gmft

Lightweight, performant, deep table extraction

https://github.com/conjuncts/gmft

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Lightweight, performant, deep table extraction

Basic Info

Host: GitHub
Owner: conjuncts
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 44 MB

Statistics

Stars: 503
Watchers: 6
Forks: 36
Open Issues: 30
Releases: 9

Created about 2 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog License Citation

gmft

There are many pdfs out there, and many of those pdfs have tables. But despite a plethora of table extraction options, there is still no definitive extraction method.

About

gmft converts pdf tables to many formats. It is lightweight, modular, and performant. Batteries included: it just works, offering strong performance with the default settings.

It relies on microsoft's Table Transformers, qualitatively the most performant and reliable of the many alternatives.

Install: pip install gmft

Quickstarts: demo notebook, bulk extract, readthedocs.

Documentation: readthedocs

Why should I use gmft?

Fast, lightweight, and performant, gmft is a great choice for extracting tables from pdfs.

The extraction quality is superb: check out the bulk extract notebook for approximate quality. When testing the same tables across many table extraction options, gmft fares extremely well.

Many Formats

We support the following export options: - Pandas dataframe - By extension: markdown, latex, html, csv, json, etc. - List of text + positions - Cropped image of table - Table caption

Cropped images can be passed into a vision recognizer, like: - GPT-4 vision - Mathpix/Adobe/Google/Amazon/Azure/etc. - Or saved to disk for human evaluation

Lightweight

gmft is very lightweight. It can run on cpu - no GPU necessary.

High throughput

Benchmark using Colab's cpu indicates ~1.381 s/page; converting to df takes ~1.168 s/table. This makes gmft about 10x faster than alternatives like unstructured, nougat, and open-parse/unitable on cpu.

The base model, Smock et al.'s Table Transformer, is very efficient.
gmft focuses on table extraction, so figures, titles, sections, etc. are not extracted.
In most cases, OCR is not necessary; pdfs already contain text positional data. Using this existing data drastically speeds up inference. For images or scanned pdfs, bboxes can be exported for further processing.
PyPDFium2 is chosen for its high throughput and permissive license.

Few dependencies

gmft does not require any external dependencies (detectron2, poppler, paddleocr, tesseract etc.)

To install gmft, first install transformers and pytorch with the necessary GPU/CPU options. We also rely on pypdfium2 and transformers.

Dependable

The base model is Microsoft's Table Transformer (TATR) pretrained on PubTables-1M, which works best with scientific papers. TATR handles implicit table structure very well. Current failure modes include OCR issues, merged cells, or false positives. Even so, the text is highly useable, and alignment of a value to its row/column header remains very accurate because of the underlying procedural algorithm.

We invite you to explore the comparison notebooks to survey use cases and compare results.

As of gmft v0.3, the library supports multiple-column headers (TATRFormatConfig.enable_multi_header = True), spanning cells (TATRFormatConfig.semantic_spanning_cells = True), and rotated tables.

Why should I not use gmft?

gmft focuses on tables, and aims to maximize performance on tables alone. If you need to extract other document features like figures or table of contents, you may want a different tool. You should instead check out: (in no particular order) marker, nougat, open-parse, docling, unstructured, surya, deepdoctection, DocTR. For table detection, img2table is excellent for tables with explicit (solid) cell boundaries.

Current limitations include: false positives (references, indexes, and large columnar text), false negatives, and no OCR support.

Quickstart

See the docs and the config guide for more information. The demo notebook and bulk extract contain more comprehensive code examples.

```python

new in v0.3: gmft.auto

from gmft.auto import CroppedTable, TableDetector, AutoTableFormatter, AutoTableDetector from gmft.pdf_bindings import PyPDFium2Document

detector = AutoTableDetector() formatter = AutoTableFormatter()

def ingestpdf(pdfpath): # produces list[CroppedTable] doc = PyPDFium2Document(pdf_path) tables = [] for page in doc: tables += detector.extract(page) return tables, doc

tables, doc = ingest_pdf("path/to/pdf.pdf") doc.close() # once you're done with the document ```

Configuration

See the config guide for discussion on gmft settings.

Development

bash git clone https://github.com/conjuncts/gmft cd gmft pip install -e . pip install pytest

Run tests:

tests are in ./test directory

Build docs:

bash cd docs make html

What does gmft stand for?

give

formatted

tables!

Acknowledgements

I gratefully acknowledge the support of Vanderbilt Data Science Institute and the Zhongyue Yang Lab at Vanderbilt.
The library builds upon work by:
- Smock, Brandon, Rohith Pesala, and Robin Abraham. "PubTables-1M: Towards comprehensive table extraction from unstructured documents." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
- Niels Rogge from huggingface.

License

GMFT is released under MIT.

PyMuPDF support is available in a separate repository in observance of pymupdf's AGPL 3.0 license.

Owner

Login: conjuncts
Kind: user

Repositories: 1
Profile: https://github.com/conjuncts

Citation (CITATION.cff)

cff-version: 1.2.0
title: gmft
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Galen
    family-names: Wei
    orcid: 'https://orcid.org/0009-0003-4440-4728'
repository-code: 'https://github.com/conjuncts/gmft'
abstract: 'Lightweight, performant, deep table extraction'
license: MIT

GitHub Events

Total

Create event: 6
Release event: 4
Issues event: 38
Watch event: 196
Delete event: 3
Issue comment event: 53
Push event: 42
Pull request event: 17
Fork event: 19

Last Year

Create event: 6
Release event: 4
Issues event: 38
Watch event: 196
Delete event: 3
Issue comment event: 53
Push event: 42
Pull request event: 17
Fork event: 19

Committers

Last synced: about 1 year ago

All Time

Total Commits: 98
Total Committers: 5
Avg Commits per committer: 19.6
Development Distribution Score (DDS): 0.041

Past Year

Commits: 98
Committers: 5
Avg Commits per committer: 19.6
Development Distribution Score (DDS): 0.041

Top Committers

Name	Email	Commits
conjuncts	6****s	94
Prathamesh Ghatole	7****e	1
Na'aman Hirschfeld	n**d@g**m	1
Bryce	g**3@a**m	1
Andreas Weiden	a**n@s**e	1

Committer Domains (Top 20 + Academic)

skillbyte.de: 1 accounts.brycedrennan.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 53
Total pull requests: 16
Average time to close issues: 14 days
Average time to close pull requests: about 2 hours
Total issue authors: 44
Total pull request authors: 6
Average comments per issue: 1.4
Average comments per pull request: 0.5
Merged pull requests: 15
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 38
Pull requests: 15
Average time to close issues: 9 days
Average time to close pull requests: about 2 hours
Issue authors: 31
Pull request authors: 5
Average comments per issue: 1.0
Average comments per pull request: 0.53
Merged pull requests: 14
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Goldziher (3)
komalpasumarthy (2)
etomlins (2)
wassim (2)
Anonymous62-bug (2)
cspx2 (2)
Datata1 (2)
conjuncts (2)
jiauy (2)
vivekrathiave (1)
xdave (1)
bgriffen (1)
snexus (1)
Bahadir-Danisik (1)
rjyo (1)

Pull Request Authors

conjuncts (11)
graipher (2)
Prathamesh-Ghatole (2)
Goldziher (2)
brycedrennan (2)

Top Labels

Issue Labels

enhancement (3) detection accuracy (2) bug (2) ocr (1) structure accuracy (1) help wanted (1)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- pypi 8,540 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 22
Total maintainers: 2

pypi.org: gmft-locally

Lightweight, performant, deep table extraction

Homepage: https://github.com/conjuncts/gmft
Documentation: https://gmft-locally.readthedocs.io/
License: MIT License
Latest release: 0.4.4
published over 1 year ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 22 Last month

Rankings

Stargazers count: 4.4%

Forks count: 9.4%

Dependent packages count: 9.9%

Average: 19.9%

Dependent repos count: 55.8%

Maintainers (1)

etom

Last synced: 10 months ago

pypi.org: gmft

Lightweight, performant, deep table extraction

Homepage: https://github.com/conjuncts/gmft
Documentation: https://gmft.readthedocs.io/
License: MIT License
Latest release: 0.4.2
published 12 months ago

Versions: 19
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 8,392 Last month

Rankings

Dependent packages count: 10.8%

Downloads: 14.6%

Average: 28.8%

Dependent repos count: 60.8%

Maintainers (1)

conjunct

Last synced: 10 months ago

pypi.org: gmft-local

Lightweight, performant, deep table extraction

Homepage: https://github.com/conjuncts/gmft
Documentation: https://gmft-local.readthedocs.io/
License: MIT License
Status: removed
Latest release: 0.4.1
published over 1 year ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 126 Last month

Rankings

Dependent packages count: 9.9%

Average: 32.9%

Dependent repos count: 55.8%

Maintainers (1)

etom

Last synced: about 1 year ago

Dependencies

pyproject.toml pypi

pandas *
pillow *
pypdfium2 >= 4
timm *
transformers >= 4.24.0

requirements.txt pypi

pandas *
pillow *
pypdfium2 *
timm *
transformers *

docs/requirements.txt pypi

sphinx ==7.1.2
sphinx-rtd-theme ==1.3.0rc1

requirements-dev.txt pypi

pytest * development
sphinx * development
sphinx_rtd_theme * development

gmft

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

gmft

About

Why should I use gmft?

Many Formats

Lightweight

High throughput

Few dependencies

Dependable

Why should I not use gmft?

Quickstart

new in v0.3: gmft.auto

Configuration

Development

What does gmft stand for?

Acknowledgements

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: gmft-locally

Rankings

Maintainers (1)

pypi.org: gmft

Rankings

Maintainers (1)

pypi.org: gmft-local

Rankings

Maintainers (1)

Dependencies