Recent Releases of gmft
gmft - v0.4.2
v0.4.2
Bugfixes before the upcoming release (v0.5.0).
- Better imports and lazy loading
- Default device is now auto, which resolves to cuda/cpu depending on availability
- Rich text now available as AutoPageFormatter
- Fixed bug with permuted coordinates (e0c6dc52)
- CroppedTable now directly has angle property
- CI tests, Python 3.9 support
- More type hints
- Light restructuring (non-breaking)
- Internal data structure tweaks
- (fctn_results → predictions.tatr)
- (effective_* → predictions.effective)
- Python
Published by conjuncts 12 months ago
gmft - v0.4.0
v0.4.0
Features
3 new table structure recognition options!
- Added TabledFormatter, with support of the fantastic new Tabled library from VikParuchuri. Check out the demo notebook for a quick example.
- Added HistogramFormatter, a super-fast and decently accurate algorithmic option for table structure recognition. The algorithm uses word bboxes to detect separating lines between text. Check out the demo notebook for a quick example.
- Added DITRFormatter. This formatter is a blend between TATRFormatter and HistogramFormatter, being trained to recognize table separating lines rather than cells. It fine tunes microsoft/table-transformer-structure-recognition-v1.1-all on PubTables-1M for 15 epochs. Its main draw is mixing and matching deep and algorithmic separating line detection. Check out the demo notebook for a quick example.
These formatters can all be used in combination with any detector (like TATRDetector).
A visual to explain HistogramFormatter:

Bugfixes
- Tweaked spanning cell merging
- Fixed bug where it would overwrite data
- Give warning when importing from
gmftdirectly (usegmft.autoinstead) - Merged PR #32, thanks!
- Python
Published by conjuncts over 1 year ago
gmft - v0.4.0.rc1
v0.4.0rc1
Exciting upcoming changes:
- Added TabledFormatter, with support of the fantastic new Tabled library. Check out the demo notebook for a quick example.
- Added IntervalicFormatter, a super-fast and fairly accurate algorithmic option for table structure recognition. Check out the demo notebook for a quick example.
- These formatters can all be used in combination with any detector (like TATRDetector).
- Python
Published by conjuncts over 1 year ago
gmft - v0.3.x
v0.3.2
Changes: - Raise default threshold of heuristic for rejecting tables on high overlap. Makes ValueErrors more rare. - (totaloverlaprejectthreshold) ValueError thrown on overlap > 90%, up from 20% - (totaloverlapwarnthreshold) overlap warned on overlap > 10%, up from 5% - Python 3.9 compatability.
v0.3.1
Bugfix: - divide by 0 when taking median of empty list in row height estimate - Fix broken build in v0.3.0 (missing formatters)
Changes:
- Added Img2TableDetector.
- refactor of code into organizational modules, detectors and formatters
- Importing from gmft is no longer encouraged. Please import from gmft.auto instead.
- Tentative richtext module and FormattedPage for direct RAG embedding usage
- Configs are now dataclasses. However, a possibly breaking change is that **passing `configoverrides` will now completely replace the config**, rather than updating it.
- Python
Published by conjuncts over 1 year ago
gmft - v0.2.2
Changes
is_projecting_rowis removed, with the information now available underFormattedTable._projecting_indices- Formally removed
timmas a dependency - Slight tweak to captions with the aim to better reflect paragraph word height, still WIP. See #8 and be93159
- Fix: return result so image can be used outside of notebook by @brycedrennan in https://github.com/conjuncts/gmft/pull/15
Full Changelog: https://github.com/conjuncts/gmft/compare/v0.2.1...v0.2.2
- Python
Published by conjuncts almost 2 years ago
gmft - v0.2.0
Features:
- Multiple headers; multi-index tables (6225043)
- Spanning cells on both the top and left (bbbbd7c)
- Captions for tables (ca18bcc)
- "Margin" parameter allows text outside of table bbox to be included (ab81f22)
- Return visualized images as PIL image; allow padding or margin around visualized (ab81f22)
Several tweaks to formatting algorithm that may result in different outputs compared to prior versions.
- Automatically drop rows whose only non-null values is the "isprojectingrow" column
- Fill in gaps between table rows, to reduce skipped text
- Non-maxima suppression, as seen in inference.py (ab81f22)
- "total overlap" metric has become less useful in favor of "rows removed by NMS"
- Widen out the rows to same length
- Several tweaks to conditions, parameters, heuristics
- superscripts/subscripts now more likely to be merged to their parent rows
Many possibly breaking changes to config.
TableDetectorConfig.confidence_score_thresholdhas been renamed toTableDetectorConfig.detector_base_thresholdTableFormatter.deduplication_iob_thresholdhas been removed in favor ofnms_iob_thresholdspanning_cell_minimum_width,corner_clip_outlier_threshold, andaggregate_spanning_cellshave been removed- Tweaks to default settings may yield different results
no_timmis now the default, which fixes #1.- this might cause slightly different bboxes
- Python
Published by conjuncts almost 2 years ago
gmft - v0.1.1
- Created AutoTableFormatter and AutoTableDetector for future flexibility (v0.1.1, a840488)
- Renamed isspanningrow to isprojectingrow (v0.1.1, a840488)
Older: - Even better accuracy for large tables (v0.1.0, 8c537ed)
Full Changelog: https://github.com/conjuncts/gmft/compare/v0.1.0...v0.1.1
- Python
Published by conjuncts almost 2 years ago