Recent Releases of gmft

gmft - v0.4.2

v0.4.2

Bugfixes before the upcoming release (v0.5.0). - Better imports and lazy loading - Default device is now auto, which resolves to cuda/cpu depending on availability - Rich text now available as AutoPageFormatter - Fixed bug with permuted coordinates (e0c6dc52) - CroppedTable now directly has angle property - CI tests, Python 3.9 support - More type hints - Light restructuring (non-breaking) - Internal data structure tweaks - (fctn_resultspredictions.tatr) - (effective_*predictions.effective)

- Python
Published by conjuncts 12 months ago

gmft - v0.4.0

v0.4.0

Features

3 new table structure recognition options! - Added TabledFormatter, with support of the fantastic new Tabled library from VikParuchuri. Check out the demo notebook for a quick example. - Added HistogramFormatter, a super-fast and decently accurate algorithmic option for table structure recognition. The algorithm uses word bboxes to detect separating lines between text. Check out the demo notebook for a quick example. - Added DITRFormatter. This formatter is a blend between TATRFormatter and HistogramFormatter, being trained to recognize table separating lines rather than cells. It fine tunes microsoft/table-transformer-structure-recognition-v1.1-all on PubTables-1M for 15 epochs. Its main draw is mixing and matching deep and algorithmic separating line detection. Check out the demo notebook for a quick example.

These formatters can all be used in combination with any detector (like TATRDetector).

A visual to explain HistogramFormatter:

Bugfixes

  • Tweaked spanning cell merging
    • Fixed bug where it would overwrite data
  • Give warning when importing from gmft directly (use gmft.auto instead)
  • Merged PR #32, thanks!

- Python
Published by conjuncts over 1 year ago

gmft - v0.4.0.rc1

v0.4.0rc1

Exciting upcoming changes: - Added TabledFormatter, with support of the fantastic new Tabled library. Check out the demo notebook for a quick example. - Added IntervalicFormatter, a super-fast and fairly accurate algorithmic option for table structure recognition. Check out the demo notebook for a quick example. - These formatters can all be used in combination with any detector (like TATRDetector).

- Python
Published by conjuncts over 1 year ago

gmft - v0.3.x

v0.3.2

Changes: - Raise default threshold of heuristic for rejecting tables on high overlap. Makes ValueErrors more rare. - (totaloverlaprejectthreshold) ValueError thrown on overlap > 90%, up from 20% - (totaloverlapwarnthreshold) overlap warned on overlap > 10%, up from 5% - Python 3.9 compatability.

v0.3.1

Bugfix: - divide by 0 when taking median of empty list in row height estimate - Fix broken build in v0.3.0 (missing formatters)

Changes: - Added Img2TableDetector. - refactor of code into organizational modules, detectors and formatters - Importing from gmft is no longer encouraged. Please import from gmft.auto instead. - Tentative richtext module and FormattedPage for direct RAG embedding usage - Configs are now dataclasses. However, a possibly breaking change is that **passing `configoverrides` will now completely replace the config**, rather than updating it.

- Python
Published by conjuncts over 1 year ago

gmft - v0.2.2

Changes

  • is_projecting_row is removed, with the information now available under FormattedTable._projecting_indices
  • Formally removed timm as a dependency
  • Slight tweak to captions with the aim to better reflect paragraph word height, still WIP. See #8 and be93159
  • Fix: return result so image can be used outside of notebook by @brycedrennan in https://github.com/conjuncts/gmft/pull/15

Full Changelog: https://github.com/conjuncts/gmft/compare/v0.2.1...v0.2.2

- Python
Published by conjuncts almost 2 years ago

gmft - v0.2.1

  • GPU support, thank you @MathiasToftas!

Full Changelog: https://github.com/conjuncts/gmft/compare/v0.2.0...v0.2.1

- Python
Published by conjuncts almost 2 years ago

gmft - v0.2.0

Features:

  • Multiple headers; multi-index tables (6225043)
  • Spanning cells on both the top and left (bbbbd7c)
  • Captions for tables (ca18bcc)
  • "Margin" parameter allows text outside of table bbox to be included (ab81f22)
  • Return visualized images as PIL image; allow padding or margin around visualized (ab81f22)

Several tweaks to formatting algorithm that may result in different outputs compared to prior versions.

  • Automatically drop rows whose only non-null values is the "isprojectingrow" column
  • Fill in gaps between table rows, to reduce skipped text
  • Non-maxima suppression, as seen in inference.py (ab81f22)
    • "total overlap" metric has become less useful in favor of "rows removed by NMS"
  • Widen out the rows to same length
  • Several tweaks to conditions, parameters, heuristics
    • superscripts/subscripts now more likely to be merged to their parent rows

Many possibly breaking changes to config.

  • TableDetectorConfig.confidence_score_threshold has been renamed to TableDetectorConfig.detector_base_threshold
  • TableFormatter.deduplication_iob_threshold has been removed in favor of nms_iob_threshold
  • spanning_cell_minimum_width, corner_clip_outlier_threshold, and aggregate_spanning_cells have been removed
  • Tweaks to default settings may yield different results
  • no_timm is now the default, which fixes #1.
    • this might cause slightly different bboxes

- Python
Published by conjuncts almost 2 years ago

gmft - v0.1.1

  • Created AutoTableFormatter and AutoTableDetector for future flexibility (v0.1.1, a840488)
  • Renamed isspanningrow to isprojectingrow (v0.1.1, a840488)

Older: - Even better accuracy for large tables (v0.1.0, 8c537ed)

Full Changelog: https://github.com/conjuncts/gmft/compare/v0.1.0...v0.1.1

- Python
Published by conjuncts almost 2 years ago

gmft - v0.0.4

  • Added support for rotated tables (5aeb80d)

- Python
Published by conjuncts almost 2 years ago