Recent Releases of edspdf

edspdf - v0.10.0

Changelog

Added

  • Support packaging models made in setuptools based projects

Fixed

  • Support packaging with poetry 2.0

Changed

  • Handle cases like distant superscript "³ something" where the super script and the rest of the text are parsed are two lines one above the other, when they should be on the same line.

Pull Requests

  • Handle cases like distant superscripts by @percevalw in https://github.com/aphp/edspdf/pull/32
  • chore: bump version to 0.10.0 by @percevalw in https://github.com/aphp/edspdf/pull/33

Full Changelog: https://github.com/aphp/edspdf/compare/v0.9.3...v0.10.0

- Python
Published by percevalw about 1 year ago

edspdf - v0.9.3

Changelog

  • Support pydantic v2

Pull Requests

  • Support pydantic v2 by @percevalw in https://github.com/aphp/edspdf/pull/31

Full Changelog: https://github.com/aphp/edspdf/compare/v0.9.2...v0.9.3

- Python
Published by percevalw over 1 year ago

edspdf - v0.9.2

Changelog

Changed

  • Default to fp16 when inferring with gpu
  • Support inputs parameter in TrainablePipe.postprocess(...) method (as in edsnlp)
  • We now check that the user isn't trying to write a single file in a split fashion (when write_in_worker is True or num_rows_per_file is not None) and raise an error if they do

Fixed

  • Batches full of empty content boxes no longer crash the huggingface-embedding component
  • Ensure models are always loaded in non training mode
  • Improved performance of edspdf.data methods over a filesystem (fs parameter)

Pull Requests

  • Fix empty batches & update data API by @percevalw in https://github.com/aphp/edspdf/pull/28
  • chore: bump version to 0.9.2 by @percevalw in https://github.com/aphp/edspdf/pull/30

Full Changelog: https://github.com/aphp/edspdf/compare/v0.9.1...v0.9.2

- Python
Published by percevalw over 1 year ago

edspdf - v0.9.1

Changelog

Fixed

  • It is now possible to recursively retrieve pdf files in a directory using edspdf.data.read_files

What's Changed

  • fix: allow recursive pdf file searching by @percevalw in https://github.com/aphp/edspdf/pull/26

Full Changelog: https://github.com/aphp/edspdf/compare/v0.9.0...v0.9.1

- Python
Published by percevalw almost 2 years ago

edspdf - v0.9.0

What's Changed ?

Added

  • New unified edspdf.data api (pdf files, pandas, parquet) and LazyCollection object to efficiently read / write data from / to different formats & sources. This API is has been heavily inspired by the edsnlp.data API.
  • New unified processing API to select the execution backend via data.set_processing(...) to replace the old accelerators API (which is now deprecated, but still available).
  • huggingface-embedding now supports quantization and other AutoModel.from_pretrained kwargs
  • It is now possible to add convert a label to multiple labels in the simple-aggregator component :

```ini

To build the "text" field, we will aggregate "title", "body" and "table" lines,

and output "title" lines in a separate field as well.

label_map = { "text" : [ "title", "body", "table" ], "title": "title", } ```

Fixed

  • huggingface-embedding now resize bbox features for large PDFs, instead of making the model crash
  • huggingface-embedding and sub-box-cnn-pooler now handle empty PDFs correctly

Pull Requests

  • API update (data & processing) by @percevalw in https://github.com/aphp/edspdf/pull/25

Full Changelog: https://github.com/aphp/edspdf/compare/v0.8.1...v0.9.0

- Python
Published by percevalw almost 2 years ago

edspdf - v0.8.1

Changelog

Fixed

  • Fix typing to allow passing an accelerator dict to Pipeline.pipe(...)
  • Removed multiprocessing accelerator debug output
  • Fixed absolute links in github-pages docs (e.g. image assets)

Changed

  • Added auto-links to components in the docs (by comparing span contents with entry points)

Pull Requests

  • v0.8.1 by @percevalw in https://github.com/aphp/edspdf/pull/23

Full Changelog: https://github.com/aphp/edspdf/compare/v0.8.0...v0.8.1

- Python
Published by percevalw over 2 years ago

edspdf - v0.8.0

What's changed

Added

  • Add multi-modal transformers (huggingface-embedding) with windowing options
  • Add render_page option to pdfminer extractor, for multi-modal PDF features
  • Add inference utilities (accelerators), with simple mono process support and multi gpu / cpu support
  • Packaging utils (pipeline.package(...)) to make a pip installable package from a pipeline

Changed

  • Updated API to follow EDS-NLP's refactoring
  • Updated confit to 0.4.2 (better errors) and foldedtensor to 0.3.0 (better multiprocess support)
  • Removed pipeline.score. You should use pipeline.pipe, a custom scorer and pipeline.select_pipes instead.
  • Better test coverage
  • Use hatch instead of setuptools to build the package / docs and run the tests

Fixed

  • Fixed attrs dependency only being installed in dev mode

Pull Requests

  • Huggingface multi-modal transformers by @percevalw in https://github.com/aphp/edspdf/pull/15
  • Dev install documentation and dependencies fix by @ian-fox in https://github.com/aphp/edspdf/pull/16
  • Huggingface by @percevalw in https://github.com/aphp/edspdf/pull/17
  • Accelerators by @percevalw in https://github.com/aphp/edspdf/pull/19
  • Scoring by @percevalw in https://github.com/aphp/edspdf/pull/20
  • Packaging utils by @percevalw in https://github.com/aphp/edspdf/pull/18
  • chore: bump version to 0.8.0 by @percevalw in https://github.com/aphp/edspdf/pull/21
  • feat: switch to hatch package manager by @percevalw in https://github.com/aphp/edspdf/pull/22

New Contributors

  • @ian-fox made their first contribution in https://github.com/aphp/edspdf/pull/16

Full Changelog: https://github.com/aphp/edspdf/compare/v0.7.0...v0.8.0

- Python
Published by percevalw over 2 years ago

edspdf - v0.7.0

What's changed

This public release comes with a major overhaul of the library since v0.5.3

Core features

  • new pipeline system whose API is inspired by spaCy
  • first-class support for pytorch
  • hybrid model inference and training (rules + deep learning)
  • moved from pandas DataFrame to attrs dataclasses (PDFDoc, Page, Box, ...) for representing PDF documents
  • new configuration system based on confit, with support for instantiation of complex deep learning models, off-the-shelf CLI, ...

Functional features

  • new extractors: pymupdf and poppler (separate packages for licensing reasons)
  • many deep learning layers (box-transformer, 2d attention with relative position information, ...)
  • trainable deep learning classifier
  • training recipes for deep learning models

Full Changelog: https://github.com/aphp/edspdf/compare/v0.5.3...v0.7.0

- Python
Published by percevalw over 2 years ago

edspdf - v0.5.3

What's Changed

Added

  • Add label mapping parameter to aggregators (to merge different types of blocks such as title and body)
  • Improved line aggregation formula

Full Changelog: https://github.com/aphp/edspdf/compare/v0.5.2...v0.5.3

- Python
Published by percevalw over 3 years ago

edspdf - v0.5.2

What's Changed

  • ci: remove unnecessary poppler dependency by @bdura in https://github.com/aphp/edspdf/pull/7
  • Fix aggregation for empty documents by @percevalw in https://github.com/aphp/edspdf/pull/8

Full Changelog: https://github.com/aphp/edspdf/compare/v0.5.1...v0.5.2

- Python
Published by percevalw over 3 years ago

edspdf - v0.5.1

Changelog

Changed

  • Drop the pdf2image dependency, replacing it with pypdfium2 (easier installation)

Pull Requests

  • New PDF rendering library by @bdura in https://github.com/aphp/edspdf/pull/3
  • docs: add mike plugin by @bdura in https://github.com/aphp/edspdf/pull/4
  • docs: add demo by @bdura in https://github.com/aphp/edspdf/pull/5
  • chore: bump version by @bdura in https://github.com/aphp/edspdf/pull/6

Full Changelog: https://github.com/aphp/edspdf/compare/v0.5.0...v0.5.1

- Python
Published by bdura over 3 years ago

edspdf - v0.5.0

EDS-PDF is a generic, pure-Python facility for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.

- Python
Published by bdura over 3 years ago