Recent Releases of edspdf
edspdf - v0.10.0
Changelog
Added
- Support packaging models made in setuptools based projects
Fixed
- Support packaging with poetry 2.0
Changed
- Handle cases like distant superscript "³ something" where the super script and the rest of the text are parsed are two lines one above the other, when they should be on the same line.
Pull Requests
- Handle cases like distant superscripts by @percevalw in https://github.com/aphp/edspdf/pull/32
- chore: bump version to 0.10.0 by @percevalw in https://github.com/aphp/edspdf/pull/33
Full Changelog: https://github.com/aphp/edspdf/compare/v0.9.3...v0.10.0
- Python
Published by percevalw about 1 year ago
edspdf - v0.9.2
Changelog
Changed
- Default to fp16 when inferring with gpu
- Support
inputsparameter inTrainablePipe.postprocess(...)method (as in edsnlp) - We now check that the user isn't trying to write a single file in a split fashion (when
write_in_worker is Trueornum_rows_per_file is not None) and raise an error if they do
Fixed
- Batches full of empty content boxes no longer crash the
huggingface-embeddingcomponent - Ensure models are always loaded in non training mode
- Improved performance of
edspdf.datamethods over a filesystem (fsparameter)
Pull Requests
- Fix empty batches & update data API by @percevalw in https://github.com/aphp/edspdf/pull/28
- chore: bump version to 0.9.2 by @percevalw in https://github.com/aphp/edspdf/pull/30
Full Changelog: https://github.com/aphp/edspdf/compare/v0.9.1...v0.9.2
- Python
Published by percevalw over 1 year ago
edspdf - v0.9.1
Changelog
Fixed
- It is now possible to recursively retrieve pdf files in a directory using
edspdf.data.read_files
What's Changed
- fix: allow recursive pdf file searching by @percevalw in https://github.com/aphp/edspdf/pull/26
Full Changelog: https://github.com/aphp/edspdf/compare/v0.9.0...v0.9.1
- Python
Published by percevalw almost 2 years ago
edspdf - v0.9.0
What's Changed ?
Added
- New unified
edspdf.dataapi (pdf files, pandas, parquet) and LazyCollection object to efficiently read / write data from / to different formats & sources. This API is has been heavily inspired by theedsnlp.dataAPI. - New unified processing API to select the execution backend via
data.set_processing(...)to replace the oldacceleratorsAPI (which is now deprecated, but still available). huggingface-embeddingnow supports quantization and otherAutoModel.from_pretrainedkwargs- It is now possible to add convert a label to multiple labels in the
simple-aggregatorcomponent :
```ini
To build the "text" field, we will aggregate "title", "body" and "table" lines,
and output "title" lines in a separate field as well.
label_map = { "text" : [ "title", "body", "table" ], "title": "title", } ```
Fixed
huggingface-embeddingnow resize bbox features for large PDFs, instead of making the model crashhuggingface-embeddingandsub-box-cnn-poolernow handle empty PDFs correctly
Pull Requests
- API update (data & processing) by @percevalw in https://github.com/aphp/edspdf/pull/25
Full Changelog: https://github.com/aphp/edspdf/compare/v0.8.1...v0.9.0
- Python
Published by percevalw almost 2 years ago
edspdf - v0.8.1
Changelog
Fixed
- Fix typing to allow passing an accelerator dict to
Pipeline.pipe(...) - Removed multiprocessing accelerator debug output
- Fixed absolute links in github-pages docs (e.g. image assets)
Changed
- Added auto-links to components in the docs (by comparing span contents with entry points)
Pull Requests
- v0.8.1 by @percevalw in https://github.com/aphp/edspdf/pull/23
Full Changelog: https://github.com/aphp/edspdf/compare/v0.8.0...v0.8.1
- Python
Published by percevalw over 2 years ago
edspdf - v0.8.0
What's changed
Added
- Add multi-modal transformers (
huggingface-embedding) with windowing options - Add
render_pageoption topdfminerextractor, for multi-modal PDF features - Add inference utilities (
accelerators), with simple mono process support and multi gpu / cpu support - Packaging utils (
pipeline.package(...)) to make a pip installable package from a pipeline
Changed
- Updated API to follow EDS-NLP's refactoring
- Updated
confitto 0.4.2 (better errors) andfoldedtensorto 0.3.0 (better multiprocess support) - Removed
pipeline.score. You should usepipeline.pipe, a custom scorer andpipeline.select_pipesinstead. - Better test coverage
- Use
hatchinstead ofsetuptoolsto build the package / docs and run the tests
Fixed
- Fixed
attrsdependency only being installed in dev mode
Pull Requests
- Huggingface multi-modal transformers by @percevalw in https://github.com/aphp/edspdf/pull/15
- Dev install documentation and dependencies fix by @ian-fox in https://github.com/aphp/edspdf/pull/16
- Huggingface by @percevalw in https://github.com/aphp/edspdf/pull/17
- Accelerators by @percevalw in https://github.com/aphp/edspdf/pull/19
- Scoring by @percevalw in https://github.com/aphp/edspdf/pull/20
- Packaging utils by @percevalw in https://github.com/aphp/edspdf/pull/18
- chore: bump version to 0.8.0 by @percevalw in https://github.com/aphp/edspdf/pull/21
- feat: switch to hatch package manager by @percevalw in https://github.com/aphp/edspdf/pull/22
New Contributors
- @ian-fox made their first contribution in https://github.com/aphp/edspdf/pull/16
Full Changelog: https://github.com/aphp/edspdf/compare/v0.7.0...v0.8.0
- Python
Published by percevalw over 2 years ago
edspdf - v0.7.0
What's changed
This public release comes with a major overhaul of the library since v0.5.3
Core features
- new pipeline system whose API is inspired by spaCy
- first-class support for pytorch
- hybrid model inference and training (rules + deep learning)
- moved from pandas DataFrame to attrs dataclasses (
PDFDoc,Page,Box, ...) for representing PDF documents - new configuration system based on confit, with support for instantiation of complex deep learning models, off-the-shelf CLI, ...
Functional features
- new extractors: pymupdf and poppler (separate packages for licensing reasons)
- many deep learning layers (box-transformer, 2d attention with relative position information, ...)
- trainable deep learning classifier
- training recipes for deep learning models
Full Changelog: https://github.com/aphp/edspdf/compare/v0.5.3...v0.7.0
- Python
Published by percevalw over 2 years ago
edspdf - v0.5.2
What's Changed
- ci: remove unnecessary poppler dependency by @bdura in https://github.com/aphp/edspdf/pull/7
- Fix aggregation for empty documents by @percevalw in https://github.com/aphp/edspdf/pull/8
Full Changelog: https://github.com/aphp/edspdf/compare/v0.5.1...v0.5.2
- Python
Published by percevalw over 3 years ago
edspdf - v0.5.1
Changelog
Changed
- Drop the
pdf2imagedependency, replacing it withpypdfium2(easier installation)
Pull Requests
- New PDF rendering library by @bdura in https://github.com/aphp/edspdf/pull/3
- docs: add mike plugin by @bdura in https://github.com/aphp/edspdf/pull/4
- docs: add demo by @bdura in https://github.com/aphp/edspdf/pull/5
- chore: bump version by @bdura in https://github.com/aphp/edspdf/pull/6
Full Changelog: https://github.com/aphp/edspdf/compare/v0.5.0...v0.5.1
- Python
Published by bdura over 3 years ago