edspdf
EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Keywords
Repository
EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.
Basic Info
- Host: GitHub
- Owner: aphp
- License: bsd-3-clause
- Language: Python
- Default Branch: main
- Homepage: https://aphp.github.io/edspdf/
- Size: 8.93 MB
Statistics
- Stars: 51
- Watchers: 2
- Forks: 7
- Open Issues: 0
- Releases: 12
Topics
Metadata Files
README.md
EDS-PDF
EDS-PDF provides a modular framework to extract text information from PDF documents.
You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models: - 📄 Extractors to parse PDFs (based on pdfminer, mupdf or poppler) - 🎯 Classifiers to perform text box classification, in order to segment PDFs - 🧩 Aggregators to produce an aggregated output from the detected text boxes - 🧠 Trainable layers to incorporate machine learning in your pipeline (e.g., embedding building blocks or a trainable classifier)
Visit the :book: documentation for more information!
Getting started
Installation
Install the library with pip:
bash
pip install edspdf
Extracting text
Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.
Create a configuration file:
config.cfg
```ini [pipeline] pipeline = ["extractor", "classifier", "aggregator"]
[components.extractor] @factory = "pdfminer-extractor"
[components.classifier] @factory = "mask-classifier" x0 = 0.2 x1 = 0.9 y0 = 0.3 y1 = 0.6 threshold = 0.1
[components.aggregator] @factory = "simple-aggregator" ```
and load it from Python:
```python import edspdf from pathlib import Path
model = edspdf.load("config.cfg") # (1) ```
Or create a pipeline directly from Python:
```python from edspdf import Pipeline
model = Pipeline() model.addpipe("pdfminer-extractor") model.addpipe( "mask-classifier", config=dict( x0=0.2, x1=0.9, y0=0.3, y1=0.6, threshold=0.1, ), ) model.add_pipe("simple-aggregator") ```
This pipeline can then be applied (for instance with this PDF):
```python
Get a PDF
pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes() pdf = model(pdf)
body = pdf.aggregated_texts["body"]
text, style = body.text, body.properties ```
See the rule-based recipe for a step-by-step explanation of what is happening.
Citation
If you use EDS-PDF, please cite us as below.
bibtex
@software{edspdf,
author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain},
doi = {10.5281/zenodo.6902977},
license = {BSD-3-Clause},
title = {{EDS-PDF: Smart text extraction from PDF documents}},
url = {https://github.com/aphp/edspdf}
}
Acknowledgement
We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.
Owner
- Name: Greater Paris University Hospitals (AP-HP)
- Login: aphp
- Kind: organization
- Location: Paris
- Website: https://www.aphp.fr/
- Repositories: 35
- Profile: https://github.com/aphp
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
EDS-PDF: Smart text extraction from PDF documents
message: If you use EDS-PDF, please cite us as below.
type: software
authors:
- given-names: Basile
family-names: Dura
orcid: "https://orcid.org/0000-0002-8315-4050"
affiliation: Assistance Publique – Hôpitaux de Paris
- given-names: Perceval
family-names: Wajsburt
affiliation: Assistance Publique – Hôpitaux de Paris
- given-names: Alice
family-names: Calliger
affiliation: Assistance Publique – Hôpitaux de Paris
- given-names: Christel
family-names: Gérardin
affiliation: Assistance Publique – Hôpitaux de Paris
- given-names: Romain
family-names: Bey
affiliation: Assistance Publique – Hôpitaux de Paris
repository-code: "https://github.com/aphp/edspdf"
url: "https://github.com/aphp/edspdf"
abstract: >-
EDS-PDF provides a modular and extendable framework to extract text from PDF documents.
keywords:
- PDF
- extraction
- python
- NLP
license: BSD-3-Clause
year: 2022
doi: 10.5281/zenodo.6902977
GitHub Events
Total
- Create event: 8
- Release event: 5
- Issues event: 2
- Watch event: 12
- Delete event: 3
- Issue comment event: 17
- Push event: 24
- Pull request event: 7
- Fork event: 1
Last Year
- Create event: 8
- Release event: 5
- Issues event: 2
- Watch event: 12
- Delete event: 3
- Issue comment event: 17
- Push event: 24
- Pull request event: 7
- Fork event: 1
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Perceval Wajsbürt | p****t@a****r | 172 |
| Basile Dura | b****t@a****r | 75 |
| Basile Dura | b****e@b****e | 48 |
| acalliger | a****r@g****m | 1 |
| Ian Fox | i****x@p****e | 1 |
| alice.calliger | a****t@a****r | 1 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 31
- Average time to close issues: 23 days
- Average time to close pull requests: 11 days
- Total issue authors: 3
- Total pull request authors: 4
- Average comments per issue: 0.67
- Average comments per pull request: 1.03
- Merged pull requests: 29
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 5
- Average time to close issues: 2 days
- Average time to close pull requests: about 6 hours
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 2.0
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bdura (1)
- percevalw (1)
- xav-hydra (1)
Pull Request Authors
- percevalw (23)
- bdura (7)
- acalliger (2)
- ian-fox (1)
- JeremyMelton (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- streamlit *
- black 22.6.0 develop
- flake8 >=3.0 develop
- mike ^1.1.2 develop
- mkdocs-autorefs ^0.4.1 develop
- mkdocs-bibtex ^2.0.3 develop
- mkdocs-gen-files ^0.3.4 develop
- mkdocs-glightbox ^0.1.6 develop
- mkdocs-literate-nav ^0.4.1 develop
- mkdocs-material ^8.2.8 develop
- mkdocstrings ^0.18.1 develop
- mkdocstrings-python ^0.6.6 develop
- mypy ^0.950 develop
- pre-commit ^2.18.1 develop
- pytest ^7.1.1 develop
- pytest-cov ^3.0.0 develop
- streamlit ^1.8.1 develop
- catalogue ^2.0.7
- loguru ^0.6.0
- networkx ^2.6
- pandas ^1.2
- pdfminer.six ^20220319
- pydantic ^1.2
- pypdfium2 ^2.7.1
- python >=3.7.1,!=3.7.6,!=3.8.1,<3.11
- scikit-learn ^1.0.2
- scipy ^1.7.0
- thinc ^8.0.15
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- codecov/codecov-action v2 composite