edspdf

EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.

https://github.com/aphp/edspdf

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary

Keywords

extraction machine-learning pdf

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: aphp
License: bsd-3-clause
Language: Python
Default Branch: main
Homepage: https://aphp.github.io/edspdf/
Size: 8.93 MB

Statistics

Stars: 51
Watchers: 2
Forks: 7
Open Issues: 0
Releases: 12

Topics

extraction machine-learning pdf

Created over 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme Changelog Contributing License Citation Roadmap

README.md

Tests

EDS-PDF

EDS-PDF provides a modular framework to extract text information from PDF documents.

You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models: - 📄 Extractors to parse PDFs (based on pdfminer, mupdf or poppler) - 🎯 Classifiers to perform text box classification, in order to segment PDFs - 🧩 Aggregators to produce an aggregated output from the detected text boxes - 🧠 Trainable layers to incorporate machine learning in your pipeline (e.g., embedding building blocks or a trainable classifier)

Visit the :book: documentation for more information!

Getting started

Installation

Install the library with pip:

bash pip install edspdf

Extracting text

Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.

Create a configuration file:

`config.cfg`

```ini [pipeline] pipeline = ["extractor", "classifier", "aggregator"]

[components.extractor] @factory = "pdfminer-extractor"

[components.classifier] @factory = "mask-classifier" x0 = 0.2 x1 = 0.9 y0 = 0.3 y1 = 0.6 threshold = 0.1

[components.aggregator] @factory = "simple-aggregator" ```

and load it from Python:

```python import edspdf from pathlib import Path

model = edspdf.load("config.cfg") # (1) ```

Or create a pipeline directly from Python:

```python from edspdf import Pipeline

model = Pipeline() model.addpipe("pdfminer-extractor") model.addpipe( "mask-classifier", config=dict( x0=0.2, x1=0.9, y0=0.3, y1=0.6, threshold=0.1, ), ) model.add_pipe("simple-aggregator") ```

This pipeline can then be applied (for instance with this PDF):

```python

Get a PDF

pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes() pdf = model(pdf)

body = pdf.aggregated_texts["body"]

text, style = body.text, body.properties ```

See the rule-based recipe for a step-by-step explanation of what is happening.

Citation

If you use EDS-PDF, please cite us as below.

bibtex @software{edspdf, author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain}, doi = {10.5281/zenodo.6902977}, license = {BSD-3-Clause}, title = {{EDS-PDF: Smart text extraction from PDF documents}}, url = {https://github.com/aphp/edspdf} }

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

Owner

Name: Greater Paris University Hospitals (AP-HP)
Login: aphp
Kind: organization
Location: Paris

Website: https://www.aphp.fr/
Repositories: 35
Profile: https://github.com/aphp

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  EDS-PDF: Smart text extraction from PDF documents
message: If you use EDS-PDF, please cite us as below.
type: software
authors:
  - given-names: Basile
    family-names: Dura
    orcid: "https://orcid.org/0000-0002-8315-4050"
    affiliation: Assistance Publique – Hôpitaux de Paris
  - given-names: Perceval
    family-names: Wajsburt
    affiliation: Assistance Publique – Hôpitaux de Paris
  - given-names: Alice
    family-names: Calliger
    affiliation: Assistance Publique – Hôpitaux de Paris
  - given-names: Christel
    family-names: Gérardin
    affiliation: Assistance Publique – Hôpitaux de Paris
  - given-names: Romain
    family-names: Bey
    affiliation: Assistance Publique – Hôpitaux de Paris
repository-code: "https://github.com/aphp/edspdf"
url: "https://github.com/aphp/edspdf"
abstract: >-
  EDS-PDF provides a modular and extendable framework to extract text from PDF documents.
keywords:
  - PDF
  - extraction
  - python
  - NLP
license: BSD-3-Clause
year: 2022
doi: 10.5281/zenodo.6902977

GitHub Events

Total

Create event: 8
Release event: 5
Issues event: 2
Watch event: 12
Delete event: 3
Issue comment event: 17
Push event: 24
Pull request event: 7
Fork event: 1

Last Year

Create event: 8
Release event: 5
Issues event: 2
Watch event: 12
Delete event: 3
Issue comment event: 17
Push event: 24
Pull request event: 7
Fork event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 298
Total Committers: 6
Avg Commits per committer: 49.667
Development Distribution Score (DDS): 0.423

Past Year

Commits: 12
Committers: 1
Avg Commits per committer: 12.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Perceval Wajsbürt	p**t@a**r	172
Basile Dura	b**t@a**r	75
Basile Dura	b**e@b**e	48
acalliger	a**r@g**m	1
Ian Fox	i**x@p**e	1
alice.calliger	a**t@a**r	1

Committer Domains (Top 20 + Academic)

aphp.fr: 3 pm.me: 1 bdura.me: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 3
Total pull requests: 31
Average time to close issues: 23 days
Average time to close pull requests: 11 days
Total issue authors: 3
Total pull request authors: 4
Average comments per issue: 0.67
Average comments per pull request: 1.03
Merged pull requests: 29
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 5
Average time to close issues: 2 days
Average time to close pull requests: about 6 hours
Issue authors: 1
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 2.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

bdura (1)
percevalw (1)
xav-hydra (1)

Pull Request Authors

percevalw (23)
bdura (7)
acalliger (2)
ian-fox (1)
JeremyMelton (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

demo/requirements.txt pypi

streamlit *

pyproject.toml pypi

black 22.6.0 develop
flake8 >=3.0 develop
mike ^1.1.2 develop
mkdocs-autorefs ^0.4.1 develop
mkdocs-bibtex ^2.0.3 develop
mkdocs-gen-files ^0.3.4 develop
mkdocs-glightbox ^0.1.6 develop
mkdocs-literate-nav ^0.4.1 develop
mkdocs-material ^8.2.8 develop
mkdocstrings ^0.18.1 develop
mkdocstrings-python ^0.6.6 develop
mypy ^0.950 develop
pre-commit ^2.18.1 develop
pytest ^7.1.1 develop
pytest-cov ^3.0.0 develop
streamlit ^1.8.1 develop
catalogue ^2.0.7
loguru ^0.6.0
networkx ^2.6
pandas ^1.2
pdfminer.six ^20220319
pydantic ^1.2
pypdfium2 ^2.7.1
python >=3.7.1,!=3.7.6,!=3.8.1,<3.11
scikit-learn ^1.0.2
scipy ^1.7.0
thinc ^8.0.15

.github/workflows/documentation.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/release.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/tests.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite
codecov/codecov-action v2 composite