edspdf

EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.

https://github.com/aphp/edspdf

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.9%) to scientific vocabulary

Keywords

extraction machine-learning pdf
Last synced: 6 months ago · JSON representation ·

Repository

EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.

Basic Info
Statistics
  • Stars: 51
  • Watchers: 2
  • Forks: 7
  • Open Issues: 0
  • Releases: 12
Topics
extraction machine-learning pdf
Created over 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog Contributing License Citation Roadmap

README.md

Tests Documentation PyPI Coverage DOI

EDS-PDF

EDS-PDF provides a modular framework to extract text information from PDF documents.

You can use it out-of-the-box, or extend it to fit your specific use case. We provide a pipeline system and various utilities for visualizing and processing PDFs, as well as multiple components to build complex models:complex models: - 📄 Extractors to parse PDFs (based on pdfminer, mupdf or poppler) - 🎯 Classifiers to perform text box classification, in order to segment PDFs - 🧩 Aggregators to produce an aggregated output from the detected text boxes - 🧠 Trainable layers to incorporate machine learning in your pipeline (e.g., embedding building blocks or a trainable classifier)

Visit the :book: documentation for more information!

Getting started

Installation

Install the library with pip:

bash pip install edspdf

Extracting text

Let's build a simple PDF extractor that uses a rule-based classifier. There are two ways to do this, either by using the configuration system or by using the pipeline API.

Create a configuration file:

config.cfg

```ini [pipeline] pipeline = ["extractor", "classifier", "aggregator"]

[components.extractor] @factory = "pdfminer-extractor"

[components.classifier] @factory = "mask-classifier" x0 = 0.2 x1 = 0.9 y0 = 0.3 y1 = 0.6 threshold = 0.1

[components.aggregator] @factory = "simple-aggregator" ```

and load it from Python:

```python import edspdf from pathlib import Path

model = edspdf.load("config.cfg") # (1) ```

Or create a pipeline directly from Python:

```python from edspdf import Pipeline

model = Pipeline() model.addpipe("pdfminer-extractor") model.addpipe( "mask-classifier", config=dict( x0=0.2, x1=0.9, y0=0.3, y1=0.6, threshold=0.1, ), ) model.add_pipe("simple-aggregator") ```

This pipeline can then be applied (for instance with this PDF):

```python

Get a PDF

pdf = Path("/Users/perceval/Development/edspdf/tests/resources/letter.pdf").read_bytes() pdf = model(pdf)

body = pdf.aggregated_texts["body"]

text, style = body.text, body.properties ```

See the rule-based recipe for a step-by-step explanation of what is happening.

Citation

If you use EDS-PDF, please cite us as below.

bibtex @software{edspdf, author = {Dura, Basile and Wajsburt, Perceval and Calliger, Alice and Gérardin, Christel and Bey, Romain}, doi = {10.5281/zenodo.6902977}, license = {BSD-3-Clause}, title = {{EDS-PDF: Smart text extraction from PDF documents}}, url = {https://github.com/aphp/edspdf} }

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

Owner

  • Name: Greater Paris University Hospitals (AP-HP)
  • Login: aphp
  • Kind: organization
  • Location: Paris

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  EDS-PDF: Smart text extraction from PDF documents
message: If you use EDS-PDF, please cite us as below.
type: software
authors:
  - given-names: Basile
    family-names: Dura
    orcid: "https://orcid.org/0000-0002-8315-4050"
    affiliation: Assistance Publique – Hôpitaux de Paris
  - given-names: Perceval
    family-names: Wajsburt
    affiliation: Assistance Publique – Hôpitaux de Paris
  - given-names: Alice
    family-names: Calliger
    affiliation: Assistance Publique – Hôpitaux de Paris
  - given-names: Christel
    family-names: Gérardin
    affiliation: Assistance Publique – Hôpitaux de Paris
  - given-names: Romain
    family-names: Bey
    affiliation: Assistance Publique – Hôpitaux de Paris
repository-code: "https://github.com/aphp/edspdf"
url: "https://github.com/aphp/edspdf"
abstract: >-
  EDS-PDF provides a modular and extendable framework to extract text from PDF documents.
keywords:
  - PDF
  - extraction
  - python
  - NLP
license: BSD-3-Clause
year: 2022
doi: 10.5281/zenodo.6902977

GitHub Events

Total
  • Create event: 8
  • Release event: 5
  • Issues event: 2
  • Watch event: 12
  • Delete event: 3
  • Issue comment event: 17
  • Push event: 24
  • Pull request event: 7
  • Fork event: 1
Last Year
  • Create event: 8
  • Release event: 5
  • Issues event: 2
  • Watch event: 12
  • Delete event: 3
  • Issue comment event: 17
  • Push event: 24
  • Pull request event: 7
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 298
  • Total Committers: 6
  • Avg Commits per committer: 49.667
  • Development Distribution Score (DDS): 0.423
Past Year
  • Commits: 12
  • Committers: 1
  • Avg Commits per committer: 12.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Perceval Wajsbürt p****t@a****r 172
Basile Dura b****t@a****r 75
Basile Dura b****e@b****e 48
acalliger a****r@g****m 1
Ian Fox i****x@p****e 1
alice.calliger a****t@a****r 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 3
  • Total pull requests: 31
  • Average time to close issues: 23 days
  • Average time to close pull requests: 11 days
  • Total issue authors: 3
  • Total pull request authors: 4
  • Average comments per issue: 0.67
  • Average comments per pull request: 1.03
  • Merged pull requests: 29
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 5
  • Average time to close issues: 2 days
  • Average time to close pull requests: about 6 hours
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 2.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • bdura (1)
  • percevalw (1)
  • xav-hydra (1)
Pull Request Authors
  • percevalw (23)
  • bdura (7)
  • acalliger (2)
  • ian-fox (1)
  • JeremyMelton (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

demo/requirements.txt pypi
  • streamlit *
pyproject.toml pypi
  • black 22.6.0 develop
  • flake8 >=3.0 develop
  • mike ^1.1.2 develop
  • mkdocs-autorefs ^0.4.1 develop
  • mkdocs-bibtex ^2.0.3 develop
  • mkdocs-gen-files ^0.3.4 develop
  • mkdocs-glightbox ^0.1.6 develop
  • mkdocs-literate-nav ^0.4.1 develop
  • mkdocs-material ^8.2.8 develop
  • mkdocstrings ^0.18.1 develop
  • mkdocstrings-python ^0.6.6 develop
  • mypy ^0.950 develop
  • pre-commit ^2.18.1 develop
  • pytest ^7.1.1 develop
  • pytest-cov ^3.0.0 develop
  • streamlit ^1.8.1 develop
  • catalogue ^2.0.7
  • loguru ^0.6.0
  • networkx ^2.6
  • pandas ^1.2
  • pdfminer.six ^20220319
  • pydantic ^1.2
  • pypdfium2 ^2.7.1
  • python >=3.7.1,!=3.7.6,!=3.8.1,<3.11
  • scikit-learn ^1.0.2
  • scipy ^1.7.0
  • thinc ^8.0.15
.github/workflows/documentation.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/release.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
.github/workflows/tests.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • codecov/codecov-action v2 composite