https://github.com/datamade/pdf-textextract

Docker Container for a Make-based, PDF extraction using OCR

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Docker Container for a Make-based, PDF extraction using OCR

Basic Info

Host: GitHub
Owner: datamade
License: mit
Language: Python
Default Branch: main
Size: 147 KB

Statistics

Stars: 12
Watchers: 2
Forks: 4
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License

PDF Text Extraction, the simplistic and resilient way.

Extracting text from PDFs seems pretty simple until you try. Most PDFs are happy little trees and so wonderful tools like pdftotext just work. If just have a few pdfs that you want to extract text from, you should start there.

But the PDF "format" is a vast and shadowy forest, and there are lots of PDFs where extracting the text by looking into the file is impossible or very difficult.

However, well formed PDFs can always be opened by PDF programs and have their contents rendered on a screen. And that's where we start in our approach to text extraction. Here's what we do.

Render the pages of a PDF as images with poppler-utils.
Perform a little image processing, with opencv, to make those images more amenable to OCR.
OCR the individual images using tesseract.
Recombine the OCRed texts into a single file.

We perform these steps using Make because it gives us easy parallelism and a nice way to restart processes if they get interrupted.

How to use

The docker-compose.yml gives an example of how you might use this pdf-textract Docker container. The most important settings are the mount points

yml volumes: # Set these to change where you are reading pdfs from and # writing processed json to - "./pdfs:/app/input" - "./intermediate:/app/intermediate" - "./texts:/app/output"

If your PDFs are in a directory called ~/lots_of_pdfs, you can adjust the docker-compose.yml like:

yml volumes: # Set these to change where you are reading pdfs from and # writing processed json to - "~/lots_of_pdfs:/app/input" - "./intermediate:/app/intermediate" - "./texts:/app/output"

Once you've got that set the way you want it, you run

bash docker-compose up

and you'll start processing the PDFs.

When everything is done you'll have a lot of files in your output directory (by default ./texts) that look like d356e527274c55c51c8008af74dfd08ce3051710f336341556e3d1d4eb7c6cf2.json and have contents like:

json { "filename": "6.pdf", "pages": [ " \n\n \n\nWade L. Robison and Linda Reeser\nEthical Decision-Making in Social Work\n\nChapter 6\n\nJustice\n\nIntroduction\n1. Particular justice\n. The formal principle of justice\n. Substantive principles of justice\n. Using princ.." ] }

The names of the output files are based on a hash of the original PDF filename. We do this because Make is a bit sensitive to forms of filenames and this just avoids a lot of complexity.

This repository

This repository contains the Makefile and Python scripts and a Dockerfile for building an image, as well as a github action to publish to Github's contianer registry.

Funding

The open sourcing of this work was funded by Project Recognize. Funding for Project Recognize is provided by an R01 grant from the National Institute on Alcohol Abuse and Alcoholism within the National Institutes of Health(NIH) under grant number R01AA029076. You can find out more at ProjectRecognize.org.

Owner

Name: datamade
Login: datamade
Kind: organization
Email: info@datamade.us
Location: Chicago, IL

Website: http://datamade.us
Twitter: datamadeco
Repositories: 123
Profile: https://github.com/datamade

We build open source technology using open data to empower journalists, researchers, governments and advocacy organizations.

GitHub Events

Total

Watch event: 3
Fork event: 2

Last Year

Watch event: 3
Fork event: 2

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 11 days
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 11 days
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

alison985 (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/main.yml actions

docker/build-push-action v2 composite
docker/login-action v1 composite

Dockerfile docker

python 3.10-slim-bullseye build

docker-compose.yml docker

ghcr.io/datamade/pdf-textextract latest

requirements.txt pypi

numpy *
opencv-python-headless *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/datamade/pdf-textextract

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

PDF Text Extraction, the simplistic and resilient way.

How to use

This repository

Funding

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies