https://github.com/datamade/pdf-textextract

Docker Container for a Make-based, PDF extraction using OCR

https://github.com/datamade/pdf-textextract

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Docker Container for a Make-based, PDF extraction using OCR

Basic Info
  • Host: GitHub
  • Owner: datamade
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 147 KB
Statistics
  • Stars: 12
  • Watchers: 2
  • Forks: 4
  • Open Issues: 0
  • Releases: 0
Created over 3 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

PDF Text Extraction, the simplistic and resilient way.

Extracting text from PDFs seems pretty simple until you try. Most PDFs are happy little trees and so wonderful tools like pdftotext just work. If just have a few pdfs that you want to extract text from, you should start there.

But the PDF "format" is a vast and shadowy forest, and there are lots of PDFs where extracting the text by looking into the file is impossible or very difficult.

However, well formed PDFs can always be opened by PDF programs and have their contents rendered on a screen. And that's where we start in our approach to text extraction. Here's what we do.

  1. Render the pages of a PDF as images with poppler-utils.
  2. Perform a little image processing, with opencv, to make those images more amenable to OCR.
  3. OCR the individual images using tesseract.
  4. Recombine the OCRed texts into a single file.

We perform these steps using Make because it gives us easy parallelism and a nice way to restart processes if they get interrupted.

How to use

The docker-compose.yml gives an example of how you might use this pdf-textract Docker container. The most important settings are the mount points

yml volumes: # Set these to change where you are reading pdfs from and # writing processed json to - "./pdfs:/app/input" - "./intermediate:/app/intermediate" - "./texts:/app/output"

If your PDFs are in a directory called ~/lots_of_pdfs, you can adjust the docker-compose.yml like:

yml volumes: # Set these to change where you are reading pdfs from and # writing processed json to - "~/lots_of_pdfs:/app/input" - "./intermediate:/app/intermediate" - "./texts:/app/output"

Once you've got that set the way you want it, you run

bash docker-compose up

and you'll start processing the PDFs.

When everything is done you'll have a lot of files in your output directory (by default ./texts) that look like d356e527274c55c51c8008af74dfd08ce3051710f336341556e3d1d4eb7c6cf2.json and have contents like:

json { "filename": "6.pdf", "pages": [ " \n\n \n\nWade L. Robison and Linda Reeser\nEthical Decision-Making in Social Work\n\nChapter 6\n\nJustice\n\nIntroduction\n1. Particular justice\n. The formal principle of justice\n. Substantive principles of justice\n. Using princ.." ] }

The names of the output files are based on a hash of the original PDF filename. We do this because Make is a bit sensitive to forms of filenames and this just avoids a lot of complexity.

This repository

This repository contains the Makefile and Python scripts and a Dockerfile for building an image, as well as a github action to publish to Github's contianer registry.

Funding

The open sourcing of this work was funded by Project Recognize. Funding for Project Recognize is provided by an R01 grant from the National Institute on Alcohol Abuse and Alcoholism within the National Institutes of Health(NIH) under grant number R01AA029076. You can find out more at ProjectRecognize.org.

Owner

  • Name: datamade
  • Login: datamade
  • Kind: organization
  • Email: info@datamade.us
  • Location: Chicago, IL

We build open source technology using open data to empower journalists, researchers, governments and advocacy organizations.

GitHub Events

Total
  • Watch event: 3
  • Fork event: 2
Last Year
  • Watch event: 3
  • Fork event: 2

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 11 days
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 11 days
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • alison985 (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/main.yml actions
  • docker/build-push-action v2 composite
  • docker/login-action v1 composite
Dockerfile docker
  • python 3.10-slim-bullseye build
docker-compose.yml docker
  • ghcr.io/datamade/pdf-textextract latest
requirements.txt pypi
  • numpy *
  • opencv-python-headless *