https://github.com/clowder-framework/extractors-pymupdf

Clowder extractor for PyMuPDF

https://github.com/clowder-framework/extractors-pymupdf

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Clowder extractor for PyMuPDF

Basic Info
  • Host: GitHub
  • Owner: clowder-framework
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 19.5 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

extractors-pymupdf

Clowder extractor for PyMuPDF Extractor takes pdf file as input and outputs json and csv files with textual contents of the pdf file.

Instructions to run the extractor

  • Activate the virtual environment
  • Install dependencies: pip install -r requirements.txt
  • Run the extractor: python extractor.py

Build extractor image

  • Run docker build . -t hub.ncsa.illinois.edu/clowder/extractors-pymupdf:<version> to build docker image
  • If you ran into error [Errno 28] No space left on device:, try below:
    • Free more spaces by running docker system prune --all
    • Increase the Disk image size. You can find the configuration in Docker Desktop

Publish Image to Private NCSA repo

  • Login first: docker login hub.ncsa.illinois.edu
  • Run docker image push hub.ncsa.illinois.edu/clowder/extractors-pymupdf:<version>

Deployment

  • Please refer to Clowder instructions
  • Current deployment hub.ncsa.illinois.edu/clowder/extractors-pymupdf:0.2.0.0

Owner

  • Name: Clowder
  • Login: clowder-framework
  • Kind: organization
  • Email: clowder@lists.illinois.edu

Research data management for long tail data.

GitHub Events

Total
  • Member event: 1
  • Push event: 5
  • Create event: 2
Last Year
  • Member event: 1
  • Push event: 5
  • Create event: 2

Dependencies

Dockerfile docker
  • python 3.10 build
requirements.txt pypi
  • pandas *
  • pyclowder ==2.7.0
  • pymupdf *
  • scispacy *
  • spacy *
  • thinc *