https://github.com/clowder-framework/extractors-pymupdf
Clowder extractor for PyMuPDF
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.3%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Clowder extractor for PyMuPDF
Basic Info
- Host: GitHub
- Owner: clowder-framework
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 19.5 KB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Created over 1 year ago
· Last pushed over 1 year ago
Metadata Files
Readme
License
README.md
extractors-pymupdf
Clowder extractor for PyMuPDF Extractor takes pdf file as input and outputs json and csv files with textual contents of the pdf file.
Instructions to run the extractor
- Activate the virtual environment
- Install dependencies:
pip install -r requirements.txt - Run the extractor:
python extractor.py
Build extractor image
- Run
docker build . -t hub.ncsa.illinois.edu/clowder/extractors-pymupdf:<version>to build docker image - If you ran into error
[Errno 28] No space left on device:, try below:- Free more spaces by running
docker system prune --all - Increase the Disk image size. You can find the configuration in Docker Desktop
- Free more spaces by running
Publish Image to Private NCSA repo
- Login first:
docker login hub.ncsa.illinois.edu - Run
docker image push hub.ncsa.illinois.edu/clowder/extractors-pymupdf:<version>
Deployment
- Please refer to Clowder instructions
- Current deployment
hub.ncsa.illinois.edu/clowder/extractors-pymupdf:0.2.0.0
Owner
- Name: Clowder
- Login: clowder-framework
- Kind: organization
- Email: clowder@lists.illinois.edu
- Website: https://clowderframework.org/
- Repositories: 30
- Profile: https://github.com/clowder-framework
Research data management for long tail data.
GitHub Events
Total
- Member event: 1
- Push event: 5
- Create event: 2
Last Year
- Member event: 1
- Push event: 5
- Create event: 2
Dependencies
Dockerfile
docker
- python 3.10 build
requirements.txt
pypi
- pandas *
- pyclowder ==2.7.0
- pymupdf *
- scispacy *
- spacy *
- thinc *