pdf-microarray
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: bjorntropf
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 216 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 2
- Releases: 1
Metadata Files
README.md
PDFMicroarray
Overview
PDFMicroarray is a Python CLI tool designed to assist with literature research for scientific papers and books.
It extracts text from multiples sources within PDF documents, including:
- Plain text
- Text from images (through OCR)
- Text from embedded diagrams (through page rendering and OCR)
and stores the extracted text in a designated output directory.
The processed text can then analyzed using the Levenshtein distance to detect the occurrences of specified words. A graphical representation of these occurrences is offered in a microarray format.
Installation
Tesseract is required for this CLI tool. Please follow the installation instructions for your platform.
bash
pip install pipx
pipx install pdf-microarray
Usage
bash
mkdir processed
pdf-microarray process -i documents -o processed
pdf-microarray analyze -i processed -w words.txt -o data.csv
pdf-microarray plot -i data.csv -o plot.png
The words in words.txt should be separated by newlines. If multiple words are on the same line, only the occurrence of all words is taken into account.
Example

Technical details
The library uses document segmentation and multithreading to speed up the extraction process, so that even large books in PDF form can be parsed within a few minutes.
The library utilized PyMuPDF for OCR, pytesseract for PDF page rendering and thefuzz to calculate the Levenshtein distance.
Contributing
Contributions to PDFMicroarray are welcome! Please feel free to fork the repository, make changes, and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.
License
Distributed under the GNU General Public License v3.0 license. See LICENSE for more information.
Owner
- Name: Björn Tropf
- Login: bjorntropf
- Kind: user
- Repositories: 1
- Profile: https://github.com/bjorntropf
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: PDFMicroarray
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Björn
family-names: Tropf
- given-names: Jan
family-names: Tropf
repository-code: "https://github.com/bjorntropf/PDFMicroarray"
abstract: >-
A Python CLI tool designed to assist with literature
research by visualizing the occurrence of words in PDF
documents with a microarray format.
keywords:
- literature
- research
- pdf
- OCR
- microarray
license: GPL-3.0-or-later
commit: 4b6b75d
version: 1.0.0
date-released: "2024-05-01"
GitHub Events
Total
Last Year
Packages
- Total packages: 1
-
Total downloads:
- pypi 17 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
pypi.org: pdf-microarray
A Python CLI tool designed to assist with literature research by visualizing the occurrence of words in PDF documents with a microarray format.
- Homepage: https://github.com/bjorntropf/PDFMicroarray
- Documentation: https://pdf-microarray.readthedocs.io/
- License: GPL-3.0-or-later
-
Latest release: 1.0.1
published almost 2 years ago
Rankings
Maintainers (1)
Dependencies
- black 24.4.2
- click 8.1.7
- colorama 0.4.6
- contourpy 1.2.1
- cycler 0.12.1
- exceptiongroup 1.2.1
- flake8 7.0.0
- fonttools 4.51.0
- importlib-resources 6.4.0
- iniconfig 2.0.0
- isort 5.13.2
- kiwisolver 1.4.5
- matplotlib 3.8.4
- mccabe 0.7.0
- mypy-extensions 1.0.0
- numpy 1.26.4
- packaging 24.0
- pandas 2.2.2
- pathspec 0.12.1
- pillow 10.3.0
- platformdirs 4.2.1
- pluggy 1.5.0
- pycodestyle 2.11.1
- pyflakes 3.2.0
- pymupdf 1.24.2
- pymupdfb 1.24.1
- pyparsing 3.1.2
- pytesseract 0.3.10
- pytest 8.2.0
- python-dateutil 2.9.0.post0
- pytz 2024.1
- rapidfuzz 3.8.1
- seaborn 0.13.2
- six 1.16.0
- thefuzz 0.22.1
- tomli 2.0.1
- typing-extensions 4.11.0
- tzdata 2024.1
- zipp 3.18.1
- black ^24.4.0 develop
- flake8 ^7.0.0 develop
- isort ^5.9.3 develop
- pytest ^8.1.1 develop
- PyMuPDF ^1.24.2
- click ^8.1.7
- matplotlib ^3.8.4
- pandas ^2.2.2
- pillow ^10.3.0
- pytesseract ^0.3.10
- python ^3.9
- seaborn ^0.13.2
- thefuzz ^0.22.1