pdf-microarray

https://github.com/bjorntropf/pdfmicroarray

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: bjorntropf
License: gpl-3.0
Language: Python
Default Branch: main
Size: 216 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 2
Releases: 1

Created about 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

PDFMicroarray

Overview

PDFMicroarray is a Python CLI tool designed to assist with literature research for scientific papers and books.

It extracts text from multiples sources within PDF documents, including:

Plain text
Text from images (through OCR)
Text from embedded diagrams (through page rendering and OCR)

and stores the extracted text in a designated output directory.

The processed text can then analyzed using the Levenshtein distance to detect the occurrences of specified words. A graphical representation of these occurrences is offered in a microarray format.

Installation

Tesseract is required for this CLI tool. Please follow the installation instructions for your platform.

bash pip install pipx pipx install pdf-microarray

Usage

bash mkdir processed pdf-microarray process -i documents -o processed pdf-microarray analyze -i processed -w words.txt -o data.csv pdf-microarray plot -i data.csv -o plot.png

The words in words.txt should be separated by newlines. If multiple words are on the same line, only the occurrence of all words is taken into account.

Example

Technical details

The library uses document segmentation and multithreading to speed up the extraction process, so that even large books in PDF form can be parsed within a few minutes.

The library utilized PyMuPDF for OCR, pytesseract for PDF page rendering and thefuzz to calculate the Levenshtein distance.

Contributing

Contributions to PDFMicroarray are welcome! Please feel free to fork the repository, make changes, and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.

License

Distributed under the GNU General Public License v3.0 license. See LICENSE for more information.

Owner

Name: Björn Tropf
Login: bjorntropf
Kind: user

Repositories: 1
Profile: https://github.com/bjorntropf

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: PDFMicroarray
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Björn
    family-names: Tropf
  - given-names: Jan
    family-names: Tropf
repository-code: "https://github.com/bjorntropf/PDFMicroarray"
abstract: >-
  A Python CLI tool designed to assist with literature
  research by visualizing the occurrence of words in PDF
  documents with a microarray format.
keywords:
  - literature
  - research
  - pdf
  - OCR
  - microarray
license: GPL-3.0-or-later
commit: 4b6b75d
version: 1.0.0
date-released: "2024-05-01"

GitHub Events

Total

Last Year

Packages

Total packages: 1
Total downloads:
- pypi 17 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

pypi.org: pdf-microarray

A Python CLI tool designed to assist with literature research by visualizing the occurrence of words in PDF documents with a microarray format.

Homepage: https://github.com/bjorntropf/PDFMicroarray
Documentation: https://pdf-microarray.readthedocs.io/
License: GPL-3.0-or-later
Latest release: 1.0.1
published about 2 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 17 Last month

Rankings

Dependent packages count: 9.5%

Average: 35.9%

Dependent repos count: 62.4%

Maintainers (1)

asym

Last synced: 10 months ago

Dependencies

poetry.lock pypi

black 24.4.2
click 8.1.7
colorama 0.4.6
contourpy 1.2.1
cycler 0.12.1
exceptiongroup 1.2.1
flake8 7.0.0
fonttools 4.51.0
importlib-resources 6.4.0
iniconfig 2.0.0
isort 5.13.2
kiwisolver 1.4.5
matplotlib 3.8.4
mccabe 0.7.0
mypy-extensions 1.0.0
numpy 1.26.4
packaging 24.0
pandas 2.2.2
pathspec 0.12.1
pillow 10.3.0
platformdirs 4.2.1
pluggy 1.5.0
pycodestyle 2.11.1
pyflakes 3.2.0
pymupdf 1.24.2
pymupdfb 1.24.1
pyparsing 3.1.2
pytesseract 0.3.10
pytest 8.2.0
python-dateutil 2.9.0.post0
pytz 2024.1
rapidfuzz 3.8.1
seaborn 0.13.2
six 1.16.0
thefuzz 0.22.1
tomli 2.0.1
typing-extensions 4.11.0
tzdata 2024.1
zipp 3.18.1

pyproject.toml pypi

black ^24.4.0 develop
flake8 ^7.0.0 develop
isort ^5.9.3 develop
pytest ^8.1.1 develop
PyMuPDF ^1.24.2
click ^8.1.7
matplotlib ^3.8.4
pandas ^2.2.2
pillow ^10.3.0
pytesseract ^0.3.10
python ^3.9
seaborn ^0.13.2
thefuzz ^0.22.1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science