Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: bjorntropf
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 216 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 2
  • Releases: 1
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

PDFMicroarray

DOI

Overview

PDFMicroarray is a Python CLI tool designed to assist with literature research for scientific papers and books.

It extracts text from multiples sources within PDF documents, including:

  • Plain text
  • Text from images (through OCR)
  • Text from embedded diagrams (through page rendering and OCR)

and stores the extracted text in a designated output directory.

The processed text can then analyzed using the Levenshtein distance to detect the occurrences of specified words. A graphical representation of these occurrences is offered in a microarray format.

Installation

Tesseract is required for this CLI tool. Please follow the installation instructions for your platform.

bash pip install pipx pipx install pdf-microarray

Usage

bash mkdir processed pdf-microarray process -i documents -o processed pdf-microarray analyze -i processed -w words.txt -o data.csv pdf-microarray plot -i data.csv -o plot.png

The words in words.txt should be separated by newlines. If multiple words are on the same line, only the occurrence of all words is taken into account.

Example

Example

Technical details

The library uses document segmentation and multithreading to speed up the extraction process, so that even large books in PDF form can be parsed within a few minutes.

The library utilized PyMuPDF for OCR, pytesseract for PDF page rendering and thefuzz to calculate the Levenshtein distance.

Contributing

Contributions to PDFMicroarray are welcome! Please feel free to fork the repository, make changes, and submit pull requests. For major changes, please open an issue first to discuss what you would like to change.

License

Distributed under the GNU General Public License v3.0 license. See LICENSE for more information.

Owner

  • Name: Björn Tropf
  • Login: bjorntropf
  • Kind: user

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: PDFMicroarray
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Björn
    family-names: Tropf
  - given-names: Jan
    family-names: Tropf
repository-code: "https://github.com/bjorntropf/PDFMicroarray"
abstract: >-
  A Python CLI tool designed to assist with literature
  research by visualizing the occurrence of words in PDF
  documents with a microarray format.
keywords:
  - literature
  - research
  - pdf
  - OCR
  - microarray
license: GPL-3.0-or-later
commit: 4b6b75d
version: 1.0.0
date-released: "2024-05-01"

GitHub Events

Total
Last Year

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 17 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
pypi.org: pdf-microarray

A Python CLI tool designed to assist with literature research by visualizing the occurrence of words in PDF documents with a microarray format.

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 17 Last month
Rankings
Dependent packages count: 9.5%
Average: 35.9%
Dependent repos count: 62.4%
Maintainers (1)
Last synced: 6 months ago

Dependencies

poetry.lock pypi
  • black 24.4.2
  • click 8.1.7
  • colorama 0.4.6
  • contourpy 1.2.1
  • cycler 0.12.1
  • exceptiongroup 1.2.1
  • flake8 7.0.0
  • fonttools 4.51.0
  • importlib-resources 6.4.0
  • iniconfig 2.0.0
  • isort 5.13.2
  • kiwisolver 1.4.5
  • matplotlib 3.8.4
  • mccabe 0.7.0
  • mypy-extensions 1.0.0
  • numpy 1.26.4
  • packaging 24.0
  • pandas 2.2.2
  • pathspec 0.12.1
  • pillow 10.3.0
  • platformdirs 4.2.1
  • pluggy 1.5.0
  • pycodestyle 2.11.1
  • pyflakes 3.2.0
  • pymupdf 1.24.2
  • pymupdfb 1.24.1
  • pyparsing 3.1.2
  • pytesseract 0.3.10
  • pytest 8.2.0
  • python-dateutil 2.9.0.post0
  • pytz 2024.1
  • rapidfuzz 3.8.1
  • seaborn 0.13.2
  • six 1.16.0
  • thefuzz 0.22.1
  • tomli 2.0.1
  • typing-extensions 4.11.0
  • tzdata 2024.1
  • zipp 3.18.1
pyproject.toml pypi
  • black ^24.4.0 develop
  • flake8 ^7.0.0 develop
  • isort ^5.9.3 develop
  • pytest ^8.1.1 develop
  • PyMuPDF ^1.24.2
  • click ^8.1.7
  • matplotlib ^3.8.4
  • pandas ^2.2.2
  • pillow ^10.3.0
  • pytesseract ^0.3.10
  • python ^3.9
  • seaborn ^0.13.2
  • thefuzz ^0.22.1