zbinfigs

Tools to classify scholarly articles based on extracted figures

https://github.com/spackman/zbinfigs

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Tools to classify scholarly articles based on extracted figures

Basic Info

Host: GitHub
Owner: spackman
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage: https://zbinfigs.readthedocs.io/en/latest/
Size: 30.5 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 3

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Zbinfigs

GitHub top language GitHub License GitHub Release PyPI - Downloads PyPI - Status GitHub repo size
Read the Docs

For the USRSE25 Conference Submission, please see the tutorial notebook here.

Version : 0.1.7

Overview

zbinfigs is a python code designed to facilitate classification of scholarly articles stored in zotero collections. The code input is a .csv exported from a zotero collection. The code allows for several options:
1. Organize and compile .pdfs 2. Extract images to summary .pdf 3. Read annotations 4. Train and test classifier

Organize and Compile

A raw zotero collection may not have .pdf files for all records, and for those that have .pdf files not all may be accessible. In the organize and compile option, the program reads a default .csv describing a zotero collection and checks all of the file links. Where no files are available, this flag is written to a summary document. If only .html files are available, the program strips the CSS from the file and converts it to a .pdf file. At the end of the program execution, all available .pdf files are written to a common folder. This can be useful to transfer data to an HPC system for further analysis.

Extract Images

From a folder of .pdf files and the original exported zotero .csv file, the program uses the marker package to convert to markdown and extract the figure images. The figure images for the entire directory are stitched into a single .pdf file for review and annotation. The converted text and original image files are retained for further semantic analysis. This step may be broken into chunks for parallel processing on an HPC system.

Read Annotations

The program takes an annotations input file that contains the .pdf page number for all figures that match from manual annotation. These annotations are then connected with the original zotero collection .csv file and summary stats written.

Train and Test Classifier

From the annotated dataset, several classifiers are applied and summary performance written.

Documentation

The documentation for this project is hosted on Read the Docs. You can find the latest version of the documentation there, including installation instructions, usage guides, and other important information.

For more details, visit:
Zbinfigs Documentation

Examples

This is a sketch of the planned usage - this will probably change.

File Preparation ```python import zbinfigs as zbf

read the collection into a collection object

collection = zbf.read_collection("mycollection.csv")

export the collection .pdfs to a common folder

collection.exportpdfstofolder(foldername="mypdfs")

**Figure Extraction** This is the slowest step and should be parallelized if possible.python import zbinfigs as zbf import sys

get the minimum and maximum file indices to process

rmin, rmax = sys.argv[0], sys.argv[1]

read the pdfs in the range selected and convert to markdown + extract images

the result is a folder for each .pdf

zbf.processpdffolder(folder_name="mypdfs", range=(rmin, rmax)) ```

Figure Compilation ```python import zbinfigs as zbf import sys

get the minimum and maximum file indices to process

rmin, rmax = sys.argv[0], sys.argv[1]

locate the image files for the .pdfs in the range selected and merge into a single .pdf

export the page ranges for each record to a .csv

zbf.gatherfigures(foldername="mypdfs", range=(rmin, rmax))

merge pdfs into a single document and make an updated page range .csv file

this knows to look for the .csv page number metadata

zbf.mergesummarypdfs(folder_name="mypdfs") ```

Read Annotations ```python import zbinfigs as zbf

locate the raw images based on their annotations and sort into folders accordingly

this expects a summary pdf .csv of page range data to match with

zbf.sortannotated(foldername="mypdfs", annotations="myannotations.csv")

add annotations to the main .csv

zbf.add_annotations(collection="mycollection.csv", annotations="myannotations.csv", outfile="myannotatedcollection.csv") ```

Analysis ```python import zbinfigs as zbf

collection = zbf.read_collection("myannotatedcollection.csv")

make a pie chart of the total number of records

sections for .html files and .pdf files available

collection.plotpdfsavailable(type="bar", x="year") collection.plotpdfsavailable(type="pie")

make a pie chart of the total number of records analyzed

sections for each annotation condition

collection.plotannotations(type="bar", x="year") collection.plotannotations(type="pie")

make a plot of the annotation type v. the document processed

collection.plotannotations(type="bar", x="pdfhtml") ```

Train and Test Classifier
This shows an example of using several different image classification strategies implemented in sklearn on the resulting dataset. Uniquely, this system can also apply OCR to an image and use the result to aid classification.

```python

implement some stuff here

if the OCR turns up a key term (like axis titles or units that are conserved by figure type)

skip the image classification and directly assign to a group

make useful summary plots that classify performance, false positives, etc.

```

Future Plans

An example semantic analysis / clustering would be nice
This would use the markdown text generated during the previous step and extract the region where the paper is from, author information or any number of other characteristics (paper length?). It may also check for linguistic similarities among the papers that match the annotation conditions, and potential differences with the remaining papers as an alternative classification approach.
- Unit tests for all functions
- Code linting
- Continuous integration for package deployment
- More documentation on annotation examples
- Tutorial script
- Prepare for RSE presentation?
- Parallelization of pdf processing with mpi4py (platform agnostic)
- Robust try/except error handling with logging information

Owner

Name: Isaac Spackman
Login: spackman
Kind: user
Location: Golden, CO
Company: Colorado School of Mines

Repositories: 1
Profile: https://github.com/spackman

Applied Chemistry PhD Student | Colorado School of Mines

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Spackman
    given-names: Isaac
    orcid: https://orcid.org/0000-0003-4080-0144
title: "Zbinfigs: An image parsing utility for Zotero collections."
version: 0.1.0
identifiers:
  - type: doi
    value: 
date-released: 2024-12-20

GitHub Events

Total

Create event: 3
Issues event: 1
Release event: 2
Push event: 38

Last Year

Create event: 3
Issues event: 1
Release event: 2
Push event: 38

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

spackman (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 20 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 2
Total maintainers: 1

pypi.org: zbinfigs

A package for figure classification from a zotero collection of scholarly articles

Documentation: https://zbinfigs.readthedocs.io/
License: MIT License
Latest release: 0.1.7
published over 1 year ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 20 Last month

Rankings

Dependent packages count: 9.9%

Average: 32.7%

Dependent repos count: 55.5%

Maintainers (1)

ispackman

Last synced: 10 months ago

zbinfigs

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Zbinfigs

For the USRSE25 Conference Submission, please see the tutorial notebook here.

Version : 0.1.7

Overview

Organize and Compile

Extract Images

Read Annotations

Train and Test Classifier

Documentation

Examples

read the collection into a collection object

export the collection .pdfs to a common folder

get the minimum and maximum file indices to process

read the pdfs in the range selected and convert to markdown + extract images

the result is a folder for each .pdf

get the minimum and maximum file indices to process

locate the image files for the .pdfs in the range selected and merge into a single .pdf

export the page ranges for each record to a .csv

merge pdfs into a single document and make an updated page range .csv file

this knows to look for the .csv page number metadata

locate the raw images based on their annotations and sort into folders accordingly

this expects a summary pdf .csv of page range data to match with

add annotations to the main .csv

make a pie chart of the total number of records

sections for .html files and .pdf files available

make a pie chart of the total number of records analyzed

sections for each annotation condition

make a plot of the annotation type v. the document processed

implement some stuff here

if the OCR turns up a key term (like axis titles or units that are conserved by figure type)

skip the image classification and directly assign to a group

make useful summary plots that classify performance, false positives, etc.

Future Plans

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: zbinfigs

Rankings

Maintainers (1)