zbinfigs
Tools to classify scholarly articles based on extracted figures
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary
Repository
Tools to classify scholarly articles based on extracted figures
Basic Info
- Host: GitHub
- Owner: spackman
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://zbinfigs.readthedocs.io/en/latest/
- Size: 30.5 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 3
Metadata Files
README.md
Zbinfigs
For the USRSE25 Conference Submission, please see the tutorial notebook here.
Version : 0.1.7
Overview
zbinfigs is a python code designed to facilitate classification of scholarly articles stored in zotero collections.
The code input is a .csv exported from a zotero collection. The code allows for several options:
1. Organize and compile .pdfs
2. Extract images to summary .pdf
3. Read annotations
4. Train and test classifier
Organize and Compile
A raw zotero collection may not have .pdf files for all records, and for those that have .pdf files not all may be accessible. In the organize and compile option, the program reads a default .csv describing a zotero collection and checks all of the file links. Where no files are available, this flag is written to a summary document. If only .html files are available, the program strips the CSS from the file and converts it to a .pdf file. At the end of the program execution, all available .pdf files are written to a common folder. This can be useful to transfer data to an HPC system for further analysis.
Extract Images
From a folder of .pdf files and the original exported zotero .csv file, the program uses the marker package to convert to markdown and extract the figure images. The figure images for the entire directory are stitched into a single .pdf file for review and annotation. The converted text and original image files are retained for further semantic analysis. This step may be broken into chunks for parallel processing on an HPC system.
Read Annotations
The program takes an annotations input file that contains the .pdf page number for all figures that match from manual annotation. These annotations are then connected with the original zotero collection .csv file and summary stats written.
Train and Test Classifier
From the annotated dataset, several classifiers are applied and summary performance written.
Documentation
The documentation for this project is hosted on Read the Docs. You can find the latest version of the documentation there, including installation instructions, usage guides, and other important information.
For more details, visit:
Zbinfigs Documentation
Examples
This is a sketch of the planned usage - this will probably change.
File Preparation ```python import zbinfigs as zbf
read the collection into a collection object
collection = zbf.read_collection("mycollection.csv")
export the collection .pdfs to a common folder
collection.exportpdfstofolder(foldername="mypdfs")
**Figure Extraction**
This is the slowest step and should be parallelized if possible.
python
import zbinfigs as zbf
import sys
get the minimum and maximum file indices to process
rmin, rmax = sys.argv[0], sys.argv[1]
read the pdfs in the range selected and convert to markdown + extract images
the result is a folder for each .pdf
zbf.processpdffolder(folder_name="mypdfs", range=(rmin, rmax)) ```
Figure Compilation ```python import zbinfigs as zbf import sys
get the minimum and maximum file indices to process
rmin, rmax = sys.argv[0], sys.argv[1]
locate the image files for the .pdfs in the range selected and merge into a single .pdf
export the page ranges for each record to a .csv
zbf.gatherfigures(foldername="mypdfs", range=(rmin, rmax))
merge pdfs into a single document and make an updated page range .csv file
this knows to look for the .csv page number metadata
zbf.mergesummarypdfs(folder_name="mypdfs") ```
Read Annotations ```python import zbinfigs as zbf
locate the raw images based on their annotations and sort into folders accordingly
this expects a summary pdf .csv of page range data to match with
zbf.sortannotated(foldername="mypdfs", annotations="myannotations.csv")
add annotations to the main .csv
zbf.add_annotations(collection="mycollection.csv", annotations="myannotations.csv", outfile="myannotatedcollection.csv") ```
Analysis ```python import zbinfigs as zbf
collection = zbf.read_collection("myannotatedcollection.csv")
make a pie chart of the total number of records
sections for .html files and .pdf files available
collection.plotpdfsavailable(type="bar", x="year") collection.plotpdfsavailable(type="pie")
make a pie chart of the total number of records analyzed
sections for each annotation condition
collection.plotannotations(type="bar", x="year") collection.plotannotations(type="pie")
make a plot of the annotation type v. the document processed
collection.plotannotations(type="bar", x="pdfhtml") ```
Train and Test Classifier
This shows an example of using several different image classification strategies implemented in sklearn on the resulting dataset. Uniquely, this system can also apply OCR to an image and use the result to aid classification.
```python
implement some stuff here
if the OCR turns up a key term (like axis titles or units that are conserved by figure type)
skip the image classification and directly assign to a group
make useful summary plots that classify performance, false positives, etc.
```
Future Plans
- An example semantic analysis / clustering would be nice
This would use the markdown text generated during the previous step and extract the region where the paper is from, author information or any number of other characteristics (paper length?). It may also check for linguistic similarities among the papers that match the annotation conditions, and potential differences with the remaining papers as an alternative classification approach.
- Unit tests for all functions
- Code linting
- Continuous integration for package deployment
- More documentation on annotation examples
- Tutorial script
- Prepare for RSE presentation?
- Parallelization of pdf processing with mpi4py (platform agnostic)
- Robust try/except error handling with logging information
Owner
- Name: Isaac Spackman
- Login: spackman
- Kind: user
- Location: Golden, CO
- Company: Colorado School of Mines
- Repositories: 1
- Profile: https://github.com/spackman
Applied Chemistry PhD Student | Colorado School of Mines
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Spackman
given-names: Isaac
orcid: https://orcid.org/0000-0003-4080-0144
title: "Zbinfigs: An image parsing utility for Zotero collections."
version: 0.1.0
identifiers:
- type: doi
value:
date-released: 2024-12-20
GitHub Events
Total
- Create event: 3
- Issues event: 1
- Release event: 2
- Push event: 38
Last Year
- Create event: 3
- Issues event: 1
- Release event: 2
- Push event: 38
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- spackman (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 20 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
pypi.org: zbinfigs
A package for figure classification from a zotero collection of scholarly articles
- Documentation: https://zbinfigs.readthedocs.io/
- License: MIT License
-
Latest release: 0.1.7
published about 1 year ago