https://github.com/bertsky/ocrd_publaynet
convert PubLayNet data into METS/PAGE-XML
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 1 committers (100.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Keywords
Repository
convert PubLayNet data into METS/PAGE-XML
Basic Info
- Host: GitHub
- Owner: bertsky
- Language: Python
- Default Branch: master
- Size: 5.86 KB
Statistics
- Stars: 10
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
ocrd_publaynet
convert PubLayNet data into METS/PAGE-XML
Introduction
This offers OCR-D compliant (i.e. METS-XML/PAGE-XML based) conversion for PubLayNet or similar, MS-COCO-based, ground-truth data.
Installation
System packages
Install GNU make and wget if you wish to use the Makefile.
# on Debian / Ubuntu:
sudo apt install make wget
Install Python3 regardless:
# on Debian / Ubuntu:
sudo apt install python3 python3-pip python3-venv
Equivalently:
# on Debian / Ubuntu:
sudo make deps-ubuntu
Python packages
It is strongly recommended to use venv. You can create and install a virtual environment of your own (which the Makefile will re-use when activated), or have the Makefile do that for you.
pip install -r requirements.txt
pip install .
Equivalently:
make install
Usage
command-line interface ocrd-import-mscoco
Once installed, the following executable becomes available:
``` Usage: ocrd-import-mscoco [OPTIONS] COCOFILE DIRECTORY
Convert MS-COCO JSON to METS/PAGE XML files.
Load JSON cocofile (in MS-COCO format) and chdir to directory
(which it refers to).
Start a METS file mets.xml with references to the image files (under
fileGrp OCR-D-IMG) and their corresponding PAGE-XML annotations (under
fileGrp OCR-D-GT-SEG-BLOCK), as parsed from cocofile and written
using the same basename.
Options: --help Show this message and exit. ```
apply on PubLayNet
To apply on the validation subsection:
ocrd-import-mscoco publaynet/val.json publaynet/val
This will create a METS publaynet/val/mets.xml and PAGE files publaynet/val/*.xml for all image files.
To apply on the training subsection:
ocrd-import-mscoco publaynet/train.json publaynet/train
This will create a METS publaynet/train/mets.xml and PAGE files publaynet/train/*.xml for all image files.
Equivalently (including download/extraction if necessary):
make convert
Note: PubLayNet itself requires approximately 103 GB of disk space. If you already have it (elsewhere), but still wish to use the Makefile to convert the files, make sure to symlink it here, so it does not get downloaded twice:
ln -s your/path/to/publaynet publaynetNote: PubLayNet's
train.jsonis 1.6 GB on disk and takes about 10 GB in (resident!) memory to load. Any incremental/stream-based method would be magnitudes slower than plainjson.load(). Also, MS-COCO cannot be split because it basically defines a (humongous)annotationsdict with pointers to a (large)imagesdict – sequentially. Another problem is that we cannot parallelize this, since everything needs to be in one final METS file. So this may take days. Just grin and bear it!
all Makefile targets
``` Rules to install ocrd-import-mscoco, and to use it on PubLayNet (by downloading, extracting and converting).
Targets:
help: this message
deps-ubuntu: install system dependencies for Ubuntu
all: alias for install download convert
install: alias for pip install .
download: alias for publaynet.tar.gz
convert: alias for publaynet/val/mets.xml publaynet/train/mets.xml
uninstall: alias for pip uninstall ocrd_publaynet
clean-xml: remove results of conversion (METS and PAGE files in publaynet)
clean: remove publaynet altogether
Variables: VIRTUAL_ENV: absolute path to (re-)use for the virtual environment PYTHON: name of the Python binary PIP: name of the Python packaging binary ```
Owner
- Name: Robert Sachunsky
- Login: bertsky
- Kind: user
- Repositories: 114
- Profile: https://github.com/bertsky
GitHub Events
Total
Last Year
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 4
- Total Committers: 1
- Avg Commits per committer: 4.0
- Development Distribution Score (DDS): 0.0
Top Committers
| Name | Commits | |
|---|---|---|
| Robert Sachunsky | s****y@i****e | 4 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 14 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 1
- Total maintainers: 1
pypi.org: ocrd-publaynet
convert PubLayNet data into METS/PAGE-XML
- Homepage: https://github.com/bertsky/ocrd_publaynet
- Documentation: https://ocrd-publaynet.readthedocs.io/
- License: Apache License 2.0
-
Latest release: 0.1.0
published about 6 years ago
Rankings
Maintainers (1)
Dependencies
- click >=7.0
- numpy *
- ocrd >=2.4.0