hocr-to-alto
Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.6%) to scientific vocabulary
Keywords
alto
hocr
xsl
xsl-stylesheets
xslt2
Last synced: 6 months ago
·
JSON representation
·
Repository
Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
Basic Info
Statistics
- Stars: 55
- Watchers: 11
- Forks: 14
- Open Issues: 0
- Releases: 1
Topics
alto
hocr
xsl
xsl-stylesheets
xslt2
Created about 10 years ago
· Last pushed 9 months ago
Metadata Files
Readme
License
Citation
README.md
hOCR-to-ALTO
Convert between Tesseract hOCR and ALTO XML 2.0/2.1/3/4 using XSL stylesheets
The XSLT scripts use XSLT 2.0 features - so a XSLT 2.0 capable transformer is required - ie. Saxon
Running the conversion using Saxon-HE command line - example converting ALTO to hOCR:
- Unpack the Saxon distribution into the saxon subdir
- Place your input file(s) into the _input subdir
Run:
java -jar "saxon/saxon-he-12.7.jar" -s:_input/input-alto.xml -xsl:alto__hocr.xsl -o:_output/output-hocr.xml
Or use the run-saxon script from bash:
$ /.run-saxon input-alto.xml alto__hocr.xsl output-hocr.xml
- Check the _output dir.
See ocr-fileformat for an interface to using these stylesheets.
hOCR-spec https://github.com/kba/hocr-spec
File naming scheme: sourceFormatVersion__targetFormatVersion.xsl
CONTENTS
- Convert ALTO to hOCR
- Convert hOCR to ALTO
- Convert ALTO to plain text
- Convert hOCR to plain text
- Language codes
codes_lookup.xml- generated with https://github.com/filak/iso-language-codes
Owner
- Name: filak
- Login: filak
- Kind: user
- Company: NLK
- Repositories: 5
- Profile: https://github.com/filak
Citation (CITATION.cff)
cff-version: 1.2.0 message: "Please cite this software using these metadata." abstract: "Convert between Tesseract hOCR and ALTO XML using XSL stylesheets" authors: - family-names: "Kriz" given-names: "Filip" url: "https://www.medvik.cz/link/xx0060106" - family-names: "Baierer" given-names: "Konstantin" - family-names: "Zumstein" given-names: "Philipp" - family-names: "Lehečka" given-names: "Boris" - family-names: "Companjen" given-names: "Ben" - family-names: "Hartwig" given-names: "Uwe" - family-names: "Weil" given-names: "Stefan" title: "hOCR-to-ALTO" license: "MIT" repository-code: "https://github.com/filak/hOCR-to-ALTO"
GitHub Events
Total
- Create event: 4
- Release event: 3
- Issues event: 1
- Watch event: 1
- Delete event: 6
- Issue comment event: 5
- Push event: 11
- Pull request event: 1
Last Year
- Create event: 4
- Release event: 3
- Issues event: 1
- Watch event: 1
- Delete event: 6
- Issue comment event: 5
- Push event: 11
- Pull request event: 1