hocr-to-alto

Convert between Tesseract hOCR and ALTO XML using XSL stylesheets

https://github.com/filak/hocr-to-alto

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.6%) to scientific vocabulary

Keywords

alto hocr xsl xsl-stylesheets xslt2
Last synced: 6 months ago · JSON representation ·

Repository

Convert between Tesseract hOCR and ALTO XML using XSL stylesheets

Basic Info
  • Host: GitHub
  • Owner: filak
  • License: mit
  • Language: XSLT
  • Default Branch: master
  • Homepage:
  • Size: 153 KB
Statistics
  • Stars: 55
  • Watchers: 11
  • Forks: 14
  • Open Issues: 0
  • Releases: 1
Topics
alto hocr xsl xsl-stylesheets xslt2
Created about 10 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

hOCR-to-ALTO

Convert between Tesseract hOCR and ALTO XML 2.0/2.1/3/4 using XSL stylesheets

The XSLT scripts use XSLT 2.0 features - so a XSLT 2.0 capable transformer is required - ie. Saxon

Running the conversion using Saxon-HE command line - example converting ALTO to hOCR:

  1. Unpack the Saxon distribution into the saxon subdir
  2. Place your input file(s) into the _input subdir
  3. Run:

    java -jar "saxon/saxon-he-12.7.jar" -s:_input/input-alto.xml -xsl:alto__hocr.xsl -o:_output/output-hocr.xml
    

Or use the run-saxon script from bash:

   $ /.run-saxon input-alto.xml alto__hocr.xsl output-hocr.xml
  1. Check the _output dir.

See ocr-fileformat for an interface to using these stylesheets.

hOCR-spec https://github.com/kba/hocr-spec

File naming scheme: sourceFormatVersion__targetFormatVersion.xsl

CONTENTS

Owner

  • Name: filak
  • Login: filak
  • Kind: user
  • Company: NLK

Citation (CITATION.cff)

cff-version: 1.2.0
message: "Please cite this software using these metadata."
abstract: "Convert between Tesseract hOCR and ALTO XML using XSL stylesheets"  
authors:
- family-names: "Kriz"
  given-names: "Filip"
  url: "https://www.medvik.cz/link/xx0060106"
- family-names: "Baierer"
  given-names: "Konstantin"
- family-names: "Zumstein"
  given-names: "Philipp"
- family-names: "Lehečka"
  given-names: "Boris"
- family-names: "Companjen"
  given-names: "Ben"
- family-names: "Hartwig"
  given-names: "Uwe"
- family-names: "Weil"
  given-names: "Stefan"
title: "hOCR-to-ALTO"
license: "MIT"
repository-code: "https://github.com/filak/hOCR-to-ALTO"

GitHub Events

Total
  • Create event: 4
  • Release event: 3
  • Issues event: 1
  • Watch event: 1
  • Delete event: 6
  • Issue comment event: 5
  • Push event: 11
  • Pull request event: 1
Last Year
  • Create event: 4
  • Release event: 3
  • Issues event: 1
  • Watch event: 1
  • Delete event: 6
  • Issue comment event: 5
  • Push event: 11
  • Pull request event: 1