ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

https://github.com/ub-mannheim/ocr-fileformat

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    3 of 11 committers (27.3%) from academic institutions
  • Institutional organization owner
    Organization ub-mannheim has institutional domain (www.bib.uni-mannheim.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary

Keywords

alto finereader hocr ocr ocr-d page-xml transformation validation
Last synced: 6 months ago · JSON representation ·

Repository

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)

Basic Info
Statistics
  • Stars: 188
  • Watchers: 19
  • Forks: 24
  • Open Issues: 34
  • Releases: 16
Topics
alto finereader hocr ocr ocr-d page-xml transformation validation
Created almost 10 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

ocr-fileformat

Codacy Badge Build Status GitHub release ocr-fileformat Docker build

Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)

Screenshot GUI

Installation

Docker

You can run the command line scripts and web interface as a Docker container, you only need Docker installed.

To start the web interface on http://localhost:8080:

sh docker run --rm -it -p 8080:8080 ubma/ocr-fileformat

To run the command line scripts, mount the directory containing your input files into the container's /data directory:

sh docker run --rm -it -v "$PWD":/data ubma/ocr-fileformat ocr-transform alto2.0 hocr somefile.alto

System-wide

To install system-wide to /usr/local:

sh sudo make install

To install without sudo to your home directory:

sh make install PREFIX=$HOME/.local

If $HOME/.local/bin is not in your PATH, add this to your shell startup file (e.g. ~/.bashrc or ~/.zshrc):

export PATH="$HOME/.local/bin $PATH"

The web application has a PHP backed. You can deploy it on any PHP-capable server by copying the web folder somewhere below the document root of your server, e.g. /var/www/html for Apache on Debian/Ubuntu:

sudo -u www-data cp -r web /var/www/html/ocr-fileformat

In this example the GUI would be available under http://localhost/ocr-fileformat/.

Usage

The project offers two functionalities, which can be accessd via a command line script (CLI), using a web interface (GUI) or in you own tools (API)

CLI

  • ocr-transform: Transformation of OCR output between OCR formats
  • ocr-validate: Validation of OCR output against OCR format schemas

GUI

The web interface is for testing validation and transformations. You can upload a file or select an input file by URL.

API

Transformation

Transformation CLI

Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]

For example, you can transform an ALTO XML to a hOCR file with:

sh ocr-transform alto hocr sample.xml sample.hocr

Or convert from ALTO XML (version 2.1) to hOCR with:

sh ocr-transform alto2.1 hocr sample.alto sample.hocr

You can also pass arguments directly to the Saxon CLI by passing them after a double dash (--). For example, to set the foo parameter to bar:

sh ocr-transform alto hocr sample.xml sample.hocr -- foo=bar

Try ocr-transform -h to get an overview:

``` Usage: ocr-transform [OPTIONS] [ []] [-- ] ocr-transform [OPTIONS] --help-args Show script-args, and exit ocr-transform [OPTIONS] -h|--help Show this help, and exit ocr-transform [OPTIONS] -v|--version Show version, and exit ocr-transform [OPTIONS] -L|--list List available from/to, and exit

Options:
    --debug   -d     Increase debug level by 1, can be repeated

Transformations:
    abbyy hocr
    abbyy page
    alto hocr
    alto page
    alto text
    alto2.0 alto3.0
    alto2.0 alto3.1
    alto2.0 hocr
    alto2.1 alto3.0
    alto2.1 alto3.1
    alto2.1 hocr
    alto4.2 alto2.1
    gcv alto
    gcv hocr
    gcv page
    hocr alto
    hocr alto2.0
    hocr alto2.1
    hocr alto3.0
    hocr alto4.0
    hocr page
    hocr tei
    hocr text
    mybib alto3.0
    page alto
    page alto_legacy
    page hocr
    page page2019
    page text
    tei hocr
    textract page

```

Transformation GUI

Select the Transform menu option. Choose a URL, an input and an output format. Click Transform.

Transformation API

The stylesheets are installed in $PREFIX/share/ocr-fileformat/xslt and can be used directly in your scripts and software. You will need to use an XSLT 2.0 capable stylesheet transformer.

Supported Transformations

| From ╲ To | hOCR | ALTO | PAGEXML | TEI | Text | | ---: | --- | --- | --- | --- | --- | | hOCR | - | ✓ | ✓ | ✓ | ✓ | | ALTO | ✓ | ✓ | ✓ | - | ✓ | | PAGEXML | ✓ | ✓ | ✓ | - | ✓ | | ABBYY FineReader | ✓ | - | ✓ | - | - | | Google Cloud Vision | ✓ | ✓ | ✓ | - | - | | Amazon AWS Textract | - | - | ✓ | - | - | | TEI | ✓ | - | - | - | - |

Validation

``` Usage: ocr-validate [OPTIONS] [] ocr-validate [OPTIONS] -h|--help Show this help, and exit ocr-validate [OPTIONS] -v|--version Show version, and exit ocr-validate [OPTIONS] -L|--list List available schemas, and exit

Options:
    --debug   -d     Increase debug level by 1, can be repeated

Schemas:
    hocr
    alto-1-0 alto-1-1 alto-1-2 alto-1-3 alto-1-4 alto-2-0 alto-2-1 alto-2-2-draft alto-3-0 alto-3-1 alto-3-2-draft alto-4-0 alto-4-1 alto-4-2 alto-4-3
    abbyy-6-schema-v1 abbyy-8-schema-v2 abbyy-9-schema-v1 abbyy-10-schema-v1
    page-2009-03-16 page-2010-01-12 page-2010-03-19 page-2013-07-15 page-2016-07-15 page-2017-07-15 page-2018-07-15 page-2019-07-15

```

Validation CLI

For example, to validate an XML file against the ALTO 3.1 schema:

ocr-validate alto-3-1 myFile.alto

Validation GUI

Select the Validate menu option. Choose a URL and an schema. Click Validate.

Validation API

The XSD files are installed under $PREFIX/share/ocr-fileformat/xsd

Supported Validation Formats

| | hOCR | ALTO | PAGEXML | FineReader | Google Cloud Vision | Amazon AWS Textract | | ---: | --- | --- | --- | --- | --- | --- | | Validation | ✓ | ✓ | ✓ | ✓ | - | - |

License

This is free software. You may use it under the terms of the MIT License.

During the installation process several projects are included (in ./vendor). These projects have different licenses:

Owner

  • Name: Universitätsbibliothek Mannheim
  • Login: UB-Mannheim
  • Kind: organization
  • Email: info.ub@uni-mannheim.de
  • Location: Mannheim, Germany

Mannheim University Library

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: ocr-fileformat
message: >-
  You may cite this software using the metadata from this file.
type: software
authors:
  - name: Universitätsbibliothek Mannheim
    country: DE
    city: Mannheim
    website: 'https://www.bib.uni-mannheim.de/'
  - given-names: Konstantin
    family-names: Baierer
    orcid: 'https://orcid.org/0000-0003-2397-242X'
  - given-names: Stefan
    family-names: Weil
    affiliation: Universitätsbibliothek Mannheim
    orcid: 'https://orcid.org/0000-0002-0524-9898'
  - family-names: Zumstein
    given-names: Philipp
    affiliation: Universitätsbibliothek Mannheim
    orcid: 'https://orcid.org/0000-0002-6485-9434'
  - given-names: Robert
    family-names: Sachunsky
  - given-names: Jörg
    orcid: 'https://orcid.org/0000-0002-6406-4906'
    family-names: Mechnich
    affiliation: Universitätsbibliothek Mannheim
  - given-names: Uwe
    family-names: Hartwig
    orcid: 'https://orcid.org/0000-0001-7164-6376'
  - given-names: Mike
    family-names: Gerber
  - given-names: Clemens
    orcid: 'https://orcid.org/0000-0001-5293-8322'
    family-names: Neudecker

GitHub Events

Total
  • Create event: 1
  • Issues event: 1
  • Release event: 2
  • Watch event: 7
  • Delete event: 1
  • Issue comment event: 8
  • Push event: 4
  • Pull request review event: 3
  • Pull request event: 4
  • Fork event: 2
Last Year
  • Create event: 1
  • Issues event: 1
  • Release event: 2
  • Watch event: 7
  • Delete event: 1
  • Issue comment event: 8
  • Push event: 4
  • Pull request review event: 3
  • Pull request event: 4
  • Fork event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 249
  • Total Committers: 11
  • Avg Commits per committer: 22.636
  • Development Distribution Score (DDS): 0.582
Past Year
  • Commits: 11
  • Committers: 3
  • Avg Commits per committer: 3.667
  • Development Distribution Score (DDS): 0.364
Top Committers
Name Email Commits
Konstantin Baierer u****g@g****m 104
Stefan Weil sw@w****e 65
Philipp Zumstein z****p@g****m 43
Robert Sachunsky s****y@i****e 23
Jörg Mechnich j****h@b****e 6
Konstantin Baierer k****r@b****e 3
Clemens Neudecker c****r@g****m 1
Gerber, Mike m****r@s****e 1
LGTM Migrator l****r 1
The Codacy Badger b****r@c****m 1
Uwe Hartwig u****g@b****o 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 94
  • Total pull requests: 95
  • Average time to close issues: 6 months
  • Average time to close pull requests: 2 months
  • Total issue authors: 13
  • Total pull request authors: 5
  • Average comments per issue: 4.2
  • Average comments per pull request: 1.85
  • Merged pull requests: 83
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 3
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 3.67
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • zuphilip (25)
  • stweil (10)
  • kba (6)
  • jbarth-ubhd (4)
  • jtlz2 (3)
  • yanirmr (2)
  • mhucka (2)
  • jbaiter (1)
  • yuvaler1 (1)
  • TuulaP (1)
  • FoxKyong (1)
  • dinosauria123 (1)
  • jmokoistinen (1)
  • asor12 (1)
Pull Request Authors
  • kba (25)
  • zuphilip (25)
  • stweil (6)
  • bertsky (5)
  • codacy-badger (1)
Top Labels
Issue Labels
enhancement (9) transformation (8) format (7) upstream (5) bug (3) help wanted (1)
Pull Request Labels
enhancement (1) bug (1)

Dependencies

.github/workflows/codeql.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
Dockerfile docker
  • alpine edge build