gt-mufilevelrules

OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.

https://github.com/ocr-d/gt-mufilevelrules

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary

Keywords

ground-truth guidelines ocr ocr-d transcription

Keywords from Contributors

standardization
Last synced: 6 months ago · JSON representation ·

Repository

OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.

Basic Info
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
ground-truth guidelines ocr ocr-d transcription
Created over 3 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

gt-MufiLevelRules

Creates OCR-D Ground-Truth Transcription Level Rules automatically from the encodings published by MUFI: The Medieval Unicode Font Initiative.

The resulting OCR-D level rules conform to the OCR-D specification. These rules can be used for substitutions or level checks, among other things.

Note: - There may not always be a definition for every level, esp. on level 1. - OCR-D will try to fill in these gaps manually or automatically. The automated completion is based on the unicruft program. - For this reason, using the rules for automatic character normalization from level 3 or level 2 to level 1 is currently not recommended before manually checking and correcting the corresponding rules.

Download the Rules

🚦 You can download the set of rules here. 🚦 - select the corresponding rule file: rules directory - as zip release file: latest Releases

Recreation of the rules

  1. copy or clone the repository.

    git clone https://github.com/tboenig/gt-MufiLevelRules.git

  2. Install Saxon for XSL Transformations v3.0. Then simply run with:

    java -jar saxon-he-XX.jar -xsl:scripts/MufiGTLevelRules2.xsl -s:scripts/MufiGTLevelRules.xsl output=characters merge=yes

Parameters: - output characters -> create the rules, all rules are saved under directory: [directory]/rules/characters - merge yes -> create the megarules, all rules in one file. Megarules saved under directoy [directory]/rules

The result of the conversion can be found in the directory: [directory]/rules/characters. - Output Format: - xml - json

The script uses:

  1. the MUFI rules [new Version] and MUFI rules old-Version

  2. a summary of the following additional rules from the OCR-D Ground-Truth Transcription Guide, which have priority (take precendence over MUFI rules where applicable):

Description of the rules

JSON Format

All JSON files (both the pure MUFI rules and the final result) follow the same schema.

Example:

JSON {"ruleset":[ ... {"rule": ["ä", "aͤ", ""], "type": "level"} ... ]}

  • Each rule has a key: rule and a list of values
  • The values define the character representation on each of the 3 transcription levels:
    • Level 1 is at the first position
    • Level 2 is in the second place
    • Level 3 is in the third place
  • Additional key-value combinations: ...
  • Character values can be empty to signify there is no definition (representation) at that level.

XML Format

XML <levelrules> <ruleset> <range>AlphPresForm</range> <desc>LATIN SMALL LIGATURE FF</desc> <rule>ff</rule> <rule>ff</rule> <rule>ff</rule> <type>level</type> </ruleset> </levelrules> - Elements - <levelrules> = root element of a gt-MufiLevelRules dataset - <ruleset> = root element of a ruleset - <range> = category of characters - <desc> = general description of the sign or symbol - <rule> - Level 1: rule[position() = 1] - Level 2: rule[position() = 2] - Level 3: rule[position() = 3]

The category of characters <range> and the general description of the sign or symbol <desc> were imported from the MUFI dataset.

The JSONPaths are: - range : $['..']['range'] - desc : $['..']['description']

See Also

  • MUFI: The Medieval Unicode Font Initiative https://mufi.info/
  • MUFI's data as JSON export https://gefin.ku.dk/q.php?q=mufiexport
  • OCR-D Ground Truth Transcription Guidelines https://ocr-d.de/en/gt-guidelines/trans/
  • Ground Truth level overview https://ocr-d.de/en/gt-guidelines/trans/trLevels.html

Owner

  • Name: OCR-D
  • Login: OCR-D
  • Kind: organization

DFG-Koordinierungsprojekt zur Weiterentwicklung von Verfahren der Optical Character Recognition

Citation (CITATION.cff)

cff-version: 1.2.0
title: gt-MufiLevelRules
message: If you use this dataset, please cite it using the metadata from this file.
type: dataset
authors:
    - given-names: Matthias
      family-names: Boenig
      orcid: 'https://orcid.org/0000-0003-4615-4753'
repository-code: 'https://github.com/OCR-D/gt-MufiLevelRules'
url: 'https://github.com/OCR-D/gt-MufiLevelRules'
abstract: Creates OCR-D Ground-Truth Transcription Level Rules automatically from the encodings published by the Medieval Unicode Font Initiative (MUFI). The generated OCR-D level rules conform to the OCR-D specification. These rules can be used for substitutions or level checks, among other things.
keywords:
    - ocr-d
    - repository
    - ground-truth
    - level classification
    - level checks
    - guidelines
    - transcription
license: LGPL-3.0
commit: v1.2.5
version: 65_v1.2.5
date-released: '2024-04-18'

GitHub Events

Total
Last Year

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 187
  • Total Committers: 3
  • Avg Commits per committer: 62.333
  • Development Distribution Score (DDS): 0.064
Past Year
  • Commits: 134
  • Committers: 2
  • Avg Commits per committer: 67.0
  • Development Distribution Score (DDS): 0.067
Top Committers
Name Email Commits
Matthias Boenig m****g@g****t 175
github-actions[bot] 4****] 9
Robert Sachunsky 3****y 3
Committer Domains (Top 20 + Academic)
gmx.net: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 1
  • Total pull requests: 2
  • Average time to close issues: 8 days
  • Average time to close pull requests: 25 days
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/gt-MufiLevelRules.yml actions
  • JamesIves/github-pages-deploy-action v4 composite
  • actions/checkout v3 composite
  • actions/create-release v1 composite
  • actions/upload-release-asset v1 composite
  • thedoctor0/zip-release master composite
.github/workflows/gt-MufiLevelRules2.yml actions
  • JamesIves/github-pages-deploy-action v4 composite
  • actions/checkout v3 composite
  • thedoctor0/zip-release master composite