gt-repo-scripts
XSLT and shell scripts for analyzing and creating GitHub pages of a ground truth repository. These are centrally managed and can be used by all repositories created with gt-repo-template (https://github.com/OCR-D/gt-repo-template).
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary
Keywords
Repository
XSLT and shell scripts for analyzing and creating GitHub pages of a ground truth repository. These are centrally managed and can be used by all repositories created with gt-repo-template (https://github.com/OCR-D/gt-repo-template).
Basic Info
Statistics
- Stars: 0
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 9
Topics
Metadata Files
README.md

gt-repo-scripts
Description
XSLT and shell scripts for analyzing and creating GitHub pages of a ground truth repository. These are centrally managed and can be used by all repositories created with gt-repo-template (https://github.com/OCR-D/gt-repo-template).
The format of the output files: - Markdown, - ruleset (JSON) - METS (XML) - Shell scripts
Overview of scripts or programs
🚀 gt-overview_unitTest.xsl
- It lists all files in the Ground Truth (GT) directory. In a second step, the xsl checks whether the specified GT directory structure with the data and GT-PAGE directories is present. If other directories or a different directory structure are present, an error is output (pathtest.md).
- It is part of the gtrepo github-action workflow.
- :wrench: general program call
-
shell java -jar saxon-XX.jar -xsl:scripts/gt-overview_unitTest.xsl \ output=unitTest1 \ -s:scripts/gt-overview_unitTest.xsl -o:ghout/pathtest.md
-
🚀 gt-overview_metadata.xsl
Environment parameters group - Analysis of ground truth, GitHub page creation, following parameters are to be followed. Use environment variables https://docs.github.com/en/actions/learn-github-actions/environment-variables - repoBase=$GITHUBREFNAME - repoName=$GITHUBREPOSITORY - bagitDumpNum=$GITHUBRUN_NUMBER
Output parameter group: - Specifies what type of analysis and in what form it should be displayed. - output=METADATA -> transform METADATA and create GT overview - output=TABLE ->compressed table view - output=OVERVIEW->detailed table view
Metadata parameter group: - indicates that a metadata set is created for the GT corpus and the README and the README file is adapted. - output=METS ->generate metadata for (METS)-Ingest in OCR-D workflow, mets.sh is generated - output=METSvolume->generate METS metadata for the whole corpus - output=METSdefault->generate METS metadata file without DEFAULT fileGrp (file Group), the METS file(s) contains only the Realease files - output=README ->creation of a customized README file
- :wrench: general program call
shell java -jar saxon-XX.jar -xsl:scripts/gt-overview_metadata.xsl \ output=XX repoBase=$GITHUB_REF_Name repoName=$GITHUB_REPOSITORY bagitDumpNum=$GITHUB_RUN_NUMBER \ -s:scripts/gt-overview_metadata.xsl -o:XX
- :wrench: general program call
🚀 gt-level_parser.xsl
- It is a rule-based parser for determining the transcription and structure level of a page file and the corpus of page files. The transcription level distinguishes three and the structure two levels.
- The parser determines the frequencies of characters and structures (regions) that are defined in the rules. Based on this analysis, a specific level is determined for the page and for the corpus.
- The gt-levelparser.xsl include gt-levelstructure.xsl.gt-level_structure.xsl specialises in determining the regions used in the Page-XML files. An independent call of this stylesheet is not provided.
- The output file is overview-level.md, it is the level matrix, the analysis result.
- :wrench: general program call
shell java -jar saxon-XX.jar -xsl:scripts/gt-level_parser.xsl \ repoName=$GITHUB_REPOSITORY \ -s:scripts/gt-level_parser.xsl -o:ghout/overview-level.md
- :wrench: general program call
🚀 gt-coll_metadata.xsl
- gt-coll_metadata.xsl automatically creates a readme file for a collection/corpus of Ground Truth repositories.
- :wrench: general program call
shell java -jar saxon-xx.jar -xsl:scripts/gt-coll_metadata.xsl \ -s:scripts/gt-coll_metadata.xsl -o:README.md
- :wrench: general program call
🚀 data_structure.sh
- Analysis of the data structure, determination of the METS metadata file and afterwards creation of the Bagit files. For Bagit see: https://ocr-d.de/en/spec/ocrd_zip
- :wrench: general program call
shell sh scripts/data_structure.sh### 🚀 data_mets.sh
- :wrench: general program call
- During the Github action workflow, METS files that do not contain
OCR-D-IMG fileGrpare deleted.
🚀 readmefolder.sh
- Archiving the original README file to the
readme_oldfolder- :wrench: general program call
shell sh scripts/readmefolder.sh
- :wrench: general program call
🚀 xreadme.sh
- Determination of the README file and change of the filename extension from Markdown to XML
- :wrench: general program call
shell sh scripts/xreadme.sh🌻 lang.js
- :wrench: general program call
- Javascript for the automated language conversion (German/English) of the level description and the links to the OCR-D-GT Guidelines.
🌻 table_hide.css - CSS stylesheet to customize the formatting of GH pages. The GH pages use the dinky template (https://pages-themes.github.io/dinky/).
🌻 levelparser.css - CSS stylesheet for customising the formatting of GH pages, in particular for determining the transcription and structure levels.
Overview of additional files
🖹 megalevelrules.xml
- Megalevelrules.xml file contains all OCR-D Ground-Truth Transcription Level Rules. These rules are based on the encodings published by the Medieval Unicode Font Initiative (MUFI).
- These rules are used for so-called level parsing.
- The megalevelrules are generated automatically. See also: https://github.com/OCR-D/gt-MufiLevelRules
- The file available here is a copy of: https://raw.githubusercontent.com/OCR-D/gt-MufiLevelRules/gh-pages/rules/megalevelrules.xml
🌻 metadata.xsl - The Metadata.xsl file updates the metadata file CITATION.cff of the repo gt-repo-scripts. The update is performed by a GitHub action workflow.
Github Action Template
In combination or individually, the individual programs and stylesheets can also be used in a Github Action Workflow. - With XSLT, an XSLT transformer should also be installed. - OCR-D is used for the creation of Bagit data containers.
Example Github Action Workflow with the programs
Example 1
see application: https://github.com/OCR-D/gt-repo-template - gt-overviewunitTest.xsl - gt-overviewmetadata.xsl - gt-levelparser.xsl - datastructure.sh - data_mets.sh - readmefolder.sh - xreadme.sh
```yml name: gtrepo on: push: tags: - 'v[0-9]+.[0-9]+.[0-9]+'
workflow_dispatch: inputs: tag-name: description: Name of the release tag
jobs: job1: name: uniTest runs-on: ubuntu-latest permissions: checks: write contents: write # Map a step output to a job output outputs: output1: ${{ steps.step4.outputs.test }} output2: ${{ steps.step4.outputs.test2 }}
steps:
- name: Git checkout
id: step1
uses: actions/checkout@v4
# Installation Styles and Saxon
- name: install analyse xsl-styles
id: step2
run: |
git clone https://github.com/tboenig/gt-repo-scripts.git
mv gt-repo-scripts/scripts scripts/
rm -r gt-repo-scripts
- name: Download and install saxon
id: step3
run: |
wget https://github.com/Saxonica/Saxon-HE/releases/download/SaxonHE12-3/SaxonHE12-3J.zip
unzip SaxonHE12-3J.zip
# Installation and Directories
- name: make gh-pages_out
run: mkdir ghout
- name: Get SDK Version from config
id: lookupSdkVersion
uses: mikefarah/yq@master
with:
cmd: yq -o=json METADATA.yml > METADATA.json
- name: PathTest
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_unitTest.xsl \
output=unitTest1 \
-s:scripts/gt-overview_unitTest.xsl -o:ghout/pathtest.md
shell: bash
# Test GT-Page Folder Repo Structure
- name: Empty
id: step4
run: |
[ -s ghout/pathtest.md ] || echo "test=empty" >> $GITHUB_OUTPUT
[ ! -s ghout/pathtest.md ] || echo "test2=full" >> $GITHUB_OUTPUT
# Error Logview
- name: uniTestError
id: step5
if: ${{steps.step4.outputs.test2 == 'full'}}
run: |
less ghout/pathtest.md
job2:
name: analyse_and_makebagit
needs: job1
if: ${{needs.job1.outputs.output1 == 'empty'}}
runs-on: ubuntu-latest
permissions:
checks: write
contents: write
steps:
- name: Using tag name from ref name
if: github.event.inputs.tag-name == ''
run: echo "TAG_NAME=$GITHUB_REF_NAME" >> $GITHUB_ENV
- name: Using tag name from input param
if: github.event.inputs.tag-name != ''
run: echo "TAG_NAME=${{ github.event.inputs.tag-name}}" >> $GITHUB_ENV
- name: Git checkout
uses: actions/checkout@v4
# Installation Styles
- name: install analyse xsl-styles
run: |
git clone https://github.com/tboenig/gt-repo-scripts.git
mv gt-repo-scripts/scripts scripts/
rm -r gt-repo-scripts
# Installation GT-Labelling Documentation
- name: install labeling
run: |
git clone https://github.com/tboenig/gt-guidelines.git
# Installation and Directories
- name: install jq
run: sudo apt-get install jq
- name: Download and install saxon
run: |
wget https://github.com/Saxonica/Saxon-HE/releases/download/SaxonHE12-3/SaxonHE12-3J.zip
unzip SaxonHE12-3J.zip
- name: make metadata_out
run: mkdir metadata_out
- name: make ocrdzip_out
run: mkdir ocrdzip_out
- name: make gh-pages_out
run: mkdir ghout
- name: make readme_out
run: sh scripts/readmefolder.sh
- name: readme.xml file
run: sh scripts/xreadme.sh
# Transformation and analyzing
- name: Get SDK Version from config
id: lookupSdkVersion
uses: mikefarah/yq@master
with:
cmd: yq -o=json METADATA.yml > METADATA.json
- name: transform METADATA and make GT-Overview
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=METADATA repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY bagitDumpNum=$GITHUB_RUN_NUMBER releaseTag=${{ env.TAG_NAME }} \
-s:scripts/gt-overview_metadata.xsl -o:ghout/metadata.md
shell: bash
- name: make Compressed table view
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=TABLE repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY \
-s:scripts/gt-overview_metadata.xsl -o:ghout/table.md
shell: bash
- name: detailed table view
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=OVERVIEW repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY \
-s:scripts/gt-overview_metadata.xsl -o:ghout/overview.md
shell: bash
- name: leveling the volume and documents
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-level_parser.xsl \
repoName=$GITHUB_REPOSITORY \
-s:scripts/gt-level_parser.xsl -o:ghout/overview-level.md
shell: bash
- name: generate mets.sh
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=METS repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY \
-s:scripts/gt-overview_metadata.xsl -o:scripts/mets.sh
shell: bash
- name: generate Metadata JSON file
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=METAJSON repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY bagitDumpNum=$GITHUB_RUN_NUMBER releaseTag=${{ env.TAG_NAME }} \
-s:scripts/gt-overview_metadata.xsl -o:metadata_out/metadata_l.json
shell: bash
- name: format json file and copy to gh branch
run: |
jq '.' metadata_out/metadata_l.json > metadata_out/metadata.json
cp metadata_out/metadata.json ghout/
rm metadata_out/metadata_l.json
- name: generate README
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=README repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY \
-s:scripts/gt-overview_metadata.xsl -o:README.md
shell: bash
- name: generate METS Volume File
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=METSvolume repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY bagitDumpNum=$GITHUB_RUN_NUMBER releaseTag=${{ env.TAG_NAME }} \
-s:scripts/gt-overview_metadata.xsl -o:metadata_out/mets.xml
shell: bash
- name: generate release download List
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=download repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY bagitDumpNum=$GITHUB_RUN_NUMBER releaseTag=${{ env.TAG_NAME }} \
-s:scripts/gt-overview_metadata.xsl -o:ghout/download.txt
shell: bash
- name: delete fileGrp DEFAULT
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=METSdefault repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY bagitDumpNum=$GITHUB_RUN_NUMBER releaseTag=${{ env.TAG_NAME }} \
-s:scripts/gt-overview_metadata.xsl
shell: bash
- name: generate CITATION.cff
run: |
java -jar saxon-he-12.3.jar -xsl:scripts/gt-overview_metadata.xsl \
output=CITATION repoBase=${{ env.TAG_NAME }} repoName=$GITHUB_REPOSITORY bagitDumpNum=$GITHUB_RUN_NUMBER releaseTag=${{ env.TAG_NAME }} \
-s:scripts/gt-overview_metadata.xsl -o:rawCITATION.cff
shell: bash
- name: formating CITATION.cff
id: lookupSdkVersion2
uses: mikefarah/yq@master
with:
cmd: |
yq -I4 rawCITATION.cff > CITATION.cff
rm rawCITATION.cff
- name: Index-link
run: |
cd ghout
ln -s metadata.md index.md
# Mets handling, Install OCR-D and Bagit
- name: del invalidMets
run: sh -ex scripts/data_mets.sh
shell: bash
- name: install ocrd, make validMets and bagit
run: |
sudo apt-get install -y python3 imagemagick libgeos-dev
python3 -m venv venv
source venv/bin/activate
pip install -U pip 'setuptools>=61'
pip install ocrd
ocrd --version
- name: make validMets
run: |
source venv/bin/activate
sh -ex scripts/mets.sh
- name: make bagit
run: |
source venv/bin/activate
sh scripts/data_structure.sh
```
Example 2
see application: https://github.com/tboenig/gtcorpusbenchmark - gt-coll_metadata.xsl - xreadme.sh
```yml name: gtrepo on: push: tags: - 'v[0-9]+.[0-9]+.[0-9]+'
workflow_dispatch:
jobs: cli: name: makeDescription runs-on: ubuntu-latest permissions: checks: write contents: write
steps:
- name: Git checkout
uses: actions/checkout@v3
# Create Directories
- name: create directories
run: |
mkdir frak
mkdir ant
mkdir fontmix
mkdir frak/frak_simple
mkdir frak/frak_complex
mkdir ant/ant_simple
mkdir ant/ant_complex
mkdir fontmix/fontmix_simple
mkdir fontmix/fontmix_complex
# Clone Repos
- name: clone repos and delete files
run: |
cd frak
cd frak_simple
git clone https://github.com/tboenig/16_frak_simple.git --branch gh-pages
cd 16_frak_simple
rm -rf _config.yml index.md metadata.md overview.md table.md table_hide.css
cd ..
git clone https://github.com/tboenig/17_frak_simple.git --branch gh-pages
cd 17_frak_simple
rm -rf _config.yml index.md metadata.md overview.md table.md table_hide.css
# Installation Styles
- name: install analyse xsl-styles
run: |
git clone https://github.com/tboenig/gt-repo-scripts.git
mv gt-repo-scripts/scripts scripts/
rm -r gt-repo-scripts
# Installation GT-Labelling Documentation
- name: install labeling
run: |
git clone https://github.com/tboenig/gt-guidelines.git
# Installation Transformer
- name: Download and install saxon
run: |
wget https://sourceforge.net/projects/saxon/files/Saxon-HE/11/Java/SaxonHE11-4J.zip/download
unzip download
# Transform Readme
- name: readme.xml file
run: sh scripts/xreadme.sh
# Transformation and analyzing
- name: generate README
run: |
java -jar saxon-he-11.4.jar -xsl:scripts/gt-coll_metadata.xsl \
-s:scripts/gt-coll_metadata.xsl -o:README.md
shell: bash
```
Owner
- Name: OCR-D
- Login: OCR-D
- Kind: organization
- Website: https://ocr-d.de
- Twitter: OCR_D_community
- Repositories: 27
- Profile: https://github.com/OCR-D
DFG-Koordinierungsprojekt zur Weiterentwicklung von Verfahren der Optical Character Recognition
Citation (CITATION.cff)
cff-version: 1.2.0
title: gt-repo-scripts
message: If you use this dataset, please cite it using the metadata from this file.
type: dataset
authors:
- given-names: Matthias
family-names: Boenig
orcid: 'https://orcid.org/0000-0003-4615-4753'
repository-code: 'https://github.com/OCR-D/gt-repo-scripts'
url: 'https://github.com/OCR-D/gt-repo-scripts'
abstract: XSLT and shell scripts for analyzing and creating GitHub pages of a ground truth repository. These are centrally managed and can be used by all repositories created with gt-repo-template (https://github.com/OCR-D/gt-repo-template).
keywords:
- ocr-d
- repository
- ground-truth
- level classification
- level checks
- guidelines
- transcription
- page-xml
- template
license: CC-BY-SA-4.0
commit: v1.1.8
version: 14_v1.1.8
date-released: '2024-04-18'
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 1
- Total pull requests: 2
- Average time to close issues: about 5 hours
- Average time to close pull requests: about 3 hours
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 2
- Average time to close issues: about 5 hours
- Average time to close pull requests: about 3 hours
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bertsky (1)
Pull Request Authors
- bertsky (4)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v4 composite
- actions/create-release v1 composite
- mikefarah/yq master composite