cremma-wikipedia

A collection of ground truth to train HTR models on contemporary French handwritings

https://github.com/htr-united/cremma-wikipedia

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.0%) to scientific vocabulary

Keywords

cremmawiki french ground-truth htr wikipedia work-in-progress
Last synced: 6 months ago · JSON representation

Repository

A collection of ground truth to train HTR models on contemporary French handwritings

Basic Info
  • Host: GitHub
  • Owner: HTR-United
  • License: cc-by-4.0
  • Default Branch: master
  • Homepage:
  • Size: 1.68 GB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 2
  • Releases: 8
Topics
cremmawiki french ground-truth htr wikipedia work-in-progress
Created almost 4 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

CREMMA - Wikipedia

files badge regions badge lines badge characters badge


CC BY 4.0

DOI

Description

The CREMMA WIKIPEDIA project aims at creating a collection of ground truth to train HTR models on contemporary French handwriting.

Each image represents an exerpt from a randomly selected Wikipedia page, copied by hand by volunteers. We then took care of the alignment between the handwritten portion and the original text, also present on the image.

Transcription guidelines

The transcription guidelines follow CREMMA's convention for modern documents. In short: - superscript is preceded by a ^. - Strikethrough elements are transcribed with - >< when unreadable, - >word< when readeable.

The text to copy may have included phonetic transcription. Non-french letters and diacritics were rendered as well. See characters.csv for the list of the characters used in this dataset. The character set can be normalized using ChocoMufin

Related tools

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Owner

  • Name: HTR United
  • Login: HTR-United
  • Kind: organization
  • Location: France

GitHub Events

Total
Last Year

Dependencies

.github/workflows/htr-unted-workflow.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • andymckay/get-gist-action master composite
  • rymndhng/release-on-push-action master composite