tapuscorpus

Ground Truth for French 20th century typewritten documents collected on Gallica and Europeana

https://github.com/htr-united/tapuscorpus

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (5.6%) to scientific vocabulary

Keywords

dataset french ground-truth htr typewritten
Last synced: 9 months ago · JSON representation

Repository

Ground Truth for French 20th century typewritten documents collected on Gallica and Europeana

Basic Info
  • Host: GitHub
  • Owner: HTR-United
  • License: cc-by-4.0
  • Default Branch: main
  • Homepage:
  • Size: 123 MB
Statistics
  • Stars: 5
  • Watchers: 2
  • Forks: 0
  • Open Issues: 3
  • Releases: 3
Topics
dataset french ground-truth htr typewritten
Created over 5 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

TAPUSCORPUS

CC BY 4.0 DOI

Files Badges Regions Badges Lines Badges Chars Badges

Description

Ground Truth dataset for French typewritten OCR

Content

154 pairs of images and PAGE XML files divided into 9 sub-corpus.

| # | name | nb of images | Date | GT for segmenter? | GT for recognizer? | link(s) to source images | | --- | :---- | :---: | :---: | :---: | :---: | ---: | | 1 | 12148-btv1b52502601r | (15) | X | y | y | https://gallica.bnf.fr/ark:/12148/btv1b52502601r | | 2 | 2013004387-1962-catalogue-dactylographi-des-enregistrements | (14) | 1962 | n | y | https://www.siv.archives-nationales.culture.gouv.fr/siv/UD/FRANIR050603/c-2cfmj84sh--fdjfx0b91tk6 | | 3 | bljd-DSN390-DSN355-DSN113(27)-DSN169 | (14) | 1922-1945 | y | y | http://bljd.sorbonne.fr/ark:/naan/a011441804309XrO1fa/e3ff656f45 , http://bljd.sorbonne.fr/ark:/naan/a011441804309Dy4ooR/8f9020bc6e , http://bljd.sorbonne.fr/ark:/naan/a01144180430809P21l/036952bd71 , http://bljd.sorbonne.fr/ark:/naan/a0114418043093CzACj/ae2ff446cd | | 4 | 12148-btv1b10581912k | (12) | 19?? | y | y | https://gallica.bnf.fr/ark:/12148/btv1b10581912k | | 5 | AN-96AP1-dossier1-pices1202-dbut-du-journal-dactylographi | (9) | 1914-1917 | y | y | https://www.siv.archives-nationales.culture.gouv.fr/siv/UD/FRANIR_050082/c-65dxt4cy4--yqhpaasluvz1 | | 6 | 12148-btv1b525062185 | (4) | X | y | y | https://gallica.bnf.fr/ark:/12148/btv1b525062185 | | 7 | 12148-btv1b10583038c | (6) | 1896 | y | y | https://gallica.bnf.fr/ark:/12148/btv1b10583038c | | 8 | 12148-btv1b10583138r | (2) | 1962 | y | y | https://gallica.bnf.fr/ark:/12148/btv1b10583138r | | 9 | 12148-btv1b53097347b | (10) | 1984 | y | y | https://gallica.bnf.fr/ark:/12148/btv1b53097347b | | 10 | europeana-random-selection | (64) | X | n | y | Europeana |

  • (3): images are blurred (low quality), but readable none the less.

Unfortunately, I did not keep track of the URL related to the images taken at random from Europeana...

Annotation system

Undescored words are preceded with _such as "This is an example" will be transcribed as: "_This _is _an example".

Portions of text that are superscripted are preceded with ^ such as "1er" will be transcribed as "1^er". If several words are superscripted, each word starts with a "^".

Crossed out words are not rendered: - words that be read under the stroke are transcribed as if not crossed out; - words that cannot be read under the stroke are transcribed like any portion of text that is not type-written.

Any portion of text that is not type-written is transcribed as a series of ~~~ (always 3). There are as many repetition of ~~~ (with a space between each instance) as there are words.

How to cite

This dataset was built and is maintained by Alix Chagu (@alix-tz). The original works and their digitization are all copyright-free, but properly annotating a corpus takes time and is a task that should be recognized. If you use any item from this corpus of ground truth, cite the dataset using the following information:

name: 'Tapuscorpus' url: 'https://github.com/HTR-United/tapuscorpus' author: 'Alix Chagu' month: 'janv' year: '2021' version: '{any version}' description: 'Ground Truth dataset for French typewritten OCR (20th century documents)' language: 'French' time: '1900-1999' hands: '30' license: - {name: 'CC-BY 4.0', url: 'https://creativecommons.org/licenses/by/4.0/'} format: PAGE-XML volume: - {count: "150", metric: pages}

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Owner

  • Name: HTR United
  • Login: HTR-United
  • Kind: organization
  • Location: France

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Dependencies

.github/workflows/htr-united.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • andymckay/get-gist-action master composite
  • rymndhng/release-on-push-action master composite