tapuscorpus
Ground Truth for French 20th century typewritten documents collected on Gallica and Europeana
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.6%) to scientific vocabulary
Keywords
Repository
Ground Truth for French 20th century typewritten documents collected on Gallica and Europeana
Basic Info
Statistics
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 3
- Releases: 3
Topics
Metadata Files
README.md
TAPUSCORPUS
Description
Ground Truth dataset for French typewritten OCR
Content
154 pairs of images and PAGE XML files divided into 9 sub-corpus.
| # | name | nb of images | Date | GT for segmenter? | GT for recognizer? | link(s) to source images | | --- | :---- | :---: | :---: | :---: | :---: | ---: | | 1 | 12148-btv1b52502601r | (15) | X | y | y | https://gallica.bnf.fr/ark:/12148/btv1b52502601r | | 2 | 2013004387-1962-catalogue-dactylographi-des-enregistrements | (14) | 1962 | n | y | https://www.siv.archives-nationales.culture.gouv.fr/siv/UD/FRANIR050603/c-2cfmj84sh--fdjfx0b91tk6 | | 3 | bljd-DSN390-DSN355-DSN113(27)-DSN169 | (14) | 1922-1945 | y | y | http://bljd.sorbonne.fr/ark:/naan/a011441804309XrO1fa/e3ff656f45 , http://bljd.sorbonne.fr/ark:/naan/a011441804309Dy4ooR/8f9020bc6e , http://bljd.sorbonne.fr/ark:/naan/a01144180430809P21l/036952bd71 , http://bljd.sorbonne.fr/ark:/naan/a0114418043093CzACj/ae2ff446cd | | 4 | 12148-btv1b10581912k | (12) | 19?? | y | y | https://gallica.bnf.fr/ark:/12148/btv1b10581912k | | 5 | AN-96AP1-dossier1-pices1202-dbut-du-journal-dactylographi | (9) | 1914-1917 | y | y | https://www.siv.archives-nationales.culture.gouv.fr/siv/UD/FRANIR_050082/c-65dxt4cy4--yqhpaasluvz1 | | 6 | 12148-btv1b525062185 | (4) | X | y | y | https://gallica.bnf.fr/ark:/12148/btv1b525062185 | | 7 | 12148-btv1b10583038c | (6) | 1896 | y | y | https://gallica.bnf.fr/ark:/12148/btv1b10583038c | | 8 | 12148-btv1b10583138r | (2) | 1962 | y | y | https://gallica.bnf.fr/ark:/12148/btv1b10583138r | | 9 | 12148-btv1b53097347b | (10) | 1984 | y | y | https://gallica.bnf.fr/ark:/12148/btv1b53097347b | | 10 | europeana-random-selection | (64) | X | n | y | Europeana |
- (3): images are blurred (low quality), but readable none the less.
Unfortunately, I did not keep track of the URL related to the images taken at random from Europeana...
Annotation system
Undescored words are preceded with _such as "This is an example" will be transcribed as: "_This _is _an example".
Portions of text that are superscripted are preceded with ^ such as "1er" will be transcribed as "1^er". If several words are superscripted, each word starts with a "^".
Crossed out words are not rendered: - words that be read under the stroke are transcribed as if not crossed out; - words that cannot be read under the stroke are transcribed like any portion of text that is not type-written.
Any portion of text that is not type-written is transcribed as a series of ~~~ (always 3). There are as many repetition of ~~~ (with a space between each instance) as there are words.
How to cite
This dataset was built and is maintained by Alix Chagu (@alix-tz). The original works and their digitization are all copyright-free, but properly annotating a corpus takes time and is a task that should be recognized. If you use any item from this corpus of ground truth, cite the dataset using the following information:
name: 'Tapuscorpus'
url: 'https://github.com/HTR-United/tapuscorpus'
author: 'Alix Chagu'
month: 'janv'
year: '2021'
version: '{any version}'
description: 'Ground Truth dataset for French typewritten OCR (20th century documents)'
language: 'French'
time: '1900-1999'
hands: '30'
license:
- {name: 'CC-BY 4.0', url: 'https://creativecommons.org/licenses/by/4.0/'}
format: PAGE-XML
volume:
- {count: "150", metric: pages}
This work is licensed under a Creative Commons Attribution 4.0 International License.
Owner
- Name: HTR United
- Login: HTR-United
- Kind: organization
- Location: France
- Website: https://htr-united.github.io
- Repositories: 21
- Profile: https://github.com/HTR-United
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite
- andymckay/get-gist-action master composite
- rymndhng/release-on-push-action master composite
