peraire-ground-truth

Ground Truth for Digital Peraire

https://github.com/alix-tz/peraire-ground-truth

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary

Last synced: 11 months ago · JSON representation

Repository

Ground Truth for Digital Peraire

Basic Info

Host: GitHub
Owner: alix-tz
License: cc-by-4.0
Default Branch: master
Size: 489 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 4
Releases: 3

Created almost 4 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

Peraire Ground Truth

characters badge regions badge lines badge files badge

License

This dataset and model are published under the CC-BY 4.0 License.

To cite this dataset:

Chagu, A., & Prez, G. (2023). Peraire Ground Truth (Version 2.0.0) [Data set]. https://doi.org/10.5281/zenodo.7185907

Description

This dataset was created in order to produce an HTR model for the Digital Peraire project. The documents are handwritten, dating from the second half of the 20th century, written in French with a blue ink pen or, more frequently, with a blue pencil. Occasional marginal notes appear in red.

Transcription guidelines

The transcription respects what is written on the document, including ponctuation and spelling errors.

The case is respected: capital letters are transcribed with capital letters.

Crossed out words are signaled by # which isn't used to transcribe anything else.

When a "v"-like sign is used to signal an insertion, it is transcribed with the character ``.

Segmentation guidelines

The SegmOnto ontology was used for the segmentation of this dataset.

For regions, MainZone and MarginTextZone were used. For lines, DefaultLine and InterlinearLine were used.

Warning: Since the main goal of this dataset was to produce ground truth for the transcription phase, and given how faded the text is on some pages, it is not recomended to use the following images to train a segmentation model:

B.1.intro-eurasie_0005.jpg
B.1b.europe-centrale_0005.jpg
B.2.europe-orientale_0007.jpg
B.26.malais_0048.jpg
B.28.java2_0017.jpg

Sources

The original documents are held at the Bibliothque Sbert, Espranto-France, Paris. They should be mentionned every time the images are used.

Model

See the models' README for more information about the training of the model.

Owner

Name: Alix Chagué
Login: alix-tz
Kind: user
Company: Inria

Website: http://alix-tz.github.io
Twitter: Alix_Tz
Repositories: 10
Profile: https://github.com/alix-tz

PhD student in Digital Humanities @ Université de Montréal and Inria.

GitHub Events

Total

Push event: 1
Pull request event: 3
Create event: 4

Last Year

Push event: 1
Pull request event: 3
Create event: 4

Dependencies

.github/workflows/htr-united-workflows.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
andymckay/get-gist-action master composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science