peraire-ground-truth
Ground Truth for Digital Peraire
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.1%) to scientific vocabulary
Repository
Ground Truth for Digital Peraire
Basic Info
- Host: GitHub
- Owner: alix-tz
- License: cc-by-4.0
- Default Branch: master
- Size: 489 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 4
- Releases: 3
Metadata Files
README.md
Peraire Ground Truth
License
This dataset and model are published under the CC-BY 4.0 License.
To cite this dataset:
Chagu, A., & Prez, G. (2023). Peraire Ground Truth (Version 2.0.0) [Data set]. https://doi.org/10.5281/zenodo.7185907
Description
This dataset was created in order to produce an HTR model for the Digital Peraire project. The documents are handwritten, dating from the second half of the 20th century, written in French with a blue ink pen or, more frequently, with a blue pencil. Occasional marginal notes appear in red.
Transcription guidelines
The transcription respects what is written on the document, including ponctuation and spelling errors.

The case is respected: capital letters are transcribed with capital letters.

Crossed out words are signaled by # which isn't used to transcribe anything else.

When a "v"-like sign is used to signal an insertion, it is transcribed with the character ``.

Segmentation guidelines
The SegmOnto ontology was used for the segmentation of this dataset.
For regions, MainZone and MarginTextZone were used. For lines, DefaultLine and InterlinearLine were used.
| Regions | Lines |
| :-----: | :---: |
|
|
|
Warning: Since the main goal of this dataset was to produce ground truth for the transcription phase, and given how faded the text is on some pages, it is not recomended to use the following images to train a segmentation model:
- B.1.intro-eurasie_0005.jpg
- B.1b.europe-centrale_0005.jpg
- B.2.europe-orientale_0007.jpg
- B.26.malais_0048.jpg
- B.28.java2_0017.jpg
Sources
The original documents are held at the Bibliothque Sbert, Espranto-France, Paris. They should be mentionned every time the images are used.
Model
See the models' README for more information about the training of the model.
Owner
- Name: Alix Chagué
- Login: alix-tz
- Kind: user
- Company: Inria
- Website: http://alix-tz.github.io
- Twitter: Alix_Tz
- Repositories: 10
- Profile: https://github.com/alix-tz
PhD student in Digital Humanities @ Université de Montréal and Inria.
GitHub Events
Total
- Push event: 1
- Pull request event: 3
- Create event: 4
Last Year
- Push event: 1
- Pull request event: 3
- Create event: 4
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite
- andymckay/get-gist-action master composite