peraire-ground-truth

Ground Truth for Digital Peraire

https://github.com/alix-tz/peraire-ground-truth

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Ground Truth for Digital Peraire

Basic Info
  • Host: GitHub
  • Owner: alix-tz
  • License: cc-by-4.0
  • Default Branch: master
  • Size: 489 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 4
  • Releases: 3
Created over 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Peraire Ground Truth

DOI

characters badge regions badge lines badge files badge

License

This dataset and model are published under the CC-BY 4.0 License.

To cite this dataset:

Chagu, A., & Prez, G. (2023). Peraire Ground Truth (Version 2.0.0) [Data set]. https://doi.org/10.5281/zenodo.7185907

Description

This dataset was created in order to produce an HTR model for the Digital Peraire project. The documents are handwritten, dating from the second half of the 20th century, written in French with a blue ink pen or, more frequently, with a blue pencil. Occasional marginal notes appear in red.

Transcription guidelines

The transcription respects what is written on the document, including ponctuation and spelling errors.

image

The case is respected: capital letters are transcribed with capital letters.

image

Crossed out words are signaled by # which isn't used to transcribe anything else.

image

When a "v"-like sign is used to signal an insertion, it is transcribed with the character ``.

image

Segmentation guidelines

The SegmOnto ontology was used for the segmentation of this dataset.

For regions, MainZone and MarginTextZone were used. For lines, DefaultLine and InterlinearLine were used.

| Regions | Lines | | :-----: | :---: | | visualizing the types of regions | visualizing the type of lines |

Warning: Since the main goal of this dataset was to produce ground truth for the transcription phase, and given how faded the text is on some pages, it is not recomended to use the following images to train a segmentation model:

  • B.1.intro-eurasie_0005.jpg
  • B.1b.europe-centrale_0005.jpg
  • B.2.europe-orientale_0007.jpg
  • B.26.malais_0048.jpg
  • B.28.java2_0017.jpg

Sources

The original documents are held at the Bibliothque Sbert, Espranto-France, Paris. They should be mentionned every time the images are used.

Model

See the models' README for more information about the training of the model.

Owner

  • Name: Alix Chagué
  • Login: alix-tz
  • Kind: user
  • Company: Inria

PhD student in Digital Humanities @ Université de Montréal and Inria.

GitHub Events

Total
  • Push event: 1
  • Pull request event: 3
  • Create event: 4
Last Year
  • Push event: 1
  • Pull request event: 3
  • Create event: 4

Dependencies

.github/workflows/htr-united-workflows.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • andymckay/get-gist-action master composite