Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: HTRogene
- License: cc0-1.0
- Default Branch: main
- Size: 24.5 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 3
Metadata Files
README.md
HTRogène - Medieval French
![]() |
![]() |
Introduction
HTRogène is an exploratory project funded by Biblissima+, aiming to develop generic models for automatic transcription of medieval and early modern manuscripts.
This repository focuses on the Medieval French corpus, providing ground-truth data for Handwritten Text Recognition (HTR) and layout segmentation.
The dataset is designed to support the creation of robust and reliable HTR models for French manuscripts.
| Shelfmark | Links | Type | Century | Color Pages | Main Zones | Lines | Characters | Genre | |-----------------------|---------------------------------------------------|--------|-----------|---------------|--------------|---------|--------------|------------| | Paris, BnF, NAF 4503 | B | verse | 12 | ✗ | 10 | 292 | 9304 | Narratives | | Paris, BnF, fr. 146 | B | verse | 14 | ✗ | 12 | 414 | 10355 | Narratives | | Paris, BnF, fr. 12563 | B | verse | 15 | ✗ | 10 | 271 | 8641 | Narratives | | Paris, BnF, fr. 12575 | B | verse | 15 | ✗ | 10 | 263 | 6181 | Narratives |
Dataset Overview
The dataset comprises carefully selected manuscripts, each containing approximately 10 columns of text (equivalent to 5 bi-column pages or 10 single-column pages).
The data adheres to the Segmonto guidelines, ensuring consistency and compatibility with other datasets following the same standards.
Each image is accompanied by two XML files:
- Files suffixed with
.chocomufin.xmlare normalized for compliance with broader datasets. - The other XML files contain repository-specific information.
We recommend using the normalized .chocomufin.xml files for most applications.
Total number of pages
29
Regions
- MainZone (42)
- NumberingZone (17)
- DecorationZone (1)
- DropCapitalZone (32)
- MarginTextZone (3)
- GraphicZone (1)
Lines
- DefaultLine (1217)
- HeadingLine (1)
- InterlinearLine (22)
Funding and Support
This project is funded by Biblissima+, an observatory for medieval and Renaissance written cultural heritage.
Biblissima+ focuses on the study of the circulation of books and the transmission of texts from the 8th to 18th centuries.
Learn more at the Biblissima+ project page.
License
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
You are free to share and adapt the material, provided appropriate credit is given.
Citation
If you use this dataset in your research, please cite it as follows:
Acknowledgments
We extend our gratitude to the transcribers and supervisors who contributed to the creation of this dataset.
Special thanks to Biblissima+ for their financial support and commitment to advancing the study of medieval manuscripts.
For more information about the HTRogène project and other related resources, please visit the Biblissima+ project page.
Owner
- Name: HTRogene
- Login: HTRogene
- Kind: organization
- Repositories: 1
- Profile: https://github.com/HTRogene
Citation (CITATION.CFF)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: HTRogène, Medieval French corpus of ground-truth for Handwritten Text Recognition and Layout Segmentation
message: >-
If you use this dataset, please cite it using the
metadata from this file.
type: dataset
authors:
- given-names: Vincent
family-names: Giovannangeli
- given-names: Ariane
family-names: Pinche
orcid: 'https://orcid.org/ 0000-0002-7843-5050'
- given-names: Jean-Baptiste
family-names: Camps
orcid: 'https://orcid.org/0000-0003-0385-7037'
- given-names: Thibault
family-names: Clérice
email: thibault.clerice@inria.fr
affiliation: ALMAnaCH, INRIA
orcid: 'https://orcid.org/0000-0003-1852-9204'
- given-names: 'Alix'
family-names: Chagué
orcid: 'https://orcid.org/0000-0002-0136-4434'
affiliation: 'Inria, Université de Montréal'
email: alix.chague@inria.fr
repository-code: 'https://github.com/HTRogene/french'
keywords:
- HTR
- French
- layout segmentation
- medieval
- manuscripts
GitHub Events
Total
- Release event: 2
- Public event: 1
- Push event: 1
- Create event: 2
Last Year
- Release event: 2
- Public event: 1
- Push event: 1
- Create event: 2
Dependencies
- actions/checkout v2 composite
- actions/setup-python v2 composite
- andymckay/get-gist-action master composite
- actions/checkout v2 composite
- rymndhng/release-on-push-action master composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/checkout v3 composite
- dieghernan/cff-validator v3 composite

