cremma-medieval
Transcription corpora for training HTR models for medieval manuscripts from the 12th to the 15th century.
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.1%) to scientific vocabulary
Keywords
Repository
Transcription corpora for training HTR models for medieval manuscripts from the 12th to the 15th century.
Basic Info
Statistics
- Stars: 22
- Watchers: 2
- Forks: 5
- Open Issues: 1
- Releases: 6
Topics
Metadata Files
README.md

cremma-medieval
This project was funded by the DIM MAP in the context of the CREMMA project (https://www.dim-map.fr/projets-soutenus/cremma/)
The cremma-medieval repository was created in order to make available transcription corpora for training HTR models for medieval manuscripts from the 12th to the 15th century.
The CREMMA Medieval dataset has been built with eScriptorium (http://traces6.paris.inria.fr), an interface for HTR ground truth production, and, Kraken, an HTR and layout segmentation engine. It is composed of fifteen Old French manuscripts written between the 13th and 15th centuries, mainly scanned in high definition and color except for one manuscript (Vatican) which is a black and white document and BnF fr. 17229, 13496 and 411 that come from microfilm scans. The datasets is mostly made from pre-existing transcribed texts and the samples size can be very different from one source manuscript to the other. The basis of the dataset is composed of the following transcriptions :
- Bibliothque nationale de France, Arsenal 3516, Crowdsourced transcriptions of the collaborative projects of the Standford Library: Bestiaire de Guillaume le Clerc de Normandie (https://fromthepage.com/stanfordlibraries/guillaume-le-clerc-de-normandie-s-bestiary)
- Bibliothque nationale de France, fr.411, Vie de saint Lambert transcribed by A. Pinche (ENC)
- Bibliothque nationale de France, fr.412, Li Seint Confessor de Wauchier de Denain transcribed by A. Pinche (ENC)
- Bibliothque nationale de France, fr.844, Manuscrit du Roi, Maritem project(https://anr.fr/Projet-ANR-18-CE27-0016) transcribed by V. Mariotti (projet Maritem)
- Bibliothque nationale de France, fr.1728, Le Livre de l'enseignement des roys et des princes, transcribe by No Leroy (Projet Gallic(orpor)a)
- Bibliothque nationale de France, fr.13496, Vie de saint Jrme transcribed by A. Pinche (ENC)
- Bibliothque nationale de France, fr.17229, Vie de saint Jrme transcribed by A. Pinche (ENC)
- Bibliothque nationale de France, fr.25516, Beuve de Hantone transcribed By A. Nolibois (Universit d'Aix-Marseille)
- Bibliothque nationale de France, fr.22550, Les Sept Sages de Thbes, this project just started in Geneva under the direction of Y. Foehr-Janssen (UNIGE), the different have been transcribed by Camille Carnaille (ULB/UNIGE) (fol.157r, 163v, 174v, 178v, 186v, 200v), Prunelle Deleville (UNIGE) (fol. 157v, 178r, 186r, 200r, 204r, 343v), Sophie Lecomte (ULB) (fol. 174v), Aminoel Meylan (UNIGE) (169r), Simone Ventura (ULB) (fol. 163r).
- Cologny, Bodmer, 168 and Vatican, Reg. Lat., 1616, Chanson d'Otinel transcribed by J.-B.Camps (ENC) from the Geste project (https://github.com/Jean-Baptiste-Camps/Geste)
- University of pennsylvania, codex 660, pelerinage de mademoiselle Sapience, transcribed by Ariane Pinche (ENC)
- University of pennsylvania, codex 909, nide, transcribed by Lucien Dugaz (ENC)
- Bibliothque royale, Bruxelles, ms 9232, Examens Moraux, transcribed by Prunelle Deleville (UNIGE)
As the data come from different projects, transcriptions have been standardized to strengthen HTR models. We chose a graphemic transcription method, following D. Stutzmann definitions (see bibliography), to have a sign in the image corresponding to a sign in our text: all the abbreviations are kept, and u/v or i/j are not distinguished. The spaces in the dataset are not homogeneously represented, sometimes transcriptions reproduce the manuscript spacing while others use lexical spaces. It must be stressed that spaces are the most important source of error in medieval HTR models. Most of the transcription follow the layout segmentation of the SegmOnto ontology (https://github.com/SegmOnto/examples), separating the main column, margin, numbering, drop capital, etc.
To ensure the quality of the data, continuous integration workflow (Github Actions) has been put in place checking the segmentation vocabulary (Segmento): XML schema validator (segmentoAltoValidator.xsd), but also the homogeneity of the signs of the characters used in the dataset through a list of authorized signs and translation table (table.csv) with ChocoMufin.
We use releases to make available our HTR models, trained with Kraken, for medieval manuscripts :
- 0.0.1 Arabica, accuracy 89,19% (21/06/21)
- 1.0.0 Bicerin, accuracy 95,49% (21/07/13)
- 1.1.0 Bicerin, accuracy 95,30% (22/06/20)
Models and data are under a Creative Commons Licence CC-BY 4.0.
Table summarizing the corpus (state 22/07/01)
Manuscript | Date | Transcribed Lines | |:---------------------------------------|:------|-------------------:| | BnF, ms fr. 412 | 13th | 6324 | | BnF, Arsenal 3516 | 13th | 1991 | | Cologny, bodmer, 168 | 13th | 1976 | | BnF, ms fr. 24428 | 13th | 1328 | | BnF, ms fr. 25516 | 13th | 717 | | BnF, ms fr. 844 | 13th | 224 | | BnF, ms fr. 17229 | 13th | 164 | | BnF, ms fr. 13496 | 13th | 161 | | BnF, Arsenal 3516 | 13th | 105 | | BnF, ms fr. 22549 | 14th | 2682 | | Vaticane, Reg. Lat., 1616 | 14th | 1772 | | University of pennsylvania, codex 660 | 14th | 368 | | BnF, ms fr. 411 | 14th | 179 | | BnF, ms fr. 1728 | 14th |622 | | University of pennsylvania, codex 909 | 15th | 2513 | | KBR, ms 9232 | 15th | 1006 | | All | | 22662 |
If you want to transcribe texts according to our recommendations, you can load the CremmaLab.json keyboard in the Escriptorium interface. In order to ensure the correct display of the characters, in the FireFox browser, install Stylus plugin (https://addons.mozilla.org/fr/firefox/addon/styl-us/), then the customisation: MUFI for eScriptorium (https://userstyles.world/style/3915/mufi-for-escriptorium).
Bibliography
J.-B. Camps, La Chanson dOtinel : dition complte du corpus manuscrit et prolgomnes ldition critique digital appendices, Ph.D. thesis, 2017. URL: https://zenodo.org/record/1116736#.XN1ufC3M00Q. doi:10.5281/zenodo. 1116736.
J.-B. Camps, T. Clrice, A. Pinche, Stylometry for Noisy Medieval Data: Evaluating Paul Meyers Hagiographic Hypothesis, arXiv:2012.03845). URL: http: //arxiv.org/abs/2012.03845, arXiv: 2012.03845.
J.-B. Camps, C. Vidal-Gorne, M. Vernet, Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches, 2021. URL: https: //hal-enc.archives-ouvertes.fr/hal-03279602.
T. Clrice, Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin, 2019. URL: https://hal.archives-ouvertes. fr/hal-02154122, working paper or preprint.
B. Kiessling, R. Tissot, P. Stokes, D. S. B. Ezra, eScriptorium: An Open Source Platform for Historical Document Analysis, in: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), volume 2, 2019, pp. 1919. doi:10.1109/ICDARW.2019.10032.
A. Pinche, Edition nativement numrique du recueil hagiographique "Li Seint Confessor" de Wauchier de Denain daprs le manuscrit 412 de la Bibliothque nationale de France, Thse de doctorat, Lyon, Lyon, 2021. URL: http://www.theses. fr/s150996.
D. Stuzmann, Palographie statistique pour dcrire, identifier, dater. . . normaliser pour cooprer et aller plus loin ?, in: F. Fischer, C. Fritze, G. Vogeler (Eds.), Kodikologie und Palographie im digitalen Zeitalter 2 - Codicology and Palaeography in the Digital Age 2, volume 3, Books on Demand (BoD), Norderstedt, 2011, pp. 247277. URL: https://kups.ub.uni-koeln.de/4353/.
Owner
- Name: HTR United
- Login: HTR-United
- Kind: organization
- Location: France
- Website: https://htr-united.github.io
- Repositories: 21
- Profile: https://github.com/HTR-United
GitHub Events
Total
- Release event: 1
- Watch event: 5
- Push event: 2
- Fork event: 1
- Create event: 1
Last Year
- Release event: 1
- Watch event: 5
- Push event: 2
- Fork event: 1
- Create event: 1
Dependencies
- AButler/upload-release-assets v2.0 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- andymckay/get-gist-action master composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/checkout v3 composite
- dieghernan/cff-validator v3 composite