cataloguessegmentationocr
Dataset and models for catalogs' Layout analysis and HTR
https://github.com/imago-catalogues-jjanes/cataloguessegmentationocr
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary
Keywords
Repository
Dataset and models for catalogs' Layout analysis and HTR
Basic Info
Statistics
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 5
Topics
Metadata Files
README.md
HTRcatalogs: Dataset for historical catalogs HTR and Segmentation
The Artl@s project focus on the global circulation of images from the 1890s to the advent of the Internet, using digital methodologies. Among its projects, BasArt is an online database of exhibition catalogs from the 19th and 20th centuries. In order to broaden this database, Caroline Corbires, intern of the project in 2020, worked on the automatisation of its process. A scanned exhibition catalog is taken as an input and then encoded in XML-TEI and structured in csv.
In this context, the aim of this repository is to improve the ocerization and segmentation of this Catalogs Workflow. This step occurs at the beginning of the workflow and transforms the data from an image to a text. The idea was not only to refine the OCR for Artl@s but also to make a useful tool for researchers who need to ocerize their catalogues. Therefore, this dataset holds exhibition catalogs, prepared by Caroline Corbires, catalogs of 19th to nowadays manuscripts fairs of the Katabase project, arranged by Simon Gabay and owners directories from the Adresses et Annuaires group of Paris Time Machine of the EHESS, produced by Gabriela Elgarrista.
The dataset is composed of 277 original pages, with 50 pages of Annuaires, 97 pages of manuscripts' fairs catalogs and 130 pages of exhibition catalogs, which were used to create the first models and 3 complete exhibition catalogs produced using this models. For more informations, see dataset.csv.
The schema below explains its process of creation. Since the pageXML from Transkribus was displayed without the manual corrections done in the transcription, a XSLT transformation has been done on them. It can be found here. Then, the pages have been prepared and segmented in eScriptorium using the SegmOnto ontology which allows to name the differents zones and lines. This work is developped here. Lastly, the work done has been exported in ALTO4 format, accessible in this repository, along with the images used.
How to build the dataset
In your terminal:
1. Go to the directory 3_Scripts_training_construction
2. Chose the dataset you want:
- All the dataset
- Only one of the catalogs types (Annuaires, exhibition catalogs or manuscripts' fair catalogs)
3. Use the corresponding script with the command bash [SCRIPT]
4. You will get a TrainingData directory containing all the data
The test dataset used for our training can be found in the directory 3_Scripts_training_construction along with two python scripts, one splitting the data in train, test and eval datasets and the other removing the entries zones.
Repository
``` 1Data annuaires altoeScriptorium | | page_Transkribus images
Cat_expositions
complete_catalogs
| altoeScriptorium | image firstdata | altoeScriptorium | | | pageTranskribus | | | pagetransforme | image | | README.md | | | Catmanuscrits altoeScriptorium | pageTranskribus | pagetransforme image | 2ToolBox | Joint Toolbox for dataset's preparation | 3Scriptstrainingconstruction buildtrainalto.sh | tests files | randomdata.py removeentries | 4Models HTR | modelhtrabondance.mlmodel | modelhtrbeaufort.mlmodel | modelhtrchaource.mlmodel | modelhtrdanablu.mlmodel | modelhtrepoisse.mlmodel | modelhtrfourme.mlmodel | modelhtrgruyere.mlmodel | README.md | | | Segment | modelsegmentationabondance.mlmodel | modelsegmentationbeaufort.mlmodel | modelsegmentationchaource.mlmodel | modelsegmentationcoulommiers.mlmodel | README.md | images | Dataset.csv ```
Thanks to
Thanks to Simon Gabay, Claire Jahan, Caroline Corbires, Gabriela Elgarrista and Carmen Brando for their help and work.
Credits
This repository is developed by Juliette Janes, intern of the Artl@s project, with the help of Simon Gabay under the supervision of Batrice Joyeux-Prunel. - Manuscripts' catalogs preparation has been done by Simon Gabay. - Exhibitions' catalogs preparation has been done by Caroline Corbires. - Annuaires preparation has been done by Gabriela Elgarrista, under the supervision of Carmen Brando.
Licence
Images from catalogs published prior 1920 and transcriptions are CC-BY.
The other images are extracts of catalogs published after 1920 and are the intellectual property of their productor.

Cite this repository
Juliette Janes, Simon Gabay, Batrice Joyeux-Prunel, HTRCatalogs: Dataset for Historical Catalogs Segmentation and HTR, 2021, Paris: ENS Paris https://github.com/Juliettejns/cataloguesSegmentationOCR/
Contacts
If you have any questions or remarks, please contact juliette.janes@chartes.psl.eu and simon.gabay@unige.ch.
Owner
- Name: IMAGO-Catalogues-Jjanes
- Login: IMAGO-Catalogues-Jjanes
- Kind: organization
- Repositories: 2
- Profile: https://github.com/IMAGO-Catalogues-Jjanes