cataloguessegmentationocr

Dataset and models for catalogs' Layout analysis and HTR

https://github.com/imago-catalogues-jjanes/cataloguessegmentationocr

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.3%) to scientific vocabulary

Keywords

alto-xml catalog htr ocr page-xml segmentation segmenter

Last synced: 10 months ago · JSON representation

Repository

Dataset and models for catalogs' Layout analysis and HTR

Basic Info

Host: GitHub
Owner: IMAGO-Catalogues-Jjanes
Language: Python
Default Branch: main
Homepage:
Size: 966 MB

Statistics

Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 5

Topics

alto-xml catalog htr ocr page-xml segmentation segmenter

Created over 5 years ago · Last pushed almost 5 years ago

Metadata Files

Readme Citation

HTRcatalogs: Dataset for historical catalogs HTR and Segmentation

The Artl@s project focus on the global circulation of images from the 1890s to the advent of the Internet, using digital methodologies. Among its projects, BasArt is an online database of exhibition catalogs from the 19th and 20th centuries. In order to broaden this database, Caroline Corbires, intern of the project in 2020, worked on the automatisation of its process. A scanned exhibition catalog is taken as an input and then encoded in XML-TEI and structured in csv.

In this context, the aim of this repository is to improve the ocerization and segmentation of this Catalogs Workflow. This step occurs at the beginning of the workflow and transforms the data from an image to a text. The idea was not only to refine the OCR for Artl@s but also to make a useful tool for researchers who need to ocerize their catalogues. Therefore, this dataset holds exhibition catalogs, prepared by Caroline Corbires, catalogs of 19th to nowadays manuscripts fairs of the Katabase project, arranged by Simon Gabay and owners directories from the Adresses et Annuaires group of Paris Time Machine of the EHESS, produced by Gabriela Elgarrista.

The dataset is composed of 277 original pages, with 50 pages of Annuaires, 97 pages of manuscripts' fairs catalogs and 130 pages of exhibition catalogs, which were used to create the first models and 3 complete exhibition catalogs produced using this models. For more informations, see dataset.csv.
The schema below explains its process of creation. Since the pageXML from Transkribus was displayed without the manual corrections done in the transcription, a XSLT transformation has been done on them. It can be found here. Then, the pages have been prepared and segmented in eScriptorium using the SegmOnto ontology which allows to name the differents zones and lines. This work is developped here. Lastly, the work done has been exported in ALTO4 format, accessible in this repository, along with the images used.

How to build the dataset

In your terminal: 1. Go to the directory 3_Scripts_training_construction 2. Chose the dataset you want:
- All the dataset
- Only one of the catalogs types (Annuaires, exhibition catalogs or manuscripts' fair catalogs) 3. Use the corresponding script with the command bash [SCRIPT] 4. You will get a TrainingData directory containing all the data

The test dataset used for our training can be found in the directory 3_Scripts_training_construction along with two python scripts, one splitting the data in train, test and eval datasets and the other removing the entries zones.

Repository

``` 1Data annuaires altoeScriptorium | | page_Transkribus images

Cat_expositions
     complete_catalogs

Thanks to

Thanks to Simon Gabay, Claire Jahan, Caroline Corbires, Gabriela Elgarrista and Carmen Brando for their help and work.

Credits

This repository is developed by Juliette Janes, intern of the Artl@s project, with the help of Simon Gabay under the supervision of Batrice Joyeux-Prunel. - Manuscripts' catalogs preparation has been done by Simon Gabay. - Exhibitions' catalogs preparation has been done by Caroline Corbires. - Annuaires preparation has been done by Gabriela Elgarrista, under the supervision of Carmen Brando.

Licence

Images from catalogs published prior 1920 and transcriptions are CC-BY.
The other images are extracts of catalogs published after 1920 and are the intellectual property of their productor.
68747470733a2f2f692e6372656174697665636f6d6d6f6e732e6f72672f6c2f62792f322e302f38387833312e706e67

Cite this repository

Juliette Janes, Simon Gabay, Batrice Joyeux-Prunel, HTRCatalogs: Dataset for Historical Catalogs Segmentation and HTR, 2021, Paris: ENS Paris https://github.com/Juliettejns/cataloguesSegmentationOCR/

Contacts

If you have any questions or remarks, please contact juliette.janes@chartes.psl.eu and simon.gabay@unige.ch.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

cataloguessegmentationocr

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

HTRcatalogs: Dataset for historical catalogs HTR and Segmentation

How to build the dataset

Repository

Thanks to

Credits

Licence

Cite this repository

Contacts

Owner

GitHub Events

Total

Last Year