fondue_wolfflin_fotosammlung

HTR data and models made with the Kunsthistorisches UZH corpus

https://github.com/fondue-htr/fondue_wolfflin_fotosammlung

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.0%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

HTR data and models made with the Kunsthistorisches UZH corpus

Basic Info
  • Host: GitHub
  • Owner: FoNDUE-HTR
  • License: cc-by-4.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Size: 12.8 GB
Statistics
  • Stars: 3
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 3 years ago · Last pushed about 3 years ago
Metadata Files
Readme License Citation

README.md

Fotosammlung von Heinrich Wlfflin

characters badge regions badge lines badge files badge


This dataset was created to experiment with the HTR and segmentation tools, KRAKEN and YALTAi, on an atypical corpus: the archive of the Kunsthistorisches Institut in Zurich, which contains reproductions of works of art on which archivist's annotations and comments have been written. The objective is to find the most efficient method to extract this heterogeneous textual data. Indeed, the corpus contains several handwritten hands and different typographical fonts. Furthermore, different languages were used to annotate these reproductions: mostly German, Italian, and French. Finally, it is worth noting a large number of proper names and figures present on these documents.


The organisation of the repository is as follows:

```mermaid flowchart LR

B{{1_Data}} -.- H>1_1_First_Folderpage]
B -.- I>1_2_Illustrations_Pages]
B -.- J>1_3_ManuscriptLines]
B -.- K>1_4_PrintLines]
B -.- L>1_5_Cremma16-17]
B -.- M>1_6_CremmaMs_20]
B -.- N>1_7_lectaurep-repertoires]
C{{2_Script_training}}  -.- O>2_1_HTR]
C  -.- P>2_2_Segmentation]
D{{3_Models}} -.- Q>3_1_HTR]
D -.- R>3_2_Segmentation]
E{{4_Split}}
F{{5_Script_python}}
G{{6_Images_Readme}}

```

The data are organised in portfolios that allow them to be classified according to the origin of their content. We have chosen to divide our data with the first pages of the portfolio on one side and the pages containing the scanned illustrations on the other: this is to facilitate the training of the segmentation models; thus, we have created : - The folder 11First_Folderpage with only the first pages (portfolios), which contain mainly handwritten lines; - The folder 12Illustrations_Pages with the pages containing the scanned illustrations.

Thanks to the SegmOnto vocabulary we used, we created a python script (available in the 5Scriptpython folder) in order to create two other files according to the types of writing: - 13ManuscriptLines with XML files where only handwritten lines are written. - 14PrintLines with XML files containing only printed lines.

These last two folders will evaluate the impact of writing changes in the training of HTR models.

HTR Training

In view of evaluating the efficiency of each model, we divided our data for each writing type into three datasets (train, val, test), allowing respectively to train the model, evaluate the results at each epoch during the training, and then do a final test on the model with data it has never seen. The different sets of data are available in the 4_Split folder.

Our Groundtruth consists of 559 portfolio pages "Firstfolderpage" and 548 pages with the reproductions "illustrationspages". The scripts used to run the HTR training with Kraken are available in the 21HTR folder.

The table below shows the different results (accuracy by character) obtained during the different training sessions. All results in square brackets are those of the training tests, and all other results are those of the evaluation tests. The horizontal header corresponds to the name of the evaluation test set used, and the vertical header corresponds to the data sets used for the model training (in brackets, the name of the corresponding HTR model). The evaluation tests weren't performed when the training test results were too low. Issues 0 In order to obtain the best possible results, training was carried out by combining a dataset from the archives of the Kunsthistorisches Institut with those of other projects published on HTR-united: these are the repositories 15Cremma16-17 , 16CremmaMs_20, 17lectaurep-repertories, with whom finestuning was carried out.

| Evaluation test set
HTR Model | $$\color{Peach}{ManuscriptLines}$$ $$\color{Peach}{Firstfolderpages}$$ | $$\color{SpringGreen}{ManuscriptLines}$$ $$\color{SpringGreen}{Illustrationspages}$$ | $$\color{Periwinkle}{All}$$ $$\color{Periwinkle}{ManuscriptLines}$$ | $$\color{Goldenrod}{All}$$ $$\color{Goldenrod}{PrintLines}$$ | $$\color{SkyBlue}{ManuscriptLines}$$ $$\color{SkyBlue}{+}$$ $$\color{SkyBlue}{PrintLines}$$ $$\color{SkyBlue}{Illustrations pages}$$| $$\color{Lavender}{All}$$ | |---------------------------------------------------------------------------------------- |:---------------------------------------: |:-----------------------------------------: |:---------------------------: |:----------------------: |:--------------------------------------------------------: |:--------------: | | GT1 Firstfolderpage
( HTRManuscriptLines4)
| [97] 93.74 | 7.53 | 43.50 | 11.62 | 9.43 | 27.28 | | GT2 Manuscriptlines illustrationspages
(HTR
ManuscriptLines_5)
| 89.03 | [80.8] 80.30 | $$\color{Periwinkle}{88.57}$$ | 18.38 | 35.71 | 52.68 | | GT3 All Manuscriptlines
(HTRManuscriptLines6)
| $$\color{Peach}{93.93}$$ | 83.53 | [84.3] 85.89 | 17.57 | 33.02 | 27.17 | | GT4 PrintLines
(PrintLines_2)
| // | // | // | [0.7] | // | // | | GT5 illustrationspages
(illustrations
pages)
| 12.32 | $$\color{SpringGreen}{85.34}$$ | 48.25 | 71.08 | [84.84] $$\color{SkyBlue}{74.04}$$ | 59.84 | |$$\fcolorbox{red}{Emerald}{GT6 All}$$ | $$\fcolorbox{red}{Emerald}{91.13}$$ | $$\fcolorbox{red}{Emerald}{73.14}$$ | $$\fcolorbox{red}{Emerald}{82.16}$$ | $$\fcolorbox{red}{Emerald}{52.84}$$ | $$\fcolorbox{red}{Emerald}{64.08}$$ |$$\fcolorbox{red}{Emerald}{[80.7]}$$ $$\color{Lavender}\fcolorbox{red}{Emerald}{70.56}$$ | | GT7 ManuscriptLines
+ Lectaurep-repertoires
(HTR_ManuscriptLines+Lectaurep)
| [24.21] | // | // | // | // | // | | GT8 PrintLines
+ Cremma16-17
(HTR_PrintLines+Cremma16-17)
| 32.66 | 33.47 | 42.36 | $$\color{Goldenrod}{67.84}$$ | 54.48 | 50.55 | | GT9 All + Lectaurep-repertoires
(AllLines+CremmaMs_20)
| 59.91 | 41.72 | 54.87 | 33.92 | 35.52 | 41.64 | | GT10 Illustrationspages Binaris
(illustrations
pages)
| // | // | // | // | [36.34] | // | | GT11 IllustrationspagesContrast
(illustrations_pages)
| 36.69 | 38.76 | 35.87 | 46.80 | [49.71] 44.97 | 39.70|

The results are in %; the best results per model are in bold, the best results per set are in colour, and the model with the best averages for all sets is highlighted in green.

These different tests aimed to define whether it was preferable to train according to the type of writing (handwritten or printed); it should be noted that the printed lines use very heterogeneous typographical characters and that their number is less than that of the handwritten lines. Finally, it can be seen that training HTR models according to the type of handwriting does not considerably improve the HTR results: for example, the GT3 model performs very well, but only on the handwritten lines of portfolios (one hand only; accuracy per character: 93.93%), and much less so on the handwritten lines of documents with reproductions (several hands; accuracy per character: 83.53%); whereas the GT6 model performs well on the various types of handwriting.

Segmentation

The segmentation of the corpus was carried out using the SegmOnto controlled vocabulary. The SegmOnto zones used are :

  • TitlePageZone : for the first page of each portfolio (blau);
  • GraphicZone : for illustrations (turquoise);
  • MarginTextZone : for all text zones that are not in GraphicZone:illustration (purple);
  • NumberingZone : for the numbering (orange).

Concerning the lines, we have distinguished the manuscript lines from the typographic lines thanks to different sub-types; furthermore, to be able to differentiate the information contained in the headings from the rest, we have distinguished two types of lines (Heading und Default):

  • DefaultLine:manuscript;
  • DefaultLine:print;
  • HeadingLine:manuscript;
  • HeadingLine:print.

Models

We first created two segmentation models according to the type of document to be segmented: portfolios (First_Folderpage) and reproductions of works of art (Illustrations_Pages); these two models result respectively is :

  • Segmentation First_Folderpage, average IOU (intersection over union) 42.78 %, frequency IOU 0.76;
  • Segmentation Illustrations_Pages, average IOU 26.59 %, frequency IOU 0.89.

It is no surprise that the results for the first few pages are better: that set is more regular and less complex. We then trained a segmentation model on all the data produced, which gave the following results:

  • Segmentation All, average IOU 44.17% IOU frequency 0.94.

There is a general improvement in results when all the data are trained together. Once this first observation was obtained, we decided to compare the results of the Kraken and YALTAi segmentation tools.

Following the previous experiments, the YALTAi model was trained with the entire Groundtruth corpus. The results obtained are :

  • Average Accuracy 52.6 %
  • Average Recall 36.5 %

To better compare the effectiveness of the models, we tried them both on the same corpus of images they had never encountered :

| | Khist_0175_I_000 | Khist_0175_I_014_R | |:-----------: |:-----------------------------------------: |:-----------------------------------------: | | **KRAKEN** || | | **YALTAi** | ||

The segmentation model made with YALTAi is more accurate and makes very few errors compared to the one made with Kraken. The segmentation model made with Kraken almost systematically forgets the NumberingZone and tends to add unwanted zones, which is not the case with the model made with the YALTAi tool.


Bibliography

  • Clrice, T. You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine, 2022. https://hal-enc.archives-ouvertes.fr/hal-03723208
  • Gabay, S., Camps, J. -B., Pinche, A. & Carboni, N. A Controlled Vocabulary to Describe the Layout of Pages, version 0.9. In : Paris/Genve, 2021. https://github.com/SegmOnto
  • Gabay, S., Kuenzli, P., Flacone, J-L., Charpilloz, C. FoNDUE: Documentation, University of Geneva, 2022. https://github.com/fonDUE-HTR/Documentation.
  • Kiessling, B. "Kraken - a Universal Text Recognizer for the Humanities", Digital Humanities Conference 2019 DataverseNL, V2. In : Utrecht, The Netherlands, 2019. https://doi.org/10.34894/Z9G2EX, DataverseNL, V2 Version used : 4. 2. 0, 2022. https://pypi.org/project/kraken/4.2.0/ -----

How to cite

If you use this dataset, please cite it as below : Jacsont, P., Gabay, S., Weddigen, T. FoNDUE for the Heinrich Wlfflin Fotosammlung of the Kunsthistorisches Institut UZH, version 1.0 : Arcimboldo, janvier 2023.

E-mail

Feel free to contact me if you have any questions or need more information about this depot : pauline.jacsont@unige.ch

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Owner

  • Name: FoNDUE-HTR
  • Login: FoNDUE-HTR
  • Kind: organization
  • Location: Switzerland

Data and models for the FoNDUE project

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Dependencies

.github/workflows/htr-united-workflows.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • andymckay/get-gist-action master composite