https://github.com/chartes/e-ndp_htr_benchmark

Last synced: 6 months ago · JSON representation
Repository

Basic Info

Host: GitHub
Owner: chartes
License: mit
Language: HTML
Default Branch: main
Size: 223 MB
Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0
Created about 4 years ago · Last pushed over 3 years ago
https://github.com/chartes/e-NDP_HTR_benchmark/blob/main/

# Section 1 : Handwritten text recognition for e-NDP



e-NDP_HTR training experiments to fit transcribed pages from AN, LL105-126 registers into Kraken HTR core.


## Definitions

Stats for each register volume (LL105 - LL126, 26 volumes)


|      Volume     |    Dates   |  Type | N.Pages |
|:---------------:|:----------:|:-----:|:-------:|
|       105       | 1326-1352. | Train |    21   |
|       106A      | 1356-1361. | Train |    20   |
|       106B      | 1362-1364. | Train |    11   |
|       107       | 1367-1370. | Train |    10   |
|       108A      | 1392-1394. | Train |    27   |
|       108B      | 1397-1399. | Train |    14   |
|       109A      | 1399-1401. |  Test |    4    |
|       109B      | 1401-1405. | Train |    10   |
|       109C      | 1405-1407. | Train |    12   |
|       110       | 1407-1411. | Train |    17   |
|       111       | 1412-1414. |  Test |    3    |
|       112       | 1414-1424. | Train |    17   |
|       113       | 1425-1432. | Train |    17   |
|       114       |  1433-1437 | Train |    11   |
|       115       | 1440-1444. |  Test |    4    |
|       116       | 1445-1459. |  Test |    4    |
|       117       | 1450-1454. | Train |    60   |
|  118 (minutes)  | 1453-1456. | Train |    13   |
|       119       | 1456-1460. | Train |    18   |
|       120       | 1460-1465. | Train |     3   |
|       121       | 1465-1474. |  Test |     3   |
|       122       | 1474-1481. | Train |    21   |
|     123-124     | 1481-1489. |  Test |    4    |
|       125       |  1489-1493 | Train |   20+16 |
|       126       |  1493-1497 | Train |    18   |
| Bnf Latin 17740 |  1430-1444 |  Test |   687   |

references:
[Projet e-NDP  Notre-Dame de Paris et son clotre : archives des sances du sminaire](https://lamop.hypotheses.org/files/2020/11/e-NDP_seance1_20201020-compresse-1.pdf)



[SRIE LL MONUMENTS ECCLSIASTIQUES REGISTRES](http://www.archivesnationales.culture.gouv.fr/chan/chan/fonds/EGF/SA/InvSAPDF/Ll.pdf)

# Architecture

### Architecure 1: 


![kraken architecture 1](https://gitlab.com/magistermilitum/home_alcar_kraken/-/raw/main/images/kraken_arch_1.drawio.png)

hyper_params': {'pad': 24, 'freq': 1.0, 'batch_size': 1, 'lag': 5, 'min_delta': None, 'optimizer': 'Adam', 'lrate': 0.0002, 'momentum': 0.9, 'weight_decay': 0, 'schedule': 'reduceonplateau', 'normalization': None, 'normalize_whitespace': True, 'augment': False, 'step_size': 10, 'gamma': 0.1, 'rop_patience': 3, 'cos_t_max': 50}}

> Trained on Kraken (https://github.com/mittagessen/kraken). Training command: 
> 
> kraken 3.0 : ketos train -N 70 -q dumb -f page --threads 32 -r 0.0001 --schedule reduceonplateau --sched-patience 3 -d cuda:0 --preload --pad 24 -s '[1,128,0,1 Cr4,16,32 Do0.1,2 Mp2,2 Cr4,16,32 Do0.1,2 Mp2,2 Cr3,8,64 Do0.1,2 Mp2,2 Cr3,8,64 Do0.1,2 S1(1x0)1,3 Lbx256 Do0.3,2 Lbx256 Do0.3,2 Lbx256 Do0.3]' --augment training_folder/*.xml
>
> kraken 4.0 : ketos train -N 70 -q dumb -f page --workers 32 -r 0.0001 --schedule reduceonplateau --sched-patience 3 -d cuda:0 --pad 24 -s '[1,128,0,1 Cr4,16,32 Do0.1,2 Mp2,2 Cr4,16,32 Do0.1,2 Mp2,2 Cr3,8,64 Do0.1,2 Mp2,2 Cr3,8,64 Do0.1,2 S1(1x0)1,3 Lbx256 Do0.3,2 Lbx256 Do0.3,2 Lbx256 Do0.3]' --augment  training_folder/*.xml


### Architecure 2:

# Training


training board accuracy on validation set


 






# HTR Experiments

## Training and testing data-sets 

- F. Odart de Morchesne:  189 images (1427, 378 pages): https://gallica.bnf.fr/ark:/12148/btv1b9059518w.image  /  http://elec.enc.sorbonne.fr/morchesne/html/morchesne.html
- Cart. ND de Clairmarais :  134 images (1220 - 1460, 268 pages) : https://bvmm.irht.cnrs.fr/mirador/index.php?manifest=https://bvmm.irht.cnrs.fr/iiif/32117/manifest  /  https://doi.org/10.5281/zenodo.5600884
- Livre Rouge ( Chtelet de Paris. Y//3 ) :  34 images (1223 - 1474) : https://www.siv.archives-nationales.culture.gouv.fr/siv/UD/FRAN_IR_056373/c-2xp1c58jd-1w19uv1g8sus6  /  https://gitlab.huma-num.fr/lamop/htr/-/tree/master/Livre_Rouge_Y__3
- e-NDP: LL 108a (27 images) + LL 125 (20) : https://e-ndp-beta.lamop.fr/public/E-NdP/temp/JPEG/
         
         + 1 group : volumes LL 106b-126 (64 images)
         + 2 group : volumes LL 106b-126 (82 images) 
         + 3 group : volumes LL 106b-126 (76 images) 
         + 4 group : volumes LL 106b-126 (78 images) 
         + 5 group : volumes LL 105, 106A, 111, 107, 118, 119, 120, 123-124, 127-128 (82 pages)
         = 429 images (TRAIN)

         + 28 images (semi-external TEST) : LL 109A, 111, 115, 116, 121, 123-124 (volumes not used in training)

Total : 
- TRAIN: 786 images --> 554 folios -> 1109 pages
- TEST: 28 images --> 28 pages

### External test dataset
- Bnf Latin 17740 (1430 - 1444, 687 images): https://gallica.bnf.fr/ark:/12148/btv1b525092040
- Cartulary of Charles II of Navarre (Navarre_Pau_AD_E513, 1297 - 1372, 209 images) : http://earchives.le64.fr/archives-en-ligne/ark:/81221/r13615z7dvnv8k/f1 / https://doi.org/10.5281/zenodo.5600884

## Multilingual 

- Odart de Morchesne : 274 formules --> 94 lat + 180 fro (35%-40% lat)
- Clairmarais : 178 actes ---> 168 lat + 10 fro (92% lat)
- Livre Rouge (35% - 40% latin)
- e-ndp (almost all in latin)
- Total:  77% latin / 23% french


## Model versions

Training HTR versions using varied data:

- 19/10/2021: V1 core --> Formulaire Odart de Morchesne + Cartulaire de Clairmarais + Livre Rouge + LL 108a (e-dnp_V1) : 
- 16/11/2021: V2 core --> V1 core + 84 pages (1 e-ndp transcription group)
- 11/01/2022: V3 core --> V1 core + V2 core + 82 pages (2 e-ndp transcription group)
- 10/02/2022: V4 core --> V1 core + V2 core + V3 core + 76 pages (3 e-ndp transcription group)
- 30/06/2022: V5 core ---> all V4 core + 78 pages (4 e-ndp transcription group)
- 17/07/2022: V6 core ---> all V5 core + 40 pages coming for new digitized volumes: 105, 106A, 111, 107, 118, 119, 120, 123-124, 127-128)
- 16/08/2022: V7 core ---> all V6 core + 42 pages coming for new digitized volumes: 105, 106A, 111, 107, 118, 119, 120, 123-124, 127-128)


- **val_acc** = accuracy on validation set _during_ training
- **test_acc** = accuracy on corpus test _after_ training
- **cer** = test character error rate
- **wer** = test word error rate

| model_name | Content | arch |val_acc | test_acc |cer | wer | logs |
| ------ | ------ |------ |------ |------ |------ |------ |------ |
| V1_test | Morchesne, Clairmarais, Livre Rouge, 108a |arch_1 | 92.50% | 69.75% |34.88% |71.50% |[log_1](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/endp_V1_evaluation) |
| V2_test | V1 core, +LL115 (20 pages), +1 e-ndp group |arch_1 | 94.71% | 83.57% |18.97% | 48.23% |[log_2](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/endp_V2_evaluation) |
| V3_test | V1 core, V2 core, +2 e-ndp group |arch_1 |93.90% |86.92% |13.25% |36.24% | [log_3](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/endp_V3_evaluation) |
| V3b_test | Only e-ndp transcriptions (193 images) |arch_1| 91.19% |81.90% |18.44% |46.94% |[log_4](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/endp_V3b_evaluation) |
| V4_test | V1 core, V2 core, V3 core +3 e-ndp group |arch_1 |93.52% |88.55% |11.43% |32.47% | [log_5](https://github.com/chartes/e-NDP_HTR_benchmark/raw/main/Logs/endp_V4_evaluation) |
| V5_test | V1, V2, V4 cores + 4 e-ndp group |arch_1 |94.48% |90.26% |9.73% |27.61% | [log_6](https://github.com/chartes/e-NDP_HTR_benchmark/raw/main/Logs/endp_V5_evaluation) |
||[all G1 test metrics](https://magistermilitum.gitlab.io/e-ndp_htr/)|
| V3_Latin_17740 | V3 tested on Latin 17740 manuscrit |arch_1| - |89.25% |11.21% |30.28% |[log_7](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/V3_latin_17740_evaluation) |
| V3b_Latin_17740 | V3b tested on Latin 17740 manuscrit |arch_1| - |82.59% |18.78% |48.17% |[log_8](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/V3b_latin_17740_evaluation) |
| V7_Latin_17740 | V7 tested on Latin 17740 manuscrit |arch_1| - |91.52% |9.27% |28.28% |[log_9](https://github.com/chartes/e-NDP_HTR/raw/main/Logs/V7_latin_17740_evaluation) |
| V3_Navarre | V3 tested on Charles II of Navarre manuscrit |arch_1| - |82.82% |14.36% |44.42% |[log_10](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/V3_Navarre_evaluation) |
| V3b_Navarre | V3b tested on Charles II of Navarre manuscrit |arch_1| - |67.81% |29.02% |69.80% |[log_11](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/V3b_Navarre_evaluation) |
| V7_Navarre | V7 tested on Charles II of Navarre manuscrit |arch_1| - |85.42% |12.52% |37.78% |[log_12](https://github.com/chartes/e-NDP_HTR/raw/main/Logs/V7_Navarre_evaluation) |



# Section 2: Layout Segmentation

Layout segmentation is a compulsory step before HTR recognition in order to distinguish sections inside a document. This process intend to separate interdependant page zones to produce a recognition in a section-sequence order and not in a line-sequence order which mix textual and peri-textual content.

For e-NDP we contemplate 5 sections to englobe the page distribution in all the 26 volumes: 
1. **Block** : All the central text blocks, that normally corresponds to the main content called "conclusions" in registers.
2. **Liste** : List of names of the canons who were present during the meeting. Normally located before the "conclusions".
3. **Entre** : Marginal notes or entries to inform about the content of "conclusions".
4. **Date** : Paragraph contending the date. Normally at the head of a "conclusion", but separate of the main body.
5. **Numrotation** : Page numbers in roman or arabic. Usually appear in the corners of the pages.



Automatic layout segmentation in a e-NDP page. 

## Layout segmentation experiments

We annotate 376 transcribed pages from the e-NDP ground-truth (V1 core to V6 core) and we experiment using a classical CNN+BiLSTM architecture.

Order to replicate the training in Kraken 3: 
> ketos segtrain -f page -o seg_model -d cuda:0 -bl --threads 32 --epochs 50 --schedule reduceonplateau -s '[1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32]' training_folder/*xml

Another training option is the fine-tuning on the default blla.mlmodel (https://github.com/mittagessen/kraken/blob/master/kraken/blla.mlmodel):
> ketos segtrain -i blla.mlmodel -f page -o seg_model -d cuda:0 -bl --threads 32  --resize add --epochs 50 --schedule reduceonplateau -s '[1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32]' training_folder/*xml

## Layout segmentation model versions


- **mean_iu** = Mean intersection over union (IU)
- **freq_iu** = Frequency intersection over union (IU)
- **mean_acc** = Mean accuracy (average of the prediction accuracy over all categories)
- **IU** is the overlap ratio between the candidate bound and the ground truth bound.

| model_name | Content | mean_iu | freq_iu |mean_acc |
| ------ | ------ |------ |------ |------ |
| V1_layout | endp V1-V2 cores | 0.6508| 0.7918 |0.9552 |
| V2_layout | endp V1-V4 cores  | 0.6744 | 0.8366 |0.9648 |
| V3_layout | endp V1-V6 cores  | 0.6936 | 0.8455 |0.9673 |


references:
[Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).](https://openaccess.thecvf.com/content_cvpr_2015/papers/Long_Fully_Convolutional_Networks_2015_CVPR_paper.pdf)
Owner

Name: École nationale des chartes
Login: chartes
Kind: organization
Location: 65 rue de Richelieu, 75002 Paris
Website: http://www.chartes.psl.eu/
Repositories: 12
Profile: https://github.com/chartes
Grand établissement d’enseignement supérieur dédié à la recherche historique
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/chartes/e-ndp_htr_benchmark

Science Score: 13.0%

Repository

Basic Info

Statistics

https://github.com/chartes/e-NDP_HTR_benchmark/blob/main/

Owner

GitHub Events

Total

Last Year