https://github.com/chartes/e-ndp_htr_benchmark
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (3.2%) to scientific vocabulary
Last synced: 6 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: chartes
- License: mit
- Language: HTML
- Default Branch: main
- Size: 223 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Created about 4 years ago
· Last pushed over 3 years ago
https://github.com/chartes/e-NDP_HTR_benchmark/blob/main/
# Section 1 : Handwritten text recognition for e-NDP
e-NDP_HTR training experiments to fit transcribed pages from AN, LL105-126 registers into Kraken HTR core.
## Definitions
Stats for each register volume (LL105 - LL126, 26 volumes)
| Volume | Dates | Type | N.Pages |
|:---------------:|:----------:|:-----:|:-------:|
| 105 | 1326-1352. | Train | 21 |
| 106A | 1356-1361. | Train | 20 |
| 106B | 1362-1364. | Train | 11 |
| 107 | 1367-1370. | Train | 10 |
| 108A | 1392-1394. | Train | 27 |
| 108B | 1397-1399. | Train | 14 |
| 109A | 1399-1401. | Test | 4 |
| 109B | 1401-1405. | Train | 10 |
| 109C | 1405-1407. | Train | 12 |
| 110 | 1407-1411. | Train | 17 |
| 111 | 1412-1414. | Test | 3 |
| 112 | 1414-1424. | Train | 17 |
| 113 | 1425-1432. | Train | 17 |
| 114 | 1433-1437 | Train | 11 |
| 115 | 1440-1444. | Test | 4 |
| 116 | 1445-1459. | Test | 4 |
| 117 | 1450-1454. | Train | 60 |
| 118 (minutes) | 1453-1456. | Train | 13 |
| 119 | 1456-1460. | Train | 18 |
| 120 | 1460-1465. | Train | 3 |
| 121 | 1465-1474. | Test | 3 |
| 122 | 1474-1481. | Train | 21 |
| 123-124 | 1481-1489. | Test | 4 |
| 125 | 1489-1493 | Train | 20+16 |
| 126 | 1493-1497 | Train | 18 |
| Bnf Latin 17740 | 1430-1444 | Test | 687 |
references:
[Projet e-NDP Notre-Dame de Paris et son clotre : archives des sances du sminaire](https://lamop.hypotheses.org/files/2020/11/e-NDP_seance1_20201020-compresse-1.pdf)
[SRIE LL MONUMENTS ECCLSIASTIQUES REGISTRES](http://www.archivesnationales.culture.gouv.fr/chan/chan/fonds/EGF/SA/InvSAPDF/Ll.pdf)
# Architecture
### Architecure 1:

hyper_params': {'pad': 24, 'freq': 1.0, 'batch_size': 1, 'lag': 5, 'min_delta': None, 'optimizer': 'Adam', 'lrate': 0.0002, 'momentum': 0.9, 'weight_decay': 0, 'schedule': 'reduceonplateau', 'normalization': None, 'normalize_whitespace': True, 'augment': False, 'step_size': 10, 'gamma': 0.1, 'rop_patience': 3, 'cos_t_max': 50}}
> Trained on Kraken (https://github.com/mittagessen/kraken). Training command:
>
> kraken 3.0 : ketos train -N 70 -q dumb -f page --threads 32 -r 0.0001 --schedule reduceonplateau --sched-patience 3 -d cuda:0 --preload --pad 24 -s '[1,128,0,1 Cr4,16,32 Do0.1,2 Mp2,2 Cr4,16,32 Do0.1,2 Mp2,2 Cr3,8,64 Do0.1,2 Mp2,2 Cr3,8,64 Do0.1,2 S1(1x0)1,3 Lbx256 Do0.3,2 Lbx256 Do0.3,2 Lbx256 Do0.3]' --augment training_folder/*.xml
>
> kraken 4.0 : ketos train -N 70 -q dumb -f page --workers 32 -r 0.0001 --schedule reduceonplateau --sched-patience 3 -d cuda:0 --pad 24 -s '[1,128,0,1 Cr4,16,32 Do0.1,2 Mp2,2 Cr4,16,32 Do0.1,2 Mp2,2 Cr3,8,64 Do0.1,2 Mp2,2 Cr3,8,64 Do0.1,2 S1(1x0)1,3 Lbx256 Do0.3,2 Lbx256 Do0.3,2 Lbx256 Do0.3]' --augment training_folder/*.xml
### Architecure 2:
# Training
training board accuracy on validation set
# HTR Experiments
## Training and testing data-sets
- F. Odart de Morchesne: 189 images (1427, 378 pages): https://gallica.bnf.fr/ark:/12148/btv1b9059518w.image / http://elec.enc.sorbonne.fr/morchesne/html/morchesne.html
- Cart. ND de Clairmarais : 134 images (1220 - 1460, 268 pages) : https://bvmm.irht.cnrs.fr/mirador/index.php?manifest=https://bvmm.irht.cnrs.fr/iiif/32117/manifest / https://doi.org/10.5281/zenodo.5600884
- Livre Rouge ( Chtelet de Paris. Y//3 ) : 34 images (1223 - 1474) : https://www.siv.archives-nationales.culture.gouv.fr/siv/UD/FRAN_IR_056373/c-2xp1c58jd-1w19uv1g8sus6 / https://gitlab.huma-num.fr/lamop/htr/-/tree/master/Livre_Rouge_Y__3
- e-NDP: LL 108a (27 images) + LL 125 (20) : https://e-ndp-beta.lamop.fr/public/E-NdP/temp/JPEG/
+ 1 group : volumes LL 106b-126 (64 images)
+ 2 group : volumes LL 106b-126 (82 images)
+ 3 group : volumes LL 106b-126 (76 images)
+ 4 group : volumes LL 106b-126 (78 images)
+ 5 group : volumes LL 105, 106A, 111, 107, 118, 119, 120, 123-124, 127-128 (82 pages)
= 429 images (TRAIN)
+ 28 images (semi-external TEST) : LL 109A, 111, 115, 116, 121, 123-124 (volumes not used in training)
Total :
- TRAIN: 786 images --> 554 folios -> 1109 pages
- TEST: 28 images --> 28 pages
### External test dataset
- Bnf Latin 17740 (1430 - 1444, 687 images): https://gallica.bnf.fr/ark:/12148/btv1b525092040
- Cartulary of Charles II of Navarre (Navarre_Pau_AD_E513, 1297 - 1372, 209 images) : http://earchives.le64.fr/archives-en-ligne/ark:/81221/r13615z7dvnv8k/f1 / https://doi.org/10.5281/zenodo.5600884
## Multilingual
- Odart de Morchesne : 274 formules --> 94 lat + 180 fro (35%-40% lat)
- Clairmarais : 178 actes ---> 168 lat + 10 fro (92% lat)
- Livre Rouge (35% - 40% latin)
- e-ndp (almost all in latin)
- Total: 77% latin / 23% french
## Model versions
Training HTR versions using varied data:
- 19/10/2021: V1 core --> Formulaire Odart de Morchesne + Cartulaire de Clairmarais + Livre Rouge + LL 108a (e-dnp_V1) :
- 16/11/2021: V2 core --> V1 core + 84 pages (1 e-ndp transcription group)
- 11/01/2022: V3 core --> V1 core + V2 core + 82 pages (2 e-ndp transcription group)
- 10/02/2022: V4 core --> V1 core + V2 core + V3 core + 76 pages (3 e-ndp transcription group)
- 30/06/2022: V5 core ---> all V4 core + 78 pages (4 e-ndp transcription group)
- 17/07/2022: V6 core ---> all V5 core + 40 pages coming for new digitized volumes: 105, 106A, 111, 107, 118, 119, 120, 123-124, 127-128)
- 16/08/2022: V7 core ---> all V6 core + 42 pages coming for new digitized volumes: 105, 106A, 111, 107, 118, 119, 120, 123-124, 127-128)
- **val_acc** = accuracy on validation set _during_ training
- **test_acc** = accuracy on corpus test _after_ training
- **cer** = test character error rate
- **wer** = test word error rate
| model_name | Content | arch |val_acc | test_acc |cer | wer | logs |
| ------ | ------ |------ |------ |------ |------ |------ |------ |
| V1_test | Morchesne, Clairmarais, Livre Rouge, 108a |arch_1 | 92.50% | 69.75% |34.88% |71.50% |[log_1](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/endp_V1_evaluation) |
| V2_test | V1 core, +LL115 (20 pages), +1 e-ndp group |arch_1 | 94.71% | 83.57% |18.97% | 48.23% |[log_2](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/endp_V2_evaluation) |
| V3_test | V1 core, V2 core, +2 e-ndp group |arch_1 |93.90% |86.92% |13.25% |36.24% | [log_3](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/endp_V3_evaluation) |
| V3b_test | Only e-ndp transcriptions (193 images) |arch_1| 91.19% |81.90% |18.44% |46.94% |[log_4](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/endp_V3b_evaluation) |
| V4_test | V1 core, V2 core, V3 core +3 e-ndp group |arch_1 |93.52% |88.55% |11.43% |32.47% | [log_5](https://github.com/chartes/e-NDP_HTR_benchmark/raw/main/Logs/endp_V4_evaluation) |
| V5_test | V1, V2, V4 cores + 4 e-ndp group |arch_1 |94.48% |90.26% |9.73% |27.61% | [log_6](https://github.com/chartes/e-NDP_HTR_benchmark/raw/main/Logs/endp_V5_evaluation) |
||[all G1 test metrics](https://magistermilitum.gitlab.io/e-ndp_htr/)|
| V3_Latin_17740 | V3 tested on Latin 17740 manuscrit |arch_1| - |89.25% |11.21% |30.28% |[log_7](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/V3_latin_17740_evaluation) |
| V3b_Latin_17740 | V3b tested on Latin 17740 manuscrit |arch_1| - |82.59% |18.78% |48.17% |[log_8](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/V3b_latin_17740_evaluation) |
| V7_Latin_17740 | V7 tested on Latin 17740 manuscrit |arch_1| - |91.52% |9.27% |28.28% |[log_9](https://github.com/chartes/e-NDP_HTR/raw/main/Logs/V7_latin_17740_evaluation) |
| V3_Navarre | V3 tested on Charles II of Navarre manuscrit |arch_1| - |82.82% |14.36% |44.42% |[log_10](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/V3_Navarre_evaluation) |
| V3b_Navarre | V3b tested on Charles II of Navarre manuscrit |arch_1| - |67.81% |29.02% |69.80% |[log_11](https://gitlab.com/magistermilitum/e-ndp_htr/-/raw/main/Logs/V3b_Navarre_evaluation) |
| V7_Navarre | V7 tested on Charles II of Navarre manuscrit |arch_1| - |85.42% |12.52% |37.78% |[log_12](https://github.com/chartes/e-NDP_HTR/raw/main/Logs/V7_Navarre_evaluation) |
# Section 2: Layout Segmentation
Layout segmentation is a compulsory step before HTR recognition in order to distinguish sections inside a document. This process intend to separate interdependant page zones to produce a recognition in a section-sequence order and not in a line-sequence order which mix textual and peri-textual content.
For e-NDP we contemplate 5 sections to englobe the page distribution in all the 26 volumes:
1. **Block** : All the central text blocks, that normally corresponds to the main content called "conclusions" in registers.
2. **Liste** : List of names of the canons who were present during the meeting. Normally located before the "conclusions".
3. **Entre** : Marginal notes or entries to inform about the content of "conclusions".
4. **Date** : Paragraph contending the date. Normally at the head of a "conclusion", but separate of the main body.
5. **Numrotation** : Page numbers in roman or arabic. Usually appear in the corners of the pages.
Automatic layout segmentation in a e-NDP page.
## Layout segmentation experiments
We annotate 376 transcribed pages from the e-NDP ground-truth (V1 core to V6 core) and we experiment using a classical CNN+BiLSTM architecture.
Order to replicate the training in Kraken 3:
> ketos segtrain -f page -o seg_model -d cuda:0 -bl --threads 32 --epochs 50 --schedule reduceonplateau -s '[1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32]' training_folder/*xml
Another training option is the fine-tuning on the default blla.mlmodel (https://github.com/mittagessen/kraken/blob/master/kraken/blla.mlmodel):
> ketos segtrain -i blla.mlmodel -f page -o seg_model -d cuda:0 -bl --threads 32 --resize add --epochs 50 --schedule reduceonplateau -s '[1,1200,0,3 Cr7,7,64,2,2 Gn32 Cr3,3,128,2,2 Gn32 Cr3,3,128 Gn32 Cr3,3,256 Gn32 Cr3,3,256 Gn32 Lbx32 Lby32 Cr1,1,32 Gn32 Lby32 Lbx32]' training_folder/*xml
## Layout segmentation model versions
- **mean_iu** = Mean intersection over union (IU)
- **freq_iu** = Frequency intersection over union (IU)
- **mean_acc** = Mean accuracy (average of the prediction accuracy over all categories)
- **IU** is the overlap ratio between the candidate bound and the ground truth bound.
| model_name | Content | mean_iu | freq_iu |mean_acc |
| ------ | ------ |------ |------ |------ |
| V1_layout | endp V1-V2 cores | 0.6508| 0.7918 |0.9552 |
| V2_layout | endp V1-V4 cores | 0.6744 | 0.8366 |0.9648 |
| V3_layout | endp V1-V6 cores | 0.6936 | 0.8455 |0.9673 |
references:
[Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).](https://openaccess.thecvf.com/content_cvpr_2015/papers/Long_Fully_Convolutional_Networks_2015_CVPR_paper.pdf)
Owner
- Name: École nationale des chartes
- Login: chartes
- Kind: organization
- Location: 65 rue de Richelieu, 75002 Paris
- Website: http://www.chartes.psl.eu/
- Repositories: 12
- Profile: https://github.com/chartes
Grand établissement d’enseignement supérieur dédié à la recherche historique