https://github.com/ai-forever/digital_peter_aij2020
Materials of the AI Journey 2020 competition dedicated to the recognition of Peter the Great's manuscripts, https://ai-journey.ru/contest/task01
Science Score: 20.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
✓Committers with academic emails
1 of 5 committers (20.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Keywords
ancient-texts
computer-vision
handwritten-text-recognition
nlp
python3
transfer-learning
Keywords from Contributors
transformer
Last synced: 6 months ago
·
JSON representation
Repository
Materials of the AI Journey 2020 competition dedicated to the recognition of Peter the Great's manuscripts, https://ai-journey.ru/contest/task01
Basic Info
- Host: GitHub
- Owner: ai-forever
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://ods.ai/tracks/aij2020
- Size: 5.93 MB
Statistics
- Stars: 66
- Watchers: 8
- Forks: 9
- Open Issues: 1
- Releases: 0
Topics
ancient-texts
computer-vision
handwritten-text-recognition
nlp
python3
transfer-learning
Created over 5 years ago
· Last pushed almost 5 years ago
https://github.com/ai-forever/digital_peter_aij2020/blob/master/
# Digital Peter: recognition of Peter the Great's manuscripts [](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/README.ru.md). # Preprinted version of the paper Is available at http://arxiv.org/abs/2103.09354 ## INFO ABOUT DATASETS CORRECTION **Fixed** train dataset can be downloaded [here](https://drive.google.com/file/d/1Qki21iEcg_iwMo3kWuaHi5AlxxpLKpof/view?usp=sharing). Moreover, one can fix the **old** version of train dataset (which can still be found [here](https://storage.yandexcloud.net/datasouls-ods/materials/46b7bb85/datasets.zip)) by yourself using following command: ```bash python checker_train.py 'train/words' ``` Here [```checker_train.py```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/fixes/checker_train.py) is the script that makes corrections of the ```'train/words'``` - the folder with old versions of transcribed strings. A complete list of the names of the fixed files (as well as info about these corrections) can be found [here](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/fixes/correction_info.txt) and inside the [```checker_train.py```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/fixes/checker_train.py) Statistics (**train** dataset): ```bash Number of corrected files = 91 Total number of files = 6196 Percentage of corrected files = 1.47% ``` Similar fixes have been made to **test_public** and **test_private**. Statistics (**test_public** dataset): ```bash Number of corrected files = 24 Total number of files = 1527 Percentage of corrected files = 1.57% ``` Actually, the public leaderboard **will not be recalculated** on the corrected **test_public** in view of the insignificance of the fixes and the proximity of the end of the competition. Oppositely, we will calculate private leaderbord on the corrected **test_private**. ## MAIN DESCRIPTION Digital Peter is an educational task with a historical slant created on the basis of several AI technologies (Computer Vision, NLP, and knowledge graphs). The task was prepared jointly with the Saint Petersburg Institute of History (N.P.Lihachov mansion) of Russian Academy of Sciences, Federal Archival Agency of Russia and Russian State Archive of Ancient Acts. ### Description of the task and data Contestants are invited to create an algorithm for line-by-line recognition of manuscripts written by Peter the Great. A detailed description of the problem (with an immersion in the problem) can be found in [```desc/detailed_description_of_the_task_en.pdf```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/desc/detailed_description_of_the_task_en.pdf) **NOT FIXED** train dataset can be downloaded [here](https://storage.yandexcloud.net/datasouls-ods/materials/46b7bb85/datasets.zip). This dataset was prepared jointly with a working group consisting of researchers from the Saint Petersburg Institute of History (N.P.Lihachov mansion) of Russian Academy of Sciences - specialists in the history of the Petrine era, as well as paleography and archeography. Federal Archival Agency of Russia and Russian State Archive of Ancient Acts were of great help by providing digital copies of autographs. There are 2 folders inside: `images` and `words`. The `images` folder contains jpg files with cut lines from Peter the Great's documents, and the `words` folder contains txt files (transcribed versions of jpg files). Mapping is performed by name. For example, the original text (1_1_10.jpg):
![]()
the translation (1_1_10.txt): ```bash i ``` File names have the following format `x_y_z`, where `x` is the series number (a series is a set of pages with text), `y` is the page number, and `z` is the line number on this page. Absolute values `x`, `y`, `z` do not make any sense (these are internal numbers). Only the sequence `z` is important for fixed `x_y`. For example, in files ``` 987_65_10.jpg 987_65_11.jpg 987_65_12.jpg 987_65_13.jpg 987_65_14.jpg ``` exactly 5 consecutive lines are found. Thus, by choosing certain values of `x` and `y`, it is possible to restore the sequence of lines in a particular document - these will be the numbers `z` in ascending order for fixed `x`, `y`. This fact can be used additionally to improve the quality of recognition. The file names in the test dataset have the same structure. The overwhelming majority of the lines were written by the hand of Peter the Great in the period from 1709 to 1713 (there are lines written in 1704, 1707 and 1708, but there are no more than 150 of them; these lines were included in both train dataset and test dataset). ### Baseline Notebook with a baseline task: [```baseline.ipynb```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/baseline.ipynb) For text recognition (in baseline), the following architecture is used:
![]()
One possible way to improve the performance is to apply a sequence-to-sequence model for post-processing, e.g.: * Encoder-Decoder with Bahdanau Attention; * Transformer-based sequence-to-sequence model. A jupyter-notebook with the baseline that incorporates the sequence-to-sequence models: [```baseline_seq2seq.ipynb```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/seq2seq/baseline_seq2seq.ipynb) The data and models are available [here](https://drive.google.com/file/d/1QXCKKZWa0wqIkHzW_OEf1SiPPPP7HtFH/view?usp=sharing). ### Description of the metrics The leaderboard will take into account the following recognition quality metrics (in the test dataset) * **CER** - Character Error Rate
![]()
![]()
is the Levenshtein distance calculated for character tokens (including spaces),
is the length of the string in characters. * **WER** - Word Error Rate
![]()
is the Levenshtein distance calculated for word tokens,
- is the length of the string in words. * **String Accuracy** - number of fully matching test strings divided by total number of test strings.
Here we use Iverson bracket:
![]()
In the formulas above,
![]()
is the size of the test sample,
is the string of characters that the model recognized in the
-th image,
is the true translation of the
-th image made by the expert. Follow this [link](https://sites.google.com/site/textdigitisation/qualitymeasures/computingerrorrates) to learn more about the metrics. You can learn more about the method of calculating metrics in the script [```eval/evaluate.py```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/eval/evaluate.py). It accepts two parameters as input - [```eval/pred_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/pred_dir) and [```eval/true_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/true_dir). The [```eval/true_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/true_dir) folder should contain txt-files with true strings translations (the structure is the same as in the `words` folder), while the [```eval/pred_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/pred_dir) folder should contain txt-files with recognized strings (using the model). Mapping is again done by name. So the lists of files names in the folders [```eval/true_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/true_dir) and [```eval/pred_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/pred_dir) **should be the same**! The quality can be calculated using the following command (called from the [```eval```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval) folder): ```bash python evaluate.py pred_dir true_dir ``` The result is displayed as follows: ```bash Ground truth -> Recognized [ERR:3] " " -> " " [ERR:3] " " -> " " [ERR:2] " I" -> " 1" [OK] "!" -> "!" Character error rate: 11.267606% Word error rate: 70.000000% String accuracy: 25.000000% ``` **CER**, %, is the key metric used to sort the leaderboard (the less the better). If two or more contestants earn the same **CER**, they will be sorted using **WER**, %, (the less the better). If both **CER** and **WER** match, **String Accuracy**, %, will be used (the more the better). Next metric is the **Time**, sec., - execution time for your model to process the test dataset on NVidia Tesla V100 (the less the better). If all the metrics match, then the first will be the solution loaded earlier in time (if everything is the same here, then we will sort alphabetically by command names). The latest version of the model (see [```baseline.ipynb```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/baseline.ipynb)) has the following values for quality metrics calculated on the public part of the test sample: ```bash CER = 10.526% WER = 44.432% String Accuracy = 21.662% Time = 60 sec ``` The latest version of the baseline (see [```baseline_seq2seq.ipynb```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/seq2seq/baseline_seq2seq.ipynb)) achieved the following metrics on the public part of the test sample: ```bash Encoder-Decoder with Bahdanau Attention CER = 14.957% WER = 49.716% String Accuracy = 13.547% Time = 359 sec Transformer-based sequence-to-sequence model CER = 14.489% WER = 54.974% String Accuracy = 9.228% Time = 76 sec ``` ### Solution format The accepted solution is ZIP archive, which contains the algorithm (your code) and the entrypoint to run it. The entrypoint should be set in `metadata.json` file in the root of your solution archive: ``` { "image": "
", "entry_point": " " } ``` For example: ``` { "image": "odsai/python-gpu", "entry_point": "python predict.py" } ``` The data is supposed to be read from `/data` directory. Your predictions should go to `/output`. For each picture file from `/data` ` .jpg` you have to get the corresponding recognized text file ` .txt` in `/output`. The solution is run in Docker container. You can start with the ready-to-go image we prepared https://hub.docker.com/r/odsai/python-gpu. It contains CUDA 10.1, CUDNN 7.6 and the latest Python libraries. Also you can use your own image for the competition, which must be uploaded to https://hub.docker.com. The image name is changed in hereabove mentioned `metadata.json`. Provided resources: - 8 CPU cores - 94 GB RAM - NVidia Tesla V100 GPU Restrictions: - Up to 5 GB size of the working dir - Up to 5 GB size of an archive with the solution - 10 minutes calculation time limit You can download the example solution: [`submit_example`](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/submit_example) [Here](https://drive.google.com/file/d/16qXYiferc8WAja_3FUbLsG3zwpmmKZBM/view?usp=sharing) is the ```.zip``` file to build a Docker container from the baseline solution that incorporates the transformer-based sequence-to-sequence model. ### Leaderboard The competition is over. Here is the final leaderboard for this competition. Scores are presented for the private set. Baseline solution presented in this github has the following metrics - 9.786, 44.222, 21.532 (CER,WER,ACC).
![]()
Owner
- Name: AI Forever
- Login: ai-forever
- Kind: organization
- Location: Armenia
- Repositories: 60
- Profile: https://github.com/ai-forever
Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.
GitHub Events
Total
Last Year
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Denis | d****v@g****m | 214 |
| MarkPotanin | m****n@p****u | 21 |
| Vlad Mikhailov | 4****v | 15 |
| mike0sv | m****v@g****m | 2 |
| Anton Emelyanov | l****t@m****u | 1 |
Committer Domains (Top 20 + Academic)
mail.ru: 1
phystech.edu: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 3
- Total pull requests: 7
- Average time to close issues: 21 minutes
- Average time to close pull requests: 1 day
- Total issue authors: 2
- Total pull request authors: 4
- Average comments per issue: 0.67
- Average comments per pull request: 0.14
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jibozogsotron (1)
- alikaz3mi (1)
Pull Request Authors
- vmkhlv (3)
- denndimitrov (1)
- MarkPotanin (1)
- mike0sv (1)