https://github.com/ai-forever/digital_peter_aij2020

Materials of the AI Journey 2020 competition dedicated to the recognition of Peter the Great's manuscripts, https://ai-journey.ru/contest/task01

Keywords

ancient-texts computer-vision handwritten-text-recognition nlp python3 transfer-learning

Keywords from Contributors

transformer

Last synced: 6 months ago · JSON representation

Repository

Materials of the AI Journey 2020 competition dedicated to the recognition of Peter the Great's manuscripts, https://ai-journey.ru/contest/task01

Basic Info

Host: GitHub
Owner: ai-forever
License: mit
Language: Jupyter Notebook
Default Branch: master
Homepage: https://ods.ai/tracks/aij2020
Size: 5.93 MB

Statistics

Stars: 66
Watchers: 8
Forks: 9
Open Issues: 1
Releases: 0

Topics

ancient-texts computer-vision handwritten-text-recognition nlp python3 transfer-learning

Created over 5 years ago · Last pushed almost 5 years ago

https://github.com/ai-forever/digital_peter_aij2020/blob/master/

# Digital Peter: recognition of Peter the Great's manuscripts [](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/README.ru.md). # Preprinted version of the paper Is available at http://arxiv.org/abs/2103.09354 ## INFO ABOUT DATASETS CORRECTION **Fixed** train dataset can be downloaded [here](https://drive.google.com/file/d/1Qki21iEcg_iwMo3kWuaHi5AlxxpLKpof/view?usp=sharing). Moreover, one can fix the **old** version of train dataset (which can still be found [here](https://storage.yandexcloud.net/datasouls-ods/materials/46b7bb85/datasets.zip)) by yourself using following command: ```bash python checker_train.py 'train/words' ``` Here [```checker_train.py```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/fixes/checker_train.py) is the script that makes corrections of the ```'train/words'``` - the folder with old versions of transcribed strings. A complete list of the names of the fixed files (as well as info about these corrections) can be found [here](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/fixes/correction_info.txt) and inside the [```checker_train.py```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/fixes/checker_train.py) Statistics (**train** dataset): ```bash Number of corrected files = 91 Total number of files = 6196 Percentage of corrected files = 1.47% ``` Similar fixes have been made to **test_public** and **test_private**. Statistics (**test_public** dataset): ```bash Number of corrected files = 24 Total number of files = 1527 Percentage of corrected files = 1.57% ``` Actually, the public leaderboard **will not be recalculated** on the corrected **test_public** in view of the insignificance of the fixes and the proximity of the end of the competition. Oppositely, we will calculate private leaderbord on the corrected **test_private**. ## MAIN DESCRIPTION Digital Peter is an educational task with a historical slant created on the basis of several AI technologies (Computer Vision, NLP, and knowledge graphs). The task was prepared jointly with the Saint Petersburg Institute of History (N.P.Lihachov mansion) of Russian Academy of Sciences, Federal Archival Agency of Russia and Russian State Archive of Ancient Acts. ### Description of the task and data Contestants are invited to create an algorithm for line-by-line recognition of manuscripts written by Peter the Great. A detailed description of the problem (with an immersion in the problem) can be found in [```desc/detailed_description_of_the_task_en.pdf```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/desc/detailed_description_of_the_task_en.pdf) **NOT FIXED** train dataset can be downloaded [here](https://storage.yandexcloud.net/datasouls-ods/materials/46b7bb85/datasets.zip). This dataset was prepared jointly with a working group consisting of researchers from the Saint Petersburg Institute of History (N.P.Lihachov mansion) of Russian Academy of Sciences - specialists in the history of the Petrine era, as well as paleography and archeography. Federal Archival Agency of Russia and Russian State Archive of Ancient Acts were of great help by providing digital copies of autographs. There are 2 folders inside: `images` and `words`. The `images` folder contains jpg files with cut lines from Peter the Great's documents, and the `words` folder contains txt files (transcribed versions of jpg files). Mapping is performed by name. For example, the original text (1_1_10.jpg):

the translation (1_1_10.txt): ```bash i ``` File names have the following format `x_y_z`, where `x` is the series number (a series is a set of pages with text), `y` is the page number, and `z` is the line number on this page. Absolute values `x`, `y`, `z` do not make any sense (these are internal numbers). Only the sequence `z` is important for fixed `x_y`. For example, in files ``` 987_65_10.jpg 987_65_11.jpg 987_65_12.jpg 987_65_13.jpg 987_65_14.jpg ``` exactly 5 consecutive lines are found. Thus, by choosing certain values of `x` and `y`, it is possible to restore the sequence of lines in a particular document - these will be the numbers `z` in ascending order for fixed `x`, `y`. This fact can be used additionally to improve the quality of recognition. The file names in the test dataset have the same structure. The overwhelming majority of the lines were written by the hand of Peter the Great in the period from 1709 to 1713 (there are lines written in 1704, 1707 and 1708, but there are no more than 150 of them; these lines were included in both train dataset and test dataset). ### Baseline Notebook with a baseline task: [```baseline.ipynb```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/baseline.ipynb) For text recognition (in baseline), the following architecture is used:

One possible way to improve the performance is to apply a sequence-to-sequence model for post-processing, e.g.: * Encoder-Decoder with Bahdanau Attention; * Transformer-based sequence-to-sequence model. A jupyter-notebook with the baseline that incorporates the sequence-to-sequence models: [```baseline_seq2seq.ipynb```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/seq2seq/baseline_seq2seq.ipynb) The data and models are available [here](https://drive.google.com/file/d/1QXCKKZWa0wqIkHzW_OEf1SiPPPP7HtFH/view?usp=sharing). ### Description of the metrics The leaderboard will take into account the following recognition quality metrics (in the test dataset) * **CER** - Character Error Rate

$\text{dist}_c$ is the Levenshtein distance calculated for character tokens (including spaces), $\text{len}_c$ is the length of the string in characters. * **WER** - Word Error Rate

$\text{dist}_w$ is the Levenshtein distance calculated for word tokens, $\text{len}_w$ - is the length of the string in words. * **String Accuracy** - number of fully matching test strings divided by total number of test strings.

Here we use Iverson bracket:

In the formulas above, $n$ is the size of the test sample, $\text{pred}_i$ is the string of characters that the model recognized in the $i$ -th image, $\text{true}_i$ is the true translation of the $i$ -th image made by the expert. Follow this [link](https://sites.google.com/site/textdigitisation/qualitymeasures/computingerrorrates) to learn more about the metrics. You can learn more about the method of calculating metrics in the script [```eval/evaluate.py```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/eval/evaluate.py). It accepts two parameters as input - [```eval/pred_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/pred_dir) and [```eval/true_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/true_dir). The [```eval/true_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/true_dir) folder should contain txt-files with true strings translations (the structure is the same as in the `words` folder), while the [```eval/pred_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/pred_dir) folder should contain txt-files with recognized strings (using the model). Mapping is again done by name. So the lists of files names in the folders [```eval/true_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/true_dir) and [```eval/pred_dir```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval/pred_dir) **should be the same**! The quality can be calculated using the following command (called from the [```eval```](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/eval) folder): ```bash python evaluate.py pred_dir true_dir ``` The result is displayed as follows: ```bash Ground truth -> Recognized [ERR:3] " " -> " " [ERR:3] " " -> " " [ERR:2] " I" -> " 1" [OK] "!" -> "!" Character error rate: 11.267606% Word error rate: 70.000000% String accuracy: 25.000000% ``` **CER**, %, is the key metric used to sort the leaderboard (the less the better). If two or more contestants earn the same **CER**, they will be sorted using **WER**, %, (the less the better). If both **CER** and **WER** match, **String Accuracy**, %, will be used (the more the better). Next metric is the **Time**, sec., - execution time for your model to process the test dataset on NVidia Tesla V100 (the less the better). If all the metrics match, then the first will be the solution loaded earlier in time (if everything is the same here, then we will sort alphabetically by command names). The latest version of the model (see [```baseline.ipynb```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/master/baseline.ipynb)) has the following values for quality metrics calculated on the public part of the test sample: ```bash CER = 10.526% WER = 44.432% String Accuracy = 21.662% Time = 60 sec ``` The latest version of the baseline (see [```baseline_seq2seq.ipynb```](https://github.com/sberbank-ai/digital_peter_aij2020/blob/seq2seq/baseline_seq2seq.ipynb)) achieved the following metrics on the public part of the test sample: ```bash Encoder-Decoder with Bahdanau Attention CER = 14.957% WER = 49.716% String Accuracy = 13.547% Time = 359 sec Transformer-based sequence-to-sequence model CER = 14.489% WER = 54.974% String Accuracy = 9.228% Time = 76 sec ``` ### Solution format The accepted solution is ZIP archive, which contains the algorithm (your code) and the entrypoint to run it. The entrypoint should be set in `metadata.json` file in the root of your solution archive: ``` { "image": "", "entry_point": "" } ``` For example: ``` { "image": "odsai/python-gpu", "entry_point": "python predict.py" } ``` The data is supposed to be read from `/data` directory. Your predictions should go to `/output`. For each picture file from `/data` `.jpg` you have to get the corresponding recognized text file `.txt` in `/output`. The solution is run in Docker container. You can start with the ready-to-go image we prepared https://hub.docker.com/r/odsai/python-gpu. It contains CUDA 10.1, CUDNN 7.6 and the latest Python libraries. Also you can use your own image for the competition, which must be uploaded to https://hub.docker.com. The image name is changed in hereabove mentioned `metadata.json`. Provided resources: - 8 CPU cores - 94 GB RAM - NVidia Tesla V100 GPU Restrictions: - Up to 5 GB size of the working dir - Up to 5 GB size of an archive with the solution - 10 minutes calculation time limit You can download the example solution: [`submit_example`](https://github.com/sberbank-ai/digital_peter_aij2020/tree/master/submit_example) [Here](https://drive.google.com/file/d/16qXYiferc8WAja_3FUbLsG3zwpmmKZBM/view?usp=sharing) is the ```.zip``` file to build a Docker container from the baseline solution that incorporates the transformer-based sequence-to-sequence model. ### Leaderboard The competition is over. Here is the final leaderboard for this competition. Scores are presented for the private set. Baseline solution presented in this github has the following metrics - 9.786, 44.222, 21.532 (CER,WER,ACC).

Owner

Name: AI Forever
Login: ai-forever
Kind: organization
Location: Armenia

Repositories: 60
Profile: https://github.com/ai-forever

Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.

GitHub Events

Total

Last Year

Committers

Last synced: 9 months ago

All Time

Total Commits: 253
Total Committers: 5
Avg Commits per committer: 50.6
Development Distribution Score (DDS): 0.154

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Denis	d**v@g**m	214
MarkPotanin	m**n@p**u	21
Vlad Mikhailov	4****v	15
mike0sv	m**v@g**m	2
Anton Emelyanov	l**t@m**u	1

Committer Domains (Top 20 + Academic)

mail.ru: 1 phystech.edu: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 3
Total pull requests: 7
Average time to close issues: 21 minutes
Average time to close pull requests: 1 day
Total issue authors: 2
Total pull request authors: 4
Average comments per issue: 0.67
Average comments per pull request: 0.14
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/ai-forever/digital_peter_aij2020

Science Score: 20.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

https://github.com/ai-forever/digital_peter_aij2020/blob/master/

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels