post-ocr-text-correction
Repository for post-OCR text correction for Bulgarian historical documents
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: springer.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.9%) to scientific vocabulary
Repository
Repository for post-OCR text correction for Bulgarian historical documents
Basic Info
- Host: GitHub
- Owner: angelbeshirov
- Language: Python
- Default Branch: main
- Size: 37.6 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Post-OCR Text Correction for Bulgarian historical documents
A repository for post-OCR text correction for Bulgarian historical documents written in the Drinov or Ivanchev orthographies. Code for the paper: https://link.springer.com/article/10.1007/s00799-025-00415-x
Description
In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during the recognition. It improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset.
Citation
Beshirov, A., Dobreva, M., Dimitrov, D., Hardalov, M., Koychev, I., & Nakov, P. (2025). Post-OCR Text Correction for Bulgarian historical documents. International Journal on Digital Libraries. https://doi.org/10.1007/s00799-025-00415-x
Owner
- Login: angelbeshirov
- Kind: user
- Location: Sofia, Bulgaria
- Repositories: 3
- Profile: https://github.com/angelbeshirov
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this code, please cite our paper:"
title: "Post-OCR Text Correction for Bulgarian historical documents"
authors:
- family-names: Beshirov
given-names: Angel
orcid: https://orcid.org/0000-0002-0684-2730
- family-names: Dobreva
given-names: Milena
orcid: https://orcid.org/0000-0002-2579-7541
- family-names: Dimitrov
given-names: Dimitar
orcid: https://orcid.org/0000-0003-1308-180X
- family-names: Hardalov
given-names: Momchil
orcid: https://orcid.org/0000-0001-8095-3570
- family-names: Koychev
given-names: Ivan
orcid: https://orcid.org/0000-0003-3919-030X
- family-names: Nakov
given-names: Preslav
orcid: https://orcid.org/0000-0002-3600-1510
date-released: 2025-02-21
doi: 10.1007/s00799-025-00415-x
preferred-citation:
type: article
title: "Post-OCR Text Correction for Bulgarian historical documents"
authors:
- family-names: Beshirov
given-names: Angel
- family-names: Dobreva
given-names: Milena
- family-names: Dimitrov
given-names: Dimitar
- family-names: Hardalov
given-names: Momchil
- family-names: Koychev
given-names: Ivan
- family-names: Nakov
given-names: Preslav
journal: "International Journal on Digital Libraries"
year: 2025
doi: 10.1007/s00799-025-00415-x
GitHub Events
Total
- Watch event: 1
- Push event: 2
- Fork event: 1
Last Year
- Watch event: 1
- Push event: 2
- Fork event: 1