post-ocr-text-correction

Repository for post-OCR text correction for Bulgarian historical documents

https://github.com/angelbeshirov/post-ocr-text-correction

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: springer.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Repository for post-OCR text correction for Bulgarian historical documents

Basic Info
  • Host: GitHub
  • Owner: angelbeshirov
  • Language: Python
  • Default Branch: main
  • Size: 37.6 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme Citation

README.md

Post-OCR Text Correction for Bulgarian historical documents

A repository for post-OCR text correction for Bulgarian historical documents written in the Drinov or Ivanchev orthographies. Code for the paper: https://link.springer.com/article/10.1007/s00799-025-00415-x

Description

In this work, we focus on Bulgarian, and we create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We further develop a method for automatically generating synthetic data in this orthography, as well as in the subsequent Ivanchev orthography, by leveraging vast amounts of contemporary literature Bulgarian texts. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during the recognition. It improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019 Bulgarian dataset.

Citation

Beshirov, A., Dobreva, M., Dimitrov, D., Hardalov, M., Koychev, I., & Nakov, P. (2025). Post-OCR Text Correction for Bulgarian historical documents. International Journal on Digital Libraries. https://doi.org/10.1007/s00799-025-00415-x

Owner

  • Login: angelbeshirov
  • Kind: user
  • Location: Sofia, Bulgaria

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this code, please cite our paper:"
title: "Post-OCR Text Correction for Bulgarian historical documents"
authors:
  - family-names: Beshirov
    given-names: Angel
    orcid: https://orcid.org/0000-0002-0684-2730
  - family-names: Dobreva
    given-names: Milena
    orcid: https://orcid.org/0000-0002-2579-7541
  - family-names: Dimitrov
    given-names: Dimitar
    orcid: https://orcid.org/0000-0003-1308-180X
  - family-names: Hardalov
    given-names: Momchil
    orcid: https://orcid.org/0000-0001-8095-3570
  - family-names: Koychev
    given-names: Ivan
    orcid: https://orcid.org/0000-0003-3919-030X
  - family-names: Nakov
    given-names: Preslav
    orcid: https://orcid.org/0000-0002-3600-1510
date-released: 2025-02-21
doi: 10.1007/s00799-025-00415-x
preferred-citation:
  type: article
  title: "Post-OCR Text Correction for Bulgarian historical documents"
  authors:
    - family-names: Beshirov
      given-names: Angel
    - family-names: Dobreva
      given-names: Milena
    - family-names: Dimitrov
      given-names: Dimitar
    - family-names: Hardalov
      given-names: Momchil
    - family-names: Koychev
      given-names: Ivan
    - family-names: Nakov
      given-names: Preslav
  journal: "International Journal on Digital Libraries"
  year: 2025
  doi: 10.1007/s00799-025-00415-x

GitHub Events

Total
  • Watch event: 1
  • Push event: 2
  • Fork event: 1
Last Year
  • Watch event: 1
  • Push event: 2
  • Fork event: 1