duosearch

Search engine for historical documents, which uses ElasticSearch and deep neural networks to address this problem.

https://github.com/angelbeshirov/duosearch

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Search engine for historical documents, which uses ElasticSearch and deep neural networks to address this problem.

Basic Info
  • Host: GitHub
  • Owner: angelbeshirov
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 4.23 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 4 years ago · Last pushed 9 months ago
Metadata Files
Readme Citation

README.md

DuoSearch

A novel search engine for historical newspapers utilizing ElasticSearch and machine learning methods. Code for the paper https://arxiv.org/abs/2305.19392

Purpose

The purpose of this research is to build a proof of concept search engine which addresses the two issues: mistakes in the OCR and orthographic variety within language reforms in Bulgarian from 1850s till 1945.

Scope

This is a PoC version and can be used for collections of digitised historical documents within the same time span. The tool uses dictionaries for Bulgarian but this can be easily adapted for other languages as well.

Target audience

This research would be useful for anyone who is interested in search tools in collections of historical documents/newspapaers containing errors and/or linguistic variance. The target user of the engine is a library in Bulgaria, but can be adapted and used by external users as well.

Architecture

Architecture

Citation

Beshirov, A., Hadzhieva, S., Dobreva, M., & Koychev, I. (2022). DuoSearch: A Novel Search Engine for Bulgarian Historical Documents. Proceedings of the European Conference on Information Retrieval. https://doi.org/10.1007/978-3-030-99739-7_31

Owner

  • Login: angelbeshirov
  • Kind: user
  • Location: Sofia, Bulgaria

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this code, please cite our paper:"
title: "DuoSearch: A Novel Search Engine for Bulgarian Historical Documents"
authors:
  - family-names: Beshirov
    given-names: Angel
    orcid: https://orcid.org/0000-0002-0684-2730
  - family-names: Hadzhieva
    given-names: Suzan
    orcid: https://orcid.org/0000-0002-1480-1437
  - family-names: Dobreva
    given-names: Milena
    orcid: https://orcid.org/0000-0002-2579-7541
  - family-names: Koychev
    given-names: Ivan
    orcid: https://orcid.org/0000-0003-3919-030X
date-released: 2022-04-05
doi: 10.1007/978-3-030-99739-7_31
preferred-citation:
  type: article
  title: "DuoSearch: A Novel Search Engine for Bulgarian Historical Documents"
  authors:
    - family-names: Beshirov
      given-names: Angel
    - family-names: Hadzhieva
      given-names: Suzen
    - family-names: Dobreva
      given-names: Milena
    - family-names: Koychev
      given-names: Ivan
  journal: "Proceedings of the European Conference on Information Retrieval"
  year: 2022
  doi: 10.1007/978-3-030-99739-7_31

GitHub Events

Total
  • Push event: 2
Last Year
  • Push event: 2