amharic_ocr

Amharic OCR based on MMOCR

https://github.com/dikubab/amharic_ocr

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Amharic OCR based on MMOCR

Basic Info

Host: GitHub
Owner: dikubab
License: apache-2.0
Language: Python
Default Branch: main
Size: 19.9 MB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

Detection and Recognition of Amharic Scene Text using the toolbox

This toolbox is an open-source toolbox based and for details of installation and related information see (https://github.com/open-mmlab/mmocr.

Generally, Geʽez/Abugida/Ethiopic script has up to 519 characters. For Amharic, we use 289-319 characters depending on whether we use Ethiopic numerals and punctuation.

Amharic Text Detection dataset preprocessing

We have two datasets for the detection task. HUST-ART is the real word dataset, and HUST-AST is the synthetic dataset. HUST-ART consists of 1500 training images and 700 test images. HUST-AST comprises 75,904 training images. To convert the dataset labels to MMOCR format, use tools/data/textdet/icdar_converter.py as follows

python tools/data/textdet/icdarconverter.py detdatasets/HUST-ART -o det_datasets/HUST-ART -d icdar2015 --split-list training test

Amharic Text Recognition

We have two training sets and two test sets datasets. Tana (TN) and Waliya (WL) training sets consist of 2.85 and 6M cropped words, respectively. HUST-ART and ABE test sets consist of 4039 and 5218 text images. We also have a validation dataset composed of 14835 text images, which is the training part of HUST-ART and ABE. All five datasets are in LMDB format.

The toolbox usage 1. In the directory configs/base/recogpipelines/, you have different pipelines you must change dict(type='LoadImageFromFile') to dict(type='LoadImageFromLMDB'),
2. In the directory configs/base/recogdatasets/, you need to modify the path of test and train datasets. 3. In the directory mmocr/models/textrecog/convertors/ base.py define the dictionary using the 314 Amharic characters. No need to worry we have modified it. Based on your character set, modify dicttype in all other related files. We have modified the configs/textrecog/satrn/satrnsmall.py settings. You can use it as an example.
The datasets for both detection and recognition can be downloaded from the website https://github.com/dikubab/HUST-ASTD/blob/main/index.md.

The Waliya-related LMDB dataset link will be provided very soon. 1. Test and Validation sets LMDB https://mega.nz/folder/Ub0SnBBa#Fh6pFqbvXVxsa7OJPfJEwA 2. Tana(TN) LMDB https://mega.nz/folder/NGcC1DaQ#soagog8p_LgOnm6Gx9wdCQ

Citation

If you find our datasets useful in your research, please consider cite:

```bibtex

@article{dikubab2022comprehensive, title={Comprehensive benchmark datasets for Amharic scene text detection and recognition}, author={Dikubab, Wondimu and Liang, Dingkang and Liao, Minghui and Bai, Xiang}, journal={Science China Information Sciences, Vol. 65, Special Focus on Deep Learning for Computer Vision, Article number: 160106}, year={2022} } ```

License

This project is released under the Apache 2.0 license.

Owner

Name: Dikubab
Login: dikubab
Kind: user

Repositories: 1
Profile: https://github.com/dikubab

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "OpenMMLab Text Detection, Recognition and Understanding Toolbox"
authors:
  - name: "MMOCR Contributors"
version: 0.3.0
date-released: 2020-08-15
repository-code: "https://github.com/open-mmlab/mmocr"
license: Apache-2.0

GitHub Events

Total

Watch event: 1
Fork event: 2

Last Year

Watch event: 1
Fork event: 2

Dependencies

docker/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

docker/serve/Dockerfile docker

pytorch/pytorch ${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel build

docs/en/requirements.txt pypi

recommonmark *
sphinx *
sphinx_markdown_tables *
sphinx_rtd_theme *

requirements/albu.txt pypi

albumentations >=1.1.0

requirements/build.txt pypi

numpy *
pyclipper *
torch >=1.1

requirements/docs.txt pypi

docutils ==0.16.0
markdown >=3.4.0
myst-parser *
sphinx ==4.0.2
sphinx_copybutton *
sphinx_markdown_tables >=0.0.16

requirements/mminstall.txt pypi

mmcv-full >=1.3.8,<1.7.0
mmdet >=2.21.0,<3.0.0

requirements/readthedocs.txt pypi

imgaug *
kwarray *
lanms-neo ==1.0.2
lmdb *
matplotlib *
mmcv *
mmdet *
pyclipper *
rapidfuzz >=2.0.0
regex *
scikit-image *
scipy *
shapely *
titlecase *
torch *
torchvision *

requirements/runtime.txt pypi

imgaug *
lanms-neo ==1.0.2
lmdb *
matplotlib *
numpy *
opencv-python >=4.2.0.32,
pyclipper *
pycocotools *
rapidfuzz >=2.0.0
scikit-image *

requirements/tests.txt pypi

asynctest * test
codecov * test
flake8 * test
isort * test
kwarray * test
pytest * test
pytest-cov * test
pytest-runner * test
ubelt * test
xdoctest >=0.10.0 test
yapf * test

requirements/optional.txt pypi

requirements.txt pypi

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science