contextual-spell-checker-for-bangla

Automatic Context Sensitive Spelling Correction for Bangla Text Using Bert and Levenstein Distance

https://github.com/mahirmahbub/contextual-spell-checker-for-bangla

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.0%) to scientific vocabulary

Keywords

bangla-bert bangla-nlp bert fastapi levenshtein-distance ner nlp spellcheck spelling-correction
Last synced: 6 months ago · JSON representation

Repository

Automatic Context Sensitive Spelling Correction for Bangla Text Using Bert and Levenstein Distance

Basic Info
  • Host: GitHub
  • Owner: MahirMahbub
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 146 KB
Statistics
  • Stars: 20
  • Watchers: 3
  • Forks: 5
  • Open Issues: 3
  • Releases: 2
Topics
bangla-bert bangla-nlp bert fastapi levenshtein-distance ner nlp spellcheck spelling-correction
Created about 4 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

Auto Progressive Contextual Spell Checker For Bangla

Automatic Progressive Context-Sensitive Spelling Correction for Bangla Text Using BERT and Levenshtein Distance.

  • Bert Masked Model (Added), Other model support(For example, LSTM/GRU based Masked Prediction model) will be added.

  • Bert NER Model (Added)

  • Levenshtein Distance (Added)

  • Dictionary Look up (Added), 451742 unique words from Oscar 2019 dataset.

  • Progressive spell checking with NER (Added)

  • New constraints added while checking the spelling (Added)

Instruction

  • Download a Bert Masked Model in "model/bangla-bert-base" (Recommeded https://huggingface.co/sagorsarker/bangla-bert-base)
  • Download a Bert NER Model in "model/mbert-bengali-ner" (Recommended https://huggingface.co/sagorsarker/mbert-bengali-ner)
  • Specify the Bert Masked Model and Bert NER Model controller class name in "config.json"
  • Download dictionary and place in at /data/output/
  • Run app.py for API(Based of Fastapi)

Example:

``` from source.spell_checker import SpellChecker

sentence = "পুলিশ আসা আগে ডাকাত পালিয়ে গোছে".split(" ") print(SpellChecker().prediction(sentence=sentence, k=100)))

['পুলিশ', 'আসার', 'আগে', 'ডাকাত', 'পালিয়ে', 'গেছে']

sentence = "এক এলাকা সোলতা আহমেদের ছে আব্দুর রহমান (৩০)".split(" ") print(SpellChecker().prediction(sentence=sentence, k=100)))

['একই', 'এলাকার', 'সোলতা', 'আহমেদের', 'ছেলে', 'আব্দুর', 'রহমান', '(৩০)']

sentence = "২০১৫ সালের নভেম্বরে প্রান্সে জলবায়ূ সসেলনে বিশেবর ২০০ দেশ অংশগ্রহণ করে".split(" ") print(SpellChecker().prediction(sentence=sentence, k=100)))

['২০১৫', 'সালের', 'নভেম্বরে', 'ফ্রান্সের', 'জলবায়ূ', 'সম্মেলনে', 'বিশ্বের', '২০০', 'দেশ', 'অংশগ্রহণ', 'করে']

sentence = "পরে তাদসের উিপর হামলা করে এলোপাতাড়ি কুপাতে থাকে" print(SpellChecker().prediction(sentence=sentence, k=100)))

['পরে', 'তাদের', 'উপর', 'হামলা', 'করে', 'এলোপাতাড়ি', 'কুপাতে', 'থাকে']

sentence = "তাূরা দেখেন ঢাকার দূই সিটিতে মশা মারতে যে ওষধূ ছিটানো হয় তা অকার্যকর" print(SpellChecker().prediction(sentence=sentence, k=100)))

['তারা', 'দেখে', 'ঢাকার', 'দুই', 'সিটিতে', 'মশা', 'মারতে', 'যে', 'ওষুধ', 'ছিটানো', 'হত', 'তা', 'অকার্যকর']

```

Result (Based on 1.0.2-alpha)

Evaluation dataset in created from https://github.com/habibsifat/Algorithm-for-Bengali-Error-Dataset-Generation.

TP: Did not change the correct word / total correct word.

FN: Change the correct word incorrectly / total correct word.

FP: Did not change the incorrect word (Mark incorrect as correct) / total incorrect word.

TN: Change the incorrect word correctly / total incorrect word.

TN_PLUS: Change the incorrect word incorrectly.

Result of bangla bert for different language models

| Model | Top N| TP | FN | FP | TN | TN_PLUS | | :----------- | :----------- | :----------- | :----------- | :----------- | :----------- | :------------ | | Sagor Sarkar | 50 | 0.9782 | 0.0218 | 0.4150 | 0.5017 | 0.0833 | --> | NWP(W2V Skipgram)| 50 | 0.9879 | 0.0121 | 0.6612 | 0.2825 | 0.0563 | -->

The result of spell checker based on bangla bert for different conditions

We conducted the experiment on different value of maximum edit distance (ml). The conditions are given below:

C1: ml = Probable misspell word(mw)’s length//2.

C2: ml = mw’s length//2 if mw’s length > 4 else ml = 2.

C3: ml = mw’s length//2 if mw’s length > 6 else ml = 2.

C4: ml = mw’s length//2 if mw’s length > 6 else ml = 3

| Condition | TP | FN | FP | TN | TN_PLUS | | :----------- | :----------- | :----------- | :----------- | :----------- | :----------- | | C1 | 0.9837 | 0.0163 | 0.6779 | 0.3209 | 0.0012 | | C2 | 0.9782 | 0.0218 | 0.4150 | 0.5017 | 0.0833 | | C3 | 0.9776 | 0.0224 | 0.5534 | 0.4410 | 0.0056 | | C4 | 0.9623 | 0.0377 | 0.6498 | 0.2010 | 0.1492 |

API

We also added API.

Owner

  • Name: Mahir Mahbub
  • Login: MahirMahbub
  • Kind: user
  • Location: Dhaka,Bangladesh
  • Company: Bangabandhu Digital University, Bangladesh

Lecturer at Bangabandhu Digital University, Bangladesh. Former Software Engineer at iXora Solution Ltd. Studied at the University of Dhaka.

GitHub Events

Total
  • Watch event: 1
  • Push event: 1
Last Year
  • Watch event: 1
  • Push event: 1

Dependencies

requirements.txt pypi
  • Cython ==0.29.23
  • Pillow ==8.4.0
  • PySocks ==1.7.1
  • PyYAML ==6.0
  • anyio ==3.5.0
  • asgiref ==3.5.0
  • brotlipy ==0.7.0
  • certifi ==2021.10.8
  • cffi ==1.14.6
  • charset-normalizer ==2.0.4
  • click ==8.0.3
  • colorama ==0.4.4
  • cryptography ==36.0.0
  • dpcpp-cpp-rt ==2022.0.3
  • fastapi ==0.75.1
  • fastapi-camelcase ==1.0.5
  • filelock ==3.4.0
  • gensim ==4.1.2
  • h11 ==0.13.0
  • httptools ==0.4.0
  • huggingface-hub ==0.2.1
  • idna ==3.3
  • importlib-metadata ==4.8.2
  • intel-cmplr-lib-rt ==2022.0.3
  • intel-cmplr-lic-rt ==2022.0.3
  • intel-opencl-rt ==2022.0.3
  • intel-openmp ==2022.0.3
  • joblib ==1.1.0
  • mkl ==2022.0.3
  • mkl-fft ==1.3.1
  • mkl-random ==1.2.2
  • mkl-service ==2.4.0
  • numpy ==1.21.5
  • olefile ==0.46
  • packaging ==21.3
  • pyOpenSSL ==21.0.0
  • pycparser ==2.21
  • pydantic ==1.9.0
  • pyhumps ==3.5.3
  • pyparsing ==3.0.4
  • python-dotenv ==0.20.0
  • regex ==2021.8.3
  • requests ==2.27.1
  • sacremoses ==0.0.43
  • scipy ==1.8.0
  • six ==1.16.0
  • smart-open ==5.2.1
  • sniffio ==1.2.0
  • starlette ==0.17.1
  • tbb ==2021.5.2
  • tokenizers ==0.10.3
  • torch ==1.8.1
  • torchaudio ==0.8.1
  • torchvision ==0.9.1
  • tqdm ==4.62.3
  • transformers ==4.14.1
  • typing-extensions ==3.10.0.2
  • urllib3 ==1.26.7
  • uvicorn ==0.17.6
  • watchgod ==0.8.2
  • websockets ==10.2
  • wincertstore ==0.2
  • zipp ==3.7.0
.github/workflows/python-app.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite