contextual-spell-checker-for-bangla
Automatic Context Sensitive Spelling Correction for Bangla Text Using Bert and Levenstein Distance
https://github.com/mahirmahbub/contextual-spell-checker-for-bangla
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.0%) to scientific vocabulary
Keywords
Repository
Automatic Context Sensitive Spelling Correction for Bangla Text Using Bert and Levenstein Distance
Basic Info
Statistics
- Stars: 20
- Watchers: 3
- Forks: 5
- Open Issues: 3
- Releases: 2
Topics
Metadata Files
README.md
Auto Progressive Contextual Spell Checker For Bangla
Automatic Progressive Context-Sensitive Spelling Correction for Bangla Text Using BERT and Levenshtein Distance.
Bert Masked Model (Added), Other model support(For example, LSTM/GRU based Masked Prediction model) will be added.
Bert NER Model (Added)
Levenshtein Distance (Added)
Dictionary Look up (Added), 451742 unique words from Oscar 2019 dataset.
Progressive spell checking with NER (Added)
New constraints added while checking the spelling (Added)
Instruction
- Download a Bert Masked Model in "model/bangla-bert-base" (Recommeded https://huggingface.co/sagorsarker/bangla-bert-base)
- Download a Bert NER Model in "model/mbert-bengali-ner" (Recommended https://huggingface.co/sagorsarker/mbert-bengali-ner)
- Specify the Bert Masked Model and Bert NER Model controller class name in "config.json"
- Download dictionary and place in at /data/output/
- Run app.py for API(Based of Fastapi)
Example:
``` from source.spell_checker import SpellChecker
sentence = "পুলিশ আসা আগে ডাকাত পালিয়ে গোছে".split(" ") print(SpellChecker().prediction(sentence=sentence, k=100)))
['পুলিশ', 'আসার', 'আগে', 'ডাকাত', 'পালিয়ে', 'গেছে']
sentence = "এক এলাকা সোলতা আহমেদের ছে আব্দুর রহমান (৩০)".split(" ") print(SpellChecker().prediction(sentence=sentence, k=100)))
['একই', 'এলাকার', 'সোলতা', 'আহমেদের', 'ছেলে', 'আব্দুর', 'রহমান', '(৩০)']
sentence = "২০১৫ সালের নভেম্বরে প্রান্সে জলবায়ূ সসেলনে বিশেবর ২০০ দেশ অংশগ্রহণ করে".split(" ") print(SpellChecker().prediction(sentence=sentence, k=100)))
['২০১৫', 'সালের', 'নভেম্বরে', 'ফ্রান্সের', 'জলবায়ূ', 'সম্মেলনে', 'বিশ্বের', '২০০', 'দেশ', 'অংশগ্রহণ', 'করে']
sentence = "পরে তাদসের উিপর হামলা করে এলোপাতাড়ি কুপাতে থাকে" print(SpellChecker().prediction(sentence=sentence, k=100)))
['পরে', 'তাদের', 'উপর', 'হামলা', 'করে', 'এলোপাতাড়ি', 'কুপাতে', 'থাকে']
sentence = "তাূরা দেখেন ঢাকার দূই সিটিতে মশা মারতে যে ওষধূ ছিটানো হয় তা অকার্যকর" print(SpellChecker().prediction(sentence=sentence, k=100)))
['তারা', 'দেখে', 'ঢাকার', 'দুই', 'সিটিতে', 'মশা', 'মারতে', 'যে', 'ওষুধ', 'ছিটানো', 'হত', 'তা', 'অকার্যকর']
```
Result (Based on 1.0.2-alpha)
Evaluation dataset in created from https://github.com/habibsifat/Algorithm-for-Bengali-Error-Dataset-Generation.
TP: Did not change the correct word / total correct word.
FN: Change the correct word incorrectly / total correct word.
FP: Did not change the incorrect word (Mark incorrect as correct) / total incorrect word.
TN: Change the incorrect word correctly / total incorrect word.
TN_PLUS: Change the incorrect word incorrectly.
Result of bangla bert for different language models
| Model | Top N| TP | FN | FP | TN | TN_PLUS | | :----------- | :----------- | :----------- | :----------- | :----------- | :----------- | :------------ | | Sagor Sarkar | 50 | 0.9782 | 0.0218 | 0.4150 | 0.5017 | 0.0833 | --> | NWP(W2V Skipgram)| 50 | 0.9879 | 0.0121 | 0.6612 | 0.2825 | 0.0563 | -->
The result of spell checker based on bangla bert for different conditions
We conducted the experiment on different value of maximum edit distance (ml). The conditions are given below:
• C1: ml = Probable misspell word(mw)’s length//2.
• C2: ml = mw’s length//2 if mw’s length > 4 else ml = 2.
• C3: ml = mw’s length//2 if mw’s length > 6 else ml = 2.
• C4: ml = mw’s length//2 if mw’s length > 6 else ml = 3
| Condition | TP | FN | FP | TN | TN_PLUS | | :----------- | :----------- | :----------- | :----------- | :----------- | :----------- | | C1 | 0.9837 | 0.0163 | 0.6779 | 0.3209 | 0.0012 | | C2 | 0.9782 | 0.0218 | 0.4150 | 0.5017 | 0.0833 | | C3 | 0.9776 | 0.0224 | 0.5534 | 0.4410 | 0.0056 | | C4 | 0.9623 | 0.0377 | 0.6498 | 0.2010 | 0.1492 |
API
We also added API.
Owner
- Name: Mahir Mahbub
- Login: MahirMahbub
- Kind: user
- Location: Dhaka,Bangladesh
- Company: Bangabandhu Digital University, Bangladesh
- Website: https://www.linkedin.com/in/mahirmahbub/
- Repositories: 9
- Profile: https://github.com/MahirMahbub
Lecturer at Bangabandhu Digital University, Bangladesh. Former Software Engineer at iXora Solution Ltd. Studied at the University of Dhaka.
GitHub Events
Total
- Watch event: 1
- Push event: 1
Last Year
- Watch event: 1
- Push event: 1
Dependencies
- Cython ==0.29.23
- Pillow ==8.4.0
- PySocks ==1.7.1
- PyYAML ==6.0
- anyio ==3.5.0
- asgiref ==3.5.0
- brotlipy ==0.7.0
- certifi ==2021.10.8
- cffi ==1.14.6
- charset-normalizer ==2.0.4
- click ==8.0.3
- colorama ==0.4.4
- cryptography ==36.0.0
- dpcpp-cpp-rt ==2022.0.3
- fastapi ==0.75.1
- fastapi-camelcase ==1.0.5
- filelock ==3.4.0
- gensim ==4.1.2
- h11 ==0.13.0
- httptools ==0.4.0
- huggingface-hub ==0.2.1
- idna ==3.3
- importlib-metadata ==4.8.2
- intel-cmplr-lib-rt ==2022.0.3
- intel-cmplr-lic-rt ==2022.0.3
- intel-opencl-rt ==2022.0.3
- intel-openmp ==2022.0.3
- joblib ==1.1.0
- mkl ==2022.0.3
- mkl-fft ==1.3.1
- mkl-random ==1.2.2
- mkl-service ==2.4.0
- numpy ==1.21.5
- olefile ==0.46
- packaging ==21.3
- pyOpenSSL ==21.0.0
- pycparser ==2.21
- pydantic ==1.9.0
- pyhumps ==3.5.3
- pyparsing ==3.0.4
- python-dotenv ==0.20.0
- regex ==2021.8.3
- requests ==2.27.1
- sacremoses ==0.0.43
- scipy ==1.8.0
- six ==1.16.0
- smart-open ==5.2.1
- sniffio ==1.2.0
- starlette ==0.17.1
- tbb ==2021.5.2
- tokenizers ==0.10.3
- torch ==1.8.1
- torchaudio ==0.8.1
- torchvision ==0.9.1
- tqdm ==4.62.3
- transformers ==4.14.1
- typing-extensions ==3.10.0.2
- urllib3 ==1.26.7
- uvicorn ==0.17.6
- watchgod ==0.8.2
- websockets ==10.2
- wincertstore ==0.2
- zipp ==3.7.0
- actions/checkout v3 composite
- actions/setup-python v3 composite