contextualspellcheck

✔️Contextual word checker for better suggestions (not actively maintained)

https://github.com/r1j1t/contextualspellcheck

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Keywords

bert chatbot help-wanted natural-language-processing nlp oov preprocessing python python-spelling-corrector spacy spacy-extension spellcheck spellchecker spelling-correction spelling-corrections

Last synced: 6 months ago · JSON representation ·

Repository

✔️Contextual word checker for better suggestions (not actively maintained)

Basic Info

Host: GitHub
Owner: R1j1t
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 2.45 MB

Statistics

Stars: 417
Watchers: 9
Forks: 64
Open Issues: 9
Releases: 15

Topics

bert chatbot help-wanted natural-language-processing nlp oov preprocessing python python-spelling-corrector spacy spacy-extension spellcheck spellchecker spelling-correction spelling-corrections

Created almost 6 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

spellCheck

Contextual word checker for better suggestions

Types of spelling mistakes

It is essential to understand that identifying whether a candidate is a spelling error is a big task.

Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.

-- Monojit Choudhury et. al. (2007)

This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation.

Install

The package can be installed using pip. You would require python 3.6+

bash pip install contextualSpellCheck

Usage

Note: For use in other languages check examples folder.

How to load the package in spacy pipeline

```python

import contextualSpellCheck import spacy nlp = spacy.load("encoreweb_sm")

We require NER to identify if a token is a PERSON

also require parser because we use Token.sent for context

nlp.pipenames ['tok2vec', 'tagger', 'parser', 'ner', 'attributeruler', 'lemmatizer'] contextualSpellCheck.addtopipe(nlp) nlp.pipenames ['tok2vec', 'tagger', 'parser', 'ner', 'attributeruler', 'lemmatizer', 'contextual spellchecker']

doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.') doc..outcomespellCheck 'Income was $9.4 million compared to the prior year of $2.7 million.' ```

Or you can add to spaCy pipeline manually!

```python

import spacy import contextualSpellCheck

nlp = spacy.load("encorewebsm") nlp.pipenames ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']

You can pass the optional parameters to the contextualSpellCheck

eg. pass max edit distance use config={"maxeditdist": 3}

nlp.addpipe("contextual spellchecker") nlp.pipenames ['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']

doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.") print(doc..performedspellCheck) True print(doc..outcomespellCheck) Income was $9.4 million compared to the prior year of $2.7 million. ```

After adding contextual spellchecker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.

Using the pipeline

```python

doc = nlp(u'Income was $9.4 milion compared to the prior year of $2.7 milion.')

Doc Extention

print(doc..contextualspellCheck) True print(doc..performedspellCheck) True print(doc..suggestionsspellCheck) {milion: 'million', milion: 'million'} print(doc..outcomespellCheck) Income was $9.4 million compared to the prior year of $2.7 million. print(doc..scorespellCheck) {milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], milion: [('billion', 0.65934), ('million', 0.26185), ('trillion', 0.05391), ('##M', 0.0051), ('Million', 0.00425), ('##B', 0.00268), ('USD', 0.00153), ('##b', 0.00077), ('millions', 0.00059), ('%', 0.00041)]}

Token Extention

print(doc[4]..getrequirespellCheck) True print(doc[4]..getsuggestionspellCheck) 'million' print(doc[4]..scorespellCheck) [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)]

Span Extention

print(doc[2:6]..gethasspellCheck) True print(doc[2:6]..score_spellCheck) {$: [], 9.4: [], milion: [('million', 0.59422), ('billion', 0.24349), (',', 0.08809), ('trillion', 0.01835), ('Million', 0.00826), ('%', 0.00672), ('##M', 0.00591), ('annually', 0.0038), ('##B', 0.00205), ('USD', 0.00113)], compared: []} ```

Extensions

To make the usage easy, contextual spellchecker provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the doc, span and token level. The below tables summarise the extensions.

`spaCy.Doc` level extensions

| Extension | Type | Description | Default | |------------------------------|---------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|---------| | doc..contextualspellCheck | Boolean | To check whether contextualSpellCheck is added as extension | True | | doc..performedspellCheck | Boolean | To check whether contextualSpellCheck identified any misspells and performed correction | False | | doc..suggestionsspellCheck | {Spacy.Token:str} | if corrections are performed, it returns the mapping of misspell token (spaCy.Token) with suggested word(str) | {} | | doc..outcomespellCheck | str | corrected sentence(str) as output | "" | | doc..scorespellCheck | {Spacy.Token:List(str,float)} | if corrections are identified, it returns the mapping of misspell token (spaCy.Token) with suggested words(str) and probability of that correction | None |

`spaCy.Span` level extensions

| Extension | Type | Description | Default | |-------------------------------|---------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| | span..gethasspellCheck | Boolean | To check whether contextualSpellCheck identified any misspells and performed correction in this span | False | | span..score_spellCheck | {Spacy.Token:List(str,float)} | if corrections are identified, it returns the mapping of misspell token (spaCy.Token) with suggested words(str) and probability of that correction for tokens in this span | {spaCy.Token: []} |

`spaCy.Token` level extensions

| Extension | Type | Description | Default | |-----------------------------------|-----------------|-------------------------------------------------------------------------------------------------------------|---------| | token..getrequirespellCheck | Boolean | To check whether contextualSpellCheck identified any misspells and performed correction on this token | False | | token..getsuggestionspellCheck | str | if corrections are performed, it returns the suggested word(str) | "" | | token..scorespellCheck | [(str,float)] | if corrections are identified, it returns suggested words(str) and probability(float) of that correction | [] |

API

At present, there is a simple GET API to get you started. You can run the app in your local and play with it.

Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY Note: Your browser can handle the text encoding

GET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.

Response:

json { "success": true, "input": "Income was $9.4 milion compared to the prior year of $2.7 milion.", "corrected": "Income was $9.4 milion compared to the prior year of $2.7 milion.", "suggestion_score": { "milion": [ [ "million", 0.59422 ], [ "billion", 0.24349 ], ... ], "milion:1": [ [ "billion", 0.65934 ], [ "million", 0.26185 ], ... ] } }

Task List

[ ] use cython for part of the code to improve performance (#39)
[ ] Improve metric for candidate selection (#40)
[ ] Add examples for other langauges (#41)
[ ] Update the logic of misspell identification (OOV) (#44)
[ ] better candidate generation (solved by #44?)
[ ] add metric by testing on datasets
[ ] Improve documentation
[ ] Improve logging in code
[ ] Add support for Real Word Error (RWE) (Big Task)
[ ] add multi mask out capability

Completed Task

- [x] specify maximum edit distance for `candidateRanking` - [x] allow user to specify bert model - [x] Include transformers deTokenizer to get better suggestions - [x] dependency version in setup.py ([#38](https://github.com/R1j1t/contextualSpellCheck/issues/38))

Support and contribution

If you like the project, please ⭑ the project and show your support! Also, if you feel, the current behaviour is not as expected, please feel free to raise an issue. If you can help with any of the above tasks, please open a PR with necessary changes to documentation and tests.

Cite

If you are using contextualSpellCheck in your academic work, please consider citing the library using the below BibTex entry:

bibtex @misc{Goel_Contextual_Spell_Check_2021, author = {Goel, Rajat}, doi = {10.5281/zenodo.4642379}, month = {3}, title = {{Contextual Spell Check}}, url = {https://github.com/R1j1t/contextualSpellCheck}, year = {2021} }

Reference

Below are some of the projects/work I referred to while developing this package

Explosion AI.Architecture. May 2020. url:https://spacy.io/api.
Monojit Choudhury et al. “How difficult is it to develop a perfect spell-checker? A cross-linguistic analysis through complex network approach”. In:arXiv preprint physics/0703198(2007).
Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transform-ers for Language Understanding. 2019. arXiv:1810.04805 [cs.CL].
Hugging Face.Fast Coreference Resolution in spaCy with Neural Net-works. May 2020. url:https://github.com/huggingface/neuralcoref.
Ines.Chapter 3: Processing Pipelines. May 20202. url:https://course.spacy.io/en/chapter3.
Eric Mays, Fred J Damerau, and Robert L Mercer. “Context based spellingcorrection”. In:Information Processing & Management27.5 (1991), pp. 517–522.
Peter Norvig. How to Write a Spelling Corrector. May 2020. url:http://norvig.com/spell-correct.html.
Yifu Sun and Haoming Jiang.Contextual Text Denoising with MaskedLanguage Models. 2019. arXiv:1910.14080 [cs.CL].
Thomas Wolf et al. “Transformers: State-of-the-Art Natural LanguageProcessing”. In:Proceedings of the 2020 Conference on Empirical Methodsin Natural Language Processing: System Demonstrations. Online: Associ-ation for Computational Linguistics, Oct. 2020, pp. 38–45. url:https://www.aclweb.org/anthology/2020.emnlp-demos.6.

Owner

Name: Rajat
Login: R1j1t
Kind: user

Repositories: 34
Profile: https://github.com/R1j1t

Computational Chemist (in the making) | Contact me: r1j1t [at] pm.me

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Goel"
  given-names: "Rajat"
  orcid: "https://orcid.org/0000-0002-8051-9798"
title: "Contextual Spell Check"
version: 0.4.1
doi: 10.5281/zenodo.4642379
date-released: 2021-03-1
url: "https://github.com/R1j1t/contextualSpellCheck"

GitHub Events

Total

Issues event: 1
Watch event: 11
Delete event: 2
Issue comment event: 12
Push event: 1
Pull request event: 5
Fork event: 2

Last Year

Issues event: 1
Watch event: 11
Delete event: 2
Issue comment event: 12
Push event: 1
Pull request event: 5
Fork event: 2

Committers

Last synced: 9 months ago

All Time

Total Commits: 149
Total Committers: 7
Avg Commits per committer: 21.286
Development Distribution Score (DDS): 0.134

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
R1j1t	2****t	129
dc-aichara	d**7@g**m	12
Alvarole	l**7@l**m	4
tpanza	1****a	1
manooka	7****m	1
Nikita Sobolev	m**l@s**e	1
Adheeshk13	1****3	1

Committer Domains (Top 20 + Academic)

sobolevn.me: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 44
Total pull requests: 52
Average time to close issues: about 2 months
Average time to close pull requests: about 1 month
Total issue authors: 37
Total pull request authors: 12
Average comments per issue: 3.2
Average comments per pull request: 1.35
Merged pull requests: 42
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 2
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 3 months
Issue authors: 2
Pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 4.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

R1j1t (5)
nicno90 (2)
AlvaroCavalcante (2)
gaurav0804 (2)
KennethEnevoldsen (1)
xei (1)
virdiprateek (1)
Xiaoping777 (1)
geeky-programer (1)
yishairasowsky (1)
BradenAnderson (1)
Jonathanpro (1)
CamilleSchr (1)
tpanza (1)
wushixian (1)

Pull Request Authors

R1j1t (37)
AlvaroCavalcante (2)
dependabot[bot] (2)
dc-aichara (2)
fingoldo (2)
it176131 (2)
adkiem (1)
tpanza (1)
sobolevn (1)
Adheeshk13 (1)
jonmun (1)
maxbachmann (1)

Top Labels

Issue Labels

wontfix (13) bug (13) enhancement (10) usage (8) documentation (7) help wanted (4) duplicate (3) good first issue (2) question (1)

Pull Request Labels

documentation (9) enhancement (6) wontfix (5) bug (4) dependencies (4)

Packages

Total packages: 1
Total downloads:
- pypi 6,732 last-month
Total docker downloads: 8

Total dependent packages: 1
Total dependent repositories: 4
Total versions: 18
Total maintainers: 1

pypi.org: contextualspellcheck

Contextual spell correction using BERT (bidirectional representations)

Homepage: https://github.com/R1j1t/contextualSpellCheck
Documentation: https://contextualspellcheck.readthedocs.io/
License: MIT License
Latest release: 0.4.4
published over 2 years ago

Versions: 18
Dependent Packages: 1
Dependent Repositories: 4
Downloads: 6,732 Last month
Docker Downloads: 8

Rankings

Downloads: 2.7%

Stargazers count: 3.4%

Docker downloads count: 4.1%

Average: 4.7%

Dependent packages count: 4.8%

Forks count: 5.9%

Dependent repos count: 7.5%

Maintainers (1)

R1j1t

Last synced: 6 months ago

Dependencies

requirements.txt pypi

black ==20.8b1
editdistance ==0.5.3
flake8 >=3.8.3
pytest *
spacy >=3.0.0
torch >=1.4
transformers >=4.0.0

setup.py pypi

editdistance ==0.5.3
spacy >=3.0.0
torch >=1.4
transformers >=4.0.0

.github/workflows/codeql-analysis.yml actions

actions/checkout v2 composite
github/codeql-action/analyze v1 composite
github/codeql-action/autobuild v1 composite
github/codeql-action/init v1 composite

.github/workflows/python-package.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite

pyproject.toml pypi

contextualspellcheck

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

spellCheck

Types of spelling mistakes

Install

Usage

How to load the package in spacy pipeline

We require NER to identify if a token is a PERSON

also require parser because we use Token.sent for context

You can pass the optional parameters to the contextualSpellCheck

eg. pass max edit distance use config={"maxeditdist": 3}

Using the pipeline

Doc Extention

Token Extention

Span Extention

Extensions

spaCy.Doc level extensions

spaCy.Span level extensions

spaCy.Token level extensions

API

Task List

Support and contribution

Cite

Reference

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: contextualspellcheck

Rankings

Maintainers (1)

Dependencies

also require parser because we use `Token.sent` for context

`spaCy.Doc` level extensions

`spaCy.Span` level extensions

`spaCy.Token` level extensions