https://github.com/barrust/pyspellchecker
Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.8%) to scientific vocabulary
Keywords
levenshtein-distance
python
python-spell-checking
spellcheck
spellchecker
spelling-checker
Keywords from Contributors
mesh
interactive
projection
generic
sequences
archival
data-structures
genomics
observability
autograding
Last synced: 6 months ago
·
JSON representation
Repository
Pure Python Spell Checking http://pyspellchecker.readthedocs.io/en/latest/
Basic Info
Statistics
- Stars: 745
- Watchers: 7
- Forks: 163
- Open Issues: 7
- Releases: 32
Topics
levenshtein-distance
python
python-spell-checking
spellcheck
spellchecker
spelling-checker
Created about 8 years ago
· Last pushed 9 months ago
Metadata Files
Readme
Changelog
License
README.rst
pyspellchecker
===============================================================================
.. image:: https://img.shields.io/badge/license-MIT-blue.svg
:target: https://opensource.org/licenses/MIT/
:alt: License
.. image:: https://img.shields.io/github/release/barrust/pyspellchecker.svg
:target: https://github.com/barrust/pyspellchecker/releases
:alt: GitHub release
.. image:: https://github.com/barrust/pyspellchecker/workflows/Python%20package/badge.svg
:target: https://github.com/barrust/pyspellchecker/actions?query=workflow%3A%22Python+package%22
:alt: Build Status
.. image:: https://codecov.io/gh/barrust/pyspellchecker/branch/master/graph/badge.svg?token=OdETiNgz9k
:target: https://codecov.io/gh/barrust/pyspellchecker
:alt: Test Coverage
.. image:: https://badge.fury.io/py/pyspellchecker.svg
:target: https://badge.fury.io/py/pyspellchecker
:alt: PyPi Package
.. image:: http://pepy.tech/badge/pyspellchecker
:target: https://pepy.tech/project/pyspellchecker
:alt: Downloads
Pure Python Spell Checking based on `Peter
Norvig's `__ blog post on setting
up a simple spell checking algorithm.
It uses a `Levenshtein Distance `__
algorithm to find permutations within an edit distance of 2 from the
original word. It then compares all permutations (insertions, deletions,
replacements, and transpositions) to known words in a word frequency
list. Those words that are found more often in the frequency list are
**more likely** the correct results.
``pyspellchecker`` supports multiple languages including English, Spanish,
German, French, Portuguese, Arabic and Basque. For information on how the dictionaries were
created and how they can be updated and improved, please see the
**Dictionary Creation and Updating** section of the readme!
``pyspellchecker`` supports **Python 3**
``pyspellchecker`` allows for the setting of the Levenshtein Distance (up to two) to check.
For longer words, it is highly recommended to use a distance of 1 and not the
default 2. See the quickstart to find how one can change the distance parameter.
Installation
-------------------------------------------------------------------------------
The easiest method to install is using pip:
.. code:: bash
pip install pyspellchecker
To build from source:
.. code:: bash
git clone https://github.com/barrust/pyspellchecker.git
cd pyspellchecker
python -m build
For *python 2.7* support, install `release 0.5.6 `__
but note that no future updates will support *python 2*.
.. code:: bash
pip install pyspellchecker==0.5.6
Quickstart
-------------------------------------------------------------------------------
After installation, using ``pyspellchecker`` should be fairly straight
forward:
.. code:: python
from spellchecker import SpellChecker
spell = SpellChecker()
# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])
for word in misspelled:
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))
If the Word Frequency list is not to your liking, you can add additional
text to generate a more appropriate list for your use case.
.. code:: python
from spellchecker import SpellChecker
spell = SpellChecker() # loads default word frequency list
spell.word_frequency.load_text_file('./my_free_text_doc.txt')
# if I just want to make sure some words are not flagged as misspelled
spell.word_frequency.load_words(['microsoft', 'apple', 'google'])
spell.known(['microsoft', 'google']) # will return both now!
If the words that you wish to check are long, it is recommended to reduce the
`distance` to 1. This can be accomplished either when initializing the spell
check class or after the fact.
.. code:: python
from spellchecker import SpellChecker
spell = SpellChecker(distance=1) # set at initialization
# do some work on longer words
spell.distance = 2 # set the distance parameter back to the default
Non-English Dictionaries
-------------------------------------------------------------------------------
``pyspellchecker`` supports several default dictionaries as part of the default
package. Each is simple to use when initializing the dictionary:
.. code:: python
from spellchecker import SpellChecker
english = SpellChecker() # the default is English (language='en')
spanish = SpellChecker(language='es') # use the Spanish Dictionary
russian = SpellChecker(language='ru') # use the Russian Dictionary
arabic = SpellChecker(language='ar') # use the Arabic Dictionary
The currently supported dictionaries are:
* English - 'en'
* Spanish - 'es'
* French - 'fr'
* Portuguese - 'pt'
* German - 'de'
* Italian - 'it'
* Russian - 'ru'
* Arabic - 'ar'
* Basque - 'eu'
* Latvian - 'lv'
* Dutch - 'nl'
* Persian - 'fa'
Dictionary Creation and Updating
-------------------------------------------------------------------------------
The creation of the dictionaries is, unfortunately, not an exact science. I have provided a script that, given a text file of sentences (in this case from
`OpenSubtitles `__) it will generate a word frequency list based on the words found within the text. The script then attempts to ***clean up*** the word frequency by, for example, removing words with invalid characters (usually from other languages), removing low count terms (misspellings?) and attempts to enforce rules as available (no more than one accent per word in Spanish). Then it removes words from a list of known words that are to be removed. It then adds words into the dictionary that are known to be missing or were removed for being too low frequency.
The script can be found here: ``scripts/build_dictionary.py```. The original word frequency list parsed from OpenSubtitles can be found in the ```scripts/data/``` folder along with each language's *include* and *exclude* text files.
Any help in updating and maintaining the dictionaries would be greatly desired. To do this, a
`discussion `__ could be started on GitHub or pull requests to update the include and exclude files could be added.
Additional Methods
-------------------------------------------------------------------------------
`On-line documentation `__ is available; below contains the cliff-notes version of some of the available functions:
``correction(word)``: Returns the most probable result for the
misspelled word
``candidates(word)``: Returns a set of possible candidates for the
misspelled word
``known([words])``: Returns those words that are in the word frequency
list
``unknown([words])``: Returns those words that are not in the frequency
list
``word_probability(word)``: The frequency of the given word out of all
words in the frequency list
The following are less likely to be needed by the user but are available:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``edit_distance_1(word)``: Returns a set of all strings at a Levenshtein
Distance of one based on the alphabet of the selected language
``edit_distance_2(word)``: Returns a set of all strings at a Levenshtein
Distance of two based on the alphabet of the selected language
Credits
-------------------------------------------------------------------------------
* `Peter Norvig `__ blog post on setting up a simple spell checking algorithm
* P Lison and J Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
Owner
- Name: Tyler Barrus
- Login: barrust
- Kind: user
- Location: Richmond Va
- Repositories: 17
- Profile: https://github.com/barrust
GitHub Events
Total
- Create event: 8
- Commit comment event: 2
- Release event: 2
- Issues event: 8
- Watch event: 49
- Delete event: 3
- Issue comment event: 17
- Push event: 28
- Pull request event: 16
- Fork event: 7
Last Year
- Create event: 8
- Commit comment event: 2
- Release event: 2
- Issues event: 8
- Watch event: 49
- Delete event: 3
- Issue comment event: 17
- Push event: 28
- Pull request event: 16
- Fork event: 7
Committers
Last synced: 12 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Tyler Barrus | b****t@g****m | 114 |
| dependabot[bot] | 4****] | 3 |
| Vladislav Sobolev | 3****m | 2 |
| grayjk | g****k@g****m | 1 |
| davido-brainlabs | d****o@b****m | 1 |
| blayzen-w | 3****w | 1 |
| Xabi | x****a@g****m | 1 |
| Thomas Decaux | e****y@g****m | 1 |
| Stephen Cawood | s****d | 1 |
| Raivis Dejus | o****s@g****m | 1 |
| Mahmoud Salhab | m****d@s****k | 1 |
| Lode Nachtergaele | c****2@g****m | 1 |
| John O'Sullivan | j****7@y****m | 1 |
| James Riley | j****s@c****m | 1 |
| CangarejoAsul | 1****l | 1 |
| Arvin Nick | a****0@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 74
- Total pull requests: 65
- Average time to close issues: 4 months
- Average time to close pull requests: about 1 month
- Total issue authors: 66
- Total pull request authors: 21
- Average comments per issue: 2.55
- Average comments per pull request: 1.09
- Merged pull requests: 53
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 4
- Pull requests: 12
- Average time to close issues: 2 days
- Average time to close pull requests: 8 days
- Issue authors: 4
- Pull request authors: 7
- Average comments per issue: 1.0
- Average comments per pull request: 1.58
- Merged pull requests: 9
- Bot issues: 0
- Bot pull requests: 3
Top Authors
Issue Authors
- barrust (5)
- mrodin52 (2)
- pctjsm (2)
- cbsnagur (2)
- sviperm (2)
- ghost (1)
- stephencawood (1)
- mhillendahl (1)
- skwolvie (1)
- madkote (1)
- Balurc (1)
- cmaureir (1)
- 7heo (1)
- hemanta212 (1)
- akarmazyan (1)
Pull Request Authors
- barrust (37)
- dependabot[bot] (6)
- sviperm (4)
- ashkanfeyzollahi (2)
- CangarejoAsul (2)
- tomkralidis (2)
- grayjk (2)
- RDxR10 (2)
- idiotcommerce (1)
- mikemalinowski (1)
- blayzen-w (1)
- stephencawood (1)
- xezpeleta (1)
- ron-oren97 (1)
- raivisdejus (1)
Top Labels
Issue Labels
help wanted (4)
enhancement (2)
Pull Request Labels
dependencies (6)
Packages
- Total packages: 3
-
Total downloads:
- pypi 587,548 last-month
- Total docker downloads: 2,032
-
Total dependent packages: 38
(may contain duplicates) -
Total dependent repositories: 671
(may contain duplicates) - Total versions: 42
- Total maintainers: 2
pypi.org: pyspellchecker
Pure python spell checker based on work by Peter Norvig
- Documentation: https://pyspellchecker.readthedocs.io/
- License: MIT
-
Latest release: 0.8.3
published 10 months ago
Rankings
Dependent packages count: 0.4%
Dependent repos count: 0.5%
Downloads: 0.6%
Average: 0.9%
Docker downloads count: 2.1%
Maintainers (1)
Last synced:
6 months ago
conda-forge.org: pyspellchecker
- Homepage: https://github.com/barrust/pyspellchecker
- License: MIT
-
Latest release: 0.7.0
published over 3 years ago
Rankings
Stargazers count: 16.5%
Dependent repos count: 18.1%
Average: 18.5%
Dependent packages count: 19.6%
Forks count: 19.7%
Last synced:
7 months ago
spack.io: py-pyspellchecker
Pure python spell checker based on work by Peter Norvig
- Homepage: https://github.com/barrust/pyspellchecker
- License: []
-
Latest release: 0.6.2
published almost 4 years ago
Rankings
Dependent repos count: 0.0%
Stargazers count: 9.7%
Forks count: 11.7%
Average: 19.7%
Dependent packages count: 57.3%
Maintainers (1)
Last synced:
7 months ago
Dependencies
.github/workflows/publish.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite
.github/workflows/python-package.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite
- codecov/codecov-action v3 composite
pyproject.toml
pypi
- black ^20.8b1 develop
- flake8 ^3.6.0 develop
- isort ^5.6.4 develop
- pre-commit >=2.18.1 develop
- pytest ^6.1.1 develop
docs/requirements.txt
pypi
- sphinx-rtd-theme *