autocorrect

Spelling corrector in python

https://github.com/filyp/autocorrect

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 19 committers (5.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

autocorrect autocorrection czech english languages levenshtein-distance multilanguage multilingual nlp ocr polish portuguese python russian spanish spellchecker spelling spelling-corrector turkish ukrainian
Last synced: 6 months ago · JSON representation

Repository

Spelling corrector in python

Basic Info
  • Host: GitHub
  • Owner: filyp
  • License: lgpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 3.81 MB
Statistics
  • Stars: 486
  • Watchers: 7
  • Forks: 92
  • Open Issues: 5
  • Releases: 6
Topics
autocorrect autocorrection czech english languages levenshtein-distance multilanguage multilingual nlp ocr polish portuguese python russian spanish spellchecker spelling spelling-corrector turkish ukrainian
Created over 5 years ago · Last pushed 8 months ago
Metadata Files
Readme Contributing License

README.md

Autocorrect

Downloads Average time to resolve an issue Code style: black

Spelling corrector in python. Currently supports English, Polish, Turkish, Russian, Ukrainian, Czech, Portuguese, Greek, Italian, Vietnamese, French and Spanish, but you can easily add new languages.

Based on: https://github.com/phatpiglet/autocorrect and Peter Norvig's spelling corrector.

Installation

bash pip install autocorrect

Examples

Autocorrect full sentences:

```python

from autocorrect import Speller spell = Speller() spell("I'm not sleapy and tehre is no place I'm giong to.") "I'm not sleepy and there is no place I'm going to." ```

Use other languages:

```python

spell = Speller('pl') spell('ptaaki latatją kluczmm') 'ptaki latają kluczem' ```

Get multiple correction candidates for a single word:

```python

spell.get_candidates("tehre") [(5437024, 'there'), (5860, 'terre')] ``` The numbers are frequencies of a word, so the higher, the better.

Speed

python %timeit spell("I'm not sleapy and tehre is no place I'm giong to.") 373 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %timeit spell("There is no comin to consiousnes without pain.") 150 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As you see, for some words correction can take ~200ms. If speed is important for your use case (e.g. chatbot) you may want to use option 'fast': python spell = Speller(fast=True) %timeit spell("There is no comin to consiousnes without pain.") 344 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Now, the correction should always work in microseconds, but words with double typos (like 'consiousnes') won't be corrected.

OCR

When cleaning up OCR, replacements are the large majority of errors. If this is the case, you may want to use the option 'onlyreplacements': ```python spell = Speller(onlyreplacements=True) ```

Custom word sets

If you wish to use your own set of words for autocorrection, you can pass an nlp_data argument:

python spell = Speller(nlp_data=your_word_frequency_dict) Where your_word_frequency_dict is a dictionary which maps words to their average frequencies in your text. If you want to change the default word set only a bit, you can just edit spell.nlp_data parameter, after spell was initialized.

Adding new languages

A simpler but untested way - wordfreq

It should be possible to get word frequencies from the wordfreq package. You should be able to provide this word frequency data to autocorrector through nlp_data parameter. You will also need to generate appropriate alphabet (see constants.py).

A more complicated but tested way - wikipedia text

Note: I will no longer accept PRs to add individual languages using this method. A more sensible approach would be to try using the wordfreq method, and adding many languages at once in some general way. But I don't have time to implement this myself.

First, define special letters, by adding entries in word_regexes and alphabets dicts in autocorrect/constants.py.

Now, you need a bunch of text. Easiest way is to download wikipedia. For example for Russian you would go to: https://dumps.wikimedia.org/ruwiki/latest/ and download ruwiki-latest-pages-articles.xml.bz2

bzip2 -d ruiwiki-latest-pages-articles.xml.bz2

After that:

First, edit the autocorrect.constants dictionaries in order to accommodate regexes and dictionaries for your language.

Then:

```python

from autocorrect.wordcount import countwords count_words('ruwiki-latest-pages-articles.xml', 'ru') ```

tar -zcvf autocorrect/data/ru.tar.gz word_count.json

For the correction to work well, you need to cut out rarely used words. First, in testall.py, write test words for your language, and add them to optionallanguagetests the same way as it's done for other languages. It's good to have at least 30 words. Now run: ``` python testall.py find_threshold ru ``` and see which threshold value has the least badly corrected words. After that, manually delete all the words with less occurences than the threshold value you found, from the file in hi.tar.gz (it's already sorted so it should be easy).

To distribute this language support to others, you will need to upload your tar.gz file to IPFS (for example with Pinata, which will pin this file so it doesn't disappear), and then add it's path to ipfs_paths in constants.py. (tip: first put this file inside the folder, and upload the folder to IPFS, for the downloaded file to have the correct filename)

Good luck!

Owner

  • Name: Filip Sondej
  • Login: filyp
  • Kind: user
  • Location: Krakow

GitHub Events

Total
  • Issues event: 5
  • Watch event: 37
  • Issue comment event: 3
  • Push event: 2
  • Fork event: 5
Last Year
  • Issues event: 5
  • Watch event: 37
  • Issue comment event: 3
  • Push event: 2
  • Fork event: 5

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 223
  • Total Committers: 19
  • Avg Commits per committer: 11.737
  • Development Distribution Score (DDS): 0.305
Past Year
  • Commits: 7
  • Committers: 1
  • Avg Commits per committer: 7.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
fsondej f****j@p****m 155
Jonas McCallum j****m@g****m 12
Raphael Boidol b****r 9
oscar-defelice o****e@g****m 9
Filip Sondej 2****p 7
Martin Vejvar v****m@g****m 6
Khiem Le t****1@g****m 4
TurcoEFelice d****r@g****m 4
negm h****m@g****m 3
mehmetandic_teknasyon m****c@t****m 2
Ryan Freckleton r****n@g****m 2
pr3ssh x@p****t 2
Julin S 4****h 2
Jonas McCallum j****m@h****m 1
magdalini-anastasiadou m****u@g****m 1
Oscar 4****e 1
Jonas McCallum f****s@g****m 1
Jerry Qu j****y@b****i 1
AdamLouly a****3@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 40
  • Total pull requests: 20
  • Average time to close issues: 6 months
  • Average time to close pull requests: 10 days
  • Total issue authors: 37
  • Total pull request authors: 16
  • Average comments per issue: 2.63
  • Average comments per pull request: 2.8
  • Merged pull requests: 9
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 0
  • Average time to close issues: about 10 hours
  • Average time to close pull requests: N/A
  • Issue authors: 3
  • Pull request authors: 0
  • Average comments per issue: 0.67
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ju-sh (3)
  • Conduit83-Q (2)
  • ByUnal (1)
  • Dobatymo (1)
  • macialek (1)
  • blahiri (1)
  • fliuzzi02 (1)
  • SebastianS93 (1)
  • symbwell (1)
  • deroace (1)
  • Garve (1)
  • sumanthdonapati (1)
  • himanshudhingra (1)
  • Mohamednow25 (1)
  • Ransly (1)
Pull Request Authors
  • magdalini-anastasiadou (3)
  • filyp (2)
  • TurconiAndrea (2)
  • Jerry2001Qu (1)
  • ju-sh (1)
  • khiemledev (1)
  • oscar-defelice (1)
  • vejvarm (1)
  • PetricaR (1)
  • fliuzzi02 (1)
  • boidolr (1)
  • NiklasHoltmeyer (1)
  • vladimarius (1)
  • AdamLouly (1)
  • pr3ssh (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Dependencies

.github/workflows/benchamrk.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/python-app.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/python-publish.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
setup.py pypi