Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 19 committers (5.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Keywords
Repository
Spelling corrector in python
Basic Info
Statistics
- Stars: 486
- Watchers: 7
- Forks: 92
- Open Issues: 5
- Releases: 6
Topics
Metadata Files
README.md
Autocorrect
Spelling corrector in python. Currently supports English, Polish, Turkish, Russian, Ukrainian, Czech, Portuguese, Greek, Italian, Vietnamese, French and Spanish, but you can easily add new languages.
Based on: https://github.com/phatpiglet/autocorrect and Peter Norvig's spelling corrector.
Installation
bash
pip install autocorrect
Examples
Autocorrect full sentences:
```python
from autocorrect import Speller spell = Speller() spell("I'm not sleapy and tehre is no place I'm giong to.") "I'm not sleepy and there is no place I'm going to." ```
Use other languages:
```python
spell = Speller('pl') spell('ptaaki latatją kluczmm') 'ptaki latają kluczem' ```
Get multiple correction candidates for a single word:
```python
spell.get_candidates("tehre") [(5437024, 'there'), (5860, 'terre')] ``` The numbers are frequencies of a word, so the higher, the better.
Speed
python
%timeit spell("I'm not sleapy and tehre is no place I'm giong to.")
373 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit spell("There is no comin to consiousnes without pain.")
150 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As you see, for some words correction can take ~200ms. If speed is important for your use case (e.g. chatbot) you may want to use option 'fast':
python
spell = Speller(fast=True)
%timeit spell("There is no comin to consiousnes without pain.")
344 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Now, the correction should always work in microseconds, but words with double typos (like 'consiousnes') won't be corrected.
OCR
When cleaning up OCR, replacements are the large majority of errors. If this is the case, you may want to use the option 'onlyreplacements': ```python spell = Speller(onlyreplacements=True) ```
Custom word sets
If you wish to use your own set of words for autocorrection, you can pass an nlp_data argument:
python
spell = Speller(nlp_data=your_word_frequency_dict)
Where your_word_frequency_dict is a dictionary which maps words to their average frequencies in your text. If you want to change the default word set only a bit, you can just edit spell.nlp_data parameter, after spell was initialized.
Adding new languages
A simpler but untested way - wordfreq
It should be possible to get word frequencies from the wordfreq package. You should be able to provide this word frequency data to autocorrector through nlp_data parameter. You will also need to generate appropriate alphabet (see constants.py).
A more complicated but tested way - wikipedia text
Note: I will no longer accept PRs to add individual languages using this method. A more sensible approach would be to try using the wordfreq method, and adding many languages at once in some general way. But I don't have time to implement this myself.
First, define special letters, by adding entries in word_regexes and alphabets dicts in autocorrect/constants.py.
Now, you need a bunch of text. Easiest way is to download wikipedia. For example for Russian you would go to: https://dumps.wikimedia.org/ruwiki/latest/ and download ruwiki-latest-pages-articles.xml.bz2
bzip2 -d ruiwiki-latest-pages-articles.xml.bz2
After that:
First, edit the autocorrect.constants dictionaries in order to accommodate regexes and dictionaries for your language.
Then:
```python
from autocorrect.wordcount import countwords count_words('ruwiki-latest-pages-articles.xml', 'ru') ```
tar -zcvf autocorrect/data/ru.tar.gz word_count.json
For the correction to work well, you need to cut out rarely used words. First, in testall.py, write test words for your language, and add them to optionallanguagetests the same way as it's done for other languages. It's good to have at least 30 words. Now run: ``` python testall.py find_threshold ru ``` and see which threshold value has the least badly corrected words. After that, manually delete all the words with less occurences than the threshold value you found, from the file in hi.tar.gz (it's already sorted so it should be easy).
To distribute this language support to others, you will need to upload your tar.gz file to IPFS (for example with Pinata, which will pin this file so it doesn't disappear), and then add it's path to ipfs_paths in constants.py. (tip: first put this file inside the folder, and upload the folder to IPFS, for the downloaded file to have the correct filename)
Good luck!
Owner
- Name: Filip Sondej
- Login: filyp
- Kind: user
- Location: Krakow
- Repositories: 56
- Profile: https://github.com/filyp
GitHub Events
Total
- Issues event: 5
- Watch event: 37
- Issue comment event: 3
- Push event: 2
- Fork event: 5
Last Year
- Issues event: 5
- Watch event: 37
- Issue comment event: 3
- Push event: 2
- Fork event: 5
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| fsondej | f****j@p****m | 155 |
| Jonas McCallum | j****m@g****m | 12 |
| Raphael Boidol | b****r | 9 |
| oscar-defelice | o****e@g****m | 9 |
| Filip Sondej | 2****p | 7 |
| Martin Vejvar | v****m@g****m | 6 |
| Khiem Le | t****1@g****m | 4 |
| TurcoEFelice | d****r@g****m | 4 |
| negm | h****m@g****m | 3 |
| mehmetandic_teknasyon | m****c@t****m | 2 |
| Ryan Freckleton | r****n@g****m | 2 |
| pr3ssh | x@p****t | 2 |
| Julin S | 4****h | 2 |
| Jonas McCallum | j****m@h****m | 1 |
| magdalini-anastasiadou | m****u@g****m | 1 |
| Oscar | 4****e | 1 |
| Jonas McCallum | f****s@g****m | 1 |
| Jerry Qu | j****y@b****i | 1 |
| AdamLouly | a****3@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 40
- Total pull requests: 20
- Average time to close issues: 6 months
- Average time to close pull requests: 10 days
- Total issue authors: 37
- Total pull request authors: 16
- Average comments per issue: 2.63
- Average comments per pull request: 2.8
- Merged pull requests: 9
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 0
- Average time to close issues: about 10 hours
- Average time to close pull requests: N/A
- Issue authors: 3
- Pull request authors: 0
- Average comments per issue: 0.67
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ju-sh (3)
- Conduit83-Q (2)
- ByUnal (1)
- Dobatymo (1)
- macialek (1)
- blahiri (1)
- fliuzzi02 (1)
- SebastianS93 (1)
- symbwell (1)
- deroace (1)
- Garve (1)
- sumanthdonapati (1)
- himanshudhingra (1)
- Mohamednow25 (1)
- Ransly (1)
Pull Request Authors
- magdalini-anastasiadou (3)
- filyp (2)
- TurconiAndrea (2)
- Jerry2001Qu (1)
- ju-sh (1)
- khiemledev (1)
- oscar-defelice (1)
- vejvarm (1)
- PetricaR (1)
- fliuzzi02 (1)
- boidolr (1)
- NiklasHoltmeyer (1)
- vladimarius (1)
- AdamLouly (1)
- pr3ssh (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite