https://github.com/ai-forever/augmentex

Augmentex — a library for augmenting texts with errors

Last synced: 10 months ago · JSON representation

Repository

Augmentex — a library for augmenting texts with errors

Basic Info

Host: GitHub
Owner: ai-forever
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 22.3 MB

Statistics

Stars: 65
Watchers: 6
Forks: 0
Open Issues: 5
Releases: 5

Created almost 3 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License

Augmentex — a library for augmenting texts with errors

Augmentex introduces rule-based and common statistic (empowered by KartaSlov project) approach to insert errors in text. It is fully described again in the Paper and in this 🗣️Talk.

Installation

commandline pip install augmentex

Implemented functionality

We collected statistics from different languages and from different input sources. This table shows what functionality the library currently supports.

| | Russian | English | | -----------:|:-----------:|:-----------:| | PC keyboard | ✅ | ✅ | | Mobile kb | ✅ | ❌ |

In the future, it is planned to scale the functionality to new languages and various input sources.

Usage

🖇️ Augmentex allows you to operate on two levels of granularity when it comes to text corruption and offers you sets of specific methods suited for particular level: - Word level: - replace - replace a random word with its incorrect counterpart; - delete - delete random word; - swap - swap two random words; - stopword - add random words from stop-list; - split - add spaces between letters to the word; - reverse - change a case of the first letter of a random word; - text2emoji - change the word to the corresponding emoji. - Character level: - shift - randomly swaps upper / lower case in a string; - orfo - substitute correct characters with their common incorrect counterparts; - typo - substitute correct characters as if they are mistyped on a keyboard; - delete - delete random character; - insert - insert random character; - multiply - multiply random character; - swap - swap two adjacent characters.

Word level

```python from augmentex import WordAug

wordaug = WordAug( unitprob=0.4, # Percentage of the phrase to which augmentations will be applied minaug=1, # Minimum number of augmentations maxaug=5, # Maximum number of augmentations lang="eng", # supports: "rus", "eng" platform="pc", # supports: "pc", "mobile" random_seed=42, ) ```

Replace a random word with its incorrect counterpart; ```python text = "Screw you guys, I am going home. (c)" word_aug.augment(text=text, action="replace")

Screw to guys, I to going com. (c)

```
Delete random word; ```python text = "Screw you guys, I am going home. (c)" word_aug.augment(text=text, action="delete")

you I am home. (c)

```
Swap two random words; ```python text = "Screw you guys, I am going home. (c)" word_aug.augment(text=text, action="swap")

Screw I guys, am home. going you (c)

```
Add random words from stop-list; ```python text = "Screw you guys, I am going home. (c)" word_aug.augment(text=text, action="stopword")

like Screw you guys, I am going completely home. by the way (c)

```
Adds spaces between letters to the word; ```python text = "Screw you guys, I am going home. (c)" word_aug.augment(text=text, action="split")

Screw y o u guys, I am going h o m e . (c)

```
Change a case of the first letter of a random word; ```python text = "Screw you guys, I am going home. (c)" word_aug.augment(text=text, action="reverse")

Screw You guys, i Am going home. (c)

```
Changes the word to the corresponding emoji. ```python text = "Screw you guys, I am going home. (c)" word_aug.augment(text=text, action="text2emoji")

Screw you guys, I am going home. (c)

```
Replaces ngram in a word with erroneous ones. ```python text = "Screw you guys, I am going home. (c)" word_aug.augment(text=text, action="ngram")

Scren you guys, I am going home. (c)

```

Character level

```python from augmentex import CharAug

charaug = CharAug( unitprob=0.3, # Percentage of the phrase to which augmentations will be applied minaug=1, # Minimum number of augmentations maxaug=5, # Maximum number of augmentations multnum=3, # Maximum number of repetitions of characters (only for the multiply method) lang="eng", # supports: "rus", "eng" platform="pc", # supports: "pc", "mobile" randomseed=42, ) ```

Randomly swaps upper / lower case in a string; ```python text = "Screw you guys, I am going home. (c)" char_aug.augment(text=text, action="shift")

Screw YoU guys, I am going Home. (C)

```
Substitute correct characters with their common incorrect counterparts; ```python text = "Screw you guys, I am going home. (c)" char_aug.augment(text=text, action="orfo")

Sedew you guya, I am going home. (c)

```
Substitute correct characters as if they are mistyped on a keyboard; ```python text = "Screw you guys, I am going home. (c)" char_aug.augment(text=text, action="typo")

Sxrew you gugs, I am going home. (x)

```
Delete random character; ```python text = "Screw you guys, I am going home. (c)" char_aug.augment(text=text, action="delete")

crew you guys Iam goinghme. (c)

```
Insert random character; ```python text = "Screw you guys, I am going home. (c)" char_aug.augment(text=text, action="insert")

Screw you ughuys, I vam gcoing hxome. (c)

```
Multiply random character; ```python text = "Screw you guys, I am going home. (c)" char_aug.augment(text=text, action="multiply")

Screw yyou guyss, I am ggoinng home. (c)

```
Swap two adjacent characters. ```python text = "Screw you guys, I am going home. (c)" char_aug.augment(text=text, action="swap")

Srcewy ou guys,I am oging hmoe. (c)

```

Batch processing

📁 For batch text processing, you need to call the aug_batch method instead of the augment method and pass a list of strings to it.

```python from augmentex import WordAug

wordaug = WordAug( unitprob=0.4, # Percentage of the phrase to which augmentations will be applied minaug=1, # Minimum number of augmentations maxaug=5, # Maximum number of augmentations lang="eng", # supports: "rus", "eng" platform="pc", # supports: "pc", "mobile" random_seed=42, )

textlist = ["Screw you guys, I am going home. (c)"] * 10 wordaug.augbatch(textlist, batch_prob=0.5) # without action

textlist = ["Screw you guys, I am going home. (c)"] * 10 wordaug.augbatch(textlist, batch_prob=0.5, action="replace") # with action ```

Compute your own statistics

📊 If you want to use your own statistics for the replace and orfo methods, then you will need to specify two paths to parallel corpora with texts without errors and with errors.

Example of txt files:

textswithouterrors.txt	textswitherrors.txt
some text without errors 1 some text without errors 2 some text without errors 3 ...	some text with errors 1 some text with errors 2 some text with errors 3 ...

```python from augmentex import WordAug

wordaug = WordAug( unitprob=0.4, # Percentage of the phrase to which augmentations will be applied minaug=1, # Minimum number of augmentations maxaug=5, # Maximum number of augmentations lang="eng", # supports: "rus", "eng" platform="pc", # supports: "pc", "mobile" randomseed=42, correcttextspath="correcttexts.txt", errortextspath="error_texts.txt", ) ```

Google Colab example

You can familiarize yourself with the usage in the example

Contributing

Issue

If you see an open issue and are willing to do it, add yourself to the performers and write about how much time it will take to fix it. See the pull request module below.
If you want to add something new or if you find a bug, you should start by creating a new issue and describing the problem/feature. Don't forget to include the appropriate labels.

Pull request

How to make a pull request. 1. Clone the repository; 2. Create a new branch, for example git checkout -b issue-id-short-name; 3. Make changes to the code (make sure you are definitely working in the new branch); 4. git push; 5. Create a pull request to the develop branch; 6. Add a brief description of the work done; 7. Expect comments from the authors.

References

SAGE — superlib, developed jointly with our friends by the AGI NLP team, which provides advanced spelling corruptions and spell checking techniques, including using Augmentex.

Authors

Aleksandr Abramov — Source code and algorithm author;
Mark Baushenko — Source code lead developer.

Owner

Name: AI Forever
Login: ai-forever
Kind: organization
Location: Armenia

Repositories: 60
Profile: https://github.com/ai-forever

Creating ML for the future. AI projects you already know. We are non-profit organization with members from all over the world.

GitHub Events

Total

Watch event: 14

Last Year

Watch event: 14

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 8
Total pull requests: 12
Average time to close issues: about 2 months
Average time to close pull requests: about 4 hours
Total issue authors: 4
Total pull request authors: 2
Average comments per issue: 0.38
Average comments per pull request: 0.0
Merged pull requests: 9
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

e0xextazy (4)
Koziev (1)
Ab1992ao (1)
sheriff1max (1)

Pull Request Authors

e0xextazy (10)
sheriff1max (2)

Top Labels

Issue Labels

enhancement (3) documentation (1) wontfix (1)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 313 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 10
Total maintainers: 1

proxy.golang.org: github.com/ai-forever/augmentex

Documentation: https://pkg.go.dev/github.com/ai-forever/augmentex#section-documentation
License: mit
Latest release: v1.3.1
published almost 2 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.7%

Last synced: 10 months ago

pypi.org: augmentex

Augmentex — a library for augmenting texts with errors

Homepage: https://github.com/ai-forever/augmentex
Documentation: https://augmentex.readthedocs.io/
License: MIT
Latest release: 1.3.1
published almost 2 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 313 Last month

Rankings

Dependent packages count: 7.3%

Forks count: 30.0%

Stargazers count: 32.0%

Average: 34.5%

Dependent repos count: 68.5%

Maintainers (1)

mbaushenko

Last synced: 10 months ago

https://github.com/ai-forever/augmentex

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Augmentex — a library for augmenting texts with errors

Contents

Installation

Implemented functionality

Usage

Word level

Screw to guys, I to going com. (c)

you I am home. (c)

Screw I guys, am home. going you (c)

like Screw you guys, I am going completely home. by the way (c)

Screw y o u guys, I am going h o m e . (c)

Screw You guys, i Am going home. (c)

Screw you guys, I am going home. (c)

Scren you guys, I am going home. (c)

Character level

Screw YoU guys, I am going Home. (C)

Sedew you guya, I am going home. (c)

Sxrew you gugs, I am going home. (x)

crew you guys Iam goinghme. (c)

Screw you ughuys, I vam gcoing hxome. (c)

Screw yyou guyss, I am ggoinng home. (c)

Srcewy ou guys,I am oging hmoe. (c)

Batch processing

Compute your own statistics

Google Colab example

Contributing

Issue

Pull request

References

Authors

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/ai-forever/augmentex

Rankings

pypi.org: augmentex

Rankings

Maintainers (1)

Dependencies