Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary
Repository
yet another text augmentation python package
Basic Info
- Host: GitHub
- Owner: ulf1
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 103 KB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 31
- Releases: 7
Metadata Files
README.md
augtxt -- Text Augmentation
Yet another text augmentation python package.
Table of Contents
- Usage
- Appendix
Usage
py
import augtxt
import numpy as np
Pipelines
Sentence Augmentations
Check the demo notebook for an usage example.
Word typos
The function augtxt.augmenters.wordtypo applies randomly different augmentations to one word.
The result is a simulated distribution of possible word augmentations, e.g. how are possible typological errors distributed for a specific original word.
The procedure does not guarantee that the original word will be augmented.
Check the demo notebook for an usage example.
Word typos for a sentence
The function augtxt.augmenters.senttypo applies randomly different augmentations to
a) at least one word in a sentence, or
b) not more than a certain percentage of words in a sentence.
The procedure guarantees that the sentence is augmented.
The functions also allows to exclude specific strings from augmentation (e.g. exclude=("[MASK]", "[UNK]")). However, these strings cannot include the special characters .,;:!? (incl. whitespace).
Check the demo notebook for an usage example.
Typographical Errors (Tippfehler)
The augtxt.typo module is about augmenting characters to mimic human errors while using a keyboard device.
Swap two consecutive characters (Vertauscher)
A user mix two consecutive characters up.
- Swap 1st and 2nd characters:
augtxt.typo.swap_consecutive("Kinder", loc=0)(Result:iKnder) - Swap 1st and 2nd characters, and enforce letter cases:
augtxt.typo.swap_consecutive("Kinder", loc=0, keep_case=True)(Result:Iknder) - Swap random
i-th andi+1-th characters that are more likely at the end of the word:np.random.seed(seed=123); augtxt.typo.swap_consecutive("Kinder", loc='end')
Add double letter (Einfüger)
User presses a key twice accidentaly
- Make 5th letter a double letter: `
augtxt.typo.pressed_twice("Eltern", loc=4)(Result:Elterrn)
Drop character (Auslasser)
User presses the key not enough (Lisbach, 2011, p.72), the key is broken, finger motion fails.
- Drop the 3rd letter:
augtxt.typo.drop_char("Straße", loc=2)(Result:Staße)
Drop character followed by double letter (Vertipper)
Letter is left out, but the following letter is typed twice.
It's a combination of augtxt.typo.pressed_twice and augtxt.typo.drop_char.
```py from augtxt.typo import dropnnexttwice augm = dropnnexttwice("Tante", loc=2)
Tatte
```
Pressed SHIFT, ALT, or SHIFT+ALT
Usually SHFIT is used to type a capital letter, and ALT or ALT+SHIFT for less common characters.
A typo might occur because these special keys are nor are not pressed in combination with a normal key.
The function augtxt.typo.pressed_shiftalt such errors randomly.
```py from augtxt.typo import pressedshiftalt augm = pressedshiftalt("Onkel", loc=2)
OnKel, On˚el, Onel
```
The keymap can differ depending on the language and the keyboard layout.
```py from augtxt.typo import pressedshiftalt import augtxt.keyboardlayouts as kbl augm = pressedshiftalt("Onkel", loc=2, keymap=kbl.macbookus)
OnKel, On˚el, Onel
```
Further, transition probabilities in case of a typo can be specified
```py from augtxt.typo import pressedshiftalt import augtxt.keyboardlayouts as kbl
keyboard_transprob = { "keys": [.0, .75, .2, .05], "shift": [.9, 0, .05, .05], "alt": [.9, .05, .0, .05], "shift+alt": [.3, .35, .35, .0] }
augm = pressedshiftalt("Onkel", loc=2, keymap=kbl.macbookus, trans=keyboard_transprob) ```
References
- Lisbach, B., 2011. Linguistisches Identity Matching. Vieweg+Teubner, Wiesbaden. https://doi.org/10.1007/978-3-8348-9791-6
Interpunctation Errors (Zeichensetzungsfehler)
Remove PUNCT and COMMA tokens
The PUNCT (.?!;:) and COMMA (,) tokens carry syntatic information.
An use case
```py import augtxt.punct text = ("Die Lehrerin [MASK] einen Roman. " "Die Schülerin [MASK] ein Aufsatz, der sehr [MASK] war.") augmented = augtxt.punct.remove_syntaxinfo(text)
'Die Lehrerin [MASK] einen Roman Die Schülerin [MASK] ein Aufsatz der sehr [MASK] war'
```
Merge two consequitive words
The function augtxt.punct.merge_words removes randomly whitespace or hyphens between words, and transform the second word to lower case.
```py import augtxt.punct
text = "Die Bindestrich-Wörter sind da."
np.random.seed(seed=23) augmented = augtxt.punct.mergewords(text, numaug=1) assert augmented == 'Die Bindestrich-Wörter sindda.'
np.random.seed(seed=1) augmented = augtxt.punct.mergewords(text, numaug=1) assert augmented == 'Die Bindestrichwörter sind da.' ```
Word Order Errors (Wortstellungsfehler)
The augtxt.order simulate errors on word token level.
Swap words
```py np.random.seed(seed=42) text = "Tausche die Wörter, lasse sie weg, oder [MASK] was." print(augtxt.order.swapconsecutive(text, exclude=["[MASK]"], numaug=1))
die Tausche Wörter, lasse sie weg, oder [MASK] was.
```
Write twice
```py np.random.seed(seed=42) text = "Tausche die Wörter, lasse sie weg, oder [MASK] was." print(augtxt.order.writetwice(text, exclude=["[MASK]"], numaug=1))
Tausche die die Wörter, lasse sie weg, oder [MASK] was.
```
Drop word
```py np.random.seed(seed=42) text = "Tausche die Wörter, lasse sie weg, oder [MASK] was." print(augtxt.order.dropword(text, exclude=["[MASK]"], numaug=1))
Tausche Wörter, lasse sie weg, oder [MASK] was.
```
Drop word followed by a double word
```py np.random.seed(seed=42) text = "Tausche die Wörter, lasse sie weg, oder [MASK] was." print(augtxt.order.dropnnexttwice(text, exclude=["[MASK]"], numaug=1))
die die Wörter, lasse sie weg, oder [MASK] was.
```
~~Word substitutions~~ (Deprecated)
Deprecation Notice:
augtxt.wordsubs will be deleted in 0.6.0 and replaced.
Especially synonym replacement is not trivial in German language.
Please check https://github.com/ulf1/flexion for further information.
The augtxt.wordsubs module is about replacing specific strings, e.g. words, morphemes, named entities, abbreviations, etc.
Using pseudo-synonym dictionaries to augment tokenized sequences
It is recommend to filter vocab further. For example, PoS tag the sequences and only augment VERB and NOUN tokens.
```py import itertools import augtxt.wordsubs import numpy as np
originalseqs = [["Das", "ist", "ein", "Satz", "."], ["Dies", "ist", "ein", "anderer", "Satz", "."]] vocab = set([s.lower() for s in itertools.chain(*originalseqs) if len(s) > 1])
synonyms = { 'anderer': ['verschiedener', 'einiger', 'vieler', 'diverser', 'sonstiger', 'etlicher', 'einzelner', 'bestimmter', 'ähnlicher'], 'satz': ['sätze', 'anfangssatz', 'schlussatz', 'eingangssatz', 'einleitungssatzes', 'einleitungsssatz', 'einleitungssatz', 'behauptungssatz', 'beispielsatz', 'schlusssatz', 'anfangssatzes', 'einzelsatz', '#einleitungssatz', 'minimalsatz', 'inhaltssatz', 'aufforderungssatz', 'ausgangssatz'], '.': [',', '🎅'], 'das': ['welches', 'solches'], 'ein': ['weiteres'], 'dies': ['was', 'umstand', 'dass'] }
np.random.seed(42) augmentedseqs = augtxt.wordsubs.synonymreplacement( originalseqs, synonyms, numaug=10, keep_case=True)
check results for 1st sentence
for s in augmented_seqs[0]: print(s) ```
Appendix
Installation
The augtxt git repo is available as PyPi package
sh
pip install augtxt>=0.5.0
pip install git+ssh://git@github.com/ulf1/augtxt.git
Commands
Install a virtual environment
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -r requirements-demo.txt
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv. Use an absolute path without whitespaces.)
Python commands
- Check syntax:
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g') - Run Unit Tests:
pytest
Publish
sh
pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*
Clean up
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
Support
Please open an issue for support.
Contributing
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.
Owner
- Name: Ulf Hamster
- Login: ulf1
- Kind: user
- Repositories: 45
- Profile: https://github.com/ulf1
1x developer
GitHub Events
Total
- Push event: 3
- Pull request event: 1
- Create event: 1
Last Year
- Push event: 3
- Pull request event: 1
- Create event: 1
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 33
- Total pull requests: 35
- Average time to close issues: about 1 month
- Average time to close pull requests: about 12 hours
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.12
- Average comments per pull request: 0.09
- Merged pull requests: 23
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 4
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ulf1 (33)
Pull Request Authors
- ulf1 (38)
- snyk-bot (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- jupyterlab >=3.0.5,<4
- matplotlib >=3.3.3,<4
- flake8 >=3.8.4 development
- pypandoc >=1.5 development
- pytest >=6.2.1 development
- setuptools >=56. development
- twine ==3.3.0 development
- wheel >=0.31.0 development
- kshingle >=0.8.3,<1
- numpy >=1.19.0,<2
- scipy >=1.5.4,<2
- kshingle >=0.6.1,<1
- numpy >=1.19.0,<2
- scipy >=1.5.4,<2
- actions/checkout v1 composite
- actions/setup-python v1 composite