Science Score: 33.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
✓Committers with academic emails
1 of 33 committers (3.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.2%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Data augmentation for NLP
Basic Info
- Host: GitHub
- Owner: makcedward
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Homepage: https://makcedward.github.io/
- Size: 3.21 MB
Statistics
- Stars: 4,577
- Watchers: 41
- Forks: 468
- Open Issues: 80
- Releases: 25
Topics
Metadata Files
README.md
nlpaug
This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.
Features
- Generate synthetic data for improving model performance without manual effort
- Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
- Plug and play to any machine leanring/ neural network frameworks (e.g. scikit-learn, PyTorch, TensorFlow)
- Support textual and audio input
Textual Data Augmentation Example

Acoustic Data Augmentation Example

| Section | Description | |:---:|:---:| | Quick Demo | How to use this library | | Augmenter | Introduce all available augmentation methods | | Installation | How to install this library | | Recent Changes | Latest enhancement | | Extension Reading | More real life examples or researchs | | Reference | Reference of external resources such as data or model |
Quick Demo
- Quick Example
- Example of Augmentation for Textual Inputs
- Example of Augmentation for Multilingual Textual Inputs
- Example of Augmentation for Spectrogram Inputs
- Example of Augmentation for Audio Inputs
- Example of Orchestra Multiple Augmenters
- Example of Showing Augmentation History
- How to train TF-IDF model
- How to train LAMBADA model
- How to create custom augmentation
- API Documentation
Augmenter
| Augmenter | Target | Augmenter | Action | Description | |:---:|:---:|:---:|:---:|:---:| |Textual| Character | KeyboardAug | substitute | Simulate keyboard distance error | |Textual| | OcrAug | substitute | Simulate OCR engine error | |Textual| | RandomAug | insert, substitute, swap, delete | Apply augmentation randomly | |Textual| Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym| |Textual| | ContextualWordEmbsAug | insert, substitute | Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation| |Textual| | RandomWordAug | swap, crop, delete | Apply augmentation randomly | |Textual| | SpellingAug | substitute | Substitute word according to spelling mistake dictionary | |Textual| | SplitAug | split | Split one word to two words randomly| |Textual| | SynonymAug | substitute | Substitute similar word according to WordNet/ PPDB synonym | |Textual| | TfIdfAug | insert, substitute | Use TF-IDF to find out how word should be augmented | |Textual| | WordEmbsAug | insert, substitute | Leverage word2vec, GloVe or fasttext embeddings to apply augmentation| |Textual| | BackTranslationAug | substitute | Leverage two translation models for augmentation | |Textual| | ReservedAug | substitute | Replace reserved words | |Textual| Sentence | ContextualWordEmbsForSentenceAug | insert | Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction | |Textual| | AbstSummAug | substitute | Summarize article by abstractive summarization method | |Textual| | LambadaAug | substitute | Using language model to generate text and then using classification model to retain high quality results | |Signal| Audio | CropAug | delete | Delete audio's segment | |Signal| | LoudnessAug|substitute | Adjust audio's volume | |Signal| | MaskAug | substitute | Mask audio's segment | |Signal| | NoiseAug | substitute | Inject noise | |Signal| | PitchAug | substitute | Adjust audio's pitch | |Signal| | ShiftAug | substitute | Shift time dimension forward/ backward | |Signal| | SpeedAug | substitute | Adjust audio's speed | |Signal| | VtlpAug | substitute | Change vocal tract | |Signal| | NormalizeAug | substitute | Normalize audio | |Signal| | PolarityInverseAug | substitute | Swap positive and negative for audio | |Signal| Spectrogram | FrequencyMaskingAug | substitute | Set block of values to zero according to frequency dimension | |Signal| | TimeMaskingAug | substitute | Set block of values to zero according to time dimension | |Signal| | LoudnessAug | substitute | Adjust volume |
Flow
| Augmenter | Augmenter | Description | |:---:|:---:|:---:| |Pipeline| Sequential | Apply list of augmentation functions sequentially | |Pipeline| Sometimes | Apply some augmentation functions randomly |
Installation
The library supports python 3.5+ in linux and window platform.
To install the library:
bash
pip install numpy requests nlpaug
or install the latest version (include BETA features) from github directly
bash
pip install numpy git+https://github.com/makcedward/nlpaug.git
or install over conda
bash
conda install -c makcedward nlpaug
If you use BackTranslationAug, ContextualWordEmbsAug, ContextualWordEmbsForSentenceAug and AbstSummAug, installing the following dependencies as well
bash
pip install torch>=1.6.0 transformers>=4.11.3 sentencepiece
If you use LambadaAug, installing the following dependencies as well
bash
pip install simpletransformers>=0.61.10
If you use AntonymAug, SynonymAug, installing the following dependencies as well
bash
pip install nltk>=3.4.5
If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first and installing the following dependencies as well ```bash from nlpaug.util.file.download import DownloadUtil DownloadUtil.downloadword2vec(destdir='.') # Download word2vec model DownloadUtil.downloadglove(modelname='glove.6B', destdir='.') # Download GloVe model DownloadUtil.downloadfasttext(modelname='wiki-news-300d-1M', destdir='.') # Download fasttext model
pip install gensim>=4.1.2 ```
If you use SynonymAug (PPDB), downloading file from the following URI. You may not able to run the augmenter if you get PPDB file from other website
bash
http://paraphrase.org/#/download
If you use PitchAug, SpeedAug and VtlpAug, installing the following dependencies as well
bash
pip install librosa>=0.9.1 matplotlib
Recent Changes
1.1.11 Jul 6, 2022
- Return list of output
- Fix download util
- Fix lambda label misalignment
- Add language pack reference link for SynonymAug
See changelog for more details.
Extension Reading
- Data Augmentation library for Text
- Does your NLP model able to prevent adversarial attack?
- How does Data Noising Help to Improve your NLP Model?
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Unsupervied Data Augmentation
- A Visual Survey of Data Augmentation in NLP
Reference
This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.
Citation
latex
@misc{ma2019nlpaug,
title={NLP Augmentation},
author={Edward Ma},
howpublished={https://github.com/makcedward/nlpaug},
year={2019}
}
This package is cited by many books, workshop and academic research papers (70+). Here are some of examples and you may visit here to get the full list.
Workshops cited nlpaug
- S. Vajjala. NLP without a readymade labeled dataset at Toronto Machine Learning Summit, 2021. 2021
Book cited nlpaug
- S. Vajjala, B. Majumder, A. Gupta and H. Surana. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems. 2020
- A. Bartoli and A. Fusiello. Computer Vision–ECCV 2020 Workshops. 2020
- L. Werra, L. Tunstall, and T. Wolf Natural Language Processing with Transformers. 2022
Research paper cited nlpaug
- Google: M. Raghu and E. Schmidt. A Survey of Deep Learning for Scientific Discovery. 2020
- Sirius XM: E. Jing, K. Schneck, D. Egan and S. A. Waterman. Identifying Introductions in Podcast Episodes from Automatically Generated Transcripts. 2021
- Salesforce Research: B. Newman, P. K. Choubey and N. Rajani. P-adapters: Robustly Extracting Factual Information from Language Modesl with Diverse Prompts. 2021
- Salesforce Research: L. Xue, M. Gao, Z. Chen, C. Xiong and R. Xu. Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks. 2021
Contributions
sakares saengkaew |
Binoy Dalal |
Emrecan Çelik |
Owner
- Name: Edward Ma
- Login: makcedward
- Kind: user
- Location: San Francisco Bay Area
- Company: SambaNova Systems
- Website: https://makcedward.github.io/
- Repositories: 11
- Profile: https://github.com/makcedward
Focus on Natural Language Processing, Transferring Learning, Data Science Architecture
GitHub Events
Total
- Issues event: 1
- Watch event: 170
- Issue comment event: 2
- Pull request event: 3
- Fork event: 9
Last Year
- Issues event: 1
- Watch event: 170
- Issue comment event: 2
- Pull request event: 3
- Fork event: 9
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Edward Ma | m****d@g****m | 554 |
| Chirag Jain | j****5@g****m | 8 |
| binoydalal | b****l@u****u | 4 |
| Ricardo Pieper | r****r@l****m | 3 |
| Anatoly Vostryakov | a****v@g****m | 2 |
| DrMatters | q****q@g****m | 2 |
| John Giorgi | j****i@g****m | 2 |
| logigo | 4****o | 2 |
| Mariia Trofimova | m****y@b****m | 2 |
| Jessica Sousa | j****s@g****m | 1 |
| Joanna Bitton | j****n@g****m | 1 |
| João António | j****e@g****m | 1 |
| MarkusSagen | m****n@g****m | 1 |
| Narayan Acharya | n****6@g****m | 1 |
| Rogier Stegeman | 4****n | 1 |
| Sakares Saengkaew | s****s@g****m | 1 |
| Sebastian Sosa | s****e@g****m | 1 |
| Tan Li | t****n@t****v | 1 |
| USVSN SAI PRASHANTH | 5****h | 1 |
| Vishal Singh | v****x@g****m | 1 |
| b.giahuy | h****i@e****t | 1 |
| emrecncelik | e****k@g****m | 1 |
| hsm207 | h****7 | 1 |
| karthikmurugadoss | k****k@n****t | 1 |
| phunc20 | w****0@g****m | 1 |
| robolamp | r****p@y****u | 1 |
| Ivan Pereira | n****1@g****m | 1 |
| Ilya Fedorov | b****n@m****u | 1 |
| Harrison Chase | h****7@g****m | 1 |
| Chandan Akiti | c****i@g****m | 1 |
| and 3 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 94
- Total pull requests: 24
- Average time to close issues: 3 months
- Average time to close pull requests: 16 days
- Total issue authors: 89
- Total pull request authors: 21
- Average comments per issue: 1.29
- Average comments per pull request: 0.13
- Merged pull requests: 14
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 2
- Pull requests: 4
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 2
- Pull request authors: 4
- Average comments per issue: 0.0
- Average comments per pull request: 0.25
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- beyondguo (3)
- lindsaydbrin (2)
- pratikchhapolika (2)
- kgarg8 (2)
- wiseyoungbuck (1)
- moe-men (1)
- 980202006 (1)
- EtherealRise (1)
- Juliano-rb (1)
- vc34 (1)
- fratambot (1)
- lei-liu1 (1)
- bhomass (1)
- le8888e (1)
- anvitha-jain (1)
Pull Request Authors
- JohnGiorgi (2)
- SR-Rubel (2)
- Keramatfar (2)
- makcedward (2)
- igopalakrishna (2)
- sbrugman (2)
- Logigo (2)
- tshu-w (2)
- emrecncelik (1)
- robolamp (1)
- EvanUp (1)
- litanlitudan (1)
- IgorMunizS (1)
- baskrahmer (1)
- MarkusSagen (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 3
-
Total downloads:
- pypi 159,014 last-month
- Total docker downloads: 5,590
-
Total dependent packages: 28
(may contain duplicates) -
Total dependent repositories: 141
(may contain duplicates) - Total versions: 45
- Total maintainers: 1
pypi.org: nlpaug
Natural language processing augmentation library for deep neural networks
- Homepage: https://github.com/makcedward/nlpaug
- Documentation: https://nlpaug.readthedocs.io/
- License: MIT
-
Latest release: 1.1.11
published over 3 years ago
Rankings
Maintainers (1)
proxy.golang.org: github.com/makcedward/nlpaug
- Documentation: https://pkg.go.dev/github.com/makcedward/nlpaug#section-documentation
- License: mit
-
Latest release: v0.0.5
published over 6 years ago
Rankings
conda-forge.org: nlpaug
This python library helps you with augmenting NLP for your machine learning projects. `Augmenter` is the basic element of augmentation while `Flow` is a pipeline to orchestra multi augmenter together. Nlpaug generates synthetic data for improving model performance without manual effort. It is a simple and easy-to-use and lightweight library where you can augment data in 3 lines of code, and features plug and play to any machine leanring and neural network frameworks (e.g. scikit-learn, PyTorch, TensorFlow). Nlpaug supports textual and audio input as well.
- Homepage: https://github.com/makcedward/nlpaug
- License: MIT
-
Latest release: 1.1.11
published over 3 years ago
Rankings
Dependencies
- gdown >=4.0.0
- numpy >=1.16.2
- pandas >=1.2.0
- requests >=2.22.0
- gensim >=4.1.2 development
- librosa >=0.9 development
- nltk >=3.4.5 development
- pyinstrument * development
- python-dotenv >=0.10.1 development
- setuptools >=39.1.0 development
- simpletransformers * development
- torch * development
- transformers * development