https://github.com/amazon-science/transformers-data-augmentation
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper
https://github.com/amazon-science/transformers-data-augmentation
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Keywords
Repository
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper
Basic Info
Statistics
- Stars: 52
- Watchers: 2
- Forks: 7
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
Data Augmentation using Pre-trained Transformer Models
Code associated with the Data Augmentation using Pre-trained Transformer Models paper
Code contains implementation of the following data augmentation methods - EDA (Baseline) - Backtranslation (Baseline) - CBERT (Baseline) - BERT Prepend (Our paper) - GPT-2 Prepend (Our paper) - BART Prepend (Our paper)
DataSets
In paper, we use three datasets from following resources - STSA-2 : https://github.com/1024er/cbert_aug/tree/crayon/datasets/stsa.binary - TREC : https://github.com/1024er/cbert_aug/tree/crayon/datasets/TREC - SNIPS : https://github.com/MiuLab/SlotGated-SLU/tree/master/data/snips
Low-data regime experiment setup
Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps
1. Download data from github
2. Replace numeric labels with text for STSA-2 and TREC dataset
3. For a given dataset, creates 15 random splits of train and dev data.
Dependencies
To run this code, you need following dependencies - Pytorch 1.5 - fairseq 0.9 - transformers 2.9
How to run
To run data augmentation experiment for a given dataset, run bash script in scripts folder.
For example, to run data augmentation on snips dataset,
- run scripts/bart_snips_lower.sh for BART experiment
- run scripts/bert_snips_lower.sh for rest of the data augmentation methods
How to cite
{bibtex}
@inproceedings{kumar-etal-2020-data,
title = "Data Augmentation using Pre-trained Transformer Models",
author = "Kumar, Varun and
Choudhary, Ashutosh and
Cho, Eunah",
booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
month = dec,
year = "2020",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
pages = "18--26",
}
Contact
Please reachout to kuvrun@amazon.com for any questions related to this code.
License
This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 1
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- SouLeo (1)
Pull Request Authors
- dependabot[bot] (1)