https://github.com/amazon-science/stil-mbart-multiatis
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: amazon-science
- License: other
- Language: Python
- Default Branch: main
- Size: 13.7 KB
Statistics
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
STIL - Simultaneous Slot Filling, Translation, Intent Classification, and Language Identification: Initial Results using mBART on MultiATIS++
A paper presented at the Asian Chapter of the Association of Computational Linguistics (AACL) 2020
Jack FitzGerald
Copyright Amazon.com Inc. or its affiliates
This repository contains some of the code used in the paper named above. The purpose of this repo is only to allow other researchers to reproduce the study and results presented in the paper. Jack and Amazon likely will not improve this repo over time.
Setup
I used AWS SageMaker for this work, including a m5.xlarge instance for data preparation and analysis and a p3.16xlarge instance for training.
Download the pretrained model
This study uses the pretrained mBART CC25 model. It can be downloaded from fairseq here.
Create the datasets
The dataset is based on the MultiATIS++ and MultiATIS datasets.
MultiATIS++
The 2020 paper by Xue et al. entitled "End-to-End Slot Alignment and Recognition for Cross-Lingual NLU" describes the dataset in more detail. As of writing, the dataset was still under review by LDC. Please contact saabm@amazon.com to obtain a copy.
Once you have the data, place it in a folder called MultiATISpp-RAW/.
At the time of my research, there were a small number of English alignment problems with the Japanese, Hindi, and Turkish data. For this reason, Japanese was excluded, and the Hindi and Turkish data from MultiATIS were used. To ensure a fair comparison with the work by Xu et al., we need to extract the validation set from MultiATIS++ and remove those examples from the MultiATIS training set. Another folder must be created called hi_tr_devsets containing those two files.
The TSVs should have the following columns, and headers should be included in the files.
id
utterance
slot_labels
intent
MultiATIS
MultiATIS is available from LDC here. Please place the TSVs for Hindi and Turkish in a folder called `MultiATIS-RAW/'. The TSVs should have the following columns, and headers should be excluded.
English utterance
English annotations
machine translation back to English
intent
non-English utterance
non-English annotations
Directory tree for raw data
MultiATISpp-RAW/
|- dev_DE.tsv
|- dev_ES.tsv
|- dev_ZH.tsv
|- test_EN.tsv
|- test_FR.tsv
|- train_DE.tsv
|- train_ES.tsv
|- train_ZH.tsv
|- dev_EN.tsv
|- dev_FR.tsv
|- test_DE.tsv
|- test_ES.tsv
|- test_ZH.tsv
|- train_EN.tsv
|- train_FR.tsv
MultiATIS-RAW/
|- Hindi-test.tsv
|- Hindi-train_1600.tsv
|- Turkish-test.tsv
|- Turkish-train_638.tsv
hi_tr_devsets/
|- dev_HI.tsv
|- dev_TR.tsv
Create the STIL dataset
To create the STIL dataset, run the preprocess_atis_stil.py script on the data described above. EX:
python path/to/preprocess_atis_stil.py MultiATISpp-RAW/ MultiATIS-RAW/ hi_tr_devsets/ MultiATISpp-FLAT/
Create the traditional NLU dataset
To create the traditional NLU dataset (no translation of the slots), run the preprocess_atis_traditional.py script on the data desscribed above. EX:
python path/to/preprocess_atis_traditional.py MultiATISpp-RAW/ MultiATIS-RAW/ hi_tr_devsets/ MultiATISpp-FLAT/
Tokenize the dataset
The mBART model uses sentencepiece tokenization. Information can be found in the sentencepice repo. The following commands can be used to build sentencepiece.
``` git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build cd build cmake .. make -j $(nproc) sudo make install sudo ldconfig -v
pip install sentencepiece ```
Once sentencepiece has been built, tokenize the datasets:
``` SPM=path/to/sentencepiece/build/src/spmencode MODEL=path/to/mbart.cc25/sentence.bpe.model DATAPATH=path/to/MultiATISpp-FLAT
for SPLIT in train dev test; do for INOUT in input output; do $SPM --model=$MODEL < ${DATAPATH}/${SPLIT}.${INOUT} > ${DATAPATH}/${SPLIT}.spm.${INOUT}; done; done ```
Binarize the data
The model requires binarized data.
The gcc in my instances of SageMaker was too old. Upgrade first if needed:
conda install -c psi4 gcc-5
Install fairseq as editable:
git clone https://github.com/pytorch/fairseq.git
cd fairseq
pip install --editable .
Run binarization:
``` FAIRSEQPATH=path/to/fairseq DATAPATH=path/to/tokenizeddata DICTPATH=path/to/mbart.cc25
python ${FAIRSEQPATH}/preprocess.py --source-lang input --target-lang output --trainpref ${DATAPATH}/train.spm --validpref ${DATAPATH}/dev.spm --testpref ${DATAPATH}/test.spm --srcdict $DICTPATH/dict.txt --tgtdict $DICTPATH/dict.txt --workers 8 --destdir MultiATISpp-BIN ```
Train the mBART model
I used a p3.16xlarge instance for training, which has 8 Nvidia v100 GPUs. By using max sentence of 2 and update freq of 2, it will result in a batch size of 32.
``` PRETRAINEDBART=path/to/mbart.cc25 DATAPATH=path/to/data-bin FAIRSEQPATH=path/to/fairseq CHECKPOINTPATH=path/to/checkpoints langs=arAR,csCZ,deDE,enXX,esXX,etEE,fiFI,frXX,guIN,hiIN,itIT,jaXX,kkKZ,koKR,ltLT,lvLV,myMM,neNP,nlXX,roRO,ruRU,siLK,trTR,viVN,zh_CN
python ${FAIRSEQPATH}/train.py ${DATAPATH} --num-workers 32 --encoder-normalize-before --decoder-normalize-before --arch mbartlarge --task translationfrompretrainedbart --source-lang input --target-lang output --criterion labelsmoothedcrossentropy --label-smoothing 0.2 --dataset-impl mmap --optimizer adam --adam-eps 1e-08 --adam-betas '(0.9, 0.999)' --lr-scheduler polynomialdecay --lr 3e-05 --min-lr -1 --warmup-updates 936 --total-num-update 20000 --dropout 0.2 --attention-dropout 0.1 --weight-decay 0.01 --max-sentences 2 --update-freq 2 --save-interval 1 --max-epoch 40 --save-dir ${CHECKPOINTPATH} --validate-interval 1 --seed 222 --log-format json --log-interval 60 --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler --restore-file ${PRETRAINEDBART}/model.pt --langs $langs --layernorm-embedding --ddp-backend noc10d --memory-efficient-fp16 |& tee trainhistory.log ```
Watch GPU utilization:
nvidia-smi -l
Parse training and validation losses as tables from the log file:
python path/to/parse_fairseq_train_logs.py train_history.log prefix_for_output_file
Run inference on the test data
This command will use 8 shards of data. Be sure to pick the right model checkpoint based on validation curves, etc.
for SHARD_ID in {0..7}; do (CUDA_VISIBLE_DEVICES=$SHARD_ID python $FAIRSEQ_PATH/generate.py data-bin/ --path $CHECKPOINT_PATH/checkpoint19.pt --task translation_from_pretrained_bart --gen-subset test -t output -s input --sacrebleu --remove-bpe 'sentencepiece' --langs $langs --memory-efficient-fp16 --max-sentences 64 --num-workers 4 --num-shards 8 --shard-id $SHARD_ID |& tee hyps_test19_${SHARD_ID}.log &); done
Combine the data from the 8 shards into one file:
``` for file in hypstest19*; do cat $file >> hypstestepoch19.log; done
rm hypstest19* ```
License
See the file entitled LICENSE
Note: This work is dependent on fairseq and sentencepiece, which were licensed under the MIT License and the Apache 2.0 license, respectively, at the time this work was conducted.
Citation
@inproceedings{fitzgerald2020mbartmultiatis,
title = {STIL - Simultaneous Slot Filling, Translation, Intent Classification, and Language Identification: Initial Results using mBART on MultiATIS++},
author = {Jack G. M. FitzGerald},
booktitle = {Proceedings of 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics},
year = {2020},
url = {https://arxiv.org/abs/2010.00760}
}
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- NLP-hua (1)