https://github.com/amazon-science/stil-mbart-multiatis

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: amazon-science
License: other
Language: Python
Default Branch: main
Size: 13.7 KB

Statistics

Stars: 4
Watchers: 1
Forks: 1
Open Issues: 1
Releases: 0

Created over 5 years ago · Last pushed over 5 years ago

Metadata Files

Readme Contributing License Code of conduct

STIL - Simultaneous Slot Filling, Translation, Intent Classification, and Language Identification: Initial Results using mBART on MultiATIS++

A paper presented at the Asian Chapter of the Association of Computational Linguistics (AACL) 2020

Jack FitzGerald

Copyright Amazon.com Inc. or its affiliates

This repository contains some of the code used in the paper named above. The purpose of this repo is only to allow other researchers to reproduce the study and results presented in the paper. Jack and Amazon likely will not improve this repo over time.

Setup

I used AWS SageMaker for this work, including a m5.xlarge instance for data preparation and analysis and a p3.16xlarge instance for training.

Download the pretrained model

This study uses the pretrained mBART CC25 model. It can be downloaded from fairseq here.

Create the datasets

The dataset is based on the MultiATIS++ and MultiATIS datasets.

MultiATIS++

The 2020 paper by Xue et al. entitled "End-to-End Slot Alignment and Recognition for Cross-Lingual NLU" describes the dataset in more detail. As of writing, the dataset was still under review by LDC. Please contact saabm@amazon.com to obtain a copy.

Once you have the data, place it in a folder called MultiATISpp-RAW/.

At the time of my research, there were a small number of English alignment problems with the Japanese, Hindi, and Turkish data. For this reason, Japanese was excluded, and the Hindi and Turkish data from MultiATIS were used. To ensure a fair comparison with the work by Xu et al., we need to extract the validation set from MultiATIS++ and remove those examples from the MultiATIS training set. Another folder must be created called hi_tr_devsets containing those two files.

The TSVs should have the following columns, and headers should be included in the files.

id utterance slot_labels intent

MultiATIS

MultiATIS is available from LDC here. Please place the TSVs for Hindi and Turkish in a folder called `MultiATIS-RAW/'. The TSVs should have the following columns, and headers should be excluded.

English utterance English annotations machine translation back to English intent non-English utterance non-English annotations

Directory tree for raw data

Create the STIL dataset

To create the STIL dataset, run the preprocess_atis_stil.py script on the data described above. EX:

python path/to/preprocess_atis_stil.py MultiATISpp-RAW/ MultiATIS-RAW/ hi_tr_devsets/ MultiATISpp-FLAT/

Create the traditional NLU dataset

To create the traditional NLU dataset (no translation of the slots), run the preprocess_atis_traditional.py script on the data desscribed above. EX:

python path/to/preprocess_atis_traditional.py MultiATISpp-RAW/ MultiATIS-RAW/ hi_tr_devsets/ MultiATISpp-FLAT/

Tokenize the dataset

The mBART model uses sentencepiece tokenization. Information can be found in the sentencepice repo. The following commands can be used to build sentencepiece.

``` git clone https://github.com/google/sentencepiece.git cd sentencepiece mkdir build cd build cmake .. make -j $(nproc) sudo make install sudo ldconfig -v

pip install sentencepiece ```

Once sentencepiece has been built, tokenize the datasets:

``` SPM=path/to/sentencepiece/build/src/spmencode MODEL=path/to/mbart.cc25/sentence.bpe.model DATAPATH=path/to/MultiATISpp-FLAT

for SPLIT in train dev test; do for INOUT in input output; do $SPM --model=$MODEL < ${DATAPATH}/${SPLIT}.${INOUT} > ${DATAPATH}/${SPLIT}.spm.${INOUT}; done; done ```

Binarize the data

The model requires binarized data.

The gcc in my instances of SageMaker was too old. Upgrade first if needed:

conda install -c psi4 gcc-5

Install fairseq as editable:

git clone https://github.com/pytorch/fairseq.git cd fairseq pip install --editable .

Run binarization:

``` FAIRSEQPATH=path/to/fairseq DATAPATH=path/to/tokenizeddata DICTPATH=path/to/mbart.cc25

python ${FAIRSEQPATH}/preprocess.py --source-lang input --target-lang output --trainpref ${DATAPATH}/train.spm --validpref ${DATAPATH}/dev.spm --testpref ${DATAPATH}/test.spm --srcdict $DICTPATH/dict.txt --tgtdict $DICTPATH/dict.txt --workers 8 --destdir MultiATISpp-BIN ```

Train the mBART model

I used a p3.16xlarge instance for training, which has 8 Nvidia v100 GPUs. By using max sentence of 2 and update freq of 2, it will result in a batch size of 32.

``` PRETRAINEDBART=path/to/mbart.cc25 DATAPATH=path/to/data-bin FAIRSEQPATH=path/to/fairseq CHECKPOINTPATH=path/to/checkpoints langs=arAR,csCZ,deDE,enXX,esXX,etEE,fiFI,frXX,guIN,hiIN,itIT,jaXX,kkKZ,koKR,ltLT,lvLV,myMM,neNP,nlXX,roRO,ruRU,siLK,trTR,viVN,zh_CN

python ${FAIRSEQPATH}/train.py ${DATAPATH} --num-workers 32 --encoder-normalize-before --decoder-normalize-before --arch mbartlarge --task translationfrompretrainedbart --source-lang input --target-lang output --criterion labelsmoothedcrossentropy --label-smoothing 0.2 --dataset-impl mmap --optimizer adam --adam-eps 1e-08 --adam-betas '(0.9, 0.999)' --lr-scheduler polynomialdecay --lr 3e-05 --min-lr -1 --warmup-updates 936 --total-num-update 20000 --dropout 0.2 --attention-dropout 0.1 --weight-decay 0.01 --max-sentences 2 --update-freq 2 --save-interval 1 --max-epoch 40 --save-dir ${CHECKPOINTPATH} --validate-interval 1 --seed 222 --log-format json --log-interval 60 --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler --restore-file ${PRETRAINEDBART}/model.pt --langs $langs --layernorm-embedding --ddp-backend noc10d --memory-efficient-fp16 |& tee trainhistory.log ```

Watch GPU utilization:

nvidia-smi -l

Parse training and validation losses as tables from the log file:

python path/to/parse_fairseq_train_logs.py train_history.log prefix_for_output_file

Run inference on the test data

This command will use 8 shards of data. Be sure to pick the right model checkpoint based on validation curves, etc.

for SHARD_ID in {0..7}; do (CUDA_VISIBLE_DEVICES=$SHARD_ID python $FAIRSEQ_PATH/generate.py data-bin/ --path $CHECKPOINT_PATH/checkpoint19.pt --task translation_from_pretrained_bart --gen-subset test -t output -s input --sacrebleu --remove-bpe 'sentencepiece' --langs $langs --memory-efficient-fp16 --max-sentences 64 --num-workers 4 --num-shards 8 --shard-id $SHARD_ID |& tee hyps_test19_${SHARD_ID}.log &); done

Combine the data from the 8 shards into one file:

``` for file in hypstest19*; do cat $file >> hypstestepoch19.log; done

rm hypstest19* ```

License

See the file entitled LICENSE

Note: This work is dependent on fairseq and sentencepiece, which were licensed under the MIT License and the Apache 2.0 license, respectively, at the time this work was conducted.

Citation

@inproceedings{fitzgerald2020mbartmultiatis, title = {STIL - Simultaneous Slot Filling, Translation, Intent Classification, and Language Identification: Initial Results using mBART on MultiATIS++}, author = {Jack G. M. FitzGerald}, booktitle = {Proceedings of 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics}, year = {2020}, url = {https://arxiv.org/abs/2010.00760} }

Owner

Name: Amazon Science
Login: amazon-science
Kind: organization

Website: https://amazon.science
Twitter: AmazonScience
Repositories: 80
Profile: https://github.com/amazon-science

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0