seq2seq

This is seq2seq model structures with Tensorflow 2

https://github.com/cosmoquester/seq2seq

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary

Keywords

nlp seq2seq tensorflow2

Last synced: 6 months ago · JSON representation ·

Repository

This is seq2seq model structures with Tensorflow 2

Basic Info

Host: GitHub
Owner: cosmoquester
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 460 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

nlp seq2seq tensorflow2

Created about 5 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

seq2seq

This is seq2seq model structures with Tensorflow 2.

There are three model architectures, RNNSeq2Seq, RNNSeq2SeqWithAttention, TransformerSeq2Seq.

This repository contains train, evaulate, inference, converting to savedmodel format scripts.

이 코드를 이용해 학습하고 실험한 결과는 Tensorflow2 기반 Seq2Seq 모델, 학습, 서빙 코드 구현 에서 볼 수 있습니다. (한국어)

Train

Example

You can start training by running script like below sh $ python -m scripts.train \ --dataset-path "data/*.txt" \ --batch-size 2048 --dev-batch-size 2048 \ --epoch 90 --steps-per-epoch 250 --auto-encoding \ --learning-rate 2e-4 \ --device gpu \ --tensorboard-update-freq 50 --model-name TransformerSeq2Seq --model-config-path resources/configs/transformer.yml

Arguments

``` File Paths: --model-name MODELNAME Seq2seq model name --model-config-path MODELCONFIGPATH model config file --dataset-path DATASETPATH a text file or multiple files ex) *.txt --pretrained-model-path PRETRAINEDMODELPATH pretrained model checkpoint --output-path OUTPUTPATH output directory to save log and model checkpoints --sp-model-path SPMODEL_PATH

Training Parameters: --epochs EPOCHS --steps-per-epoch STEPSPEREPOCH --learning-rate LEARNINGRATE --min-learning-rate MINLEARNINGRATE --warm-up-rate WARMUPRATE --batch-size BATCHSIZE --dev-batch-size DEVBATCHSIZE --num-dev-dataset NUMDEVDATASET --shuffle-buffer-size SHUFFLEBUFFERSIZE --prefetch-buffer-size PREFETCHBUFFERSIZE --max-sequence-length MAXSEQUENCELENGTH

Other settings: --tensorboard-update-freq TENSORBOARDUPDATEFREQ log losses and metrics every after this value step --disable-mixed-precision Use mixed precision FP16 --auto-encoding train by auto encoding with text lines dataset --use-tfrecord train using tfrecord dataset --debug-nan-loss Trainin with this flag, print the number of Nan loss (not supported on TPU) --device {CPU,GPU,TPU} device to train model --max-over-sequence-policy {filter,slice} Policy for sequences of which length is over the max ``-model-nameis seq2seq model class name, so one of (RNNSeq2Seq, RNNSeq2SeqWithAttention, TransformerSeq2Seq) -model-config-pathis model config file path. model config file describe model parameter. There are default model configs inresources/configs-dataset-pathis dataset file glob expression. dataset file format is tsv file without header having two columns sequence A, sequenceB model training to predict sequence B when we inputs sequence A. However, when we useauto-encodingoption, dataset format is just lines of text. So, model will be trained to echo texts. -sp-model-pathis sentencepiece model path to tokenize text. -disable-mixed-precisionis to disable fp16 mixed precision. Mixed precision is on as default. -deviceis training device, one of (cpu, gpu, tpu) But tpu is not supported yet. If you want to train on TPU,use-tfrecord` option is necessary. TFRecord can be made with "scripts/make_tfrecord.py" python script. When ending of training, the model checkpoints and tensorboard log files are saved to output directory.

Evaluate

Example

You can start training by running script like below ```sh $ python -m scripts.evaluate \ --model-path ~/Downloads/output/models/model-50epoch-nanloss0.3870acc.ckpt \ --model-config-path ~/Downloads/output/modelconfig.yml \ --dataset-path test.txt \ --auto-encoding \ --beam-size 2 \ --disable-mixed-precision

[skip some messy logs...] [2020-12-20 01:42:28,308] RNN implementation=2 is not supported when recurrent_dropout is set. Using implementation=1. DEBUG:tensorflow:RNN implementation=2 is not supported when recurrent_dropout is set. Using implementation=1. [2020-12-20 01:42:28,311] RNN implementation=2 is not supported when recurrent_dropout is set. Using implementation=1. DEBUG:tensorflow:RNN implementation=2 is not supported when recurrent_dropout is set. Using implementation=1. [2020-12-20 01:42:28,315] RNN implementation=2 is not supported when recurrent_dropout is set. Using implementation=1. [2020-12-20 01:42:30,963] Loaded weights of model Perplexity: 17.618855794270832, BLEU: 0.07615733809469007: : 1it [00:04, 4.38s/it] [2020-12-20 01:42:35,347] Finished evalaution! [2020-12-20 01:42:35,348] Perplexity: 17.618855794270832, BLEU: 0.07615733809469007 ``` Results is ppl and BLEU.

Arguments

```sh File Paths: --model-name MODELNAME Seq2seq model name --model-config-path MODELCONFIGPATH model config file --dataset-path DATASETPATH a tsv file or multiple files ex) *.tsv --model-path MODELPATH pretrained model checkpoint --sp-model-path SPMODEL_PATH

Inference Parameters: --batch-size BATCHSIZE --prefetch-buffer-size PREFETCHBUFFERSIZE --max-sequence-length MAXSEQUENCELENGTH --header use this flag if dataset (tsv file) has header --beam-size BEAMSIZE not given, use greedy search else beam search with this value as beam size

Other settings: --disable-mixed-precision Use mixed precision FP16 --auto-encoding evaluate by autoencoding performance dataset format is lines of texts (.txt) --device DEVICE device to train model ``- Most of arugments is same as training script. -beam-size` is beam search parameter. When this is less than two or not given, use greedy search.

Inference

Example

You can start training by running script like below ```sh $ python -m scripts.inference \ --dataset-path test.txt \ --model-path ~/Downloads/output/models/model-50epoch-nanloss_0.3870acc.ckpt \ --output-path out.txt \ --save-pair

[skip some messy logs...] [2020-12-20 01:52:27,856] Loaded weights of model [2020-12-20 01:52:27,857] Start Inference [2020-12-20 01:52:35,629] Ended Inference, Start to save... [2020-12-20 01:52:35,631] Saved (original sentence,decoded sentence) pairs to out.txt ```

Arguments

``` File Paths: --model-name MODELNAME Seq2seq model name --model-config-path MODELCONFIGPATH model config file --dataset-path DATASETPATH a text file or multiple files ex) *.txt --model-path MODELPATH pretrained model checkpoint --output-path OUTPUTPATH output file path to save generated sentences --sp-model-path SPMODELPATH

Inference Parameters: --batch-size BATCHSIZE --prefetch-buffer-size PREFETCHBUFFERSIZE --max-sequence-length MAXSEQUENCELENGTH --beam-size BEAMSIZE not given, use greedy search else beam search with this value as beam size

Other settings: --disable-mixed-precision Use mixed precision FP16 --save-pair save result as the pairs of original and decoded sentences --device DEVICE device to train model ``- When usesave-pair` option, save with original sentence and generated sentence with tsv format. If not, save only generated sentences.

Interactive

Example

You can test your trained model interactively. If you want to finish, just enter to put empty input. ```sh $ python -m scripts.interactive \ --model-name TransformerSeq2Seq \ --model-path ~/model-28epoch-0.1396loss_0.9812acc.ckpt \ --model-config-path resources/configs/transformer.yml

[2021-04-30 00:27:00,037] Loaded weights of model Please Input Text: 너 이름이 뭐야? 2021-04-30 00:27:23.437004: I tensorflow/streamexecutor/platform/default/dsoloader.cc:48] Successfully opened dynamic library libcublas.so.10 2021-04-30 00:27:23.705647: I tensorflow/streamexecutor/platform/default/dsoloader.cc:48] Successfully opened dynamic library libcudnn.so.7 Output: 너 이름이 뭐야?, Perplexity: 1.0628 Please Input Text: 근데 어쩌라는 걸까 Output: 근데 어쩌라는 걸까, Perplexity: 1.0594 Please Input Text: 헤헤헤헤 Output: 헤헤헤헤, Perplexity: 1.4151 Please Input Text: ```

Arguments

``` File Paths: --model-name MODELNAME Seq2seq model name --model-config-path MODELCONFIGPATH model config file --model-path MODELPATH pretrained model checkpoint --sp-model-path SPMODELPATH

Inference Parameters: --batch-size BATCHSIZE --prefetch-buffer-size PREFETCHBUFFERSIZE --max-sequence-length MAXSEQUENCELENGTH --pad-id PADID Pad token id when tokenize with sentencepiece --beam-size BEAM_SIZE not given, use greedy search else beam search with this value as beam size

Other settings: --mixed-precision Use mixed precision FP16 --device DEVICE device to train model ```

Convert to savedmodel

Example

You can simply convert model checkpoint to savedmodel format. ```sh $ python -m scripts.converttosavedmodel \ --model-name RNNSeq2SeqWithAttention \ --model-config-path ~/Downloads/output/modelconfig.yml \ --model-weight-path ~/Downloads/output/models/model-50epoch-nanloss0.3870acc.ckpt \ --output-path seq2seq-model/1

[skip some messy logs...] Instructions for updating: This property should not be used in TensorFlow 2.0, as updates are applied automatically. INFO:tensorflow:Assets written to: seq2seq-model/1/assets [2020-12-20 01:58:49,424] Assets written to: seq2seq-model/1/assets [2020-12-20 01:58:51,285] Saved model to seq2seq-model/1 ``` If you make savedmodel by using this script, tokenize is included in savedmodel so you can use generate sequence without tokenizer or vocab.

Arguments

``` Arguments: --model-name MODELNAME Seq2seq model name --model-config-path MODELCONFIGPATH model config file --model-weight-path MODELWEIGHTPATH Model weight file path saved in training --sp-model-path SPMMODELPATH sp tokenizer model path --output-path OUTPUTPATH Savedmodel path

Search Method Configs: --pad-id PADID Pad token id when tokenize with sentencepiece --max-sequence-length MAXSEQUENCE_LENGTH Max number of tokens including bos, eos --alpha ALPHA length penalty control variable when beam searching --beta BETA length penalty control variable when beam searching ```

Use of savedmodel

sh $ docker run -v `pwd`/seq2seq-model:/models/seq2seq -e MODEL_NAME=seq2seq -p 8501:8501 -dt tensorflow/serving You can open tensorflow serving server.

sh $ curl -XPOST localhost:8501/v1/models/seq2seq:predict -d '{"inputs":["안녕하세요", "나는 오늘 밥을 먹었다", "아니 지금 뭐라고요?, 그게 대체 무슨 말이에요!!"]}' { "outputs": { "perplexity": [ 1.00468457, 1.06678605, 1.04327798 ], "sentences": [ "안녕하세요", "나는 오늘 밥을 먹었다", "아니 지금 뭐라고요?, 그게 대체 무슨 말이에요!!" ] } } - By default, signature function is greedy search. Like above example, you can send texts then receice ppl and gernerated texts.

sh $ curl -XPOST localhost:8501/v1/models/seq2seq:predict -d '{"inputs":{"texts":["반갑습니다", "학교가기 싫다"], "beam_size":3}, "signature_name":"beam_search"}' { "outputs": { "sentences": [ [ "반갑습니다", "반갑습니다", "반갑습니다" ], [ "학교가기 싫다", "학교가기놔", "학교가기 챙겨" ] ], "perplexity": [ [ 1.0299294, 1.0299294, 1.0299294 ], [ 1.22807097, 1.54527545, 1.56684875 ] ] } } - If you want to inference by beam searching, set signature_name as beamsearch and request with beamsize. - Response also contains beam size number of texts per an example.

References

Sequence to Sequence Learning with Neural Networks, Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Neural Machine Translation by Jointly Learning to Align and Translate, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Owner

Name: ParkSangJun
Login: cosmoquester
Kind: user
Location: Seoul, Korea
Company: @scatterlab @pingpong-ai

Website: https://cosmoquester.github.io
Repositories: 12
Profile: https://github.com/cosmoquester

Machine Learning Engineer @scatterlab Korea. Thank you.

Citation (CITATION.cff)

cff-version: 1.2.0
type: generic
message: "If you use this code, please cite this as below."
authors:
- family-names: "Park"
  given-names: "Sangjun"
  orcid: "https://orcid.org/0000-0002-1838-9259"
title: "seq2seq"
version: 0.0.1
date-released: 2022-11-16
url: "https://github.com/cosmoquester/seq2seq"

GitHub Events

Total

Last Year

Dependencies

requirements-dev.txt pypi

black * development
codecov * development
isort * development
pytest * development
pytest-cov * development

requirements.txt pypi

nltk *
pyyaml *
tensorboard_plugin_profile *
tensorflow <2.5
tensorflow-serving-api *
tensorflow-text *
tqdm *

setup.py pypi

tensorflow >=2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

seq2seq

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

seq2seq

Train

Example

Arguments

Evaluate

Example

Arguments

Inference

Example

Arguments

Interactive

Example

Arguments

Convert to savedmodel

Example

Arguments

Use of savedmodel

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies