transformers-bart-pretrain

Script to pre-train hugginface transformers BART with Tensorflow 2

https://github.com/cosmoquester/transformers-bart-pretrain

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary

Keywords

bart gpu huggingface-transformers pretraining tensorflow tpu

Last synced: 6 months ago · JSON representation ·

Repository

Script to pre-train hugginface transformers BART with Tensorflow 2

Basic Info

Host: GitHub
Owner: cosmoquester
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 1.43 MB

Statistics

Stars: 33
Watchers: 1
Forks: 6
Open Issues: 1
Releases: 0

Topics

bart gpu huggingface-transformers pretraining tensorflow tpu

Created over 4 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

transformers TF BART pre-training

Script to pre-train hugginface transformers BART
Training BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Text infilling and Sentence Permutation functions are available now

Train

You can train huggingface transformers model simply like below example. (below example works without change as itself using sample data)

sh $ CUDA_VISIBLE_DEVICES=1 python -m scripts.train \ --model-config-path configs/base.json \ --train-dataset-path tests/data/sample1.txt \ --dev-dataset-path tests/data/sample1.txt \ --sp-model-path sp_model/sp_model_unigram_8K.model \ --device GPU \ --auto-encoding \ --batch-size 2 \ --steps-per-epoch 100 \ --mask-token "[MASK]" \ --mixed-precision

Arguments

```sh File Paths: --model-config-path MODELCONFIGPATH model config file --train-dataset-path TRAINDATASETPATH training dataset, a text file or multiple files ex) *.txt --dev-dataset-path DEVDATASETPATH dev dataset, a text file or multiple files ex) *.txt --pretrained-checkpoint PRETRAINEDCHECKPOINT pretrained checkpoint path --output-path OUTPUTPATH output directory to save log and model checkpoints --sp-model-path SPMODELPATH sentencepiece model path to tokenizer

Training Parameters: --mask-token MASKTOKEN mask token ex) [MASK] --mask-token-id MASKTOKENID mask token id of vocab --epochs EPOCHS --steps-per-epoch STEPSPEREPOCH --learning-rate LEARNINGRATE --min-learning-rate MINLEARNINGRATE --warmup-steps WARMUPSTEPS --warmup-rate WARMUPRATE --batch-size BATCHSIZE total training batch size of all devices --dev-batch-size DEVBATCHSIZE --num-total-dataset NUMTOTALDATASET --shuffle-buffer-size SHUFFLEBUFFERSIZE --prefetch-buffer-size PREFETCHBUFFERSIZE --max-sequence-length MAXSEQUENCELENGTH --weight-decay WEIGHTDECAY use weight decay --clipnorm CLIPNORM clips gradients to a maximum norm. --disable-text-infilling disable input noising --disable-sentence-permutation disable input noising --masking-rate MASKINGRATE text infilling masking rate --permutation-segment-token-id PERMUTATIONSEGMENTTOKENID segment token id for sentence permutation

Other settings: --tensorboard-update-freq TENSORBOARDUPDATEFREQ log losses and metrics every after this value step --mixed-precision Use mixed precision FP16 --auto-encoding train by auto encoding with text lines dataset --use-tfrecord train using tfrecord dataset --repeat-each-file repeat each dataset and uniform sample for train example --debug-nan-loss Trainin with this flag, print the number of Nan loss (not supported on TPU) --seed SEED random seed --skip-epochs SKIP_EPOCHS skip this number of epochs --device {CPU,GPU,TPU} device to train model --max-over-sequence-policy {filter,slice} Policy for sequences of which length is over the max ``-model-config-pathis huggingface bart model config file path. -pretrained-checkpointis trained model checkpoint path. -sp-model-pathis sentencepiece tokenizer model path. - withrepeat-each-file` flag, you can repeat each dataset files forever even if one of dataset were run out.

Owner

Name: ParkSangJun
Login: cosmoquester
Kind: user
Location: Seoul, Korea
Company: @scatterlab @pingpong-ai

Website: https://cosmoquester.github.io
Repositories: 12
Profile: https://github.com/cosmoquester

Machine Learning Engineer @scatterlab Korea. Thank you.

Citation (CITATION.cff)

cff-version: 1.2.0
type: generic
message: "If you use this code, please cite this as below."
authors:
- family-names: "Park"
  given-names: "Sangjun"
  orcid: "https://orcid.org/0000-0002-1838-9259"
title: "transformers-bart-pretrain"
version: 1.0.0
date-released: 2022-11-02
url: "https://github.com/cosmoquester/transformers-bart-pretrain"

GitHub Events

Total

Last Year

Dependencies

requirements-dev.txt pypi

black * development
codecov * development
isort * development
pytest * development
pytest-cov * development

requirements.txt pypi

tensorflow >=2
tensorflow-text *
transformers *

setup.py pypi

tensorflow >=2

pyproject.toml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science