transformers-bart-pretrain
Script to pre-train hugginface transformers BART with Tensorflow 2
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Keywords
Repository
Script to pre-train hugginface transformers BART with Tensorflow 2
Basic Info
Statistics
- Stars: 33
- Watchers: 1
- Forks: 6
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
transformers TF BART pre-training
- Script to pre-train hugginface transformers BART
- Training BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Text infillingandSentence Permutationfunctions are available now
Train
You can train huggingface transformers model simply like below example. (below example works without change as itself using sample data)
sh
$ CUDA_VISIBLE_DEVICES=1 python -m scripts.train \
--model-config-path configs/base.json \
--train-dataset-path tests/data/sample1.txt \
--dev-dataset-path tests/data/sample1.txt \
--sp-model-path sp_model/sp_model_unigram_8K.model \
--device GPU \
--auto-encoding \
--batch-size 2 \
--steps-per-epoch 100 \
--mask-token "[MASK]" \
--mixed-precision
Arguments
```sh File Paths: --model-config-path MODELCONFIGPATH model config file --train-dataset-path TRAINDATASETPATH training dataset, a text file or multiple files ex) *.txt --dev-dataset-path DEVDATASETPATH dev dataset, a text file or multiple files ex) *.txt --pretrained-checkpoint PRETRAINEDCHECKPOINT pretrained checkpoint path --output-path OUTPUTPATH output directory to save log and model checkpoints --sp-model-path SPMODELPATH sentencepiece model path to tokenizer
Training Parameters: --mask-token MASKTOKEN mask token ex) [MASK] --mask-token-id MASKTOKENID mask token id of vocab --epochs EPOCHS --steps-per-epoch STEPSPEREPOCH --learning-rate LEARNINGRATE --min-learning-rate MINLEARNINGRATE --warmup-steps WARMUPSTEPS --warmup-rate WARMUPRATE --batch-size BATCHSIZE total training batch size of all devices --dev-batch-size DEVBATCHSIZE --num-total-dataset NUMTOTALDATASET --shuffle-buffer-size SHUFFLEBUFFERSIZE --prefetch-buffer-size PREFETCHBUFFERSIZE --max-sequence-length MAXSEQUENCELENGTH --weight-decay WEIGHTDECAY use weight decay --clipnorm CLIPNORM clips gradients to a maximum norm. --disable-text-infilling disable input noising --disable-sentence-permutation disable input noising --masking-rate MASKINGRATE text infilling masking rate --permutation-segment-token-id PERMUTATIONSEGMENTTOKENID segment token id for sentence permutation
Other settings:
--tensorboard-update-freq TENSORBOARDUPDATEFREQ
log losses and metrics every after this value step
--mixed-precision Use mixed precision FP16
--auto-encoding train by auto encoding with text lines dataset
--use-tfrecord train using tfrecord dataset
--repeat-each-file repeat each dataset and uniform sample for train
example
--debug-nan-loss Trainin with this flag, print the number of Nan loss
(not supported on TPU)
--seed SEED random seed
--skip-epochs SKIP_EPOCHS
skip this number of epochs
--device {CPU,GPU,TPU}
device to train model
--max-over-sequence-policy {filter,slice}
Policy for sequences of which length is over the max
``
-model-config-pathis huggingface bart model config file path.
-pretrained-checkpointis trained model checkpoint path.
-sp-model-pathis sentencepiece tokenizer model path.
- withrepeat-each-file` flag, you can repeat each dataset files forever even if one of dataset were run out.
Owner
- Name: ParkSangJun
- Login: cosmoquester
- Kind: user
- Location: Seoul, Korea
- Company: @scatterlab @pingpong-ai
- Website: https://cosmoquester.github.io
- Repositories: 12
- Profile: https://github.com/cosmoquester
Machine Learning Engineer @scatterlab Korea. Thank you.
Citation (CITATION.cff)
cff-version: 1.2.0 type: generic message: "If you use this code, please cite this as below." authors: - family-names: "Park" given-names: "Sangjun" orcid: "https://orcid.org/0000-0002-1838-9259" title: "transformers-bart-pretrain" version: 1.0.0 date-released: 2022-11-02 url: "https://github.com/cosmoquester/transformers-bart-pretrain"
GitHub Events
Total
Last Year
Dependencies
- black * development
- codecov * development
- isort * development
- pytest * development
- pytest-cov * development
- tensorflow >=2
- tensorflow-text *
- transformers *
- tensorflow >=2