transformers-bart-pretrain

Script to pre-train hugginface transformers BART with Tensorflow 2

https://github.com/cosmoquester/transformers-bart-pretrain

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.4%) to scientific vocabulary

Keywords

bart gpu huggingface-transformers pretraining tensorflow tpu
Last synced: 6 months ago · JSON representation ·

Repository

Script to pre-train hugginface transformers BART with Tensorflow 2

Basic Info
  • Host: GitHub
  • Owner: cosmoquester
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 1.43 MB
Statistics
  • Stars: 33
  • Watchers: 1
  • Forks: 6
  • Open Issues: 1
  • Releases: 0
Topics
bart gpu huggingface-transformers pretraining tensorflow tpu
Created over 4 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

transformers TF BART pre-training

Code style: black Imports: isort cosmoquester codecov

Train

You can train huggingface transformers model simply like below example. (below example works without change as itself using sample data)

sh $ CUDA_VISIBLE_DEVICES=1 python -m scripts.train \ --model-config-path configs/base.json \ --train-dataset-path tests/data/sample1.txt \ --dev-dataset-path tests/data/sample1.txt \ --sp-model-path sp_model/sp_model_unigram_8K.model \ --device GPU \ --auto-encoding \ --batch-size 2 \ --steps-per-epoch 100 \ --mask-token "[MASK]" \ --mixed-precision

Arguments

```sh File Paths: --model-config-path MODELCONFIGPATH model config file --train-dataset-path TRAINDATASETPATH training dataset, a text file or multiple files ex) *.txt --dev-dataset-path DEVDATASETPATH dev dataset, a text file or multiple files ex) *.txt --pretrained-checkpoint PRETRAINEDCHECKPOINT pretrained checkpoint path --output-path OUTPUTPATH output directory to save log and model checkpoints --sp-model-path SPMODELPATH sentencepiece model path to tokenizer

Training Parameters: --mask-token MASKTOKEN mask token ex) [MASK] --mask-token-id MASKTOKENID mask token id of vocab --epochs EPOCHS --steps-per-epoch STEPSPEREPOCH --learning-rate LEARNINGRATE --min-learning-rate MINLEARNINGRATE --warmup-steps WARMUPSTEPS --warmup-rate WARMUPRATE --batch-size BATCHSIZE total training batch size of all devices --dev-batch-size DEVBATCHSIZE --num-total-dataset NUMTOTALDATASET --shuffle-buffer-size SHUFFLEBUFFERSIZE --prefetch-buffer-size PREFETCHBUFFERSIZE --max-sequence-length MAXSEQUENCELENGTH --weight-decay WEIGHTDECAY use weight decay --clipnorm CLIPNORM clips gradients to a maximum norm. --disable-text-infilling disable input noising --disable-sentence-permutation disable input noising --masking-rate MASKINGRATE text infilling masking rate --permutation-segment-token-id PERMUTATIONSEGMENTTOKENID segment token id for sentence permutation

Other settings: --tensorboard-update-freq TENSORBOARDUPDATEFREQ log losses and metrics every after this value step --mixed-precision Use mixed precision FP16 --auto-encoding train by auto encoding with text lines dataset --use-tfrecord train using tfrecord dataset --repeat-each-file repeat each dataset and uniform sample for train example --debug-nan-loss Trainin with this flag, print the number of Nan loss (not supported on TPU) --seed SEED random seed --skip-epochs SKIP_EPOCHS skip this number of epochs --device {CPU,GPU,TPU} device to train model --max-over-sequence-policy {filter,slice} Policy for sequences of which length is over the max `` -model-config-pathis huggingface bart model config file path. -pretrained-checkpointis trained model checkpoint path. -sp-model-pathis sentencepiece tokenizer model path. - withrepeat-each-file` flag, you can repeat each dataset files forever even if one of dataset were run out.

Owner

  • Name: ParkSangJun
  • Login: cosmoquester
  • Kind: user
  • Location: Seoul, Korea
  • Company: @scatterlab @pingpong-ai

Machine Learning Engineer @scatterlab Korea. Thank you.

Citation (CITATION.cff)

cff-version: 1.2.0
type: generic
message: "If you use this code, please cite this as below."
authors:
- family-names: "Park"
  given-names: "Sangjun"
  orcid: "https://orcid.org/0000-0002-1838-9259"
title: "transformers-bart-pretrain"
version: 1.0.0
date-released: 2022-11-02
url: "https://github.com/cosmoquester/transformers-bart-pretrain"

GitHub Events

Total
Last Year

Dependencies

requirements-dev.txt pypi
  • black * development
  • codecov * development
  • isort * development
  • pytest * development
  • pytest-cov * development
requirements.txt pypi
  • tensorflow >=2
  • tensorflow-text *
  • transformers *
setup.py pypi
  • tensorflow >=2
pyproject.toml pypi