https://github.com/chapzq77/albert_zh
海量中文预训练ALBERT模型, A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
海量中文预训练ALBERT模型, A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
Basic Info
- Host: GitHub
- Owner: chapzq77
- Default Branch: master
- Homepage: https://arxiv.org/pdf/1909.11942.pdf
- Size: 1.4 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of brightmart/albert_zh
Created over 6 years ago
· Last pushed over 6 years ago
https://github.com/chapzq77/albert_zh/blob/master/
# albert_zh An Implementation of A Lite Bert For Self-Supervised Learning Language Representations with TensorFlow ALBERT Chinese version of ALBERT pre-trained model, both TensorFlow and PyTorch checkpoint of Chinese will be available *** UPDATE, 2019-09-28 *** add code for three main changes of albert from bert and its test functions ALBERT Introduction of ALBERT ----------------------------------------------- ALBert is based on Bert, but with some improvements. It achieve state of the art performance on main benchmarks recently, but with 30% parameters less or more. ALBERTBERTState of the art 13NLPALBERTGLUE BERT Three main changes of ALBert from Bert 1 Factorized embedding parameterization O(V * H) to O(V * E + E * H) ALBert_xxlargeV=30000, H=4096, E=128 V * H= 30000 * 4096 = 1.23V * E + E * H = 30000*128+128*4096 = 384 + 52 = 436 28 2 Cross-Layer Parameter Sharing 3 Inter-sentence coherence loss. NSP We maintain that inter-sentence modeling is an important aspect of language understanding, but we propose a loss based primarily on coherence. That is, for ALBERT, we use a sentence-order prediction (SOP) loss, which avoids topic prediction and instead focuses on modeling inter-sentence coherence. The SOP loss uses as positive examples the same technique as BERT (two consecutive segments from the same document), and as negative examples the same two consecutive segments but with their order swapped. This forces the model to learn finer-grained distinctions about discourse-level coherence properties. Other changes 1dropout Remvoe dropout to enlarge capacity of model. 1dropoutdropout We also note that, even after training for 1M steps, our largest models still do not overfit to their training data. As a result, we decide to remove dropout to further increase our model capacity. 2LAMB Use lAMB as optimizer, to train with big batch size batch_size(4096) LAMBbatch_size6 3n-gram(uni-gram,bi-gram, tri-gram Use n-gram as make language model n-gram,uni-grambi-gramtri-gram whole word maskn-gram maskn-gramspanBERT Release Plan ----------------------------------------------- 1albert_base, 12M, 12105 2albert_large, 18M, 241013 3albert_xlarge, 59M, 24106 4albert_xxlarge, 233M, 12107 Training data ----------------------------------------------- 40g100 () Performance and Comparision -----------------------------------------------![]()
![]()
Performance on Chinese datasets ----------------------------------------------- ### XNLI of Chinese Version | | | | | :------- | :---------: | :---------: | | BERT | 77.8 (77.4) | 77.8 (77.5) | | ERNIE | 79.7 (79.4) | 78.6 (78.2) | | BERT-wwm | 79.0 (78.4) | 78.2 (78.0) | | BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) | | XLNet | 79.2 | 78.7 | | RoBERTa-zh-base | 79.8 |78.8 | | RoBERTa-zh-Large | 80.2 (80.0) | 79.9 (79.5) | | ALBERT-xlarge | ? | ? | | ALBERT-xxlarge | ? | ? | BERT-wwm-extXLNet; RoBERTa-zh-base12RoBERTa ### LCQMC(Sentence Pair Matching) | | (Dev) | (Test) | | :------- | :---------: | :---------: | | BERT | 89.4(88.4) | 86.9(86.4) | | ERNIE | 89.8 (89.6) | 87.2 (87.0) | | BERT-wwm |89.4 (89.2) | 87.0 (86.8) | | BERT-wwm-ext | - |- | | RoBERTa-zh-base | 88.7 | 87.0 | | RoBERTa-zh-Large | 89.9(89.6) | 87.2(86.7) | | RoBERTa-zh-Large(20w_steps) | 89.7| 87.0 | | ALBERT-xlarge | ? | ? | | ALBERT-xxlarge | ? | ? | Configuration of Models -----------------------------------------------
Implementation and Code Testing ----------------------------------------------- python test_changes.py Pre-training ----------------------------------------------- #### (tfrecords) Generate tfrecords Files (data/news_zh_1.txt) bash create_pretrain_data.sh (tfrecords #### pre-training on GPU/TPU GPU: export BERT_BASE_DIR=bert_config nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord \ --output_dir=my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/bert_config_xxlarge.json \ --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=76 \ --num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \ --save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt & TPU, add following information: --use_tpu=True --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a init_checkpoint BERT_BASE_DIRbert_config_fileinit_checkpoint #### QQ: 836811304 Join us on QQ group If you have any question, you can raise an issue, or send me an email: brightmart@hotmail.com; You can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on. If you have ideas for generate best performance pre-training Chinese model, please also let me know. Reference ----------------------------------------------- 1ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations 213NLPALBERTGLUE 3BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 4SpanBERT: Improving Pre-training by Representing and Predicting Spans 5RoBERTa: A Robustly Optimized BERT Pretraining Approach
Owner
- Name: 周奇
- Login: chapzq77
- Kind: user
- Repositories: 3
- Profile: https://github.com/chapzq77
Performance on Chinese datasets
-----------------------------------------------
### XNLI of Chinese Version
| | | |
| :------- | :---------: | :---------: |
| BERT | 77.8 (77.4) | 77.8 (77.5) |
| ERNIE | 79.7 (79.4) | 78.6 (78.2) |
| BERT-wwm | 79.0 (78.4) | 78.2 (78.0) |
| BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) |
| XLNet | 79.2 | 78.7 |
| RoBERTa-zh-base | 79.8 |78.8 |
| RoBERTa-zh-Large | 80.2 (80.0) | 79.9 (79.5) |
| ALBERT-xlarge | ? | ? |
| ALBERT-xxlarge | ? | ? |
BERT-wwm-ext
Implementation and Code Testing
-----------------------------------------------
python test_changes.py
Pre-training
-----------------------------------------------
#### (tfrecords) Generate tfrecords Files
(data/news_zh_1.txt)
bash create_pretrain_data.sh
(tfrecords
#### pre-training on GPU/TPU
GPU:
export BERT_BASE_DIR=bert_config
nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord \
--output_dir=my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/bert_config_xxlarge.json \
--train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=76 \
--num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176 \
--save_checkpoints_steps=2000 --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt &
TPU, add following information:
--use_tpu=True --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a
init_checkpoint
BERT_BASE_DIRbert_config_fileinit_checkpoint
#### QQ: 836811304 Join us on QQ group
If you have any question, you can raise an issue, or send me an email: brightmart@hotmail.com;
You can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on.
If you have ideas for generate best performance pre-training Chinese model, please also let me know.
Reference
-----------------------------------------------
1