https://github.com/chapzq77/albert_zh

海量中文预训练ALBERT模型, A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

https://github.com/chapzq77/albert_zh

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

海量中文预训练ALBERT模型, A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of brightmart/albert_zh
Created over 6 years ago · Last pushed over 6 years ago

https://github.com/chapzq77/albert_zh/blob/master/

# albert_zh

An Implementation of A Lite Bert For Self-Supervised Learning Language Representations with TensorFlow

ALBERT 

Chinese version of ALBERT pre-trained model, both TensorFlow and PyTorch checkpoint of Chinese will be available 

*** UPDATE, 2019-09-28 ***  add code for three main changes of albert from bert and its test functions

ALBERT Introduction of ALBERT
-----------------------------------------------
ALBert is based on Bert, but with some improvements. It achieve state of the art performance on main benchmarks recently, but with

30% parameters less or more.

ALBERTBERTState of the art

13NLPALBERTGLUE

BERT Three main changes of ALBert from Bert

1 Factorized embedding parameterization
   
     O(V * H) to O(V * E + E * H)
     
     ALBert_xxlargeV=30000, H=4096, E=128
       
     V * H= 30000 * 4096 = 1.23V * E + E * H = 30000*128+128*4096 = 384 + 52 = 436
       
     28


2 Cross-Layer Parameter Sharing

     

3 Inter-sentence coherence loss.
     
     
     
     NSP

      We maintain that inter-sentence modeling is an important aspect of language understanding, but we propose a loss 
      based primarily on coherence. That is, for ALBERT, we use a sentence-order prediction (SOP) loss, which avoids topic 
      prediction and instead focuses on modeling inter-sentence coherence. The SOP loss uses as positive examples the 
      same technique as BERT (two consecutive segments from the same document), and as negative examples the same two 
      consecutive segments but with their order swapped. This forces the model to learn finer-grained distinctions about
      discourse-level coherence properties. 

 Other changes

    1dropout  Remvoe dropout to enlarge capacity of model.
        1dropoutdropout
        We also note that, even after training for 1M steps, our largest models still do not overfit to their training data. As a result, we decide to remove dropout to further increase our model capacity.
    
    2LAMB Use lAMB as optimizer, to train with big batch size
      batch_size(4096) LAMBbatch_size6
    
    3n-gram(uni-gram,bi-gram, tri-gram Use n-gram as make language model
       n-gram,uni-grambi-gramtri-gram
       whole word maskn-gram maskn-gramspanBERT

 Release Plan
-----------------------------------------------
1albert_base, 12M, 12105

2albert_large, 18M, 241013

3albert_xlarge, 59M, 24106

4albert_xxlarge, 233M, 12107

 Training data
-----------------------------------------------
40g100

() Performance and Comparision
-----------------------------------------------    

  
   






 Performance on Chinese datasets
----------------------------------------------- 

### XNLI of Chinese Version

|  |  |  |
| :------- | :---------: | :---------: |
| BERT | 77.8 (77.4) | 77.8 (77.5) | 
| ERNIE | 79.7 (79.4) | 78.6 (78.2) | 
| BERT-wwm | 79.0 (78.4) | 78.2 (78.0) | 
| BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) |
| XLNet | 79.2  | 78.7 |
| RoBERTa-zh-base | 79.8 |78.8  |
| RoBERTa-zh-Large | 80.2 (80.0) | 79.9 (79.5) |
| ALBERT-xlarge | ? | ? |
| ALBERT-xxlarge | ? | ? |


BERT-wwm-extXLNet; RoBERTa-zh-base12RoBERTa
   

###  LCQMC(Sentence Pair Matching)

|  | (Dev) | (Test) |
| :------- | :---------: | :---------: |
| BERT | 89.4(88.4) | 86.9(86.4) | 
| ERNIE | 89.8 (89.6) | 87.2 (87.0) | 
| BERT-wwm |89.4 (89.2) | 87.0 (86.8) | 
| BERT-wwm-ext | - |-  |
| RoBERTa-zh-base | 88.7 | 87.0  |
| RoBERTa-zh-Large | 89.9(89.6) | 87.2(86.7) |
| RoBERTa-zh-Large(20w_steps) | 89.7| 87.0 |
| ALBERT-xlarge | ? | ? |
| ALBERT-xxlarge | ? | ? |




 Configuration of Models
-----------------------------------------------


 Implementation and Code Testing
-----------------------------------------------


    python test_changes.py

 Pre-training
-----------------------------------------------

#### (tfrecords) Generate tfrecords Files

(data/news_zh_1.txt)
   
       bash create_pretrain_data.sh
   
(tfrecords

####  pre-training on GPU/TPU
    GPU:
    export BERT_BASE_DIR=bert_config
    nohup python3 run_pretraining.py --input_file=./data/tf*.tfrecord  \
    --output_dir=my_new_model_path --do_train=True --do_eval=True --bert_config_file=$BERT_BASE_DIR/bert_config_xxlarge.json \
    --train_batch_size=4096 --max_seq_length=512 --max_predictions_per_seq=76 \
    --num_train_steps=125000 --num_warmup_steps=12500 --learning_rate=0.00176    \
    --save_checkpoints_steps=2000   --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt &
    
    TPU, add following information:
        --use_tpu=True  --tpu_name=grpc://10.240.1.66:8470 --tpu_zone=us-central1-a
        
    init_checkpoint
    BERT_BASE_DIRbert_config_fileinit_checkpoint
    


#### QQ: 836811304 Join us on QQ group

If you have any question, you can raise an issue, or send me an email: brightmart@hotmail.com;

You can also send pull request to report you performance on your task or add methods on how to load models for PyTorch and so on.

If you have ideas for generate best performance pre-training Chinese model, please also let me know.

Reference
-----------------------------------------------
1ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations

213NLPALBERTGLUE

3BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

4SpanBERT: Improving Pre-training by Representing and Predicting Spans

5RoBERTa: A Robustly Optimized BERT Pretraining Approach




Owner

  • Name: 周奇
  • Login: chapzq77
  • Kind: user

GitHub Events

Total
Last Year