language-pretraining

Pre-training Language Models for Japanese

https://github.com/retarfi/language-pretraining

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.5%) to scientific vocabulary

Keywords

bert electra implementation japanese language-model language-models natural-language-processing nlp pytorch transformer transformers

Last synced: 6 months ago · JSON representation ·

Repository

Pre-training Language Models for Japanese

Basic Info

Host: GitHub
Owner: retarfi
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 203 KB

Statistics

Stars: 49
Watchers: 4
Forks: 6
Open Issues: 1
Releases: 5

Topics

bert electra implementation japanese language-model language-models natural-language-processing nlp pytorch transformer transformers

Created over 4 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Pre-training Language Models for Japanese

This is a repository of pretrained Japanese transformer-based models. BERT, ELECTRA, RoBERTa, DeBERTa, and DeBERTaV2 is available.

Our pre-trained models are available in Transformers by Hugging Face: https://huggingface.co/izumi-lab. BERT-small, BERT-base, ELECTRA-small, ELECTRA-small-paper, and ELECTRA-base models trained by Wikipedia or financial dataset is available in this URL.

issue は日本語でも大丈夫です。

Table of Contents

Usage
Pre-trained Models
- Model Architecture
Training Data
- Wikipedia Model
- Financial Model
Roadmap
Citation
- Pre-trained Model
- This Implementation
Licenses
Related Work
Acknowledgements

Usage

Train Tokenizer

In our pretrained models, the texts are first tokenized by MeCab with IPAdic dictionary and then split into subwords by the WordPiece algorithm.

From v2.2.0, jptranstokenizer is required, which enables to use word tokenizers other than MeCab, such as Juman++, Sudachi, and spaCy LUW.

For subword tokenization, SentencePiece is also available for subword algorithm.

$ python train_tokenizer.py \ --word_tokenizer mecab \ --input_file corpus.txt \ --model_dir tokenizer/ \ --intermediate_dir ./data/corpus_split/ \ --mecab_dic ipadic \ --tokenizer_type wordpiece \ --vocab_size 32768 \ --min_frequency 2 \ --limit_alphabet 2900 \ --num_unused_tokens 10

You can see all the arguments with python train_tokenizer.py --help

Create Dataset

You can train any type of corpus in Japanese.
When you train with another dataset, please add your corpus name with the line.
The output directory name is <dataset_type>_<max_length>_<input_corpus>.
In the following case, the output directory name is nsp_128_wiki-ja.
tokenizer_name_or_path will end with vocab.txt for wordpiece and with spiece.model for sentencepiece.

We show 2 examples to create dataset.

When you use your trained tokenizer:

$ python create_datasets.py \ --input_corpus wiki-ja \ --max_length 512 \ --input_file corpus.txt \ --mask_style bert \ --tokenizer_name_or_path tokenizer/vocab.txt \ --word_tokenizer_type mecab \ --subword_tokenizer_type wordpiece \ --mecab_dic ipadic

When you use the tokenizer existing in HuggingFace Hub:

$ python create_datasets.py \ --input_corpus wiki-ja \ --max_length 512 \ --input_file corpus.txt \ --mask_style roberta-wwm \ --tokenizer_name_or_path izumi-lab/bert-small-japanese \ --load_from_hub

Training

Distributed training is available. For run command, please see the PyTorch document in detail. In official PyTorch implementation, different batch size between nodes is not available. We improved PyTorch sampling implementation (utils/trainerptutils.py).

For example, bert-base-dist model is defined in parameter.json:

"bert-base-dist" : { "number-of-layers" : 12, "hidden-size" : 768, "sequence-length" : 512, "ffn-inner-hidden-size" : 3072, "attention-heads" : 12, "warmup-steps" : 10000, "learning-rate" : 1e-4, "batch-size" : { "0" : 80, "1" : 80, "2" : 48, "3" : 48 }, "train-steps" : 1000000, "save-steps" : 50000, "logging-steps" : 5000, "fp16-type": 0, "bf16": false }

In this case, node 0 and node 1 have 80 batch sizes and node 2 and node 3 have 48 respectively. If node 0 has 2 GPUs, each GPU have a 40 batch size. 10G or higher network speed is recommended for training with multi-nodes.

fp16-type argument specifies which precision mode to use:

0: FP32 training
1: Mixed Precision
2: "Almost FP16" Mixed Precision
3: FP16 training

In detail, please see NVIDIA Apex document.

bf16 argument determine whether bfloat16 is enabled or not.
You cannot use fp16-type (1, 2 or 3) and bf16 (true) simultaneously.

The whole word masking option is also available.

```

Train with 1 node

$ python runpretraining.py \ --datasetdir ./datasets/nsp128wiki-ja/ \ --modeldir ./model/bert/ \ --parameterfile parameter.json \ --modeltype bert-small \ --tokenizernameorpath tokenizer/vocab.txt \ --wordtokenizertype mecab \ --subwordtokenizertype wordpiece \ --mecabdic ipadic \ (--usedeepspeed ) (--dowholewordmask ) (--docontinue)

Train with multi-node and multi-process

$ NCCLSOCKETIFNAME=eno1 CUDAVISIBLEDEVICES=0,1 python -m torch.distributed.launch \ --nprocpernode=2 --nnodes=2 --noderank=0 --masteraddr="10.0.0.1" \ --masterport=50916 runpretraining.py \ --datasetdir ./datasets/nsp128wiki-ja/ \ --modeldir ./model/bert/ \ --parameterfile parameter.json \ --modeltype bert-small \ --tokenizernameorpath tokenizer/vocab.txt \ --wordtokenizertype mecab \ --subwordtokenizertype wordpiece \ --mecabdic ipadic \ (--usedeepspeed ) (--dowholewordmask ) (--do_continue) ```

Additional Pre-training

You can train models additionally with existing pre-trained model.
For example, bert-small-additional model is defined in parameter.json:

"bert-small-additional" : { "pretrained_model_name_or_path" : "izumi-lab/bert-small-japanese", "flozen-layers" : 6, "warmup-steps" : 10000, "learning-rate" : 5e-4, "batch-size" : { "-1" : 128 }, "train-steps" : 1450000, "save-steps" : 100000, "fp16-type": 0, "bf16": false }

pretrained_model_name_or_path specifies a pretrained model in HuggingFace Hub or the path of a pretrained model.
flozen-layers specifies the flozen (not trained) layers of transformer.
When it is -1, train all layers (including embedding layer).
When it is 3, train upper (near output layer) 9 layers.

When you train ELECTRA model additionally, you need to specify pretrained_generator_model_name_or_path and discriminator_model_name_or_path instead of pretrained_model_name_or_path.

$ python run_pretraining.py \ --tokenizer_name_or_path izumi-lab/bert-small-japanese \ --dataset_dir ./datasets/nsp_128_fin-ja/ \ --model_dir ./model/bert/ \ --parameter_file parameter.json \ --model_type bert-small-additional

For ELECTRA

ELECTRA models generated by run_pretraining.py contain both generator and discriminator. For general use, separation is needed.

$ python extract_electra_model.py \ --input_dir ./model/electra/checkpoint-1000000 \ --output_dir ./model/electra/extracted-1000000 \ --parameter_file parameter.json \ --model_type electra-small \ --generator \ --discriminator

In this example, the generator model is saved in ./model/electra/extracted-1000000/generator/ and discriminator model is saved in ./model/electra/extracted-1000000/discriminator/ respectively.

Training Log

Tensorboard is available for the training log.

Pre-trained Models

Model Architecture

Following models are available now:

BERT
ELECTRA

The architecture of BERT-small, BERT-base, ELECTRA-small-paper, ELECTRA-base models are the same as those in the original ELECTRA paper (ELECTRA-small-paper is described as ELECTRA-small in the paper). The architecture of ELECTRA-small is the same as that in the ELECTRA implementation by Google.

| Parameter | BERT-small | BERT-base | ELECTRA-small | ELECTRA-small-paper | ELECTRA-base | | :--------------: | :--------: | :-------: | :-----------: | :-----------------: | :----------: | | Number of layers | 12 | 12 | 12 | 12 | 12 | | Hidden Size | 256 | 768 | 256 | 256 | 768 | | Attention Heads | 4 | 12 | 4 | 4 | 12 | | Embedding Size | 128 | 512 | 128 | 128 | 128 | | Generator Size | - | - | 1/1 | 1/4 | 1/3 | | Train Steps | 1.45M | 1M | 1M | 1M | 766k |

Other models such as BERT-large or ELECTRA-large are also available in this implementation. You can also add your original parameters in parameter.json.

Training Data

Training data are aggregated to a text file. Each sentence is in one line and a blank line is inserted between documents.

Wikipedia Model

The normal models (not financial models) are trained on the Japanese version of Wikipedia, using Wikipedia dump file as of June 1, 2021. The corpus file is 2.9GB, consisting of approximately 20M sentences.

Financial Model

The financial models are trained on Wikipedia corpus and financial corpus. The Wikipedia corpus is the same as described above. The financial corpus consists of 2 corpora:

Summaries of financial results from October 9, 2012, to December 31, 2020
Securities reports from February 8, 2018, to December 31, 2020

The financial corpus file is 5.2GB, consisting of approximately 27M sentences.

Roadmap

See the open issues for a full list of proposed features (and known issues).

Citation

@article{Suzuki-etal-2023-ipm, title = {Constructing and analyzing domain-specific language model for financial text mining} author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi}, journal = {Information Processing \& Management}, volume = {60}, number = {2}, pages = {103194}, year = {2023}, doi = {10.1016/j.ipm.2022.103194} }

Licenses

The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.

The codes in this repository are distributed under MIT.

Related Work

Original BERT model by Google Research Team
- https://github.com/google-research/bert
Original ELECTRA model by Google Research Team
- https://github.com/google-research/electra
Pretrained Japanese BERT models
- Autor Tohoku University
- https://github.com/cl-tohoku/bert-japanese
ELECTRA training with PyTorch implementation
- Author: Richard Wang
- https://github.com/richarddwang/electra_pytorch

Acknowledgements

This work was supported by JSPS KAKENHI Grant Number JP21K12010, JST-Mirai Program Grant Number JPMJMI20B1, and JST PRESTO Grand Number JPMJPR2267, Japan.

Owner

Name: Masahiro Suzuki
Login: retarfi
Kind: user
Location: Tokyo
Company: Nikko Asset Management Co., Ltd.

Website: https://msuzuki.me/
Twitter: retarfi_
Repositories: 4
Profile: https://github.com/retarfi

Ph. D. student in the University of Tokyo / NLP Engineer in Finance

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Suzuki"
  given-names: "Masahiro"
  orcid: "https://orcid.org/0000-0001-8519-5617"
- family-names: "Sakaji"
  given-names: "Hiroki"
  orcid: "https://orcid.org/0000-0001-5030-625X"
- family-names: "Hirano"
  given-names: "Masanori"
  orcid: "https://orcid.org/0000-0001-5883-8250"
- family-names: "Izumi"
  given-names: "Kiyoshi"
title: "Pre-training Language Models for Japanese"
version: 2.2.1
date-released: 2023-04-28
url: "https://github.com/retarfi/language-pretraining"
preferred-citation:
  type: article
  authors:
  - family-names: "Suzuki"
    given-names: "Masahiro"
    orcid: "https://orcid.org/0000-0001-8519-5617"
  - family-names: "Sakaji"
    given-names: "Hiroki"
    orcid: "https://orcid.org/0000-0001-5030-625X"
  - family-names: "Hirano"
    given-names: "Masanori"
    orcid: "https://orcid.org/0000-0001-5883-8250"
  - family-names: "Izumi"
    given-names: "Kiyoshi"
  doi: "10.1016/j.ipm.2022.103194"
  journal: "Information Processing & Management"
  month: 3
  title: "Constructing and analyzing domain-specific language model for financial text mining"
  volume: 60
  number: 2
  year: 2023

GitHub Events

Total

Watch event: 1
Fork event: 1

Last Year

Watch event: 1
Fork event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 6
Total pull requests: 12
Average time to close issues: about 1 month
Average time to close pull requests: about 1 hour
Total issue authors: 3
Total pull request authors: 2
Average comments per issue: 3.17
Average comments per pull request: 0.08
Merged pull requests: 12
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

kajyuuen (2)
KoichiYasuoka (1)
ghost (1)

Pull Request Authors

retarfi (9)
upura (1)

Top Labels

Issue Labels

bug (1) enhancement (1)

Pull Request Labels

Dependencies

requirements.txt pypi

datasets >=1.10.0
fugashi *
ipadic ==1.0.0
tensorflow >=2.5.0
tokenizers >=0.10.0
torch >=1.8.0
tqdm *
transformers >=4.7.0
unidic ==1.0.3
unidic-lite ==1.0.8

pyproject.toml pypi

SudachiTra ^0.1.7 develop
black ^22.6.0 develop
deepspeed ^0.8.1 develop
ipadic ^1.0.0 develop
isort ^5.10.1 develop
ja-gsdluw * develop
mpi4py ^3.1.3 develop
mypy ^0.971 develop
pre-commit ^2.21.0 develop
pytest ^7.1.2 develop
spacy ^3.2.0 develop
unidic ^1.1.0 develop
unidic-lite 1.0.8 develop
datasets ^1.10.0
fugashi ^1.1.0
jptranstokenizer ^0.3.1
psutil ^5.9.3
python ^3.8
sentencepiece ^0.1.98
tensorflow ^2.5.0
tokenizers ^0.11
torch *
tqdm ^4.63.0
transformers 4.20.1

language-pretraining

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Pre-training Language Models for Japanese

Usage

Train Tokenizer

Create Dataset

Training

Train with 1 node

Train with multi-node and multi-process

Additional Pre-training

For ELECTRA

Training Log

Pre-trained Models

Model Architecture

Training Data

Wikipedia Model

Financial Model

Roadmap

Citation

Licenses

Related Work

Acknowledgements

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies