shiba-model

Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.

https://github.com/octanove/shiba

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (3.5%) to scientific vocabulary

Keywords

deep-learning natural-language-processing neural-network

Last synced: 6 months ago · JSON representation ·

Repository

Pytorch implementation and pre-trained Japanese model for CANINE, the efficient character-level transformer.

Basic Info

Host: GitHub
Owner: octanove
License: other
Language: Python
Default Branch: main
Homepage:
Size: 253 KB

Statistics

Stars: 89
Watchers: 3
Forks: 14
Open Issues: 1
Releases: 0

Topics

deep-learning natural-language-processing neural-network

Created almost 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

SHIBAとは

SHIBA は、日本語 Wikipedia コーパスを用いて事前学習した CANINE[1] モデルの PyTorch 再実装です。 CANINE をご存知なければ、非常に高効率な文字レベル BERT モデルだと考えてください。もちろん、SHIBA という名前は日本の canine (犬)である柴犬に由来しています。

CANINE/SHIBA Architecture

SHIBA の最大のメリットは、CANINE と同様、以下の2つです：

語彙の制限がなく、あらゆるユニコード文字を処理できること。事前学習中にモデルが観察したことのない文字、単語、言語でもファインチューニングできます。
効率よく多くの文字を処理できること。文字レベル BERT に比べると、同等の計算量で4倍（2048文字) の文字を埋め込むことができます。

また、2つの下流タスクにおける性能も良好です。

性能

1つ目の下流タスクは、モデルが一度に処理できる量のテキストを使ったlivedoorニュースコーパスの分類です。

| モデル | 精度 | |---|---| | SHIBA | 95.5% | | bert-base-japanese | 95.1% | | bert-base-japanese-char | 92.9% |

2件目の下流タスクは、UD Japanese GSD corpusにおける単語分割です。

| モデル | F1 スコア | |---|---| | MeCab | 99.7% | | SHIBA | 97.9% |

UD 上の単語分割において MeCab を超えるのは難しそうですが、MeCab と違って辞書が必要ないので、非標準的なテキストに対する単語分割では、SHIBA が役に立つことが期待できます。

使い方

モデルだけの使用なら、以下のようにインストールできます：

bash pip install shiba-model

日本語 Wikipedia で事前学習したチェックポイントは下記のように使えます。 get_pretrained_from_hub() は自動的にチェックポイントをダウンロードします。自分でダウンロードしたい方はここからダウンロードできます。

```python from shiba import Shiba, CodepointTokenizer, getpretrainedfromhub shibamodel = Shiba() shibamodel.loadstatedict(getpretrainedfromhub()) shiba_model.eval() # disable dropout tokenizer = CodepointTokenizer()

inputs = tokenizer.encodebatch(['自然言語処理', '柴ドリル']) outputs = shibamodel(**inputs) ```

他のトランスフォーマーのエンコーダと同様に、分類や文字レベルタスクに合わせてファインチューニングできます。タスク固有レイヤーを付け足すのは簡単なはずですが、本リポジトリには、分類と系列ラベリングに使えるモデル ShibaForClassification と ShibaForSequenceLabeling も含まれています。

python from shiba import ShibaForClassification cls_model = ShibaForClassification(vocab_size=3) cls_model.load_encoder_checkpoint()

load_encoder_checkpoint() は事前学習されたエンコーダのみをロードする関数ですが、cls_model.shiba_model.load_state_dict(get_pretrained_state_dict())とほぼ同じです。

また、比較的学習しやすいタスクで、効率的な文字レベルのモデルを学習したいだけであれば、SHIBA をゼロから学習することもできます。

詳細

近いうちに、SHIBA に関する技術ブログを公開するつもりです。以下に、重要な詳細を記載します。

CANINE との違い

SHIBA の構造は CANINEとほぼ同じですが、注意すべき違いがいくつかあります

SHIBA は、 CANINE に使われている blockwise local attention ではなく、windowed local attention を使っています。
SHIBA に token type の埋め込みはありません。
文字埋め込みのダウンサンプリングの細かいところが SHIBA と CANINE で少し異なります。主な違いとしては、CANINE と違って SHIBA は最大長の文字列の最終文字を切り詰めません。

モデルのコード

モデルのコードは model.py に、トークナイザーは codepoint_tokenizer.py にあります。わかりやすさと変更のしやすさを意識して書いたコードなので、モデルの細かい仕組みを理解したい場合は、コードを自分で読んだりいじったりしていただくのが一番早いかもしれません。

学習方法

日本語 Wikipedia コーパスを学習データとして使い、東北大学の日本語 BERTとほぼ同様な全処理をしました。訓練インスタンスの生成は RoBERTa[2]と同様に、インスタンスにつきできるだけ多くの文を詰め込みました。マスキングには、ランダムスパンマスキングという動的にランダムなスパンをマスクする手法を使いました。 [M]がマスク文字を表す Unicode コードポイントだとすると、マスキングの具体例は下記のようになります

柴犬は最強の犬種である

柴犬は[M][M]の犬種である

マスクされたスパンを置き換える際には、同じデータで学習されたBPE語彙からランダムで同じ長さのものが選択されます。学習を再現したい方は、TRAINING.mdをご参考ください。

マスキング手法を含め、ハイパーパラメータはデータのサブセットにおける性能に基づいて決めました。また、トランスフォーマーエンコーダの学習を扱う RoBERTa[2] と Academic Budget BERT[3] で使われているハイパーパラメータも参考にしました。

学習コード

SHIBA の学習用の実装もこのリポジトリに含まれています。モデルのコードに比べると、学習コードは依存関係が多くあまり洗練されていませんが、同じようなモデルを学習したい場合には役に立つかもしれません。ランダムスパンマスキングとランダムBPEマスキングの実装は masking.py で見られます。

チェックポイント

デフォルトのモデルは下流タスクで最も高い性能を発揮したものですが、他に言語モデルのチェックポイントなども提供しています。

| Type | Step | Note | |-------------------------------------------------------------------|------|-----------------| | Encoder Only | 45k | (default model) | | Encoder Only | 60k | | | LM Head + Encoder | 45k | | | LM Head + Encoder | 60k | |

ライセンス

このレポジトリーの内容とコードは Apache 2.0 ライセンスで提供されています。事前学習されたモデルのチェックポイントは、CC BY-SA 4.0 ライセンスで提供されています。

引用

本リポジトリは、以下のように引用してください。

bibtex @misc{shiba, author = {Joshua Tanner and Masato Hagiwara}, title = {SHIBA: Japanese CANINE model}, year = {2021}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/octanove/shiba}}, }

また、CANINE の論文は、以下のように引用してください。

bibtex @misc{clark2021canine, title={CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation}, author={Jonathan H. Clark and Dan Garrette and Iulia Turc and John Wieting}, year={2021}, eprint={2103.06874}, archivePrefix={arXiv}, primaryClass={cs.CL} }

文献

[1] Jonathan H. Clark and Dan Garrette and Iulia Turc and John Wieting (2021). CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. CoRR, abs/2103.06874.

[2] Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, abs/1907.11692.

[3] Peter Izsak and Moshe Berchansky and Omer Levy (2021). How to Train BERT with an Academic Budget. CoRR, abs/2104.07705.

Owner

Name: Octanove Labs
Login: octanove
Kind: organization

Website: https://www.octanove.com/
Repositories: 2
Profile: https://github.com/octanove

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Tanner"
  given-names: "Joshua"
- family-names: "Hagiwara"
  given-names: "Masato"
title: "SHIBA: Japanese CANINE model"
version: 0.1.1
date-released: 2021-06-24
url: "https://github.com/octanove/shiba"

GitHub Events

Total

Issues event: 2
Issue comment event: 12

Last Year

Issues event: 2
Issue comment event: 12

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 4
Total Committers: 2
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.25

Top Committers

Name	Email	Commits
Joshua Tanner	m**t@g**m	3
Shunya Ueta	h**a@u**m	1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 8
Total pull requests: 4
Average time to close issues: 2 months
Average time to close pull requests: 26 days
Total issue authors: 8
Total pull request authors: 4
Average comments per issue: 5.63
Average comments per pull request: 0.75
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: 2 days
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

stefan-it (1)
shamweelm (1)
cdleong (1)
SohaibAnwaar (1)
sven-nm (1)
jiminsun (1)
Mindful (1)
nicholasdehnen (1)
CsAbdulelah (1)

Pull Request Authors

Mindful (1)
hurutoriya (1)
SohaibAnwaar (1)
CsAbdulelah (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 29 last-month

Total dependent packages: 0
Total dependent repositories: 2
Total versions: 2
Total maintainers: 1

pypi.org: shiba-model

An efficient character-level transformer encoder, pretrained for Japanese

Homepage: https://github.com/octanove/shiba
Documentation: https://shiba-model.readthedocs.io/
License: Apache Software License
Latest release: 0.1.1
published almost 3 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 2
Downloads: 29 Last month

Rankings

Stargazers count: 7.5%

Forks count: 9.8%

Dependent packages count: 10.1%

Dependent repos count: 11.6%

Average: 19.3%

Downloads: 57.3%

Maintainers (1)

mindful

Last synced: 6 months ago

Dependencies

requirements.txt pypi

conllu ==4.4
datasets *
fugashi ==1.1.0
ipadic *
jsonlines *
mpi4py *
shiba-model *
tokenizers *
torchmetrics ==0.3.2
tqdm *
transformers ==4.6.1
wandb *

setup.py pypi

local-attention *
torch *

shiba-model

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.ja.md

SHIBAとは

性能

使い方

詳細

CANINE との違い

モデルのコード

学習方法

学習コード

チェックポイント

ライセンス

引用

文献

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: shiba-model

Rankings

Maintainers (1)

Dependencies