nusabert

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

https://github.com/lazarusnlp/nusabert

Keywords

bert indobert indonesian language-model natural-language-processing natural-language-understanding nusabert

Last synced: 6 months ago · JSON representation ·

Repository

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

Basic Info

Host: GitHub
Owner: LazarusNLP
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 576 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

bert indobert indonesian language-model natural-language-processing natural-language-understanding nusabert

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

This project aims to extend the multilingual and multicultural capability of IndoBERT (Wilie et al., 2020). We expanded the IndoBERT tokenizer on 12 new regional languages of Indonesia, and continued pre-training on a large-scale corpus consisting of the Indonesian language and 12 regional languages of Indonesia. Our models are highly competitive and robust on multilingual and multicultural benchmarks, such as IndoNLU, NusaX, and NusaWrites.

Pre-trained Models

| Model | #params | Dataset | | ----------------------------------------------------------------------------- | :-----: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | LazarusNLP/NusaBERT-base | 111M | sabilmakbar/indo_wiki, acul3/KoPI-NLLB, uonlp/CulturaX | | LazarusNLP/NusaBERT-large | 337M | sabilmakbar/indo_wiki, acul3/KoPI-NLLB, uonlp/CulturaX |

Results

We evaluate our models on three benchmarks: IndoNLU, NusaX, and NusaWrites, which measures the model's natural language understanding, multilingual, and multicultural capabilities. The datasets supports a variety of languages of Indonesia.

The values on the table below denotes the F1 score on the test set.

IndoNLU (Classification)

| Model | EmoT | SmSA | CASA | HoASA | WReTE | AVG | | ----------------------------------------------------------------------------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | | mBERT | 67.30 | 84.14 | 72.23 | 84.63 | 84.40 | 78.54 | | XLM-MLM | 65.75 | 86.33 | 82.17 | 88.89 | 64.35 | 77.50 | | XLM-R Base | 71.15 | 91.39 | 91.71 | 91.57 | 79.95 | 85.15 | | XLM-R Large | 78.51 | 92.35 | 92.40 | 94.27 | 83.82 | 88.27 | | IndoBERT Lite Base p1 | 73.88 | 90.85 | 89.68 | 88.07 | 82.17 | 84.93 | | IndoBERT Lite Base p2 | 72.27 | 90.29 | 87.63 | 87.62 | 83.62 | 84.29 | | IndoBERT Base p1 | 75.48 | 87.73 | 93.23 | 92.07 | 78.55 | 85.41 | | IndoBERT Base p2 | 76.28 | 87.66 | 93.24 | 92.70 | 78.68 | 85.71 | | IndoBERT Lite Large p1 | 75.19 | 88.66 | 90.99 | 89.53 | 78.98 | 84.67 | | IndoBERT Lite Large p2 | 70.80 | 88.61 | 88.13 | 91.05 | 85.41 | 84.80 | | IndoBERT Large p1 | 77.08 | 92.72 | 95.69 | 93.75 | 82.91 | 88.43 | | IndoBERT Large p2 | 79.47 | 92.03 | 94.94 | 93.38 | 80.30 | 88.02 | | Our work | | LazarusNLP/NusaBERT-base | 76.10 | 87.46 | 91.26 | 89.80 | 76.77 | 84.28 | | LazarusNLP/NusaBERT-large | 78.90 | 87.36 | 92.13 | 93.18 | 82.64 | 86.84 |

IndoNLU (Sequence Labeling)

| Model | POSP | BaPOS | TermA | KEPS | NERGrit | NERP | FacQA | AVG | | ----------------------------------------------------------------------------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | | mBERT | 91.85 | 83.25 | 89.51 | 64.31 | 75.02 | 69.27 | 61.29 | 76.36 | | XLM-MLM | 95.87 | 88.40 | 90.55 | 65.35 | 74.75 | 75.06 | 62.15 | 78.88 | | XLM-R Base | 95.16 | 84.64 | 90.99 | 68.82 | 79.09 | 75.03 | 64.58 | 79.76 | | XLM-R Large | 92.73 | 87.03 | 91.45 | 70.88 | 78.26 | 78.52 | 74.61 | 81.92 | | IndoBERT Lite Base p1 | 91.40 | 75.10 | 89.29 | 69.02 | 66.62 | 46.58 | 54.99 | 70.43 | | IndoBERT Lite Base p2 | 90.05 | 77.59 | 89.19 | 69.13 | 66.71 | 50.52 | 49.18 | 70.34 | | IndoBERT Base p1 | 95.26 | 87.09 | 90.73 | 70.36 | 69.87 | 75.52 | 53.45 | 77.47 | | IndoBERT Base p2 | 95.23 | 85.72 | 91.13 | 69.17 | 67.42 | 75.68 | 57.06 | 77.34 | | IndoBERT Lite Large p1 | 91.56 | 83.74 | 90.23 | 67.89 | 71.19 | 74.37 | 65.50 | 77.78 | | IndoBERT Lite Large p2 | 94.53 | 84.91 | 90.72 | 68.55 | 73.07 | 74.89 | 62.87 | 78.51 | | IndoBERT Large p1 | 95.71 | 90.35 | 91.87 | 71.18 | 77.60 | 79.25 | 62.48 | 81.21 | | IndoBERT Large p2 | 95.34 | 87.36 | 92.14 | 71.27 | 76.63 | 77.99 | 68.09 | 81.26 | | Our work | | LazarusNLP/NusaBERT-base | 95.77 | 96.02 | 90.54 | 66.67 | 72.93 | 82.29 | 54.81 | 79.86 | | LazarusNLP/NusaBERT-large | 96.89 | 96.76 | 91.73 | 71.53 | 79.86 | 85.12 | 66.77 | 84.09 |

NusaX

| Model | ace | ban | bbc | bjn | bug | eng | ind | jav | mad | min | nij | sun | AVG | | ----------------------------------------------------------------------------- | :------: | :-------: | :------: | :-------: | :------: | :------: | :-------: | :-------: | :-------: | :-------: | :-------: | :------: | :-------: | | Naive Bayes | 72.5 | 72.6 | 73.0 | 71.9 | 73.7 | 76.5 | 73.1 | 69.4 | 66.8 | 73.2 | 68.8 | 71.9 | 72.0 | | SVM | 75.7 | 75.3 | 76.7 | 74.8 | 77.2 | 75.0 | 78.7 | 71.3 | 73.8 | 76.7 | 75.1 | 74.3 | 75.4 | | Logistic Regression | 77.4 | 76.3 | 76.3 | 75.0 | 77.2 | 75.9 | 74.7 | 73.7 | 74.7 | 74.8 | 73.4 | 75.8 | 75.4 | | IndoNLU IndoBERT Base | 75.4 | 74.8 | 70.0 | 83.1 | 73.9 | 79.5 | 90.0 | 81.7 | 77.8 | 82.5 | 75.8 | 77.5 | 78.5 | | IndoNLU IndoBERT Large | 76.3 | 79.5 | 74.0 | 83.2 | 70.9 | 87.3 | 90.2 | 85.6 | 77.2 | 82.9 | 75.8 | 77.2 | 80.0 | | IndoLEM IndoBERT Base | 72.6 | 65.4 | 61.7 | 71.2 | 66.9 | 71.2 | 87.6 | 74.5 | 71.8 | 68.9 | 69.3 | 71.7 | 71.1 | | mBERT Base | 72.2 | 70.6 | 69.3 | 70.4 | 68.0 | 84.1 | 78.0 | 73.2 | 67.4 | 74.9 | 70.2 | 74.5 | 72.7 | | XLM-R Base | 73.9 | 72.8 | 62.3 | 76.6 | 66.6 | 90.8 | 88.4 | 78.9 | 69.7 | 79.1 | 75.0 | 80.1 | 76.2 | | XLM-R Large | 75.9 | 77.1 | 65.5 | 86.3 | 70.0 | 92.6 | 91.6 | 84.2 | 74.9 | 83.1 | 73.3 | 86.0 | 80.0 | | Our work | | LazarusNLP/NusaBERT-base | 76.51 | 78.67 | 74.02 | 82.38 | 71.64 | 84.09 | 89.74 | 84.09 | 75.62 | 80.77 | 74.93 | 85.21 | 79.81 | | LazarusNLP/NusaBERT-large | 81.8 | 82.83 | 74.71 | 86.51 | 73.36 | 84.63 | 93.33 | 87.20 | 82.50 | 83.54 | 77.72 | 82.74 | 82.57 |

NusaWrites (NusaParagraph)

| Models | Emotion | Rhetorical Mode | Topic | | ----------------------------------------------------------------------------- | :-------: | :-------------: | :-------: | | Naive Bayes | 75.51 | 37.73 | 85.06 | | SVM | 76.36 | 45.44 | 85.86 | | Logistic Regression | 78.23 | 45.21 | 87.67 | | IndoNLU IndoBERT Base | 67.12 | 47.92 | 85.87 | | IndoNLU IndoBERT Large | 62.65 | 31.75 | 85.41 | | IndoLEM IndoBERT Base | 66.94 | 51.93 | 84.87 | | mBERT | 63.15 | 50.01 | 73.82 | | XLM-R Base | 59.15 | 49.17 | 71.68 | | XLM-R Large | 67.42 | 51.57 | 83.05 | | Our work | | LazarusNLP/NusaBERT-base | 67.18 | 51.34 | 84.17 | | LazarusNLP/NusaBERT-large | 71.82 | 53.06 | 85.89 |

NusaWrites (NusaTranslation)

| Models | Emotion | Sentiment | | ----------------------------------------------------------------------------- | :-------: | :-------: | | Naive Bayes | 52.70 | 74.89 | | SVM | 55.08 | 76.04 | | Logistic Regression | 56.18 | 74.89 | | IndoNLU IndoBERT Base | 54.50 | 75.24 | | IndoNLU IndoBERT Large | 57.80 | 77.40 | | IndoLEM IndoBERT Base | 52.59 | 69.08 | | mBERT | 44.13 | 68.72 | | XLM-R Base | 47.02 | 68.62 | | XLM-R Large | 54.84 | 79.06 | | Our work | | LazarusNLP/NusaBERT-base | 56.54 | 77.07 | | LazarusNLP/NusaBERT-large | 61.40 | 79.54 |

Installation

sh git clone https://github.com/LazarusNLP/NusaBERT.git cd NusaBERT pip install -r requirements.txt

Dataset

For pre-training we leverage three existing open-source corpora that includes the Indonesian language and regional languages of Indonesia. A summary of the datasets are as follows:

| Dataset | Language | #documents | | ------------------------------------------------------------------------------ | ---------------------- | :--------: | | uonlp/CulturaX | Indonesian (ind) | 23,251,368 | | uonlp/CulturaX | Javanese (jav) | 2,058 | | uonlp/CulturaX | Malay (msa) | 238,000 | | uonlp/CulturaX | Sundanese (sun) | 1,554 | | sabilmakbar/indo_wiki | Acehnese (ace) | 12,904 | | sabilmakbar/indo_wiki | Balinese (ban) | 19,837 | | sabilmakbar/indo_wiki | Banjarese (bjn) | 10,437 | | sabilmakbar/indo_wiki | Buginese (bug) | 9,793 | | sabilmakbar/indo_wiki | Gorontalo (gor) | 14,514 | | sabilmakbar/indo_wiki | Indonesian (ind) | 654,287 | | sabilmakbar/indo_wiki | Javanese (jav) | 72,667 | | sabilmakbar/indo_wiki | Banyumasan (map_bms) | 11,832 | | sabilmakbar/indo_wiki | Minangkabau (min) | 225,858 | | sabilmakbar/indo_wiki | Malay (msa) | 346,186 | | sabilmakbar/indo_wiki | Nias (nia) | 1,650 | | sabilmakbar/indo_wiki | Sundanese (sun) | 61,494 | | sabilmakbar/indo_wiki | Tetum (tet) | 1,465 | | acul3/KoPI-NLLB | Acehnese (ace) | 792,594 | | acul3/KoPI-NLLB | Balinese (ban) | 244,545 | | acul3/KoPI-NLLB | Banjarese (bjn) | 296,314 | | acul3/KoPI-NLLB | Javanese (jav) | 1,155,142 | | acul3/KoPI-NLLB | Minangkabau (min) | 113,323 | | acul3/KoPI-NLLB | Sundanese (sun) | 894,626 |

Extend NusaBERT Tokenizer

We first need to train a WordPiece tokenizer on our pre-pretraining corpus, whose vocab size we limit up to 10,000. We then add non-overlapping tokens from the new tokenizer to the original IndoBERT tokenizer. Since there are overlapping tokens between the two tokenizers, we only ended up adding 1,511 new tokens to the original tokenizer. Refer to the script for more details.

Pre-train NusaBERT

We modified the Hugging Face 🤗 masked language modeling pre-training script and conducted continued pre-training of IndoBERT on the dataset detailed above. Running pre-training is as simple as:

sh python scripts/run_mlm.py \ --model_name_or_path indobenchmark/indobert-base-p1 \ --tokenizer_name LazarusNLP/nusabert-base \ --max_seq_length 128 \ --per_device_train_batch_size 256 \ --per_device_eval_batch_size 256 \ --do_train --do_eval \ --max_steps 500000 \ --warmup_steps 24000 \ --learning_rate 3e-4 \ --weight_decay 0.01 \ --optim adamw_torch_fused \ --bf16 \ --preprocessing_num_workers 24 \ --dataloader_num_workers 24 \ --save_steps 10000 --save_total_limit 3 \ --output_dir outputs/nusabert-base \ --overwrite_output_dir \ --report_to tensorboard \ --push_to_hub --hub_private_repo \ --hub_model_id LazarusNLP/nusabert-base

We achieved a negative log-likelihood loss of 1.4876 and an accuracy of 68.66% on a heldout subset (5%) of the pre-training corpus.

Fine-tune NusaBERT

We developed fine-tuning scripts for NusaBERT based on fine-tuning scripts from Hugging Face 🤗's sample fine-tuning scripts.

In particular, we developed fine-tuning scripts for single-sentence classification, multi-class multi-label classification, token classification, and pair token classification, which you can find in scripts. These scripts support IndoNLU, NusaX, and NusaWrites datasets.

Single-Sentence Classification Task

The tasks included under this category are emotion classification, sentiment analysis, topic classification, etc. To fine-tune for single-sentence classification, run the following command and modify accordingly:

sh python scripts/run_classification.py \ --model-checkpoint LazarusNLP/NusaBERT-base \ --dataset-name indonlp/indonlu \ --dataset-config emot \ --input-column-names tweet \ --target-column-name label \ --input-max-length 128 \ --output-dir outputs/nusabert-base-emot \ --num-train-epochs 100 \ --optim adamw_torch_fused \ --learning-rate 1e-5 \ --weight-decay 0.01 \ --per-device-train-batch-size 32 \ --per-device-eval-batch-size 64 \ --hub-model-id LazarusNLP/NusaBERT-base-EmoT

Single-Sentence Classification recipes are provided here.

Multi-label Multi-class Classification

The task included under this category is aspect-based sentiment analysis (e.g. IndoNLU CASA and HoASA). To fine-tune for multi-label multi-class classification, run the following command and modify accordingly:

sh python scripts/run_multi_label_classification.py \ --model-checkpoint LazarusNLP/NusaBERT-base \ --dataset-name indonlp/indonlu \ --dataset-config casa \ --input-column-name sentence \ --target-column-names fuel,machine,others,part,price,service \ --input-max-length 128 \ --output-dir outputs/nusabert-base-casa \ --num-train-epochs 100 \ --optim adamw_torch_fused \ --learning-rate 1e-5 \ --weight-decay 0.01 \ --per-device-train-batch-size 32 \ --per-device-eval-batch-size 64 \ --hub-model-id LazarusNLP/NusaBERT-base-CASA

Multi-label Multi-class Classification recipes are provided here.

Token Classification

Token classification is also known as sequence labeling. The tasks included under this category are part-of-speech tagging (POS), named entity recognition (NER), and token-level span extraction (e.g. IndoNLU TermA, KEPS). To fine-tune for token classification, run the following command and modify accordingly:

sh python scripts/run_token_classification.py \ --model-checkpoint LazarusNLP/NusaBERT-base \ --dataset-name indonlp/indonlu \ --dataset-config posp \ --input-column-name tokens \ --target-column-name pos_tags \ --output-dir outputs/nusabert-base-posp \ --num-train-epochs 10 \ --optim adamw_torch_fused \ --learning-rate 2e-5 \ --weight-decay 0.01 \ --per-device-train-batch-size 16 \ --per-device-eval-batch-size 64 \ --hub-model-id LazarusNLP/NusaBERT-base-POSP

Token Classification recipes are provided here.

Pair Token Classification

Pair token classification is much like token-classification, except involving a pair of input sentences instead of one. The tasks included under this category is token-level question-passage-answering (e.g. IndoNLU FacQA). To fine-tune for pair question-answering, run the following command and modify accordingly:

sh python scripts/run_pair_token_classification.py \ --model-checkpoint LazarusNLP/NusaBERT-base \ --dataset-name indonlp/indonlu \ --dataset-config facqa \ --input-column-name-1 question \ --input-column-name-2 passage \ --target-column-name seq_label \ --output-dir outputs/nusabert-base-facqa \ --num-train-epochs 10 \ --optim adamw_torch_fused \ --learning-rate 2e-5 \ --weight-decay 0.01 \ --per-device-train-batch-size 16 \ --per-device-eval-batch-size 64 \ --hub-model-id LazarusNLP/NusaBERT-base-FacQA

Pair Token Classification recipes are provided here.

Citation

If you use NusaBERT in your research, please cite the following:

bibtex @misc{wongso2024nusabertteachingindobertmultilingual, title = {NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural}, author = {Wilson Wongso and David Samuel Setiawan and Steven Limcorn and Ananto Joyoadikusumo}, year = {2024}, eprint = {2403.01817}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2403.01817}, }

Credits

NusaBERT is developed with love by:

Owner

Name: LazarusNLP
Login: LazarusNLP
Kind: organization
Location: Indonesia

Website: https://lazarusnlp.github.io/
Repositories: 1
Profile: https://github.com/LazarusNLP

Lazarus NLP is a collective initiative to revive the dying languages of Indonesia through speech and language technology.

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite both the article from preferred-citation and the software itself.
authors:
  - family-names: Wilson Wongso
    given-names: David Samuel Setiawan
  - family-names: Steven Limcorn
    given-names: Ananto Joyoadikusumo
title: 'NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural'
version: 1.0.0
url: https://arxiv.org/abs/2403.01817
date-released: '2024-08-07'
preferred-citation:
  authors:
    - family-names: Wilson Wongso
      given-names: David Samuel Setiawan
    - family-names: Steven Limcorn
      given-names: Ananto Joyoadikusumo
  title: 'NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural'
  url: https://arxiv.org/abs/2403.01817
  type: generic
  year: '2024'
  conference: {}
  publisher: {}

nusabert

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

NusaBERT: Teaching IndoBERT to be multilingual and multicultural!

Pre-trained Models

Results

IndoNLU (Classification)

IndoNLU (Sequence Labeling)

NusaX

NusaWrites (NusaParagraph)

NusaWrites (NusaTranslation)

Installation

Dataset

Extend NusaBERT Tokenizer

Pre-train NusaBERT

Fine-tune NusaBERT

Single-Sentence Classification Task

Multi-label Multi-class Classification

Token Classification

Pair Token Classification

Citation

Credits

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies