https://github.com/cahya-wirawan/indonlu

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained models, and a starter code! (AACL-IJCNLP 2020)

https://github.com/cahya-wirawan/indonlu

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained models, and a starter code! (AACL-IJCNLP 2020)

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Fork of IndoNLP/indonlu
Created over 5 years ago · Last pushed over 5 years ago

https://github.com/cahya-wirawan/indonlu/blob/master/

# IndoNLU 
![Pull Requests Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat) [![GitHub license](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/indobenchmark/indonlu/blob/master/LICENSE) [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg)](code_of_conduct.md)

IndoNLU is a collection of Natural Language Understanding (NLU) resources for Bahasa Indonesia with 12 downstream tasks. We provide the code to reproduce the results and large pre-trained models (IndoBERT and IndoBERT-lite) trained with around 4 billion word corpus (Indo4B), more than 20 GB of text data. This project was initially started by a joint collaboration between universities and industry, such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, Gojek, and Prosa.AI.

## Research Paper
IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our preprint https://arxiv.org/abs/2009.05387.
If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:
```
@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}
```

## How to contribute to IndoNLU?
Be sure to check the [contributing guidelines](https://github.com/indobenchmark/indonlu/blob/master/CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

## 12 Downstream Tasks
- You can check [[Link]](https://github.com/indobenchmark/indonlu/tree/master/dataset)
- We provide train, valid, and test sets. The labels of the test set are masked (no true labels) in order to preserve the integrity of the evaluation. Please submit your predictions to the submission portal at [CodaLab](https://competitions.codalab.org/competitions/26537)

### Examples
- A guide to load IndoBERT model and finetune the model on Sequence Classification and Sequence Tagging task.
- You can check [link](https://github.com/indobenchmark/indonlu/tree/master/examples)

### Submission Format
Please kindly check the [link](https://github.com/indobenchmark/indonlu/tree/master/submission_examples). For each task, there is different format. Every submission file always start with the `index` column (the id of the test sample following the order of the masked test set). 

For the submission, first you need to rename your prediction into `pred.txt`, then zip the file. After that, you need to allow the system to compute the results. You can easily check the progress in your `results` tab.

## Indo4B Dataset
We provide the access to our large pretraining dataset. In this version, we exclude all Twitter tweets due to restrictions of the Twitter Developer Policy and Agreement.
- Indo4B Dataset (23 GB uncompressed, 5.6 GB compressed) [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/dataset/preprocessed/dataset_wot_uncased_blanklines.tar.xz)

## IndoBERT and IndoBERT-lite Models
We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [[Link]](https://huggingface.co/indobenchmark)
- IndoBERT-base
  - Phase 1  [[Link]](https://huggingface.co/indobenchmark/indobert-base-p1)
  - Phase 2  [[Link]](https://huggingface.co/indobenchmark/indobert-base-p2)
- IndoBERT-large
  - Phase 1  [[Link]](https://huggingface.co/indobenchmark/indobert-large-p1)
  - Phase 2  [[Link]](https://huggingface.co/indobenchmark/indobert-large-p2)
- IndoBERT-lite-base
  - Phase 1  [[Link]](https://huggingface.co/indobenchmark/indobert-lite-base-p1)
  - Phase 2  [[Link]](https://huggingface.co/indobenchmark/indobert-lite-base-p2)
- IndoBERT-lite-large
  - Phase 1  [[Link]](https://huggingface.co/indobenchmark/indobert-lite-large-p1)
  - Phase 2  [[Link]](https://huggingface.co/indobenchmark/indobert-lite-large-p2)

## FastText (Indo4B)
We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)
- FastText model (11.9 GB) [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext.4B.id.300.epoch5.uncased.bin) 
- Vector file (3.9 GB) [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext.4B.id.300.epoch5.uncased.vec.zip)

We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks
- FastText-Indo4B [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext-4B-id-uncased.zip)
- FastText-CC-ID [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext-cc-id.zip)

## Leaderboard
- Community Portal and Public Leaderboard [[Link]](https://www.indobenchmark.com/leaderboard.html)
- Submission Portal https://competitions.codalab.org/competitions/26537

Owner

  • Name: Cahya Wirawan
  • Login: cahya-wirawan
  • Kind: user
  • Location: Vienna, Austria

System engineer, currently working on NLP, CV and Speech Recognition for fun and curiosity

GitHub Events

Total
Last Year