https://github.com/cahya-wirawan/indonlu
The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained models, and a starter code! (AACL-IJCNLP 2020)
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained models, and a starter code! (AACL-IJCNLP 2020)
Basic Info
- Host: GitHub
- Owner: cahya-wirawan
- License: mit
- Default Branch: master
- Homepage: https://indobenchmark.com
- Size: 4.74 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of IndoNLP/indonlu
Created over 5 years ago
· Last pushed over 5 years ago
https://github.com/cahya-wirawan/indonlu/blob/master/
# IndoNLU
 [](https://github.com/indobenchmark/indonlu/blob/master/LICENSE) [](code_of_conduct.md)
IndoNLU is a collection of Natural Language Understanding (NLU) resources for Bahasa Indonesia with 12 downstream tasks. We provide the code to reproduce the results and large pre-trained models (IndoBERT and IndoBERT-lite) trained with around 4 billion word corpus (Indo4B), more than 20 GB of text data. This project was initially started by a joint collaboration between universities and industry, such as Institut Teknologi Bandung, Universitas Multimedia Nusantara, The Hong Kong University of Science and Technology, Universitas Indonesia, Gojek, and Prosa.AI.
## Research Paper
IndoNLU has been accepted by AACL-IJCNLP 2020 and you can find the details in our preprint https://arxiv.org/abs/2009.05387.
If you are using any component on IndoNLU including Indo4B, FastText-Indo4B, or IndoBERT in your work, please cite the following paper:
```
@inproceedings{wilie2020indonlu,
title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
year={2020}
}
```
## How to contribute to IndoNLU?
Be sure to check the [contributing guidelines](https://github.com/indobenchmark/indonlu/blob/master/CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
## 12 Downstream Tasks
- You can check [[Link]](https://github.com/indobenchmark/indonlu/tree/master/dataset)
- We provide train, valid, and test sets. The labels of the test set are masked (no true labels) in order to preserve the integrity of the evaluation. Please submit your predictions to the submission portal at [CodaLab](https://competitions.codalab.org/competitions/26537)
### Examples
- A guide to load IndoBERT model and finetune the model on Sequence Classification and Sequence Tagging task.
- You can check [link](https://github.com/indobenchmark/indonlu/tree/master/examples)
### Submission Format
Please kindly check the [link](https://github.com/indobenchmark/indonlu/tree/master/submission_examples). For each task, there is different format. Every submission file always start with the `index` column (the id of the test sample following the order of the masked test set).
For the submission, first you need to rename your prediction into `pred.txt`, then zip the file. After that, you need to allow the system to compute the results. You can easily check the progress in your `results` tab.
## Indo4B Dataset
We provide the access to our large pretraining dataset. In this version, we exclude all Twitter tweets due to restrictions of the Twitter Developer Policy and Agreement.
- Indo4B Dataset (23 GB uncompressed, 5.6 GB compressed) [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/dataset/preprocessed/dataset_wot_uncased_blanklines.tar.xz)
## IndoBERT and IndoBERT-lite Models
We provide 4 IndoBERT and 4 IndoBERT-lite Pretrained Language Model [[Link]](https://huggingface.co/indobenchmark)
- IndoBERT-base
- Phase 1 [[Link]](https://huggingface.co/indobenchmark/indobert-base-p1)
- Phase 2 [[Link]](https://huggingface.co/indobenchmark/indobert-base-p2)
- IndoBERT-large
- Phase 1 [[Link]](https://huggingface.co/indobenchmark/indobert-large-p1)
- Phase 2 [[Link]](https://huggingface.co/indobenchmark/indobert-large-p2)
- IndoBERT-lite-base
- Phase 1 [[Link]](https://huggingface.co/indobenchmark/indobert-lite-base-p1)
- Phase 2 [[Link]](https://huggingface.co/indobenchmark/indobert-lite-base-p2)
- IndoBERT-lite-large
- Phase 1 [[Link]](https://huggingface.co/indobenchmark/indobert-lite-large-p1)
- Phase 2 [[Link]](https://huggingface.co/indobenchmark/indobert-lite-large-p2)
## FastText (Indo4B)
We provide the full uncased FastText model file (11.9 GB) and the corresponding Vector file (3.9 GB)
- FastText model (11.9 GB) [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext.4B.id.300.epoch5.uncased.bin)
- Vector file (3.9 GB) [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext.4B.id.300.epoch5.uncased.vec.zip)
We provide smaller FastText models with smaller vocabulary for each of the 12 downstream tasks
- FastText-Indo4B [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext-4B-id-uncased.zip)
- FastText-CC-ID [[Link]](https://storage.googleapis.com/babert-pretraining/IndoNLU_finals/models/fasttext/fasttext-cc-id.zip)
## Leaderboard
- Community Portal and Public Leaderboard [[Link]](https://www.indobenchmark.com/leaderboard.html)
- Submission Portal https://competitions.codalab.org/competitions/26537
Owner
- Name: Cahya Wirawan
- Login: cahya-wirawan
- Kind: user
- Location: Vienna, Austria
- Website: https://www.linkedin.com/in/cahyawirawan/
- Twitter: CahyaWr
- Repositories: 171
- Profile: https://github.com/cahya-wirawan
System engineer, currently working on NLP, CV and Speech Recognition for fun and curiosity