indonesian-language-models

Indonesian Language Models and its Usage

https://github.com/cahya-wirawan/indonesian-language-models

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.1%) to scientific vocabulary

Keywords

deep-learning fastai huggingface-transformers language-model machine-learning nlp pytorch transformer
Last synced: 6 months ago · JSON representation ·

Repository

Indonesian Language Models and its Usage

Basic Info
Statistics
  • Stars: 157
  • Watchers: 12
  • Forks: 29
  • Open Issues: 1
  • Releases: 0
Topics
deep-learning fastai huggingface-transformers language-model machine-learning nlp pytorch transformer
Created over 7 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Code of conduct Citation

README.md

Indonesian Language Models

The language model is a probability distribution over word sequences used to predict the next word based on previous sentences. This ability makes the language model the core component of modern natural language processing. We use it for many different tasks, such as speech recognition, conversational AI, information retrieval, sentiment analysis, or text summarization.

For this reason, many big companies are competing to build large and larger language models, such as Google BERT, Facebook RoBERTa, or OpenAI GPT3, with its massive number of parameters. Most of the time, they built only language models in English and some other European languages. Other countries with low resource languages have big challenges to catch up on this technology race.

Therefore the author tries to build some language models for Indonesian, started with ULMFiT in 2018. The first language model has been only trained with Indonesian Wikipedia, which is very small compared to other datasets used to train the English language model.

Universal Language Model Fine-tuning (ULMFiT)

Jeremy Howard and Sebastian Ruder proposed ULMFiT in early 2018 as a novel method for fine-tuning language models for inductive transfer learning. The language model ULMFiT for Indonesian has been trained as part of the author's project while learning FastAI. It achieved a perplexity of 27.67 on Indonesian Wikipedia.

Transformers

Ashish Vaswani et al. proposed Transfomer in the paper Attention Is All You Need. It is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease.

At the time of writing (March 2021), there are already more than 50 different types of transformer-based language models (according to the model list at huggingface), such as BERT, GPT2, Longformer, or MT5, built by companies and individual contributors. The author built also several Indonesian transformer-based language models using Huggingface Transformers Library and hosted them in the Huggingfaces model hub.

Owner

  • Name: Cahya Wirawan
  • Login: cahya-wirawan
  • Kind: user
  • Location: Vienna, Austria

System engineer, currently working on NLP, CV and Speech Recognition for fun and curiosity

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Wirawan
    given-names: Cahya
    orcid: https://orcid.org/0000-0002-0263-8273
title: "Indonesian Language Models"
version: 1.0.0
date-released: 2018-08-19

GitHub Events

Total
  • Watch event: 7
  • Fork event: 1
Last Year
  • Watch event: 7
  • Fork event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 3
  • Total pull requests: 2
  • Average time to close issues: 6 days
  • Average time to close pull requests: 24 days
  • Total issue authors: 2
  • Total pull request authors: 2
  • Average comments per issue: 3.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • wahyubram82 (2)
  • sigit-purnomo (1)
Pull Request Authors
  • cahya-wirawan (1)
  • guspan-tanadi (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

ULMFiT/0.3/requirements.txt pypi
  • fastai >=1.0.0
  • numpy *
  • pytorch >=1.0.0
requirements.txt pypi
  • numpy *
  • pytorch *