multilingual-asr
Multilingual Speech Recognition for Indonesian Languages
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary
Keywords
Repository
Multilingual Speech Recognition for Indonesian Languages
Basic Info
Statistics
- Stars: 62
- Watchers: 3
- Forks: 5
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Multilingual Speech Recognition for Indonesian Languages
Introduction
Automatic Speech Recognition (ASR) enables the recognition and translation of spoken language into text. Typically the ASR Model is trained and used for a specific language. However, Indonesia has more than 700 spoken languages. It is not practicable to provide a speech recognition model for each language.
Therefore, we want to develop a multilingual speech recognition model that can at least support some of the main Indonesian languages without sacrificing model performance for each language.
Objectives
We want to develop and build a multilingual speech recognition model with the Indonesian, Javanese, and Sundanese datasets. The model should perform well in all these three languages. We also train monolingual models for comparison purposes.
Methods
We used the following speech datasets for the training/finetuning: - Indonesian Common Voice - High-quality TTS data for Javanese - High-quality TTS data for Sundanese
We used Wav2vec 2.0, a framework for self-supervised learning of speech representations which is now state of the art on the Librispeech benchmark for noisy speech, for Indonesia, Javanese and Sundanese language.
We trained a multilingual Wav2vec 2.0 model with the three languages combined for 200 epochs. We also trained three Wav2vec 2.0 models with a single language for Indonesian, Java, and Sundanese, each for 200 epochs.
Results and Comparison
We built a multilingual Speech Recognition model and publish it as open source model. We also provide a live demo to test the model.
Following is the comparison of the models and the list of its performance evaluation:
The Models Comparison
The following figure is the model comparison by Word Error Rate (WER) for the Test split of Indonesian Common Voice 6.1 (less is better)
Without Language Model

With Language Model
Lastly, we integrated a language model into our speech recognition pipeline, which reduces the WER from 11.57% to 4.27% on the Test split of Indonesian Common Voice 6.1. We also evaluated the performance of Google Speech To Text, its WER for the Test split of Indonesian Common Voice 6.1 is 9.22%.

The detail of the performance evaluation
The performance evaluation can be found here
Conclusion
- The experiment shows that the multilingual model can perform on par with a model trained on a single language; the Word Error Rate (WER) difference is maximal 0.6 absolute percent. We also trained the multilingual model with more epochs, and it outperforms the monolingual model.
- The monolingual model performs very well in the language we trained for but poorly in other languages.
- The multilingual speech recognition model overcomes the need to have a separate model for each language in Indonesia. Therefore, it significantly reduces hardware resources and simplifies the model deployment.
Future Works
We plan following for the future: - Training the model with more data and more Indonesian languages. - ~~Integrating Language Model to reduce the WER~~ - Compressing the model size for speeding up the inferencing time and reducing hardware resources - Developing real-time speech recognition based on this multilingual model.
Owner
- Name: indonesian-nlp
- Login: indonesian-nlp
- Kind: organization
- Repositories: 6
- Profile: https://github.com/indonesian-nlp
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Wirawan
given-names: Cahya
orcid: https://orcid.org/0000-0002-0263-8273
title: "Multilingual Speech Recognition for Indonesian Languages"
version: 1.0.0
date-released: 2021-10-29
url: "https://github.com/indonesian-nlp/multilingual-asr"
GitHub Events
Total
- Watch event: 12
- Fork event: 2
Last Year
- Watch event: 12
- Fork event: 2