vietnamese-asr-released-model
Vietnamese Automatic Speech Recognition using Wav2vec 2.0
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary
Repository
Vietnamese Automatic Speech Recognition using Wav2vec 2.0
Basic Info
- Host: GitHub
- Owner: khanld
- Default Branch: main
- Size: 40 KB
Statistics
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Vietnamese Speech Recognition using Wav2vec 2.0
Table of contents
- Model Description
- Implementation
- Benchmark Result
- Example Usage
- Evaluation
- Citation
- Contact ### Model Description Fine-tuned the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources, including VIOS, COMMON VOICE, FOSD and VLSP 100h. We have not yet incorporated the Language Model into our ASR system but still gained a promising result. ### Implementation We also provide code for Pre-training and Fine-tuning the Wav2vec2 model. If you wish to train on your dataset, check it out here:
- Pre-train code
- Fine-tune code
Benchmark WER Result
| | VIVOS | COMMON VOICE 8.0 | |---|---|---| |without LM| 15.05 | 10.78 | |with LM| in progress | in progress |
Example Usage 
```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC import librosa import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Wav2Vec2Processor.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model = Wav2Vec2ForCTC.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model.to(device)
def transcribe(wav): inputvalues = processor(wav, samplingrate=16000, returntensors="pt").inputvalues logits = model(inputvalues.to(device)).logits predids = torch.argmax(logits, dim=-1) predtranscript = processor.batchdecode(predids)[0] return predtranscript
wav, _ = librosa.load('path/to/your/audio/file', sr = 16000) print(f"transcript: {transcribe(wav)}") ```
Evaluation 
```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import loaddataset import torch import re from datasets import loaddataset, load_metric, Audio
wer = loadmetric("wer") device = torch.device("cuda" if torch.cuda.isavailable() else "cpu")
load processor and model
processor = Wav2Vec2Processor.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model = Wav2Vec2ForCTC.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model.to(device) model.eval()
Load dataset
testdataset = loaddataset("mozilla-foundation/commonvoice80", "vi", split="test", useauthtoken="yourhuggingfaceauthtoken") testdataset = testdataset.castcolumn("audio", Audio(samplingrate=16000)) charstoignore = r'[,?.!-;:"“%\'�]' # ignore special characters
preprocess data
def preprocess(batch): audio = batch["audio"] batch["inputvalues"] = audio["array"] batch["transcript"] = re.sub(charsto_ignore, '', batch["sentence"]).lower() return batch
run inference
def inference(batch): inputvalues = processor(batch["inputvalues"], samplingrate=16000, returntensors="pt").inputvalues logits = model(inputvalues.to(device)).logits predids = torch.argmax(logits, dim=-1) batch["predtranscript"] = processor.batchdecode(predids) return batch
testdataset = testdataset.map(preprocess) result = testdataset.map(inference, batched=True, batchsize=1) print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"]))) ``` Test Result: 10.78%
Citation
BibTeX
@mics{Duy_Khanh_Finetune_Wav2vec_2_0_2022,
author = {Duy Khanh, Le},
doi = {10.5281/zenodo.6542357},
license = {CC-BY-NC-4.0},
month = {5},
title = {{Finetune Wav2vec 2.0 For Vietnamese Speech Recognition}},
url = {https://github.com/khanld/ASR-Wa2vec-Finetune},
year = {2022}
}
APA
Duy Khanh, L. (2022). Finetune Wav2vec 2.0 For Vietnamese Speech Recognition [Data set]. https://doi.org/10.5281/zenodo.6542357
Contact
Owner
- Name: Duy Khánh
- Login: khanld
- Kind: user
- Location: VietNam
- Website: http://linkedin.com/in/khanhld257
- Repositories: 3
- Profile: https://github.com/khanld
I hate my job!!!
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Finetune Wav2vec 2.0 For Vietnamese Speech
Recognition
message: >-
If you use this software, please cite it using the
metadata from this file.
type: dataset
authors:
- given-names: Le
family-names: Khanh
name-particle: Duy
email: khanhld218@uef.edu.vn
identifiers:
- type: doi
value: 10.5281/zenodo.6542357
repository-code: 'https://github.com/khanld/ASR-Wa2vec-Finetune'
url: >-
https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h
keywords:
- audio
- speech
- Transformer
- wav2vec2
- automatic-speech-recognition
- vietnamese
date-released: 2022-05-12
doi: 10.5281/zenodo.6542357
license: CC-BY-NC-4.0
GitHub Events
Total
- Watch event: 3
Last Year
- Watch event: 3