vietnamese-asr-released-model

Vietnamese Automatic Speech Recognition using Wav2vec 2.0

https://github.com/khanld/vietnamese-asr-released-model

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Vietnamese Automatic Speech Recognition using Wav2vec 2.0

Basic Info

Host: GitHub
Owner: khanld
Default Branch: main
Size: 40 KB

Statistics

Stars: 4
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 1

Created about 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Citation

Vietnamese Speech Recognition using Wav2vec 2.0

Model Description
Implementation
Benchmark Result
Example Usage
Evaluation
Citation
Contact ### Model Description Fine-tuned the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources, including VIOS, COMMON VOICE, FOSD and VLSP 100h. We have not yet incorporated the Language Model into our ASR system but still gained a promising result. ### Implementation We also provide code for Pre-training and Fine-tuning the Wav2vec2 model. If you wish to train on your dataset, check it out here:
Pre-train code
Fine-tune code

Benchmark WER Result

| | VIVOS | COMMON VOICE 8.0 | |---|---|---| |without LM| 15.05 | 10.78 | |with LM| in progress | in progress |

Example Usage

```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC import librosa import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = Wav2Vec2Processor.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model = Wav2Vec2ForCTC.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model.to(device)

def transcribe(wav): inputvalues = processor(wav, samplingrate=16000, returntensors="pt").inputvalues logits = model(inputvalues.to(device)).logits predids = torch.argmax(logits, dim=-1) predtranscript = processor.batchdecode(predids)[0] return predtranscript

wav, _ = librosa.load('path/to/your/audio/file', sr = 16000) print(f"transcript: {transcribe(wav)}") ```

Evaluation

```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import loaddataset import torch import re from datasets import loaddataset, load_metric, Audio

wer = loadmetric("wer") device = torch.device("cuda" if torch.cuda.isavailable() else "cpu")

load processor and model

processor = Wav2Vec2Processor.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model = Wav2Vec2ForCTC.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model.to(device) model.eval()

Load dataset

testdataset = loaddataset("mozilla-foundation/commonvoice80", "vi", split="test", useauthtoken="yourhuggingfaceauthtoken") testdataset = testdataset.castcolumn("audio", Audio(samplingrate=16000)) charstoignore = r'[,?.!-;:"“%\'�]' # ignore special characters

preprocess data

def preprocess(batch): audio = batch["audio"] batch["inputvalues"] = audio["array"] batch["transcript"] = re.sub(charsto_ignore, '', batch["sentence"]).lower() return batch

run inference

def inference(batch): inputvalues = processor(batch["inputvalues"], samplingrate=16000, returntensors="pt").inputvalues logits = model(inputvalues.to(device)).logits predids = torch.argmax(logits, dim=-1) batch["predtranscript"] = processor.batchdecode(predids) return batch

testdataset = testdataset.map(preprocess) result = testdataset.map(inference, batched=True, batchsize=1) print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"]))) ``` Test Result: 10.78%

Citation

BibTeX @mics{Duy_Khanh_Finetune_Wav2vec_2_0_2022, author = {Duy Khanh, Le}, doi = {10.5281/zenodo.6542357}, license = {CC-BY-NC-4.0}, month = {5}, title = {{Finetune Wav2vec 2.0 For Vietnamese Speech Recognition}}, url = {https://github.com/khanld/ASR-Wa2vec-Finetune}, year = {2022} } APA Duy Khanh, L. (2022). Finetune Wav2vec 2.0 For Vietnamese Speech Recognition [Data set]. https://doi.org/10.5281/zenodo.6542357

Contact

khanhld218@uef.edu.vn

Owner

Name: Duy Khánh
Login: khanld
Kind: user
Location: VietNam

Website: http://linkedin.com/in/khanhld257
Repositories: 3
Profile: https://github.com/khanld

I hate my job!!!

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
  Finetune Wav2vec 2.0 For Vietnamese Speech
  Recognition
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: dataset
authors:
  - given-names: Le
    family-names: Khanh
    name-particle: Duy
    email: khanhld218@uef.edu.vn
identifiers:
  - type: doi
    value: 10.5281/zenodo.6542357
repository-code: 'https://github.com/khanld/ASR-Wa2vec-Finetune'
url: >-
  https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h
keywords:
  - audio
  - speech
  - Transformer
  - wav2vec2
  - automatic-speech-recognition
  - vietnamese
date-released: 2022-05-12
doi: 10.5281/zenodo.6542357
license: CC-BY-NC-4.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science