vietnamese-asr-released-model

Vietnamese Automatic Speech Recognition using Wav2vec 2.0

https://github.com/khanld/vietnamese-asr-released-model

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Vietnamese Automatic Speech Recognition using Wav2vec 2.0

Basic Info
  • Host: GitHub
  • Owner: khanld
  • Default Branch: main
  • Size: 40 KB
Statistics
  • Stars: 4
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created about 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme Citation

README.md

PWC PWC

Vietnamese Speech Recognition using Wav2vec 2.0

Table of contents

  1. Model Description
  2. Implementation
  3. Benchmark Result
  4. Example Usage
  5. Evaluation
  6. Citation
  7. Contact ### Model Description Fine-tuned the Wav2vec2-based model on about 160 hours of Vietnamese speech dataset from different resources, including VIOS, COMMON VOICE, FOSD and VLSP 100h. We have not yet incorporated the Language Model into our ASR system but still gained a promising result. ### Implementation We also provide code for Pre-training and Fine-tuning the Wav2vec2 model. If you wish to train on your dataset, check it out here:
  8. Pre-train code
  9. Fine-tune code

Benchmark WER Result

| | VIVOS | COMMON VOICE 8.0 | |---|---|---| |without LM| 15.05 | 10.78 | |with LM| in progress | in progress |

Example Usage Open In Colab

```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC import librosa import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

processor = Wav2Vec2Processor.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model = Wav2Vec2ForCTC.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model.to(device)

def transcribe(wav): inputvalues = processor(wav, samplingrate=16000, returntensors="pt").inputvalues logits = model(inputvalues.to(device)).logits predids = torch.argmax(logits, dim=-1) predtranscript = processor.batchdecode(predids)[0] return predtranscript

wav, _ = librosa.load('path/to/your/audio/file', sr = 16000) print(f"transcript: {transcribe(wav)}") ```

Evaluation Open In Colab

```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import loaddataset import torch import re from datasets import loaddataset, load_metric, Audio

wer = loadmetric("wer") device = torch.device("cuda" if torch.cuda.isavailable() else "cpu")

load processor and model

processor = Wav2Vec2Processor.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model = Wav2Vec2ForCTC.frompretrained("khanhld/wav2vec2-base-vietnamese-160h") model.to(device) model.eval()

Load dataset

testdataset = loaddataset("mozilla-foundation/commonvoice80", "vi", split="test", useauthtoken="yourhuggingfaceauthtoken") testdataset = testdataset.castcolumn("audio", Audio(samplingrate=16000)) charstoignore = r'[,?.!-;:"“%\'�]' # ignore special characters

preprocess data

def preprocess(batch): audio = batch["audio"] batch["inputvalues"] = audio["array"] batch["transcript"] = re.sub(charsto_ignore, '', batch["sentence"]).lower() return batch

run inference

def inference(batch): inputvalues = processor(batch["inputvalues"], samplingrate=16000, returntensors="pt").inputvalues logits = model(inputvalues.to(device)).logits predids = torch.argmax(logits, dim=-1) batch["predtranscript"] = processor.batchdecode(predids) return batch

testdataset = testdataset.map(preprocess) result = testdataset.map(inference, batched=True, batchsize=1) print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_transcript"], references=result["transcript"]))) ``` Test Result: 10.78%

Citation

DOI
BibTeX @mics{Duy_Khanh_Finetune_Wav2vec_2_0_2022, author = {Duy Khanh, Le}, doi = {10.5281/zenodo.6542357}, license = {CC-BY-NC-4.0}, month = {5}, title = {{Finetune Wav2vec 2.0 For Vietnamese Speech Recognition}}, url = {https://github.com/khanld/ASR-Wa2vec-Finetune}, year = {2022} } APA Duy Khanh, L. (2022). Finetune Wav2vec 2.0 For Vietnamese Speech Recognition [Data set]. https://doi.org/10.5281/zenodo.6542357

Contact

  • khanhld218@uef.edu.vn
  • GitHub
  • LinkedIn

Owner

  • Name: Duy Khánh
  • Login: khanld
  • Kind: user
  • Location: VietNam

I hate my job!!!

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
  Finetune Wav2vec 2.0 For Vietnamese Speech
  Recognition
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: dataset
authors:
  - given-names: Le
    family-names: Khanh
    name-particle: Duy
    email: khanhld218@uef.edu.vn
identifiers:
  - type: doi
    value: 10.5281/zenodo.6542357
repository-code: 'https://github.com/khanld/ASR-Wa2vec-Finetune'
url: >-
  https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h
keywords:
  - audio
  - speech
  - Transformer
  - wav2vec2
  - automatic-speech-recognition
  - vietnamese
date-released: 2022-05-12
doi: 10.5281/zenodo.6542357
license: CC-BY-NC-4.0

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3