https://github.com/ai4bharat/indicoov

https://github.com/ai4bharat/indicoov

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: AI4Bharat
  • License: cc-by-4.0
  • Default Branch: master
  • Size: 79.1 KB
Statistics
  • Stars: 0
  • Watchers: 5
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme

README.md

IndicOOV: Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies

Paper Website

🎉 Accepted at INTERSPEECH 2024

We release IndicOOV, a benchmark for Hindi and Tamil with ~1300 words, spread over 6-7 categories, consisting of both In-Vocabulary (IV) and Out-of-Vocabulary (OOV) words.

We show in the paper that pronunciations of TTS models with poor vocabulary coverage can be improved by recording and finetuning on small amounts of carefully curated data, specifically improving the bigram coverage of the data. We release a benchmark to demonstrate the problem and hence exhibit the improvements via our proposed solution.

Authors: Srija Anand, Praveen S V, Ashwin Sankar, Giri Raju, Mitesh M. Khapra

Abstract

Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocab- ulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from several real-world applications. Indeed, state-of-the-art Hindi and Tamil TTS systems perform poorly on this OOV benchmark, as indicated by intelligibility tests. To improve the model’s OOV performance, we propose a low-effort and economically viable strategy to obtain more training data. Specifically, we propose using volunteers as opposed to high quality voice artists to record words containing character bigrams unseen in the training data. We show that using such inexpensive data, the model’s performance improves on OOV words, while not affecting voice quality and in-domain performance.

IndicOOV Benchmark

Hindi

The Hindi benchmark covers 6 categories: 1. abbreviations 2. brands 3. codemixed 4. companynames 5. govtschemes 6. propernouns

Tamil

The Tamil benchmark covers 7 categories: 1. abbreviations 2. brands 3. codemixed 4. propernouns 5. healthcaremedicalterms 6. education_literature 7. navigation

Citation

If you want to use our benchmark, templates or recording scripts, please cite our work as follows: @article{anand2024enhancingoutofvocabularyperformanceindian, title={Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies}, author={Srija Anand and Praveen Srinivasa Varadhan and Ashwin Sankar and Giri Raju and Mitesh M. Khapra}, year={2024}, eprint={2407.13435}, url={https://arxiv.org/abs/2407.13435}, }

We used the implementation of VITS by jaywalnut310 and the Indic-TTS Fastpitch model released by AI4Bharat trained in the Coqui framework.

Owner

  • Name: AI4Bhārat
  • Login: AI4Bharat
  • Kind: organization
  • Email: opensource@ai4bharat.org
  • Location: India

Artificial-Intelligence-For-Bhārat : Building open-source AI solutions for India!

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1