tts-scores

Scripts for computing the Intelligibility and CLVP scores for evaluating TTS models

https://github.com/neonbjb/tts-scores

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary

Scientific Fields

Mathematics Computer Science - 84% confidence
Artificial Intelligence and Machine Learning Computer Science - 69% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Scripts for computing the Intelligibility and CLVP scores for evaluating TTS models

Basic Info
  • Host: GitHub
  • Owner: neonbjb
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 43 KB
Statistics
  • Stars: 160
  • Watchers: 5
  • Forks: 15
  • Open Issues: 7
  • Releases: 0
Created almost 4 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

TTS Scores - Better evaluation metrics for text to speech models

TTS quality is a difficult thing to measure. Distance-based metrics are poor measurements because they only measure similarity to the test set, not the realism of the generated speech. For this reason, most TTS papers rely on Mean Opinion Scores to report model quality. Computing MOS involves humans in the loop, meaning it is costly and time consuming. More importantly, it cannot be used while training to evaluate the real-time performance of a model while training.

The field of image generation has settled on the usage of the Frechet Inception Distance and Inception Score metrics to measure live performance. They are quite successful. I think we should take a page out of their book. But, we can modernize this a little:

Installation

tts-scores is available on pypi: shell pip install tts-scores

Contrastive Language-Voice Pretrained model (CLVP)

To this end, I trained a CLIP-like architecture with a twist: instead of measuring the similarity of text and images, it measures the similarity of text and voice clips. I call this model CLVP. I believe such a model is an exceptional candidate for synthesizing a quality metric for Text->Voice models, much in the way that the Inception model is used for FID and IS scores.

This repo contains the source code for CLVP and scripts that allow you to use it. I have built two metrics:

CLVP Score

The CLVP score measures the distance predicted by CVLP between text and an audio clip where that text is spoken. A lower score is better. It can be obtained by:

python from tts_scores.clvp import CLVPMetric cv_metric = CLVPMetric(device='cuda') score = cv_metric.compute_clvp('<path_to_your_tsv>', 'D:\\tmp\\tortoise-tts-eval\\real')

Note: the format of the TSV file is described in a later section

CLVP Frechet Distance

Similar to FID, this metric compares the distribution of real spoken text with whatever your TTS model generatets. It is particularly useful if you have a bunch of spoken text that you want to compare against but do not have the transcriptions for that text. For example, this is a good fit for measuring the performance of vocoders.

It works by computing the frechet distance of the outputs of the last layer of the CLVP model when fed data from both distributions. Similar to FID, a lower score is better. It can be obtained by:

python from tts_scores.clvp import CLVPMetric cv_metric = CLVPMetric(device='cuda') score = cv_metric.compute_fd('<path_to_your_generated_audio>, '<path_to_your_real_audio>')

TSV format

The TSV input is a tab-separated-value file. Each line must contain a transcript followed by a tab, followed by a filename. It can be optionally followed by more tab separated values, only the first two are important:

<transcript1><|tab|><filename1><|tab|>.... <transcript2><|tab|><filename2><|tab|>.... ... <transcriptN><|tab|><filenameN><|tab|>....

wav2vec2 Intelligibility Score

One rather obvious way to compute the performance of a TTS system that I have not seen before is to leverage an ASR system. If the goal is to produce intelligible speech - why not use a speech recognition system to measure that intelligibility.

The intelligibility score packaged in this repo does exactly that. It takes in a list of generated and real audio files and their transcriptions, and feeds everything through a pre-trained wav2vec2 model. The raw losses are returned. The score is the difference between the wav2vec2 losses for the fake/generated samples and the real samples.

While CLVP scores take things like voice quality, voice diversity and prosody into account, the intelligibility score only considers whether or not the speech your TTS model generates maps coherently to the text you put into it. For some use cases, this will be the most important score. For others, all of the scores are important.

python from tts_scores.intelligibility import IntelligibilityMetric is_metric = IntelligibilityMetric(device='cuda') score = is_metric.compute_intelligibility('<path_to_your_tsv>', '<path_to_your_real_audio>')

Scores from common models

A metric is only good if there are benchmarks which can be used as points of comparison. To this end, I computed all of the scores in this repo on two high-performance TTS models:

  1. Tacotron2+waveglow from NVIDIA's repo
  2. FastSpeech2+hifigan from ming024's repo

See the scores below:

Citations

Please cite this repo if you use it in your repo:

@software{TTS-scores, author = {Betker, J ames}, month = {4}, title = {{TTS-scores}}, url = {https://github.com/neonbjb/tts-scores}, version = {1.0.0}, year = {2022} }

Owner

  • Name: James Betker
  • Login: neonbjb
  • Kind: user
  • Location: CO
  • Company: OpenAI

Latent Analyst, Entropy Wrangler

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Betker"
  given-names: "James"
  orcid: "https://orcid.org/my-orcid?orcid=0000-0003-3259-4862"
title: "TTS Scores"
version: 1.0.0
date-released: 2022-04-01
url: "https://github.com/neonbjb/tts-scores"

GitHub Events

Total
  • Issues event: 2
  • Watch event: 22
  • Issue comment event: 7
  • Fork event: 1
Last Year
  • Issues event: 2
  • Watch event: 22
  • Issue comment event: 7
  • Fork event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 7
  • Total Committers: 1
  • Avg Commits per committer: 7.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
James Betker j****r@g****m 7

Issues and Pull Requests

Last synced: 5 months ago

All Time
  • Total issues: 14
  • Total pull requests: 2
  • Average time to close issues: about 11 hours
  • Average time to close pull requests: 4 months
  • Total issue authors: 10
  • Total pull request authors: 1
  • Average comments per issue: 2.36
  • Average comments per pull request: 0.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: 11 minutes
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • fakerybakery (5)
  • CamellIyquitous (1)
  • tatujan (1)
  • a1rishav (1)
  • rsxdalv (1)
  • Ahmed-Hossam-Aldeen (1)
  • PiotrDabkowski (1)
  • Ryu1845 (1)
  • Keith-Hon (1)
  • ranjana-creator (1)
Pull Request Authors
  • fakerybakery (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • einops *
  • inflect *
  • pytorch_fid *
  • requests *
  • scipy *
  • tokenizers *
  • torch *
  • tqdm *
  • transformers *
  • unidecode *
setup.py pypi
  • einops *
  • ffmpeg *
  • inflect *
  • pytorch_fid *
  • requests *
  • scipy *
  • tokenizers *
  • torch >=1.8
  • torchaudio >0.9
  • tqdm *
  • transformers *
  • unidecode *