tts-scores

Scripts for computing the Intelligibility and CLVP scores for evaluating TTS models

https://github.com/neonbjb/tts-scores

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Scientific Fields

Mathematics Computer Science - 84% confidence

Artificial Intelligence and Machine Learning Computer Science - 69% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Scripts for computing the Intelligibility and CLVP scores for evaluating TTS models

Basic Info

Host: GitHub
Owner: neonbjb
License: apache-2.0
Language: Python
Default Branch: main
Size: 43 KB

Statistics

Stars: 160
Watchers: 5
Forks: 15
Open Issues: 7
Releases: 0

Created almost 4 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

TTS Scores - Better evaluation metrics for text to speech models

TTS quality is a difficult thing to measure. Distance-based metrics are poor measurements because they only measure similarity to the test set, not the realism of the generated speech. For this reason, most TTS papers rely on Mean Opinion Scores to report model quality. Computing MOS involves humans in the loop, meaning it is costly and time consuming. More importantly, it cannot be used while training to evaluate the real-time performance of a model while training.

The field of image generation has settled on the usage of the Frechet Inception Distance and Inception Score metrics to measure live performance. They are quite successful. I think we should take a page out of their book. But, we can modernize this a little:

Installation

tts-scores is available on pypi: shell pip install tts-scores

Contrastive Language-Voice Pretrained model (CLVP)

To this end, I trained a CLIP-like architecture with a twist: instead of measuring the similarity of text and images, it measures the similarity of text and voice clips. I call this model CLVP. I believe such a model is an exceptional candidate for synthesizing a quality metric for Text->Voice models, much in the way that the Inception model is used for FID and IS scores.

This repo contains the source code for CLVP and scripts that allow you to use it. I have built two metrics:

CLVP Score

The CLVP score measures the distance predicted by CVLP between text and an audio clip where that text is spoken. A lower score is better. It can be obtained by:

python from tts_scores.clvp import CLVPMetric cv_metric = CLVPMetric(device='cuda') score = cv_metric.compute_clvp('<path_to_your_tsv>', 'D:\\tmp\\tortoise-tts-eval\\real')

Note: the format of the TSV file is described in a later section

CLVP Frechet Distance

Similar to FID, this metric compares the distribution of real spoken text with whatever your TTS model generatets. It is particularly useful if you have a bunch of spoken text that you want to compare against but do not have the transcriptions for that text. For example, this is a good fit for measuring the performance of vocoders.

It works by computing the frechet distance of the outputs of the last layer of the CLVP model when fed data from both distributions. Similar to FID, a lower score is better. It can be obtained by:

python from tts_scores.clvp import CLVPMetric cv_metric = CLVPMetric(device='cuda') score = cv_metric.compute_fd('<path_to_your_generated_audio>, '<path_to_your_real_audio>')

TSV format

The TSV input is a tab-separated-value file. Each line must contain a transcript followed by a tab, followed by a filename. It can be optionally followed by more tab separated values, only the first two are important:

<transcript1><|tab|><filename1><|tab|>.... <transcript2><|tab|><filename2><|tab|>.... ... <transcriptN><|tab|><filenameN><|tab|>....

wav2vec2 Intelligibility Score

One rather obvious way to compute the performance of a TTS system that I have not seen before is to leverage an ASR system. If the goal is to produce intelligible speech - why not use a speech recognition system to measure that intelligibility.

The intelligibility score packaged in this repo does exactly that. It takes in a list of generated and real audio files and their transcriptions, and feeds everything through a pre-trained wav2vec2 model. The raw losses are returned. The score is the difference between the wav2vec2 losses for the fake/generated samples and the real samples.

While CLVP scores take things like voice quality, voice diversity and prosody into account, the intelligibility score only considers whether or not the speech your TTS model generates maps coherently to the text you put into it. For some use cases, this will be the most important score. For others, all of the scores are important.

python from tts_scores.intelligibility import IntelligibilityMetric is_metric = IntelligibilityMetric(device='cuda') score = is_metric.compute_intelligibility('<path_to_your_tsv>', '<path_to_your_real_audio>')

Scores from common models

A metric is only good if there are benchmarks which can be used as points of comparison. To this end, I computed all of the scores in this repo on two high-performance TTS models:

Tacotron2+waveglow from NVIDIA's repo
FastSpeech2+hifigan from ming024's repo

See the scores below:

Citations

Please cite this repo if you use it in your repo:

@software{TTS-scores, author = {Betker, J ames}, month = {4}, title = {{TTS-scores}}, url = {https://github.com/neonbjb/tts-scores}, version = {1.0.0}, year = {2022} }

Owner

Name: James Betker
Login: neonbjb
Kind: user
Location: CO
Company: OpenAI

Website: nonint.com
Twitter: neonbjb
Repositories: 44
Profile: https://github.com/neonbjb

Latent Analyst, Entropy Wrangler

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Betker"
  given-names: "James"
  orcid: "https://orcid.org/my-orcid?orcid=0000-0003-3259-4862"
title: "TTS Scores"
version: 1.0.0
date-released: 2022-04-01
url: "https://github.com/neonbjb/tts-scores"

GitHub Events

Total

Issues event: 2
Watch event: 22
Issue comment event: 7
Fork event: 1

Last Year

Issues event: 2
Watch event: 22
Issue comment event: 7
Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 7
Total Committers: 1
Avg Commits per committer: 7.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
James Betker	j**r@g**m	7

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 14
Total pull requests: 2
Average time to close issues: about 11 hours
Average time to close pull requests: 4 months
Total issue authors: 10
Total pull request authors: 1
Average comments per issue: 2.36
Average comments per pull request: 0.5
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 0
Average time to close issues: 11 minutes
Average time to close pull requests: N/A
Issue authors: 2
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

fakerybakery (5)
CamellIyquitous (1)
tatujan (1)
a1rishav (1)
rsxdalv (1)
Ahmed-Hossam-Aldeen (1)
PiotrDabkowski (1)
Ryu1845 (1)
Keith-Hon (1)
ranjana-creator (1)

Pull Request Authors

fakerybakery (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

einops *
inflect *
pytorch_fid *
requests *
scipy *
tokenizers *
torch *
tqdm *
transformers *
unidecode *

setup.py pypi

einops *
ffmpeg *
inflect *
pytorch_fid *
requests *
scipy *
tokenizers *
torch >=1.8
torchaudio >0.9
tqdm *
transformers *
unidecode *

tts-scores

Science Score: 44.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.md

TTS Scores - Better evaluation metrics for text to speech models

Installation

Contrastive Language-Voice Pretrained model (CLVP)

CLVP Score

CLVP Frechet Distance

TSV format

wav2vec2 Intelligibility Score

Scores from common models

Citations

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies