comprehensive-e2e-tts

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS

https://github.com/keonlee9420/comprehensive-e2e-tts

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.0%) to scientific vocabulary

Keywords

deep-learning end-to-end fastspeech2 hifi-gan jets multi-speaker neural-tts non-ar non-autoregressive pytorch single-speaker sota speech-synthesis text-to-speech text-to-wav tts ultimate-tts unsupervised
Last synced: 6 months ago · JSON representation ·

Repository

A Non-Autoregressive End-to-End Text-to-Speech (text-to-wav), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS

Basic Info
  • Host: GitHub
  • Owner: keonlee9420
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 3.45 MB
Statistics
  • Stars: 146
  • Watchers: 10
  • Forks: 19
  • Open Issues: 4
  • Releases: 0
Topics
deep-learning end-to-end fastspeech2 hifi-gan jets multi-speaker neural-tts non-ar non-autoregressive pytorch single-speaker sota speech-synthesis text-to-speech text-to-wav tts ultimate-tts unsupervised
Created almost 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (generating waveform given text), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS. Any suggestions toward the best End-to-End TTS are welcome :)

Architecture Design

Linguistic Encoder

Audio Upsampler

Duration Modeling

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

Dependencies

You can install the Python dependencies with pip3 install -r requirements.txt Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models (will be shared soon) and put them in output/ckpt/DATASET/.

For a single-speaker TTS, run python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8 Add --speakerid SPEAKERID for a multi-speaker TTS.

Training

Datasets

The supported datasets are

  • LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
  • VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.

Preprocessing

Training

Train your model with python3 train.py --dataset DATASET Useful options: <!-- - To use Automatic Mixed Precision, append --use_amp argument to the above command. --> - The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use tensorboard --logdir output/log

to serve TensorBoard on your localhost. <!-- The loss curves, synthesized mel-spectrograms, and audios are shown.

-->

Notes

  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

Owner

  • Name: Keon Lee
  • Login: keonlee9420
  • Kind: user
  • Location: Seoul, Republic of Korea
  • Company: KRAFTON Inc.

Everything towards Conversational AI

Citation (CITATION.cff)

cff-version: 1.0.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Lee"
  given-names: "Keon"
  orcid: "https://orcid.org/0000-0001-9028-1018"
title: "Comprehensive-E2E-TTS"
version: 0.1.0
doi: ___
date-released: 2022-05-06
url: "https://github.com/keonlee9420/Comprehensive-E2E-TTS"

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Dependencies

requirements.txt pypi
  • PyYAML ==5.4.1
  • g2p-en ==2.1.0
  • inflect ==4.1.0
  • librosa ==0.7.2
  • matplotlib ==3.2.2
  • numba ==0.48
  • numpy ==1.19.2
  • pandas ==1.1.5
  • pillow ==8.3.2
  • praat-parselmouth ==0.4.1
  • pycwt ==0.3.0a22
  • pypinyin ==0.39.0
  • python_speech_features ==0.6
  • pyworld ==0.3.0
  • scikit-learn ==0.23.2
  • scipy ==1.5.0
  • soundfile ==0.10.3.post1
  • tensorboard ==2.5
  • tensorflow ==2.5.1
  • tgt ==1.4.4
  • torch ==1.7.0
  • torchaudio ==0.7.0
  • tqdm ==4.46.1
  • unidecode ==1.1.1