comprehensive-transformer-tts

A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS

https://github.com/keonlee9420/comprehensive-transformer-tts

Keywords

comprehensive deep-learning fastspeech fastspeech2 hifi-gan mel-gan multi-speaker neural-tts non-ar non-autoregressive pytorch single-speaker sota speech-synthesis supervised text-to-speech transformer tts ultimate-tts unsupervised

Last synced: 6 months ago · JSON representation

Repository

A Non-Autoregressive Transformer based Text-to-Speech, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS

Basic Info

Host: GitHub
Owner: keonlee9420
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 143 MB

Statistics

Stars: 325
Watchers: 12
Forks: 42
Open Issues: 10
Releases: 4

Topics

comprehensive deep-learning fastspeech fastspeech2 hifi-gan mel-gan multi-speaker neural-tts non-ar non-autoregressive pytorch single-speaker sota speech-synthesis supervised text-to-speech transformer tts ultimate-tts unsupervised

Created over 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

README.md

Comprehensive-Transformer-TTS - PyTorch Implementation

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS. Any suggestions toward the best Non-AR TTS are welcome :)

Unsupervised Duration Modelings

[x] One TTS Alignment To Rule Them All (Badlani et al., 2021): We are finally freed from external aligners such as MFA! Validation alignments for LJ014-0329 up to 70K are shown below as an example.

Transformer Performance Comparison on LJSpeech (1 TITAN RTX 24G / 16 batch size)

| Model | Memory Usage | Training Time (1K steps) | | --- | ----------- | ----- | |Fastformer (lucidrains')|10531MiB / 24220MiB|4m 25s |Fastformer (wuch15's)|10515MiB / 24220MiB|4m 45s |Long-Short Transformer|10633MiB / 24220MiB|5m 26s |Conformer|18903MiB / 24220MiB|7m 4s |Reformer|10293MiB / 24220MiB|10m 16s |Transformer|7909MiB / 24220MiB|4m 51s |Transformer_fs2|11571MiB / 24220MiB|4m 53s

Toggle the type of building blocks by ```yaml

In the model.yaml

blocktype: "transformerfs2" # ["transformer_fs2", "transformer", "fastformer", "lstransformer", "conformer", "reformer"] ```

Toggle the type of prosody modelings by ```yaml

In the model.yaml

prosodymodeling: modeltype: "none" # ["none", "du2021", "liu2021"] ```

Toggle the type of duration modelings by ```yaml

In the model.yaml

durationmodeling: learnalignment: True # True for unsupervised modeling, and False for supervised modeling ```

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

Dependencies

You can install the Python dependencies with pip3 install -r requirements.txt Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models and put them in output/ckpt/DATASET/. The models are trained under unsupervised duration modeling with "transformer_fs2" building block.

For a single-speaker TTS, run python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8 Add --speakerid SPEAKERID for a multi-speaker TTS.

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.
Run python3 prepare_align.py --dataset DATASET for some preparations.

For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

After that, run the preprocessing script by python3 preprocess.py --dataset DATASET

Training

Train your model with python3 train.py --dataset DATASET Useful options: - To use a Automatic Mixed Precision, append --use_amp argument to the above command. - The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.

TensorBoard

Use tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.

LJSpeech

VCTK

Ablation Study

| ID | Model | Block Type | Pitch Conditioning | | --- | --- | ----------- | ----- | |1|LJSpeechtransformerfs2cwt| `transformerfs2| continuous wavelet transform |2|LJSpeech_transformer_cwt|transformer| continuous wavelet transform |3|LJSpeech_transformer_frame|transformer| frame-level f0 |4|LJSpeech_transformer_ph|transformer` | phoneme-level f0

Observations from 1. changing building block (ID 1~2): "transformerfs2" seems to be more optimized in terms of memory usage and model size so that the training time and mel losses are decreased. However, the output quality is not improved dramatically, and sometimes the "transformer" block generates speech with an even more stable pitch contour than "transformerfs2". 2. changing pitch conditioning (ID 2~4): There is a trade-off between audio quality (pitch stability) and expressiveness. - audio quality: "ph" >= "frame" > "cwt" - expressiveness: "cwt" > "frame" > "ph"

Notes

Both phoneme-level and frame-level variance are supported in both supervised and unsupervised duration modeling.
Note that there are no pre-extracted phoneme-level variance features in unsupervised duration modeling.
Unsupervised duration modeling in phoneme-level will take longer time than frame-level since the additional computation of phoneme-level variance is activated at runtime.
Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

For vocoder, HiFi-GAN and MelGAN are supported.

Updates Log

Mar.05, 2022 (v0.2.1): Fix and update codebase & pre-trained models with demo samples
1. Fix variance adaptor to make it work with all combinations of building block and variance type/level
2. Update pre-trained models with demo samples of LJSpeech and VCTK under "transformer_fs2" building block and "cwt" pitch conditioning
3. Share the result of ablation studies of comparing "transformer" vs. "transformer_fs2" paired among three types of pitch conditioning ("frame", "ph", and "cwt")
Feb.18, 2022 (v0.2.0): Update data preprocessor and variance adaptor & losses following keonlee9420's DiffSinger / Add various prosody modeling methods
1. Prepare two different types of data pipeline in preprocessor to maximize unsupervised/supervised duration modelings
2. Adopt wavelet for pitch modeling & loss
3. Add fine-trained duration loss
4. Apply var_start_steps for better model convergence, especially under unsupervised duration modeling
5. Remove dependency of energy modeling on pitch variance
6. Add "transformer_fs2" building block, which is more close to the original FastSpeech2 paper
7. Add two types of prosody modeling methods
8. Loss camparison on validation set:
9. LJSpeech - blue: v0.1.1 / green: v0.2.0
10. VCTK - skyblue: v0.1.1 / orange: v0.2.0
Sep.21, 2021 (v0.1.1): Initialize with ming024's FastSpeech2

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

References

ming024's FastSpeech2
wuch15's Fastformer
lucidrains' fast-transformer-pytorch
lucidrains' long-short-transformer
sooftware's conformer
lucidrains' reformer-pytorch
sagelywizard's pytorch-mdn
keonlee9420's RobustFineGrainedProsodyControl
keonlee9420's Cross-Speaker-Emotion-Transfer
keonlee9420's DiffSinger
NVIDIA's NeMo: Special thanks to Onur Babacan and Rafael Valle for unsupervised duration modeling.

Owner

Name: Keon Lee
Login: keonlee9420
Kind: user
Location: Seoul, Republic of Korea
Company: KRAFTON Inc.

Website: keonlee.notion.site
Twitter: KeonLee26348956
Repositories: 18
Profile: https://github.com/keonlee9420

Everything towards Conversational AI

GitHub Events

Total

Watch event: 6
Fork event: 2

Last Year

Watch event: 6
Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 20
Total pull requests: 3
Average time to close issues: 23 days
Average time to close pull requests: about 14 hours
Total issue authors: 18
Total pull request authors: 2
Average comments per issue: 2.7
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

godspirit00 (2)
blx0102 (2)
dunky11 (1)
gen35 (1)
dasstyx (1)
inconnu11 (1)
GuangChen2016 (1)
Stardust-minus (1)
wizardk (1)
bondio77 (1)
vietvq-vbee (1)
LEECHOONGHO (1)
gongchenghhu (1)
ease-zh (1)
michael-conrad (1)

comprehensive-transformer-tts

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Comprehensive-Transformer-TTS - PyTorch Implementation

Transformers

Prosody Modelings (WIP)

Supervised Duration Modelings

Unsupervised Duration Modelings

Transformer Performance Comparison on LJSpeech (1 TITAN RTX 24G / 16 batch size)

In the model.yaml

In the model.yaml

In the model.yaml

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

LJSpeech

VCTK

Ablation Study

Notes

Updates Log

Citation

References

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies