tacotron-cli
Command-line interface to train Tacotron 2 using .wav <=> .TextGrid pairs.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 15 DOI reference(s) in README -
✓Academic publication links
Links to: ieee.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Keywords
Repository
Command-line interface to train Tacotron 2 using .wav <=> .TextGrid pairs.
Basic Info
Statistics
- Stars: 6
- Watchers: 1
- Forks: 2
- Open Issues: 0
- Releases: 5
Topics
Metadata Files
README.md
tacotron-cli
Command-line interface (CLI) to train Tacotron 2 using .wav <=> .TextGrid pairs.
Features
- train phoneme stress separately (ARPAbet/IPA)
- train phoneme tone separately (IPA)
- train phoneme duration separately (IPA)
- train single/multi-speaker
- train/synthesize on CPU or GPU
- synthesis of paragraphs
- copy embeddings from one checkpoint to another
- train using embeddings or one-hot encodings
Installation
sh
pip install tacotron-cli --user
Usage
Click to unfold usage
```txt usage: tacotron-cli [-h] [-v] {create-mels,train,continue-train,validate,synthesize,synthesize-grids,analyze,add-missing-symbols} ... Command-line interface (CLI) to train Tacotron 2 using .wav <=> .TextGrid pairs. positional arguments: {create-mels,train,continue-train,validate,synthesize,synthesize-grids,analyze,add-missing-symbols} description create-mels create mel-spectrograms from audio files train start training continue-train continue training from a checkpoint validate validate checkpoint(s) synthesize synthesize lines from a file synthesize-grids synthesize .TextGrid files analyze analyze checkpoint add-missing-symbols copy missing symbols from one checkpoint to another options: -h, --help show this help message and exit -v, --version show program's version number and exit ```Training
The dataset structure need to follow the generic format of speech-dataset-parser, i.e., each TextGrid need to contain a tier in which all phonemes are separated into single intervals, e.g., T|h|i|s| |i|s| |a| |t|e|x|t|..
Tips:
- place stress directly to the vowel of the syllable, e.g.
b|ˈo|d|iinstead ofˈb|o|d|i(body) - place tone directly to the vowel of the syllable, e.g.
ʈʂʰ|w|a˥˩|ninstead ofʈʂʰ|w|a|n˥˩(串)- tone-characters which are considered:
˥ ˦ ˧ ˨ ˩, e.g.,ɑ˥˩
- tone-characters which are considered:
- duration-characters which are considered:
˘ ˑ ː, e.g.,ʌː - normalize the text, e.g., numbers should be written out
- substituted space by either
SIL0,SIL1orSIL2depending on the duration of the pause- use
SIL0for no pause - use
SIL1for a short pause, for example after a comma...|v|i|ˈɛ|n|ʌ|,|SIL1|ˈɔ|s|t|ɹ|i|ʌ|... - use
SIL2for a longer pause, for example after a sentence:...|ˈɝ|θ|.|SIL2
- use
- Note: only phonemes occurring in the TextGrids (on the selected tier) are possible to synthesize
Synthesis
To prepare a text for synthesis, following things need to be considered:
- each line in the text file will be synthesized as a single file, therefore it is recommended to place each sentence onto a single line
- paragraphs can be separated by a blank line
- each symbol needs can be separated by an separator like
|, e.g.s|ˌɪ|ɡ|ɝ|ˈɛ|t- this is useful if the model contains phonemes/symbols that consist of multiple characters, e.g.,
ˈɛ
- this is useful if the model contains phonemes/symbols that consist of multiple characters, e.g.,
Example valid sentence: "As the overlying plate lifts up, it also forms mountain ranges." => ˈæ|z|SIL0|ð|ʌ|SIL0|ˌoʊ|v|ɝ|l|ˈaɪ|ɪ|ŋ|SIL0|p|l|ˈeɪ|t|SIL0|l|ˈɪ|f|t|s|SIL0|ˈʌ|p|,|SIL1|ɪ|t|SIL0|ˈɔ|l|s|oʊ|SIL0|f|ˈɔ|ɹ|m|z|SIL0|m|ˈaʊ|n|t|ʌ|n|SIL0|ɹ|ˈeɪ|n|d͡ʒ|ʌ|z|.|SIL2
Example invalid sentence: "Digestion is a vital process which involves the breakdown of food into smaller and smaller components, until they can be absorbed and assimilated into the body." => daɪˈʤɛsʧʌn ɪz ʌ ˈvaɪtʌl ˈpɹɑˌsɛs wɪʧ ɪnˈvɑlvz ðʌ ˈbɹeɪkˌdaʊn ʌv fud ˈɪntu ˈsmɔlɝ ænd ˈsmɔlɝ kʌmˈpoʊnʌnts, ʌnˈtɪl ðeɪ kæn bi ʌbˈzɔɹbd ænd ʌˈsɪmʌˌleɪtɪd ˈɪntu ðʌ ˈbɑdi.
Pretrained Models
- English
- Chinese
Audio Example
"The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak." Listen here (headphones recommended)
Example Synthesis
To reproduce the audio example from above, you can use the following commands:
```sh
Create example directory
mkdir ~/example
Download pre-trained Tacotron model checkpoint
wget https://tuc.cloud/index.php/s/xxFCDMgEk8dZKbp/download/LJS-IPA-101500.pt -O ~/example/checkpoint-tacotron.pt
Download pre-trained Waveglow model checkpoint
wget https://tuc.cloud/index.php/s/yBRaWz5oHrFwigf/download/LJS-v3-580000.pt -O ~/example/checkpoint-waveglow.pt
Create text containing phonetic transcription of: "The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak."
cat > ~/example/text.txt << EOF ð|ʌ|SIL0|n|ˈɔ|ɹ|θ|SIL0|w|ˈɪ|n|d|SIL0|ˈæ|n|d|SIL0|ð|ʌ|SIL0|s|ˈʌ|n|SIL0|w|ɝ|SIL0|d|ɪ|s|p|j|ˈu|t|ɪ|ŋ|SIL0|h|w|ˈɪ|t͡ʃ|SIL0|w|ˈɑ|z|SIL0|ð|ʌ|SIL0|s|t|ɹ|ˈɔ|ŋ|ɝ|,|SIL1|h|w|ˈɛ|n|SIL0|ʌ|SIL0|t|ɹ|ˈæ|v|ʌ|l|ɝ|SIL0|k|ˈeɪ|m|SIL0|ʌ|l|ˈɔ|ŋ|SIL0|ɹ|ˈæ|p|t|SIL0|ɪ|n|SIL0|ʌ|SIL0|w|ˈɔ|ɹ|m|SIL0|k|l|ˈoʊ|k|.|SIL2 EOF
Synthesize text to mel-spectrogram
tacotron-cli synthesize \ ~/example/checkpoint-tacotron.pt \ ~/example/text.txt \ --sep "|"
Install waveglow-cli for synthesis of mel-spectrograms
pip install waveglow-cli --user
Synthesize mel-spectrogram to wav
waveglow-cli synthesize \ ~/example/checkpoint-waveglow.pt \ ~/example/text -o
Resulting wav is written to: ~/example/text/1-1.npy.wav
```
Roadmap
- Outsource method to convert audio files to mel-spectrograms before training
- Better logging
- Provide more pre-trained models
- Adding tests
Development setup
```sh
update
sudo apt update
install Python 3.8-3.11 for ensuring that tests can be run
sudo apt install python3-pip \ python3.8 python3.8-dev python3.8-distutils python3.8-venv \ python3.9 python3.9-dev python3.9-distutils python3.9-venv \ python3.10 python3.10-dev python3.10-distutils python3.10-venv \ python3.11 python3.11-dev python3.11-distutils python3.11-venv
install pipenv for creation of virtual environments
python3.8 -m pip install pipenv --user
check out repo
git clone https://github.com/stefantaubert/tacotron.git cd tacotron
create virtual environment
python3.8 -m pipenv install --dev ```
Running the tests
```sh
first install the tool like in "Development setup"
then, navigate into the directory of the repo (if not already done)
cd tacotron
activate environment
python3.8 -m pipenv shell
run tests
tox ```
Final lines of test result output:
log
py38: commands succeeded
py39: commands succeeded
py310: commands succeeded
py311: commands succeeded
congratulations :)
License
MIT License
Acknowledgments
Model code adapted from Nvidia.
Papers:
- Tacotron: Towards End-to-End Speech Synthesis
- Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410
Citation
If you want to cite this repo, you can use the BibTeX-entry generated by GitHub (see About => Cite this repository).
txt
Taubert, S. (2024). tacotron-cli (Version 0.0.5) [Computer software]. [https://doi.org/10.5281/zenodo.10568731](https://doi.org/10.5281/zenodo.10568731)
Cited by
- Taubert, S., Sternkopf, J., Kahl, S., & Eibl, M. (2022). A Comparison of Text Selection Algorithms for Sequence-to-Sequence Neural TTS. 2022 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), 1–6. https://doi.org/10.1109/ICSPCC55723.2022.9984283
- Albrecht, S., Tamboli, R., Taubert, S., Eibl, M., Rey, G. D., & Schmied, J. (2022). Towards a Vowel Formant Based Quality Metric for Text-to-Speech Systems: Measuring Monophthong Naturalness. 2022 IEEE 9th International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), 1–6. https://doi.org/10.1109/CIVEMSA53371.2022.9853712
Owner
- Name: Stefan Taubert
- Login: stefantaubert
- Kind: user
- Location: Chemnitz, Germany
- Company: Chemnitz University of Technology
- Website: https://stefantaubert.com
- Twitter: Stefan_Taubert
- Repositories: 75
- Profile: https://github.com/stefantaubert
Currently I am working on my PhD about the topic of speech synthesis at Chemnitz University of Technology.
Citation (CITATION.cff)
cff-version: 1.2.0
title: tacotron-cli
abstract: Command-line interface (CLI) to train Tacotron 2 using .wav <=> .TextGrid pairs.
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- email: github@stefantaubert.com
given-names: Stefan
family-names: Taubert
affiliation: Chemnitz University of Technology
orcid: 'https://orcid.org/0000-0002-4932-2874'
website: 'https://stefantaubert.com/'
version: 0.0.5
date-released: 2024-01-25
license: MIT
url: https://github.com/stefantaubert/tacotron
doi: 10.5281/zenodo.10568731
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 3
- Total pull requests: 1
- Average time to close issues: 7 months
- Average time to close pull requests: 4 days
- Total issue authors: 3
- Total pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- stefantaubert (1)
Pull Request Authors
- jasminsternkopf (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- autoflake * develop
- autopep8 * develop
- isort * develop
- pycodestyle * develop
- pylint * develop
- pytest * develop
- rope * develop
- tacotron * develop
- tox * develop
- twine * develop
- librosa *
- matplotlib *
- mel-cepstral-distance >=0.0.1
- numpy *
- ordered-set >=4.1.0
- pandas *
- plotly *
- scikit-image *
- scikit-learn *
- scipy *
- speech-dataset-parser >=0.0.1
- torch *
- tqdm *
- 137 dependencies
- librosa *
- matplotlib *
- mel-cepstral-distance >=0.0.2
- numpy *
- ordered_set >=4.1.0
- pandas *
- plotly *
- scikit-image *
- scikit-learn *
- scipy *
- speech-dataset-parser >=0.0.4
- torch *
- tqdm *