emospeech

https://github.com/deepvk/emospeech

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.2%) to scientific vocabulary

Keywords

tts

Last synced: 8 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: deepvk
License: apache-2.0
Language: Python
Default Branch: main
Size: 1.14 MB

Statistics

Stars: 123
Watchers: 6
Forks: 13
Open Issues: 2
Releases: 0

Topics

tts

Created almost 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

How to run

Build env

You can build an environment with Docker or Conda.

To set up environment with Docker

If you don't have Docker installed, please follow the links to find installation instructions for Ubuntu, Mac or Windows.

Build docker image:

docker build -t emospeech .

Run docker image:

bash run_docker.sh

To set up environment with Conda

If you don't have Conda installed, please find the installation instructions for your OS here.

  conda create -n etts python=3.10
  conda activate etts
  pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117
  pip install -r requirements.txt

If you have different version of cuda on your machine you can find applicable link for pytorch installation here.

Download and preprocess data

We used data of 10 English Speakers from ESD dataset. To download all .wav, .txt files along with .TextGrid files created using MFA:

  bash download_data.sh

To train a model we need precomputed durations, energy, pitch and eGeMap features. From src directory run:

  python -m src.preprocess.preprocess

This is how your data folder should look like:

  .
  ├── data
  │   ├── ssw_esd
  │   ├── test_ids.txt
  │   ├── val_ids.txt
  └── └── preprocessed
          ├── duration
          ├── egemap
          ├── energy
          ├── mel
          ├── phones.json
          ├── pitch
          ├── stats.json
          ├── test.txt
          ├── train.txt
          ├── trimmed_wav
          └── val.txt

Training

Configure arguments in config/config.py.
Run python -m src.scripts.train.

Testing

Testing is implemented on testing subset of ESD dataset. To synthesize audio and compute neural MOS (NISQA TTS): 1. Configure arguments in config/config.py under Inference section. 2. Run python -m src.scripts.test.

You can find NISQA TTS for original, reconstructed and generated audio in test.log.

Inference

EmoSpeech is trained on phoneme sequences. Supported phones can be found in data/preprocessed/phones.json. This repositroy is created for academic research and doesn't support automatic grapheme-to-phoneme conversion. However, if you would like to synthesize arbitrary sentence with emotion conditioning you can: 1. Generate phoneme sequence from graphemes with MFA.

  1.1 Follow the [installation guide](https://montreal-forced-aligner.readthedocs.io/en/latest/installation.html)

  1.2 Download english g2p model: `mfa model download g2p english_us_arpa`

  1.3 Generate phoneme.txt from graphemes.txt: `mfa g2p graphemes.txt english_us_arpa phoneme.txt`

Run python -m src.scripts.inference, specifying arguments:

Аrgument | Meaning | Possible Values | Default value ---|---|---|--- -sq | Phoneme sequence to synthesisze| Find in data/phones.json. | Not set, required argument. -emo | Id of desired voice emotion | 0: neutral, 1: angry, 2: happy, 3: sad, 4: surprise. | 1 -sp | Id of speaker voice | From 1 to 10, correspond to 0011 ... 0020 in original ESD notation. | 5 -p | Path where to save synthesised audio | Any with .wav extension. | generationfromphoneme_sequence.wav

For example

python -m src.scripts.inference --sq "S P IY2 K ER1 F AY1 V T AO1 K IH0 NG W IH0 TH AE1 NG G R IY0 IH0 M OW0 SH AH0 N"

If result file is not synthesied, check inference.log for OOV phones.

References

Owner

Name: Deep VK
Login: deepvk
Kind: organization

Repositories: 3
Profile: https://github.com/deepvk

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: emospeech
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Daria
    family-names: Diatlova
    affiliation: deepvk
    email: d.dyatlova@vk.team
  - given-names: Vitaly
    family-names: Shutov
    affiliation: deepvk
    email: vi.shutov@corp.vk.com
identifiers:
  - type: url
    value: 'https://arxiv.org/abs/2307.00024'
    description: arXiv preprint
repository-code: 'https://github.com/deepvk/emospeech'
url: 'https://dariadiatlova.github.io/emospeech'
abstract: >-
  EmoSpeech is an acoustic model for an emotional speech
  synthesis.
keywords:
  - TTS
license: Apache-2.0

GitHub Events

Total

Watch event: 21
Fork event: 3

Last Year

Watch event: 21
Fork event: 3

Dependencies

Dockerfile docker

nvidia/cuda 11.8.0-runtime-ubuntu22.04 build

requirements.txt pypi

PyYAML *
gdown *
librosa *
lightning *
loguru *
matplotlib *
numpy *
opensmile *
pandas *
pillow *
pip-system-certs *
pyrallis *
pyworld *
pyyaml *
scikit-learn *
scipy *
seaborn *
soundfile *
tgt *
torch *
torchaudio *
torchmetrics *
torchvision *
tqdm *
wandb *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science