shepardtts
ShepardTTS is a fine-tuned XTTS v2.0.3 model, trained on paired dialogue/audio samples from the Mass Effect 2 and Mass Effect 3 base games.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Repository
ShepardTTS is a fine-tuned XTTS v2.0.3 model, trained on paired dialogue/audio samples from the Mass Effect 2 and Mass Effect 3 base games.
Basic Info
- Host: GitHub
- Owner: Darwinkel
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Homepage: https://shepardtts.darwinkel.net/
- Size: 1.59 MB
Statistics
- Stars: 7
- Watchers: 1
- Forks: 0
- Open Issues: 16
- Releases: 0
Metadata Files
README.md
ShepardTTS
ShepardTTS is a free and open-source fine-tuned XTTS v2.0.3 model, trained on paired dialogue/audio samples from the Mass Effect 2 and Mass Effect 3 base games. It is a multilingual and multispeaker model, and can make all of our beloved characters come to life.
Pull requests, feature requests, and discussions are welcome!
If you are a researcher, and you want access to the public ShepardTTS deployment, contact me.
Usage notes
Most voices perform best when narrating medium-length sentences with medium-length words. They tend to produce garbage and artifacts when confronted with very short words and sentences, excessive punctuation, and abbreviations. Sentences which are too long tend to cause hallucinations. As a rule of thumb: provide text input such that it could have reasonably occurred in the games. The more out-of-domain - and unnatural - the text input, the lower the chances of a good narration.
Deployment
GitHub Actions automatically produces a fresh image on every push to the main branch. See docker-compose.example.yml on how it can be deployed.
History (and other experiments)
I initially fine-tuned SpeechT5, but the results were disappointing. That model very frequently produced garbage and/or hallucinated output for most voices. Interestingly, it also had a very strong bias towards female speakers.
Dataset
After dumping dialogue strings with the Legendary Explorer and dumping audio samples with Gibbed's ME2/ME3 extractor, you can use create_dataset.py to align and filter the two. This transforms the dialogue-audio pairs into a HuggingFace dataset, which it then exports into the ljspeech format.
You can then proceed to train the model, and create character embeddings when it finishes training.
The audio samples and dialogue strings are extremely clean. The audio has a sample rate of 24000Hz (downsampled to 22050 for training). The dialogue strings are corrupted in some cases (issue with the Legendary Explorer?).
Training
Trained for 12 epochs on a RTX 3060 with 12GB VRAM. Took about 14 hours. Judging from the eval loss, this is roughly the point where it starts overfitting. See train.py for the used parameters.
Future work
See the project board.
GPU inference with DeepSpeed is ~20x faster (minutes to seconds), but renting GPU's is very expensive. Do we have a generous sponsor in the audience perhaps?
Ethical (and legal) statement
There are probably copyright issues with a generative model trained on game files. More importantly, I'm not sure how the voice actors feel about their voice being cloned. Do not use ShepardTTS for commercial or harmful purposes. This software is a labor of love built for the Mass Effect fan community.
Due to these legal and ethical issues, I will not distribute the game files nor the model checkpoint at this time. Dump and fine-tune yourself.
Risks
Voice cloning technology has been around for a couple of years. Hand-picked audio samples with commercial-grade voice models likely produce better audio than ShepardTTS. Furthermore, waveforms produced by this model are easily recognizable as such just by visual inspection, as it always produces (some) characteristic artifacts.
The access to the public deployment is highly restricted, and as such there is no straightforward way to use the system such that it hurts the interests of the original voice actors.
All things considered, this software should not produce additional harm beyond what already exists.
License
The model and its output: Coqui Public Model License (CPML)
The code: GNU General Public License v3.0
Acknowledgements
- Coqui for their amazing XTTS v2.0.3 model;
- Mass Effect Legendary Explorer to dump the dialogue strings;
- Gibbed's AudioExtractor to bulk-export audio files from ME2 and ME3;
- Gradio, HuggingFace, and HuggingFace Spaces, for inspiration regarding the deployment of this model.
Owner
- Name: Patrick
- Login: Darwinkel
- Kind: user
- Location: Groningen, Netherlands
- Company: @code050 | @rijksuniversiteit-groningen
- Repositories: 1
- Profile: https://github.com/Darwinkel
BSc in Information Science; software engineer
Citation (CITATION.cff)
cff-version: 1.2.0
title: ShepardTTS
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Patrick
family-names: Darwinkel
orcid: 'https://orcid.org/0009-0009-6604-1175'
identifiers:
- type: url
value: 'https://github.com/Darwinkel/ShepardTTS'
description: Source code
abstract: >-
ShepardTTS is a free and open-source fine-tuned XTTS
v2.0.3 model, trained on paired dialogue/audio samples
from the Mass Effect 2 and Mass Effect 3 base games. It is
a multilingual and multispeaker model, and can make all of
our beloved characters come to life.
license: GPL-3.0
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Dependencies
- actions/checkout v4 composite
- actions/setup-python v5 composite
- python 3.11-bookworm build
- 222 dependencies
- gradio 4.19.1 deploy
- mypy 1.8.0 develop
- poetry-plugin-export 1.6.0 develop
- ruff 0.2.1 develop
- clean-text 0.6.0
- deepspeed 0.13.2
- num2words 0.5.13
- numpy 1.26.4
- pandas 1.5.3
- python ^3.11,<3.12
- torch 2.2.0
- torchaudio 2.2.0
- tts 0.22.0
- unidecode 1.3.8
- accelerate 0.27.2 train
- datasets 2.17.0 train
- openpyxl 3.1.2 train
- soundfile 0.12.1 train
- trainer 0.0.36 train
- actions/checkout v4 composite
- actions/setup-python v5 composite
- docker/build-push-action v5 composite
- docker/login-action v3 composite
- docker/metadata-action v5 composite
- docker/setup-buildx-action v3 composite
- actions/checkout v4 composite
- actions/setup-python v5 composite
- actions/upload-artifact v4 composite