mel-cepstral-distance

A Python library for computing the Mel-Cepstral Distance (Mel-Cepstral Distortion, MCD) between two inputs. This implementation is based on the method proposed by Robert F. Kubichek in "Mel-Cepstral Distance Measure for Objective Speech Quality Assessment".

https://github.com/stefantaubert/mel-cepstral-distance

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 21 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, ieee.org, zenodo.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Keywords

cepstral distance distortion divergence dtw dynamic-time-warping language linguistics mcd mel mfcc objective-evaluation spectrogram spectrum speech-quality speech-synthesis text-to-speech tts voice-cloning

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: stefantaubert
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 62.7 MB

Statistics

Stars: 55
Watchers: 3
Forks: 10
Open Issues: 0
Releases: 4

Topics

Created over 5 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

README.md

mel-cepstral-distance

A Python library for computing the Mel-Cepstral Distance (also known as Mel-Cepstral Distortion, MCD) between two inputs. This implementation is based on the method proposed by Robert F. Kubichek in Mel-Cepstral Distance Measure for Objective Speech Quality Assessment.

Compute MCD between two inputs: audio files, amplitude spectrograms, Mel spectrograms, or MFCCs.
Calculate an alignment penalty (PEN) as an additional metric to indicate the extent of alignment applied.
Remove pauses from audio files or feature representations (amplitude spectrograms, Mel spectrograms, or MFCCs) using a threshold.
Align feature representations using either Dynamic Time Warping (DTW) or zero-padding.
Experimental results show a moderate negative correlation with naturalness (Spearman: –0.31) and a weak negative correlation with intelligibility (–0.24). For a detailed analysis of parameter configurations and their impact on correlation strength, see the experiment report.

Getting Started

Installation

sh pip install mel-cepstral-distance

Example usage

Compare two audio files with default parameters:

```py from melcepstraldistance import compareaudiofiles

mcd, penalty = compareaudiofiles( 'examples/GT.wav', 'examples/Tacotron-2.wav', )

print(f'MCD: {mcd:.2f}, Penalty: {penalty:.4f}')

MCD: 7.45, Penalty: 0.1087

```

Calculation

Spectrogram

$$ X(k, m) = \text{FFT of } x_k(n), \text{ for real input.} $$

Where:

$X(k, m)$: The result (amplitude spectrogram) of the real-valued FFT for the $k$-th frame at frequency index $m$.
$x_k(n)$: The time-domain signal of the $k$-th frame.
$\text{FFT}$: The real-valued discrete Fourier transform, computed using np.fft.rfft.

Mel spectrogram

$$ X{k,n} = \log{10}\left\lbrace\summ^M |X(k, m)|^2 \cdot wn(m)\right\rbrace $$

Where:

$X_{k,n}$: The logarithmic Mel-scaled power spectrogram for the $k$-th frame at Mel frequency $n$.
$X(k, m)$: The amplitude spectrum of the $k$-th frame at frequency $m$.
$M$: The total number of Mel frequency bins.
$w_n(m)$: The Mel filter bank weights for Mel frequency $n$ and frequency bin $m$.

Mel-frequency cepstral coefficients

$$ MCX(i, k) = \sum{n=1}^{M} X_{k,n} \cos\left[i\left(n - \frac{1}{2}\right)\frac{\pi}{M}\right] $$

Where:

$MC_X(i, k)$: The $i$-th Mel-frequency cepstral coefficient (MFCC) for the $k$-th frame.
$X_{k,n}$: The logarithmic Mel-scaled power spectrogram for the $k$-th frame at Mel frequency $n$.
$M$: The total number of Mel frequency bins.
$i$: The index of the MFCC being computed.

Mel-cepstral distance

Per frame

$$ MCD(k) = \alpha\sqrt{\sum{i=s}^{D} \left(MCX(i, k) - MC_Y(i, k)\right)^2} $$

Where:

$MCD(k)$: The Mel-cepstral distance for the $k$-th frame.
$MC_X(i, k)$: The $i$-th MFCC of the reference signal for the $k$-th frame.
$MC_Y(i, k)$: The $i$-th MFCC of the target signal for the $k$-th frame.
$D$: The number of MFCCs used in the computation.
$\alpha$: Optional scaling factor used in some literature, e.g. $\frac{10\sqrt{2}}{\ln 10}$.
- Note: Kubichek didn't use it, so it has value 1
$s$: Parameter to exclude the 0th coefficient (corresponding to energy):
- $s = 0$: Includes the 0th coefficient
- $s = 1$: Excludes the 0th coefficient

Mean over all frames

$$ MCD = \frac{1}{N} \sum_{k=1}^{N} MCD(k) $$

Where:

$MCD$: The mean Mel-cepstral distance over all frames.
$N$: The total number of frames.
$MCD(k)$: The Mel-cepstral distance for the $k$-th frame.

Alignment penalty during Dynamic Time Warping (DTW)

$$ PEN = 2 - \frac{NX + NY}{N_{XY}} $$

Where:

$N_X$: The number of frames in the reference sequence.
$N_Y$: The number of frames in the target sequence.
$N_{XY}$: The number of frames after alignment (same for X and Y).
$PEN$: A value in interval $[0, 1)$, where a smaller value indicates less alignment.

Used parameters in literature

| Literature | Sampling Rate | Window Size | Hop Length | FFT Size | Window Function | $M$ | Min Frequency | Max Frequency | $s$ | $D$ | Pause | DTW | $\alpha$ | Smallest MCD | Largest MCD | Citation MCD | Domain | | ---------- | ------------- | --------------------- | -------------------- | ------------ | --------------- | --- | ------------- | ------------- | --- | --- | ----- | --- | ----------------------------- | ------------ | ----------- | ------------ | ------- | | [1] | 8kHz | 32ms/256 | <16ms/128* | 32ms/256* | ? | 20 | 0Hz* | 4kHz* | 1 | 16 | no | no | 1 | ~0.8 | ~1.05 | original | generic | | [2] | ? | ? | ? | ? | ? | 80* | 80Hz* | 12kHz* | 1 | 13 | yes* | no | 1 | 0.294 | 0.518 | [3] | TTS | | [3] | 24kHz* | ? | ? | ? | ? | 80 | 80Hz | 12kHz | 1 | 13 | yes* | no | 1 | 6.99 | 12.37 | [1] | TTS | | [4] | 16kHz* | 25ms | 5ms | ? | ? | ? | 0Hz* | 8kHz* | 1 | 24 | yes* | no | $\frac{10}{\ln(10)}$ | ~2.5dB | ~12.5dB | [5] | TTS | | [5] | ? | 30ms | 10ms | ? | Hamming | ? | ? | ? | 1 | 10 | yes* | yes | 1 | 3.415 | 4.066 | [1] | TTS | | [6] | ? | >10ms* | 5ms | >10ms* | Gaussian* | ? | ? | 8kHz* | 1 | 24 | no | no | $\frac{10 \sqrt{2}}{\ln(10)}$ | ~4.75 | ~6 | [7] | VC | | [7] | 16kHz | 40ms* | 5ms | 64ms/1024 | Gaussian | ? | ? | 12kHz | 1 | 40 | yes | no | $\frac{10 \sqrt{2}}{\ln(10)}$ | 2.32dB | 3.53dB | none | TTS | | [8] | 24kHz | 50ms/1200 | 12.5ms/300 | 2048/~85.3ms | Hann | 80 | 80Hz | 12kHz | 1 | 13 | yes* | yes | 1 | 4.83 | 5.68 | [1] | TTS | | [9] | 16kHz | 64ms/1024 | 16ms/256 | 128ms/2048 | Hann | 80 | 125Hz | 7.6kHz | 1* | 16* | yes* | yes | 1* | 10.62 | 14.38 | [1] | TTS | | [10] | 16kHz | ? | ? | ? | ? | ? | ? | ? | 1 | 16* | yes* | yes | 1* | 8.67 | 19.41 | none | TTS | | [11] | 16kHz* | 64ms* (at 16kHz)/1024 | 16ms* (at 16kHz)/256 | 64ms/1024 | Hann* | 80 | 0Hz | 8kHz | 1 | 60 | yes* | no | $\frac{10 \sqrt{2}}{\ln(10)}$ | 5.32dB | 6.78dB | [12] | TTS |

*Parameters are not explicitly stated, but were estimated from the information in the literature.

Literature:

[1] Kubichek, R. (1993). Mel-cepstral distance measure for objective speech quality assessment. Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, 1, 125–128. https://doi.org/10.1109/PACRIM.1993.407206
[2] Lee, Y., & Kim, T. (2019). Robust and Fine-grained Prosody Control of End-to-end Speech Synthesis. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5911–5915. https://doi.org/10.1109/ICASSP.2019.8683501
[3] Ref-Tacotron -> Skerry-Ryan, R. J., Battenberg, E., Xiao, Y., Wang, Y., Stanton, D., Shor, J., Weiss, R., Clark, R., & Saurous, R. A. (2018). Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. Proceedings of the 35th International Conference on Machine Learning, 4693–4702. https://proceedings.mlr.press/v80/skerry-ryan18a.html
[4] Nature/ansp19-503 Anumanchipalli, G. K., Chartier, J., & Chang, E. F. (2019). Speech synthesis from neural decoding of spoken sentences. Nature, 568(7753), Article 7753. https://doi.org/10.1038/s41586-019-1119-1
[5] Shah, N. J., Vachhani, B. B., Sailor, H. B., & Patil, H. A. (2014). Effectiveness of PLP-based phonetic segmentation for speech synthesis. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 270–274. https://doi.org/10.1109/ICASSP.2014.6853600
[6] Kominek, J., Schultz, T., & Black, A. W. (2008). Synthesizer voice quality of new languages calibrated with mean Mel cepstral distortion. SLTU, 63–68. http://www.cs.cmu.edu/~./awb/papers/sltu2008/kominekblack.sltu2008.pdf
[7] Mashimo, M., Toda, T., Shikano, K., & Campbell, N. (2001). Evaluation of cross-language voice conversion based on GMM and straight. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 361–364. https://doi.org/10.21437/Eurospeech.2001-111
[8] Capacitron -> Battenberg, E., Mariooryad, S., Stanton, D., Skerry-Ryan, R. J., Shannon, M., Kao, D., & Bagby, T. (2019). Effective Use of Variational Embedding Capacity in Expressive End-to-End Speech Synthesis (No. arXiv:1906.03402). arXiv. http://arxiv.org/abs/1906.03402
[9] Attentron -> Choi, S., Han, S., Kim, D., & Ha, S. (2020). Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding. Interspeech 2020, 2007–2011. https://doi.org/10.21437/Interspeech.2020-2096
[10] VoiceLoop -> Taigman, Y., Wolf, L., Polyak, A., & Nachmani, E. (2018). VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. 6th International Conference on Learning Representations (ICLR 2018), 2, 1374–1387. https://openreview.net/forum?id=SkFAWax0-
[11] MIST-Tacotron -> Moon, S., Kim, S., & Choi, Y.-H. (2022). MIST-Tacotron: End-to-End Emotional Speech Synthesis Using Mel-Spectrogram Image Style Transfer. IEEE Access, 10, 25455–25463. IEEE Access. https://doi.org/10.1109/ACCESS.2022.3156093
[12] Kim, J., Choi, H., Park, J., Hahn, M., Kim, S., & Kim, J.-J. (2018). Korean Singing Voice Synthesis Based on an LSTM Recurrent Neural Network. Interspeech 2018, 1551–1555. https://doi.org/10.21437/Interspeech.2018-1575

Default parameters

Based on the values in the literature the default parameters were set:

Hop Length (hop_len): 8 ms
- Note: should be 1/2 or 1/4 of the window size
Window Size (win_len): 32 ms
FFT Size (n_fft): 32 ms
- For faster computation, the sample equivalent should be a power of 2.
Window Function (window): Hanning
Sampling Rate (sample_rate): is taken from the audio file
Minimum Frequency (fmin): 0 Hz
Maximum Frequency (fmax): sampling rate / 2
- Cannot exceed half the sampling rate.
Num. Mel-Bands ($M$): 20
- Increasing the number will increase the resulting MCD values.
$s$: 1
$D$: 16
$\alpha$: 1 (alternate values can be applied by multiplying the MCD with a custom factor)
Aligning: DTW
Align Target (align_target): MFCC
Remove Silence: No
- Silence can be removed from Mel spectrograms before computing the MCD, with dataset-specific thresholds.

Suggested parameters

Based on the conducted experiments, the following parameter settings are recommended to achieve the strongest correlation with subjective ratings:

sample_rate = 96000 Hz n_fft = 64 ms win_len = 32 ms hop_len = 16 ms window = 'hanning' fmin = 0 Hz fmax = 48000 Hz M = 20 s = 1 D = 13 align_method = 'dtw' align_target = 'mel' remove_silence = 'no' silence_threshold_A = None silence_threshold_B = None norm_audio = True dtw_radius = 2

Furthermore, combining MCD and PEN using the formula MCD*(PEN+1) yield the strongest correlation with subjective ratings, according to the experimental results.

Note: To enable meaningful cross-paper comparisons, it is strongly recommended that users of this library—whether adopting it directly or implementing their own version—explicitly report all parameter settings used for feature extraction and distance calculation, as inconsistent or undocumented configurations remain a major issue in the current literature.

License

MIT License

Citation

If you want to cite this repo, you can use the BibTeX-entry generated by GitHub (see About => Cite this repository).

txt Taubert, S., & Sternkopf, J. (2025). mel-cepstral-distance (Version 0.0.4) [Computer software]. https://doi.org/10.5281/zenodo.15213012

Acknowledgments

Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 416228727 – CRC 1410

Owner

Name: Stefan Taubert
Login: stefantaubert
Kind: user
Location: Chemnitz, Germany
Company: Chemnitz University of Technology

Website: https://stefantaubert.com
Twitter: Stefan_Taubert
Repositories: 75
Profile: https://github.com/stefantaubert

Currently I am working on my PhD about the topic of speech synthesis at Chemnitz University of Technology.

Citation (CITATION.cff)

cff-version: 1.2.0
title: mel-cepstral-distance
abstract: A Python library for computing the Mel-Cepstral Distance (also known as Mel-Cepstral Distortion, MCD) between two inputs. This implementation is based on the paper 'Mel-Cepstral Distance Measure for Objective Speech Quality Assessment' by Kubichek (1993).
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Stefan
    family-names: Taubert
    affiliation: Chemnitz University of Technology
    orcid: 'https://orcid.org/0000-0002-4932-2874'
    website: 'https://stefantaubert.com/'
  - given-names: Jasmin
    family-names: Sternkopf
    affiliation: Chemnitz University of Technology
version: 0.0.4
date-released: 2025-04-14
license: MIT
url: https://github.com/stefantaubert/mel-cepstral-distance
doi: 10.5281/zenodo.15213012

GitHub Events

Total

Watch event: 4
Push event: 16

Last Year

Watch event: 4
Push event: 16

Committers

Last synced: 12 months ago

All Time

Total Commits: 158
Total Committers: 2
Avg Commits per committer: 79.0
Development Distribution Score (DDS): 0.285

Past Year

Commits: 76
Committers: 1
Avg Commits per committer: 76.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Stefan Taubert	2****t	113
Jasmin Sternkopf	j**f@w**e	45

Issues and Pull Requests

Last synced: 12 months ago

All Time

Total issues: 4
Total pull requests: 1
Average time to close issues: 7 months
Average time to close pull requests: over 2 years
Total issue authors: 4
Total pull request authors: 1
Average comments per issue: 1.5
Average comments per pull request: 1.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: 40 minutes
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

chankl3579 (1)
piotrkawa (1)

Pull Request Authors

ckdgus0505 (1)

Top Labels

Issue Labels

enhancement (1)

Pull Request Labels

Dependencies

Pipfile pypi

autoflake * develop
autopep8 * develop
build * develop
isort * develop
mel_cepstral_distance * develop
pycodestyle * develop
pylint * develop
pytest * develop
rope * develop
tox * develop
twine * develop
fastdtw >=0.3.4
librosa >=0.9.1
numpy >=1.22.3
scipy >=1.8.0

Pipfile.lock pypi

appdirs ==1.4.4 develop
astroid ==2.11.7 develop
attrs ==22.1.0 develop
audioread ==3.0.0 develop
autoflake ==1.4 develop
autopep8 ==1.7.0 develop
bleach ==5.0.1 develop
build ==0.8.0 develop
certifi ==2022.6.15 develop
cffi ==1.15.1 develop
charset-normalizer ==2.1.0 develop
commonmark ==0.9.1 develop
cryptography ==37.0.4 develop
decorator ==5.1.1 develop
dill ==0.3.5.1 develop
distlib ==0.3.5 develop
docutils ==0.19 develop
fastdtw ==0.3.4 develop
filelock ==3.8.0 develop
idna ==3.3 develop
importlib-metadata ==4.12.0 develop
iniconfig ==1.1.1 develop
isort ==5.10.1 develop
jeepney ==0.8.0 develop
joblib ==1.1.0 develop
keyring ==23.8.2 develop
lazy-object-proxy ==1.7.1 develop
librosa ==0.9.2 develop
llvmlite ==0.39.0 develop
mccabe ==0.7.0 develop
mel-cepstral-distance * develop
numba ==0.56.0 develop
numpy ==1.22.4 develop
packaging ==21.3 develop
pep517 ==0.13.0 develop
pkginfo ==1.8.3 develop
platformdirs ==2.5.2 develop
pluggy ==1.0.0 develop
pooch ==1.6.0 develop
py ==1.11.0 develop
pycodestyle ==2.9.1 develop
pycparser ==2.21 develop
pyflakes ==2.5.0 develop
pygments ==2.13.0 develop
pylint ==2.14.5 develop
pyparsing ==3.0.9 develop
pytest ==7.1.2 develop
pytoolconfig ==1.2.2 develop
readme-renderer ==36.0 develop
requests ==2.28.1 develop
requests-toolbelt ==0.9.1 develop
resampy ==0.4.0 develop
rfc3986 ==2.0.0 develop
rich ==12.5.1 develop
rope ==1.3.0 develop
scikit-learn ==1.1.2 develop
scipy ==1.9.0 develop
secretstorage ==3.3.3 develop
setuptools ==65.0.2 develop
six ==1.16.0 develop
soundfile ==0.10.3.post1 develop
threadpoolctl ==3.1.0 develop
toml ==0.10.2 develop
tomli ==2.0.1 develop
tomlkit ==0.11.4 develop
tox ==3.25.1 develop
twine ==4.0.1 develop
typing-extensions ==4.3.0 develop
urllib3 ==1.26.11 develop
virtualenv ==20.16.3 develop
webencodings ==0.5.1 develop
wrapt ==1.14.1 develop
zipp ==3.8.1 develop
appdirs ==1.4.4
audioread ==3.0.0
certifi ==2022.6.15
cffi ==1.15.1
charset-normalizer ==2.1.0
decorator ==5.1.1
fastdtw ==0.3.4
idna ==3.3
importlib-metadata ==4.12.0
joblib ==1.1.0
librosa ==0.9.2
llvmlite ==0.39.0
numba ==0.56.0
numpy ==1.22.4
packaging ==21.3
pooch ==1.6.0
pycparser ==2.21
pyparsing ==3.0.9
requests ==2.28.1
resampy ==0.4.0
scikit-learn ==1.1.2
scipy ==1.9.0
setuptools ==65.0.2
soundfile ==0.10.3.post1
threadpoolctl ==3.1.0
urllib3 ==1.26.11
zipp ==3.8.1

pyproject.toml pypi

fastdtw >= 0.3.4, < 0.4
librosa >= 0.9.1, < 0.10
numpy >= 1.22.3, < 1.24
scipy >= 1.8.0, < 1.10