Software Design and User Interface of ESPnet-SE++

Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing - Published in JOSS (2023)

https://github.com/espnet/espnet

Keywords

chainer deep-learning end-to-end kaldi machine-translation pytorch singing-voice-synthesis speaker-diarization speech-enhancement speech-recognition speech-separation speech-synthesis speech-translation spoken-language-understanding text-to-speech voice-conversion

Keywords from Contributors

cryptocurrencies mesh exoplanet energy-system hydrology finite-elements fem trading-bot trade telegram-bot

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 40% confidence

Biology Life Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

End-to-End Speech Processing Toolkit

Basic Info

Host: GitHub
Owner: espnet
License: apache-2.0
Language: Python
Default Branch: master
Homepage: https://espnet.github.io/espnet/
Size: 1.22 GB

Statistics

Stars: 9,431
Watchers: 168
Forks: 2,325
Open Issues: 99
Releases: 56

Topics

chainer deep-learning end-to-end kaldi machine-translation pytorch singing-voice-synthesis speaker-diarization speech-enhancement speech-recognition speech-separation speech-synthesis speech-translation spoken-language-understanding text-to-speech voice-conversion

Created about 8 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License

README.md

ESPnet: end-to-end speech processing toolkit

|system/pytorch ver.|2.5.1|2.6.0|2.7.1|2.8.0| | :---- | :---: | :---: | :---: | :---: | |ubuntu/python3.10/pip||||| |ubuntu/python3.11/pip||||| |ubuntu/python3.10/conda|||| |debian12/python3.10/conda|||||| |windows/python3.10/pip|||||| |macos/python3.10/pip|||||| |macos/python3.10/conda||||||

______________________________________________________________________ [![PyPI version](https://badge.fury.io/py/espnet.svg)](https://badge.fury.io/py/espnet) [![Python Versions](https://img.shields.io/pypi/pyversions/espnet.svg)](https://pypi.org/project/espnet/) [![Downloads](https://pepy.tech/badge/espnet)](https://pepy.tech/project/espnet) [![GitHub license](https://img.shields.io/github/license/espnet/espnet.svg)](https://github.com/espnet/espnet) [![codecov](https://codecov.io/gh/espnet/espnet/branch/master/graph/badge.svg)](https://codecov.io/gh/espnet/espnet) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) [![pre-commit.ci status](https://results.pre-commit.ci/badge/github/espnet/espnet/master.svg)](https://results.pre-commit.ci/latest/github/espnet/espnet/master) [![Mergify Status](https://img.shields.io/endpoint.svg?url=https://api.mergify.com/v1/badges/espnet/espnet&style=flat)](https://mergify.com) [![Discord](https://img.shields.io/discord/1174538500360650773?color=%239B59B6&label=chat%20on%20discord)](https://discord.gg/hrCs85gFWM) ______________________________________________________________________ [**Docs**](https://espnet.github.io/espnet/) | [**Example**](https://github.com/espnet/espnet/tree/master/egs) | [**Example (ESPnet2)**](https://github.com/espnet/espnet/tree/master/egs2) | [**Docker**](https://github.com/espnet/espnet/tree/master/docker) | [**Notebook**](https://github.com/espnet/notebook)

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Tutorial Series

2019 Tutorial at Interspeech
- Material
2021 Tutorial at CMU
- Online video
- Material
2022 Tutorial at CMU
- Usage of ESPnet (ASR as an example)
- Online video
- Material
- Add new models/tasks to ESPnet
- Online video
- Material

Key Features

Kaldi-style complete recipe

Support numbers of ASR recipes (WSJ, Switchboard, CHiME-4/5, Librispeech, TED, CSJ, AMI, HKUST, Voxforge, REVERB, Gigaspeech, etc.)
Support numbers of TTS recipes in a similar manner to the ASR recipe (LJSpeech, LibriTTS, M-AILABS, etc.)
Support numbers of ST recipes (Fisher-CallHome Spanish, Libri-trans, IWSLT'18, How2, Must-C, Mboshi-French, etc.)
Support numbers of MT recipes (IWSLT'14, IWSLT'16, the above ST recipes etc.)
Support numbers of SLU recipes (CATSLU-MAPS, FSC, Grabo, IEMOCAP, JDCINAL, SNIPS, SLURP, SWBD-DA, etc.)
Support numbers of SE/SS recipes (DNS-IS2020, LibriMix, SMS-WSJ, VCTK-noisyreverb, WHAM!, WHAMR!, WSJ-2mix, etc.)
Support voice conversion recipe (VCC2020 baseline)
Support speaker diarization recipe (mini_librispeech, librimix)
Support singing voice synthesis recipe (ofutonputagoe_db, opencpop, m4singer, etc.)

ASR: Automatic Speech Recognition

State-of-the-art performance in several ASR benchmarks (comparable/superior to hybrid DNN/HMM and CTC)
Hybrid CTC/attention based end-to-end ASR
- Fast/accurate training with CTC/attention multitask training
- CTC/attention joint decoding to boost monotonic alignment decoding
- Encoder: VGG-like CNN + BiRNN (LSTM/GRU), sub-sampling BiRNN (LSTM/GRU), Transformer, Conformer, Branchformer, or E-Branchformer
- Decoder: RNN (LSTM/GRU), Transformer, or S4
Attention: Flash Attention, Dot product, location-aware attention, variants of multi-head
Incorporate RNNLM/LSTMLM/TransformerLM/N-gram trained only with text data
Batch GPU decoding
Data augmentation
Transducer based end-to-end ASR
- Architecture:
- Custom encoder supporting RNNs, Conformer, Branchformer (w/ variants), 1D Conv / TDNN.
- Decoder w/ parameters shared across blocks supporting RNN, stateless w/ 1D Conv, MEGA, and RWKV.
- Pre-encoder: VGG2L or Conv2D available.
- Search algorithms:
- Greedy search constrained to one emission by timestep.
- Default beam search algorithm [Graves, 2012] without prefix search.
- Alignment-Length Synchronous decoding [Saon et al., 2020].
- Time Synchronous Decoding [Saon et al., 2020].
- N-step Constrained beam search modified from [Kim et al., 2020].
- modified Adaptive Expansion Search based on [Kim et al., 2021] and NSC.
- Features:
- Unified interface for offline and streaming speech recognition.
- Multi-task learning with various auxiliary losses:
  - Encoder: CTC, auxiliary Transducer and symmetric KL divergence.
  - Decoder: cross-entropy w/ label smoothing.
- Transfer learning with an acoustic model and/or language model.
- Training with FastEmit regularization method [Yu et al., 2021]. > Please refer to the tutorial page for complete documentation.
CTC segmentation
Non-autoregressive model based on Mask-CTC
ASR examples for supporting endangered language documentation (Please refer to egs/pueblanahuatl and egs/yoloxochitlmixtec for details)
Wav2Vec2.0 pre-trained model as Encoder, imported from FairSeq.
Self-supervised learning representations as features, using upstream models in S3PRL in frontend.
- Set frontend to s3prl
- Select any upstream model by setting the frontend_conf to the corresponding name.
Transfer Learning :
- easy usage and transfers from models previously trained by your group or models from ESPnet Hugging Face repository.
- Documentation and toy example runnable on colab.
Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
Restricted Self-Attention based on Longformer as an encoder for long sequences
OpenAI Whisper model, robust ASR based on large-scale, weakly-supervised multitask learning

Demonstration - Real-time ASR demo with ESPnet2 - Gradio Web Demo on Hugging Face Spaces. Check out the Web Demo - Streaming Transformer ASR Local Demo with ESPnet2.

TTS: Text-to-speech

Architecture
- Tacotron2
- Transformer-TTS
- FastSpeech
- FastSpeech2
- Conformer FastSpeech & FastSpeech2
- VITS
- JETS
Multi-speaker & multi-language extension
- Pre-trained speaker embedding (e.g., X-vector)
- Speaker ID embedding
- Language ID embedding
- Global style token (GST) embedding
- Mix of the above embeddings
End-to-end training
- End-to-end text-to-wav model (e.g., VITS, JETS, etc.)
- Joint training of text2mel and vocoder
Various language support
- En / Jp / Zn / De / Ru / And more...
Integration with neural vocoders
- Parallel WaveGAN
- MelGAN
- Multi-band MelGAN
- HiFiGAN
- StyleMelGAN
- Mix of the above models

Demonstration - Real-time TTS demo with ESPnet2 - Integrated to Hugging Face Spaces with Gradio. See demo:

To train the neural vocoder, please check the following repositories: - kan-bayashi/ParallelWaveGAN - r9y9/wavenet_vocoder

SE: Speech enhancement (and separation)

Single-speaker speech enhancement
Multi-speaker speech separation
Unified encoder-separator-decoder structure for time-domain and frequency-domain models
- Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution
- Separators: BLSTM, Transformer, Conformer, TasNet, DPRNN, SkiM, SVoice, DC-CRN, DCCRN, Deep Clustering, Deep Attractor Network, FaSNet, iFaSNet, Neural Beamformers, etc.
Flexible ASR integration: working as an individual task or as the ASR frontend
Easy to import pre-trained models from Asteroid
- Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration - Interactive SE demo with ESPnet2 - Streaming SE demo with ESPnet2

ST: Speech Translation & MT: Machine Translation

State-of-the-art performance in several ST benchmarks (comparable/superior to cascaded ASR and MT)
Transformer-based end-to-end ST (new!)
Transformer-based end-to-end MT (new!)

VC: Voice conversion

Transformer and Tacotron2-based parallel VC using Mel spectrogram
End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

SLU: Spoken Language Understanding

Architecture
- Transformer-based Encoder
- Conformer-based Encoder
- Branchformer based Encoder
- E-Branchformer based Encoder
- RNN based Decoder
- Transformer-based Decoder
Support Multitasking with ASR
- Predict both intent and ASR transcript
Support Multitasking with NLU
- Deliberation encoder based 2 pass model
Support using pre-trained ASR models
- Hubert
- Wav2vec2
- VQ-APC
- TERA and more ...
Support using pre-trained NLP models
- BERT
- MPNet And more...
Various language support
- En / Jp / Zn / Nl / And more...
Supports using context from previous utterances
Supports using other tasks like SE in a pipeline manner
Supports Two Pass SLU that combines audio and ASR transcript Demonstration
Performing noisy spoken language understanding using a speech enhancement model followed by a spoken language understanding model.
Performing two-pass spoken language understanding where the second pass model attends to both acoustic and semantic information.
Integrated to Hugging Face Spaces with Gradio. See SLU demo on multiple languages:

SUM: Speech Summarization

End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [Sharma et al., 2022]

SVS: Singing Voice Synthesis

Framework merge from Muskits
Architecture
- RNN-based non-autoregressive model
- Xiaoice
- Tacotron-singing
- DiffSinger (in progress)
- VISinger
- VISinger 2 (its variations with different vocoders-architecture)
Support multi-speaker & multilingual singing synthesis
- Speaker ID embedding
- Language ID embedding
Various language support
- Jp / En / Kr / Zh
Tight integration with neural vocoders (the same as TTS)

SSL: Self-supervised Learning

Support HuBERT Pre-training:
- Example recipe: egs2/LibriSpeech/ssl1

UASR: Unsupervised ASR (EURO: ESPnet Unsupervised Recognition - Open-source)

Architecture
- wav2vec-U (with different self-supervised models)
- wav2vec-U 2.0 (in progress)
Support PrefixBeamSearch and K2-based WFST decoding

S2T: Speech-to-text with Whisper-style multilingual multitask models

Reproduces Whisper-style training from scratch using public data: OWSM
Supports multiple tasks in a single model
- Multilingual speech recognition
- Any-to-any speech translation
- Language identification
- Utterance-level timestamp prediction (segmentation)

DNN Framework

Flexible network architecture thanks to Chainer and PyTorch
Flexible front-end processing thanks to kaldiio and HDF5 support
Tensorboard-based monitoring
DeepSpeed-based large-scale training

ESPnet2

See ESPnet2.

Independent from Kaldi/Chainer, unlike ESPnet1
On-the-fly feature extraction and text processing when training
Supporting DistributedDataParallel and DaraParallel both
Supporting multiple nodes training and integrated with Slurm or MPI
Supporting Sharded Training provided by fairscale
A template recipe that can be applied to all corpora
Possible to train any size of corpus without CPU memory error
ESPnet Model Zoo
Integrated with wandb

Installation

If you intend to do full experiments, including DNN training, then see Installation.
If you just need the Python module only: ```sh

We recommend you install PyTorch before installing espnet following https://pytorch.org/get-started/locally/

pip install espnet

To install the latest

pip install git+https://github.com/espnet/espnet

To install additional packages

pip install "espnet[all]"

```

If you use ESPnet1, please install chainer and cupy.

sh pip install chainer==6.0.0 cupy==6.0.0 # [Option]

You might need to install some packages depending on each task. We prepared various installation scripts at tools/installers.
(ESPnet2) Once installed, run wandb login and set --use_wandb true to enable tracking runs using W&B.

Docker Container

go to docker/ and follow instructions.

Contribution

Thank you for taking the time for ESPnet! Any contributions to ESPnet are welcome, and feel free to ask any questions or requests to issues. If it's your first ESPnet contribution, please follow the contribution guide.

ASR results

expand

We list the character error rate (CER) and word error rate (WER) of major ASR tasks. | Task | CER (%) | WER (%) | Pre-trained model | | ----------------------------------------------------------------- | :-------------: | :-------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | Aishell dev/test | 4.6/5.1 | N/A | [link](https://github.com/espnet/espnet/blob/master/egs/aishell/asr1/RESULTS.md#conformer-kernel-size--15--specaugment--lm-weight--00-result) | | **ESPnet2** Aishell dev/test | 4.1/4.4 | N/A | [link](https://github.com/espnet/espnet/tree/master/egs2/aishell/asr1#branchformer-initial) | | Common Voice dev/test | 1.7/1.8 | 2.2/2.3 | [link](https://github.com/espnet/espnet/blob/master/egs/commonvoice/asr1/RESULTS.md#first-results-default-pytorch-transformer-setting-with-bpe-100-epochs-single-gpu) | | CSJ eval1/eval2/eval3 | 5.7/3.8/4.2 | N/A | [link](https://github.com/espnet/espnet/blob/master/egs/csj/asr1/RESULTS.md#pytorch-backend-transformer-without-any-hyperparameter-tuning) | | **ESPnet2** CSJ eval1/eval2/eval3 | 4.5/3.3/3.6 | N/A | [link](https://github.com/espnet/espnet/tree/master/egs2/csj/asr1#initial-conformer-results) | | **ESPnet2** GigaSpeech dev/test | N/A | 10.6/10.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/gigaspeech/asr1#e-branchformer) | | HKUST dev | 23.5 | N/A | [link](https://github.com/espnet/espnet/blob/master/egs/hkust/asr1/RESULTS.md#transformer-only-20-epochs) | | **ESPnet2** HKUST dev | 21.2 | N/A | [link](https://github.com/espnet/espnet/tree/master/egs2/hkust/asr1#transformer-asr--transformer-lm) | | Librispeech dev_clean/dev_other/test_clean/test_other | N/A | 1.9/4.9/2.1/4.9 | [link](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-conformer-with-specaug--speed-perturbation-8-gpus--transformer-lm-4-gpus) | | **ESPnet2** Librispeech dev_clean/dev_other/test_clean/test_other | 0.6/1.5/0.6/1.4 | 1.7/3.4/1.8/3.6 | [link](https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1#self-supervised-learning-features-hubert_large_ll60k-conformer-utt_mvn-with-transformer-lm) | | Switchboard (eval2000) callhm/swbd | N/A | 14.0/6.8 | [link](https://github.com/espnet/espnet/blob/master/egs/swbd/asr1/RESULTS.md#conformer-with-bpe-2000-specaug-speed-perturbation-transformer-lm-decoding) | | **ESPnet2** Switchboard (eval2000) callhm/swbd | N/A | 13.4/7.3 | [link](https://github.com/espnet/espnet/tree/master/egs2/swbd/asr1#e-branchformer) | | TEDLIUM2 dev/test | N/A | 8.6/7.2 | [link](https://github.com/espnet/espnet/blob/master/egs/tedlium2/asr1/RESULTS.md#conformer-large-model--specaug--speed-perturbation--rnnlm) | | **ESPnet2** TEDLIUM2 dev/test | N/A | 7.3/7.1 | [link](https://github.com/espnet/espnet/blob/master/egs2/tedlium2/asr1/README.md#e-branchformer-12-encoder-layers) | | TEDLIUM3 dev/test | N/A | 9.6/7.6 | [link](https://github.com/espnet/espnet/blob/master/egs/tedlium3/asr1/RESULTS.md) | | WSJ dev93/eval92 | 3.2/2.1 | 7.0/4.7 | N/A | | **ESPnet2** WSJ dev93/eval92 | 1.1/0.8 | 2.8/1.8 | [link](https://github.com/espnet/espnet/tree/master/egs2/wsj/asr1#self-supervised-learning-features-wav2vec2_large_ll60k-conformer-utt_mvn-with-transformer-lm) | Note that the performance of the CSJ, HKUST, and Librispeech tasks was significantly improved by using the wide network (#units = 1024) and large subword units if necessary reported by [RWTH](https://arxiv.org/pdf/1805.03294.pdf). If you want to check the results of the other recipes, please check `egs//asr1/RESULTS.md`.

ASR demo

expand

You can recognize speech in a WAV file using pre-trained models. Go to a recipe directory and run `utils/recog_wav.sh` as follows: ```sh # go to the recipe directory and source path of espnet tools cd egs/tedlium2/asr1 && . ./path.sh # let's recognize speech! recog_wav.sh --models tedlium2.transformer.v1 example.wav ``` where `example.wav` is a WAV file to be recognized. The sampling rate must be consistent with that of data used in training. Available pre-trained models in the demo script are listed below. | Model | Notes | | :----------------------------------------------------------------------------------------------- | :--------------------------------------------------------- | | [tedlium2.rnn.v1](https://drive.google.com/open?id=1UqIY6WJMZ4sxNxSugUqp3mrGb3j6h7xe) | Streaming decoding based on CTC-based VAD | | [tedlium2.rnn.v2](https://drive.google.com/open?id=1cac5Uc09lJrCYfWkLQsF8eapQcxZnYdf) | Streaming decoding based on CTC-based VAD (batch decoding) | | [tedlium2.transformer.v1](https://drive.google.com/open?id=1cVeSOYY1twOfL9Gns7Z3ZDnkrJqNwPow) | Joint-CTC attention Transformer trained on Tedlium 2 | | [tedlium3.transformer.v1](https://drive.google.com/open?id=1zcPglHAKILwVgfACoMWWERiyIquzSYuU) | Joint-CTC attention Transformer trained on Tedlium 3 | | [librispeech.transformer.v1](https://drive.google.com/open?id=1BtQvAnsFvVi-dp_qsaFP7n4A_5cwnlR6) | Joint-CTC attention Transformer trained on Librispeech | | [commonvoice.transformer.v1](https://drive.google.com/open?id=1tWccl6aYU67kbtkm8jv5H6xayqg1rzjh) | Joint-CTC attention Transformer trained on CommonVoice | | [csj.transformer.v1](https://drive.google.com/open?id=120nUQcSsKeY5dpyMWw_kI33ooMRGT2uF) | Joint-CTC attention Transformer trained on CSJ | | [csj.rnn.v1](https://drive.google.com/open?id=1ALvD4nHan9VDJlYJwNurVr7H7OV0j2X9) | Joint-CTC attention VGGBLSTM trained on CSJ |

SE results

expand

We list results from three different models on WSJ0-2mix, which is one the most widely used benchmark dataset for speech separation. | Model | STOI | SAR | SDR | SIR | | ------------------------------------------------- | ---- | ----- | ----- | ----- | | [TF Masking](https://zenodo.org/record/4498554) | 0.89 | 11.40 | 10.24 | 18.04 | | [Conv-Tasnet](https://zenodo.org/record/4498562) | 0.95 | 16.62 | 15.94 | 25.90 | | [DPRNN-Tasnet](https://zenodo.org/record/4688000) | 0.96 | 18.82 | 18.29 | 28.92 |

SE demos

expand

You can try the interactive demo with Google Colab. Please click the following button to get access to the demos. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fjRJCh96SoYLZPRxsjF9VDv4Q2VoIckI?usp=sharing) It is based on ESPnet2. Pre-trained models are available for both speech enhancement and speech separation tasks. Speech separation streaming demos: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17vd1V78eJpp3PHBnbFE5aVY5uMxQFL6o?usp=sharing)

ST results

expand

We list 4-gram BLEU of major ST tasks. #### end-to-end system | Task | BLEU | Pre-trained model | | ------------------------------------------------- | :---: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | Fisher-CallHome Spanish fisher_test (Es->En) | 51.03 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/st1/RESULTS.md#train_spen_lcrm_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans) | | Fisher-CallHome Spanish callhome_evltest (Es->En) | 20.44 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/st1/RESULTS.md#train_spen_lcrm_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans) | | Libri-trans test (En->Fr) | 16.70 | [link](https://github.com/espnet/espnet/blob/master/egs/libri_trans/st1/RESULTS.md#train_spfr_lc_pytorch_train_pytorch_transformer_bpe_short_long_bpe1000_specaug_asrtrans_mttrans-1) | | How2 dev5 (En->Pt) | 45.68 | [link](https://github.com/espnet/espnet/blob/master/egs/how2/st1/RESULTS.md#trainpt_tc_pytorch_train_pytorch_transformer_short_long_bpe8000_specaug_asrtrans_mttrans-1) | | Must-C tst-COMMON (En->De) | 22.91 | [link](https://github.com/espnet/espnet/blob/master/egs/must_c/st1/RESULTS.md#train_spen-dede_tc_pytorch_train_pytorch_transformer_short_long_bpe8000_specaug_asrtrans_mttrans) | | Mboshi-French dev (Fr->Mboshi) | 6.18 | N/A | #### cascaded system | Task | BLEU | Pre-trained model | | ------------------------------------------------- | :---: | :--------------: | | Fisher-CallHome Spanish fisher_test (Es->En) | 42.16 | N/A | | Fisher-CallHome Spanish callhome_evltest (Es->En) | 19.82 | N/A | | Libri-trans test (En->Fr) | 16.96 | N/A | | How2 dev5 (En->Pt) | 44.90 | N/A | | Must-C tst-COMMON (En->De) | 23.65 | N/A | If you want to check the results of the other recipes, please check `egs//st1/RESULTS.md`.

ST demo

expand

(**New!**) We made a new real-time E2E-ST + TTS demonstration in Google Colab. Please access the notebook from the following button and enjoy the real-time speech-to-speech translation! [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/st_demo.ipynb) --- You can translate speech in a WAV file using pre-trained models. Go to a recipe directory and run `utils/translate_wav.sh` as follows: ```sh # Go to recipe directory and source path of espnet tools cd egs/fisher_callhome_spanish/st1 && . ./path.sh # download example wav file wget -O - https://github.com/espnet/espnet/files/4100928/test.wav.tar.gz | tar zxvf - # let's translate speech! translate_wav.sh --models fisher_callhome_spanish.transformer.v1.es-en test.wav ``` where `test.wav` is a WAV file to be translated. The sampling rate must be consistent with that of data used in training. Available pre-trained models in the demo script are listed as below. | Model | Notes | | :----------------------------------------------------------------------------------------------------------- | :------------------------------------------------------- | | [fisher_callhome_spanish.transformer.v1](https://drive.google.com/open?id=1hawp5ZLw4_SIHIT3edglxbKIIkPVe8n3) | Transformer-ST trained on Fisher-CallHome Spanish Es->En |

MT results

expand

| Task | BLEU | Pre-trained model | | ------------------------------------------------- | :---: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------: | | Fisher-CallHome Spanish fisher_test (Es->En) | 61.45 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/mt1/RESULTS.md#trainen_lcrm_lcrm_pytorch_train_pytorch_transformer_bpe_bpe1000) | | Fisher-CallHome Spanish callhome_evltest (Es->En) | 29.86 | [link](https://github.com/espnet/espnet/blob/master/egs/fisher_callhome_spanish/mt1/RESULTS.md#trainen_lcrm_lcrm_pytorch_train_pytorch_transformer_bpe_bpe1000) | | Libri-trans test (En->Fr) | 18.09 | [link](https://github.com/espnet/espnet/blob/master/egs/libri_trans/mt1/RESULTS.md#trainfr_lcrm_tc_pytorch_train_pytorch_transformer_bpe1000) | | How2 dev5 (En->Pt) | 58.61 | [link](https://github.com/espnet/espnet/blob/master/egs/how2/mt1/RESULTS.md#trainpt_tc_tc_pytorch_train_pytorch_transformer_bpe8000) | | Must-C tst-COMMON (En->De) | 27.63 | [link](https://github.com/espnet/espnet/blob/master/egs/must_c/mt1/RESULTS.md#summary-4-gram-bleu) | | IWSLT'14 test2014 (En->De) | 24.70 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) | | IWSLT'14 test2014 (De->En) | 29.22 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) | | IWSLT'14 test2014 (De->En) | 32.2 | [link](https://github.com/espnet/espnet/blob/master/egs2/iwslt14/mt1/README.md) | | IWSLT'16 test2014 (En->De) | 24.05 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) | | IWSLT'16 test2014 (De->En) | 29.13 | [link](https://github.com/espnet/espnet/blob/master/egs/iwslt16/mt1/RESULTS.md#result) |

TTS results

ESPnet2

You can listen to the generated samples in the following URL. - [ESPnet2 TTS generated samples](https://drive.google.com/drive/folders/1H3fnlBbWMEkQUfrHqosKN_ZX_WjO29ma?usp=sharing) > Note that in the generation, we use Griffin-Lim (`wav/`) and [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) (`wav_pwg/`). You can download pre-trained models via `espnet_model_zoo`. - [ESPnet model zoo](https://github.com/espnet/espnet_model_zoo) - [Pre-trained model list](https://github.com/espnet/espnet_model_zoo/blob/master/espnet_model_zoo/table.csv) You can download pre-trained vocoders via `kan-bayashi/ParallelWaveGAN`. - [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) - [Pre-trained vocoder list](https://github.com/kan-bayashi/ParallelWaveGAN#results)

ESPnet1

> NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest results in the above ESPnet2 results. You can listen to our samples in demo HP [espnet-tts-sample](https://espnet.github.io/espnet-tts-sample/). Here we list some notable ones: - [Single English speaker Tacotron2](https://drive.google.com/open?id=18JgsOCWiP_JkhONasTplnHS7yaF_konr) - [Single Japanese speaker Tacotron2](https://drive.google.com/open?id=1fEgS4-K4dtgVxwI4Pr7uOA1h4PE-zN7f) - [Single other language speaker Tacotron2](https://drive.google.com/open?id=1q_66kyxVZGU99g8Xb5a0Q8yZ1YVm2tN0) - [Multi English speaker Tacotron2](https://drive.google.com/open?id=18S_B8Ogogij34rIfJOeNF8D--uG7amz2) - [Single English speaker Transformer](https://drive.google.com/open?id=14EboYVsMVcAq__dFP1p6lyoZtdobIL1X) - [Single English speaker FastSpeech](https://drive.google.com/open?id=1PSxs1VauIndwi8d5hJmZlppGRVu2zuy5) - [Multi English speaker Transformer](https://drive.google.com/open?id=1_vrdqjM43DdN1Qz7HJkvMQ6lCMmWLeGp) - [Single Italian speaker FastSpeech](https://drive.google.com/open?id=13I5V2w7deYFX4DlVk1-0JfaXmUR2rNOv) - [Single Mandarin speaker Transformer](https://drive.google.com/open?id=1mEnZfBKqA4eT6Bn0eRZuP6lNzL-IL3VD) - [Single Mandarin speaker FastSpeech](https://drive.google.com/open?id=1Ol_048Tuy6BgvYm1RpjhOX4HfhUeBqdK) - [Multi Japanese speaker Transformer](https://drive.google.com/open?id=1fFMQDF6NV5Ysz48QLFYE8fEvbAxCsMBw) - [Single English speaker models with Parallel WaveGAN](https://drive.google.com/open?id=1HvB0_LDf1PVinJdehiuCt5gWmXGguqtx) - [Single English speaker knowledge distillation-based FastSpeech](https://drive.google.com/open?id=1wG-Y0itVYalxuLAHdkAHO7w1CWFfRPF4) You can download all of the pre-trained models and generated samples: - [All of the pre-trained E2E-TTS models](https://drive.google.com/open?id=1k9RRyc06Zl0mM2A7mi-hxNiNMFb_YzTF) - [All of the generated samples](https://drive.google.com/open?id=1bQGuqH92xuxOX__reWLP4-cif0cbpMLX) Note that in the generated samples, we use the following vocoders: Griffin-Lim (**GL**), WaveNet vocoder (**WaveNet**), Parallel WaveGAN (**ParallelWaveGAN**), and MelGAN (**MelGAN**). The neural vocoders are based on the following repositories. - [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN): Parallel WaveGAN / MelGAN / Multi-band MelGAN - [r9y9/wavenet_vocoder](https://github.com/r9y9/wavenet_vocoder): 16 bit mixture of Logistics WaveNet vocoder - [kan-bayashi/PytorchWaveNetVocoder](https://github.com/kan-bayashi/PytorchWaveNetVocoder): 8 bit Softmax WaveNet Vocoder with the noise shaping If you want to build your own neural vocoder, please check the above repositories. [kan-bayashi/ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) provides [the manual](https://github.com/kan-bayashi/ParallelWaveGAN#decoding-with-espnet-tts-models-features) about how to decode ESPnet-TTS model's features with neural vocoders. Please check it. Here we list all of the pre-trained neural vocoders. Please download and enjoy the generation of high-quality speech! | Model link | Lang | Fs [Hz] | Mel range [Hz] | FFT / Shift / Win [pt] | Model type | | :--------------------------------------------------------------------------------------------------- | :---: | :-----: | :------------: | :--------------------: | :---------------------------------------------------------------------- | | [ljspeech.wavenet.softmax.ns.v1](https://drive.google.com/open?id=1eA1VcRS9jzFa-DovyTgJLQ_jmwOLIi8L) | EN | 22.05k | None | 1024 / 256 / None | [Softmax WaveNet](https://github.com/kan-bayashi/PytorchWaveNetVocoder) | | [ljspeech.wavenet.mol.v1](https://drive.google.com/open?id=1sY7gEUg39QaO1szuN62-Llst9TrFno2t) | EN | 22.05k | None | 1024 / 256 / None | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) | | [ljspeech.parallel_wavegan.v1](https://drive.google.com/open?id=1tv9GKyRT4CDsvUWKwH3s_OfXkiTi0gw7) | EN | 22.05k | None | 1024 / 256 / None | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) | | [ljspeech.wavenet.mol.v2](https://drive.google.com/open?id=1es2HuKUeKVtEdq6YDtAsLNpqCy4fhIXr) | EN | 22.05k | 80-7600 | 1024 / 256 / None | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) | | [ljspeech.parallel_wavegan.v2](https://drive.google.com/open?id=1Grn7X9wD35UcDJ5F7chwdTqTa4U7DeVB) | EN | 22.05k | 80-7600 | 1024 / 256 / None | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) | | [ljspeech.melgan.v1](https://drive.google.com/open?id=1ipPWYl8FBNRlBFaKj1-i23eQpW_W_YcR) | EN | 22.05k | 80-7600 | 1024 / 256 / None | [MelGAN](https://github.com/kan-bayashi/ParallelWaveGAN) | | [ljspeech.melgan.v3](https://drive.google.com/open?id=1_a8faVA5OGCzIcJNw4blQYjfG4oA9VEt) | EN | 22.05k | 80-7600 | 1024 / 256 / None | [MelGAN](https://github.com/kan-bayashi/ParallelWaveGAN) | | [libritts.wavenet.mol.v1](https://drive.google.com/open?id=1jHUUmQFjWiQGyDd7ZeiCThSjjpbF_B4h) | EN | 24k | None | 1024 / 256 / None | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) | | [jsut.wavenet.mol.v1](https://drive.google.com/open?id=187xvyNbmJVZ0EZ1XHCdyjZHTXK9EcfkK) | JP | 24k | 80-7600 | 2048 / 300 / 1200 | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) | | [jsut.parallel_wavegan.v1](https://drive.google.com/open?id=1OwrUQzAmvjj1x9cDhnZPp6dqtsEqGEJM) | JP | 24k | 80-7600 | 2048 / 300 / 1200 | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) | | [csmsc.wavenet.mol.v1](https://drive.google.com/open?id=1PsjFRV5eUP0HHwBaRYya9smKy5ghXKzj) | ZH | 24k | 80-7600 | 2048 / 300 / 1200 | [MoL WaveNet](https://github.com/r9y9/wavenet_vocoder) | | [csmsc.parallel_wavegan.v1](https://drive.google.com/open?id=10M6H88jEUGbRWBmU1Ff2VaTmOAeL8CEy) | ZH | 24k | 80-7600 | 2048 / 300 / 1200 | [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) | If you want to use the above pre-trained vocoders, please exactly match the feature setting with them.

TTS demo

ESPnet2

You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis! - Real-time TTS demo with ESPnet2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb) English, Japanese, and Mandarin models are available in the demo.

ESPnet1

> NOTE: We are moving on ESPnet2-based development for TTS. Please check the latest demo in the above ESPnet2 demo. You can try the real-time demo in Google Colab. Please access the notebook from the following button and enjoy the real-time synthesis. - Real-time TTS demo with ESPnet1 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb) We also provide a shell script to perform synthesis. Go to a recipe directory and run `utils/synth_wav.sh` as follows: ```sh # Go to recipe directory and source path of espnet tools cd egs/ljspeech/tts1 && . ./path.sh # We use an upper-case char sequence for the default model. echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example.txt # let's synthesize speech! synth_wav.sh example.txt # Also, you can use multiple sentences echo "THIS IS A DEMONSTRATION OF TEXT TO SPEECH." > example_multi.txt echo "TEXT TO SPEECH IS A TECHNIQUE TO CONVERT TEXT INTO SPEECH." >> example_multi.txt synth_wav.sh example_multi.txt ``` You can change the pre-trained model as follows: ```sh synth_wav.sh --models ljspeech.fastspeech.v1 example.txt ``` Waveform synthesis is performed with the Griffin-Lim algorithm and neural vocoders (WaveNet and ParallelWaveGAN). You can change the pre-trained vocoder model as follows: ```sh synth_wav.sh --vocoder_models ljspeech.wavenet.mol.v1 example.txt ``` WaveNet vocoder provides very high-quality speech, but it takes time to generate. See more details or available models via `--help`. ```sh synth_wav.sh --help ```

VC results

expand

- Transformer and Tacotron2-based VC You can listen to some samples on the [demo webpage](https://unilight.github.io/Publication-Demos/publications/transformer-vc/). - Cascade ASR+TTS as one of the baseline systems of VCC2020 The [Voice Conversion Challenge 2020](http://www.vc-challenge.org/) (VCC2020) adopts ESPnet to build an end-to-end based baseline system. In VCC2020, the objective is intra/cross-lingual nonparallel VC. You can download converted samples of the cascade ASR+TTS baseline system [here](https://drive.google.com/drive/folders/1oeZo83GrOgtqxGwF7KagzIrfjr8X59Ue?usp=sharing).

SLU results

expand

We list the performance on various SLU tasks and datasets using the metric reported in the original dataset paper | Task | Dataset | Metric | Result | Pre-trained Model | | ----------------------------------------------------------------- | :-------------: | :-------------: | :-------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | | Intent Classification | SLURP | Acc | 86.3 | [link](https://github.com/espnet/espnet/tree/master/egs2/slurp/asr1/README.md) | | Intent Classification | FSC | Acc | 99.6 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc/asr1/README.md) | | Intent Classification | FSC Unseen Speaker Set | Acc | 98.6 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_unseen/asr1/README.md) | | Intent Classification | FSC Unseen Utterance Set | Acc | 86.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_unseen/asr1/README.md) | | Intent Classification | FSC Challenge Speaker Set | Acc | 97.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_challenge/asr1/README.md) | | Intent Classification | FSC Challenge Utterance Set | Acc | 78.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_challenge/asr1/README.md) | | Intent Classification | SNIPS | F1 | 91.7 | [link](https://github.com/espnet/espnet/tree/master/egs2/snips/asr1/README.md) | | Intent Classification | Grabo (Nl) | Acc | 97.2 | [link](https://github.com/espnet/espnet/tree/master/egs2/grabo/asr1/README.md) | | Intent Classification | CAT SLU MAP (Zn) | Acc | 78.9 | [link](https://github.com/espnet/espnet/tree/master/egs2/catslu/asr1/README.md) | | Intent Classification | Google Speech Commands | Acc | 98.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/speechcommands/asr1/README.md) | | Slot Filling | SLURP | SLU-F1 | 71.9 | [link](https://github.com/espnet/espnet/tree/master/egs2/slurp_entity/asr1/README.md) | | Dialogue Act Classification | Switchboard | Acc | 67.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/swbd_da/asr1/README.md) | | Dialogue Act Classification | Jdcinal (Jp) | Acc | 67.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/jdcinal/asr1/README.md) | | Emotion Recognition | IEMOCAP | Acc | 69.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/iemocap/asr1/README.md) | | Emotion Recognition | swbd_sentiment | Macro F1 | 61.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/swbd_sentiment/asr1/README.md) | | Emotion Recognition | slue_voxceleb | Macro F1 | 44.0 | [link](https://github.com/espnet/espnet/tree/master/egs2/slue-voxceleb/asr1/README.md) | If you want to check the results of the other recipes, please check `egs2//asr1/RESULTS.md`.

CTC Segmentation demo

ESPnet1

[CTC segmentation](https://arxiv.org/abs/2007.09127) determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets. As a demo, we align the start and end of utterances within the audio file `ctc_align_test.wav`, using the example script `utils/asr_align_wav.sh`. For preparation, set up a data directory: ```sh cd egs/tedlium2/align1/ # data directory align_dir=data/demo mkdir -p ${align_dir} # wav file base=ctc_align_test wav=../../../test_utils/${base}.wav # recipe files echo "batchsize: 0" > ${align_dir}/align.yaml cat << EOF > ${align_dir}/utt_text ${base} THE SALE OF THE HOTELS ${base} IS PART OF HOLIDAY'S STRATEGY ${base} TO SELL OFF ASSETS ${base} AND CONCENTRATE ${base} ON PROPERTY MANAGEMENT EOF ``` Here, `utt_text` is the file containing the list of utterances. Choose a pre-trained ASR model that includes a CTC layer to find utterance segments: ```sh # pre-trained ASR model model=wsj.transformer_small.v1 mkdir ./conf && cp ../../wsj/asr1/conf/no_preprocess.yaml ./conf ../../../utils/asr_align_wav.sh \ --models ${model} \ --align_dir ${align_dir} \ --align_config ${align_dir}/align.yaml \ ${wav} ${align_dir}/utt_text ``` Segments are written to `aligned_segments` as a list of file/utterance names, utterance start and end times in seconds, and a confidence score. The confidence score is a probability in log space that indicates how well the utterance was aligned. If needed, remove bad utterances: ```sh min_confidence_score=-5 awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${align_dir}/aligned_segments ``` The demo script `utils/ctc_align_wav.sh` uses an already pre-trained ASR model (see the list above for more models). It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models with a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed. A full example recipe is in `egs/tedlium2/align1/`.

ESPnet2

[CTC segmentation](https://arxiv.org/abs/2007.09127) determines utterance segments within audio files. Aligned utterance segments constitute the labels of speech datasets. As a demo, we align the start and end of utterances within the audio file `ctc_align_test.wav`. This can be done either directly from the Python command line or using the script `espnet2/bin/asr_align.py`. From the Python command line interface: ```python # load a model with character tokens from espnet_model_zoo.downloader import ModelDownloader d = ModelDownloader(cachedir="./modelcache") wsjmodel = d.download_and_unpack("kamo-naoyuki/wsj") # load the example file included in the ESPnet repository import soundfile speech, rate = soundfile.read("./test_utils/ctc_align_test.wav") # CTC segmentation from espnet2.bin.asr_align import CTCSegmentation aligner = CTCSegmentation( **wsjmodel , fs=rate ) text = """ utt1 THE SALE OF THE HOTELS utt2 IS PART OF HOLIDAY'S STRATEGY utt3 TO SELL OFF ASSETS utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT """ segments = aligner(speech, text) print(segments) # utt1 utt 0.26 1.73 -0.0154 THE SALE OF THE HOTELS # utt2 utt 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY # utt3 utt 3.19 4.20 -0.7433 TO SELL OFF ASSETS # utt4 utt 4.20 6.10 -0.4899 AND CONCENTRATE ON PROPERTY MANAGEMENT ``` Aligning also works with fragments of the text. For this, set the `gratis_blank` option that allows skipping unrelated audio sections without penalty. It's also possible to omit the utterance names at the beginning of each line by setting `kaldi_style_text` to False. ```python aligner.set_config( gratis_blank=True, kaldi_style_text=False ) text = ["SALE OF THE HOTELS", "PROPERTY MANAGEMENT"] segments = aligner(speech, text) print(segments) # utt_0000 utt 0.37 1.72 -2.0651 SALE OF THE HOTELS # utt_0001 utt 4.70 6.10 -5.0566 PROPERTY MANAGEMENT ``` The script `espnet2/bin/asr_align.py` uses a similar interface. To align utterances: ```sh # ASR model and config files from pre-trained model (e.g., from cachedir): asr_config=/config.yaml asr_model=/valid.*best.pth # prepare the text file wav="test_utils/ctc_align_test.wav" text="test_utils/ctc_align_text.txt" cat << EOF > ${text} utt1 THE SALE OF THE HOTELS utt2 IS PART OF HOLIDAY'S STRATEGY utt3 TO SELL OFF ASSETS utt4 AND CONCENTRATE utt5 ON PROPERTY MANAGEMENT EOF # obtain alignments: python espnet2/bin/asr_align.py --asr_train_config ${asr_config} --asr_model_file ${asr_model} --audio ${wav} --text ${text} # utt1 ctc_align_test 0.26 1.73 -0.0154 THE SALE OF THE HOTELS # utt2 ctc_align_test 1.73 3.19 -0.7674 IS PART OF HOLIDAY'S STRATEGY # utt3 ctc_align_test 3.19 4.20 -0.7433 TO SELL OFF ASSETS # utt4 ctc_align_test 4.20 4.97 -0.6017 AND CONCENTRATE # utt5 ctc_align_test 4.97 6.10 -0.3477 ON PROPERTY MANAGEMENT ``` The output of the script can be redirected to a `segments` file by adding the argument `--output segments`. Each line contains the file/utterance name, utterance start and end times in seconds, and a confidence score; optionally also the utterance text. The confidence score is a probability in log space that indicates how well the utterance was aligned. If needed, remove bad utterances: ```sh min_confidence_score=-7 # here, we assume that the output was written to the file `segments` awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' segments ``` See the module documentation for more information. It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models that have a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed. Also, we can use this tool to provide token-level segmentation information if we prepare a list of tokens instead of that of utterances in the `text` file. See the discussion in https://github.com/espnet/espnet/issues/4278#issuecomment-1100756463.

Citations

``` @inproceedings{watanabe2018espnet, author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, title={{ESPnet}: End-to-End Speech Processing Toolkit}, year={2018}, booktitle={Proceedings of Interspeech}, pages={2207--2211}, doi={10.21437/Interspeech.2018-1456}, url={http://dx.doi.org/10.21437/Interspeech.2018-1456} } @inproceedings{hayashi2020espnet, title={{Espnet-TTS}: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit}, author={Hayashi, Tomoki and Yamamoto, Ryuichi and Inoue, Katsuki and Yoshimura, Takenori and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Zhang, Yu and Tan, Xu}, booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={7654--7658}, year={2020}, organization={IEEE} } @inproceedings{inaguma-etal-2020-espnet, title = "{ESP}net-{ST}: All-in-One Speech Translation Toolkit", author = "Inaguma, Hirofumi and Kiyono, Shun and Duh, Kevin and Karita, Shigeki and Yalta, Nelson and Hayashi, Tomoki and Watanabe, Shinji", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations", month = jul, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.acl-demos.34", pages = "302--311", } @article{hayashi2021espnet2, title={{ESP}net2-{TTS}: Extending the edge of {TTS} research}, author={Hayashi, Tomoki and Yamamoto, Ryuichi and Yoshimura, Takenori and Wu, Peter and Shi, Jiatong and Saeki, Takaaki and Ju, Yooncheol and Yasuda, Yusuke and Takamichi, Shinnosuke and Watanabe, Shinji}, journal={arXiv preprint arXiv:2110.07840}, year={2021} } @inproceedings{li2020espnet, title={{ESPnet-SE}: End-to-End Speech Enhancement and Separation Toolkit Designed for {ASR} Integration}, author={Chenda Li and Jing Shi and Wangyou Zhang and Aswin Shanmugam Subramanian and Xuankai Chang and Naoyuki Kamo and Moto Hira and Tomoki Hayashi and Christoph Boeddeker and Zhuo Chen and Shinji Watanabe}, booktitle={Proceedings of IEEE Spoken Language Technology Workshop (SLT)}, pages={785--792}, year={2021}, organization={IEEE}, } @inproceedings{arora2021espnet, title={{ESPnet-SLU}: Advancing Spoken Language Understanding through ESPnet}, author={Arora, Siddhant and Dalmia, Siddharth and Denisov, Pavel and Chang, Xuankai and Ueda, Yushi and Peng, Yifan and Zhang, Yuekai and Kumar, Sujay and Ganesan, Karthik and Yan, Brian and others}, booktitle={ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={7167--7171}, year={2022}, organization={IEEE} } @inproceedings{shi2022muskits, author={Shi, Jiatong and Guo, Shuai and Qian, Tao and Huo, Nan and Hayashi, Tomoki and Wu, Yuning and Xu, Frank and Chang, Xuankai and Li, Huazhe and Wu, Peter and Watanabe, Shinji and Jin, Qin}, title={{Muskits}: an End-to-End Music Processing Toolkit for Singing Voice Synthesis}, year={2022}, booktitle={Proceedings of Interspeech}, pages={4277-4281}, url={https://www.isca-speech.org/archive/pdfs/interspeech2022/shi22dinterspeech.pdf} } @inproceedings{lu22c_interspeech, author={Yen-Ju Lu and Xuankai Chang and Chenda Li and Wangyou Zhang and Samuele Cornell and Zhaoheng Ni and Yoshiki Masuyama and Brian Yan and Robin Scheibler and Zhong-Qiu Wang and Yu Tsao and Yanmin Qian and Shinji Watanabe}, title={{ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding}}, year=2022, booktitle={Proc. Interspeech 2022}, pages={5458--5462}, } @inproceedings{gao2023euro, title={{EURO: ESP}net unsupervised {ASR} open-source toolkit}, author={Gao, Dongji and Shi, Jiatong and Chuang, Shun-Po and Garcia, Leibny Paola and Lee, Hung-yi and Watanabe, Shinji and Khudanpur, Sanjeev}, booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--5}, year={2023}, organization={IEEE} } @inproceedings{peng2023reproducing, title={Reproducing {W}hisper-style training using an open-source toolkit and publicly available data}, author={Peng, Yifan and Tian, Jinchuan and Yan, Brian and Berrebbi, Dan and Chang, Xuankai and Li, Xinjian and Shi, Jiatong and Arora, Siddhant and Chen, William and Sharma, Roshan and others}, booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, pages={1--8}, year={2023}, organization={IEEE} } @inproceedings{sharma2023espnet, title={ESPnet-{SUMM}: Introducing a novel large dataset, toolkit, and a cross-corpora evaluation of speech summarization systems}, author={Sharma, Roshan and Chen, William and Kano, Takatomo and Sharma, Ruchira and Arora, Siddhant and Watanabe, Shinji and Ogawa, Atsunori and Delcroix, Marc and Singh, Rita and Raj, Bhiksha}, booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, pages={1--8}, year={2023}, organization={IEEE} } @article{jung2024espnet, title={{ESPnet-SPK}: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models}, author={Jung, Jee-weon and Zhang, Wangyou and Shi, Jiatong and Aldeneh, Zakaria and Higuchi, Takuya and Theobald, Barry-John and Abdelaziz, Ahmed Hussen and Watanabe, Shinji}, journal={Proc. Interspeech 2024}, year={2024} } @inproceedings{yan-etal-2023-espnet, title = "{ESP}net-{ST}-v2: Multipurpose Spoken Language Translation Toolkit", author = "Yan, Brian and Shi, Jiatong and Tang, Yun and Inaguma, Hirofumi and Peng, Yifan and Dalmia, Siddharth and Pol{\'a}k, Peter and Fernandes, Patrick and Berrebbi, Dan and Hayashi, Tomoki and Zhang, Xiaohui and Ni, Zhaoheng and Hira, Moto and Maiti, Soumi and Pino, Juan and Watanabe, Shinji", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)", year = "2023", publisher = "Association for Computational Linguistics", pages = "400--411", }

```

Owner

Name: ESPnet
Login: espnet
Kind: organization

Website: https://espnet.github.io/espnet/
Repositories: 9
Profile: https://github.com/espnet

end-to-end speech processing toolkit

JOSS Publication

Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing

Published

November 20, 2023

DOI

10.21105/joss.05403

Volume 8, Issue 91, Page 5403

Authors

Yen-Ju Lu

Johns Hopkins University, USA

Xuankai Chang

Carnegie Mellon University, USA

Chenda Li

Shanghai Jiao Tong University, Shanghai

Wangyou Zhang

Shanghai Jiao Tong University, Shanghai

Samuele Cornell

Carnegie Mellon University, USA, Universita\` Politecnica delle Marche, Italy

Zhaoheng Ni
Meta AI, USA

Yoshiki Masuyama
Carnegie Mellon University, USA, Tokyo Metropolitan University, Japan

Brian Yan
Carnegie Mellon University, USA

Robin Scheibler

LINE Corporation, Japan

Zhong-Qiu Wang

Carnegie Mellon University, USA

Yu Tsao

Academia Sinica, Taipei

Yanmin Qian

Shanghai Jiao Tong University, Shanghai

Shinji Watanabe

Carnegie Mellon University, USA

Editor

Fabian-Robert Stöter

GitHub Events

Total

Create event: 20
Commit comment event: 2
Release event: 3
Issues event: 350
Watch event: 922
Delete event: 17
Issue comment event: 1,606
Push event: 208
Pull request review event: 635
Pull request review comment event: 679
Pull request event: 328
Fork event: 175

Last Year

Create event: 20
Commit comment event: 2
Release event: 3
Issues event: 353
Watch event: 923
Delete event: 17
Issue comment event: 1,610
Push event: 208
Pull request review event: 635
Pull request review comment event: 679
Pull request event: 328
Fork event: 175

Committers

Last synced: 10 months ago

All Time

Total Commits: 17,520
Total Committers: 450
Avg Commits per committer: 38.933
Development Distribution Score (DDS): 0.85

Past Year

Commits: 1,262
Committers: 73
Avg Commits per committer: 17.288
Development Distribution Score (DDS): 0.841

Top Committers

Name	Email	Commits
kan-bayashi	h**i@g**p	2,623
kamo-naoyuki	n**9@g**m	1,498
ftshijt	7**8@q**m	1,063
Hirofumi Inaguma	h**c@g**m	1,013
fboyer	f**r@l**r	836
pre-commit-ci[bot]	6****]	700
Shinji Watanabe	s**0@g**m	590
karita	s**a@g**m	587
Wangyou Zhang	C**n@1**m	455
Fhrozen	n**1@g**m	428
popcornell	c**e@g**m	397
Yifan Peng	p**1@g**m	322
LiChenda	l**6@s**n	273
Jungjee	j**g@g**m	258
Yuning Wu	1**0@q**m	205
Guillaume Tâche	g**e@h**m	187
Masao-Someki	m**i@g**m	183
jerryuhoo	j**o@g**m	170
Yosuke Higuchi	w**v@g**m	158
D-Keqi	6****i	149
unilight	v**6@g**m	133
Xuankai Chang	n**k@g**m	128
Jinchuan	j**t@a**u	113
Shikhar Bharadwaj	s**2@a**u	111
roshansh-cmu	r**h@a**u	102
Nanxin Chen	b**n@g**m	101
Yushi Ueda	y**a@b**u	99
neillu23	n**u@g**m	98
wtc6	s**6@g**m	93
Ludwig Kürzinger	l**r@t**e	93
and 420 more...

Committer Domains (Top 20 + Academic)

br011.ib.bridges2.psc.edu: 14 qq.com: 13 andrew.cmu.edu: 12 br012.ib.bridges2.psc.edu: 11 br013.ib.bridges2.psc.edu: 9 163.com: 9 br014.ib.bridges2.psc.edu: 9 swl.condo.cs.cmu.edu: 7 ntu.edu.tw: 4 jhu.edu: 3 dt-login03.delta.ncsa.illinois.edu: 2 foxmail.com: 2 test1.cm.gemini: 2 speechpro.com: 2 bigo.sg: 2 tir.lti.cs.cmu.edu: 2 swl-0-2.lti.cs.cmu.edu: 2 bytedance.com: 2 yahoo-corp.jp: 2 toki.waseda.jp: 2 sjtu.edu.cn: 2 gh-login02.delta.ncsa.illinois.edu: 2 million1.sp.m.is.nagoya-u.ac.jp: 1 v019.ib.bridges2.psc.edu: 1 gh-login03.delta.ncsa.illinois.edu: 1 iis.sinica.edu.tw: 1 swl-0-1.lti.cs.cmu.edu: 1 v003.ib.bridges2.psc.edu: 1 v021.ib.bridges2.psc.edu: 1 dt-login02.delta.ncsa.illinois.edu: 1 emo.speech.cs.cmu.edu: 1 i6.informatik.rwth-aachen.de: 1 snuchennai.edu.in: 1 dt-login03.delta.internal.ncsa.edu: 1 cornell.edu: 1 polytechnique.edu: 1 v001.ib.bridges2.psc.edu: 1 a.cs.okayama-u.ac.jp: 1 knights.ucf.edu: 1 dt-login02.delta.internal.ncsa.edu: 1 ims.uni-stuttgart.de: 1 tum.de: 1 nehs.hc.edu.tw: 1 gh-login01.delta.ncsa.illinois.edu: 1 a09.clsp.jhu.edu: 1 v034.ib.bridges2.psc.edu: 1 ieee.org: 1 b15.clsp.jhu.edu: 1 g.sp.m.is.nagoya-u.ac.jp: 1 gpu-cloud-anode43.dakao.io: 1 gpu-cloud-anode3.dakao.io: 1 gpub048.delta.ncsa.illinois.edu: 1 r085.ib.bridges2.psc.edu: 1 r046.ib.bridges2.psc.edu: 1 v024.ib.bridges2.psc.edu: 1 v020.ib.bridges2.psc.edu: 1 v017.ib.bridges2.psc.edu: 1 v012.ib.bridges2.psc.edu: 1 tju.edu.cn: 1 r218.ib.bridges2.psc.edu: 1 gpub079.delta.internal.ncsa.edu: 1 gpub077.delta.internal.ncsa.edu: 1 gpub017.delta.internal.ncsa.edu: 1 gpub087.delta.ncsa.illinois.edu: 1 login.clsp.jhu.edu: 1 b08.clsp.jhu.edu: 1 b12.clsp.jhu.edu: 1 mail.utoronto.ca: 1 b19.clsp.jhu.edu: 1 gpu-cloud-anode45.dakao.io: 1 g.sp.m.nagoya-u.ac.jp: 1 epfl.ch: 1 v006.ib.bridges2.psc.edu: 1 v011.ib.bridges2.psc.edu: 1 swl-0-3.lti.cs.cmu.edu: 1 gpuc03.delta.ncsa.illinois.edu: 1 ruc.edu.cn: 1 dt-login01.delta.ncsa.illinois.edu: 1 nwpu-aslp.org: 1 v009.ib.bridges2.psc.edu: 1 v033.ib.bridges2.psc.edu: 1 dlr.de: 1 m.fudan.edu.cn: 1 b11.clsp.jhu.edu: 1 swl-2-13.condo.cs.cmu.edu: 1 andrew.cmu.edu”: 1 gh067.hsn.cm.delta.internal.ncsa.edu: 1 gpub075.delta.ncsa.illinois.edu: 1 dt-login01.delta.internal.ncsa.edu: 1 gpub073.delta.internal.ncsa.edu: 1 r205.ib.bridges2.psc.edu: 1 v016.ib.bridges2.psc.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 459
Total pull requests: 968
Average time to close issues: over 1 year
Average time to close pull requests: 2 months
Total issue authors: 314
Total pull request authors: 166
Average comments per issue: 3.74
Average comments per pull request: 3.67
Merged pull requests: 571
Bot issues: 0
Bot pull requests: 32

Past Year

Issues: 108
Pull requests: 363
Average time to close issues: 18 days
Average time to close pull requests: 21 days
Issue authors: 80
Pull request authors: 63
Average comments per issue: 1.55
Average comments per pull request: 2.93
Merged pull requests: 183
Bot issues: 0
Bot pull requests: 11

View more stats

Top Authors

Issue Authors

mukherjeesougata (18)
underdogliu (6)
minamo817 (5)
Curiosci (5)
pyf98 (5)
david-gimeno (5)
abhijitmohanta (4)
iamanigeeit (4)
cgbhat1978 (4)
GoGoAsahi (4)
nellorebhanuteja (4)
espnetUser (4)
bloodraven66 (4)
songjie1121 (3)
kbramhendra (3)

Pull Request Authors

ftshijt (72)
Fhrozen (61)
Masao-Someki (59)
jctian98 (43)
juice500ml (30)
siddhu001 (27)
sw005320 (24)
Emrys365 (24)
pre-commit-ci[bot] (23)
pyf98 (23)
wyh2000 (22)
Shikhar-S (19)
wanchichen (19)
popcornell (17)
South-Twilight (17)

Top Labels

Issue Labels

Question (257) Stale (180) Bug (85) TTS (47) Installation (31) Feature request (25) ASR (24) Discussion (23) SE (12) SSL (7) Diarization (6) OWSM (5) Roadmap (5) Wontfix (4) ST (4) ESPnet2 (4) Music (4) SLU (4) New Features (3) Docker (3) Streaming (3) Recipe (3) Help wanted (3) ESPnet1 (3) VC (3) SID (2) LM (2) RNNT (2) Unsupervise (2) MT (1)

Pull Request Labels

ESPnet2 (663) README (319) auto-merge (183) Recipe (178) ESPnet1 (143) CI (135) Installation (130) Bugfix (129) ASR (123) New Features (103) conflicts (65) Documentation (60) Stale (49) TTS (48) SE (40) Enhancement (37) Music (35) Codec (30) SID (27) size:XXL (20) SLU (20) OWSM (18) ST (17) dependencies (15) size:L (15) size:XS (15) Docker (15) mergify (14) Need review (13) size:XL (12)

Packages

Total packages: 2
Total downloads:
- pypi 22,548 last-month
Total docker downloads: 748

Total dependent packages: 6
(may contain duplicates)
Total dependent repositories: 217
(may contain duplicates)
Total versions: 37
Total maintainers: 5

pypi.org: espnet

ESPnet: end-to-end speech processing toolkit

Homepage: http://github.com/espnet/espnet
Documentation: https://espnet.readthedocs.io/
License: Apache Software License
Latest release: 0.10.6
published about 4 years ago

Versions: 36
Dependent Packages: 6
Dependent Repositories: 217
Downloads: 22,548 Last month
Docker Downloads: 748

Rankings

Stargazers count: 0.3%

Forks count: 0.3%

Dependent repos count: 1.0%

Average: 1.2%

Dependent packages count: 1.3%

Downloads: 1.7%

Docker downloads count: 2.4%

Maintainers (5)

naoyuki_kamo espnet kan-bayashi sw005320 Fhrozen

Last synced: 6 months ago

conda-forge.org: espnet

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.

Homepage: http://github.com/espnet/espnet
License: Apache-2.0
Latest release: 202209
published over 3 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Forks count: 2.8%

Stargazers count: 4.2%

Average: 23.0%

Dependent repos count: 34.0%

Dependent packages count: 51.2%

Last synced: 6 months ago

Software Design and User Interface of ESPnet-SE++

Science Score: 95.0%

Keywords

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ESPnet: end-to-end speech processing toolkit

Tutorial Series

Key Features

Kaldi-style complete recipe

ASR: Automatic Speech Recognition

TTS: Text-to-speech

SE: Speech enhancement (and separation)

ST: Speech Translation & MT: Machine Translation

VC: Voice conversion

SLU: Spoken Language Understanding

SUM: Speech Summarization

SVS: Singing Voice Synthesis

SSL: Self-supervised Learning

UASR: Unsupervised ASR (EURO: ESPnet Unsupervised Recognition - Open-source)

S2T: Speech-to-text with Whisper-style multilingual multitask models

DNN Framework

ESPnet2

Installation

We recommend you install PyTorch before installing espnet following https://pytorch.org/get-started/locally/

To install the latest

pip install git+https://github.com/espnet/espnet

To install additional packages

pip install "espnet[all]"

Docker Container

Contribution

ASR results

ASR demo

SE results

SE demos

ST results

ST demo

MT results

TTS results

TTS demo

VC results

SLU results

CTC Segmentation demo

Citations

Owner

JOSS Publication

Software Design and User Interface of ESPnet-SE++: Speech Enhancement for Robust Speech Processing

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: espnet

Rankings

Maintainers (5)

conda-forge.org: espnet

Rankings

Dependencies