nemo_voicetextblender

NAACL 2025 main conference: "VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning"

https://github.com/pyf98/nemo_voicetextblender

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

NAACL 2025 main conference: "VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning"

Basic Info
  • Host: GitHub
  • Owner: pyf98
  • License: apache-2.0
  • Language: Python
  • Default Branch: speechllm-develop-yifanp
  • Homepage:
  • Size: 147 MB
Statistics
  • Stars: 8
  • Watchers: 2
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme Contributing License Citation

README.md

VoiceTextBlender

This repo contains the code for our paper:

Yifan Peng, Krishna C. Puvvada, Zhehuai Chen*, Piotr Zelasko, He Huang, Kunal Dhawan, Ke Hu, Shinji Watanabe, Jagadeesh Balam, and Boris Ginsburg, "VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning," in Proc. NAACL, 2025. [arXiv]

Overview

Recent studies have augmented large language models (LLMs) with speech understanding capabilities, leading to SpeechLMs. However, existing SpeechLMs typically require multi-stage curriculum learning and suffer from catastrophic forgetting of the original text-only capabilities. In this work, we propose a novel training pipeline, which streamlines the training process, enhances speech understanding performance, and maintains text-only performance of SpeechLMs. Specifically, we combine multi-turn text-only supervised fine-tuning (SFT) data with three types of single-turn speech-related SFT data and perform single-stage training.

The main results and ablation studies are presented below.

results

Demonstrations

Please find various demonstrations in our paper or the imgs directory in this repo.

Installation

The code is based on an old version of NVIDIA NeMo. We use a docker container to train and decode the model. Please build the container and install relevant packages.

Data Generation Scripts

As described in our paper, we combine four types of data to train VoiceTextBlender: - Text-only conversations - ASR and AST - SQA generated from ASR data - Mixed-modal SFT generated with TTS

Our training data is prepared in Lhotse's format in NeMo. The first two types of data are commonly used in LLM or SpeechLM training. Please refer to our paper for data sources and detailed statistics of these data. The last two types of data are newly generated. Below are the data generation scripts.

SQA generated from ASR data

Given a transcript from an ASR dataset, we prompt an LLM to generate a pair of question and answer. This is implemented in a python script: generate_audioqa_from_llm.py

We submit parallel jobs to a SLURM GPU cluster using submitit. The script can be customized for other environments, data, or LLMs.

Mixed-modal SFT generated with TTS

We create mixed-modal SFT data from single-turn text SFT data. We randomly select consecutive sentences in the user turn and synthesize speech using a pre-trained TTS model. During training, these sentences will be replaced by the synthesized speech, leading to mixed-modal user input. The script is tts_generate_mixedmodal.py, which first downloads a text SFT dataset from Hugging Face and then performs TTS. Similarly, we submit parallel jobs on a SLURM cluster, but the script can be customized to other environments.

Training Scripts

The training python script is examples/multimodal/speech_llm/modular_audio_gpt_train.py

The training config file is singlestage_gemma_enc-ft_adp-ft_llm-lora_lr1e-4_max100k.yaml

We launch the training job on a SLURM cluster using train_N8.sh.

Inference Scripts

The inference script is examples/multimodal/speech_llm/modular_audio_gpt_eval.py

We provide an example launching script for SQA: test_sqa.sh

For SQA, we use OpenAI's GPT API for scoring. The system prompt is below: You are an expert evaluator of question-answering performance. Your task is to evaluate the "correctness" and "redundancy" of an AI assistant's response to a user question based on the provided context. Provide your output following the schema provided. Here is a description of the required fields: - correctness_score: either 0 or 1 - Score 0: The AI assistant's answer is incorrect based on the provided context, or the AI assistant's answer simply copies the context. - Score 1: The AI assistant's answer is correct based on the provided context, and it does not simply copy the context. - correctness_explanation: explanation of your score for "correctness". - redundancy_score: an integer score between 1 and 10, where a higher score indicates that the AI assistant's answer copies more redundant information from the context. - redundancy_explanation: explanation of your score for "redundancy".

Citation

BibTeX @inproceedings{vtblender, title={{VoiceTextBlender}: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning}, author={Yifan Peng and Krishna C. Puvvada and Zhehuai Chen and Piotr Zelasko and He Huang and Kunal Dhawan and Ke Hu and Shinji Watanabe and Jagadeesh Balam and Boris Ginsburg}, year={2025}, booktitle={Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL)}, }

Owner

  • Name: Yifan Peng
  • Login: pyf98
  • Kind: user
  • Location: Pittsburgh, PA
  • Company: Carnegie Mellon University

Speech Recognition | Ph.D. Candidate at CMU | B.E. from Tsinghua EE

GitHub Events

Total
  • Issues event: 3
  • Watch event: 7
  • Issue comment event: 4
  • Push event: 19
  • Fork event: 2
  • Create event: 2
Last Year
  • Issues event: 3
  • Watch event: 7
  • Issue comment event: 4
  • Push event: 19
  • Fork event: 2
  • Create event: 2

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sankar-mukherjee (1)
  • Sangkikim-77 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/actions/cancel-workflow/action.yml actions
.github/workflows/_test_template.yml actions
  • NVIDIA/NeMo/.github/actions/cancel-workflow main composite
  • actions/checkout v4 composite
.github/workflows/changelog-build.yml actions
  • actions/checkout v2 composite
  • mikepenz/release-changelog-builder-action v3.3.1 composite
.github/workflows/cherry-pick-release-commit.yml actions
  • actions/checkout v3 composite
  • carloscastrojumo/github-cherry-pick-action bb0869df47c27be4ae4c7a2d93d22827aa5a0054 composite
.github/workflows/cicd-main.yml actions
  • NVIDIA/NeMo/.github/actions/cancel-workflow main composite
  • actions/checkout v4 composite
  • docker/build-push-action v5 composite
  • docker/setup-buildx-action v3 composite
.github/workflows/close-inactive-issue-pr.yml actions
  • actions/stale v6 composite
.github/workflows/code-formatting.yml actions
  • EndBug/add-and-commit v9 composite
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • isort/isort-action v1 composite
  • psf/black stable composite
  • tj-actions/changed-files v44 composite
.github/workflows/codeql.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/config/codeql.yml actions
.github/workflows/gh-docs.yml actions
  • actions/checkout v3 composite
.github/workflows/import-test.yml actions
  • actions/checkout v2 composite
.github/workflows/labeler.yml actions
  • actions/labeler v4 composite
Dockerfile docker
  • ${BASE_IMAGE} latest build
  • nemo-deps latest build
  • scratch latest build
pyproject.toml pypi
requirements/requirements.txt pypi
  • fiddle *
  • huggingface_hub >=0.20.3
  • numba *
  • numpy >=1.22
  • onnx >=1.7.0
  • python-dateutil *
  • ruamel.yaml *
  • scikit-learn *
  • setuptools >=65.5.1
  • tensorboard *
  • text-unidecode *
  • torch *
  • tqdm >=4.41.0
  • wget *
  • wrapt *
requirements/requirements_asr.txt pypi
  • braceexpand *
  • editdistance *
  • einops *
  • g2p_en *
  • ipywidgets *
  • jiwer *
  • kaldi-python-io *
  • kaldiio *
  • lhotse >=1.22.0
  • librosa >=0.10.0
  • marshmallow *
  • matplotlib *
  • packaging *
  • pyannote.core *
  • pyannote.metrics *
  • pydub *
  • pyloudnorm *
  • resampy *
  • ruamel.yaml *
  • scipy >=0.14
  • soundfile *
  • sox *
  • texterrors *
requirements/requirements_common.txt pypi
  • datasets *
  • inflect *
  • pandas *
  • sacremoses >=0.0.43
  • sentencepiece <1.0.0
requirements/requirements_docs.txt pypi
  • Jinja2 *
  • Sphinx *
  • boto3 *
  • latexcodec *
  • numpy *
  • pydata-sphinx-theme *
  • sphinx-book-theme *
  • sphinx-copybutton *
  • sphinxcontrib-bibtex *
  • sphinxext-opengraph *
  • urllib3 *
  • wrapt *
requirements/requirements_infer.txt pypi
  • nvidia-pytriton *
  • tensorstore ==0.1.45
  • zarr *
requirements/requirements_lightning.txt pypi
  • cloudpickle *
  • fiddle *
  • hydra-core >1.3,<=1.3.2
  • omegaconf <=2.3
  • pytorch-lightning >=2.2.1
  • torchmetrics >=0.11.0
  • transformers >=4.36.0,<=4.40.2
  • wandb *
  • webdataset >=0.2.86
requirements/requirements_multimodal.txt pypi
  • PyMCubes *
  • addict *
  • clip *
  • decord *
  • diffusers >=0.19.3
  • einops_exts *
  • imageio *
  • kornia *
  • nerfacc >=0.5.3
  • open_clip_torch *
  • taming-transformers *
  • torchdiffeq *
  • torchsde *
  • trimesh *
requirements/requirements_nlp.txt pypi
  • accelerated-scan *
  • boto3 *
  • causal-conv1d ==1.2.0.post2
  • einops *
  • faiss-cpu *
  • fasttext *
  • flask_restful *
  • ftfy *
  • gdown *
  • h5py *
  • ijson *
  • jieba *
  • markdown2 *
  • matplotlib >=3.3.2
  • nltk >=3.6.5
  • opencc <1.1.7
  • pangu *
  • rapidfuzz *
  • rouge_score *
  • sacrebleu *
  • sentence_transformers *
  • tensorstore <0.1.46
  • zarr *
requirements/requirements_slu.txt pypi
  • jiwer >=2.0.0
  • progress >=1.5
  • tabulate >=0.8.7
  • textdistance >=4.1.5
  • tqdm *
requirements/requirements_test.txt pypi
  • black * test
  • click ==8.0.2 test
  • isort >5.1.0,<6.0.0 test
  • parameterized * test
  • pytest * test
  • pytest-mock * test
  • pytest-runner * test
  • ruamel.yaml * test
  • sphinx * test
  • sphinxcontrib-bibtex * test
  • wandb * test
  • wget * test
  • wrapt * test
requirements/requirements_tts.txt pypi
  • attrdict *
  • einops *
  • jieba *
  • kornia *
  • librosa *
  • matplotlib *
  • nemo_text_processing *
  • nltk *
  • pandas *
  • pypinyin *
  • pypinyin-dict *
scripts/freesound_download_resample/freesound_requirements.txt pypi
  • joblib *
  • librosa *
  • requests *
  • requests_oauthlib *
  • sox *
setup.py pypi
tools/ctc_segmentation/requirements.txt pypi
  • ctc_segmentation ==1.7.1
  • num2words *
tools/nemo_forced_aligner/requirements.txt pypi
  • nemo_toolkit *
  • prettyprinter *
  • pytest *
tools/nmt_webapp/requirements.txt pypi
  • flask *
  • flask_cors *
  • nemo_toolkit >=1.0.0rc1
tools/speech_data_explorer/requirements.txt pypi
  • SoundFile *
  • dash >=2.1.0
  • dash_bootstrap_components >=1.0.3
  • diff_match_patch *
  • editdistance *
  • jiwer *
  • librosa >=0.9.1
  • numpy *
  • plotly *
  • tqdm *