speechllm

This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.

https://github.com/skit-ai/speechllm

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.2%) to scientific vocabulary

Keywords

conversational-ai llm multi-modal-llms multi-modality speech

Last synced: 6 months ago · JSON representation ·

Repository

This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.

Basic Info

Host: GitHub
Owner: skit-ai
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 3.88 MB

Statistics

Stars: 98
Watchers: 5
Forks: 9
Open Issues: 3
Releases: 0

Topics

conversational-ai llm multi-modal-llms multi-modality speech

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

SpeechLLM

SpeechLLM is a multi-modal Language Model (LLM) specifically trained to analyze and predict metadata from a speaker's turn in a conversation. This advanced model integrates a speech encoder to transform speech signals into meaningful speech representations. These embeddings, combined with text instructions, are then processed by the LLM to generate predictions.

The model inputs an speech audio file of 16 KHz and predicts the following: 1. SpeechActivity : if the audio signal contains speech (True/False) 2. Transcript : ASR transcript of the audio 3. Gender of the speaker (Female/Male) 4. Age of the speaker (Young/Middle-Age/Senior) 5. Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia) 6. Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

Usage

```python

Load model directly from huggingface

from transformers import AutoModel model = AutoModel.frompretrained("skit-ai/speechllm-2B", trustremote_code=True)

model.generatemeta( audiopath="path-to-audio.wav", #16k Hz, mono audiotensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audiopath or audiotensor directly instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]", maxnewtokens=500, returnspecial_tokens=False )

Model Generation

''' { "SpeechActivity" : "True", "Transcript": "Yes, I got it. I'll make the payment now.", "Gender": "Female", "Emotion": "Neutral", "Age": "Young", "Accent" : "America", } ''' ```

Try the model in Google Colab Notebook. Also, check out our blog on SpeechLLM for end-to-end conversational agents(User Speech -> Response).

Model Weights

We released the speechllm-2B and speechllm-1.5B model checkpoints on huggingface :hugs:. | Model | Speech Encoder | LLM | checkpoint url | |-------------------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|---------------------------------------------------------------| | speechllm-2B | facebook/hubert-xlarge-ll60k | TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Huggingface | | speechllm-1.5B | microsoft/wavlm-large | TinyLlama/TinyLlama-1.1B-Chat-v1.0 | Huggingface |

Latest Checkpoint Result

speechllm-2B

| Dataset | Type | Word Error Rate | Gender Acc | Age Acc | Accent Acc | |:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:| | librispeech-test-clean | Read Speech | 6.73 | 0.9496 | | | | librispeech-test-other | Read Speech | 9.13 | 0.9217 | | | | CommonVoice test | Diverse Accent, Age | 25.66 | 0.8680 | 0.6041 | 0.6959 |

speechllm-1.5B

| Dataset | Type | Word Error Rate | Gender Acc | Age Acc | Accent Acc | |:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:| | librispeech-test-clean | Read Speech | 11.51 | 0.9594 | | | | librispeech-test-other | Read Speech | 16.68 | 0.9297 | | | | CommonVoice test | Diverse Accent, Age | 26.02 | 0.9476 | 0.6498 | 0.8121 |

Training

Dataset Preparation and Installation

Install the necessary packages in the requirements.txt and take care of CUDA versions. Then prepare the audio dataset similar to datasamples/train.csv and datasamples/dev.csv, if new tasks eg: (noise, environment class) has to be added, then update the dataset.py accordingly. bash pip install requirements.txt

Train

update the config in train.py, such as audioencodername, llm_name, etc and other hyper parameters. bash python train.py

Evaluation

After training, update checkpoint path and test dataset path(similar format to train/dev.csv). bash python test.py

Infer model in Streamlit app

bash streamlit run app.py

Disclaimer

The models provided in this repository are not perfect and may produce errors in Automatic Speech Recognition (ASR), gender identification, age estimation, accent recognition, and emotion detection. Additionally, these models may exhibit biases related to gender, age, accent, and emotion. Please use with caution, especially in production environments, and be aware of potential inaccuracies and biases.

License

This project is released under the Apache 2.0 license as found in the LICENSE file. The released checkpoints, and code are intended for research purpose subject to the license of facebook/hubert-xlarge-ll60k, microsoft/wavlm-large and TinyLlama/TinyLlama-1.1B-Chat-v1.0 models.

Cite

@misc{Rajaa_SpeechLLM_Multi-Modal_LLM, author = {Rajaa, Shangeth and Tushar, Abhinav}, title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}}, url = {https://github.com/skit-ai/SpeechLLM} }

Owner

Name: Skit.ai
Login: skit-ai
Kind: organization
Email: hello@skit.ai
Location: Bangalore, India

Website: https://skit.ai
Twitter: SkitTech
Repositories: 98
Profile: https://github.com/skit-ai

Transforming Customer Experience with Voice AI

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this model, please cite it using these metadata."
authors:
- family-names: "Rajaa"
  given-names: "Shangeth"
- family-names: "Tushar"
  given-names: "Abhinav"

title: "SpeechLLM: Multi-Modal LLM for Speech Understanding"
abstract : ""
type: model
keywords:
  - "multi-modal-llms"
  - "llm"
  - "speech"
  - "conversational-ai"
version: 1.0.0
date-released: 2024-25-06
url: "https://github.com/skit-ai/SpeechLLM"
license: Apache-2.0

GitHub Events

Total

Issues event: 3
Watch event: 59
Fork event: 7

Last Year

Issues event: 3
Watch event: 59
Fork event: 7

Committers

Last synced: 10 months ago

All Time

Total Commits: 30
Total Committers: 1
Avg Commits per committer: 30.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 30
Committers: 1
Avg Commits per committer: 30.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Shangeth	s**a@g**m	30

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 3
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 3
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 3
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

lisicheng-csn (1)
cbowdon (1)
zlin0 (1)
JohnHerry (1)
uni-manjunath-ke (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

accelerate ==0.30.0
audio_recorder_streamlit ==0.0.8
datasets ==2.2.1
huggingface-hub ==0.23.0
jiwer ==3.0.3
librosa ==0.10.1
peft ==0.9.0
pytorch-lightning ==1.9.4
streamlit ==1.34.0
tokenizers ==0.19.1
torch ==2.0.1
torchaudio ==2.0.2
transformers ==4.41.2
wandb ==0.15.3

speechllm

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SpeechLLM

Usage

Load model directly from huggingface

Model Generation

Model Weights

Latest Checkpoint Result

speechllm-2B

speechllm-1.5B

Training

Dataset Preparation and Installation

Train

Evaluation

Infer model in Streamlit app

Disclaimer

License

Cite

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies