https://github.com/audiollms/audiobench

AudioBench: A Universal Benchmark for Audio Large Language Models

Keywords

audio-scene-understanding speech speech-question-answering speech-recognition

Last synced: 6 months ago · JSON representation

Repository

AudioBench: A Universal Benchmark for Audio Large Language Models

Basic Info

Host: GitHub
Owner: AudioLLMs
License: other
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2406.16020
Size: 70.6 MB

Statistics

Stars: 223
Watchers: 10
Forks: 9
Open Issues: 4
Releases: 0

Topics

audio-scene-understanding speech speech-question-answering speech-recognition

Created over 1 year ago · Last pushed 8 months ago

Metadata Files

Readme License

AudioBench

A repository for evaluating AudioLLMs in various tasks
AudioBench: A Universal Benchmark for Audio Large Language Models
Come to View Our Live Leaderboard on Huggingface Space

AudioBench Leaderboard | Huggingface Datasets | AudioLLM Paper Collection

Change log

Mar 2025: Supported phi4multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietenames and Indonesina).
Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
Mar 2025: AudioBench now supports over 50 datasets!!
Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
AUG 2024: Leaderboard is live. Check it out here.
JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.

Supported Evaluation Data

[x] librispeechtestclean, ASR, English, Metric: wer
[x] librispeechtestother, ASR, English, Metric: wer
[x] commonvoice15entest, ASR, English, Metric: wer
[x] peoplesspeechtest, ASR, English, Metric: wer
[x] gigaspeech_test, ASR, English, Metric: wer
[x] tedlium3_test, ASR, English, Metric: wer
[x] tedlium3longform_test, ASR, English, Long recording, Metric: wer
[x] earnings21_test, ASR, English, Long recording, Metric: wer
[x] earnings22_test, ASR, English, Long recording, Metric: wer
[x] aishellasrzh_test, ASR, Chinese, Metric: wer
[x] covost2enid_test, Speech Translation, English-Indonesian, Metric: bleu
[x] covost2enzh_test, Speech Translation, English-Chinese, Metric: bleu
[x] covost2enta_test, Speech Translation, English-Tamil, Metric: bleu
[x] covost2iden_test, Speech Translation, Indonesian-English, Metric: bleu
[x] covost2zhen_test, Speech Translation, Chinese-English, Metric: bleu
[x] covost2taen_test, Speech Translation, Tamil-English, Metric: bleu
[x] cncollegelistenmcqtest, Speech Question Answering, Multiple Choice, Metric: llama3_70b_judge, gpt4o_judge
[x] sluep2sqa5_test, Speech Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] dreamttsmcq_test, Speech Question Answering, Multiple Choice, Metric: llama3_70b_judge, gpt4o_judge
[x] publicsgspeechqatest, Speech Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] spokensquadtest, Speech Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] openhermesaudiotest, Speech Instruction, Metric: llama3_70b_judge, gpt4o_judge
[x] alpacaaudiotest, Speech Instruction, Metric: llama3_70b_judge, gpt4o_judge
[x] spoken-mqashortdigit, Speech Instruction, Metric: acc
[x] spoken-mqalongdigit, Speech Instruction, Metric: acc
[x] spoken-mqasinglestep_reasoning, Speech Instruction, Metric: acc
[x] spoken-mqamultistep_reasoning, Speech Instruction, Metric: acc
[x] clothoaqatest, Speech Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] wavcapsqatest, Audio Scene Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] audiocapsqatest, Audio Scene Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] wavcaps_test, Audio Scene Question Answering, Metric: llama3_70b_judge, meteor, gpt4o_judge
[x] audiocaps_test, Audio Scene Question Answering, Metric: llama3_70b_judge, meteor, gpt4o_judge
[x] iemocapemotiontest, Emotion Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] meldsentimenttest, Emotion Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] meldemotiontest, Emotion Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] voxcelebaccenttest, Accent Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] voxcelebgendertest, Gender Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] iemocapgendertest, Gender Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] muchomusic_test, Music Understanding, Metric: llama3_70b_judge, gpt4o_judge
[x] imdapart1asr_test, Singlish ASR, Metric: wer
[x] imdapart2asr_test, Singlish ASR, Metric: wer
[x] imdapart330sasrtest, Singlish ASR, Metric: wer
[x] imdapart430sasrtest, Singlish ASR, Metric: wer
[x] imdapart530sasrtest, Singlish ASR, Metric: wer
[x] imdapart630sasrtest, Singlish ASR, Metric: wer
[x] imdapart330ssqahuman_test, Singlish Speech Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] imdapart430ssqahuman_test, Singlish Speech Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] imdapart530ssqahuman_test, Singlish Speech Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] imdapart630ssqahuman_test, Singlish Speech Question Answering, Metric: llama3_70b_judge, gpt4o_judge
[x] imdapart330sdshuman_test, Singlish Speech Summarization, Metric: llama3_70b_judge, gpt4o_judge
[x] imdapart430sdshuman_test, Singlish Speech Summarization, Metric: llama3_70b_judge, gpt4o_judge
[x] imdapart530sdshuman_test, Singlish Speech Summarization, Metric: llama3_70b_judge, gpt4o_judge
[x] imdapart630sdshuman_test, Singlish Speech Summarization, Metric: llama3_70b_judge, gpt4o_judge
[x] imdaarsentence, Singlish, Accent Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] imdaardialogue, Singlish, Accent Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] imdagrsentence, Singlish, Gender Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] imdagrdialogue, Singlish, Gender Recognition, Metric: llama3_70b_judge, gpt4o_judge
[x] seamedevman, English-Chinese Code-Switching, Metric: wer
[x] seamedevsge, English-Chinese Code-Switching, Metric: wer
[x] mmau_mini, Audio Understandign and Reasoning, Multiple Choice Questions, Metric: llama3_70b_judge, string_match, gpt4o_judge
[x] gigaspeech2_thai, ASR for Thai language, Metric: wer
[x] gigaspeech2_indo, ASR for Indonesian language, Metric: wer
[x] gigaspeech2_viet, ASR for Vietnamese language, Metric: wer
[ ] ASCEND, English-Chinese Code-Switching, Metric: wer
[ ] [fleurs] speech translation
[ ] [AIR-Bench] airbench tasks

How to evaluate with the supported datasets? That's as simple as it can be. Replace the DATASET and METRIC name. DATASET=librispeech_test_clean METRIC=wer

How to Evaluation on Your Dataset?

Two simple steps: 1. Make a copy of one of the customized dataset loader. Example: cncollegelistenmcqtest. Customize it as your like on your own dataset. 2. Add a new term in dataset.py. 3. Done!

Supported Models

[x] cascadewhisperlargev3llama38b_instruct
[x] cascadewhisperlargev2gemma29bcptsealionv3_instruct
[x] MERaLiON-AudioLLM-Whisper-SEA-LION
[x] Qwen-Audio-Chat
[x] Qwen2-Audio-7B-Instruct
[x] SALMONN_7B: need extra git clone.
[x] WavLLM_fairseq: no longer supported as the inference takes too much effort.
[x] whisperlargev3
[x] whisperlargev2
[x] gemini-1.5-flash: key needed
[x] gemini-2-flash: key needed
[x] gpt-4o-audio: key needed
[x] phi4multimodal_instruct
[x] seallmsaudio7b
[ ] ultravox https://huggingface.co/fixie-ai/ultravox-v05-llama-31-8b / https://www.ultravox.ai/
[ ] llama3_s
[ ] audio-flamingo-2
[ ] [GLM4-Voice]
[ ] [Mini-Omni]
[ ] [SLAM-Omni]
[ ] [https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct]
[ ] [https://huggingface.co/WillHeld/DiVA-llama-3-v0-8b]

How to evaluation your own models?

As long as the model can do inference, you can load them and inference to get the responses. To evaluate on new models, please refer to addingnewmodel.

Installation

Installation with pip: shell pip install -r requirements.txt

Quick Start

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model. ```shell

Step 1:

Server the judgement model using VLLM framework (my example is using int4 quantized version)

This requires with 1 * 80GB GPU

bash vllmmodeljudgellama3_70b.sh

Step 2:

We perform model inference and obtain the evaluation results with the second GPU

GPU=2 BATCHSIZE=1 OVERWRITE=True NUMBEROFSAMPLES=-1 # indicate all test samples if numberof_samples=-1

MODEL_NAME=Qwen2-Audio-7B-Instruct

DATASET=cncollegelistenmcqtest METRICS=llama370bjudge

bash eval.sh $DATASET $MODELNAME $GPU $BATCHSIZE $OVERWRITE $METRICS $NUMBEROFSAMPLES

```

Citation

If you find our work useful, please consider citing our paper! bibtex @article{wang2024audiobench, title={AudioBench: A Universal Benchmark for Audio Large Language Models}, author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F}, journal={NAACL}, year={2025} }

To submit your model to leaderboard

Email: bwang28c@gmail.com

Researchers, companies or groups that are using AudioBench:

Llama3-S: When Llama Learns to Listen
[llms-eval] https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md
More to come...

To-Do List

[ ] Features
- [ ] Evaluation with audio/speech generation
- [ ] Evaluation with multiround chatbot
- [ ] Also support other model-as-judge and report the results
- [ ] Update AI-SHELL from WER to CER
[x] Bugs
- [x] Threads of model-as-judge
- [x] Post-processing script for IMDA PART4 which contains code-switching in 4 languages.

Contributors

Xue Cong Tey (MMAU-mini Dataset)

Owner

Name: AudioLLMs
Login: AudioLLMs
Kind: organization

Repositories: 1
Profile: https://github.com/AudioLLMs

GitHub Events

Total

Issues event: 14
Watch event: 141
Member event: 3
Issue comment event: 12
Push event: 62
Pull request review event: 1
Fork event: 9
Create event: 2

Last Year

Issues event: 14
Watch event: 141
Member event: 3
Issue comment event: 12
Push event: 62
Pull request review event: 1
Fork event: 9
Create event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 8
Total pull requests: 0
Average time to close issues: 21 days
Average time to close pull requests: N/A
Total issue authors: 4
Total pull request authors: 0
Average comments per issue: 0.38
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 8
Pull requests: 0
Average time to close issues: 21 days
Average time to close pull requests: N/A
Issue authors: 4
Pull request authors: 0
Average comments per issue: 0.38
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/audiollms/audiobench

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

AudioBench

Change log

Supported Evaluation Data

How to Evaluation on Your Dataset?

Supported Models

How to evaluation your own models?

Installation

Quick Start

Step 1:

Server the judgement model using VLLM framework (my example is using int4 quantized version)

This requires with 1 * 80GB GPU

Step 2:

We perform model inference and obtain the evaluation results with the second GPU

Citation

To submit your model to leaderboard

Researchers, companies or groups that are using AudioBench:

To-Do List

Contributors

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels