https://github.com/audiollms/audiobench
AudioBench: A Universal Benchmark for Audio Large Language Models
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Keywords
Repository
AudioBench: A Universal Benchmark for Audio Large Language Models
Basic Info
- Host: GitHub
- Owner: AudioLLMs
- License: other
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2406.16020
- Size: 70.6 MB
Statistics
- Stars: 223
- Watchers: 10
- Forks: 9
- Open Issues: 4
- Releases: 0
Topics
Metadata Files
README.md
AudioBench
A repository for evaluating AudioLLMs in various tasks
AudioBench: A Universal Benchmark for Audio Large Language Models
Come to View Our Live Leaderboard on Huggingface Space
AudioBench Leaderboard | Huggingface Datasets | AudioLLM Paper Collection
Change log
- Mar 2025: Supported phi4multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietenames and Indonesina).
- Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
- Mar 2025: AudioBench now supports over 50 datasets!!
- Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
- JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
- JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
- DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
- SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
- AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
- AUG 2024: Leaderboard is live. Check it out here.
- JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
- JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.
Supported Evaluation Data
- [x] librispeechtestclean, ASR, English, Metric:
wer - [x] librispeechtestother, ASR, English, Metric:
wer - [x] commonvoice15entest, ASR, English, Metric:
wer - [x] peoplesspeechtest, ASR, English, Metric:
wer - [x] gigaspeech_test, ASR, English, Metric:
wer - [x] tedlium3_test, ASR, English, Metric:
wer - [x] tedlium3longform_test, ASR, English, Long recording, Metric:
wer - [x] earnings21_test, ASR, English, Long recording, Metric:
wer - [x] earnings22_test, ASR, English, Long recording, Metric:
wer - [x] aishellasrzh_test, ASR, Chinese, Metric:
wer - [x] covost2enid_test, Speech Translation, English-Indonesian, Metric:
bleu - [x] covost2enzh_test, Speech Translation, English-Chinese, Metric:
bleu - [x] covost2enta_test, Speech Translation, English-Tamil, Metric:
bleu - [x] covost2iden_test, Speech Translation, Indonesian-English, Metric:
bleu - [x] covost2zhen_test, Speech Translation, Chinese-English, Metric:
bleu - [x] covost2taen_test, Speech Translation, Tamil-English, Metric:
bleu - [x] cncollegelistenmcqtest, Speech Question Answering, Multiple Choice, Metric:
llama3_70b_judge,gpt4o_judge - [x] sluep2sqa5_test, Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] dreamttsmcq_test, Speech Question Answering, Multiple Choice, Metric:
llama3_70b_judge,gpt4o_judge - [x] publicsgspeechqatest, Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] spokensquadtest, Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] openhermesaudiotest, Speech Instruction, Metric:
llama3_70b_judge,gpt4o_judge - [x] alpacaaudiotest, Speech Instruction, Metric:
llama3_70b_judge,gpt4o_judge - [x] spoken-mqashortdigit, Speech Instruction, Metric:
acc - [x] spoken-mqalongdigit, Speech Instruction, Metric:
acc - [x] spoken-mqasinglestep_reasoning, Speech Instruction, Metric:
acc - [x] spoken-mqamultistep_reasoning, Speech Instruction, Metric:
acc - [x] clothoaqatest, Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] wavcapsqatest, Audio Scene Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] audiocapsqatest, Audio Scene Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] wavcaps_test, Audio Scene Question Answering, Metric:
llama3_70b_judge,meteor,gpt4o_judge - [x] audiocaps_test, Audio Scene Question Answering, Metric:
llama3_70b_judge,meteor,gpt4o_judge - [x] iemocapemotiontest, Emotion Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] meldsentimenttest, Emotion Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] meldemotiontest, Emotion Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] voxcelebaccenttest, Accent Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] voxcelebgendertest, Gender Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] iemocapgendertest, Gender Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] muchomusic_test, Music Understanding, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdapart1asr_test, Singlish ASR, Metric:
wer - [x] imdapart2asr_test, Singlish ASR, Metric:
wer - [x] imdapart330sasrtest, Singlish ASR, Metric:
wer - [x] imdapart430sasrtest, Singlish ASR, Metric:
wer - [x] imdapart530sasrtest, Singlish ASR, Metric:
wer - [x] imdapart630sasrtest, Singlish ASR, Metric:
wer - [x] imdapart330ssqahuman_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdapart430ssqahuman_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdapart530ssqahuman_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdapart630ssqahuman_test, Singlish Speech Question Answering, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdapart330sdshuman_test, Singlish Speech Summarization, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdapart430sdshuman_test, Singlish Speech Summarization, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdapart530sdshuman_test, Singlish Speech Summarization, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdapart630sdshuman_test, Singlish Speech Summarization, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdaarsentence, Singlish, Accent Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdaardialogue, Singlish, Accent Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdagrsentence, Singlish, Gender Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] imdagrdialogue, Singlish, Gender Recognition, Metric:
llama3_70b_judge,gpt4o_judge - [x] seamedevman, English-Chinese Code-Switching, Metric:
wer - [x] seamedevsge, English-Chinese Code-Switching, Metric:
wer - [x] mmau_mini, Audio Understandign and Reasoning, Multiple Choice Questions, Metric:
llama3_70b_judge,string_match,gpt4o_judge - [x] gigaspeech2_thai, ASR for Thai language, Metric:
wer - [x] gigaspeech2_indo, ASR for Indonesian language, Metric:
wer - [x] gigaspeech2_viet, ASR for Vietnamese language, Metric:
wer - [ ] ASCEND, English-Chinese Code-Switching, Metric:
wer - [ ] [fleurs] speech translation
- [ ] [AIR-Bench] airbench tasks
How to evaluate with the supported datasets? That's as simple as it can be. Replace the DATASET and METRIC name.
DATASET=librispeech_test_clean
METRIC=wer
How to Evaluation on Your Dataset?
Two simple steps: 1. Make a copy of one of the customized dataset loader. Example: cncollegelistenmcqtest. Customize it as your like on your own dataset. 2. Add a new term in dataset.py. 3. Done!
Supported Models
- [x] cascadewhisperlargev3llama38b_instruct
- [x] cascadewhisperlargev2gemma29bcptsealionv3_instruct
- [x] MERaLiON-AudioLLM-Whisper-SEA-LION
- [x] Qwen-Audio-Chat
- [x] Qwen2-Audio-7B-Instruct
- [x] SALMONN_7B: need extra git clone.
- [x] WavLLM_fairseq: no longer supported as the inference takes too much effort.
- [x] whisperlargev3
- [x] whisperlargev2
- [x] gemini-1.5-flash: key needed
- [x] gemini-2-flash: key needed
- [x] gpt-4o-audio: key needed
- [x] phi4multimodal_instruct
- [x] seallmsaudio7b
- [ ] ultravox https://huggingface.co/fixie-ai/ultravox-v05-llama-31-8b / https://www.ultravox.ai/
- [ ] llama3_s
- [ ] audio-flamingo-2
- [ ] [GLM4-Voice]
- [ ] [Mini-Omni]
- [ ] [SLAM-Omni]
- [ ] [https://huggingface.co/scb10x/llama3.1-typhoon2-audio-8b-instruct]
- [ ] [https://huggingface.co/WillHeld/DiVA-llama-3-v0-8b]
How to evaluation your own models?
As long as the model can do inference, you can load them and inference to get the responses. To evaluate on new models, please refer to addingnewmodel.
Installation
Installation with pip:
shell
pip install -r requirements.txt
Quick Start
For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.
The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model.
```shell
Step 1:
Server the judgement model using VLLM framework (my example is using int4 quantized version)
This requires with 1 * 80GB GPU
bash vllmmodeljudgellama3_70b.sh
Step 2:
We perform model inference and obtain the evaluation results with the second GPU
GPU=2 BATCHSIZE=1 OVERWRITE=True NUMBEROFSAMPLES=-1 # indicate all test samples if numberof_samples=-1
MODEL_NAME=Qwen2-Audio-7B-Instruct
DATASET=cncollegelistenmcqtest METRICS=llama370bjudge
bash eval.sh $DATASET $MODELNAME $GPU $BATCHSIZE $OVERWRITE $METRICS $NUMBEROFSAMPLES
```
Citation
If you find our work useful, please consider citing our paper!
bibtex
@article{wang2024audiobench,
title={AudioBench: A Universal Benchmark for Audio Large Language Models},
author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F},
journal={NAACL},
year={2025}
}
To submit your model to leaderboard
Email: bwang28c@gmail.com
Researchers, companies or groups that are using AudioBench:
- Llama3-S: When Llama Learns to Listen
- [llms-eval] https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/lmms-eval-0.3.md
- More to come...
To-Do List
- [ ] Features
- [ ] Evaluation with audio/speech generation
- [ ] Evaluation with multiround chatbot
- [ ] Also support other model-as-judge and report the results
- [ ] Update AI-SHELL from WER to CER
- [x] Bugs
- [x] Threads of model-as-judge
- [x] Post-processing script for IMDA PART4 which contains code-switching in 4 languages.
Contributors
- Xue Cong Tey (MMAU-mini Dataset)
Owner
- Name: AudioLLMs
- Login: AudioLLMs
- Kind: organization
- Repositories: 1
- Profile: https://github.com/AudioLLMs
GitHub Events
Total
- Issues event: 14
- Watch event: 141
- Member event: 3
- Issue comment event: 12
- Push event: 62
- Pull request review event: 1
- Fork event: 9
- Create event: 2
Last Year
- Issues event: 14
- Watch event: 141
- Member event: 3
- Issue comment event: 12
- Push event: 62
- Pull request review event: 1
- Fork event: 9
- Create event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 8
- Total pull requests: 0
- Average time to close issues: 21 days
- Average time to close pull requests: N/A
- Total issue authors: 4
- Total pull request authors: 0
- Average comments per issue: 0.38
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 8
- Pull requests: 0
- Average time to close issues: 21 days
- Average time to close pull requests: N/A
- Issue authors: 4
- Pull request authors: 0
- Average comments per issue: 0.38
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- BinWang28 (4)
- lxt160980 (2)
- Peter-SungwooCho (1)
- Ashbajawed (1)
Pull Request Authors
- zhimin-z (1)
- bachvudinh (1)