https://github.com/audiollms/audiobench

AudioBench: A Universal Benchmark for Audio Large Language Models

https://github.com/audiollms/audiobench

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.2%) to scientific vocabulary

Keywords

audio-scene-understanding speech speech-question-answering speech-recognition
Last synced: 6 months ago · JSON representation

Repository

AudioBench: A Universal Benchmark for Audio Large Language Models

Basic Info
Statistics
  • Stars: 223
  • Watchers: 10
  • Forks: 9
  • Open Issues: 4
  • Releases: 0
Topics
audio-scene-understanding speech speech-question-answering speech-recognition
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

Prometheus-Logo

AudioBench

arXiv Hugging Face Organization License

A repository for evaluating AudioLLMs in various tasks
AudioBench: A Universal Benchmark for Audio Large Language Models
Come to View Our Live Leaderboard on Huggingface Space

AudioBench Leaderboard | Huggingface Datasets | AudioLLM Paper Collection GitHub Repo stars

Change log

  • Mar 2025: Supported phi4multimodal_instruct model, gigaspeech 2 evaluation (Thai, Vietenames and Indonesina).
  • Mar 2025: Support MMAU testset. Multiple-choice questions for speech, audio and music understanding!
  • Mar 2025: AudioBench now supports over 50 datasets!!
  • Mar 2025: Support SEAME testsets (dev). It is a code-switching dataset for Chinese and Singapore accented English.
  • JAN 2025: AudioBench paper is accepted to NAACL 2025 Main Conference.
  • JAN 2025: Support 10+ MNSC - Singlish Understanding datasets, the results are updated on leaderboard.
  • DEC 2024: Support more (35) datasets / more Models (2 cascade and 3 fusion models).
  • SEP 2024: Add MuChoMusic dataset for music evaluation (multiple choice questions).
  • AUG 2024: Support a 6 speech translation datasets. Update the evaluation script for several MCQ evaluation.
  • AUG 2024: Leaderboard is live. Check it out here.
  • JUL 2024: We are working hard on the leaderboard and speech translation dataset. Stay tuned!
  • JUL 2024: Support all INITIAL 26 datasets listed in AudioBench manuscript.

Star History Chart

Supported Evaluation Data

How to evaluate with the supported datasets? That's as simple as it can be. Replace the DATASET and METRIC name. DATASET=librispeech_test_clean METRIC=wer

How to Evaluation on Your Dataset?

Two simple steps: 1. Make a copy of one of the customized dataset loader. Example: cncollegelistenmcqtest. Customize it as your like on your own dataset. 2. Add a new term in dataset.py. 3. Done!

Supported Models

How to evaluation your own models?

As long as the model can do inference, you can load them and inference to get the responses. To evaluate on new models, please refer to addingnewmodel.

Installation

Installation with pip: shell pip install -r requirements.txt

Quick Start

For model-as-judge evaluation, we serve the judgement model as a service via vllm on port 5000.

The example is hosting a Llama-3-70B-Instruct model and running the cascade Whisper + Llama-3 model. ```shell

Step 1:

Server the judgement model using VLLM framework (my example is using int4 quantized version)

This requires with 1 * 80GB GPU

bash vllmmodeljudgellama3_70b.sh

Step 2:

We perform model inference and obtain the evaluation results with the second GPU

GPU=2 BATCHSIZE=1 OVERWRITE=True NUMBEROFSAMPLES=-1 # indicate all test samples if numberof_samples=-1

MODEL_NAME=Qwen2-Audio-7B-Instruct

DATASET=cncollegelistenmcqtest METRICS=llama370bjudge

bash eval.sh $DATASET $MODELNAME $GPU $BATCHSIZE $OVERWRITE $METRICS $NUMBEROFSAMPLES

```

Citation

If you find our work useful, please consider citing our paper! bibtex @article{wang2024audiobench, title={AudioBench: A Universal Benchmark for Audio Large Language Models}, author={Wang, Bin and Zou, Xunlong and Lin, Geyu and Sun, Shuo and Liu, Zhuohan and Zhang, Wenyu and Liu, Zhengyuan and Aw, AiTi and Chen, Nancy F}, journal={NAACL}, year={2025} }

To submit your model to leaderboard

Email: bwang28c@gmail.com

Researchers, companies or groups that are using AudioBench:

To-Do List

  • [ ] Features
    • [ ] Evaluation with audio/speech generation
    • [ ] Evaluation with multiround chatbot
    • [ ] Also support other model-as-judge and report the results
    • [ ] Update AI-SHELL from WER to CER
  • [x] Bugs
    • [x] Threads of model-as-judge
    • [x] Post-processing script for IMDA PART4 which contains code-switching in 4 languages.

Contributors

  • Xue Cong Tey (MMAU-mini Dataset)

Owner

  • Name: AudioLLMs
  • Login: AudioLLMs
  • Kind: organization

GitHub Events

Total
  • Issues event: 14
  • Watch event: 141
  • Member event: 3
  • Issue comment event: 12
  • Push event: 62
  • Pull request review event: 1
  • Fork event: 9
  • Create event: 2
Last Year
  • Issues event: 14
  • Watch event: 141
  • Member event: 3
  • Issue comment event: 12
  • Push event: 62
  • Pull request review event: 1
  • Fork event: 9
  • Create event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 8
  • Total pull requests: 0
  • Average time to close issues: 21 days
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 0.38
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 8
  • Pull requests: 0
  • Average time to close issues: 21 days
  • Average time to close pull requests: N/A
  • Issue authors: 4
  • Pull request authors: 0
  • Average comments per issue: 0.38
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • BinWang28 (4)
  • lxt160980 (2)
  • Peter-SungwooCho (1)
  • Ashbajawed (1)
Pull Request Authors
  • zhimin-z (1)
  • bachvudinh (1)
Top Labels
Issue Labels
Pull Request Labels