https://github.com/bethgelab/onebench

[ACL'25] The official code for "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities"

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

[ACL'25] The official code for "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities"

Basic Info

Host: GitHub
Owner: bethgelab
License: mit
Language: Python
Default Branch: main
Size: 1.82 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 11 months ago · Last pushed 11 months ago

Metadata Files

Readme License

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

An ever-evolving benchmark for LLMs and LMMs.

Installation

(Recommended) Create a new virtual environment and activate it. Some packages require Python>=3.11, therefore we suggest using the following:

bash conda create -n onebench python=3.11 -y conda activate onebench

Install the required packages:

bash python -m pip install -r requirements.txt

Install ONEBench in editable mode:

bash python -m pip install -e .

Test the installation:

bash python -c "import onebench"

Downloading the data

LLM

HELM

[Optional] Upgrade the Google Cloud SDK:

bash brew install python@3.11 export CLOUDSDK_PYTHON=$(which python3.11) gcloud components update

Authenticate to Google Cloud:

bash gcloud init

Download the HELM data:

bash python llm/download_helm.py

Open LLM Leaderboard

Download the Open LLM Leaderboard data:

bash python llm/download_open_llm_leaderboard.py

Chatbot Arena

Download the LMSYS Chatbot Arena data:

bash python llm/download_chatbot_arena.py

VLM

The VLM results are in the data/vlm/{dataset} directory, where dataset corresponds to vhelm and lmms-eval. The individual dataset a-matrices are located in data/vlm/{dataset}/binary and data/vlm/{dataset}/numeric. The results from Prometheus2 are located in data/vlm/{dataset}/pairwise_num.

[TODO]: Add instructions for json downloads, a matrix creation, prometheus scripts and capability querying.

📚Citation

If you find our work helpful, please use the following citation:

@inprocessings{ghosh2025onebench, title={ONEBench to test them all: Sample-level benchmarking over open-ended capabilities}, author={Ghosh, Adhiraj and Dziadzio, Sebastian and Prabhu, Ameya and Udandarao, Vishaal and Albanie, Samuel and Bethge, Matthias}, booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics }, year={2025} }

🪪 License

Code: MIT. Check LICENSE.

Owner

Name: Bethge Lab
Login: bethgelab
Kind: organization
Location: Tübingen

Website: http://bethgelab.org
Repositories: 23
Profile: https://github.com/bethgelab

Perceiving Neural Networks

GitHub Events

Total

Watch event: 1
Member event: 2
Push event: 7

Last Year

Watch event: 1
Member event: 2
Push event: 7

Dependencies

requirements.txt pypi

choix *
datasets *
fastparquet *
google-cloud-storage *
hydra-core *
matplotlib *
pandas *
plotly *
prometheus-eval *
pyarrow *
rapidfuzz *
requests *
scienceplots *
scikit-learn *
vllm *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bethgelab/onebench

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

Installation

Downloading the data

LLM

HELM

Open LLM Leaderboard

Chatbot Arena

VLM

📚Citation

If you find our work helpful, please use the following citation:

🪪 License

Owner

GitHub Events

Total

Last Year

Dependencies