https://github.com/bethgelab/onebench
[ACL'25] The official code for "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities"
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.7%) to scientific vocabulary
Repository
[ACL'25] The official code for "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities"
Basic Info
- Host: GitHub
- Owner: bethgelab
- License: mit
- Language: Python
- Default Branch: main
- Size: 1.82 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
An ever-evolving benchmark for LLMs and LMMs.
Installation
(Recommended) Create a new virtual environment and activate it. Some packages require Python>=3.11, therefore we suggest using the following:
bash
conda create -n onebench python=3.11 -y
conda activate onebench
Install the required packages:
bash
python -m pip install -r requirements.txt
Install ONEBench in editable mode:
bash
python -m pip install -e .
Test the installation:
bash
python -c "import onebench"
Downloading the data
LLM
HELM
[Optional] Upgrade the Google Cloud SDK:
bash
brew install python@3.11
export CLOUDSDK_PYTHON=$(which python3.11)
gcloud components update
Authenticate to Google Cloud:
bash
gcloud init
Download the HELM data:
bash
python llm/download_helm.py
Open LLM Leaderboard
Download the Open LLM Leaderboard data:
bash
python llm/download_open_llm_leaderboard.py
Chatbot Arena
Download the LMSYS Chatbot Arena data:
bash
python llm/download_chatbot_arena.py
VLM
The VLM results are in the data/vlm/{dataset} directory, where dataset corresponds to vhelm and lmms-eval. The individual dataset a-matrices are located in data/vlm/{dataset}/binary and data/vlm/{dataset}/numeric. The results from Prometheus2 are located in data/vlm/{dataset}/pairwise_num.
[TODO]: Add instructions for json downloads, a matrix creation, prometheus scripts and capability querying.
📚Citation
If you find our work helpful, please use the following citation:
@inprocessings{ghosh2025onebench,
title={ONEBench to test them all: Sample-level benchmarking over open-ended capabilities},
author={Ghosh, Adhiraj and Dziadzio, Sebastian and Prabhu, Ameya and Udandarao, Vishaal and Albanie, Samuel and Bethge, Matthias},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics },
year={2025}
}
🪪 License
Code: MIT. Check LICENSE.
Owner
- Name: Bethge Lab
- Login: bethgelab
- Kind: organization
- Location: Tübingen
- Website: http://bethgelab.org
- Repositories: 23
- Profile: https://github.com/bethgelab
Perceiving Neural Networks
GitHub Events
Total
- Watch event: 1
- Member event: 2
- Push event: 7
Last Year
- Watch event: 1
- Member event: 2
- Push event: 7
Dependencies
- choix *
- datasets *
- fastparquet *
- google-cloud-storage *
- hydra-core *
- matplotlib *
- pandas *
- plotly *
- prometheus-eval *
- pyarrow *
- rapidfuzz *
- requests *
- scienceplots *
- scikit-learn *
- vllm *