https://github.com/bethgelab/onebench

[ACL'25] The official code for "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities"

https://github.com/bethgelab/onebench

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

[ACL'25] The official code for "ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities"

Basic Info
  • Host: GitHub
  • Owner: bethgelab
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 1.82 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 11 months ago · Last pushed 11 months ago
Metadata Files
Readme License

README.md

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

An ever-evolving benchmark for LLMs and LMMs.

Installation

(Recommended) Create a new virtual environment and activate it. Some packages require Python>=3.11, therefore we suggest using the following:

bash conda create -n onebench python=3.11 -y conda activate onebench

Install the required packages:

bash python -m pip install -r requirements.txt

Install ONEBench in editable mode:

bash python -m pip install -e .

Test the installation:

bash python -c "import onebench"

Downloading the data

LLM

HELM

[Optional] Upgrade the Google Cloud SDK:

bash brew install python@3.11 export CLOUDSDK_PYTHON=$(which python3.11) gcloud components update

Authenticate to Google Cloud:

bash gcloud init

Download the HELM data:

bash python llm/download_helm.py

Open LLM Leaderboard

Download the Open LLM Leaderboard data:

bash python llm/download_open_llm_leaderboard.py

Chatbot Arena

Download the LMSYS Chatbot Arena data:

bash python llm/download_chatbot_arena.py

VLM

The VLM results are in the data/vlm/{dataset} directory, where dataset corresponds to vhelm and lmms-eval. The individual dataset a-matrices are located in data/vlm/{dataset}/binary and data/vlm/{dataset}/numeric. The results from Prometheus2 are located in data/vlm/{dataset}/pairwise_num.

[TODO]: Add instructions for json downloads, a matrix creation, prometheus scripts and capability querying.

📚Citation

If you find our work helpful, please use the following citation:

@inprocessings{ghosh2025onebench, title={ONEBench to test them all: Sample-level benchmarking over open-ended capabilities}, author={Ghosh, Adhiraj and Dziadzio, Sebastian and Prabhu, Ameya and Udandarao, Vishaal and Albanie, Samuel and Bethge, Matthias}, booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics }, year={2025} }

🪪 License

Code: MIT. Check LICENSE.

Owner

  • Name: Bethge Lab
  • Login: bethgelab
  • Kind: organization
  • Location: Tübingen

Perceiving Neural Networks

GitHub Events

Total
  • Watch event: 1
  • Member event: 2
  • Push event: 7
Last Year
  • Watch event: 1
  • Member event: 2
  • Push event: 7

Dependencies

requirements.txt pypi
  • choix *
  • datasets *
  • fastparquet *
  • google-cloud-storage *
  • hydra-core *
  • matplotlib *
  • pandas *
  • plotly *
  • prometheus-eval *
  • pyarrow *
  • rapidfuzz *
  • requests *
  • scienceplots *
  • scikit-learn *
  • vllm *