https://github.com/anjiang-wei/equibench
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Anjiang-Wei
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 70.3 KB
Statistics
- Stars: 3
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking
Overview
EquiBench is a comprehensive benchmark designed to evaluate the code reasoning capabilities of Large Language Models (LLMs) through equivalence checking tasks. This framework helps researchers and developers assess how well different LLMs understand code semantics, reason about program functionality, and determine when two code snippets are functionally equivalent despite syntactic differences.
Key Features
- Diverse Test Cases: Includes 2400 pairs of equivalent programs across six distinct categories (
DCEfor C programs,STOKEfor x86-64 programs,TVMfor CUDA programs, andOJ_A,OJ_V,OJ_VAfor Python competitive programming problems) - Multiple Prompting Strategies: Support for zero-shot, few-shot, and chain-of-thought variations to evaluate different reasoning approaches
- Wide Model Support: Compatible with leading LLMs from OpenAI, Anthropic, Meta, Mistral AI, Qwen, and DeepSeek
- Standardized Methodology: Consistent evaluation framework enabling fair comparison across different model architectures
Table of Contents
- Overview
- Steps
- Details
- HuggingFace Dataset
- Evaluation Results
- Citation
- License
Initial Setup
Clone the repository and navigate to the directory
Shell git clone https://github.com/Anjiang-Wei/EquiBench.git cd EquiBenchUse Python version 3.12 or higher
You can use
pyenvto manage Python versions:Prerequisite: Install pyenv
Install Python 3.12 using pyenv:
Shell pyenv install 3.12 pyenv local 3.12Create a virtual environment and activate it
Shell python -m venv .venv source .venv/bin/activateUpdate pip and install required packages
Shell pip install --upgrade pip pip install .Set up API keys in a
.envfile:Create an empty
.envfileShell touch .envAdd the following API keys to your
.envfile:Shell OPENAI_API_KEY=<your OpenAI key here> ANTHROPIC_API_KEY=<your Anthropic key here> TOGETHER_API_KEY=<your Together key here> HF_TOKEN=<your HuggingFace access token here>
Daily Setup
When returning to work on EquiBench:
Navigate to the repository directory
Shell cd EquiBenchActivate the virtual environment
Shell source .venv/bin/activate
Step 1: Downloading Datasets
Obtain a
readorwritetype access token from HuggingFaceLogin using the access token:
Option A: Log in via command line and verify access:
Shell huggingface-cli login huggingface-cli whoamiOption B: Add your token directly to the
.envfile as theHF_TOKENenvironment variable.Download the datasets:
Shell python step1_data.py data
This will download all 2400 program pairs from the EquiBench-Datasets repository on HuggingFace.
Step 2: Running Evaluations
Execute the evaluation script with your desired configuration. The example below runs a zero-shot evaluation on three different models with a sample limit of 1 for each category:
Shell
python step2_eval.py data result/eval \
--prompt_types ZERO \
--models \
openai/gpt-4o-mini-2024-07-18 \
anthropic/claude-3-5-sonnet-20241022 \
Qwen/Qwen2.5-7B-Instruct-Turbo \
--limit 1
Additional Options
The evaluation script supports several command-line options:
--models: List of models to evaluate (see Supported Models)--limit: Number of test pairs to evaluate per category (omit to evaluate all 400 pairs per category)--prompt_types: Types of prompting strategies to use (see Supported Prompt Types)--categories: Select specific categories for evaluation. Choices:DCE,STOKE,TVM,OJ_A,OJ_V,OJ_VA--prompt_path: Path to custom prompt templates, default asprompts.toml--log_level: Set logging verbosity, default asINFO. Choices:DEBUG,INFO,WARNING,ERROR
Example Commands
```Shell
Evaluate all models on a single category with few-shot prompting
python step2eval.py data result/eval --prompttypes FEW --categories OJ_A --limit 10
Evaluate one model on all categories with chain-of-thought reasoning
python step2eval.py data result/eval --prompttypes ZERO_COT --models openai/gpt-4o-2024-11-20
Custom output directory
python step2eval.py data customresults --prompt_types ZERO FEW ```
Finally, run the result summary
Shell
python step3_stat.py
Supported Categories
EquiBench contains 2400 pairs of programs across six distinct categories of code equivalence tasks:
- DCE (Dead Code Elimination for C programs): Code pairs that differ by removal of dead / live code
- STOKE (Superoptimizer for x86-64 program): Assembly code pairs optimized using the STOKE framework
- TVM (Compiler Scheduling for CUDA programs): Code pairs optimized for tensor operations
- OJ_A (Python Competitive Programming - Algorithm): Different algorithmic solutions to the same programming problem
- OJ_V (Python Competitive Programming - Variable Renaming): Code pairs with variable renaming transformations
- OJ_VA (Python Competitive Programming - Variables + Algorithms): Code pairs with both variable renaming and algorithmic differences
Each category contains 400 pairs of programs (200 equivalent and 200 inequivalent), providing a diverse range of challenges for LLMs to reason about code semantics.
Supported Prompt Types
EquiBench evaluates models using four different prompting strategies:
ZERO: Zero-shot prompting (directly asking the model without examples)FEW: Few-shot prompting (providing example problems and solutions)ZERO_COT: Zero-shot chain of thought (encouraging step-by-step reasoning)FEW_COT: Few-shot chain of thought (examples with step-by-step reasoning)
Each strategy tests different aspects of a model's reasoning capabilities, from basic understanding to advanced reasoning chains.
Supported Models
EquiBench supports evaluation across a diverse range of LLMs:
OpenAI Models
Shell
openai/o1-mini-2024-09-12
openai/gpt-4o-2024-11-20
openai/gpt-4o-mini-2024-07-18
openai/o3-mini-2025-01-31
Anthropic Models
Shell
anthropic/claude-3-5-sonnet-20241022
Meta (Llama) Models
Shell
meta-llama/Llama-3.2-3B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
Mistral AI Models
Shell
mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mixtral-8x7B-Instruct-v0.1
mistralai/Mixtral-8x22B-Instruct-v0.1
Qwen Models
Shell
Qwen/Qwen2.5-7B-Instruct-Turbo
Qwen/Qwen2.5-72B-Instruct-Turbo
Qwen/QwQ-32B-Preview
DeepSeek Models
Shell
deepseek-ai/DeepSeek-R1
deepseek-ai/DeepSeek-V3
Additional models from OpenAI, Anthropic, and together.ai platforms are also supported.
HuggingFace Dataset
The EquiBench dataset is hosted on HuggingFace as anjiangwei/EquiBench-Datasets.
Dataset Statistics
| Category | Language | Equivalent Pairs | Inequivalent Pairs | Total | |----------|----------|------------------|-------------------|-------| | DCE | C | 200 | 200 | 400 | | STOKE | x86-64 | 200 | 200 | 400 | | TVM | CUDA | 200 | 200 | 400 | | OJA | Python | 200 | 200 | 400 | | OJV | Python | 200 | 200 | 400 | | OJ_VA | Python | 200 | 200 | 400 | | Total| | 1200 | 1200 | 2400|
Direct Access via HuggingFace Datasets Library
You can directly access the dataset using the HuggingFace datasets library:
```python from datasets import load_dataset
Define dataset path
hf_path = "anjiangwei/EquiBench-Datasets"
Load specific categories
dcedataset = loaddataset(path=hfpath, name="DCE") stokedataset = loaddataset(path=hfpath, name="STOKE") tvmdataset = loaddataset(path=hfpath, name="TVM") ojadataset = loaddataset(path=hfpath, name="OJA") ojvdataset = loaddataset(path=hfpath, name="OJV") ojvadataset = loaddataset(path=hfpath, name="OJVA")
Example: Access the first pair in OJ_A category
print(ojadataset["train"][0]) ```
Evaluation Results
Below is a summary of performance across different models and prompting strategies based on our paper experiments:
| Model | DCE | CUDA | x86-64 | OJA | OJV | OJ_VA | Overall Accuracy | |-----------------------------------------|------|------|--------|------|------|-------|------------------| | Random Baseline | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | 50.0 | | Llama-3.2-3B-Instruct-Turbo | 50.0 | 49.8 | 50.0 | 51.5 | 51.5 | 51.5 | 50.7 | | Llama-3.1-8B-Instruct-Turbo | 41.8 | 49.8 | 50.5 | 57.5 | 75.5 | 56.8 | 55.3 | | Mistral-7B-Instruct-v0.3 | 51.0 | 57.2 | 73.8 | 50.7 | 50.5 | 50.2 | 55.6 | | Mixtral-8x7B-Instruct-v0.1 | 50.2 | 47.0 | 64.2 | 59.0 | 61.5 | 55.0 | 56.1 | | Mixtral-8x22B-Instruct-v0.1 | 46.8 | 49.0 | 62.7 | 63.5 | 76.0 | 62.7 | 60.1 | | Llama-3.1-70B-Instruct-Turbo | 47.5 | 50.0 | 58.5 | 66.2 | 72.0 | 67.5 | 60.3 | | QwQ-32B-Preview | 48.2 | 50.5 | 62.7 | 65.2 | 71.2 | 64.2 | 60.3 | | Qwen2.5-7B-Instruct-Turbo | 50.5 | 49.2 | 58.0 | 62.0 | 80.8 | 63.0 | 60.6 | | gpt-4o-mini-2024-07-18 | 46.8 | 50.2 | 56.8 | 64.5 | 91.2 | 64.0 | 62.2 | | Qwen2.5-72B-Instruct-Turbo | 42.8 | 56.0 | 64.8 | 72.0 | 76.5 | 70.8 | 63.8 | | Llama-3.1-405B-Instruct-Turbo | 40.0 | 49.0 | 75.0 | 72.2 | 74.5 | 72.8 | 63.9 | | DeepSeek-V3 | 41.0 | 50.7 | 69.2 | 73.0 | 83.5 | 72.5 | 65.0 | | gpt-4o-2024-11-20 | 43.2 | 49.5 | 65.2 | 71.0 | 87.0 | 73.8 | 65.0 | | claude3.5-sonnet-2024-10-22 | 38.5 | 62.3 | 70.0 | 71.2 | 78.0 | 73.5 | 65.6 | | o1-mini-2024-09-12 | 55.8 | 50.7 | 74.2 | 80.0 | 89.8 | 78.8 | 71.5 | | DeepSeek-R1 | 52.2 | 61.0 | 78.2 | 79.8 | 91.5 | 78.0 | 73.5 | | o3-mini-2025-01-31 | 68.8 | 59.0 | 84.5 | 84.2 | 88.2 | 83.2 | 78.0 | | Mean | 47.9 | 52.4 | 65.8 | 67.3 | 76.4 | 67.0 | 62.8 |
Table: Accuracy of 17 models on EquiBench under 0-shot prompting.
We report accuracy for each of the six equivalence categories along with the overall accuracy.
Note: These results represent average accuracy across all categories. For detailed results, please refer to our paper.
Citation
If you use EquiBench in your research, please cite our paper:
plaintext
@article{wei2025equibench,
title={EquiBench: Benchmarking Code Reasoning Capabilities of Large Language Models via Equivalence Checking},
author={Wei, Anjiang and Cao, Jiannan and Li, Ran and Chen, Hongyu and Zhang, Yuhui and Wang, Ziheng and Sun, Yaofeng and Liu, Yuan and Teixeira, Thiago S. F. X. and Yang, Diyi and Wang, Ke and Aiken, Alex},
journal={arXiv preprint arXiv:2502.12466},
year={2025}
}
License
Apache License 2.0. See the LICENSE file for details.
Owner
- Login: Anjiang-Wei
- Kind: user
- Repositories: 19
- Profile: https://github.com/Anjiang-Wei
GitHub Events
Total
- Watch event: 4
- Member event: 1
- Push event: 17
- Create event: 2
Last Year
- Watch event: 4
- Member event: 1
- Push event: 17
- Create event: 2
Dependencies
- anthropic *
- dataclasses_json *
- datasets *
- jsonlines *
- marshmallow *
- openai *
- pandas *
- python-dotenv *
- pyyaml *
- together *