llvm-apr-benchmark
A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs
Basic Info
- Host: GitHub
- Owner: dtcxzyw
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://huggingface.co/spaces/dtcxzyw/llvm-apr-benchmark-leaderboard
- Size: 2.06 MB
Statistics
- Stars: 15
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
LLVM APR Benchmark: A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs
GitHub (We only accept pull requests from GitHub)
Motivation
The compiler is a critical infrastructure in the software development. The LLVM compiler infrastructure is widely used in both academia and industry. However, due to its inherent complexity, the LLVM compiler still contains many bugs that can be triggered in edge cases. As one of the LLVM maintainers, my job is to provide the minimal reproducible test cases for issues from fuzzers/ downstream users, and fix these bugs (or assign them to the right person). However, the process is time-consuming and boring. Thanks to the recent advances in compiler testing, we can automatically generate interesting test cases that trigger bugs and automatically reduce the tests to minimal ones. If we can also perform bug localization and repair automatically, it will significantly reduce the workload of us maintainers! Recently, LLM-based automated program repair (APR) techniques have been proposed. We have seen some successful cases in APR benchmarks like Defects4J and SWE-bench. But I believe that fixing LLVM bugs is more challenging than existing benchmarks due to its large C/C++ codebase, complex logic, long history, and the need for domain-specific knowledge. Therefore, I build this benchmark to see if we can automatically repair real-world LLVM bugs with the help of large language models and APR techniques. I hope this benchmark can help both SE researchers and LLVM community to understand how APR techniques work on a large-scale, real-world C/C++ project.
Dataset Description
In this benchmark, we only focus on three kinds of bugs in the LLVM middle-end:
+ Crash: the compiler terminates exceptionally or hits an assertion failure (LLVM is built with -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ABI_BREAKING_CHECKS=WITH_ASSERTS).
+ Miscompilation: the compiler generates incorrect program from a well-defined source code.
+ Hang: the compiler runs into an infinite loop or fails to reach a fixpoint.
All bugs can be triggered with an opt command and a small piece of LLVM textual IR.
This dataset collects some fixed LLVM middle-end bugs from GitHub issues since 2024-01-01. Each issue contains issue description, test cases, a reference patch, and some hints. All issues are checked against the following criteria:
- At least one of the given test cases can be used to reproduce the bug at a specific commit (
base_commit). For most of the miscompilation bugs, thesrcandtgtfunctions will be checked with alive2, an automatic refinement verification tool for LLVM. If miscompilation happens,alive-tvwill provide a counterexample. The remaining miscompilation bugs will be checked bylli. optpasses all the given tests after fixing the bug with the given reference patch (patch).optpasses all regression tests at a specific commit (hints.fix_commit).
Take Issue121459 as an example:
jsonc
{
// Identifier of the bug. It can be an issue number, a pull request number,
// or a commit hash.
"bug_id": "121459",
// Points to issue/PR/commit url
"issue_url": "https://github.com/llvm/llvm-project/issues/121459",
// Bug type: crash/miscompilation/hang
"bug_type": "miscompilation",
// Fixes should be applied at the base commit
"base_commit": "68d265666e708bad1c63b419b6275aaba1a7dcd2",
// Knowledge cutoff date. It is not allowed to use the web knowledge base
// after this date or use a large language model trained with newer
// information. Please check the "Rules" section for exemptions.
"knowledge_cutoff": "2025-01-02T09:03:32Z",
// Regression test directories
"lit_test_dir": [
"llvm/test/Transforms/InstCombine"
],
// Bug localization hints at different granularity levels.
// Note that this information is provided in a best-effort way.
// They are not guaranteed to be available or accurate.
"hints": {
"fix_commit": "a4d92400a6db9566d84cb4b900149e36e117f452",
"components": [
"InstCombine"
],
"bug_location_lineno": {
"llvm/lib/Transforms/InstCombine/InstructionCombining.cpp": [
[
2782,
2787
],
[
2838,
2843
],
[
2847,
2852
]
]
},
"bug_location_funcname": {
"llvm/lib/Transforms/InstCombine/InstructionCombining.cpp": [
"foldGEPOfPhi"
]
}
},
// A reference patch extracted from hints.fix_commit
"patch": "<omitted>",
// Minimal reproducible tests
"tests": [
{
"file": "llvm/test/Transforms/InstCombine/opaque-ptr.ll",
"commands": [
"opt -S -passes='instcombine<no-verify-fixpoint>' < %s"
],
"tests": [
{
"test_name": "gep_of_phi_of_gep_different_type",
"test_body": "<omitted>"
},
{
"test_name": "gep_of_phi_of_gep_flags2",
"test_body": "<omitted>"
},
{
"test_name": "gep_of_phi_of_gep_flags1",
"test_body": "<omitted>"
}
]
}
],
// Issue description
"issue": {
"title": "[InstCombine] GEPNoWrapFlags is propagated incorrectly",
"body": "<omitted>",
"author": "dtcxzyw",
"labels": [
"miscompilation",
"llvm:instcombine"
],
"comments": []
},
"verified": true,
// You are allowed to choose a subset of issues to fix.
// Althrough these properties are obtained from the golden patch,
// using properties is not treated as using hints.
"properties": {
"is_single_file_fix": true,
"is_single_func_fix": true
}
}
As of May 19, 2025, this benchmark contains 295 issues. You can run python3 scripts/dataset_summary.py locally to obtain the latest statistics.
```
Total issues: 295
Verified issues: 295 (100.00%)
Bug type summary: miscompilation: 106 crash: 181 hang: 8
Bug component summary (Total = 50): SLPVectorizer: 73 LoopVectorize: 71 InstCombine: 54 ScalarEvolution: 15 VectorCombine: 11 ValueTracking: 8 IR: 6 ConstraintElimination: 5 InstructionSimplify: 5 SimplifyIndVar: 4 Local: 4 LoopAccessAnalysis: 3 LoopPeel: 3 MemCpyOptimizer: 3 DeadStoreElimination: 3 MemorySSAUpdater: 3 ...
Label summary: crash: 111 miscompilation: 108 vectorizers: 81 llvm:SLPVectorizer: 73 crash-on-valid: 61 llvm:instcombine: 57 llvm:transforms: 40 llvm:analysis: 22 release:backport: 16 llvm:SCEV: 16 generated by fuzzer: 13 confirmed: 9 llvm:crash: 8 regression: 7 llvm:hang: 6 ...
Changed files count summary: Average: 1.17 Max: 5 Min: 1 Median: 1
Inserted lines summary: Average: 10.71 Max: 164 Min: 0 Median: 6
Deleted lines summary: Average: 5.55 Max: 169 Min: 0 Median: 2
Test count summary: Average: 3.59 Max: 107 Min: 1 Median: 1
Patch summary: Single file fix: 264 (89.49%) Single func fix: 227 (76.95%) Single hunk fix: 168 (56.95%) ```
You can see from the statistics that more than half of the bugs can be fixed with a single hunk. So I believe most of bugs can be fixed with the aid of LLM-based APR techniques :)
Getting Started
Prerequisites
- A C++17 compatible compiler
- ninja
- ccache
- Pre-built LLVM core libraries
- alive-tv
You can follow the Dockerfile to setup the environment.
Installation
bash
git clone https://github.com/dtcxzyw/llvm-apr-benchmark.git
cd llvm-apr-benchmark
pip3 install -r requirements.txt
mkdir -p work && cd work
git clone https://github.com/llvm/llvm-project.git
Please set the following environment variables:
bash
export LAB_LLVM_DIR=<path-to-llvm-src>
export LAB_LLVM_BUILD_DIR=<path-to-llvm-build-dir>
export LAB_LLVM_ALIVE_TV=<path-to-alive-tv>
export LAB_DATASET_DIR=<path-to-llvm-apr-benchmark>/dataset
export LAB_FIX_DIR=<path-to-llvm-apr-benchmark>/examples/fixes
Usage
This benchmark provides two helper modules to allow researchers to easily interact with LLVM and this benchmark.
To use these two helpers:
python
sys.path.append(os.path.join(os.path.dirname(os.environ["LAB_DATASET_DIR"]), "scripts"))
import llvm_helper
from lab_env import Environment as Env
llvm_helper ```python
Environment variables
llvmhelper.llvmdir # os.environ["LABLLVMDIR"] llvmhelper.llvmbuilddir # os.environ["LABLLVMBUILDDIR"] llvmhelper.llvmalivetv # os.environ["LABLLVMALIVETV"] llvmhelper.datasetdir # os.environ["LABDATASETDIR"]
Execute git commands on the llvm source tree
sourcecode = llvmhelper.gitexecute(['show', f'{commit}:{filepath}'])
Get information of first failed test from the result of Environment.checkfast/checkfull
res, log = env.checkfast()
if isinstance(log, list):
test = llvmhelper.getfirstfailedtest(log)
[lab_env](./scripts/lab_env.py)
python
env = Env(
# Load an issue from dataset/{issueid}.json
issueid,
# The knowledge cutoff date of LLM
basemodelknowledgecutoff = "2024-01-01Z",
# Max concurrent jobs for build/test
maxbuildjobs=None,
maxtestjobs=None,
)
If any external knowledge is used, please call this function.
env.use_knowledge(url = "
Reset the source tree to the base commit. Please call it before each attempt.
env.reset()
Build llvm
res, log = env.build()
Provide a certificate with the patch and verification result
certificate = env.dump()
Perform build + test
res, log = env.check_fast()
Perform build + test + lit regression test
res, log = env.check_full()
Issue information (always available)
bugtype = env.getbugtype() basecommit = env.getbasecommit() tests = env.get_tests()
Hints (optional)
fixcommit = env.gethintfixcommit() components = env.gethintcomponents() files = env.gethintfiles() functions = env.gethintbugfunctions() linenos = env.gethintlinelevelbuglocations()
Issue description (optional)
issue = env.gethintissue()
Collect instructions and intrinsics from the given LLVM IR.
Then it will retrieve descriptions from llvm/docs/LangRef.dst.
It is useful for LLMs to understand new flags/attributes/metadata.
keywords = env.getirkeywords(llvmir) desc = env.getlangref_desc(keywords)
Properties
issinglefuncfix = env.issinglefuncfix() issinglefilefix = env.issinglefilefix() ```
Here is a simple repair loop: ```python env = Env(...)
System prompts and user prompts
messages = [] while True: # Reset the LLVM source code tree env.reset() # Get information from env ... # Chat with LLM ... # Modify the source code in place ... res, log = env.checkfull() if res: # The bug is fixed successfully cert = json.dumps(env.dump(log = messages), indent=2) print(cert) break # Append the feedback into user prompts for the next iteration messages.append(constructuserpromptfrom_feedback(log)) ```
I have drafted a poor baseline which is powered by DeepSeek-R1. This baseline implementation is only for reference purposes since I am neither an expert in LLM nor APR.
Rules
To claim that your APR tool successfully fixes a bug, please obey the following rules:
+ Knowledge allowed to use:
+ Any static content/ dynamic feedback provided by lab_env.Environment
+ Any content in the LLVM source tree before the base commit
+ Large language model trained with dataset before the knowledge cutoff date
+ Any other content on the web created before the knowledge cutoff date
+ opt with this patch passes both the given tests and the regression testsuite.
License
This project is licensed under the Apache License 2.0. Please see the LICENSE for details.
Please cite this work with the following BibTex entry:
bibtex
@misc{llvm-apr-benchmark,
title = {LLVM APR Benchmark: A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs},
url = {https://github.com/dtcxzyw/llvm-apr-benchmark},
author = {Yingwei Zheng},
year = {2025},
}
Owner
- Name: Yingwei Zheng
- Login: dtcxzyw
- Kind: user
- Company: SUSTech
- Repositories: 4
- Profile: https://github.com/dtcxzyw
CG & HPC & Compiler
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
title: "LLVM APR Benchmark: A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs"
type: dataset
authors:
- given-names: Yingwei
family-names: Zheng
email: dtcxzyw2333@gmail.com
affiliation: Southern University of Science and Technology
url: "https://github.com/dtcxzyw/llvm-apr-benchmark"
license: Apache-2.0
GitHub Events
Total
- Watch event: 15
- Push event: 54
- Public event: 1
- Pull request event: 1
Last Year
- Watch event: 15
- Push event: 54
- Public event: 1
- Pull request event: 1
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Yingwei Zheng | d****3@g****m | 97 |
| dependabot[bot] | 4****] | 1 |
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- dependabot[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- ubuntu 22.04 build
- Requests ==2.32.3
- dateparser ==1.2.0
- openai ==1.60.1
- psutil ==6.1.1
- tqdm ==4.66.5
- tree_sitter ==0.23.2
- tree_sitter_cpp ==0.23.4
- unidiff ==0.7.5