llvm-apr-benchmark

A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs

https://github.com/dtcxzyw/llvm-apr-benchmark

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary

Keywords

automated-program-repair compiler llm llvm software-engineering

Keywords from Contributors

mesh interpretability sequences generic projection interactive optim hacking network-simulation
Last synced: 6 months ago · JSON representation ·

Repository

A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs

Basic Info
Statistics
  • Stars: 15
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
automated-program-repair compiler llm llvm software-engineering
Created about 1 year ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

LLVM APR Benchmark: A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs

GitHub (We only accept pull requests from GitHub)

Hugging Face Mirror

Hugging Face Leaderboard

Evaluation Result Submission

Motivation

The compiler is a critical infrastructure in the software development. The LLVM compiler infrastructure is widely used in both academia and industry. However, due to its inherent complexity, the LLVM compiler still contains many bugs that can be triggered in edge cases. As one of the LLVM maintainers, my job is to provide the minimal reproducible test cases for issues from fuzzers/ downstream users, and fix these bugs (or assign them to the right person). However, the process is time-consuming and boring. Thanks to the recent advances in compiler testing, we can automatically generate interesting test cases that trigger bugs and automatically reduce the tests to minimal ones. If we can also perform bug localization and repair automatically, it will significantly reduce the workload of us maintainers! Recently, LLM-based automated program repair (APR) techniques have been proposed. We have seen some successful cases in APR benchmarks like Defects4J and SWE-bench. But I believe that fixing LLVM bugs is more challenging than existing benchmarks due to its large C/C++ codebase, complex logic, long history, and the need for domain-specific knowledge. Therefore, I build this benchmark to see if we can automatically repair real-world LLVM bugs with the help of large language models and APR techniques. I hope this benchmark can help both SE researchers and LLVM community to understand how APR techniques work on a large-scale, real-world C/C++ project.

Dataset Description

In this benchmark, we only focus on three kinds of bugs in the LLVM middle-end: + Crash: the compiler terminates exceptionally or hits an assertion failure (LLVM is built with -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ABI_BREAKING_CHECKS=WITH_ASSERTS). + Miscompilation: the compiler generates incorrect program from a well-defined source code. + Hang: the compiler runs into an infinite loop or fails to reach a fixpoint.

All bugs can be triggered with an opt command and a small piece of LLVM textual IR.

This dataset collects some fixed LLVM middle-end bugs from GitHub issues since 2024-01-01. Each issue contains issue description, test cases, a reference patch, and some hints. All issues are checked against the following criteria:

  • At least one of the given test cases can be used to reproduce the bug at a specific commit (base_commit). For most of the miscompilation bugs, the src and tgt functions will be checked with alive2, an automatic refinement verification tool for LLVM. If miscompilation happens, alive-tv will provide a counterexample. The remaining miscompilation bugs will be checked by lli.
  • opt passes all the given tests after fixing the bug with the given reference patch (patch).
  • opt passes all regression tests at a specific commit (hints.fix_commit).

Take Issue121459 as an example: jsonc { // Identifier of the bug. It can be an issue number, a pull request number, // or a commit hash. "bug_id": "121459", // Points to issue/PR/commit url "issue_url": "https://github.com/llvm/llvm-project/issues/121459", // Bug type: crash/miscompilation/hang "bug_type": "miscompilation", // Fixes should be applied at the base commit "base_commit": "68d265666e708bad1c63b419b6275aaba1a7dcd2", // Knowledge cutoff date. It is not allowed to use the web knowledge base // after this date or use a large language model trained with newer // information. Please check the "Rules" section for exemptions. "knowledge_cutoff": "2025-01-02T09:03:32Z", // Regression test directories "lit_test_dir": [ "llvm/test/Transforms/InstCombine" ], // Bug localization hints at different granularity levels. // Note that this information is provided in a best-effort way. // They are not guaranteed to be available or accurate. "hints": { "fix_commit": "a4d92400a6db9566d84cb4b900149e36e117f452", "components": [ "InstCombine" ], "bug_location_lineno": { "llvm/lib/Transforms/InstCombine/InstructionCombining.cpp": [ [ 2782, 2787 ], [ 2838, 2843 ], [ 2847, 2852 ] ] }, "bug_location_funcname": { "llvm/lib/Transforms/InstCombine/InstructionCombining.cpp": [ "foldGEPOfPhi" ] } }, // A reference patch extracted from hints.fix_commit "patch": "<omitted>", // Minimal reproducible tests "tests": [ { "file": "llvm/test/Transforms/InstCombine/opaque-ptr.ll", "commands": [ "opt -S -passes='instcombine<no-verify-fixpoint>' < %s" ], "tests": [ { "test_name": "gep_of_phi_of_gep_different_type", "test_body": "<omitted>" }, { "test_name": "gep_of_phi_of_gep_flags2", "test_body": "<omitted>" }, { "test_name": "gep_of_phi_of_gep_flags1", "test_body": "<omitted>" } ] } ], // Issue description "issue": { "title": "[InstCombine] GEPNoWrapFlags is propagated incorrectly", "body": "<omitted>", "author": "dtcxzyw", "labels": [ "miscompilation", "llvm:instcombine" ], "comments": [] }, "verified": true, // You are allowed to choose a subset of issues to fix. // Althrough these properties are obtained from the golden patch, // using properties is not treated as using hints. "properties": { "is_single_file_fix": true, "is_single_func_fix": true } }

As of May 19, 2025, this benchmark contains 295 issues. You can run python3 scripts/dataset_summary.py locally to obtain the latest statistics. ``` Total issues: 295 Verified issues: 295 (100.00%)

Bug type summary: miscompilation: 106 crash: 181 hang: 8

Bug component summary (Total = 50): SLPVectorizer: 73 LoopVectorize: 71 InstCombine: 54 ScalarEvolution: 15 VectorCombine: 11 ValueTracking: 8 IR: 6 ConstraintElimination: 5 InstructionSimplify: 5 SimplifyIndVar: 4 Local: 4 LoopAccessAnalysis: 3 LoopPeel: 3 MemCpyOptimizer: 3 DeadStoreElimination: 3 MemorySSAUpdater: 3 ...

Label summary: crash: 111 miscompilation: 108 vectorizers: 81 llvm:SLPVectorizer: 73 crash-on-valid: 61 llvm:instcombine: 57 llvm:transforms: 40 llvm:analysis: 22 release:backport: 16 llvm:SCEV: 16 generated by fuzzer: 13 confirmed: 9 llvm:crash: 8 regression: 7 llvm:hang: 6 ...

Changed files count summary: Average: 1.17 Max: 5 Min: 1 Median: 1

Inserted lines summary: Average: 10.71 Max: 164 Min: 0 Median: 6

Deleted lines summary: Average: 5.55 Max: 169 Min: 0 Median: 2

Test count summary: Average: 3.59 Max: 107 Min: 1 Median: 1

Patch summary: Single file fix: 264 (89.49%) Single func fix: 227 (76.95%) Single hunk fix: 168 (56.95%) ```

You can see from the statistics that more than half of the bugs can be fixed with a single hunk. So I believe most of bugs can be fixed with the aid of LLM-based APR techniques :)

Getting Started

Prerequisites

  • A C++17 compatible compiler
  • ninja
  • ccache
  • Pre-built LLVM core libraries
  • alive-tv

You can follow the Dockerfile to setup the environment.

Installation

bash git clone https://github.com/dtcxzyw/llvm-apr-benchmark.git cd llvm-apr-benchmark pip3 install -r requirements.txt mkdir -p work && cd work git clone https://github.com/llvm/llvm-project.git

Please set the following environment variables: bash export LAB_LLVM_DIR=<path-to-llvm-src> export LAB_LLVM_BUILD_DIR=<path-to-llvm-build-dir> export LAB_LLVM_ALIVE_TV=<path-to-alive-tv> export LAB_DATASET_DIR=<path-to-llvm-apr-benchmark>/dataset export LAB_FIX_DIR=<path-to-llvm-apr-benchmark>/examples/fixes

Usage

This benchmark provides two helper modules to allow researchers to easily interact with LLVM and this benchmark.

To use these two helpers: python sys.path.append(os.path.join(os.path.dirname(os.environ["LAB_DATASET_DIR"]), "scripts")) import llvm_helper from lab_env import Environment as Env

llvm_helper ```python

Environment variables

llvmhelper.llvmdir # os.environ["LABLLVMDIR"] llvmhelper.llvmbuilddir # os.environ["LABLLVMBUILDDIR"] llvmhelper.llvmalivetv # os.environ["LABLLVMALIVETV"] llvmhelper.datasetdir # os.environ["LABDATASETDIR"]

Execute git commands on the llvm source tree

sourcecode = llvmhelper.gitexecute(['show', f'{commit}:{filepath}'])

Get information of first failed test from the result of Environment.checkfast/checkfull

res, log = env.checkfast() if isinstance(log, list): test = llvmhelper.getfirstfailedtest(log) [lab_env](./scripts/lab_env.py) python env = Env( # Load an issue from dataset/{issueid}.json issueid, # The knowledge cutoff date of LLM basemodelknowledgecutoff = "2024-01-01Z", # Max concurrent jobs for build/test maxbuildjobs=None, maxtestjobs=None, )

If any external knowledge is used, please call this function.

env.use_knowledge(url = "", date = "")

Reset the source tree to the base commit. Please call it before each attempt.

env.reset()

Build llvm

res, log = env.build()

Provide a certificate with the patch and verification result

certificate = env.dump()

Perform build + test

res, log = env.check_fast()

Perform build + test + lit regression test

res, log = env.check_full()

Issue information (always available)

bugtype = env.getbugtype() basecommit = env.getbasecommit() tests = env.get_tests()

Hints (optional)

fixcommit = env.gethintfixcommit() components = env.gethintcomponents() files = env.gethintfiles() functions = env.gethintbugfunctions() linenos = env.gethintlinelevelbuglocations()

Issue description (optional)

issue = env.gethintissue()

Collect instructions and intrinsics from the given LLVM IR.

Then it will retrieve descriptions from llvm/docs/LangRef.dst.

It is useful for LLMs to understand new flags/attributes/metadata.

keywords = env.getirkeywords(llvmir) desc = env.getlangref_desc(keywords)

Properties

issinglefuncfix = env.issinglefuncfix() issinglefilefix = env.issinglefilefix() ```

Here is a simple repair loop: ```python env = Env(...)

System prompts and user prompts

messages = [] while True: # Reset the LLVM source code tree env.reset() # Get information from env ... # Chat with LLM ... # Modify the source code in place ... res, log = env.checkfull() if res: # The bug is fixed successfully cert = json.dumps(env.dump(log = messages), indent=2) print(cert) break # Append the feedback into user prompts for the next iteration messages.append(constructuserpromptfrom_feedback(log)) ```

I have drafted a poor baseline which is powered by DeepSeek-R1. This baseline implementation is only for reference purposes since I am neither an expert in LLM nor APR.

Rules

To claim that your APR tool successfully fixes a bug, please obey the following rules: + Knowledge allowed to use: + Any static content/ dynamic feedback provided by lab_env.Environment + Any content in the LLVM source tree before the base commit + Large language model trained with dataset before the knowledge cutoff date + Any other content on the web created before the knowledge cutoff date + opt with this patch passes both the given tests and the regression testsuite.

License

This project is licensed under the Apache License 2.0. Please see the LICENSE for details.

Please cite this work with the following BibTex entry: bibtex @misc{llvm-apr-benchmark, title = {LLVM APR Benchmark: A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs}, url = {https://github.com/dtcxzyw/llvm-apr-benchmark}, author = {Yingwei Zheng}, year = {2025}, }

Owner

  • Name: Yingwei Zheng
  • Login: dtcxzyw
  • Kind: user
  • Company: SUSTech

CG & HPC & Compiler

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
title: "LLVM APR Benchmark: A Large-Scale Automated Program Repair Benchmark of Real-World LLVM Middle-End Bugs"
type: dataset
authors:
  - given-names: Yingwei
    family-names: Zheng
    email: dtcxzyw2333@gmail.com
    affiliation: Southern University of Science and Technology
url: "https://github.com/dtcxzyw/llvm-apr-benchmark"
license: Apache-2.0

GitHub Events

Total
  • Watch event: 15
  • Push event: 54
  • Public event: 1
  • Pull request event: 1
Last Year
  • Watch event: 15
  • Push event: 54
  • Public event: 1
  • Pull request event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 98
  • Total Committers: 2
  • Avg Commits per committer: 49.0
  • Development Distribution Score (DDS): 0.01
Past Year
  • Commits: 98
  • Committers: 2
  • Avg Commits per committer: 49.0
  • Development Distribution Score (DDS): 0.01
Top Committers
Name Email Commits
Yingwei Zheng d****3@g****m 97
dependabot[bot] 4****] 1

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (1)
Top Labels
Issue Labels
Pull Request Labels
dependencies (1) python (1)

Dependencies

Dockerfile docker
  • ubuntu 22.04 build
requirements.txt pypi
  • Requests ==2.32.3
  • dateparser ==1.2.0
  • openai ==1.60.1
  • psutil ==6.1.1
  • tqdm ==4.66.5
  • tree_sitter ==0.23.2
  • tree_sitter_cpp ==0.23.4
  • unidiff ==0.7.5