verina

Verina (Verifiable Code Generation Arena) is a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions.

https://github.com/sunblaze-ucb/verina

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Verina (Verifiable Code Generation Arena) is a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions.

Basic Info
  • Host: GitHub
  • Owner: sunblaze-ucb
  • License: apache-2.0
  • Language: Lean
  • Default Branch: main
  • Homepage: https://verina.io/
  • Size: 1.54 MB
Statistics
  • Stars: 22
  • Watchers: 7
  • Forks: 2
  • Open Issues: 1
  • Releases: 0
Created 10 months ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Verina: Benchmarking Verifiable Code Generation

Overview

Verina (Verifiable Code Generation Arena) is a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Read more about the project in our website and paper.

Dataset

See the datasets/verina directory for the curated dataset used in the benchmark. Each datapoint in the dataset is organized as a folder containing the following files: - task.json: A JSON file describing the task, including the id, task signature, paths to necessary data files, and other metadata. - description.txt: The description of programming task. - task.lean: The Lean 4 file containing ground truth code, specification, and proof. - test.json and reject_inputs.json: JSON files containing test cases and rejected inputs for the task.

The HuggingFace dataset is an aggregated dataset exctraced from datasets/verina. To contribute to Verina, please submit a pull request to this repository.

Prerequisites

Setup

bash uv sync source .venv/bin/activate # Activate the virtual environment created by uv lake exe cache get lake update

API Key Configuration

See .env.example for the required environment variables. Run cp .env.example .env to create a .env file, then fill in your API keys for the language model providers you want to use. LANGFUSE related variables are optional and used for logging the benchmark runs.

Running Benchmarks on Baselines

Starting Prefect Server

Before running the benchmarks, start the Prefect server:

bash docker compose up -d # This will start the database for prefect in the background uv run prefect server start

The server will be available at http://127.0.0.1:4200 with the API at http://127.0.0.1:4200/api.

Alternative Setup without Docker

Docker is only needed for setting up a postgres database for Prefect. If you prefer not to use Docker, you can set up a local PostgreSQL database and update the database.connection_url in the prefect.toml file to point to your local database.

Alternatively, you can comment out the database.connection_url line in prefect.toml to use the default SQLite database. In this case, we do not recommend running the benchmarks with a large amount of workers (e.g. 16 workers).

Using Configuration Files

Existing configuration files are located in the configs directory. You can create your own configuration file by copying one of the existing files and modifying it as needed.

```toml outputdir = "output/gpt-4o-mini-5rnd" # Directory where all generated output will be saved maxworkers = 128 # Number of parallel worker tasks to use for generation and evaluation rounds = 5 # Total number of rounds to run each benchmark task fewshotexamplenames = ["verinabasic15", "verinabasic44"] # Identifiers for example tasks used in few-shot prompting

Tasks selected to run:

codegen = false # Code generation task specgen = false # Specification generation task proofgen = true # Proof generation task codespecgen = false # Combined code and specification generation task codeproofgen = true # Combined code and proof generation task specproofgen = true # Combined specification and proof generation task codespecproofgen = true # Combined code, specification and proof generation task

[genlmconfig] # Settings for the language model generation, see utils/lm.py for more details provider = "openai" # The language model API provider to use model_name = "gpt-4o-mini" # Specific model name for generation

[baselineconfig] name = "baseline" # Name of the baseline method ["baseline", "proof-refinement"] resumefrom_checkpoint = true # whether to resume from the previous result file refinements = 64 # Number of refinements to run for the proof refinement baseline ```

Custom Models

You can use custom models by specifying the provider and model_name in the gen_lm_config section of your configuration file. Currently supported providers include openai, anthropic, vertex_ai, together_ai, and any OpenAI-compatible API endpoints (use local provider). See utils/lm.py for more details.

To add a new provider or new models or to customize hyper-prameters, you can modify the relevent functions in utils/lm.py, or use the local provider (see configs/dsprover2-7b.toml as an example).

Running Benchmark on Baselines

To run benchmark:

bash PREFECT_API_URL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/[config_name].toml

It is faster to run the benchmark by separating the generation and evaluation steps.

```bash

For generation only

PREFECTAPIURL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/.toml --no-eval

For evaluation only

PREFECTAPIURL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/.toml --no-gen -ew ```

Running Quality Assurance

bash PREFECT_API_URL=http://127.0.0.1:4200/api uv run scripts/quality_assurance.py -c configs/qa.toml

Reading Results

The detailed results of the benchmark will be saved in the output_dir directory specified in your configuration file, including the generated artifacts and evaluation scores. You can use the following snippet to obtain a summary of the results:

```python from pathlib import Path from src.verina.benchmark.report import EvaluationRoundsReport from src.verina.benchmark.summary import DatapointSummaryReport

outputdir = Path("<youroutputdir>") report = EvaluationRoundsReport.loadlatest(outputdir) summary = DatapointSummaryReport.fromroundsreport(report) print(summary.passat_k(1)) ```

Creating Your Own Solution

To create your own solution for the Verina benchmark, you can implement a custom method by inheriting from the Solution class in benchmark/solution.py. It is easier to start with SimpleSolution class which only requires implementing the basic tasks and derives the combined tasks automatically.

The BaselineSolution class in baseline/baseline.py is a good reference implementation which implements the baseline methods for code, specification, and proof generation.

FAQs

  • How to disable prefect caching?

We recommend specifying a new output_dir in your configuration file to avoid caching issues. To completely disable caching, you can set the refresh_cache = true in prefect.toml file.

Development

Setup

```bash

pre-commit

pre-commit install ```

Code Formatting

```bash

Sort imports

ruff check --select I --fix

Format code

ruff format ```

Citation

bibtex @article{ye2025verina, title={VERINA: Benchmarking Verifiable Code Generation}, author={Ye, Zhe and Yan, Zhengxu and He, Jingxuan and Kasriel, Timothe and Yang, Kaiyu and Song, Dawn}, journal={arXiv preprint arXiv:2505.23135}, year={2025} }

Owner

  • Name: sunblaze-ucb
  • Login: sunblaze-ucb
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
title: "VERINA: Benchmarking Verifiable Code Generation"
message: >-
  If you use this benchmark, please cite it as below.
type: software
authors:
  - given-names: Zhe
    family-names: Ye
  - given-names: Zhengxu
    family-names: Yan
  - given-names: Jingxuan
    family-names: He
  - given-names: Timothe
    family-names: Kasriel
  - given-names: Kaiyu
    family-names: Yang
  - given-names: Dawn
    family-names: Song
date-released: 2015-05-16
url: "https://github.com/sunblaze-ucb/verina"
preferred-citation:
  type: article
  title: "VERINA: Benchmarking Verifiable Code Generation"
  authors:
    - given-names: Zhe
      family-names: Ye
    - given-names: Zhengxu
      family-names: Yan
    - given-names: Jingxuan
      family-names: He
    - given-names: Timothe
      family-names: Kasriel
    - given-names: Kaiyu
      family-names: Yang
    - given-names: Dawn
      family-names: Song
  url: "https://arxiv.org/abs/2505.23135"
  journal: "arXiv preprint arXiv:2505.23135"
  year: 2025

GitHub Events

Total
  • Watch event: 10
  • Delete event: 3
  • Issue comment event: 1
  • Member event: 2
  • Push event: 21
  • Pull request review event: 2
  • Pull request review comment event: 10
  • Pull request event: 9
  • Fork event: 1
  • Create event: 5
Last Year
  • Watch event: 10
  • Delete event: 3
  • Issue comment event: 1
  • Member event: 2
  • Push event: 21
  • Pull request review event: 2
  • Pull request review comment event: 10
  • Pull request event: 9
  • Fork event: 1
  • Create event: 5

Dependencies

pyproject.toml pypi
  • datasets >=3.4.0
  • dspy >=2.6.23
  • google-cloud-aiplatform >=1.92.0
  • ipykernel >=6.29.5
  • jinja2 >=3.1.6
  • langfuse >=2.60.3
  • loguru >=0.7.3
  • matplotlib >=3.10.0
  • pandas >=2.2.3
  • prefect ==3.4.1
  • pydantic >=2.10.6
  • python-dotenv >=1.0.1
  • pyyaml >=6.0.1
  • rich >=13.9.4
  • seaborn >=0.13.2
  • setuptools >=75.8.0
  • tqdm >=4.67.1
  • typer >=0.15.1
  • vllm >=0.8.0
uv.lock pypi
  • 263 dependencies