verina
Verina (Verifiable Code Generation Arena) is a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.1%) to scientific vocabulary
Repository
Verina (Verifiable Code Generation Arena) is a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions.
Basic Info
- Host: GitHub
- Owner: sunblaze-ucb
- License: apache-2.0
- Language: Lean
- Default Branch: main
- Homepage: https://verina.io/
- Size: 1.54 MB
Statistics
- Stars: 22
- Watchers: 7
- Forks: 2
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Verina: Benchmarking Verifiable Code Generation
Overview
Verina (Verifiable Code Generation Arena) is a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Read more about the project in our website and paper.
Dataset
See the datasets/verina directory for the curated dataset used in the benchmark.
Each datapoint in the dataset is organized as a folder containing the following files:
- task.json: A JSON file describing the task, including the id, task signature, paths to necessary data files, and other metadata.
- description.txt: The description of programming task.
- task.lean: The Lean 4 file containing ground truth code, specification, and proof.
- test.json and reject_inputs.json: JSON files containing test cases and rejected inputs for the task.
The HuggingFace dataset is an aggregated dataset exctraced from datasets/verina.
To contribute to Verina, please submit a pull request to this repository.
Prerequisites
Setup
bash
uv sync
source .venv/bin/activate # Activate the virtual environment created by uv
lake exe cache get
lake update
API Key Configuration
See .env.example for the required environment variables.
Run cp .env.example .env to create a .env file, then fill in your API keys for the language model providers you want to use.
LANGFUSE related variables are optional and used for logging the benchmark runs.
Running Benchmarks on Baselines
Starting Prefect Server
Before running the benchmarks, start the Prefect server:
bash
docker compose up -d # This will start the database for prefect in the background
uv run prefect server start
The server will be available at http://127.0.0.1:4200 with the API at http://127.0.0.1:4200/api.
Alternative Setup without Docker
Docker is only needed for setting up a postgres database for Prefect.
If you prefer not to use Docker, you can set up a local PostgreSQL database and update the database.connection_url in the prefect.toml file to point to your local database.
Alternatively, you can comment out the database.connection_url line in prefect.toml to use the default SQLite database.
In this case, we do not recommend running the benchmarks with a large amount of workers (e.g. 16 workers).
Using Configuration Files
Existing configuration files are located in the configs directory.
You can create your own configuration file by copying one of the existing files and modifying it as needed.
```toml outputdir = "output/gpt-4o-mini-5rnd" # Directory where all generated output will be saved maxworkers = 128 # Number of parallel worker tasks to use for generation and evaluation rounds = 5 # Total number of rounds to run each benchmark task fewshotexamplenames = ["verinabasic15", "verinabasic44"] # Identifiers for example tasks used in few-shot prompting
Tasks selected to run:
codegen = false # Code generation task specgen = false # Specification generation task proofgen = true # Proof generation task codespecgen = false # Combined code and specification generation task codeproofgen = true # Combined code and proof generation task specproofgen = true # Combined specification and proof generation task codespecproofgen = true # Combined code, specification and proof generation task
[genlmconfig] # Settings for the language model generation, see utils/lm.py for more details provider = "openai" # The language model API provider to use model_name = "gpt-4o-mini" # Specific model name for generation
[baselineconfig] name = "baseline" # Name of the baseline method ["baseline", "proof-refinement"] resumefrom_checkpoint = true # whether to resume from the previous result file refinements = 64 # Number of refinements to run for the proof refinement baseline ```
Custom Models
You can use custom models by specifying the provider and model_name in the gen_lm_config section of your configuration file.
Currently supported providers include openai, anthropic, vertex_ai, together_ai, and any OpenAI-compatible API endpoints (use local provider).
See utils/lm.py for more details.
To add a new provider or new models or to customize hyper-prameters, you can modify the relevent functions in utils/lm.py, or use the local provider (see configs/dsprover2-7b.toml as an example).
Running Benchmark on Baselines
To run benchmark:
bash
PREFECT_API_URL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/[config_name].toml
It is faster to run the benchmark by separating the generation and evaluation steps.
```bash
For generation only
PREFECTAPIURL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/
For evaluation only
PREFECTAPIURL=http://127.0.0.1:4200/api uv run scripts/benchmark.py -c configs/
Running Quality Assurance
bash
PREFECT_API_URL=http://127.0.0.1:4200/api uv run scripts/quality_assurance.py -c configs/qa.toml
Reading Results
The detailed results of the benchmark will be saved in the output_dir directory specified in your configuration file, including the generated artifacts and evaluation scores.
You can use the following snippet to obtain a summary of the results:
```python from pathlib import Path from src.verina.benchmark.report import EvaluationRoundsReport from src.verina.benchmark.summary import DatapointSummaryReport
outputdir = Path("<youroutputdir>") report = EvaluationRoundsReport.loadlatest(outputdir) summary = DatapointSummaryReport.fromroundsreport(report) print(summary.passat_k(1)) ```
Creating Your Own Solution
To create your own solution for the Verina benchmark, you can implement a custom method by inheriting from the Solution class in benchmark/solution.py.
It is easier to start with SimpleSolution class which only requires implementing the basic tasks and derives the combined tasks automatically.
The BaselineSolution class in baseline/baseline.py is a good reference implementation which implements the baseline methods for code, specification, and proof generation.
FAQs
- How to disable
prefectcaching?
We recommend specifying a new output_dir in your configuration file to avoid caching issues.
To completely disable caching, you can set the refresh_cache = true in prefect.toml file.
Development
Setup
```bash
pre-commit
pre-commit install ```
Code Formatting
```bash
Sort imports
ruff check --select I --fix
Format code
ruff format ```
Citation
bibtex
@article{ye2025verina,
title={VERINA: Benchmarking Verifiable Code Generation},
author={Ye, Zhe and Yan, Zhengxu and He, Jingxuan and Kasriel, Timothe and Yang, Kaiyu and Song, Dawn},
journal={arXiv preprint arXiv:2505.23135},
year={2025}
}
Owner
- Name: sunblaze-ucb
- Login: sunblaze-ucb
- Kind: organization
- Repositories: 8
- Profile: https://github.com/sunblaze-ucb
Citation (CITATION.cff)
cff-version: 1.2.0
title: "VERINA: Benchmarking Verifiable Code Generation"
message: >-
If you use this benchmark, please cite it as below.
type: software
authors:
- given-names: Zhe
family-names: Ye
- given-names: Zhengxu
family-names: Yan
- given-names: Jingxuan
family-names: He
- given-names: Timothe
family-names: Kasriel
- given-names: Kaiyu
family-names: Yang
- given-names: Dawn
family-names: Song
date-released: 2015-05-16
url: "https://github.com/sunblaze-ucb/verina"
preferred-citation:
type: article
title: "VERINA: Benchmarking Verifiable Code Generation"
authors:
- given-names: Zhe
family-names: Ye
- given-names: Zhengxu
family-names: Yan
- given-names: Jingxuan
family-names: He
- given-names: Timothe
family-names: Kasriel
- given-names: Kaiyu
family-names: Yang
- given-names: Dawn
family-names: Song
url: "https://arxiv.org/abs/2505.23135"
journal: "arXiv preprint arXiv:2505.23135"
year: 2025
GitHub Events
Total
- Watch event: 10
- Delete event: 3
- Issue comment event: 1
- Member event: 2
- Push event: 21
- Pull request review event: 2
- Pull request review comment event: 10
- Pull request event: 9
- Fork event: 1
- Create event: 5
Last Year
- Watch event: 10
- Delete event: 3
- Issue comment event: 1
- Member event: 2
- Push event: 21
- Pull request review event: 2
- Pull request review comment event: 10
- Pull request event: 9
- Fork event: 1
- Create event: 5
Dependencies
- datasets >=3.4.0
- dspy >=2.6.23
- google-cloud-aiplatform >=1.92.0
- ipykernel >=6.29.5
- jinja2 >=3.1.6
- langfuse >=2.60.3
- loguru >=0.7.3
- matplotlib >=3.10.0
- pandas >=2.2.3
- prefect ==3.4.1
- pydantic >=2.10.6
- python-dotenv >=1.0.1
- pyyaml >=6.0.1
- rich >=13.9.4
- seaborn >=0.13.2
- setuptools >=75.8.0
- tqdm >=4.67.1
- typer >=0.15.1
- vllm >=0.8.0
- 263 dependencies