propertyeval

PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing

https://github.com/mrigankpawagi/propertyeval

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary

Keywords

llm-evaluation property-based-testing python

Last synced: 6 months ago · JSON representation ·

Repository

PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing

Basic Info

Host: GitHub
Owner: mrigankpawagi
Language: Python
Default Branch: main
Homepage: http://mrigank.in/PropertyEval/
Size: 6.44 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Topics

llm-evaluation property-based-testing python

Created over 2 years ago · Last pushed 11 months ago

Metadata Files

Readme Citation

PropertyEval

A repository of property-based tests for thorough benchmarking of LLM code generation.

Citation

If you use or modify this dataset or related tools, please cite it as below.

bibtex @software{propertyEval, author = {Pawagi, Mrigank}, doi = {10.5281/zenodo.15081893}, title = {{PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing}}, url = {https://github.com/mrigankpawagi/PropertyEval}, year = {2025} }

Usage

The /tests directory contains directories labeled from 0 to 163, each of which contains a strategy.py file. This file contains the hypothesis strategy for the corresponding problem from the HumanEval dataset. __init__.py files have been placed in each directory to allow for importing of the tests as modules. The strategies are available as the strategy attribute of these strategy modules. Usage of the strategies is as follows.

```python from hypothesis import given, strategies

@given(strategies.tuples(st)) def test_property(args): # call functions as f(args) # for example, assert f(args) == ground_truth(args) # ... ```

Here, st is the imported strategy. One way to do this is using the importlib module.

```python import importlib

stmodule = importlib.importmodule(f"test.{humanevalid}.strategy") st = stmodule.strategy ```

Contribution

Thoroughness

We show that it is possible to improve the thoroughness of programming benchmarks using Property-Based Testing (PBT), leveraging the canonical solutions within these benchmarks. For the HumanEval dataset, since adequate property-based tests cannot be automatically generated using rule-based tools, we carefully construct these tests manually. We show that our approach using PBT allows us to synthesize as thorough test cases as those generated using type-aware mutations in Liu et al.'s EvalPlus[^2]. However, our approach can be easily adapted to other contexts.

Dataset

We share our full set of property-based tests as a complementary resource to existing manual and synthesized test suites.

Examples

A non-trivial strategy. ```python

HumanEval 129: minPath

@composite def creategrid(draw, nst=integers(minvalue=2, maxvalue=MAXSEQUENCELEN)): n = draw(nst) grid = draw(lists(lists(integers(), minsize=n, maxsize=n), minsize=n, max_size=n))
perm = draw(permutations(range(1, n*2 + 1))) # fill grid with perm for i in range(n): for j in range(n): grid[i][j] = perm[in + j]
return grid

grid = creategrid() k = integers(minvalue=1, maxvalue=MAXINT) strategy = grid, k ```

Examples of additional constraints on the input space. Here, we have restricted the alphabet and introduced bounds on the lengths of strings and lists. ```python

HumanEval 134: checkiflastcharisaletter

txt = text(alphabet='abcde0123 ') strategy = txt python

HumanEval 143: wordsinsentence

sentence = text(alphabet="a ", minsize=1, maxsize=100) .map(lambda s: re.sub(r"\s+", " ", s)) .filter(lambda s: not (s.startswith(" ") or s.endswith(" "))) strategy = sentence python

HumanEval 158: find_max

words = lists(text(alphabet='abc', maxsize=MAXSEQUENCELEN), minsize=1, maxsize=MAXSEQUENCE_LEN) strategy = words ```

[!TIP] The ghAIstwriter project uses these strategies to create a dataset for finetuning LLMs to generate such strategies automatically.

Automation

For the MBPP dataset, we demonstrate that these tests can be generated largely automatically using GPT-3.5 by providing few-shot prompts based on some of our manually constructed tests. This demonstrates that our approach can be easily scaled to other datasets.

[!WARNING]
This is a work in progress, but some preliminary results are available here.

Contributing back to EvalPlus

Work on PropertyEval has led to several contributions to EvalPlus as suggestions for improving contracts in many problems. These have been acknowledged by the authors of EvalPlus and incorporated in their subsequent releases.

Evaluation

The /humaneval_groundtruth directory contains canonical solutions to HumanEval problems, adapted from the ground truth solutions provided with EvalPlus v0.1.0. The results from the equivalence tests on code samples for 84 (model, size, temperature) combinations provided with EvalPlus v0.1.0 are available in evaldata.csv. The script for executing this benchmark is a modified fork[^1] of the EvalPlus script.

The limits/limits.py file contains several standardized limits for the strategies. The limits/fuzzer.py script is for running fuzz-tests on all HumanEval ground truth with the strategies in order to validate these limits.

[^1]: We forked EvalPlus and modified the evaluation script to evaluate code samples with PropertyEval's property-based tests as well, in addition to the Base and Base + Extra test cases. We further modified the existing pipeline for estimating pass@k for PropertyEval's property-based tests also. The fork is available as EvalPlusPro. Some points to note are as follows.

1. The property-based tests are executed with 1000 examples, with `@settings(max_examples=1000)`.
2. Instead of the time limits enforced by EvalPlus, we use the default deadline of 200ms that comes with Hypothesis.

[^2]: Jiawei Liu et al. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In: arXiv preprint arXiv:2305.01210 (2023)

Owner

Name: Mrigank Pawagi
Login: mrigankpawagi
Kind: user
Location: Noida, Uttar Pradesh, India

Website: https://mrigank.in/
Twitter: mrigankpawagi
Repositories: 0
Profile: https://github.com/mrigankpawagi

I am a passionate learner who seeks to find innovative ways of leveraging science and technology to make people confident, secure, reliant, and aware.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use or modify this dataset or related tools, please cite it as below."
authors:
- family-names: "Pawagi"
  given-names: "Mrigank"
  orcid: "https://orcid.org/0009-0002-6169-4766"
title: "PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing"
version: 0.1.0-alpha
doi: 10.5281/zenodo.15081893
date-released: 2025-03-25
url: "https://github.com/mrigankpawagi/PropertyEval"

GitHub Events

Total

Release event: 1
Push event: 5
Create event: 1

Last Year

Release event: 1
Push event: 5
Create event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 55
Total Committers: 2
Avg Commits per committer: 27.5
Development Distribution Score (DDS): 0.018

Past Year

Commits: 11
Committers: 1
Avg Commits per committer: 11.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Mrigank Pawagi	m**i@g**m	54
manyatapawagi	m**i@g**m	1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

propertyeval

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

PropertyEval

Citation

Usage

Contribution

Thoroughness

Dataset

Examples

HumanEval 129: minPath

HumanEval 134: checkiflastcharisaletter

HumanEval 143: wordsinsentence

HumanEval 158: find_max

Automation

Contributing back to EvalPlus

Evaluation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels