propertyeval
PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Keywords
Repository
PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing
Basic Info
- Host: GitHub
- Owner: mrigankpawagi
- Language: Python
- Default Branch: main
- Homepage: http://mrigank.in/PropertyEval/
- Size: 6.44 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
PropertyEval
A repository of property-based tests for thorough benchmarking of LLM code generation.
Citation
If you use or modify this dataset or related tools, please cite it as below.
bibtex
@software{propertyEval,
author = {Pawagi, Mrigank},
doi = {10.5281/zenodo.15081893},
title = {{PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing}},
url = {https://github.com/mrigankpawagi/PropertyEval},
year = {2025}
}
Usage
The /tests directory contains directories labeled from 0 to 163, each of which contains a strategy.py file. This file contains the hypothesis strategy for the corresponding problem from the HumanEval dataset. __init__.py files have been placed in each directory to allow for importing of the tests as modules. The strategies are available as the strategy attribute of these strategy modules. Usage of the strategies is as follows.
```python from hypothesis import given, strategies
@given(strategies.tuples(st)) def test_property(args): # call functions as f(args) # for example, assert f(args) == ground_truth(args) # ... ```
Here, st is the imported strategy. One way to do this is using the importlib module.
```python import importlib
stmodule = importlib.importmodule(f"test.{humanevalid}.strategy") st = stmodule.strategy ```
Contribution
Thoroughness
We show that it is possible to improve the thoroughness of programming benchmarks using Property-Based Testing (PBT), leveraging the canonical solutions within these benchmarks. For the HumanEval dataset, since adequate property-based tests cannot be automatically generated using rule-based tools, we carefully construct these tests manually. We show that our approach using PBT allows us to synthesize as thorough test cases as those generated using type-aware mutations in Liu et al.'s EvalPlus[^2]. However, our approach can be easily adapted to other contexts.

Dataset
We share our full set of property-based tests as a complementary resource to existing manual and synthesized test suites.
Examples
A non-trivial strategy. ```python
HumanEval 129: minPath
@composite
def creategrid(draw, nst=integers(minvalue=2, maxvalue=MAXSEQUENCELEN)):
n = draw(nst)
grid = draw(lists(lists(integers(), minsize=n, maxsize=n), minsize=n, max_size=n))
perm = draw(permutations(range(1, n*2 + 1)))
# fill grid with perm
for i in range(n):
for j in range(n):
grid[i][j] = perm[in + j]
return grid
grid = creategrid() k = integers(minvalue=1, maxvalue=MAXINT) strategy = grid, k ```
Examples of additional constraints on the input space. Here, we have restricted the alphabet and introduced bounds on the lengths of strings and lists. ```python
HumanEval 134: checkiflastcharisaletter
txt = text(alphabet='abcde0123 ')
strategy = txt
python
HumanEval 143: wordsinsentence
sentence = text(alphabet="a ", minsize=1, maxsize=100)
.map(lambda s: re.sub(r"\s+", " ", s))
.filter(lambda s: not (s.startswith(" ") or s.endswith(" ")))
strategy = sentence
python
HumanEval 158: find_max
words = lists(text(alphabet='abc', maxsize=MAXSEQUENCELEN), minsize=1, maxsize=MAXSEQUENCE_LEN) strategy = words ```
[!TIP] The ghAIstwriter project uses these strategies to create a dataset for finetuning LLMs to generate such strategies automatically.
Automation
For the MBPP dataset, we demonstrate that these tests can be generated largely automatically using GPT-3.5 by providing few-shot prompts based on some of our manually constructed tests. This demonstrates that our approach can be easily scaled to other datasets.
[!WARNING]
This is a work in progress, but some preliminary results are available here.
Contributing back to EvalPlus
Work on PropertyEval has led to several contributions to EvalPlus as suggestions for improving contracts in many problems. These have been acknowledged by the authors of EvalPlus and incorporated in their subsequent releases.
- Incomplete contracts in problems involving balanced parantheses (1, 6)
- Contract misses infinities and nan (2, 99)
- Contract misses empty list in problem 35
- Incomplete contract in HumanEval #32 (
find_zero) - Incomplete contract in HumanEval #160 (
do_algebra)
Evaluation
The /humaneval_groundtruth directory contains canonical solutions to HumanEval problems, adapted from the ground truth solutions provided with EvalPlus v0.1.0. The results from the equivalence tests on code samples for 84 (model, size, temperature) combinations provided with EvalPlus v0.1.0 are available in evaldata.csv. The script for executing this benchmark is a modified fork[^1] of the EvalPlus script.
The limits/limits.py file contains several standardized limits for the strategies. The limits/fuzzer.py script is for running fuzz-tests on all HumanEval ground truth with the strategies in order to validate these limits.
[^1]: We forked EvalPlus and modified the evaluation script to evaluate code samples with PropertyEval's property-based tests as well, in addition to the Base and Base + Extra test cases. We further modified the existing pipeline for estimating pass@k for PropertyEval's property-based tests also. The fork is available as EvalPlusPro. Some points to note are as follows.
1. The property-based tests are executed with 1000 examples, with `@settings(max_examples=1000)`.
2. Instead of the time limits enforced by EvalPlus, we use the default deadline of 200ms that comes with Hypothesis.
[^2]: Jiawei Liu et al. “Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation”. In: arXiv preprint arXiv:2305.01210 (2023)
Owner
- Name: Mrigank Pawagi
- Login: mrigankpawagi
- Kind: user
- Location: Noida, Uttar Pradesh, India
- Website: https://mrigank.in/
- Twitter: mrigankpawagi
- Repositories: 0
- Profile: https://github.com/mrigankpawagi
I am a passionate learner who seeks to find innovative ways of leveraging science and technology to make people confident, secure, reliant, and aware.
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use or modify this dataset or related tools, please cite it as below." authors: - family-names: "Pawagi" given-names: "Mrigank" orcid: "https://orcid.org/0009-0002-6169-4766" title: "PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing" version: 0.1.0-alpha doi: 10.5281/zenodo.15081893 date-released: 2025-03-25 url: "https://github.com/mrigankpawagi/PropertyEval"
GitHub Events
Total
- Release event: 1
- Push event: 5
- Create event: 1
Last Year
- Release event: 1
- Push event: 5
- Create event: 1
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Mrigank Pawagi | m****i@g****m | 54 |
| manyatapawagi | m****i@g****m | 1 |
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0