https://github.com/cornell-zhang/heurigym
Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Keywords
Repository
Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Basic Info
- Host: GitHub
- Owner: cornell-zhang
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://arxiv.org/abs/2506.07972
- Size: 8.43 MB
Statistics
- Stars: 39
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
📙About • 📚Problems • 🔥Quick Start • 🚀LLM Solver Agent • 🤝Contribute • 📜Citation
📘 About
HeuriGym is a benchmark for evaluating how well LLMs generate and refine heuristics for real-world combinatorial optimization (CO) tasks through agentic, code-driven interaction.

🔍 Why HeuriGym?
Existing LLM benchmarks fall short:
- 🎯 Closed-form tasks (e.g., AIME, HumanEval): Saturated, too simplistic for real-world reasoning.
- 🤖 Subjective evaluations (e.g., Chatbot Arena): Noisy, inconsistent, and unreliable for technical tasks.
HeuriGym fills this gap with:
- 🧩 Open-ended problems: Well-defined objectives with large solution spaces.
- 🤖 Agentic interaction: LLMs improve heuristics through feedback-driven code execution.
- 📏 Expert comparison metrics: Measure both pass rate and quality relative to expert solutions.
Let LLMs think, code, and improve—just like real solvers.
📚 Problems
The initial release of the HeuriGym benchmark includes nine distinct optimization problems spanning four scientific and engineering domains.
| Domain | Problem | Difficulty | | :--: | :--: | :--: | | EDA | Operator scheduling | ★ | | EDA | Technology mapping | ★★ | | EDA | Global routing | ★★★ | | Compilers | E-graph extraction | ★ | | Compilers | Intra-operator parallelism | ★★ | | CompBio | Protein sequence design | ★ | | CompBio | Mendelian error detection | ★★ | | Logistics | Airline crew pairing | ★★ | | Logistics | Pickup and delivery w/ time windows | ★★★ |
🔥 Quick Start
Install the required dependencies:
bash pip install -r requirements.txtClone the repository:
bash git clone https://github.com/cornell-zhang/heurigym.git cd heurigymSetup API keys:
```bash
you need to have a HuggingFace token to download the dataset.
export HUGGINGFACETOKEN=<yourhuggingfacekeyhere>
If you are using Google models, you need to have a Google API key.
export GOOGLEAPIKEY=
Run the agent to solve the operator scheduling problem with Gemini 2.5 Pro:
bash python llm_solver_agent.py --problem operator_scheduling \ --models gemini-2.5-pro-preview-05-06Check the results in the
llm_solutionsdirectory.
Best results are saved in best_results.json and error analysis is saved in error_summary.json.
🚀 LLM Solver Agent
Create a .env file in the root directory with the API keys for the models you want to use:
```
Required only if using models from OpenAI (e.g., o4-mini:high)
OPENAIAPIKEY=youropenaikey_here
Required only if using models from Anthropic (e.g., claude-3-7-sonnet-20250219)
ANTHROPICAPIKEY=youranthropickey_here
Required only if using models from DeepSeek (e.g., deepseek-chat, deepseek-coder)
DEEPSEEKAPIKEY=yourdeepseekkey_here
Required only if using models from Google (e.g., gemini-2.5-flash-preview-04-17, gemini-2.5-pro-preview-05-06)
GOOGLEAPIKEY=yourgooglekey_here
Required only if using models from OpenRouter (e.g., openrouter/meta-llama/llama-4-maverick)
OPENROUTERAPIKEY=youropenrouterkey_here
Required only if using models from Alibaba (e.g., qwen3-235b-a22b)
DASHSCOPEAPIKEY=youralibabakey_here ```
Also note that you need to have a HuggingFace token to download the dataset.
HUGGINGFACE_TOKEN=your_huggingface_key_here
Usage
Run the agent to solve the operator scheduling problem with Gemini 2.5 Pro: ```bash
Requires GOOGLEAPIKEY
python llmsolveragent.py --problem operator_scheduling \ --models gemini-2.5-pro-preview-05-06 ```
Run the agent to solve egraph extraction problem with Claude 3.7 Sonnet: ```bash
Requires ANTHROPICAPIKEY
python llmsolveragent.py --problem egraph_extraction \ --models claude-3-7-sonnet-20250219 ```
Run the agent to solve the airline crew pairing problem with o4-mini:high: ```bash
Requires OPENAIAPIKEY
python llmsolveragent.py --problem crew_pairing \ --models o4-mini:high ```
Command Line Arguments
The agent supports the following command line arguments:
bash
python llm_solver_agent.py [options]
Options:
- --models MODEL1 MODEL2 ...: List of models to use (default: all supported models)
- --iterations N: Maximum number of iterations for each model (default: 3)
- --problem PROBLEM_NAME: Specific problem to solve (folder name)
- --timeout TIMEOUT: Timeout in seconds for program execution (default: 10)
- --temperature TEMPERATURE: Temperature for LLM generation (default: 0.0)
- --stream: Enable streaming output from LLM (default: False, but True for Qwen models)
- --history_rounds H: Number of previous rounds to keep in conversation history (default: None, keep all history)
- --num_cores C: Number of CPU cores to use for program execution (default: 8)
- --few_shots S: Number of training examples to provide to LLMs (default: None, use all examples)
The agent will:
1. Scan all directories in the workspace for README.md files
2. Parse the problem descriptions
3. Request solutions from configured LLMs with iterative improvement
4. Save solutions in the llm_solutions directory
5. Collect results, analyze all solutions, finds the best results, and performs error analysis. Best results are saved in best_results.json and error analysis is saved in error_summary.json.
🤝 Contribute
We welcome contributions to the HeuriGym benchmark!
To add a new problem to the benchmark suite, you need to create a new folder in the problems directory.
The folder should have two subfolders:
* dataset: A folder for problem instances
* program: A folder for the program template
You can copy the template folder as a starting point. There are several files you need to implement or include:
* README.md: Problem description, formalization, and input/output format
* solver.py: A template solver function for LLM to fill in. Feel free overload the solve function by copying it to your problem folder.
* verifier.py: After LLM provides a solution, the verifier will check if the solution is valid. Please implement the verify function in this file.
* evaluator.py: After the solution is verified, the evaluator will calculate the cost of the solution. Please implement the evaluate function in this file.
📜 Citation
bibtex
@article{chen2025heurigym,
title={HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization},
author={Hongzheng Chen and Yingheng Wang and Yaohui Cai and Hins Hu and Jiajie Li and Shirley Huang and Chenhui Deng and Rongjian Liang and Shufeng Kong and Haoxing Ren and Samitha Samaranayake and Carla P. Gomes and Zhiru Zhang},
journal={arXiv preprint arXiv:2506.07972},
year={2025}
}
Owner
- Name: Cornell Zhang Research Group
- Login: cornell-zhang
- Kind: organization
- Website: https://zhang.ece.cornell.edu/
- Repositories: 12
- Profile: https://github.com/cornell-zhang
GitHub Events
Total
- Watch event: 17
- Issue comment event: 1
- Push event: 11
- Public event: 1
- Pull request event: 4
- Pull request review event: 1
- Fork event: 1
- Create event: 1
Last Year
- Watch event: 17
- Issue comment event: 1
- Push event: 11
- Public event: 1
- Pull request event: 4
- Pull request review event: 1
- Fork event: 1
- Create event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- tonyjie (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- biopython *
- datasets *
- huggingface_hub *
- markdown >=3.5.2
- matplotlib *
- networkx ==3.4.2
- numpy ==2.2.5
- openai >=1.12.0
- pandas ==2.2.3
- python-dotenv >=1.0.0