https://github.com/cornell-zhang/heurigym

Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary

Keywords

agent benchmark chatgpt large-language-models optimization

Last synced: 9 months ago · JSON representation

Repository

Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Basic Info

Host: GitHub
Owner: cornell-zhang
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2506.07972
Size: 8.43 MB

Statistics

Stars: 39
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 0

Topics

agent benchmark chatgpt large-language-models optimization

Created about 1 year ago · Last pushed 9 months ago

Metadata Files

Readme License

README.md

📙About • 📚Problems • 🔥Quick Start • 🚀LLM Solver Agent • 🤝Contribute • 📜Citation

📘 About

HeuriGym is a benchmark for evaluating how well LLMs generate and refine heuristics for real-world combinatorial optimization (CO) tasks through agentic, code-driven interaction.

overview

🔍 Why HeuriGym?

Existing LLM benchmarks fall short:

🎯 Closed-form tasks (e.g., AIME, HumanEval): Saturated, too simplistic for real-world reasoning.
🤖 Subjective evaluations (e.g., Chatbot Arena): Noisy, inconsistent, and unreliable for technical tasks.

HeuriGym fills this gap with:

🧩 Open-ended problems: Well-defined objectives with large solution spaces.
🤖 Agentic interaction: LLMs improve heuristics through feedback-driven code execution.
📏 Expert comparison metrics: Measure both pass rate and quality relative to expert solutions.

Let LLMs think, code, and improve—just like real solvers.

📚 Problems

The initial release of the HeuriGym benchmark includes nine distinct optimization problems spanning four scientific and engineering domains.

| Domain | Problem | Difficulty | | :--: | :--: | :--: | | EDA | Operator scheduling | ★ | | EDA | Technology mapping | ★★ | | EDA | Global routing | ★★★ | | Compilers | E-graph extraction | ★ | | Compilers | Intra-operator parallelism | ★★ | | CompBio | Protein sequence design | ★ | | CompBio | Mendelian error detection | ★★ | | Logistics | Airline crew pairing | ★★ | | Logistics | Pickup and delivery w/ time windows | ★★★ |

🔥 Quick Start

Install the required dependencies: bash pip install -r requirements.txt
Clone the repository: bash git clone https://github.com/cornell-zhang/heurigym.git cd heurigym
Setup API keys:

```bash

you need to have a HuggingFace token to download the dataset.

export HUGGINGFACETOKEN=<yourhuggingfacekeyhere>

If you are using Google models, you need to have a Google API key.

export GOOGLEAPIKEY= ```

Run the agent to solve the operator scheduling problem with Gemini 2.5 Pro: bash python llm_solver_agent.py --problem operator_scheduling \ --models gemini-2.5-pro-preview-05-06
Check the results in the llm_solutions directory.

Best results are saved in best_results.json and error analysis is saved in error_summary.json.

🚀 LLM Solver Agent

Create a .env file in the root directory with the API keys for the models you want to use: ```

Required only if using models from OpenAI (e.g., o4-mini:high)

OPENAIAPIKEY=youropenaikey_here

Required only if using models from Anthropic (e.g., claude-3-7-sonnet-20250219)

ANTHROPICAPIKEY=youranthropickey_here

Required only if using models from DeepSeek (e.g., deepseek-chat, deepseek-coder)

DEEPSEEKAPIKEY=yourdeepseekkey_here

Required only if using models from Google (e.g., gemini-2.5-flash-preview-04-17, gemini-2.5-pro-preview-05-06)

GOOGLEAPIKEY=yourgooglekey_here

Required only if using models from OpenRouter (e.g., openrouter/meta-llama/llama-4-maverick)

OPENROUTERAPIKEY=youropenrouterkey_here

Required only if using models from Alibaba (e.g., qwen3-235b-a22b)

DASHSCOPEAPIKEY=youralibabakey_here ```

Also note that you need to have a HuggingFace token to download the dataset. HUGGINGFACE_TOKEN=your_huggingface_key_here

Usage

Run the agent to solve the operator scheduling problem with Gemini 2.5 Pro: ```bash

Requires GOOGLEAPIKEY

python llmsolveragent.py --problem operator_scheduling \ --models gemini-2.5-pro-preview-05-06 ```

Run the agent to solve egraph extraction problem with Claude 3.7 Sonnet: ```bash

Requires ANTHROPICAPIKEY

python llmsolveragent.py --problem egraph_extraction \ --models claude-3-7-sonnet-20250219 ```

Run the agent to solve the airline crew pairing problem with o4-mini:high: ```bash

Requires OPENAIAPIKEY

python llmsolveragent.py --problem crew_pairing \ --models o4-mini:high ```

Command Line Arguments

The agent supports the following command line arguments:

bash python llm_solver_agent.py [options]

Options: - --models MODEL1 MODEL2 ...: List of models to use (default: all supported models) - --iterations N: Maximum number of iterations for each model (default: 3) - --problem PROBLEM_NAME: Specific problem to solve (folder name) - --timeout TIMEOUT: Timeout in seconds for program execution (default: 10) - --temperature TEMPERATURE: Temperature for LLM generation (default: 0.0) - --stream: Enable streaming output from LLM (default: False, but True for Qwen models) - --history_rounds H: Number of previous rounds to keep in conversation history (default: None, keep all history) - --num_cores C: Number of CPU cores to use for program execution (default: 8) - --few_shots S: Number of training examples to provide to LLMs (default: None, use all examples)

The agent will: 1. Scan all directories in the workspace for README.md files 2. Parse the problem descriptions 3. Request solutions from configured LLMs with iterative improvement 4. Save solutions in the llm_solutions directory 5. Collect results, analyze all solutions, finds the best results, and performs error analysis. Best results are saved in best_results.json and error analysis is saved in error_summary.json.

🤝 Contribute

We welcome contributions to the HeuriGym benchmark!

To add a new problem to the benchmark suite, you need to create a new folder in the problems directory. The folder should have two subfolders: * dataset: A folder for problem instances * program: A folder for the program template

You can copy the template folder as a starting point. There are several files you need to implement or include: * README.md: Problem description, formalization, and input/output format * solver.py: A template solver function for LLM to fill in. Feel free overload the solve function by copying it to your problem folder. * verifier.py: After LLM provides a solution, the verifier will check if the solution is valid. Please implement the verify function in this file. * evaluator.py: After the solution is verified, the evaluator will calculate the cost of the solution. Please implement the evaluate function in this file.

📜 Citation

bibtex @article{chen2025heurigym, title={HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization}, author={Hongzheng Chen and Yingheng Wang and Yaohui Cai and Hins Hu and Jiajie Li and Shirley Huang and Chenhui Deng and Rongjian Liang and Shufeng Kong and Haoxing Ren and Samitha Samaranayake and Carla P. Gomes and Zhiru Zhang}, journal={arXiv preprint arXiv:2506.07972}, year={2025} }

Owner

Name: Cornell Zhang Research Group
Login: cornell-zhang
Kind: organization

Website: https://zhang.ece.cornell.edu/
Repositories: 12
Profile: https://github.com/cornell-zhang

GitHub Events

Total

Watch event: 17
Issue comment event: 1
Push event: 11
Public event: 1
Pull request event: 4
Pull request review event: 1
Fork event: 1
Create event: 1

Last Year

Watch event: 17
Issue comment event: 1
Push event: 11
Public event: 1
Pull request event: 4
Pull request review event: 1
Fork event: 1
Create event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 0
Total pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

tonyjie (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

biopython *
datasets *
huggingface_hub *
markdown >=3.5.2
matplotlib *
networkx ==3.4.2
numpy ==2.2.5
openai >=1.12.0
pandas ==2.2.3
python-dotenv >=1.0.0