https://github.com/anjiang-wei/satbench

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: Anjiang-Wei
Language: Python
Default Branch: main
Size: 24.4 KB

Statistics

Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 11 months ago · Last pushed 11 months ago

Metadata Files

Readme

SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Setting Up the Environment

1. Clone the repository and enter the directory:

bash git clone https://github.com/Anjiang-Wei/SATBench.git cd SATBench

2. Create and activate a conda environment:

bash conda create -y -n SATBench python=3.12 conda activate SATBench

3. Install dependencies:

bash pip install -r requirements.txt

4. Set API keys:

bash export OPENAI_API_KEY=<your_openai_api_key> export ANTHROPIC_API_KEY=<your_anthropic_api_key> export TOGETHER_API_KEY=<your_together_api_key>

Dataset Generation Pipeline

1. Generate SAT/UNSAT CNF Problems

bash python step1.1_sat_problem_generation.py \ --output sat_problems.json \ --num_per_config 10 \ --max_clause_len 3 \ --sat_ratio 0.5 \ --dimensions "[[3,2],[3,3,2]]" \ --clauses "[4,5,6,7,8,9,10]"

Key Arguments: - --output: Output file for SAT problems - --num_per_config: Number of examples per config - --dimensions: Variable dimensionality - --clauses: Clause counts to vary difficulty

2. Generate Scenario + Variable Mapping

bash python step1.2_scenario_mapping_generation.py \ --input sat_problems.json \ --output scenario_and_mapping.jsonl

Key Arguments: - --input: Input CNF problems - --output: Output JSONL file with generated scenarios and variable meanings

3. Generate Final Puzzle Conditions and Questions

bash python step1.3_puzzle_generation.py \ --input scenario_and_mapping.jsonl \ --output puzzle_problems.jsonl \ --target_per_clause 5 \ --clause_values "[4,5,6]"

Key Arguments: - --input: Scenario and mapping file - --output: Final puzzle output - --target_per_clause: Number of examples to generate per clause count - --clause_values: Which clause numbers to include in the final benchmark

You can also access the released version on Hugging Face:

👉 https://huggingface.co/datasets/LLM4Code/SATBench

Each puzzle includes: - A natural language scenario + variable mapping - A list of logical conditions - A final question - Ground-truth SAT/UNSAT label

Evaluation Pipeline

We provide scripts to evaluate LLMs on SATBench directly from the Hugging Face dataset.

Running Evaluations

bash python step2_evaluation.py \ --mode sat \ --eval_model openai_gpt-4o \ --limit 10

bash python step2_evaluation.py \ --mode trace \ --eval_model openai_gpt-4o

Modes: - --mode sat: Run SAT/UNSAT prediction for each puzzle. - --mode trace: Judge reasoning trace quality (after SAT/UNSAT evaluation).

Arguments: - --limit: Max number of examples to evaluate (optional). - --eval_model: Model used to answer puzzles (e.g., openai_gpt-4o, together_Qwen3-14B).

Statistics Summary

After evaluation, generate per-bucket statistics:

bash python step3_stats.py

Or for specific models:

bash python step3_stats.py --models openai_gpt-4o

This prints accuracy by SAT/UNSAT and difficulty (easy/medium/hard)

Supported Models

SATBench supports evaluation for a wide range of models via the OpenAI, Anthropic, and Together APIs:

openai_gpt-4o
openai_gpt-4o-mini
openai_o4-mini
together_deepseek-ai_DeepSeek-R1
together_deepseek-ai_DeepSeek-V3
anthropic_claude-3-7-sonnet-20250219
together_Qwen_Qwen3-235B-A22B-fp8-tput
together_Qwen_QwQ-32B
together_deepseek-ai_DeepSeek-R1-Distill-Qwen-14B
together_meta-llama_Llama-4-Scout-17B-16E-Instruct
together_meta-llama_Llama-4-Maverick-17B-128E-Instruct-FP8
together_meta-llama_Llama-3.3-70B-Instruct-Turbo
together_meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo
together_meta-llama_Meta-Llama-3.1-8B-Instruct-Turbo

Citation

If you use this benchmark in your research, please cite:

@article{wei2025satbench, title={SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas}, author={Wei, Anjiang and Wu, Yuheng and Wan, Yingjia and Suresh, Tarun and Tan, Huanmi and Zhou, Zhanke and Koyejo, Sanmi and Wang, Ke and Aiken, Alex}, journal={arXiv preprint arXiv:2505.14615}, year={2025} }

Owner

Login: Anjiang-Wei
Kind: user

Repositories: 19
Profile: https://github.com/Anjiang-Wei

GitHub Events

Total

Push event: 1
Create event: 2

Last Year

Push event: 1
Create event: 2

Dependencies

requirements.txt pypi

PyYAML ==6.0.2
Pygments ==2.19.1
aiofiles ==24.1.0
aiohappyeyeballs ==2.6.1
aiohttp ==3.11.16
aiolimiter ==1.2.1
aiosignal ==1.3.2
annotated-types ==0.7.0
anthropic ==0.50.0
anyio ==4.9.0
async-timeout ==5.0.1
attrs ==25.3.0
bidict ==0.22.1
certifi ==2025.1.31
charset-normalizer ==3.4.2
click ==8.1.8
cloudpickle ==3.1.1
contourpy ==1.3.1
cycler ==0.12.1
dask ==2025.4.1
datasets ==3.6.0
dill ==0.3.8
distro ==1.9.0
eval_type_backport ==0.2.2
exceptiongroup ==1.2.2
filelock ==3.18.0
fonttools ==4.57.0
frozenlist ==1.5.0
fsspec ==2025.3.0
funcy ==1.18
h11 ==0.14.0
hf-xet ==1.1.2
httpcore ==1.0.7
httpx ==0.28.1
huggingface-hub ==0.32.1
idna ==3.10
importlib_metadata ==8.7.0
jiter ==0.9.0
kiwisolver ==1.4.8
locket ==1.0.0
markdown-it-py ==3.0.0
matplotlib ==3.10.1
mdurl ==0.1.2
multidict ==6.4.2
multiprocess ==0.70.16
numpy ==2.2.4
nvitop ==1.4.2
openai ==1.72.0
packaging ==24.2
pandas ==2.2.3
partd ==1.4.2
patsy ==1.0.1
pillow ==11.2.1
propcache ==0.3.1
py-aiger ==6.2.3
py-aiger-cnf ==5.0.8
pyapproxmc ==4.1.24
pyarrow ==20.0.0
pycryptosat ==5.11.21
pydantic ==2.11.3
pydantic_core ==2.33.1
pyparsing ==3.2.3
pypblib ==0.0.4
pyrsistent ==0.19.3
python-dateutil ==2.9.0.post0
python-sat ==1.8.dev16
pytz ==2025.2
regex ==2024.11.6
requests ==2.32.3
rich ==14.0.0
scipy ==1.15.2
seaborn ==0.13.2
shellingham ==1.5.4
six ==1.17.0
sniffio ==1.3.1
sortedcontainers ==2.4.0
statsmodels ==0.14.4
tabulate ==0.9.0
tiktoken ==0.9.0
together ==1.5.7
toolz ==1.0.0
tqdm ==4.67.1
triton ==3.3.0
typer ==0.15.3
typing-inspection ==0.4.0
typing_extensions ==4.13.1
tzdata ==2025.2
urllib3 ==2.4.0
xxhash ==3.5.0
yarl ==1.19.0
zipp ==3.21.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science