https://github.com/anjiang-wei/satbench
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Anjiang-Wei
- Language: Python
- Default Branch: main
- Size: 24.4 KB
Statistics
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
Setting Up the Environment
1. Clone the repository and enter the directory:
bash
git clone https://github.com/Anjiang-Wei/SATBench.git
cd SATBench
2. Create and activate a conda environment:
bash
conda create -y -n SATBench python=3.12
conda activate SATBench
3. Install dependencies:
bash
pip install -r requirements.txt
4. Set API keys:
bash
export OPENAI_API_KEY=<your_openai_api_key>
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
export TOGETHER_API_KEY=<your_together_api_key>
Dataset Generation Pipeline
1. Generate SAT/UNSAT CNF Problems
bash
python step1.1_sat_problem_generation.py \
--output sat_problems.json \
--num_per_config 10 \
--max_clause_len 3 \
--sat_ratio 0.5 \
--dimensions "[[3,2],[3,3,2]]" \
--clauses "[4,5,6,7,8,9,10]"
Key Arguments:
- --output: Output file for SAT problems
- --num_per_config: Number of examples per config
- --dimensions: Variable dimensionality
- --clauses: Clause counts to vary difficulty
2. Generate Scenario + Variable Mapping
bash
python step1.2_scenario_mapping_generation.py \
--input sat_problems.json \
--output scenario_and_mapping.jsonl
Key Arguments:
- --input: Input CNF problems
- --output: Output JSONL file with generated scenarios and variable meanings
3. Generate Final Puzzle Conditions and Questions
bash
python step1.3_puzzle_generation.py \
--input scenario_and_mapping.jsonl \
--output puzzle_problems.jsonl \
--target_per_clause 5 \
--clause_values "[4,5,6]"
Key Arguments:
- --input: Scenario and mapping file
- --output: Final puzzle output
- --target_per_clause: Number of examples to generate per clause count
- --clause_values: Which clause numbers to include in the final benchmark
You can also access the released version on Hugging Face:
👉 https://huggingface.co/datasets/LLM4Code/SATBench
Each puzzle includes: - A natural language scenario + variable mapping - A list of logical conditions - A final question - Ground-truth SAT/UNSAT label
Evaluation Pipeline
We provide scripts to evaluate LLMs on SATBench directly from the Hugging Face dataset.
Running Evaluations
bash
python step2_evaluation.py \
--mode sat \
--eval_model openai_gpt-4o \
--limit 10
bash
python step2_evaluation.py \
--mode trace \
--eval_model openai_gpt-4o
Modes:
- --mode sat: Run SAT/UNSAT prediction for each puzzle.
- --mode trace: Judge reasoning trace quality (after SAT/UNSAT evaluation).
Arguments:
- --limit: Max number of examples to evaluate (optional).
- --eval_model: Model used to answer puzzles (e.g., openai_gpt-4o, together_Qwen3-14B).
Statistics Summary
After evaluation, generate per-bucket statistics:
bash
python step3_stats.py
Or for specific models:
bash
python step3_stats.py --models openai_gpt-4o
This prints accuracy by SAT/UNSAT and difficulty (easy/medium/hard)
Supported Models
SATBench supports evaluation for a wide range of models via the OpenAI, Anthropic, and Together APIs:
openai_gpt-4oopenai_gpt-4o-miniopenai_o4-minitogether_deepseek-ai_DeepSeek-R1together_deepseek-ai_DeepSeek-V3anthropic_claude-3-7-sonnet-20250219together_Qwen_Qwen3-235B-A22B-fp8-tputtogether_Qwen_QwQ-32Btogether_deepseek-ai_DeepSeek-R1-Distill-Qwen-14Btogether_meta-llama_Llama-4-Scout-17B-16E-Instructtogether_meta-llama_Llama-4-Maverick-17B-128E-Instruct-FP8together_meta-llama_Llama-3.3-70B-Instruct-Turbotogether_meta-llama_Meta-Llama-3.1-70B-Instruct-Turbotogether_meta-llama_Meta-Llama-3.1-8B-Instruct-Turbo
Citation
If you use this benchmark in your research, please cite:
@article{wei2025satbench,
title={SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas},
author={Wei, Anjiang and Wu, Yuheng and Wan, Yingjia and Suresh, Tarun and Tan, Huanmi and Zhou, Zhanke and Koyejo, Sanmi and Wang, Ke and Aiken, Alex},
journal={arXiv preprint arXiv:2505.14615},
year={2025}
}
Owner
- Login: Anjiang-Wei
- Kind: user
- Repositories: 19
- Profile: https://github.com/Anjiang-Wei
GitHub Events
Total
- Push event: 1
- Create event: 2
Last Year
- Push event: 1
- Create event: 2
Dependencies
- PyYAML ==6.0.2
- Pygments ==2.19.1
- aiofiles ==24.1.0
- aiohappyeyeballs ==2.6.1
- aiohttp ==3.11.16
- aiolimiter ==1.2.1
- aiosignal ==1.3.2
- annotated-types ==0.7.0
- anthropic ==0.50.0
- anyio ==4.9.0
- async-timeout ==5.0.1
- attrs ==25.3.0
- bidict ==0.22.1
- certifi ==2025.1.31
- charset-normalizer ==3.4.2
- click ==8.1.8
- cloudpickle ==3.1.1
- contourpy ==1.3.1
- cycler ==0.12.1
- dask ==2025.4.1
- datasets ==3.6.0
- dill ==0.3.8
- distro ==1.9.0
- eval_type_backport ==0.2.2
- exceptiongroup ==1.2.2
- filelock ==3.18.0
- fonttools ==4.57.0
- frozenlist ==1.5.0
- fsspec ==2025.3.0
- funcy ==1.18
- h11 ==0.14.0
- hf-xet ==1.1.2
- httpcore ==1.0.7
- httpx ==0.28.1
- huggingface-hub ==0.32.1
- idna ==3.10
- importlib_metadata ==8.7.0
- jiter ==0.9.0
- kiwisolver ==1.4.8
- locket ==1.0.0
- markdown-it-py ==3.0.0
- matplotlib ==3.10.1
- mdurl ==0.1.2
- multidict ==6.4.2
- multiprocess ==0.70.16
- numpy ==2.2.4
- nvitop ==1.4.2
- openai ==1.72.0
- packaging ==24.2
- pandas ==2.2.3
- partd ==1.4.2
- patsy ==1.0.1
- pillow ==11.2.1
- propcache ==0.3.1
- py-aiger ==6.2.3
- py-aiger-cnf ==5.0.8
- pyapproxmc ==4.1.24
- pyarrow ==20.0.0
- pycryptosat ==5.11.21
- pydantic ==2.11.3
- pydantic_core ==2.33.1
- pyparsing ==3.2.3
- pypblib ==0.0.4
- pyrsistent ==0.19.3
- python-dateutil ==2.9.0.post0
- python-sat ==1.8.dev16
- pytz ==2025.2
- regex ==2024.11.6
- requests ==2.32.3
- rich ==14.0.0
- scipy ==1.15.2
- seaborn ==0.13.2
- shellingham ==1.5.4
- six ==1.17.0
- sniffio ==1.3.1
- sortedcontainers ==2.4.0
- statsmodels ==0.14.4
- tabulate ==0.9.0
- tiktoken ==0.9.0
- together ==1.5.7
- toolz ==1.0.0
- tqdm ==4.67.1
- triton ==3.3.0
- typer ==0.15.3
- typing-inspection ==0.4.0
- typing_extensions ==4.13.1
- tzdata ==2025.2
- urllib3 ==2.4.0
- xxhash ==3.5.0
- yarl ==1.19.0
- zipp ==3.21.0