https://github.com/anjiang-wei/satbench

https://github.com/anjiang-wei/satbench

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: Anjiang-Wei
  • Language: Python
  • Default Branch: main
  • Size: 24.4 KB
Statistics
  • Stars: 1
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 11 months ago · Last pushed 11 months ago
Metadata Files
Readme

README.md

SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Version arXiv License Python HuggingFace


Setting Up the Environment

1. Clone the repository and enter the directory:

bash git clone https://github.com/Anjiang-Wei/SATBench.git cd SATBench

2. Create and activate a conda environment:

bash conda create -y -n SATBench python=3.12 conda activate SATBench

3. Install dependencies:

bash pip install -r requirements.txt

4. Set API keys:

bash export OPENAI_API_KEY=<your_openai_api_key> export ANTHROPIC_API_KEY=<your_anthropic_api_key> export TOGETHER_API_KEY=<your_together_api_key>

Dataset Generation Pipeline

1. Generate SAT/UNSAT CNF Problems

bash python step1.1_sat_problem_generation.py \ --output sat_problems.json \ --num_per_config 10 \ --max_clause_len 3 \ --sat_ratio 0.5 \ --dimensions "[[3,2],[3,3,2]]" \ --clauses "[4,5,6,7,8,9,10]"

Key Arguments: - --output: Output file for SAT problems - --num_per_config: Number of examples per config - --dimensions: Variable dimensionality - --clauses: Clause counts to vary difficulty

2. Generate Scenario + Variable Mapping

bash python step1.2_scenario_mapping_generation.py \ --input sat_problems.json \ --output scenario_and_mapping.jsonl

Key Arguments: - --input: Input CNF problems - --output: Output JSONL file with generated scenarios and variable meanings

3. Generate Final Puzzle Conditions and Questions

bash python step1.3_puzzle_generation.py \ --input scenario_and_mapping.jsonl \ --output puzzle_problems.jsonl \ --target_per_clause 5 \ --clause_values "[4,5,6]"

Key Arguments: - --input: Scenario and mapping file - --output: Final puzzle output - --target_per_clause: Number of examples to generate per clause count - --clause_values: Which clause numbers to include in the final benchmark

You can also access the released version on Hugging Face:

👉 https://huggingface.co/datasets/LLM4Code/SATBench

Each puzzle includes: - A natural language scenario + variable mapping - A list of logical conditions - A final question - Ground-truth SAT/UNSAT label

Evaluation Pipeline

We provide scripts to evaluate LLMs on SATBench directly from the Hugging Face dataset.

Running Evaluations

bash python step2_evaluation.py \ --mode sat \ --eval_model openai_gpt-4o \ --limit 10

bash python step2_evaluation.py \ --mode trace \ --eval_model openai_gpt-4o

Modes: - --mode sat: Run SAT/UNSAT prediction for each puzzle. - --mode trace: Judge reasoning trace quality (after SAT/UNSAT evaluation).

Arguments: - --limit: Max number of examples to evaluate (optional). - --eval_model: Model used to answer puzzles (e.g., openai_gpt-4o, together_Qwen3-14B).

Statistics Summary

After evaluation, generate per-bucket statistics:

bash python step3_stats.py

Or for specific models:

bash python step3_stats.py --models openai_gpt-4o

This prints accuracy by SAT/UNSAT and difficulty (easy/medium/hard)

Supported Models

SATBench supports evaluation for a wide range of models via the OpenAI, Anthropic, and Together APIs:

  • openai_gpt-4o
  • openai_gpt-4o-mini
  • openai_o4-mini
  • together_deepseek-ai_DeepSeek-R1
  • together_deepseek-ai_DeepSeek-V3
  • anthropic_claude-3-7-sonnet-20250219
  • together_Qwen_Qwen3-235B-A22B-fp8-tput
  • together_Qwen_QwQ-32B
  • together_deepseek-ai_DeepSeek-R1-Distill-Qwen-14B
  • together_meta-llama_Llama-4-Scout-17B-16E-Instruct
  • together_meta-llama_Llama-4-Maverick-17B-128E-Instruct-FP8
  • together_meta-llama_Llama-3.3-70B-Instruct-Turbo
  • together_meta-llama_Meta-Llama-3.1-70B-Instruct-Turbo
  • together_meta-llama_Meta-Llama-3.1-8B-Instruct-Turbo

Citation

If you use this benchmark in your research, please cite:

@article{wei2025satbench, title={SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas}, author={Wei, Anjiang and Wu, Yuheng and Wan, Yingjia and Suresh, Tarun and Tan, Huanmi and Zhou, Zhanke and Koyejo, Sanmi and Wang, Ke and Aiken, Alex}, journal={arXiv preprint arXiv:2505.14615}, year={2025} }

Owner

  • Login: Anjiang-Wei
  • Kind: user

GitHub Events

Total
  • Push event: 1
  • Create event: 2
Last Year
  • Push event: 1
  • Create event: 2

Dependencies

requirements.txt pypi
  • PyYAML ==6.0.2
  • Pygments ==2.19.1
  • aiofiles ==24.1.0
  • aiohappyeyeballs ==2.6.1
  • aiohttp ==3.11.16
  • aiolimiter ==1.2.1
  • aiosignal ==1.3.2
  • annotated-types ==0.7.0
  • anthropic ==0.50.0
  • anyio ==4.9.0
  • async-timeout ==5.0.1
  • attrs ==25.3.0
  • bidict ==0.22.1
  • certifi ==2025.1.31
  • charset-normalizer ==3.4.2
  • click ==8.1.8
  • cloudpickle ==3.1.1
  • contourpy ==1.3.1
  • cycler ==0.12.1
  • dask ==2025.4.1
  • datasets ==3.6.0
  • dill ==0.3.8
  • distro ==1.9.0
  • eval_type_backport ==0.2.2
  • exceptiongroup ==1.2.2
  • filelock ==3.18.0
  • fonttools ==4.57.0
  • frozenlist ==1.5.0
  • fsspec ==2025.3.0
  • funcy ==1.18
  • h11 ==0.14.0
  • hf-xet ==1.1.2
  • httpcore ==1.0.7
  • httpx ==0.28.1
  • huggingface-hub ==0.32.1
  • idna ==3.10
  • importlib_metadata ==8.7.0
  • jiter ==0.9.0
  • kiwisolver ==1.4.8
  • locket ==1.0.0
  • markdown-it-py ==3.0.0
  • matplotlib ==3.10.1
  • mdurl ==0.1.2
  • multidict ==6.4.2
  • multiprocess ==0.70.16
  • numpy ==2.2.4
  • nvitop ==1.4.2
  • openai ==1.72.0
  • packaging ==24.2
  • pandas ==2.2.3
  • partd ==1.4.2
  • patsy ==1.0.1
  • pillow ==11.2.1
  • propcache ==0.3.1
  • py-aiger ==6.2.3
  • py-aiger-cnf ==5.0.8
  • pyapproxmc ==4.1.24
  • pyarrow ==20.0.0
  • pycryptosat ==5.11.21
  • pydantic ==2.11.3
  • pydantic_core ==2.33.1
  • pyparsing ==3.2.3
  • pypblib ==0.0.4
  • pyrsistent ==0.19.3
  • python-dateutil ==2.9.0.post0
  • python-sat ==1.8.dev16
  • pytz ==2025.2
  • regex ==2024.11.6
  • requests ==2.32.3
  • rich ==14.0.0
  • scipy ==1.15.2
  • seaborn ==0.13.2
  • shellingham ==1.5.4
  • six ==1.17.0
  • sniffio ==1.3.1
  • sortedcontainers ==2.4.0
  • statsmodels ==0.14.4
  • tabulate ==0.9.0
  • tiktoken ==0.9.0
  • together ==1.5.7
  • toolz ==1.0.0
  • tqdm ==4.67.1
  • triton ==3.3.0
  • typer ==0.15.3
  • typing-inspection ==0.4.0
  • typing_extensions ==4.13.1
  • tzdata ==2025.2
  • urllib3 ==2.4.0
  • xxhash ==3.5.0
  • yarl ==1.19.0
  • zipp ==3.21.0