s1

s1: Simple test-time scaling

https://github.com/simplescaling/s1

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.8%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

s1: Simple test-time scaling

Basic Info

Host: GitHub
Owner: simplescaling
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2501.19393
Size: 7.07 MB

Statistics

Stars: 6,375
Watchers: 56
Forks: 746
Open Issues: 65
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

s1: Simple test-time scaling

Minimal recipe for test-time scaling and strong reasoning performance matching o1-preview with just 1,000 examples & budget forcing

Updates:

2025-03: Released 2 videos on s1: TWIML Podcast (Sam Charrington & Niklas Muennighoff) & Microsoft GenAI Talk (Niklas Muennighoff)
2025-02: We released s1.1 a better model than s1 by reusing the same s1K questions but with reasoning traces generated by r1 instead of Gemini: s1K-1.1. Check this tweet for details
2025-01: We released our paper announced via this tweet.

This repository provides an overview of all resources for the paper "s1: Simple test-time scaling".

Artifacts
Structure
Inference
Training
Evaluation
Data
Visuals
Known Issues
Citation

Artifacts

Paper: https://arxiv.org/abs/2501.19393
Model: https://hf.co/simplescaling/s1.1-32B (Old: https://hf.co/simplescaling/s1-32B)
Data: https://hf.co/datasets/simplescaling/s1K-1.1 (Old: https://hf.co/datasets/simplescaling/s1K)
- s1-prob: https://hf.co/datasets/simplescaling/s1-prob
- s1-teasers: https://hf.co/datasets/simplescaling/s1-teasers
- Full 59K: https://hf.co/datasets/simplescaling/dataablationfull59K

Structure

eval/: Evaluation scripts
data/: Synthetic data creation scripts & co
train/: Training scripts

Inference

vLLM

Install the vllm library and run: ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer

model = LLM( "simplescaling/s1.1-32B", tensorparallelsize=2, ) tok = AutoTokenizer.from_pretrained("simplescaling/s1-32B")

stoptokenids = tok("<|imend|>")["inputids"]

samplingparams = SamplingParams( maxtokens=32768, mintokens=0, stoptokenids=stoptoken_ids, )

o = model.generate(prompt, samplingparams=samplingparams) print(o[0].outputs[0].text) ```

vLLM with budget forcing

```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer

Decide on a token limit for thinking; As the model's max tokens is 32768, 32000 usually ensures there is enough space for the model to still answer

MAXTOKENSTHINKING = 32000

Decide how often to ignore end-of-thinking token

NUM_IGNORE = 1

model = LLM( "simplescaling/s1-32B", # s1 originally gets this prompt wrong but with budget forcing it fixes it tensorparallelsize=2, ) tok = AutoTokenizer.from_pretrained( "simplescaling/s1-32B" )

stoptokenids = tok("<|imend|>")["inputids"] samplingparams = SamplingParams( maxtokens=32768, mintokens=0, stoptokenids=stoptokenids, skipspecial_tokens=False, temperature=0.0, )

For the exact raspberry sample in the paper see

prompts = [ "How many r in raspberry", ]

for i, p in enumerate(prompts): prompt = "<|imstart|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|imend|>\n<|imstart|>user\n" + p + "<|imend|>\n<|imstart|>assistant\n" stoptokenids = tok("<|imstart|><|imend|>")["inputids"] samplingparams = SamplingParams( maxtokens=MAXTOKENSTHINKING, mintokens=0, stoptokenids=stoptokenids, skipspecialtokens=False, temperature=0.0, ) prompt += "<|imstart|>think" o = model.generate( prompt, samplingparams=samplingparams ) ignorestr = "Wait" maxtokensthinkingtmp = MAXTOKENSTHINKING for i in range(NUMIGNORE): # Num of times to skip stop token maxtokensthinkingtmp -= len(o[0].outputs[0].tokenids) if maxtokensthinkingtmp > 0: prompt += o[0].outputs[0].text + ignorestr samplingparams = SamplingParams( maxtokens=maxtokensthinkingtmp, mintokens=1, stoptokenids=stoptokenids, skipspecialtokens=False, temperature=0.0, ) o = model.generate( prompt, samplingparams=samplingparams ) ### Final answer ### prompt += o[0].outputs[0].text # You can also append "Final Answer:" here like we do for some evaluations to prevent the model from just continuing to reason in its answer when early exiting stoptokenids = tok("<|imend|>")["inputids"] samplingparams = SamplingParams( maxtokens=32768, mintokens=0, stoptokenids=stoptokenids, skipspecialtokens=False, temperature=0.0, ) o = model.generate( prompt, samplingparams=samplingparams, ) print("With budget forcing:") # You will see that after the "Wait" in the reasoning trace it fixes its answer print(prompt + o[0].outputs[0].text) ```

transformers

Install the transformers & torch libraries and run:

```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch

DEVICE = "cuda" if torch.cuda.isavailable() else "cpu" modelname = "simplescaling/s1.1-32B"

model = AutoModelForCausalLM.frompretrained( modelname, torchdtype="auto", devicemap="auto" ) tokenizer = AutoTokenizer.frompretrained(modelname)

prompt = "How many r in raspberry" messages = [ {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}, {"role": "user", "content": prompt} ] text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) modelinputs = tokenizer([text], returntensors="pt").to(model.device)

generatedids = model.generate( **modelinputs, maxnewtokens=512 ) generatedids = [ outputids[len(inputids):] for inputids, outputids in zip(modelinputs.inputids, generatedids) ]

response = tokenizer.batchdecode(generatedids, skipspecialtokens=True)[0] ```

Training

To run training, you can find our script at train/sft.py which you can invoke via one of the train/sft*sh scripts which in turn you can launch via train/launch.sh if you are on a SLURM cluster (requires editing the file for your cluster setup).

To train s1-32B/s1.1-32B, we recommend 16 H100 GPUs i.e. 2 nodes with 8 each. For s1.1, we set the block size to 20000 to avoid OOM (https://github.com/simplescaling/s1/blob/0ad4b3de32507b4aa0d4be28f336276ee99b2315/train/sft.sh#L17); Check the wandb logs here.

Quick start: git clone https://github.com/simplescaling/s1.git cd s1 pip3 install -r requirements.txt bash train/sft.sh Note: If you encounter an out-of-memory (OOM) issue with 8 GPUs, consider enabling gradient checkpointing by adding the following line to your script: --gradient_checkpointing=True.

Evaluation

We cloned lm-evaluation-harness at commit 4cec66e4e468d15789473d6d63c3a61a751fa524 and modified it. Setup: bash cd eval/lm-evaluation-harness pip install -e .[math,vllm]

All commands are in eval/commands.sh. For AIME24 we always pick the aime24_nofigures result, which uses a dataset that only contains the AIME24 figures if they are important for the task.

If you want to compute statistics (avg thinking tokens etc) for an evaluation run you can use python eval/compute_sample_stats.py path_to_samples_file.jsonl

All our evaluation result files are at: https://hf.co/datasets/simplescaling/results

To run REBASE: commands are in eval/rebase/run.sh Note that for the evaluations in the Discussion section with REBASE we used https://huggingface.co/simplescaling/step-conditional-control-old trained on an older version of our dataset https://huggingface.co/datasets/simplescaling/s1K-step-conditional-control-old and run on an older version of our evaluation using https://huggingface.co/datasets/Maxwell-Jia/AIME_2024.

Data

To recreate s1K follow the steps below. In various files you will have to rename the organizations simplescaling and qfq with an organization that you own. Note that s1K-1.1 is a better dataset generated with r1 traces instead of Gemini traces. 1. Run data/collect_data.py followed by data/fix_gpqa.py & data/add_aime.py to collect the questions; Make sure to change the hub path in the respective files to one of your own. 2. Generate traces with Gemini via python data/gemini.py. This step will use https://hf.co/datasets/qfq/train which should be roughly equivalent to the dataet you have produced in 1. 3. Generate answers with Qwen via python data/bulk_inference.py that can be launched with data/bulk_inference.sh. 4. Add features by running python data/featurization.py. 5. Run final filtering via going through data/filter.ipynb. 6. If you want to run grading on the final questions to produce e.g. a gemini_grade column as in this dataset, you can use data/grading.ipynb.

Visuals

All figures and some tables are created via this colab equivalent to visuals/visuals.ipynb. Some are subsequently edited via the visuals/s1.fig file, which you can load in Figma. The output figures are in visuals/ in pdf or png format.

Known Issues

vLLM throws ValueError: Token id XXXXX is out of vocabulary
- This can happen with budget forcing, especially when running with temperature 1, where the model will sometimes do crazy stuff and predict a vocab id that is larger than its max token id but still within its embedding size i.e. anything <152064, >151664; When we refeed the model's previous outputs to it which is done when setting e.g. maxthinkingtokens in the evaluation then this will cause the error cuz vLLM does this check even though it would only be an issue for IDs >152064. To fix it you can just uncomment the vLLM ValueError (It is the line if max_input_id > tokenizer.max_token_id: in vllm/engine/llm_engine.py)

Citation

bibtex @misc{muennighoff2025s1simpletesttimescaling, title={s1: Simple test-time scaling}, author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto}, year={2025}, eprint={2501.19393}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.19393}, }

Owner

Name: simplescaling
Login: simplescaling
Kind: organization

Repositories: 1
Profile: https://github.com/simplescaling

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
abstract: "Minimal recipe for test-time scaling and strong reasoning performance matching o1-preview with just 1,000 examples & budget forcing"
repository-code: "https://github.com/simplescaling/s1"
authors:
  - family-names: "Muennighoff"
    given-names: "Niklas"
  - family-names: "Yang"
    given-names: "Zitong"
  - family-names: "Shi"
    given-names: "Weijia"
  - family-names: "Li"
    given-names: "Xiang Lisa"
  - family-names: "Fei-Fei"
    given-names: "Li"
  - family-names: "Hajishirzi"
    given-names: "Hannaneh"
  - family-names: "Zettlemoyer"
    given-names: "Luke"
  - family-names: "Liang"
    given-names: "Percy"
  - family-names: "Candès"
    given-names: "Emmanuel"
  - family-names: "Hashimoto"
    given-names: "Tatsunori"
title: "s1: Simple test-time scaling"
type: software
keywords:
  - "test-time scaling"
  - "language models"
  - "reasoning"
  - "budget forcing"
references:
  - type: software
    title: "s1-32B"
    url: "https://hf.co/simplescaling/s1-32B"
  - type: dataset
    title: "s1K"
    url: "https://hf.co/datasets/simplescaling/s1K"
  - type: dataset
    title: "s1-prob"
    url: "https://hf.co/datasets/simplescaling/s1-prob"
  - type: dataset
    title: "s1-teasers"
    url: "https://hf.co/datasets/simplescaling/s1-teasers"
  - type: dataset
    title: "data_ablation_full59K"
    url: "https://hf.co/datasets/simplescaling/data_ablation_full59K"
date-released: "2025"
url: "https://arxiv.org/abs/2501.19393"
identifiers:
  - type: arxiv
    value: 2501.19393
    description: The ArXiv preprint
preferred-citation:
  type: misc
  authors:
    - family-names: "Muennighoff"
      given-names: "Niklas"
    - family-names: "Yang"
      given-names: "Zitong"
    - family-names: "Shi"
      given-names: "Weijia"
    - family-names: "Li"
      given-names: "Xiang Lisa"
    - family-names: "Fei-Fei"
      given-names: "Li"
    - family-names: "Hajishirzi"
      given-names: "Hannaneh"
    - family-names: "Zettlemoyer"
      given-names: "Luke"
    - family-names: "Liang"
      given-names: "Percy"
    - family-names: "Candès"
      given-names: "Emmanuel"
    - family-names: "Hashimoto"
      given-names: "Tatsunori"
  title: "s1: Simple test-time scaling"
  year: 2025
  url: "https://arxiv.org/abs/2501.19393"
  arxiv: "2501.19393"

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 94
Total pull requests: 14
Average time to close issues: about 21 hours
Average time to close pull requests: about 11 hours
Total issue authors: 74
Total pull request authors: 14
Average comments per issue: 0.61
Average comments per pull request: 0.29
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 94
Pull requests: 14
Average time to close issues: about 21 hours
Average time to close pull requests: about 11 hours
Issue authors: 74
Pull request authors: 14
Average comments per issue: 0.61
Average comments per pull request: 0.29
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

agoyang (8)
JH-ninjatech (4)
tonychenxyz (3)
wanghanbinpanda (2)
Aradhye2002 (2)
Zzzsjj (2)
chen1325-star (2)
kyZhao-1 (2)
JaydencoolCC (2)
huangtiansheng (2)
jd730 (2)
laolaorkkkkk (1)
neelbhandari6 (1)
SusMaria (1)
guozihang (1)

Pull Request Authors

gamble369 (1)
rrostt (1)
freakynit (1)
kiwamizamurai (1)
nonaghazizadeh (1)
eltociear (1)
Henry-Jessie (1)
invictus717 (1)
test1githu (1)
bigsnarfdude (1)
SamYuan1990 (1)
tpoisonooo (1)
kothasuhas (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

eval/lm-evaluation-harness/pyproject.toml pypi

accelerate >=0.26.0
datasets >=2.16.0
dill *
evaluate >=0.4.0
evaluate *
jsonlines *
more_itertools *
numexpr *
peft >=0.2.0
pybind11 >=2.6.2
pytablewriter *
rouge-score >=0.0.4
sacrebleu >=1.5.0
scikit-learn >=0.24.1
sqlitedict *
torch >=1.8
tqdm-multiprocess *
transformers >=4.1
word2number *
zstandard *

eval/lm-evaluation-harness/requirements.txt pypi

eval/lm-evaluation-harness/setup.py pypi

requirements.txt pypi

accelerate ==1.0.1
datasets ==3.1.0
gradio ==4.44.0
ipykernel ==6.28.0
ipython ==8.20.0
openai ==1.56.1
torch ==2.5.1
torchaudio ==2.5.1
torchvision ==0.20.1
transformers ==4.46.1
trl ==0.12.0
vllm ==0.6.4.post1
wandb ==0.17.3

eval/rebase/sglang/examples/usage/triton/Dockerfile docker

nvcr.io/nvidia/tritonserver 24.01-py3 build

eval/rebase/inference_scaling/finetune/gpt-accelera/requirements.txt pypi

sentencepiece *
torch *

eval/rebase/inference_scaling/finetune/gpt-accelera/setup.py pypi

torch *

eval/rebase/sglang/python/pyproject.toml pypi

requests *

eval/rebase/sglang/python/sglang.egg-info/requires.txt pypi

aiohttp *
anthropic >=0.20.0
cloudpickle *
diskcache *
fastapi *
interegular *
lark *
numba *
numpy *
openai >=1.0
outlines >=0.0.27
pebble *
pillow *
psutil *
pydantic *
referencing *
requests *
rpyc *
sglang *
timeout-decorator *
torch *
uvicorn *
uvloop *
vllm ==0.3.3
zmq *

s1

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

s1: Simple test-time scaling

Artifacts

Structure

Inference

vLLM

vLLM with budget forcing

Decide on a token limit for thinking; As the model's max tokens is 32768, 32000 usually ensures there is enough space for the model to still answer

Decide how often to ignore end-of-thinking token

For the exact raspberry sample in the paper see

transformers

Training

Evaluation

Data

Visuals

Known Issues

Citation

Owner

Citation (CITATION.cff)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies