train_simplescaling-s1k-1.1
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: zhangbo2008
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 6.33 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
1.另一个新项目: https://docs.unsloth.ai/
When deciding between using Retrieval-Augmented Generation (RAG) or fine-tuning, it's important to note that fine-tuning can replicate RAG's functionalities and not vice versa. Instead of using one or the other, we recommend people to use both which greatly increases accuracy, useability and reduces hallucinations.
看来最好的方法是 微调加rag
已经把参数改成cpu也能自动适配的了. 运行2.py即可.都不需要手动改.
第一步运行:data/fixgpqa.py data/addaime.py data/collectdata.py data/gemini.py data/bulkinference.py ata/featurization.py data/filter.ipynb 最后得到了数据集 ("qfq/s1Ktokenized") 也传到了hf上. 更好的是 simplescaling/s1K-1.1 simplescaling/s1K-1.1tokenized 这个是添加了tokenized提示词的.
把simplescaling/s1K-1.1_tokenized 和 simplescaling/s1K-1.1 区别写一下:
1.1数据字段: solution
stringlengths
16.9k
question
stringlengths
402.42k
cot_type
stringclasses
3 values
source_type
stringclasses
34 values
metadata
stringlengths
219.4k
geminithinkingtrajectory
stringlengths
1.47k25.4k
gemini_attempt
stringlengths
99.82k
deepseekthinkingtrajectory
stringlengths
2.33k80.3k
deepseek_attempt
stringlengths 2.33k80.3k
1.1 tokenized数据字段: 就是上面这些字段加了一个text字段. 所以我们最终使用的数据就是https://huggingface.co/datasets/simplescaling/s1K-1.1_tokenized
一些数学符号的说明.
三种空格unicode(\u00A0,\u0020,\u3000) 原创
CSDN博客 https://blog.csdn.net › article › details 2020年6月24日 — - “00A0”:“\u00A0”表示不换行空格(non-breaking space
{"question": "Let $\mathbb{R}$ be the set of real numbers . Determine all functions $f \u00a0:\mathbb{R} \rightarrow \mathbb{R}$ such that\n \nfor all pairs of real numbers $x$ and $y$ ."}
{"question": "Find the sum of the ages of everyone who wrote a problem for this year's HMMT November contest. If your answer is $X$ and the actual value is $Y$, your score will be $\max (0,20-|X-Y|)$"}
s1: Simple test-time scaling
Minimal recipe for test-time scaling and strong reasoning performance matching o1-preview with just 1,000 examples & budget forcing

Updates:
- 2025-02: We released s1.1 a better model than s1 by reusing the same s1K questions but with reasoning traces generated by r1 instead of Gemini: s1K-1.1. Check this tweet for details
- 2025-01: We released our paper announced via this tweet.
This repository provides an overview of all resources for the paper "s1: Simple test-time scaling".
Artifacts
- Paper: https://arxiv.org/abs/2501.19393
- Model: https://hf.co/simplescaling/s1-32B
- Data: https://hf.co/datasets/simplescaling/s1K
- s1-prob: https://hf.co/datasets/simplescaling/s1-prob
- s1-teasers: https://hf.co/datasets/simplescaling/s1-teasers
- Full 59K: https://hf.co/datasets/simplescaling/dataablationfull59K
Structure
eval/: Evaluation scriptsdata/: Synthetic data creation scripts & cotrain/: Training scripts
Inference
vLLM
一般的llm调用.
Install the vllm library and run:
```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model = LLM( "simplescaling/s1.1-32B", tensorparallelsize=2, ) tok = AutoTokenizer.from_pretrained("simplescaling/s1-32B")
stoptokenids = tok("<|imend|>")["inputids"]
samplingparams = SamplingParams( maxtokens=32768, mintokens=0, stoptokenids=stoptoken_ids, )
prompt = "How many r in raspberry" prompt = "<|imstart|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|imend|>\n<|imstart|>user\n" + prompt + "<|imend|>\n<|im_start|>assistant\n"
o = model.generate(prompt, samplingparams=samplingparams) print(o[0].outputs[0].text) ```
vLLM with budget forcing
带有预算强制的调用
```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer
Decide on a token limit for thinking; As the model's max tokens is 32768, 32000 usually ensures there is enough space for the model to still answer
MAXTOKENSTHINKING = 32000
Decide how often to ignore end-of-thinking token
NUM_IGNORE = 1
model = LLM( "simplescaling/s1-32B", # s1 originally gets this prompt wrong but with budget forcing it fixes it tensorparallelsize=2, ) tok = AutoTokenizer.from_pretrained( "simplescaling/s1-32B" )
stoptokenids = tok("<|imend|>")["inputids"] # 结束符的索引 samplingparams = SamplingParams( maxtokens=32768, mintokens=0, stoptokenids=stoptokenids, skipspecial_tokens=False, temperature=0.0, )
For the exact raspberry sample in the paper see
prompts = [ "How many r in raspberry", ]
for i, p in enumerate(prompts): prompt = "<|imstart|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|imend|>\n<|imstart|>user\n" + p + "<|imend|>\n<|imstart|>assistant\n" stoptokenids = tok("<|imstart|><|imend|>")["inputids"] samplingparams = SamplingParams( maxtokens=MAXTOKENSTHINKING, mintokens=0, stoptokenids=stoptokenids, skipspecialtokens=False, temperature=0.0, ) prompt += "<|imstart|>think" o = model.generate(#先让他生成最长3w2的句子. prompt, samplingparams=samplingparams ) ignorestr = "Wait" maxtokensthinkingtmp = MAXTOKENSTHINKING # Num of times to skip stop token for i in range(NUMIGNORE): # NUMIGNORE 表示跳过stop token的次数. maxtokensthinkingtmp -= len(o[0].outputs[0].tokenids)# 先计算还剩多少个思考token prompt += o[0].outputs[0].text + ignorestr # 第二次生成的提示词是第一次的提示词添加上第一次生成的结果, 再加上ignorestr 也就是wait这个字符串. samplingparams = SamplingParams( # 进行第二次生成的参数. maxtokens=maxtokensthinkingtmp, mintokens=1, stoptokenids=stoptokenids, skipspecialtokens=False, temperature=0.0, ) o = model.generate( prompt, samplingparams=samplingparams ) ### Final answer ### prompt += o[0].outputs[0].text # You can also append "Final Answer:" here like we do for some evaluations to prevent the model from just continuing to reason in its answer when early exiting stoptokenids = tok("<|imend|>")["inputids"] samplingparams = SamplingParams( maxtokens=32768, mintokens=0, stoptokenids=stoptokenids, skipspecialtokens=False, temperature=0.0, ) o = model.generate(#再次生成. prompt, samplingparams=sampling_params, ) print("With budget forcing:") # You will see that after the "Wait" in the reasoning trace it fixes its answer print(prompt + o[0].outputs[0].text) ```
transformers
Install the transformers & torch libraries and run:
```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch
DEVICE = "cuda" if torch.cuda.isavailable() else "cpu" modelname = "simplescaling/s1.1-32B"
model = AutoModelForCausalLM.frompretrained( modelname, torchdtype="auto", devicemap="auto" ) tokenizer = AutoTokenizer.frompretrained(modelname)
prompt = "How many r in raspberry" messages = [ {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."}, {"role": "user", "content": prompt} ] text = tokenizer.applychattemplate( messages, tokenize=False, addgenerationprompt=True ) modelinputs = tokenizer([text], returntensors="pt").to(model.device)
generatedids = model.generate( **modelinputs, maxnewtokens=512 ) generatedids = [ outputids[len(inputids):] for inputids, outputids in zip(modelinputs.inputids, generatedids) ] #保留生成的中的原始prompt就有的部分.
response = tokenizer.batchdecode(generatedids, skipspecialtokens=True)[0] ```
Training
To run training, you can find our script at train/sft.py which you can invoke via one of the train/sft*sh scripts which in turn you can launch via train/launch.sh if you are on a SLURM cluster (requires editing the file for your cluster setup).
To train s1-32B/s1.1-32B, we recommend 16 H100 GPUs i.e. 2 nodes with 8 each. For s1.1, we set the block size to 20000 to avoid OOM (https://github.com/simplescaling/s1/blob/0ad4b3de32507b4aa0d4be28f336276ee99b2315/train/sft.sh#L17); Check the wandb logs here.
Quick start:
git clone https://github.com/simplescaling/s1.git
cd s1
pip3 install -r requirements.txt
bash train/sft.sh
Note: If you encounter an out-of-memory (OOM) issue with 8 GPUs, consider enabling gradient checkpointing by adding the following line to your script: --gradient_checkpointing=True.
Evaluation
We cloned lm-evaluation-harness at commit 4cec66e4e468d15789473d6d63c3a61a751fa524 and modified it. Setup:
bash
cd eval/lm-evaluation-harness
pip install -e .[math,vllm]
All commands are in eval/commands.sh. For AIME24 we always pick the aime24_nofigures result, which uses a dataset that only contains the AIME24 figures if they are important for the task.
If you want to compute statistics (avg thinking tokens etc) for an evaluation run you can use
python eval/compute_sample_stats.py path_to_samples_file.jsonl
All our evaluation result files are at: https://hf.co/datasets/simplescaling/results
To run REBASE: commands are in eval/rebase/run.sh
Note that for the evaluations in the Discussion section with REBASE we used https://huggingface.co/simplescaling/step-conditional-control-old trained on an older version of our dataset https://huggingface.co/datasets/simplescaling/s1K-step-conditional-control-old and run on an older version of our evaluation using https://huggingface.co/datasets/Maxwell-Jia/AIME_2024.
Data
To recreate s1K follow the steps below. In various files you will have to rename the organizations simplescaling and qfq with an organization that you own. Note that s1K-1.1 is a better dataset generated with r1 traces instead of Gemini traces.
1. Run data/collect_data.py followed by data/fix_gpqa.py & data/add_aime.py to collect the questions; Make sure to change the hub path in the respective files to one of your own
3. Generate traces with Gemini via python data/gemini.py.
4. Generate answers with Qwen via python data/bulk_inference.py that can be launched with data/bulk_inference.sh.
5. Add features by running python data/featurization.py.
6. Run final filtering via going through data/filter.ipynb.
Visuals
All figures and some tables are created via this colab equivalent to visuals/visuals.ipynb. Some are subsequently edited via the visuals/s1.fig file, which you can load in Figma.
Known Issues
- vLLM throws
ValueError: Token id XXXXX is out of vocabulary- This can happen with budget forcing, especially when running with temperature 1, where the model will sometimes do crazy stuff and predict a vocab id that is larger than its max token id but still within its embedding size i.e. anything <152064, >151664; When we refeed the model's previous outputs to it which is done when setting e.g. maxthinkingtokens in the evaluation then this will cause the error cuz vLLM does this check even though it would only be an issue for IDs >152064. To fix it you can just uncomment the vLLM ValueError (It is the line
if max_input_id > tokenizer.max_token_id:invllm/engine/llm_engine.py)
- This can happen with budget forcing, especially when running with temperature 1, where the model will sometimes do crazy stuff and predict a vocab id that is larger than its max token id but still within its embedding size i.e. anything <152064, >151664; When we refeed the model's previous outputs to it which is done when setting e.g. maxthinkingtokens in the evaluation then this will cause the error cuz vLLM does this check even though it would only be an issue for IDs >152064. To fix it you can just uncomment the vLLM ValueError (It is the line
Citation
bibtex
@misc{muennighoff2025s1simpletesttimescaling,
title={s1: Simple test-time scaling},
author={Niklas Muennighoff and Zitong Yang and Weijia Shi and Xiang Lisa Li and Li Fei-Fei and Hannaneh Hajishirzi and Luke Zettlemoyer and Percy Liang and Emmanuel Candès and Tatsunori Hashimoto},
year={2025},
eprint={2501.19393},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.19393},
}
"# train_simplescaling-s1K-1.1"
Owner
- Name: 张博
- Login: zhangbo2008
- Kind: user
- Location: 天津河东卫国道翰林园
- Website: www.cnblogs.com/zhangbo2008
- Repositories: 8
- Profile: https://github.com/zhangbo2008
I am a chinese coder, having some machine learning and math book code and notes shared contact me 15122306087@163.com
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
abstract: "Minimal recipe for test-time scaling and strong reasoning performance matching o1-preview with just 1,000 examples & budget forcing"
repository-code: "https://github.com/simplescaling/s1"
authors:
- family-names: "Muennighoff"
given-names: "Niklas"
- family-names: "Yang"
given-names: "Zitong"
- family-names: "Shi"
given-names: "Weijia"
- family-names: "Li"
given-names: "Xiang Lisa"
- family-names: "Fei-Fei"
given-names: "Li"
- family-names: "Hajishirzi"
given-names: "Hannaneh"
- family-names: "Zettlemoyer"
given-names: "Luke"
- family-names: "Liang"
given-names: "Percy"
- family-names: "Candès"
given-names: "Emmanuel"
- family-names: "Hashimoto"
given-names: "Tatsunori"
title: "s1: Simple test-time scaling"
type: software
keywords:
- "test-time scaling"
- "language models"
- "reasoning"
- "budget forcing"
references:
- type: software
title: "s1-32B"
url: "https://hf.co/simplescaling/s1-32B"
- type: dataset
title: "s1K"
url: "https://hf.co/datasets/simplescaling/s1K"
- type: dataset
title: "s1-prob"
url: "https://hf.co/datasets/simplescaling/s1-prob"
- type: dataset
title: "s1-teasers"
url: "https://hf.co/datasets/simplescaling/s1-teasers"
- type: dataset
title: "data_ablation_full59K"
url: "https://hf.co/datasets/simplescaling/data_ablation_full59K"
date-released: "2025"
url: "https://arxiv.org/abs/2501.19393"
identifiers:
- type: arxiv
value: 2501.19393
description: The ArXiv preprint
preferred-citation:
type: misc
authors:
- family-names: "Muennighoff"
given-names: "Niklas"
- family-names: "Yang"
given-names: "Zitong"
- family-names: "Shi"
given-names: "Weijia"
- family-names: "Li"
given-names: "Xiang Lisa"
- family-names: "Fei-Fei"
given-names: "Li"
- family-names: "Hajishirzi"
given-names: "Hannaneh"
- family-names: "Zettlemoyer"
given-names: "Luke"
- family-names: "Liang"
given-names: "Percy"
- family-names: "Candès"
given-names: "Emmanuel"
- family-names: "Hashimoto"
given-names: "Tatsunori"
title: "s1: Simple test-time scaling"
year: 2025
url: "https://arxiv.org/abs/2501.19393"
arxiv: "2501.19393"
GitHub Events
Total
- Push event: 5
- Create event: 2
Last Year
- Push event: 5
- Create event: 2
Dependencies
- nvcr.io/nvidia/tritonserver 24.01-py3 build
- accelerate >=0.26.0
- datasets >=2.16.0
- dill *
- evaluate >=0.4.0
- evaluate *
- jsonlines *
- more_itertools *
- numexpr *
- peft >=0.2.0
- pybind11 >=2.6.2
- pytablewriter *
- rouge-score >=0.0.4
- sacrebleu >=1.5.0
- scikit-learn >=0.24.1
- sqlitedict *
- torch >=1.8
- tqdm-multiprocess *
- transformers >=4.1
- word2number *
- zstandard *
- sentencepiece *
- torch *
- torch *
- requests *
- aiohttp *
- anthropic >=0.20.0
- cloudpickle *
- diskcache *
- fastapi *
- interegular *
- lark *
- numba *
- numpy *
- openai >=1.0
- outlines >=0.0.27
- pebble *
- pillow *
- psutil *
- pydantic *
- referencing *
- requests *
- rpyc *
- sglang *
- timeout-decorator *
- torch *
- uvicorn *
- uvloop *
- vllm ==0.3.3
- zmq *
- accelerate ==1.0.1
- datasets ==3.1.0
- gradio ==4.44.0
- ipykernel ==6.28.0
- ipython ==8.20.0
- openai ==1.56.1
- torch ==2.5.1
- torchaudio ==2.5.1
- torchvision ==0.20.1
- transformers ==4.46.1
- trl ==0.12.0
- vllm ==0.6.4.post1
- wandb ==0.17.3