https://github.com/acai66/qwen_numpy

使用numpy实现DeepSeek-R1-Distill-Qwen-1.5B的推理过程,易于学习LLM推理与移植到其它编程语言加速。 Implementing the inference process of DeepSeek-R1-Distill-Qwen-1.5B using numpy, making it easy to learn LLM (Large Language Model) inference and to port to other programming languages for acceleration.

https://github.com/acai66/qwen_numpy

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.0%) to scientific vocabulary

Keywords

deepseek deepseek-r1 llama-cpp llm-inference numpy qwen qwen2
Last synced: 5 months ago · JSON representation

Repository

使用numpy实现DeepSeek-R1-Distill-Qwen-1.5B的推理过程,易于学习LLM推理与移植到其它编程语言加速。 Implementing the inference process of DeepSeek-R1-Distill-Qwen-1.5B using numpy, making it easy to learn LLM (Large Language Model) inference and to port to other programming languages for acceleration.

Basic Info
  • Host: GitHub
  • Owner: acai66
  • Language: Python
  • Default Branch: main
  • Homepage: https://hyacm.com
  • Size: 45.9 KB
Statistics
  • Stars: 9
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
deepseek deepseek-r1 llama-cpp llm-inference numpy qwen qwen2
Created about 1 year ago · Last pushed 7 months ago
Metadata Files
Readme

README.md

阿里通义千问 Qwen2.5、Qwen3 numpy推理(支持Deekseek-R1蒸馏的Qwen模型)

  • 只使用 numpy 实现 Qwen 的推理,不使用 torchtransformers 等框架,易于学习LLM的推理过程,以及移植到其它语言
  • 支持阿里云原始的通义千问 Qwen2.5 模型、Qwen3 模型、Deekseek-R1 蒸馏的 Qwen2.5 模型,其它微调模型暂未测试(理论上支持)
  • 支持 batch 推理
  • 支持 temperaturetopktopppenalty等参数
  • 支持 KV缓存
  • 支持 q8_0量化
  • 以学习为目的,约400行代码实现了完整的llm推理过程,不含 tokenization 部分

测试

1. 安装依赖

bash pip install numpy tokenizers

2. 下载 safetensors 模型

到模型分享平台下载完整模型,参考 modelscope 平台下载说明

  1. Qwen3-0.6B
  2. Qwen3-1.7B
  3. Qwen2.5-0.5B-Instruct
  4. Qwen2.5-1.5B-Instruct
  5. DeepSeek-R1-Distill-Qwen-1.5B

3. 转换模型

使用 parse_safetensors.py 脚本转换模型,提供下载的模型目录,转换后的npy模型保存目录,例如:

bash python parse_safetensors.py --model_dir 下载的模型目录 --npy_save_dir 转换后的npy模型保存目录

4. 运行推理

修改 model.py 中的模型路径、prompt,运行 model.py,或者手动从 model.py 中导入 Model 类,参考 model.pymain 函数的使用方法

```python if name == 'main': # chattemplate = '<|imstart|>system\nYou are a helpful assistant.<|imend|>\n<|imstart|>user\n{}<|imend|>\n<|imstart|>assistant\n' # modelweightspath = '/Users/acai/Downloads/models/Qwen2.50.5BInstructnpyFP32' # chattemplate = '<|begin▁of▁sentence|><|begin▁of▁sentence|>You are a helpful assistant.<|User|>{}<|Assistant|>\n' # modelweightspath = '/Users/acai/Downloads/models/DeepSeekR1DistillQwen1.5BnpyFP32' chattemplate = '<|imstart|>user\n{}<|imend|>\n<|imstart|>assistant\n\n\n\n\n' # no thinking mode # chattemplate = '<|imstart|>user\n{}<|imend|>\n<|imstart|>assistant\n\n' # thinking mode modelweightspath = '/Users/acai/Downloads/models/Qwen30.6BnpyFP32'

model = Model(model_weights_path)


prompt = [
    # "怎么用python numpy实softmax?",
    "你是谁?",
    # "计算456+826",
] # 批次
text = list(map(lambda x: chat_template.format(x), prompt))

model_inputs = np.array([model.tokenizer.encode_batch_fast(text)[i].ids for i in range(len(text))], dtype=np.int32)

generated_ids = model.generate(
    model_inputs,
    max_new_tokens=2048
)

response = model.tokenizer.decode_batch(generated_ids, skip_special_tokens=True)
print(f'{"\n".join(response)}')

```

Benchmark

llama.cpp对比每秒 Tokens 速度,测试平台为 Mac mini M4,内存16G

|模型|精度|numpy|llama.cpp| |:---:|:---:|:---:|:---:| |Qwen2.50.5BInstruct|float32|29.77|45.6| |Qwen2.50.5BInstruct|float16|-|86.44| |Qwen2.50.5BInstruct|q80|1.94|140.53| |DeepSeekR1DistillQwen1.5B|float32|10.31|15.55| |DeepSeekR1DistillQwen1.5B|float16|-|31.55| |DeepSeekR1DistillQwen1.5B|q80|0.68|54.47|

7B模型用float32精度时需要30G左右内存,机器内存不足,未测试

比较震惊的是 numpy 的矩阵加速只支持float32、float64,不支持整数、半精度等,导致float32速度是最快的,float32模型内存占用很大,容易导致内存不足,同时对内存带宽的要求很高,估计只能通过移植到其它语言,从底层优化矩阵运算才能加速到 llama.cpp 的速度

参考

Owner

  • Name: acai
  • Login: acai66
  • Kind: user

悟已往之不谏,知来者之可追

GitHub Events

Total
Last Year

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 10
  • Total Committers: 1
  • Avg Commits per committer: 10.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 10
  • Committers: 1
  • Avg Commits per committer: 10.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
acai66 1****6@q****m 10
Committer Domains (Top 20 + Academic)
qq.com: 1

Issues and Pull Requests

Last synced: 7 months ago