https://github.com/acai66/qwen_numpy
使用numpy实现DeepSeek-R1-Distill-Qwen-1.5B的推理过程,易于学习LLM推理与移植到其它编程语言加速。 Implementing the inference process of DeepSeek-R1-Distill-Qwen-1.5B using numpy, making it easy to learn LLM (Large Language Model) inference and to port to other programming languages for acceleration.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.0%) to scientific vocabulary
Keywords
Repository
使用numpy实现DeepSeek-R1-Distill-Qwen-1.5B的推理过程,易于学习LLM推理与移植到其它编程语言加速。 Implementing the inference process of DeepSeek-R1-Distill-Qwen-1.5B using numpy, making it easy to learn LLM (Large Language Model) inference and to port to other programming languages for acceleration.
Basic Info
- Host: GitHub
- Owner: acai66
- Language: Python
- Default Branch: main
- Homepage: https://hyacm.com
- Size: 45.9 KB
Statistics
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
阿里通义千问 Qwen2.5、Qwen3 numpy推理(支持Deekseek-R1蒸馏的Qwen模型)
- 只使用
numpy实现Qwen的推理,不使用torch、transformers等框架,易于学习LLM的推理过程,以及移植到其它语言 - 支持阿里云原始的通义千问
Qwen2.5模型、Qwen3模型、Deekseek-R1蒸馏的Qwen2.5模型,其它微调模型暂未测试(理论上支持) - 支持
batch推理 - 支持
temperature、topk、topp、penalty等参数 - 支持
KV缓存 - 支持
q8_0量化 - 以学习为目的,约400行代码实现了完整的llm推理过程,不含
tokenization部分
测试
1. 安装依赖
bash
pip install numpy tokenizers
2. 下载 safetensors 模型
到模型分享平台下载完整模型,参考 modelscope 平台下载说明
3. 转换模型
使用 parse_safetensors.py 脚本转换模型,提供下载的模型目录,转换后的npy模型保存目录,例如:
bash
python parse_safetensors.py --model_dir 下载的模型目录 --npy_save_dir 转换后的npy模型保存目录
4. 运行推理
修改 model.py 中的模型路径、prompt,运行 model.py,或者手动从 model.py 中导入 Model 类,参考 model.py 中 main 函数的使用方法
```python
if name == 'main':
# chattemplate = '<|imstart|>system\nYou are a helpful assistant.<|imend|>\n<|imstart|>user\n{}<|imend|>\n<|imstart|>assistant\n'
# modelweightspath = '/Users/acai/Downloads/models/Qwen2.50.5BInstructnpyFP32'
# chattemplate = '<|begin▁of▁sentence|><|begin▁of▁sentence|>You are a helpful assistant.<|User|>{}<|Assistant|>
model = Model(model_weights_path)
prompt = [
# "怎么用python numpy实softmax?",
"你是谁?",
# "计算456+826",
] # 批次
text = list(map(lambda x: chat_template.format(x), prompt))
model_inputs = np.array([model.tokenizer.encode_batch_fast(text)[i].ids for i in range(len(text))], dtype=np.int32)
generated_ids = model.generate(
model_inputs,
max_new_tokens=2048
)
response = model.tokenizer.decode_batch(generated_ids, skip_special_tokens=True)
print(f'{"\n".join(response)}')
```
Benchmark
与 llama.cpp对比每秒 Tokens 速度,测试平台为 Mac mini M4,内存16G
|模型|精度|numpy|llama.cpp| |:---:|:---:|:---:|:---:| |Qwen2.50.5BInstruct|float32|29.77|45.6| |Qwen2.50.5BInstruct|float16|-|86.44| |Qwen2.50.5BInstruct|q80|1.94|140.53| |DeepSeekR1DistillQwen1.5B|float32|10.31|15.55| |DeepSeekR1DistillQwen1.5B|float16|-|31.55| |DeepSeekR1DistillQwen1.5B|q80|0.68|54.47|
7B模型用float32精度时需要30G左右内存,机器内存不足,未测试
比较震惊的是 numpy 的矩阵加速只支持float32、float64,不支持整数、半精度等,导致float32速度是最快的,float32模型内存占用很大,容易导致内存不足,同时对内存带宽的要求很高,估计只能通过移植到其它语言,从底层优化矩阵运算才能加速到 llama.cpp 的速度
参考
Owner
- Name: acai
- Login: acai66
- Kind: user
- Website: https://hyacm.com
- Repositories: 33
- Profile: https://github.com/acai66
悟已往之不谏,知来者之可追
GitHub Events
Total
Last Year
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| acai66 | 1****6@q****m | 10 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago