lite_llama
A light llama-like llm inference framework based on the triton kernel.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary
Keywords
Repository
A light llama-like llm inference framework based on the triton kernel.
Basic Info
Statistics
- Stars: 144
- Watchers: 4
- Forks: 18
- Open Issues: 3
- Releases: 0
Topics
Metadata Files
README.md
✅ Flash attention ✅ Reduce GPU memory (fp16/32) ✅ Beginner friendly
Features
- Up to
4xspeedup over transformers, llama3 1B and 3B models. - Supports the latest
llama3,Qwen2.5,Qwen3,Llava1.5model inference,top-psampling, streaming output. - Supports GQA, ~~decode stage support cuda graph optimization (with batch_size limitations)~~.
- Supports
flashattention1,flashattention2,flashdecoding(supportsNopadAttention). - Support efficient dynamic management of kv cache (
auto tokenattnetion). - Support fusion of operators, e.g. fusion of
*andsilufor element-by-element multiplication, k v linear layer fusion, fusion ofskipandrmsnorm. - Some custom operators such as
rmsnorm,rope,softmax,element-by-element-multiplication, etc. are implemented using the efficienttritonkernel.
Setup and Installation
Pre-requisites
If you don't have a physical server, you can try using virtal cloud remote server.
lite_llama framework requires the following dependencies:
For cuda, torch, and triton version
```bash
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on MonApr317:16:06PDT2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda12.1.r12.1/compiler.326880720
Python 3.11.8:
pip list | grep torch
torch 2.2.1 triton 2.2.0 transformers 4.52.4 triton-nightly 3.0.0.post20240716052845 ```
The latest version of transformers requires the flash-attn package to run correctly, otherwise the error flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c105Error4whatEv will be reported. Flash-attn can be installed through pip install flash-attn. However, this download and compilation speed is too slow. It is recommended to download the corresponding version of the wheel package from the github-flash-attention-prebuild-wheels website for installation.
For rocm, torch, and triton version:
```bash
rocminfo | grep -i version
ROCk module version 6.10.5 is loaded Runtime Version: 1.14 Runtime Ext Version: 1.6
Python 3.11.8:
pip list | grep torch
pytorch-triton-rocm 3.2.0 torch 2.6.0+rocm6.2.4 torchaudio 2.6.0+rocm6.2.4 torchvision 0.21.0+rocm6.2.4 ```
Getting Started
Recommended cuda version 12.0 and above. Download llama3.2-1B-Instruct Model and place it in the specified checkpoints_dir directory. python apply_weight_convert.py needs to be run to convert the hf model weights to lite_llama weight format, before running cli.py.
bash
apt update
apt install imagemagick
conda create --name lite_llama python >= 3.11
conda activate lite_llama
git clone https://github.com/harleyszhang/lite_llama.git
cd lite_llama/
pip install -r requirement.txt
python test_weight_convert.py # model weight transformation
python generate.py --prompt "What is large language model" --checkpoint_path /path/to/model/Llama-3.2-1B-Instruct/ # Run on the basis that the model has been downloaded and placed in the specified directory
ROCm version 5.7 and above is recommended.
```bash
pip install matplotlib
pip install pandas
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4
apt update apt install imagemagick conda create --name litellama python >= 3.11 conda activate litellama git clone https://github.com/harleyszhang/litellama.git cd litellama/ pip install -r requirement.txt python testweightconvert.py # model weight transformation python generate.py --prompt "What is large language model" --checkpoint_path /path/to/model/Llama-3.2-1B-Instruct/ # Run on the basis that the model has been downloaded and placed in the specified directory ```
Evaluation
After cli.py runs successfully, the terminal displays the interface as shown below, and you can enter your question in the terminal.

After generate.py runs successfully, the terminal displays the interface as shown below, and you can enter your question in the terminal.

After cli_llava.py runs successfully, the terminal displays the interface as shown below, enter your picture and prompt word in the terminal, and then enter.

For performance test, after changing your model weight path, run lite_llama/examples/benchmark.py file directly, it will output the latency and throughput performance comparison between litellama and transformers libraries, the result of the first run is not very accurate, so we suggest you to take the second run as a reference. For example, for the Llama-3.2-3B model with `promptlen = 25,batchsize = 12, andmaxgenlen = 1900, the result of benchmark:
``bash
litellama inference time: 31.3463 s
Transformers inference time: 69.1433 s
litellama throughput: 730.45 tokens/s
Transformers throughput: 183.95 tokens/s
litellama per token latency: 1.369015 ms/token
Transformers per token latency: 5.436221 ms/token
```
TODO
- [x] Optimized for decode phase using cuda graph
- [x] Use flashattention instead of standard attention
- [x] Upgrade
flashattentiontoflashattention2to reduce some computation. - [x] The decode phase of the reasoning uses
flashdecoding - [x] Support kv cache Efficient dynamic management
- [x] Use
GQA_KV_heads_indexinstead ofrepeat_kvfunction - [x] kv Linear layer fusion
- [x] Operator fusion: the skip operation on residual joins is fused with the
rmsnormoperator to form a newskip_rmsnormoperator. - [x] Refactoring and optimizing the
MHAmodule to optimize thecontext_attentionandtoken_attentionkernels to supportNopad attentionandkv cachedynamic allocation and management. - [ ] Supports continuous batch optimization.
- [ ] Support for AWQ and SmoothQuant quantization.
- [ ] Code refactoring and fix for cuda graph not working properly after optimization with AutoTokenAttention.
Detailed information can be found in performance optimization
Acknowledgement
Citation
If you use Litellama in your research, please cite the following work:
bibtex
@misc{litellama-2023,
author = {Litellama AI team},
title = {Litellama},
howpublished = {\url{https://github.com/harleyszhang/lite_llama}},
year = {2023},
}
Owner
- Name: Zhang
- Login: HarleysZhang
- Kind: user
- Location: shenzhen
- Company: SWJTU
- Website: https://www.cnblogs.com/armcvai/
- Repositories: 17
- Profile: https://github.com/HarleysZhang
CV&DL Learner, 公众号搜索嵌入式视觉
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, you can cite it as shown below." title: "Lite Llama" abstract: "A light llama-like llm inference framework based on the triton kernel." date-released: 2023-04-23 authors: - name: "The Litellama AI team" url: "https://github.com/harleyszhang/lite_llama.git"
GitHub Events
Total
- Issues event: 3
- Watch event: 128
- Member event: 1
- Issue comment event: 7
- Push event: 217
- Pull request event: 14
- Fork event: 17
- Create event: 3
Last Year
- Issues event: 3
- Watch event: 128
- Member event: 1
- Issue comment event: 7
- Push event: 217
- Pull request event: 14
- Fork event: 17
- Create event: 3
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| HarleysZhang | z****1@o****m | 217 |
| TATAXIMU | t****u@g****m | 10 |
| zhanghonggao.zhg | z****g@a****m | 5 |
| Ikko Eltociear Ashimine | e****r@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 1
- Total pull requests: 8
- Average time to close issues: about 14 hours
- Average time to close pull requests: 2 days
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 1.0
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 8
- Average time to close issues: about 14 hours
- Average time to close pull requests: 2 days
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 1.0
- Average comments per pull request: 1.0
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- zhangtianhong-1998 (1)
- WangKai1123 (1)
Pull Request Authors
- TATAXIMU (8)
- WangKai1123 (2)
- eltociear (2)
- harleyszhang (1)