lite_llama

A light llama-like llm inference framework based on the triton kernel.

https://github.com/harleyszhang/lite_llama

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary

Keywords

attention llama llama3 llava-llama3 llm llm-inference python3 qwen2-5 triton-kernels

Last synced: 6 months ago · JSON representation ·

Repository

A light llama-like llm inference framework based on the triton kernel.

Basic Info

Host: GitHub
Owner: harleyszhang
Language: Python
Default Branch: main
Homepage:
Size: 39.4 MB

Statistics

Stars: 144
Watchers: 4
Forks: 18
Open Issues: 3
Releases: 0

Topics

attention llama llama3 llava-llama3 llm llm-inference python3 qwen2-5 triton-kernels

Created over 1 year ago · Last pushed 7 months ago

Metadata Files

Readme Citation

# Litellama **A light llama-like llm inference framework based on the triton kernel.** [![en](https://img.shields.io/badge/lang-en-red.svg)](https://github.com/harleyszhang/lite_llama/blob/main/README.md) [![zh](https://img.shields.io/badge/lang-zh-yellow.svg)](https://github.com/harleyszhang/lite_llama/blob/main/README.zh.md) ![PyPI - Python Version](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)

         ✅ Flash attention      ✅ Reduce GPU memory (fp16/32)    ✅ Beginner friendly

Features

Up to 4x speedup over transformers, llama3 1B and 3B models.
Supports the latest llama3, Qwen2.5, Qwen3, Llava1.5 model inference, top-p sampling, streaming output.
Supports GQA, ~~decode stage support cuda graph optimization (with batch_size limitations)~~.
Supports flashattention1, flashattention2, flashdecoding (supports NopadAttention).
Support efficient dynamic management of kv cache (auto tokenattnetion).
Support fusion of operators, e.g. fusion of * and silu for element-by-element multiplication, k v linear layer fusion, fusion of skip and rmsnorm.
Some custom operators such as rmsnorm, rope, softmax, element-by-element-multiplication, etc. are implemented using the efficient triton kernel.

Setup and Installation

Pre-requisites

If you don't have a physical server, you can try using virtal cloud remote server.

lite_llama framework requires the following dependencies:

For cuda, torch, and triton version

```bash

nvcc -V

Python 3.11.8:

pip list | grep torch

torch 2.2.1 triton 2.2.0 transformers 4.52.4 triton-nightly 3.0.0.post20240716052845 ```

The latest version of transformers requires the flash-attn package to run correctly, otherwise the error flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c105Error4whatEv will be reported. Flash-attn can be installed through pip install flash-attn. However, this download and compilation speed is too slow. It is recommended to download the corresponding version of the wheel package from the github-flash-attention-prebuild-wheels website for installation.

For rocm, torch, and triton version:

```bash

rocminfo | grep -i version

ROCk module version 6.10.5 is loaded Runtime Version: 1.14 Runtime Ext Version: 1.6

Python 3.11.8:

pip list | grep torch

pytorch-triton-rocm 3.2.0 torch 2.6.0+rocm6.2.4 torchaudio 2.6.0+rocm6.2.4 torchvision 0.21.0+rocm6.2.4 ```

Getting Started

Recommended cuda version 12.0 and above. Download llama3.2-1B-Instruct Model and place it in the specified checkpoints_dir directory. python apply_weight_convert.py needs to be run to convert the hf model weights to lite_llama weight format, before running cli.py.

bash apt update apt install imagemagick conda create --name lite_llama python >= 3.11 conda activate lite_llama git clone https://github.com/harleyszhang/lite_llama.git cd lite_llama/ pip install -r requirement.txt python test_weight_convert.py # model weight transformation python generate.py --prompt "What is large language model" --checkpoint_path /path/to/model/Llama-3.2-1B-Instruct/ # Run on the basis that the model has been downloaded and placed in the specified directory

ROCm version 5.7 and above is recommended.

```bash pip install matplotlib
pip install pandas pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4

apt update apt install imagemagick conda create --name litellama python >= 3.11 conda activate litellama git clone https://github.com/harleyszhang/litellama.git cd litellama/ pip install -r requirement.txt python testweightconvert.py # model weight transformation python generate.py --prompt "What is large language model" --checkpoint_path /path/to/model/Llama-3.2-1B-Instruct/ # Run on the basis that the model has been downloaded and placed in the specified directory ```

Evaluation

After cli.py runs successfully, the terminal displays the interface as shown below, and you can enter your question in the terminal.

cli

After generate.py runs successfully, the terminal displays the interface as shown below, and you can enter your question in the terminal.

generate

After cli_llava.py runs successfully, the terminal displays the interface as shown below, enter your picture and prompt word in the terminal, and then enter.

llava model streaming output

For performance test, after changing your model weight path, run lite_llama/examples/benchmark.py file directly, it will output the latency and throughput performance comparison between litellama and transformers libraries, the result of the first run is not very accurate, so we suggest you to take the second run as a reference. For example, for the Llama-3.2-3B model with `promptlen = 25,batchsize = 12, andmaxgenlen = 1900, the result of benchmark:``bash litellama inference time: 31.3463 s Transformers inference time: 69.1433 s litellama throughput: 730.45 tokens/s Transformers throughput: 183.95 tokens/s litellama per token latency: 1.369015 ms/token Transformers per token latency: 5.436221 ms/token ```

TODO

[x] Optimized for decode phase using cuda graph
[x] Use flashattention instead of standard attention
[x] Upgrade flashattention to flashattention2 to reduce some computation.
[x] The decode phase of the reasoning uses flashdecoding
[x] Support kv cache Efficient dynamic management
[x] Use GQA_KV_heads_index instead of repeat_kv function
[x] kv Linear layer fusion
[x] Operator fusion: the skip operation on residual joins is fused with the rmsnorm operator to form a new skip_rmsnorm operator.
[x] Refactoring and optimizing the MHA module to optimize the context_attention and token_attention kernels to support Nopad attention and kv cache dynamic allocation and management.
[ ] Supports continuous batch optimization.
[ ] Support for AWQ and SmoothQuant quantization.
[ ] Code refactoring and fix for cuda graph not working properly after optimization with AutoTokenAttention.

Detailed information can be found in performance optimization

Acknowledgement

Citation

If you use Litellama in your research, please cite the following work:

bibtex @misc{litellama-2023, author = {Litellama AI team}, title = {Litellama}, howpublished = {\url{https://github.com/harleyszhang/lite_llama}}, year = {2023}, }

Owner

Name: Zhang
Login: HarleysZhang
Kind: user
Location: shenzhen
Company: SWJTU

Website: https://www.cnblogs.com/armcvai/
Repositories: 17
Profile: https://github.com/HarleysZhang

CV&DL Learner, 公众号搜索嵌入式视觉

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, you can cite it as shown below."
title: "Lite Llama"
abstract: "A light llama-like llm inference framework based on the triton kernel."
date-released: 2023-04-23
authors:
  - name: "The Litellama AI team"
url: "https://github.com/harleyszhang/lite_llama.git"

GitHub Events

Total

Issues event: 3
Watch event: 128
Member event: 1
Issue comment event: 7
Push event: 217
Pull request event: 14
Fork event: 17
Create event: 3

Last Year

Issues event: 3
Watch event: 128
Member event: 1
Issue comment event: 7
Push event: 217
Pull request event: 14
Fork event: 17
Create event: 3

Committers

Last synced: 9 months ago

All Time

Total Commits: 233
Total Committers: 4
Avg Commits per committer: 58.25
Development Distribution Score (DDS): 0.069

Past Year

Commits: 233
Committers: 4
Avg Commits per committer: 58.25
Development Distribution Score (DDS): 0.069

Top Committers

Name	Email	Commits
HarleysZhang	z**1@o**m	217
TATAXIMU	t**u@g**m	10
zhanghonggao.zhg	z**g@a**m	5
Ikko Eltociear Ashimine	e**r@g**m	1

Committer Domains (Top 20 + Academic)

alibaba-inc.com: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 1
Total pull requests: 8
Average time to close issues: about 14 hours
Average time to close pull requests: 2 days
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 1.0
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 8
Average time to close issues: about 14 hours
Average time to close pull requests: 2 days
Issue authors: 1
Pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 1.0
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

lite_llama

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Features

Setup and Installation

Pre-requisites

nvcc -V

Python 3.11.8:

pip list | grep torch

rocminfo | grep -i version

Python 3.11.8:

pip list | grep torch

Getting Started

Evaluation

TODO

Acknowledgement

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels