lite_llama

A light llama-like llm inference framework based on the triton kernel.

https://github.com/harleyszhang/lite_llama

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.1%) to scientific vocabulary

Keywords

attention llama llama3 llava-llama3 llm llm-inference python3 qwen2-5 triton-kernels
Last synced: 6 months ago · JSON representation ·

Repository

A light llama-like llm inference framework based on the triton kernel.

Basic Info
  • Host: GitHub
  • Owner: harleyszhang
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 39.4 MB
Statistics
  • Stars: 144
  • Watchers: 4
  • Forks: 18
  • Open Issues: 3
  • Releases: 0
Topics
attention llama llama3 llava-llama3 llm llm-inference python3 qwen2-5 triton-kernels
Created over 1 year ago · Last pushed 7 months ago
Metadata Files
Readme Citation

README.md

# Litellama **A light llama-like llm inference framework based on the triton kernel.** [![en](https://img.shields.io/badge/lang-en-red.svg)](https://github.com/harleyszhang/lite_llama/blob/main/README.md) [![zh](https://img.shields.io/badge/lang-zh-yellow.svg)](https://github.com/harleyszhang/lite_llama/blob/main/README.zh.md) ![PyPI - Python Version](https://img.shields.io/badge/python-3.11%20%7C%203.12-blue)
         ✅ Flash attention      ✅ Reduce GPU memory (fp16/32)    ✅ Beginner friendly

Features

  • Up to 4x speedup over transformers, llama3 1B and 3B models.
  • Supports the latest llama3, Qwen2.5, Qwen3, Llava1.5 model inference, top-p sampling, streaming output.
  • Supports GQA, ~~decode stage support cuda graph optimization (with batch_size limitations)~~.
  • Supports flashattention1, flashattention2, flashdecoding (supports NopadAttention).
  • Support efficient dynamic management of kv cache (auto tokenattnetion).
  • Support fusion of operators, e.g. fusion of * and silu for element-by-element multiplication, k v linear layer fusion, fusion of skip and rmsnorm.
  • Some custom operators such as rmsnorm, rope, softmax, element-by-element-multiplication, etc. are implemented using the efficient triton kernel.

Setup and Installation

Pre-requisites

If you don't have a physical server, you can try using virtal cloud remote server.

lite_llama framework requires the following dependencies:

For cuda, torch, and triton version

```bash

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on MonApr317:16:06PDT2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda12.1.r12.1/compiler.326880720

Python 3.11.8:

pip list | grep torch

torch 2.2.1 triton 2.2.0 transformers 4.52.4 triton-nightly 3.0.0.post20240716052845 ```

The latest version of transformers requires the flash-attn package to run correctly, otherwise the error flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c105Error4whatEv will be reported. Flash-attn can be installed through pip install flash-attn. However, this download and compilation speed is too slow. It is recommended to download the corresponding version of the wheel package from the github-flash-attention-prebuild-wheels website for installation.

For rocm, torch, and triton version:

```bash

rocminfo | grep -i version

ROCk module version 6.10.5 is loaded Runtime Version: 1.14 Runtime Ext Version: 1.6

Python 3.11.8:

pip list | grep torch

pytorch-triton-rocm 3.2.0 torch 2.6.0+rocm6.2.4 torchaudio 2.6.0+rocm6.2.4 torchvision 0.21.0+rocm6.2.4 ```

Getting Started

Recommended cuda version 12.0 and above. Download llama3.2-1B-Instruct Model and place it in the specified checkpoints_dir directory. python apply_weight_convert.py needs to be run to convert the hf model weights to lite_llama weight format, before running cli.py.

bash apt update apt install imagemagick conda create --name lite_llama python >= 3.11 conda activate lite_llama git clone https://github.com/harleyszhang/lite_llama.git cd lite_llama/ pip install -r requirement.txt python test_weight_convert.py # model weight transformation python generate.py --prompt "What is large language model" --checkpoint_path /path/to/model/Llama-3.2-1B-Instruct/ # Run on the basis that the model has been downloaded and placed in the specified directory

ROCm version 5.7 and above is recommended.

```bash pip install matplotlib
pip install pandas pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4

apt update apt install imagemagick conda create --name litellama python >= 3.11 conda activate litellama git clone https://github.com/harleyszhang/litellama.git cd litellama/ pip install -r requirement.txt python testweightconvert.py # model weight transformation python generate.py --prompt "What is large language model" --checkpoint_path /path/to/model/Llama-3.2-1B-Instruct/ # Run on the basis that the model has been downloaded and placed in the specified directory ```

Evaluation

After cli.py runs successfully, the terminal displays the interface as shown below, and you can enter your question in the terminal.

cli

After generate.py runs successfully, the terminal displays the interface as shown below, and you can enter your question in the terminal.

generate

After cli_llava.py runs successfully, the terminal displays the interface as shown below, enter your picture and prompt word in the terminal, and then enter.

llava model streaming output

For performance test, after changing your model weight path, run lite_llama/examples/benchmark.py file directly, it will output the latency and throughput performance comparison between litellama and transformers libraries, the result of the first run is not very accurate, so we suggest you to take the second run as a reference. For example, for the Llama-3.2-3B model with `promptlen = 25,batchsize = 12, andmaxgenlen = 1900, the result of benchmark: ``bash litellama inference time: 31.3463 s Transformers inference time: 69.1433 s litellama throughput: 730.45 tokens/s Transformers throughput: 183.95 tokens/s litellama per token latency: 1.369015 ms/token Transformers per token latency: 5.436221 ms/token ```

TODO

  • [x] Optimized for decode phase using cuda graph
  • [x] Use flashattention instead of standard attention
  • [x] Upgrade flashattention to flashattention2 to reduce some computation.
  • [x] The decode phase of the reasoning uses flashdecoding
  • [x] Support kv cache Efficient dynamic management
  • [x] Use GQA_KV_heads_index instead of repeat_kv function
  • [x] kv Linear layer fusion
  • [x] Operator fusion: the skip operation on residual joins is fused with the rmsnorm operator to form a new skip_rmsnorm operator.
  • [x] Refactoring and optimizing the MHA module to optimize the context_attention and token_attention kernels to support Nopad attention and kv cache dynamic allocation and management.
  • [ ] Supports continuous batch optimization.
  • [ ] Support for AWQ and SmoothQuant quantization.
  • [ ] Code refactoring and fix for cuda graph not working properly after optimization with AutoTokenAttention.

Detailed information can be found in performance optimization

Acknowledgement

Citation

If you use Litellama in your research, please cite the following work:

bibtex @misc{litellama-2023, author = {Litellama AI team}, title = {Litellama}, howpublished = {\url{https://github.com/harleyszhang/lite_llama}}, year = {2023}, }

Owner

  • Name: Zhang
  • Login: HarleysZhang
  • Kind: user
  • Location: shenzhen
  • Company: SWJTU

CV&DL Learner, 公众号搜索嵌入式视觉

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, you can cite it as shown below."
title: "Lite Llama"
abstract: "A light llama-like llm inference framework based on the triton kernel."
date-released: 2023-04-23
authors:
  - name: "The Litellama AI team"
url: "https://github.com/harleyszhang/lite_llama.git"

GitHub Events

Total
  • Issues event: 3
  • Watch event: 128
  • Member event: 1
  • Issue comment event: 7
  • Push event: 217
  • Pull request event: 14
  • Fork event: 17
  • Create event: 3
Last Year
  • Issues event: 3
  • Watch event: 128
  • Member event: 1
  • Issue comment event: 7
  • Push event: 217
  • Pull request event: 14
  • Fork event: 17
  • Create event: 3

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 233
  • Total Committers: 4
  • Avg Commits per committer: 58.25
  • Development Distribution Score (DDS): 0.069
Past Year
  • Commits: 233
  • Committers: 4
  • Avg Commits per committer: 58.25
  • Development Distribution Score (DDS): 0.069
Top Committers
Name Email Commits
HarleysZhang z****1@o****m 217
TATAXIMU t****u@g****m 10
zhanghonggao.zhg z****g@a****m 5
Ikko Eltociear Ashimine e****r@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 1
  • Total pull requests: 8
  • Average time to close issues: about 14 hours
  • Average time to close pull requests: 2 days
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 8
  • Average time to close issues: about 14 hours
  • Average time to close pull requests: 2 days
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 1.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • zhangtianhong-1998 (1)
  • WangKai1123 (1)
Pull Request Authors
  • TATAXIMU (8)
  • WangKai1123 (2)
  • eltociear (2)
  • harleyszhang (1)
Top Labels
Issue Labels
Pull Request Labels