llmcompressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

https://github.com/vllm-project/llm-compressor

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
2 of 69 committers (2.9%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary

Keywords

compression quantization sparsity

Keywords from Contributors

transformer cryptocurrency cryptography jax evaluation-framework optimizing-compiler yolov5 multi-agents genomics hacking

Last synced: 6 months ago · JSON representation ·

Repository

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Basic Info

Host: GitHub
Owner: vllm-project
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://docs.vllm.ai/projects/llm-compressor
Size: 29.1 MB

Statistics

Stars: 1,905
Watchers: 23
Forks: 221
Open Issues: 89
Releases: 13

Topics

compression quantization sparsity

Created over 1 year ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License Code of conduct Citation Notice

LLM Compressor

[![docs](https://img.shields.io/badge/docs-LLM--Compressor-blue)](https://docs.vllm.ai/projects/llm-compressor/en/latest/) [![PyPI](https://img.shields.io/pypi/v/llmcompressor.svg)](https://pypi.org/project/llmcompressor/)

llmcompressor is an easy-to-use library for optimizing models for deployment with vllm, including:

Comprehensive set of quantization algorithms for weight-only and activation quantization
Seamless integration with Hugging Face models and repositories
safetensors-based file format compatible with vllm
Large model support via accelerate

✨ Read the announcement blog here! ✨

LLM Compressor Flow

🚀 What's New!

Big updates have landed in LLM Compressor! To get a more in-depth look, check out the deep-dive.

Some of the exciting new features include:

QuIP and SpinQuant-style Transforms: The newly added QuIPModifier and SpinQuantModifier allow users to quantize their models after injecting hadamard weights into the computation graph, reducing quantization error and greatly improving accuracy recovery for low bit weight and activation quantization.
DeepSeekV3-style Block Quantization Support: This allows for more efficient compression of large language models without needing a calibration dataset. Quantize a Qwen3 model to W8A8.
Llama4 Quantization Support: Quantize a Llama4 model to W4A16 or NVFP4. The checkpoint produced can seamlessly run in vLLM.
FP4 Quantization - now with MoE and non-uniform support: Quantize weights and activations to FP4 and seamlessly run the compressed model in vLLM. Model weights and activations are quantized following the NVFP4 configuration. See examples of fp4 activation support, MoE support, and Non-uniform quantization support where some layers are selectively quantized to fp8 for better recovery. You can also mix other quantization schemes, such as int8 and int4.
Large Model Support with Sequential Onloading: As of llm-compressor>=0.6.0, you can now quantize very large language models on a single GPU. Models are broken into disjoint layers which are then onloaded to the GPU one layer at a time. For more information on sequential onloading, see Big Modeling with Sequential Onloading as well as the DeepSeek-R1 Example.
Axolotl Sparse Finetuning Integration: Seamlessly finetune sparse LLMs with our Axolotl integration. Learn how to create fast sparse open-source models with Axolotl and LLM Compressor. See also the Axolotl integration docs.

Supported Formats

Activation Quantization: W8A8 (int8 and fp8)
Mixed Precision: W4A16, W8A16, NVFP4 (W4A4 and W4A16 support)
2:4 Semi-structured and Unstructured Sparsity

Supported Algorithms

Simple PTQ
GPTQ
AWQ
SmoothQuant
SparseGPT

When to Use Which Optimization

Please refer to compression_schemes.md for detailed information about available optimization schemes and their use cases.

Installation

bash pip install llmcompressor

Get Started

End-to-End Examples

Applying quantization with llmcompressor: * Activation quantization to int8 * Activation quantization to fp8 * Activation quantization to fp4 * Weight only quantization to fp4 * Weight only quantization to int4 using GPTQ * Weight only quantization to int4 using AWQ * Quantizing MoE LLMs * Quantizing Vision-Language Models * Quantizing Audio-Language Models * Quantizing Models Non-uniformly

User Guides

Deep dives into advanced usage of llmcompressor: * Quantizing with large models with the help of accelerate

Quick Tour

Let's quantize TinyLlama with 8 bit weights and activations using the GPTQ and SmoothQuant algorithms.

Note that the model can be swapped for a local or remote HF-compatible checkpoint and the recipe may be changed to target different quantization algorithms or formats.

Apply Quantization

Quantization is applied by selecting an algorithm and calling the oneshot API.

```python from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor import oneshot

Select quantization algorithm. In this case, we:

* apply SmoothQuant to make the activations easier to quantize

* quantize the weights to int8 with GPTQ (static per channel)

* quantize the activations to int8 (dynamic per token)

recipe = [ SmoothQuantModifier(smoothingstrength=0.8), GPTQModifier(scheme="W8A8", targets="Linear", ignore=["lmhead"]), ]

Apply quantization using the built in open_platypus dataset.

* See examples for demos showing how to pass a custom calibration set

oneshot( model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", dataset="openplatypus", recipe=recipe, outputdir="TinyLlama-1.1B-Chat-v1.0-INT8", maxseqlength=2048, numcalibrationsamples=512, ) ```

Inference with vLLM

The checkpoints created by llmcompressor can be loaded and run in vllm:

Install:

bash pip install vllm

Run:

python from vllm import LLM model = LLM("TinyLlama-1.1B-Chat-v1.0-INT8") output = model.generate("My name is")

Questions / Contribution

If you have any questions or requests open an issue and we will add an example or documentation.
We appreciate contributions to the code, examples, integrations, and documentation as well as bug reports and feature requests! Learn how here.

Citation

If you find LLM Compressor useful in your research or projects, please consider citing it:

bibtex @software{llmcompressor2024, title={{LLM Compressor}}, author={Red Hat AI and vLLM Project}, year={2024}, month={8}, url={https://github.com/vllm-project/llm-compressor}, }

Owner

Name: vLLM
Login: vllm-project
Kind: organization

Repositories: 22
Profile: https://github.com/vllm-project

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - name: Red Hat AI
  - name: vLLM Project
title: "LLM Compressor"
date-released: 2024-08-08
url: https://github.com/vllm-project/llm-compressor

Committers

Last synced: 9 months ago

All Time

Total Commits: 2,321
Total Committers: 69
Avg Commits per committer: 33.638
Development Distribution Score (DDS): 0.874

Past Year

Commits: 560
Committers: 33
Avg Commits per committer: 16.97
Development Distribution Score (DDS): 0.732

Top Committers

Name	Email	Commits
Benjamin Fineran	b****n	293
Mark Kurtz	m**k@n**m	269
Rahul Tuli	r**l@n**m	204
Sara Adkins	s**a@n**m	187
Kyle Sayers	k**s@g**m	153
dbogunowicz	9****z	116
Konstantin Gulin	6****n	111
Dipika Sikka	d**1@g**m	109
Tuan Nguyen	t**n@n**m	91
Benjamin	b**n@n**m	89
Kevin Escobar Rodriguez	k**z@n**m	81
Eldar Kurtic	e**i@g**m	79
corey-nm	1****m	64
Michael Goin	m**l@n**m	59
George	g**e@n**m	54
Alexandre Marques	a**e@n**m	50
Jeannie Finks	7****s	43
Domenic Barbuzzi	d**c@n**m	25
rshaw@neuralmagic.com	r**w@n**m	25
dependabot[bot]	4****]	25
Brian Dellabetta	b****a	24
dhuangnm	7****m	24
jonathan rosenfeld	j**r@j**l	18
Robert Shaw	1****c	15
abhinavnmagic	1****c	14
Jen Iofinova	j**a@n**m	11
spacemanidol	d**3@i**u	8
Robert Shaw	1****2	6
Dan	d**h@g**m	5
Derrick Mwiti	m**k@g**m	5
and 39 more...

Committer Domains (Top 20 + Academic)

neuralmagic.com: 15 intel.com: 1 redhat.com: 1 mail.ru: 1 python.org: 1 gitam.in: 1 linkedin.com: 1 googlegroups.com: 1 mit.edu: 1 illinois.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 279
Total pull requests: 873
Average time to close issues: 22 days
Average time to close pull requests: 15 days
Total issue authors: 163
Total pull request authors: 50
Average comments per issue: 2.52
Average comments per pull request: 1.43
Merged pull requests: 551
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 248
Pull requests: 805
Average time to close issues: 19 days
Average time to close pull requests: 15 days
Issue authors: 149
Pull request authors: 43
Average comments per issue: 2.21
Average comments per pull request: 1.51
Merged pull requests: 495
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

alamgirsb01 (35)
wreetre (31)
dsikka (15)
instinktum (14)
esakhaei (9)
zjnyly (8)
trakveeeH (8)
Kha-Zix-1 (8)
robertgshaw2-neuralmagic (7)
kylesayrs (7)
avdvafqa (6)
mgoin (6)
jalebi2 (6)
jiangjiadi (6)
hfgthygyu (6)

Pull Request Authors

kylesayrs (384)
dsikka (172)
horheynm (126)
rahul-tuli (95)
brian-dellabetta (56)
dbarbuzzi (47)
Satrat (44)
robertgshaw2-neuralmagic (24)
mgoin (20)
shanjiaz (15)
ved1beta (13)
markurtz (12)
dhuangnm (10)
eldarkurtic (8)
bfineran (6)

Top Labels

Issue Labels

bug (316) enhancement (80) question (18) good first issue (16) documentation (10) vllm (9) compressed-tensors (4) help wanted (1) roadmap (1)

Pull Request Labels

ready (437) bug (16) enhancement (3) documentation (2)

Packages

Total packages: 1
Total downloads:
- pypi 45,309 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 44
Total maintainers: 1

pypi.org: llmcompressor

A library for compressing large language models utilizing the latest techniques and research in the field for both training aware and post training techniques. The library is designed to be flexible and easy to use on top of PyTorch and HuggingFace Transformers, allowing for quick experimentation.

Homepage: https://github.com/vllm-project/llm-compressor
Documentation: https://llmcompressor.readthedocs.io/
License: Apache
Latest release: 0.7.1
published 6 months ago

Versions: 44
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 45,309 Last month

Rankings

Dependent packages count: 10.5%

Average: 34.8%

Dependent repos count: 59.2%

Maintainers (1)

neuralmagic

Last synced: 6 months ago

Dependencies

.github/workflows/build-and-publish-release-images.yaml actions

actions/checkout v3 composite
docker/build-push-action v2 composite
docker/login-action v2 composite
docker/setup-buildx-action v2 composite

.github/workflows/build-container.yml actions

actions/checkout v3 composite
docker/build-push-action v4 composite
docker/login-action v2 composite
docker/setup-buildx-action v2 composite

.github/workflows/build-release.yml actions

.github/workflows/build-wheel-and-container.yml actions

.github/workflows/build-wheel.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
aws-actions/configure-aws-credentials v2 composite
neuralmagic/nm-actions/actions/pypi_build main composite
neuralmagic/nm-actions/actions/s3_push main composite

.github/workflows/linkcheck.yml actions

actions/checkout v2 composite
gaurav-nelson/github-action-markdown-link-check v1 composite

.github/workflows/quality-check.yaml actions

actions/checkout v2 composite
actions/setup-python v4 composite

.github/workflows/test-check.yaml actions

actions/checkout v2 composite
actions/setup-python v4 composite

.github/workflows/test-nightly.yml actions

actions/checkout v2 composite
actions/checkout v3 composite

.github/workflows/test-weekly.yml actions

actions/checkout v2 composite
actions/checkout v3 composite

.github/workflows/test-wheel-and-publish.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
aws-actions/configure-aws-credentials v2 composite
neuralmagic/nm-actions/actions/publish-whl main composite
neuralmagic/nm-actions/actions/s3_pull main composite

.github/workflows/util.yml actions

actions/checkout v3 composite
aws-actions/configure-aws-credentials v2 composite

docker/Dockerfile docker

base latest build
container_branch_${DEPS} latest build
cuda-$CUDA_VERSION latest build
cuda_builder latest build
nvidia/cuda ${CUDA_VERSION}-devel-ubuntu18.04 build

docker/containers/docker_dev/Dockerfile docker

$SOURCE latest build

docker/containers/docker_nightly/Dockerfile docker

$SOURCE latest build

docker/containers/docker_release/Dockerfile docker

$SOURCE latest build

pyproject.toml pypi

setup.py pypi

accelerate >=0.20.3
click >=7.1.2,
compressed-tensors *
datasets <2.19
else *
evaluate >=0.4.1
if *
loguru *
numpy >=1.17.0,<2.0
pyyaml >=5.0.0
requests >=2.0.0
safetensors >=0.4.1
sentencepiece *
torch >=1.7.0
tqdm >=4.0.0
transformers <4.41

llmcompressor

Science Score: 54.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

LLM Compressor

🚀 What's New!

Supported Formats

Supported Algorithms

When to Use Which Optimization

Installation

Get Started

End-to-End Examples

User Guides

Quick Tour

Apply Quantization

Select quantization algorithm. In this case, we:

* apply SmoothQuant to make the activations easier to quantize

* quantize the weights to int8 with GPTQ (static per channel)

* quantize the activations to int8 (dynamic per token)

Apply quantization using the built in open_platypus dataset.

* See examples for demos showing how to pass a custom calibration set

Inference with vLLM

Questions / Contribution

Citation

Owner

Citation (CITATION.cff)

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: llmcompressor

Rankings

Maintainers (1)

Dependencies