https://github.com/mobiusml/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

Keywords

llm machine-learning quantization

Keywords from Contributors

transformers vlms pretrained-models audio deepseek gemma glm model-hub pytorch-transformers optimism

Last synced: 6 months ago · JSON representation

Repository

Official implementation of Half-Quadratic Quantization (HQQ)

Basic Info

Host: GitHub
Owner: mobiusml
License: apache-2.0
Language: Python
Default Branch: master
Homepage: https://mobiusml.github.io/hqq_blog/
Size: 606 KB

Statistics

Stars: 878
Watchers: 15
Forks: 84
Open Issues: 2
Releases: 29

Topics

llm machine-learning quantization

Created over 2 years ago · Last pushed 6 months ago

Metadata Files

Readme License

Half-Quadratic Quantization (HQQ)

This repository contains the official implementation of Half-Quadratic Quantization (HQQ) presented in our articles: * HQQ: https://mobiusml.github.io/hqqblog/ * HQQ+: https://mobiusml.github.io/1bitblog/

What is HQQ?

HQQ is a fast and accurate model quantizer that skips the need for calibration data. Quantize the largest models, without calibration data, in just a few minutes at most 🚀.

FAQ

Why should I use HQQ instead of other quantization methods?

HQQ is very fast to quantize models.
It supports 8,4,3,2,1 bits.
You can use it on any model (LLMs, Vision, etc.).
The dequantization step is a linear operation, this means that HQQ is compatbile with various optimized CUDA/Triton kernels.
HQQ is compatible with peft training.
We try to make HQQ fully compatible `torch.compile` for faster inference and training.

What is the quality of the quantized models?
We have detailed benchmarks on both language and vision models. Please refer to our blog posts: HQQ, HQQ+.
What is the speed of the quantized models?
4-bit models with `axis=1` can use optimized inference fused kernels. Moreover, we focus on making hqq fully compatible with `torch.compile` which speeds-up both training and inference. For more details, please refer to the backend section below.
What quantization settings should I use?
You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend, but this setting is not supported for fast inference.
What does the `axis` parameter mean?
The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.
What is the difference between HQQ and HQQ+?
HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.

Installation

First, make sure you have a Pytorch 2 version that matches your CUDA version: https://pytorch.org/

You can install hqq via
```

latest stable version

pip install hqq;

Latest updates - recommended

pip install git+https://github.com/mobiusml/hqq.git;

Disable building the CUDA kernels for the aten backend

DISABLE_CUDA=1 pip install ... ```

Alternatively, clone the repo and run pip install . from this current folder.

Basic Usage

To perform quantization with HQQ, you simply need to replace the linear layers ( torch.nn.Linear) as follows: ```Python from hqq.core.quantize import *

Quantization settings

quantconfig = BaseQuantizeConfig(nbits=4, groupsize=64)

Replace your linear layer

hqqlayer = HQQLinear(yourlinearlayer, #torch.nn.Linear or None quantconfig=quantconfig, #quantization configuration computedtype=torch.float16, #compute dtype device='cuda', #cuda device initialize=True, #Use False to quantize later del_orig=True #if True, delete the original layer )

Wr = hqqlayer.dequantize() #dequantize() Wq = hqqlayer.unpack(dtype=torch.uint8) #unpack y = hqq_layer(x) #forward-pass ```

The quantization parameters are set as follows:

nbits (int): supports 8, 4, 3, 2, 1 bits.
group_size (int): no restrictions as long as weight.numel() is divisible by the group_size.
view_as_float (bool): if True, the quantized parameter is viewed as float instead of an int type.

Usage with Models

Transformers 🤗

For usage with HF's transformers, see the example below from the documentation: ```Python from transformers import AutoModelForCausalLM, HqqConfig

All linear layers will use the same quantization config

quantconfig = HqqConfig(nbits=4, groupsize=64)

Load and quantize

model = AutoModelForCausalLM.frompretrained( modelid, torchdtype=torch.float16, devicemap="cuda", quantizationconfig=quantconfig ) ``You can save/load quantized models as regular transformers models viasavepretrained/frompretrained`.

HQQ Lib

You can also utilize the HQQ library to quantize transformers models: ```Python

Load the model on CPU

from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.frompretrained(modelid, torchdtype=computedtype)

Quantize

from hqq.models.hf.base import AutoHQQHFModel quantconfig = BaseQuantizeConfig(nbits=4, groupsize=64) AutoHQQHFModel.quantizemodel(model, quantconfig=quantconfig, computedtype=compute_dtype, device=device) You can save/load quantized models as follows:Python from hqq.models.hf.base import AutoHQQHFModel

Save: Make sure to save the model BEFORE any patching

AutoHQQHFModel.savequantized(model, savedir)

Save as safetensors (to be load via transformers or vllm)

AutoHQQHFModel.savetosafetensors(model, save_dir)

Load

model = AutoHQQHFModel.fromquantized(savedir) ```

❗ Note that models saved via the hqq lib are not compatible with .from_pretrained()

Backends

Native Backends

The following native dequantization backends can be used by the HQQLinear module: Python HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend - Default HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Compiled Pytorch HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend - only axis=0 supported ❗ Note that HQQBackend.ATEN only supports axis=0.

Optimized Inference

We support external backends for faster inference with fused kernels. You can enable one of the backends after the model was quantized as follows: ```Python from hqq.utils.patching import prepareforinference

Pytorch backend that makes the model compatible with fullgraph torch.compile: works with any settings

prepareforinference(model)

Gemlite backend: nbits=4/2/1, compute_dtype=float16, axis=1

prepareforinference(model, backend="gemlite")

Torchao's tinygemm backend (fast for batch-size<4): nbits=4, computedtype=bfloat16, axis=1

prepareforinference(model, backend="torchao_int4")

``Note that these backends only work withaxis=1`. Additional restrictions apply regarding the group-size values depending on the backend. You should expect ~158 tokens/sec with a Llama3-8B 4-bit quantized model on a 4090 RTX.

When a quantization config is not supported by the specified inference backend, hqq will fallback to the native backend.

Custom Quantization Configurations ⚙️

You can set up various quantization configurations for different layers by specifying the settings for each layer name:

Transformers 🤗

```Python

Each linear layer with the same tag will use a dedicated quantization config

q4config = {'nbits':4, 'groupsize':64} q3config = {'nbits':3, 'groupsize':32}

quantconfig = HqqConfig(dynamicconfig={ 'selfattn.qproj':q4config, 'selfattn.kproj':q4config, 'selfattn.vproj':q4config, 'selfattn.oproj':q4config,

'mlp.gateproj':q3config, 'mlp.upproj' :q3config, 'mlp.downproj':q3config, }) ```

HQQ lib

```Python from hqq.core.quantize import * q4config = BaseQuantizeConfig(nbits=4, groupsize=64) q3config = BaseQuantizeConfig(nbits=3, groupsize=32)

quantconfig = {'selfattn.qproj':q4config, 'selfattn.kproj':q4config, 'selfattn.vproj':q4config, 'selfattn.oproj':q4_config,

'mlp.gateproj':q3config, 'mlp.upproj' :q3config, 'mlp.downproj':q3config, } ```

VLLM

You can use HQQ in vllm. Make sure to install GemLite before using the backend.

```Python

Or you can quantize on-the-fly

from hqq.utils.vllm import setvllmontheflyhqqquant skipmodules = ['lmhead', 'visual', 'vision']

Select one of the following modes:

INT/FP format

setvllmontheflyhqqquant(weightbits=8, groupsize=None, quantmode='int8weightonly', skipmodules=skipmodules) #A16W8 - INT8 weight only setvllmontheflyhqqquant(weightbits=4, groupsize=128, quantmode='int4weightonly', skipmodules=skipmodules) #A16W4 - HQQ weight only setvllmontheflyhqqquant(weightbits=8, quantmode='int8dynamic', skipmodules=skipmodules) #A8W8 - INT8 x INT8 dynamic setvllmontheflyhqqquant(weightbits=8, quantmode='fp8dynamic', skipmodules=skipmodules) #A8W8 - FP8 x FP8 dynamic

MXFP format

setvllmontheflyhqqquant(weightbits=8, groupsize=None, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W8 - MXFP8 x MXPF8 - postscale=True setvllmontheflyhqqquant(weightbits=8, groupsize=32, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W8 - MXFP8 x MXPF8- postscale=False setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp4weightonly', skipmodules=skipmodules) #A16W4 - MXFP4 weight-only setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W4 - MXFP8 x MXFP4 dynamic setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp4dynamic', skipmodules=skipmodules) #A4W4 - MXPF4 x MXPF4 dynamic setvllmontheflyhqqquant(weightbits=4, quantmode='nvfp4dynamic', skipmodules=skipmodules) #A4W4 - NVFP4 x NVFP4 dynamic

llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct", maxmodellen=4096, gpumemoryutilization=0.80, dtype=torch.float16) ```

Peft Training

Peft training is directly supported in the HuggingFace's peft library. If you still want to use hqq-lib's peft utilities, here's how:

```Python

First, quantize/load a quantized HQQ model the

from hqq.core.peft import PeftUtils

baseloraparams = {'loratype':'default', 'r':32, 'loraalpha':64, 'dropout':0.05, 'traindtype':torch.float32} loraparams = {'selfattn.qproj': baseloraparams, 'selfattn.kproj': baseloraparams, 'selfattn.vproj': baseloraparams, 'selfattn.oproj': baseloraparams, 'mlp.gateproj' : None, 'mlp.upproj' : None, 'mlp.down_proj' : None}

Add LoRA to linear/HQQ modules

PeftUtils.addlora(model, loraparams)

Optional: set your backend

HQQLinear.setbackend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCHCOMPILE)

Train ....

Convert LoRA weights to the same model dtype for faster inference

model.eval() PeftUtils.castloraweights(model, dtype=compute_dtype)

Save LoRA weights

PeftUtils.saveloraweights(model, filename)

Load LoRA weights: automatically calls add_lora

PeftUtils.loadloraweights(model, filename) ```

We provide a complete example to train a model with HQQ/LoRA that you can find in examples/hqq_plus.py.

If you want to use muti-gpu training via FSDP, check out this awesome repo by Answer.AI: https://github.com/AnswerDotAI/fsdp_qlora

Examples

We provide a variety of examples demonstrating model quantization across different backends within the examples directory.

Citation 📜

@misc{badri2023hqq, title = {Half-Quadratic Quantization of Large Machine Learning Models}, url = {https://mobiusml.github.io/hqq_blog/}, author = {Hicham Badri and Appu Shaji}, month = {November}, year = {2023}

Owner

Name: mobiusml
Login: mobiusml
Kind: organization

Repositories: 5
Profile: https://github.com/mobiusml

GitHub Events

Total

Create event: 6
Commit comment event: 2
Release event: 6
Issues event: 62
Watch event: 180
Issue comment event: 211
Push event: 90
Pull request event: 12
Fork event: 20

Last Year

Create event: 6
Commit comment event: 2
Release event: 6
Issues event: 62
Watch event: 180
Issue comment event: 211
Push event: 90
Pull request event: 12
Fork event: 20

Committers

Last synced: 6 months ago

All Time

Total Commits: 406
Total Committers: 20
Avg Commits per committer: 20.3
Development Distribution Score (DDS): 0.372

Past Year

Commits: 141
Committers: 7
Avg Commits per committer: 20.143
Development Distribution Score (DDS): 0.44

Top Committers

Name	Email	Commits
mobicham	h**m@m**m	255
mobicham	3**m@u**m	113
Kerem Turgutlu	k**u@g**m	7
Yeongjae Jang	5**r@u**m	6
Viraat Das	v**s@g**m	4
E	e**a@g**m	3
Aleksi Ikkala	a**a@p**i	2
Benjamin Clavié	b**n@c**u	2
Mark Saroufim	m**m@m**m	2
fahadh4ilyas	f**4@g**m	2
Appu Shaji	a**e@g**m	1
Benjamin Warner	me@b****v	1
Ikko Eltociear Ashimine	e**r@g**m	1
Qubitium	q**m@m**i	1
Sian Cao	y**y@g**m	1
mark	M**1@g**u	1
root	m****m	1
xffxff	1**9@q**m	1
yeh-sudo	y**n@g**m	1
yiliu30	y**u@i**m	1

Committer Domains (Top 20 + Academic)

intel.com: 1 qq.com: 1 gatech.edu: 1 modelcloud.ai: 1 benjaminwarner.dev: 1 meta.com: 1 clavie.eu: 1 poliisi.fi: 1 mobiuslabs.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 122
Total pull requests: 29
Average time to close issues: 29 days
Average time to close pull requests: 12 days
Total issue authors: 86
Total pull request authors: 19
Average comments per issue: 5.15
Average comments per pull request: 3.07
Merged pull requests: 21
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 37
Pull requests: 11
Average time to close issues: about 1 month
Average time to close pull requests: 9 days
Issue authors: 28
Pull request authors: 7
Average comments per issue: 6.49
Average comments per pull request: 2.27
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

kaizizzzzzz (9)
Minami-su (5)
BeichenHuang (4)
kadirnar (4)
NEWbie0709 (4)
mxjmtxrm (3)
ZeleiShao (3)
zhangy659 (3)
2U1 (3)
mobicham (2)
DHKim0428 (2)
DominikHil (2)
NickyDark1 (2)
MarkBenjamin (2)
Abdullah-kwl (2)

Pull Request Authors

Liberatedwinner (6)
sonald (4)
fahadh4ilyas (4)
aikkala (4)
bclavie (3)
xffxff (2)
KeremTurgutlu (2)
larin92 (2)
warner-benjamin (2)
eltociear (2)
yeh-sudo (2)
rationalism (2)
Qubitium (2)
MarkBenjamin (2)
anijain2305 (2)

Top Labels

Issue Labels

enhancement (9) help wanted (7) question (7) bug (5) needs more info (1)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 16,322 last-month

Total dependent packages: 2
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 29
Total maintainers: 1

proxy.golang.org: github.com/mobiusml/hqq

Documentation: https://pkg.go.dev/github.com/mobiusml/hqq#section-documentation
License: apache-2.0
Latest release: v0.2.8
published 6 months ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 6.4%

Average: 6.6%

Dependent repos count: 6.8%

Last synced: 6 months ago

pypi.org: hqq

Half-Quadratic Quantization (HQQ)

Homepage: https://github.com/mobiusml/hqq/
Documentation: https://hqq.readthedocs.io/
License: Apache 2
Latest release: 0.2.8
published 6 months ago

Versions: 26
Dependent Packages: 2
Dependent Repositories: 0
Downloads: 16,322 Last month

Rankings

Dependent packages count: 10.1%

Average: 38.5%

Dependent repos count: 66.9%

Maintainers (1)

mobicham

Last synced: 6 months ago

https://github.com/mobiusml/hqq

Science Score: 36.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

Readme.md

Half-Quadratic Quantization (HQQ)

What is HQQ?

Installation

latest stable version

Latest updates - recommended

Disable building the CUDA kernels for the aten backend

Basic Usage

Quantization settings

Replace your linear layer

Usage with Models

Transformers 🤗

All linear layers will use the same quantization config

Load and quantize

HQQ Lib

Load the model on CPU

Quantize

Save: Make sure to save the model BEFORE any patching

Save as safetensors (to be load via transformers or vllm)

Load

Backends

Native Backends

Optimized Inference

Pytorch backend that makes the model compatible with fullgraph torch.compile: works with any settings

prepareforinference(model)

Gemlite backend: nbits=4/2/1, compute_dtype=float16, axis=1

Torchao's tinygemm backend (fast for batch-size<4): nbits=4, computedtype=bfloat16, axis=1

prepareforinference(model, backend="torchao_int4")

Custom Quantization Configurations ⚙️

Transformers 🤗

Each linear layer with the same tag will use a dedicated quantization config

HQQ lib

VLLM

Or you can quantize on-the-fly

Select one of the following modes:

INT/FP format

MXFP format

Peft Training

First, quantize/load a quantized HQQ model the

Add LoRA to linear/HQQ modules

Optional: set your backend

Train ....

Convert LoRA weights to the same model dtype for faster inference

Save LoRA weights

Load LoRA weights: automatically calls add_lora

Examples

Citation 📜

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/mobiusml/hqq

Rankings

pypi.org: hqq

Rankings

Maintainers (1)