https://github.com/mobiusml/hqq

Official implementation of Half-Quadratic Quantization (HQQ)

https://github.com/mobiusml/hqq

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 20 committers (5.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary

Keywords

llm machine-learning quantization

Keywords from Contributors

transformers vlms pretrained-models audio deepseek gemma glm model-hub pytorch-transformers optimism
Last synced: 6 months ago · JSON representation

Repository

Official implementation of Half-Quadratic Quantization (HQQ)

Basic Info
Statistics
  • Stars: 878
  • Watchers: 15
  • Forks: 84
  • Open Issues: 2
  • Releases: 29
Topics
llm machine-learning quantization
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme License

Readme.md

Half-Quadratic Quantization (HQQ)

This repository contains the official implementation of Half-Quadratic Quantization (HQQ) presented in our articles: * HQQ: https://mobiusml.github.io/hqqblog/ * HQQ+: https://mobiusml.github.io/1bitblog/

What is HQQ?

HQQ is a fast and accurate model quantizer that skips the need for calibration data. Quantize the largest models, without calibration data, in just a few minutes at most 🚀.

FAQ Why should I use HQQ instead of other quantization methods?
  • HQQ is very fast to quantize models.
  • It supports 8,4,3,2,1 bits.
  • You can use it on any model (LLMs, Vision, etc.).
  • The dequantization step is a linear operation, this means that HQQ is compatbile with various optimized CUDA/Triton kernels.
  • HQQ is compatible with peft training.
  • We try to make HQQ fully compatible `torch.compile` for faster inference and training.
What is the quality of the quantized models?
We have detailed benchmarks on both language and vision models. Please refer to our blog posts: HQQ, HQQ+.
What is the speed of the quantized models?
4-bit models with `axis=1` can use optimized inference fused kernels. Moreover, we focus on making hqq fully compatible with `torch.compile` which speeds-up both training and inference. For more details, please refer to the backend section below.
What quantization settings should I use?
You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend, but this setting is not supported for fast inference.
What does the `axis` parameter mean?
The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.
What is the difference between HQQ and HQQ+?
HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.

Installation

First, make sure you have a Pytorch 2 version that matches your CUDA version: https://pytorch.org/

You can install hqq via
```

latest stable version

pip install hqq;

Latest updates - recommended

pip install git+https://github.com/mobiusml/hqq.git;

Disable building the CUDA kernels for the aten backend

DISABLE_CUDA=1 pip install ... ```

Alternatively, clone the repo and run pip install . from this current folder.

Basic Usage

To perform quantization with HQQ, you simply need to replace the linear layers ( torch.nn.Linear) as follows: ```Python from hqq.core.quantize import *

Quantization settings

quantconfig = BaseQuantizeConfig(nbits=4, groupsize=64)

Replace your linear layer

hqqlayer = HQQLinear(yourlinearlayer, #torch.nn.Linear or None quantconfig=quantconfig, #quantization configuration computedtype=torch.float16, #compute dtype device='cuda', #cuda device initialize=True, #Use False to quantize later del_orig=True #if True, delete the original layer )

Wr = hqqlayer.dequantize() #dequantize() Wq = hqqlayer.unpack(dtype=torch.uint8) #unpack y = hqq_layer(x) #forward-pass ```

The quantization parameters are set as follows:

  • nbits (int): supports 8, 4, 3, 2, 1 bits.
  • group_size (int): no restrictions as long as weight.numel() is divisible by the group_size.
  • view_as_float (bool): if True, the quantized parameter is viewed as float instead of an int type.

Usage with Models

Transformers 🤗

For usage with HF's transformers, see the example below from the documentation: ```Python from transformers import AutoModelForCausalLM, HqqConfig

All linear layers will use the same quantization config

quantconfig = HqqConfig(nbits=4, groupsize=64)

Load and quantize

model = AutoModelForCausalLM.frompretrained( modelid, torchdtype=torch.float16, devicemap="cuda", quantizationconfig=quantconfig ) `` You can save/load quantized models as regular transformers models viasavepretrained/frompretrained`.

HQQ Lib

You can also utilize the HQQ library to quantize transformers models: ```Python

Load the model on CPU

from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.frompretrained(modelid, torchdtype=computedtype)

Quantize

from hqq.models.hf.base import AutoHQQHFModel quantconfig = BaseQuantizeConfig(nbits=4, groupsize=64) AutoHQQHFModel.quantizemodel(model, quantconfig=quantconfig, computedtype=compute_dtype, device=device) You can save/load quantized models as follows: Python from hqq.models.hf.base import AutoHQQHFModel

Save: Make sure to save the model BEFORE any patching

AutoHQQHFModel.savequantized(model, savedir)

Save as safetensors (to be load via transformers or vllm)

AutoHQQHFModel.savetosafetensors(model, save_dir)

Load

model = AutoHQQHFModel.fromquantized(savedir) ```

❗ Note that models saved via the hqq lib are not compatible with .from_pretrained()

Backends

Native Backends

The following native dequantization backends can be used by the HQQLinear module: Python HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend - Default HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Compiled Pytorch HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend - only axis=0 supported ❗ Note that HQQBackend.ATEN only supports axis=0.

Optimized Inference

We support external backends for faster inference with fused kernels. You can enable one of the backends after the model was quantized as follows: ```Python from hqq.utils.patching import prepareforinference

Pytorch backend that makes the model compatible with fullgraph torch.compile: works with any settings

prepareforinference(model)

Gemlite backend: nbits=4/2/1, compute_dtype=float16, axis=1

prepareforinference(model, backend="gemlite")

Torchao's tinygemm backend (fast for batch-size<4): nbits=4, computedtype=bfloat16, axis=1

prepareforinference(model, backend="torchao_int4")

`` Note that these backends only work withaxis=1`. Additional restrictions apply regarding the group-size values depending on the backend. You should expect ~158 tokens/sec with a Llama3-8B 4-bit quantized model on a 4090 RTX.

When a quantization config is not supported by the specified inference backend, hqq will fallback to the native backend.

Custom Quantization Configurations ⚙️

You can set up various quantization configurations for different layers by specifying the settings for each layer name:

Transformers 🤗

```Python

Each linear layer with the same tag will use a dedicated quantization config

q4config = {'nbits':4, 'groupsize':64} q3config = {'nbits':3, 'groupsize':32}

quantconfig = HqqConfig(dynamicconfig={ 'selfattn.qproj':q4config, 'selfattn.kproj':q4config, 'selfattn.vproj':q4config, 'selfattn.oproj':q4config,

'mlp.gateproj':q3config, 'mlp.upproj' :q3config, 'mlp.downproj':q3config, }) ```

HQQ lib

```Python from hqq.core.quantize import * q4config = BaseQuantizeConfig(nbits=4, groupsize=64) q3config = BaseQuantizeConfig(nbits=3, groupsize=32)

quantconfig = {'selfattn.qproj':q4config, 'selfattn.kproj':q4config, 'selfattn.vproj':q4config, 'selfattn.oproj':q4_config,

'mlp.gateproj':q3config, 'mlp.upproj' :q3config, 'mlp.downproj':q3config, } ```

VLLM

You can use HQQ in vllm. Make sure to install GemLite before using the backend.

```Python

Or you can quantize on-the-fly

from hqq.utils.vllm import setvllmontheflyhqqquant skipmodules = ['lmhead', 'visual', 'vision']

Select one of the following modes:

INT/FP format

setvllmontheflyhqqquant(weightbits=8, groupsize=None, quantmode='int8weightonly', skipmodules=skipmodules) #A16W8 - INT8 weight only setvllmontheflyhqqquant(weightbits=4, groupsize=128, quantmode='int4weightonly', skipmodules=skipmodules) #A16W4 - HQQ weight only setvllmontheflyhqqquant(weightbits=8, quantmode='int8dynamic', skipmodules=skipmodules) #A8W8 - INT8 x INT8 dynamic setvllmontheflyhqqquant(weightbits=8, quantmode='fp8dynamic', skipmodules=skipmodules) #A8W8 - FP8 x FP8 dynamic

MXFP format

setvllmontheflyhqqquant(weightbits=8, groupsize=None, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W8 - MXFP8 x MXPF8 - postscale=True setvllmontheflyhqqquant(weightbits=8, groupsize=32, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W8 - MXFP8 x MXPF8- postscale=False setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp4weightonly', skipmodules=skipmodules) #A16W4 - MXFP4 weight-only setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W4 - MXFP8 x MXFP4 dynamic setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp4dynamic', skipmodules=skipmodules) #A4W4 - MXPF4 x MXPF4 dynamic setvllmontheflyhqqquant(weightbits=4, quantmode='nvfp4dynamic', skipmodules=skipmodules) #A4W4 - NVFP4 x NVFP4 dynamic

llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct", maxmodellen=4096, gpumemoryutilization=0.80, dtype=torch.float16) ```

Peft Training

Peft training is directly supported in the HuggingFace's peft library. If you still want to use hqq-lib's peft utilities, here's how:

```Python

First, quantize/load a quantized HQQ model the

from hqq.core.peft import PeftUtils

baseloraparams = {'loratype':'default', 'r':32, 'loraalpha':64, 'dropout':0.05, 'traindtype':torch.float32} loraparams = {'selfattn.qproj': baseloraparams, 'selfattn.kproj': baseloraparams, 'selfattn.vproj': baseloraparams, 'selfattn.oproj': baseloraparams, 'mlp.gateproj' : None, 'mlp.upproj' : None, 'mlp.down_proj' : None}

Add LoRA to linear/HQQ modules

PeftUtils.addlora(model, loraparams)

Optional: set your backend

HQQLinear.setbackend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCHCOMPILE)

Train ....

Convert LoRA weights to the same model dtype for faster inference

model.eval() PeftUtils.castloraweights(model, dtype=compute_dtype)

Save LoRA weights

PeftUtils.saveloraweights(model, filename)

Load LoRA weights: automatically calls add_lora

PeftUtils.loadloraweights(model, filename) ```

We provide a complete example to train a model with HQQ/LoRA that you can find in examples/hqq_plus.py.

If you want to use muti-gpu training via FSDP, check out this awesome repo by Answer.AI: https://github.com/AnswerDotAI/fsdp_qlora

Examples

We provide a variety of examples demonstrating model quantization across different backends within the examples directory.

Citation 📜

@misc{badri2023hqq, title = {Half-Quadratic Quantization of Large Machine Learning Models}, url = {https://mobiusml.github.io/hqq_blog/}, author = {Hicham Badri and Appu Shaji}, month = {November}, year = {2023}

Owner

  • Name: mobiusml
  • Login: mobiusml
  • Kind: organization

GitHub Events

Total
  • Create event: 6
  • Commit comment event: 2
  • Release event: 6
  • Issues event: 62
  • Watch event: 180
  • Issue comment event: 211
  • Push event: 90
  • Pull request event: 12
  • Fork event: 20
Last Year
  • Create event: 6
  • Commit comment event: 2
  • Release event: 6
  • Issues event: 62
  • Watch event: 180
  • Issue comment event: 211
  • Push event: 90
  • Pull request event: 12
  • Fork event: 20

Committers

Last synced: 6 months ago

All Time
  • Total Commits: 406
  • Total Committers: 20
  • Avg Commits per committer: 20.3
  • Development Distribution Score (DDS): 0.372
Past Year
  • Commits: 141
  • Committers: 7
  • Avg Commits per committer: 20.143
  • Development Distribution Score (DDS): 0.44
Top Committers
Name Email Commits
mobicham h****m@m****m 255
mobicham 3****m@u****m 113
Kerem Turgutlu k****u@g****m 7
Yeongjae Jang 5****r@u****m 6
Viraat Das v****s@g****m 4
E e****a@g****m 3
Aleksi Ikkala a****a@p****i 2
Benjamin Clavié b****n@c****u 2
Mark Saroufim m****m@m****m 2
fahadh4ilyas f****4@g****m 2
Appu Shaji a****e@g****m 1
Benjamin Warner me@b****v 1
Ikko Eltociear Ashimine e****r@g****m 1
Qubitium q****m@m****i 1
Sian Cao y****y@g****m 1
mark M****1@g****u 1
root m****m 1
xffxff 1****9@q****m 1
yeh-sudo y****n@g****m 1
yiliu30 y****u@i****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 122
  • Total pull requests: 29
  • Average time to close issues: 29 days
  • Average time to close pull requests: 12 days
  • Total issue authors: 86
  • Total pull request authors: 19
  • Average comments per issue: 5.15
  • Average comments per pull request: 3.07
  • Merged pull requests: 21
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 37
  • Pull requests: 11
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 9 days
  • Issue authors: 28
  • Pull request authors: 7
  • Average comments per issue: 6.49
  • Average comments per pull request: 2.27
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kaizizzzzzz (9)
  • Minami-su (5)
  • BeichenHuang (4)
  • kadirnar (4)
  • NEWbie0709 (4)
  • mxjmtxrm (3)
  • ZeleiShao (3)
  • zhangy659 (3)
  • 2U1 (3)
  • mobicham (2)
  • DHKim0428 (2)
  • DominikHil (2)
  • NickyDark1 (2)
  • MarkBenjamin (2)
  • Abdullah-kwl (2)
Pull Request Authors
  • Liberatedwinner (6)
  • sonald (4)
  • fahadh4ilyas (4)
  • aikkala (4)
  • bclavie (3)
  • xffxff (2)
  • KeremTurgutlu (2)
  • larin92 (2)
  • warner-benjamin (2)
  • eltociear (2)
  • yeh-sudo (2)
  • rationalism (2)
  • Qubitium (2)
  • MarkBenjamin (2)
  • anijain2305 (2)
Top Labels
Issue Labels
enhancement (9) help wanted (7) question (7) bug (5) needs more info (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 16,322 last-month
  • Total dependent packages: 2
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 29
  • Total maintainers: 1
proxy.golang.org: github.com/mobiusml/hqq
  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 6.4%
Average: 6.6%
Dependent repos count: 6.8%
Last synced: 6 months ago
pypi.org: hqq

Half-Quadratic Quantization (HQQ)

  • Versions: 26
  • Dependent Packages: 2
  • Dependent Repositories: 0
  • Downloads: 16,322 Last month
Rankings
Dependent packages count: 10.1%
Average: 38.5%
Dependent repos count: 66.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

hqq/kernels/setup.py pypi
setup.py pypi
  • accelerate *
  • huggingface_hub *
  • numpy >=1.24.4
  • timm *
  • torch >=2.0.1
  • tqdm >=4.64.1
  • transformers *