https://github.com/mobiusml/hqq
Official implementation of Half-Quadratic Quantization (HQQ)
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 20 committers (5.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Official implementation of Half-Quadratic Quantization (HQQ)
Basic Info
- Host: GitHub
- Owner: mobiusml
- License: apache-2.0
- Language: Python
- Default Branch: master
- Homepage: https://mobiusml.github.io/hqq_blog/
- Size: 606 KB
Statistics
- Stars: 878
- Watchers: 15
- Forks: 84
- Open Issues: 2
- Releases: 29
Topics
Metadata Files
Readme.md
Half-Quadratic Quantization (HQQ)
This repository contains the official implementation of Half-Quadratic Quantization (HQQ) presented in our articles: * HQQ: https://mobiusml.github.io/hqqblog/ * HQQ+: https://mobiusml.github.io/1bitblog/
What is HQQ?
HQQ is a fast and accurate model quantizer that skips the need for calibration data. Quantize the largest models, without calibration data, in just a few minutes at most 🚀.
FAQ
Why should I use HQQ instead of other quantization methods?- HQQ is very fast to quantize models.
- It supports 8,4,3,2,1 bits.
- You can use it on any model (LLMs, Vision, etc.).
- The dequantization step is a linear operation, this means that HQQ is compatbile with various optimized CUDA/Triton kernels.
- HQQ is compatible with peft training.
- We try to make HQQ fully compatible `torch.compile` for faster inference and training.
We have detailed benchmarks on both language and vision models. Please refer to our blog posts: HQQ, HQQ+.
What is the speed of the quantized models?
4-bit models with `axis=1` can use optimized inference fused kernels. Moreover, we focus on making hqq fully compatible with `torch.compile` which speeds-up both training and inference. For more details, please refer to the backend section below.
What quantization settings should I use?
You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend, but this setting is not supported for fast inference.
What does the `axis` parameter mean?
The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.
What is the difference between HQQ and HQQ+?
HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.
Installation
First, make sure you have a Pytorch 2 version that matches your CUDA version: https://pytorch.org/
You can install hqq via
```
latest stable version
pip install hqq;
Latest updates - recommended
pip install git+https://github.com/mobiusml/hqq.git;
Disable building the CUDA kernels for the aten backend
DISABLE_CUDA=1 pip install ... ```
Alternatively, clone the repo and run pip install . from this current folder.
Basic Usage
To perform quantization with HQQ, you simply need to replace the linear layers ( torch.nn.Linear) as follows:
```Python
from hqq.core.quantize import *
Quantization settings
quantconfig = BaseQuantizeConfig(nbits=4, groupsize=64)
Replace your linear layer
hqqlayer = HQQLinear(yourlinearlayer, #torch.nn.Linear or None quantconfig=quantconfig, #quantization configuration computedtype=torch.float16, #compute dtype device='cuda', #cuda device initialize=True, #Use False to quantize later del_orig=True #if True, delete the original layer )
Wr = hqqlayer.dequantize() #dequantize() Wq = hqqlayer.unpack(dtype=torch.uint8) #unpack y = hqq_layer(x) #forward-pass ```
The quantization parameters are set as follows:
nbits(int): supports 8, 4, 3, 2, 1 bits.group_size(int): no restrictions as long asweight.numel()is divisible by thegroup_size.view_as_float(bool): if True, the quantized parameter is viewed as float instead of an int type.
Usage with Models
Transformers 🤗
For usage with HF's transformers, see the example below from the documentation: ```Python from transformers import AutoModelForCausalLM, HqqConfig
All linear layers will use the same quantization config
quantconfig = HqqConfig(nbits=4, groupsize=64)
Load and quantize
model = AutoModelForCausalLM.frompretrained(
modelid,
torchdtype=torch.float16,
devicemap="cuda",
quantizationconfig=quantconfig
)
``
You can save/load quantized models as regular transformers models viasavepretrained/frompretrained`.
HQQ Lib
You can also utilize the HQQ library to quantize transformers models: ```Python
Load the model on CPU
from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.frompretrained(modelid, torchdtype=computedtype)
Quantize
from hqq.models.hf.base import AutoHQQHFModel
quantconfig = BaseQuantizeConfig(nbits=4, groupsize=64)
AutoHQQHFModel.quantizemodel(model, quantconfig=quantconfig, computedtype=compute_dtype, device=device)
You can save/load quantized models as follows:
Python
from hqq.models.hf.base import AutoHQQHFModel
Save: Make sure to save the model BEFORE any patching
AutoHQQHFModel.savequantized(model, savedir)
Save as safetensors (to be load via transformers or vllm)
AutoHQQHFModel.savetosafetensors(model, save_dir)
Load
model = AutoHQQHFModel.fromquantized(savedir) ```
❗ Note that models saved via the hqq lib are not compatible with .from_pretrained()
Backends
Native Backends
The following native dequantization backends can be used by the HQQLinear module:
Python
HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend - Default
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Compiled Pytorch
HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend - only axis=0 supported
❗ Note that HQQBackend.ATEN only supports axis=0.
Optimized Inference
We support external backends for faster inference with fused kernels. You can enable one of the backends after the model was quantized as follows: ```Python from hqq.utils.patching import prepareforinference
Pytorch backend that makes the model compatible with fullgraph torch.compile: works with any settings
prepareforinference(model)
Gemlite backend: nbits=4/2/1, compute_dtype=float16, axis=1
prepareforinference(model, backend="gemlite")
Torchao's tinygemm backend (fast for batch-size<4): nbits=4, computedtype=bfloat16, axis=1
prepareforinference(model, backend="torchao_int4")
``
Note that these backends only work withaxis=1`. Additional restrictions apply regarding the group-size values depending on the backend. You should expect ~158 tokens/sec with a Llama3-8B 4-bit quantized model on a 4090 RTX.
When a quantization config is not supported by the specified inference backend, hqq will fallback to the native backend.
Custom Quantization Configurations ⚙️
You can set up various quantization configurations for different layers by specifying the settings for each layer name:
Transformers 🤗
```Python
Each linear layer with the same tag will use a dedicated quantization config
q4config = {'nbits':4, 'groupsize':64} q3config = {'nbits':3, 'groupsize':32}
quantconfig = HqqConfig(dynamicconfig={ 'selfattn.qproj':q4config, 'selfattn.kproj':q4config, 'selfattn.vproj':q4config, 'selfattn.oproj':q4config,
'mlp.gateproj':q3config, 'mlp.upproj' :q3config, 'mlp.downproj':q3config, }) ```
HQQ lib
```Python from hqq.core.quantize import * q4config = BaseQuantizeConfig(nbits=4, groupsize=64) q3config = BaseQuantizeConfig(nbits=3, groupsize=32)
quantconfig = {'selfattn.qproj':q4config, 'selfattn.kproj':q4config, 'selfattn.vproj':q4config, 'selfattn.oproj':q4_config,
'mlp.gateproj':q3config, 'mlp.upproj' :q3config, 'mlp.downproj':q3config, } ```
VLLM
You can use HQQ in vllm. Make sure to install GemLite before using the backend.
```Python
Or you can quantize on-the-fly
from hqq.utils.vllm import setvllmontheflyhqqquant skipmodules = ['lmhead', 'visual', 'vision']
Select one of the following modes:
INT/FP format
setvllmontheflyhqqquant(weightbits=8, groupsize=None, quantmode='int8weightonly', skipmodules=skipmodules) #A16W8 - INT8 weight only setvllmontheflyhqqquant(weightbits=4, groupsize=128, quantmode='int4weightonly', skipmodules=skipmodules) #A16W4 - HQQ weight only setvllmontheflyhqqquant(weightbits=8, quantmode='int8dynamic', skipmodules=skipmodules) #A8W8 - INT8 x INT8 dynamic setvllmontheflyhqqquant(weightbits=8, quantmode='fp8dynamic', skipmodules=skipmodules) #A8W8 - FP8 x FP8 dynamic
MXFP format
setvllmontheflyhqqquant(weightbits=8, groupsize=None, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W8 - MXFP8 x MXPF8 - postscale=True setvllmontheflyhqqquant(weightbits=8, groupsize=32, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W8 - MXFP8 x MXPF8- postscale=False setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp4weightonly', skipmodules=skipmodules) #A16W4 - MXFP4 weight-only setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp8dynamic', skipmodules=skipmodules) #A8W4 - MXFP8 x MXFP4 dynamic setvllmontheflyhqqquant(weightbits=4, quantmode='mxfp4dynamic', skipmodules=skipmodules) #A4W4 - MXPF4 x MXPF4 dynamic setvllmontheflyhqqquant(weightbits=4, quantmode='nvfp4dynamic', skipmodules=skipmodules) #A4W4 - NVFP4 x NVFP4 dynamic
llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct", maxmodellen=4096, gpumemoryutilization=0.80, dtype=torch.float16) ```
Peft Training
Peft training is directly supported in the HuggingFace's peft library. If you still want to use hqq-lib's peft utilities, here's how:
```Python
First, quantize/load a quantized HQQ model the
from hqq.core.peft import PeftUtils
baseloraparams = {'loratype':'default', 'r':32, 'loraalpha':64, 'dropout':0.05, 'traindtype':torch.float32} loraparams = {'selfattn.qproj': baseloraparams, 'selfattn.kproj': baseloraparams, 'selfattn.vproj': baseloraparams, 'selfattn.oproj': baseloraparams, 'mlp.gateproj' : None, 'mlp.upproj' : None, 'mlp.down_proj' : None}
Add LoRA to linear/HQQ modules
PeftUtils.addlora(model, loraparams)
Optional: set your backend
HQQLinear.setbackend(HQQBackend.ATEN if axis==0 else HQQBackend.PYTORCHCOMPILE)
Train ....
Convert LoRA weights to the same model dtype for faster inference
model.eval() PeftUtils.castloraweights(model, dtype=compute_dtype)
Save LoRA weights
PeftUtils.saveloraweights(model, filename)
Load LoRA weights: automatically calls add_lora
PeftUtils.loadloraweights(model, filename) ```
We provide a complete example to train a model with HQQ/LoRA that you can find in examples/hqq_plus.py.
If you want to use muti-gpu training via FSDP, check out this awesome repo by Answer.AI: https://github.com/AnswerDotAI/fsdp_qlora
Examples
We provide a variety of examples demonstrating model quantization across different backends within the examples directory.
Citation 📜
@misc{badri2023hqq,
title = {Half-Quadratic Quantization of Large Machine Learning Models},
url = {https://mobiusml.github.io/hqq_blog/},
author = {Hicham Badri and Appu Shaji},
month = {November},
year = {2023}
Owner
- Name: mobiusml
- Login: mobiusml
- Kind: organization
- Repositories: 5
- Profile: https://github.com/mobiusml
GitHub Events
Total
- Create event: 6
- Commit comment event: 2
- Release event: 6
- Issues event: 62
- Watch event: 180
- Issue comment event: 211
- Push event: 90
- Pull request event: 12
- Fork event: 20
Last Year
- Create event: 6
- Commit comment event: 2
- Release event: 6
- Issues event: 62
- Watch event: 180
- Issue comment event: 211
- Push event: 90
- Pull request event: 12
- Fork event: 20
Committers
Last synced: 6 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| mobicham | h****m@m****m | 255 |
| mobicham | 3****m@u****m | 113 |
| Kerem Turgutlu | k****u@g****m | 7 |
| Yeongjae Jang | 5****r@u****m | 6 |
| Viraat Das | v****s@g****m | 4 |
| E | e****a@g****m | 3 |
| Aleksi Ikkala | a****a@p****i | 2 |
| Benjamin Clavié | b****n@c****u | 2 |
| Mark Saroufim | m****m@m****m | 2 |
| fahadh4ilyas | f****4@g****m | 2 |
| Appu Shaji | a****e@g****m | 1 |
| Benjamin Warner | me@b****v | 1 |
| Ikko Eltociear Ashimine | e****r@g****m | 1 |
| Qubitium | q****m@m****i | 1 |
| Sian Cao | y****y@g****m | 1 |
| mark | M****1@g****u | 1 |
| root | m****m | 1 |
| xffxff | 1****9@q****m | 1 |
| yeh-sudo | y****n@g****m | 1 |
| yiliu30 | y****u@i****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 122
- Total pull requests: 29
- Average time to close issues: 29 days
- Average time to close pull requests: 12 days
- Total issue authors: 86
- Total pull request authors: 19
- Average comments per issue: 5.15
- Average comments per pull request: 3.07
- Merged pull requests: 21
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 37
- Pull requests: 11
- Average time to close issues: about 1 month
- Average time to close pull requests: 9 days
- Issue authors: 28
- Pull request authors: 7
- Average comments per issue: 6.49
- Average comments per pull request: 2.27
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- kaizizzzzzz (9)
- Minami-su (5)
- BeichenHuang (4)
- kadirnar (4)
- NEWbie0709 (4)
- mxjmtxrm (3)
- ZeleiShao (3)
- zhangy659 (3)
- 2U1 (3)
- mobicham (2)
- DHKim0428 (2)
- DominikHil (2)
- NickyDark1 (2)
- MarkBenjamin (2)
- Abdullah-kwl (2)
Pull Request Authors
- Liberatedwinner (6)
- sonald (4)
- fahadh4ilyas (4)
- aikkala (4)
- bclavie (3)
- xffxff (2)
- KeremTurgutlu (2)
- larin92 (2)
- warner-benjamin (2)
- eltociear (2)
- yeh-sudo (2)
- rationalism (2)
- Qubitium (2)
- MarkBenjamin (2)
- anijain2305 (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 16,322 last-month
-
Total dependent packages: 2
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 29
- Total maintainers: 1
proxy.golang.org: github.com/mobiusml/hqq
- Documentation: https://pkg.go.dev/github.com/mobiusml/hqq#section-documentation
- License: apache-2.0
-
Latest release: v0.2.8
published 6 months ago
Rankings
pypi.org: hqq
Half-Quadratic Quantization (HQQ)
- Homepage: https://github.com/mobiusml/hqq/
- Documentation: https://hqq.readthedocs.io/
- License: Apache 2
-
Latest release: 0.2.8
published 6 months ago
Rankings
Maintainers (1)
Dependencies
- accelerate *
- huggingface_hub *
- numpy >=1.24.4
- timm *
- torch >=2.0.1
- tqdm >=4.64.1
- transformers *