Releases | Open Source Science

https://github.com/mobiusml/hqq - v0.2.8

Bug fixing: -Fix static cache init with new transformers version -Add mxfp vllm patching utils -Improve cuda graphs/compile settings for transformer models

- Python
Published by mobicham 6 months ago

https://github.com/mobiusml/hqq - v0.2.7.post1

Bug fixing: -HIP graph fix in generation: https://github.com/mobiusml/hqq/commit/bc8f4c7d778a0cdbfe115299ea7253ed28948d31 -Fix HQQLinear with None linear inputs: https://github.com/mobiusml/hqq/commit/3b86ac950f699a4ca3584cb18bea023b2f5e1da9

- Python
Published by mobicham 8 months ago

https://github.com/mobiusml/hqq - v0.2.7

Fix nan bug when max - min is very small: https://github.com/mobiusml/hqq/commit/373cbea93892cb491a3c072e0036a37848926404
Add DISABLE_CUDA=1 env variable to disable building cuda kernels for then aten backend. This allows faster pip build. https://github.com/mobiusml/hqq/commit/861f6906a2ebf4c864603d7eebd2091b9beb2a77
Improve memory usage https://github.com/mobiusml/hqq/commit/a566c78961ea408c747ad2a9bd4f3a9235ff3b70
Fix vLLM torch fallback logic: https://github.com/mobiusml/hqq/commit/d3f14b494eb9939e05a7aba854796eab13da3d3b

- Python
Published by mobicham 9 months ago

https://github.com/mobiusml/hqq - v0.2.6

Fix cuda build
torchcompile() support for hqq_aten
bfloat16 support for vllm/hqq
Update vllm utils to support hqq_gemlite and hqq_torch aliases
FIx vLLM v1 issues
Extend save_to_safetensors to VLMs

Full Changelog: https://github.com/mobiusml/hqq/compare/v0.2.5...0.2.6

- Python
Published by mobicham 9 months ago

https://github.com/mobiusml/hqq - v0.2.5

-Fix .name in backends -Skip gemlite invalid in/out feature sizes in VLLM patching -Faster VLLM packing via GemLite

- Python
Published by mobicham 11 months ago

https://github.com/mobiusml/hqq - v.0.2.3.post1

Bug fixes: - Check W_q in state dict to fix peft issue https://github.com/mobiusml/hqq/issues/151 - Fix bugs related to AutoHQQHFModel.save_to_safetensors

- Python
Published by mobicham about 1 year ago

https://github.com/mobiusml/hqq - v0.2.3

VLLM support via patching - GemLite backend + on-the-fly quantization
Add support for Aria
Add support to load quantized SequenceClassification
Faster decoding via (custom cudagraphs, sdpa math backend, etc.)
Fix bugs related torch compile and hf_generator related to the newer transformers versions
Fix bugs related to saving quantized models with no grouping
Fix bugs related to saving large quantized models
Update examples
Add support for HQQLinear .to(device)

- Python
Published by mobicham about 1 year ago

https://github.com/mobiusml/hqq - v0.2.2

HQQ v0.2.2

Support static cache compilation without using HFGenerator
Fixing various issues related to torch.compile

- Python
Published by mobicham over 1 year ago

https://github.com/mobiusml/hqq - v.0.2.1

HQQ v0.2.1

HQQLinear.state_dict() for non-initialized layers. Mainly used in for https://github.com/huggingface/transformers/pull/33141

- Python
Published by mobicham over 1 year ago

https://github.com/mobiusml/hqq - v.0.2.0

HQQ v0.2.0

Bug fixes
Safetensors support for transformers via https://github.com/huggingface/transformers/pull/33141
quant_scale, quant_zero and offload_meta are now deprecated. You can still use them with the hqq lib, but you can't use them with the transformers lib

- Python
Published by mobicham over 1 year ago

https://github.com/mobiusml/hqq - v.0.1.8

HQQ v0.1.8

Add BitBlas backend support
Simpler HQQLinear from weights HQQLinear.from_weights(W, bias, etc.)
Fix memory leak while swaping layers for the TorchAO Backend
Add HQQLinear.unpack() call

- Python
Published by mobicham over 1 year ago

https://github.com/mobiusml/hqq - v0.1.7.post3

HQQ v0.1.7.post3

Enable CPU quantization and runtime
_load_state_dict fix
fix extra_repr in HQQLinear
fix from_quantized bugs
fix | typing
fix 3-bit axis=1 slicing bug
add 5/6 bit for testing

- Python
Published by mobicham over 1 year ago

https://github.com/mobiusml/hqq - v0.1.7.post2

HQQ v0.1.7.post2

Various bug fixes, especially with AutoHQQHFModel and the patching logic, to make it work with any transformers model.
Readme refactoring.
Whisper example.

- Python
Published by mobicham almost 2 years ago

https://github.com/mobiusml/hqq - v0.1.7

HQQ v0.1.7

Faster inference with torchao / marlin 4-bit kernels
Multi-gpu support for model.quantize()
Custom HF generator
Various bug fixes/improvements

- Python
Published by mobicham almost 2 years ago

https://github.com/mobiusml/hqq - v0.1.6.post2

HQQ v0.1.6.post2

Same as v0.1.6 with setup.py fixes:

find_packages fix: https://github.com/mobiusml/hqq/pull/25
Auto-build CUDA kernels via pypi package: https://github.com/mobiusml/hqq/pull/26

- Python
Published by mobicham almost 2 years ago

https://github.com/mobiusml/hqq - v0.1.6.post1

HQQ v0.1.6.post1

Same as v0.1.6 with a find_packages fix https://github.com/mobiusml/hqq/pull/25

- Python
Published by mobicham almost 2 years ago

https://github.com/mobiusml/hqq - v0.1.6

HQQ v0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features

Quantize on target device.
Meta-offloading uses pinned memory for faster/async transfers.
Loading saved LoRA weights automatically adds LoRA modules if not already present.
pip install automatically compiles the CUDA kernels now.
CUDA backend automatically detected and used when available.
You can quantize any HF model automatically via AutoHQQHFModel.
Faster meta-offloading with CUDA streams (experimental).
Int8 matmul (experimental).
Shared memory CUDA kernels (experimental).

Bugs

Fix Peft bias dtype.
Removed auto backend setting in LoRA.
All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other

Refactor backends (using backprop backends by default now).
Added typing.
Ruff fix and reformat all Python files.
Refactor ATEN for reference tensors.

Issues

Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
Shared memory CUDA kernels are a bit slower than without for some reason.
The block size setting doesn't have much influence on the speed.
Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.

- Python
Published by mobicham almost 2 years ago

https://github.com/mobiusml/hqq - v0.1.5

HQQ v0.1.5

New features

Added support for multi-gpu FSDP QLoRA training (https://github.com/mobiusml/hqq/pull/17)

Issues

torch.compile and the PYTORCH_COMPILE backend break with view_as_float=True. No known solution for the moment.
A bit slower inference with view_as_float=True. Solution: after training, the user can revert back to in bitpacking.

- Python
Published by mobicham almost 2 years ago

https://github.com/mobiusml/hqq - v0.1.4

HQQ v0.1.4

New features

Added 1-bit support with CUDA dequant kernels.

- Python
Published by mobicham almost 2 years ago

https://github.com/mobiusml/hqq - v0.1.3.post1

HQQ v0.1.3.post1

New features

meta_offloading support: allows offloading meta-data to the CPU hence achieving true n-bit storage on the GPU.

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - v0.1.3

HQQ v0.1.3

New features

Added CUDA kernels for dequantization (up to 2-3x inference speed-up vs. Pytorch)
Added support for compute_dtype parameter (useful for float32/bfloat16 LoRA training)

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - v0.1.2.post1

HQQ v0.1.2.post1

Bug fixes

Fixed LoRA adapter loading.

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - v0.1.2

HQQ v0.1.2

Improvements

Added LoRA support
Added LoRA with fake quantization support (experimental)
Optimizer V2 with scale update support
Some code refactoring in quantize.py

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - v0.1.1.post1

HQQ v0.1.1.post1

No improvements over v0.1.1. Just removed Pytorch from the dependencies and updated the Readme.

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - v0.1.1

HQQ v0.1.1

Improvements:

Added Mixtral support for Hugging Face.
Added support for layer-wise custom quantization configs.

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - v0.1.0

HQQ v0.1.0

Improvements

Added compile backend support
Added Aten C++ backend (experimental)
Faster bit unpacking via pre-allocated empty tensor
Added VLLM support
Refactoring to call quantize_model() on instances

Supported models

Llama (Hugging Face + VLLM)
ViT-CLIP (timm)

Limitations

HF only supports single GPU runtime.
VLLM only supports single GPU with a single worker.
The compile backend sometimes creates issues with async runtime
Doesn't support PEFT (LoRA, etc.).

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - 0.1.0-alpha

HQQ 0.1.0-alpha

Alpha version with basic Hugging Face/Timm support.

Supported models:

Llama (Hugging Face)
ViT (timm)

Limitations:

Uses a pure Pytorch implementation without optimizations.
Only supports single GPU runtime.
Doesn't support Peft (LoRA, etc.) for custom training.

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - HQQ v1.0.0

Limitations:

Only supports single GPU runtime.
Not compatible with Hugging Face's Peft.

- Python
Published by mobicham about 2 years ago

https://github.com/mobiusml/hqq - HQQ v1.0.0

HQQ v1.0.0

Limitations:

Only supports single GPU runtime with Pytorch.
Not compatible with Hugging Face's Peft.

- Python
Published by mobicham about 2 years ago

Recent Releases of https://github.com/mobiusml/hqq

https://github.com/mobiusml/hqq - v0.2.8

https://github.com/mobiusml/hqq - v0.2.7.post1

https://github.com/mobiusml/hqq - v0.2.7

https://github.com/mobiusml/hqq - v0.2.6

https://github.com/mobiusml/hqq - v0.2.5

https://github.com/mobiusml/hqq - v.0.2.3.post1

https://github.com/mobiusml/hqq - v0.2.3

https://github.com/mobiusml/hqq - v0.2.2

HQQ v0.2.2

https://github.com/mobiusml/hqq - v.0.2.1

HQQ v0.2.1

https://github.com/mobiusml/hqq - v.0.2.0

HQQ v0.2.0

https://github.com/mobiusml/hqq - v.0.1.8

HQQ v0.1.8

https://github.com/mobiusml/hqq - v0.1.7.post3

HQQ v0.1.7.post3

https://github.com/mobiusml/hqq - v0.1.7.post2

HQQ v0.1.7.post2

https://github.com/mobiusml/hqq - v0.1.7

HQQ v0.1.7

https://github.com/mobiusml/hqq - v0.1.6.post2

HQQ v0.1.6.post2

https://github.com/mobiusml/hqq - v0.1.6.post1

HQQ v0.1.6.post1

https://github.com/mobiusml/hqq - v0.1.6

HQQ v0.1.6

Features

Bugs

Other

Issues

https://github.com/mobiusml/hqq - v0.1.5

HQQ v0.1.5

New features

Issues

https://github.com/mobiusml/hqq - v0.1.4

HQQ v0.1.4

New features

https://github.com/mobiusml/hqq - v0.1.3.post1

HQQ v0.1.3.post1

New features

https://github.com/mobiusml/hqq - v0.1.3

HQQ v0.1.3

New features

https://github.com/mobiusml/hqq - v0.1.2.post1

HQQ v0.1.2.post1

Bug fixes

https://github.com/mobiusml/hqq - v0.1.2

HQQ v0.1.2

Improvements

https://github.com/mobiusml/hqq - v0.1.1.post1

HQQ v0.1.1.post1

https://github.com/mobiusml/hqq - v0.1.1

HQQ v0.1.1

Improvements:

https://github.com/mobiusml/hqq - v0.1.0

HQQ v0.1.0

Improvements

Supported models

Limitations

https://github.com/mobiusml/hqq - 0.1.0-alpha

HQQ 0.1.0-alpha

Supported models:

Limitations:

https://github.com/mobiusml/hqq - HQQ v1.0.0

Limitations:

https://github.com/mobiusml/hqq - HQQ v1.0.0

HQQ v1.0.0

Limitations: