Recent Releases of https://github.com/mobiusml/hqq
https://github.com/mobiusml/hqq - v0.2.8
Bug fixing: -Fix static cache init with new transformers version -Add mxfp vllm patching utils -Improve cuda graphs/compile settings for transformer models
- Python
Published by mobicham 6 months ago
https://github.com/mobiusml/hqq - v0.2.7.post1
Bug fixing:
-HIP graph fix in generation: https://github.com/mobiusml/hqq/commit/bc8f4c7d778a0cdbfe115299ea7253ed28948d31
-Fix HQQLinear with None linear inputs: https://github.com/mobiusml/hqq/commit/3b86ac950f699a4ca3584cb18bea023b2f5e1da9
- Python
Published by mobicham 8 months ago
https://github.com/mobiusml/hqq - v0.2.7
- Fix
nanbug whenmax - minis very small: https://github.com/mobiusml/hqq/commit/373cbea93892cb491a3c072e0036a37848926404 - Add
DISABLE_CUDA=1env variable to disable building cuda kernels for then aten backend. This allows faster pip build. https://github.com/mobiusml/hqq/commit/861f6906a2ebf4c864603d7eebd2091b9beb2a77 - Improve memory usage https://github.com/mobiusml/hqq/commit/a566c78961ea408c747ad2a9bd4f3a9235ff3b70
- Fix vLLM torch fallback logic: https://github.com/mobiusml/hqq/commit/d3f14b494eb9939e05a7aba854796eab13da3d3b
- Python
Published by mobicham 9 months ago
https://github.com/mobiusml/hqq - v0.2.6
- Fix cuda build
torchcompile()support for hqq_aten- bfloat16 support for vllm/hqq
- Update vllm utils to support
hqq_gemliteandhqq_torchaliases - FIx vLLM v1 issues
- Extend
save_to_safetensorsto VLMs
Full Changelog: https://github.com/mobiusml/hqq/compare/v0.2.5...0.2.6
- Python
Published by mobicham 9 months ago
https://github.com/mobiusml/hqq - v0.2.5
-Fix .name in backends
-Skip gemlite invalid in/out feature sizes in VLLM patching
-Faster VLLM packing via GemLite
- Python
Published by mobicham 11 months ago
https://github.com/mobiusml/hqq - v.0.2.3.post1
Bug fixes:
- Check W_q in state dict to fix peft issue https://github.com/mobiusml/hqq/issues/151
- Fix bugs related to AutoHQQHFModel.save_to_safetensors
- Python
Published by mobicham about 1 year ago
https://github.com/mobiusml/hqq - v0.2.3
- VLLM support via patching - GemLite backend + on-the-fly quantization
- Add support for Aria
- Add support to load quantized SequenceClassification
- Faster decoding via (custom cudagraphs, sdpa math backend, etc.)
- Fix bugs related torch compile and hf_generator related to the newer transformers versions
- Fix bugs related to saving quantized models with no grouping
- Fix bugs related to saving large quantized models
- Update examples
- Add support for HQQLinear
.to(device)
- Python
Published by mobicham about 1 year ago
https://github.com/mobiusml/hqq - v0.2.2
HQQ v0.2.2
- Support static cache compilation without using
HFGenerator - Fixing various issues related to
torch.compile
- Python
Published by mobicham over 1 year ago
https://github.com/mobiusml/hqq - v.0.2.1
HQQ v0.2.1
-
HQQLinear.state_dict()for non-initialized layers. Mainly used in for https://github.com/huggingface/transformers/pull/33141
- Python
Published by mobicham over 1 year ago
https://github.com/mobiusml/hqq - v.0.2.0
HQQ v0.2.0
- Bug fixes
- Safetensors support for transformers via https://github.com/huggingface/transformers/pull/33141
quant_scale,quant_zeroandoffload_metaare now deprecated. You can still use them with the hqq lib, but you can't use them with the transformers lib
- Python
Published by mobicham over 1 year ago
https://github.com/mobiusml/hqq - v.0.1.8
HQQ v0.1.8
- Add BitBlas backend support
- Simpler HQQLinear from weights
HQQLinear.from_weights(W, bias, etc.) - Fix memory leak while swaping layers for the TorchAO Backend
- Add
HQQLinear.unpack()call
- Python
Published by mobicham over 1 year ago
https://github.com/mobiusml/hqq - v0.1.7.post3
HQQ v0.1.7.post3
- Enable CPU quantization and runtime
_load_state_dictfix- fix
extra_reprinHQQLinear - fix
from_quantizedbugs - fix
|typing - fix 3-bit
axis=1slicing bug - add 5/6 bit for testing
- Python
Published by mobicham over 1 year ago
https://github.com/mobiusml/hqq - v0.1.7.post2
HQQ v0.1.7.post2
- Various bug fixes, especially with
AutoHQQHFModeland the patching logic, to make it work with any transformers model. - Readme refactoring.
- Whisper example.
- Python
Published by mobicham almost 2 years ago
https://github.com/mobiusml/hqq - v0.1.7
HQQ v0.1.7
- Faster inference with torchao / marlin 4-bit kernels
- Multi-gpu support for
model.quantize() - Custom HF generator
- Various bug fixes/improvements
- Python
Published by mobicham almost 2 years ago
https://github.com/mobiusml/hqq - v0.1.6.post2
HQQ v0.1.6.post2
Same as v0.1.6 with setup.py fixes:
find_packagesfix: https://github.com/mobiusml/hqq/pull/25- Auto-build CUDA kernels via pypi package: https://github.com/mobiusml/hqq/pull/26
- Python
Published by mobicham almost 2 years ago
https://github.com/mobiusml/hqq - v0.1.6.post1
HQQ v0.1.6.post1
Same as v0.1.6 with a find_packages fix https://github.com/mobiusml/hqq/pull/25
- Python
Published by mobicham almost 2 years ago
https://github.com/mobiusml/hqq - v0.1.6
HQQ v0.1.6
Use v0.1.6.post1 instead, unless you clone the repo first then install.
Features
- Quantize on target device.
- Meta-offloading uses pinned memory for faster/async transfers.
- Loading saved LoRA weights automatically adds LoRA modules if not already present.
pip installautomatically compiles the CUDA kernels now.- CUDA backend automatically detected and used when available.
- You can quantize any HF model automatically via
AutoHQQHFModel. - Faster meta-offloading with CUDA streams (experimental).
- Int8 matmul (experimental).
- Shared memory CUDA kernels (experimental).
Bugs
- Fix Peft bias dtype.
- Removed auto backend setting in LoRA.
- All
HQQLineardtype/device-related overloads now return self which should solve a couple of issues.
Other
- Refactor backends (using backprop backends by default now).
- Added typing.
- Ruff fix and reformat all Python files.
- Refactor ATEN for reference tensors.
Issues
- Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
- Shared memory CUDA kernels are a bit slower than without for some reason.
- The block size setting doesn't have much influence on the speed.
- Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.
- Python
Published by mobicham almost 2 years ago
https://github.com/mobiusml/hqq - v0.1.5
HQQ v0.1.5
New features
- Added support for multi-gpu FSDP QLoRA training (https://github.com/mobiusml/hqq/pull/17)
Issues
torch.compileand thePYTORCH_COMPILEbackend break withview_as_float=True. No known solution for the moment.- A bit slower inference with
view_as_float=True. Solution: after training, the user can revert back to in bitpacking.
- Python
Published by mobicham almost 2 years ago
https://github.com/mobiusml/hqq - v0.1.4
HQQ v0.1.4
New features
- Added 1-bit support with CUDA dequant kernels.
- Python
Published by mobicham almost 2 years ago
https://github.com/mobiusml/hqq - v0.1.3.post1
HQQ v0.1.3.post1
New features
- meta_offloading support: allows offloading meta-data to the CPU hence achieving true n-bit storage on the GPU.
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - v0.1.3
HQQ v0.1.3
New features
- Added CUDA kernels for dequantization (up to 2-3x inference speed-up vs. Pytorch)
- Added support for
compute_dtypeparameter (useful for float32/bfloat16 LoRA training)
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - v0.1.2.post1
HQQ v0.1.2.post1
Bug fixes
- Fixed LoRA adapter loading.
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - v0.1.2
HQQ v0.1.2
Improvements
- Added LoRA support
- Added LoRA with fake quantization support (experimental)
- Optimizer V2 with scale update support
- Some code refactoring in quantize.py
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - v0.1.1.post1
HQQ v0.1.1.post1
No improvements over v0.1.1. Just removed Pytorch from the dependencies and updated the Readme.
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - v0.1.1
HQQ v0.1.1
Improvements:
- Added Mixtral support for Hugging Face.
- Added support for layer-wise custom quantization configs.
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - v0.1.0
HQQ v0.1.0
Improvements
- Added compile backend support
- Added Aten C++ backend (experimental)
- Faster bit unpacking via pre-allocated empty tensor
- Added VLLM support
- Refactoring to call
quantize_model()on instances
Supported models
- Llama (Hugging Face + VLLM)
- ViT-CLIP (timm)
Limitations
- HF only supports single GPU runtime.
- VLLM only supports single GPU with a single worker.
- The compile backend sometimes creates issues with async runtime
- Doesn't support PEFT (LoRA, etc.).
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - 0.1.0-alpha
HQQ 0.1.0-alpha
Alpha version with basic Hugging Face/Timm support.
Supported models:
- Llama (Hugging Face)
- ViT (timm)
Limitations:
- Uses a pure Pytorch implementation without optimizations.
- Only supports single GPU runtime.
- Doesn't support Peft (LoRA, etc.) for custom training.
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - HQQ v1.0.0
Limitations:
- Only supports single GPU runtime.
- Not compatible with Hugging Face's Peft.
- Python
Published by mobicham about 2 years ago
https://github.com/mobiusml/hqq - HQQ v1.0.0
HQQ v1.0.0
Limitations:
- Only supports single GPU runtime with Pytorch.
- Not compatible with Hugging Face's Peft.
- Python
Published by mobicham about 2 years ago