Recent Releases of torchao
torchao - v0.13.0
Highlights
We are excited to announce the 0.13.0 release of torchao! This release adds support for numerous QAT improvements, faster mxfp8 pretraining and more!
Simpler Multi-step QAT API (https://github.com/pytorch/ao/pull/2629)
We added a new, simpler, multi-step QAT API that uses only a single config. Now users can specify the target post-training quantization (PTQ) config as the base config and we will automatically infer the correct fake quantize configs to use!
```py from torchao.quantization import ( quantize_, Int8DynamicActivationInt4WeightConfig ) from torchao.quantization.qat import QATConfig
prepare
baseconfig = Int8DynamicActivationInt4WeightConfig(groupsize=32) qatconfig = QATConfig(baseconfig, step="prepare") quantize(m, qatconfig)
train (not shown)
convert
quantize(m, QATConfig(baseconfig, step="convert")) ```
For more advanced use cases, users can continue to specify specific FakeQuantizeConfigs as before:
```py
prepare
activationconfig = IntxFakeQuantizeConfig(torch.int8, "pertoken", issymmetric=False) weightconfig = IntxFakeQuantizeConfig(torch.int4, groupsize=32) qatconfig = QATConfig( activationconfig=activationconfig, weightconfig=weightconfig, step="prepare", ) quantize(model, qatconfig)
train and convert (not shown)
```
(Prototype) NVFP4 and FP8 QAT (https://github.com/pytorch/ao/pull/2735, https://github.com/pytorch/ao/pull/2666)
We generalized QAT to support FP8 and NVFP4 use cases. You can try them out as follows:
```py from torchao.quantization import ( quantize, Float8DynamicActivationInt4WeightConfig, Float8DynamicActivationFloat8WeightConfig, Float8WeightOnlyConfig, ) from torchao.prototype.mxformats import NVFP4InferenceConfig from torchao.quantization.qat import QATConfig
Pick a base config
baseconfig = Float8DynamicActivationInt4WeightConfig() # or baseconfig = Float8DynamicActivationInt8WeightConfig() # or base_config = NVFP4InferenceConfig
prepare
qatconfig = QATConfig(baseconfig, step="prepare") quantize(m, qatconfig)
train (not shown)
convert
quantize(m, QATConfig(baseconfig, step="convert")) ```
Users can also use the more specific FakeQuantizeConfigs for more advanced use cases, e.g.:
```py from torchao.quantization import PerRow from torchao.quantization.qat import Float8FakeQuantizeConfig from torchao.prototype.qat import NVFP4FakeQuantizeConfig
actconfig = Float8FakeQuantizeConfig(torch.float8e4m3fn, PerRow()) weightconfig = NVFP4FakeQuantizeConfig(usepertensorscale=True)
prepare
qatconfig = QATConfig( activationconfig=activationconfig, weightconfig=weightconfig, step="prepare", ) quantize(model, qat_config)
train and convert (not shown)
```
(prototype) 1.2x MXFP8 dense pretraining speedups with torchtitan
We landed performance improvements (such as a faster to_mx dim1 cast) to our prototype MXFP8 training APIs, and we now achieve a 1.2x speedup vs bf16 on pretraining LLaMa 3 8B on NVIDIA B200. Please see our training benchmarks README for more information.
torchao float8 training now integrated into axolotl!
You can now use torchao.float8 directly from axolotl to achieve finetuning QPS e2e speedups of up to 1.1x on 3B parameter models (docs, release notes).
BC Breaking
Float8DynamicActivationFloat8WeightConfig and Float8WeightOnlyConfig version bump to 2 (https://github.com/pytorch/ao/pull/2650)
We updated the implementation for float8 Tensor, so bumps the default version from 1 to 2 for these two configs.
``` from transformers import AutoModelForCausalLM, AutoTokenizer modelname = "torchao-testing/opt-125m-Float8DynamicActivationFloat8WeightConfig-v1-0.13.dev" quantizedmodel = AutoModelForCausalLM.frompretrained( modelname, torchdtype="bfloat16", devicemap="cuda", )
/data/users/jerryzh/ao/torchao/core/config.py:249: UserWarning: Stored version is not the same as current default version of the config: storedversion=1, currentversion=2, please check the deprecation warning warnings.warn( /data/users/jerryzh/ao/torchao/dtypes/floatx/float8_layout.py:113: UserWarning: Models quantized with version 1 of Float8DynamicActivationFloat8WeightConfig is deprecated and will no longer be supported in a future release, please upgrade torchao and quantize again, or download a newer torchao checkpoint, see https://github.com/pytorch/ao/issues/2649 for more details warnings.warn( ```
Suggestion: upgrade torchao to 0.13 and later and generate the checkpoint again:
quantize_(model, Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))
Or download the checkpoint again (please let us know if the checkpoint is not updated)
Please see https://github.com/pytorch/ao/issues/2649 for more details around the deprecation.
QAT API Changes (https://github.com/pytorch/ao/pull/2628, https://github.com/pytorch/ao/pull/2641)
On a high level, the following existing APIs are deprecated and replaced by these new ones. Although this is technically BC-breaking due to typing changes, it will not affect most users as old classes are kept around for now. They are planned to be removed in the next release, however.
py
IntXQuantizationAwareTrainingConfig -> QATConfig
FromIntXQuantizationAwareTrainingConfig -> QATConfig
FakeQuantizeConfig -> IntxFakeQuantizeConfig
FakeQuantizer -> IntxFakeQuantizer
Please see https://github.com/pytorch/ao/issues/2630 and the latest QAT README for more information on how to migrate.
Remove old change_linear_weights_to_* APIs (https://github.com/pytorch/ao/pull/2721)
The following old quantization APIs no longer work and are removed:
py
change_linear_weights_to_int8_dqtensors(model)
change_linear_weights_to_int8_woqtensors(model)
change_linear_weights_to_int4_woqtensors(model)
Please use the quantize_ API with the following configs instead:
py
quantize_(model, Int8WeightOnlyConfig())
quantize_(model, Int4WeightOnlyConfig())
Deprecations
Deprecate old TORCH_VERSION variables (https://github.com/pytorch/ao/pull/2719)
The following variables are deprecated and will be removed in the next release:
py
TORCH_VERSION_AT_LEAST_2_2
TORCH_VERSION_AT_LEAST_2_3
TORCH_VERSION_AT_LEAST_2_4
TORCH_VERSION_AT_LEAST_2_5
TORCH_VERSION_AT_LEAST_2_6
TORCH_VERSION_AT_LEAST_2_7
TORCH_VERSION_AT_LEAST_2_8
TORCH_VERSION_AFTER_2_2
TORCH_VERSION_AFTER_2_3
TORCH_VERSION_AFTER_2_4
TORCH_VERSION_AFTER_2_5
Drop support for PyTorch 2.5 and before (https://github.com/pytorch/ao/pull/2720)
torchao only supports the latest 3 versions of PyTorch. Please upgrade to PyTorch 2.6.0+ if you were using an older version of PyTorch.
New Features
- New multi-step QAT API (https://github.com/pytorch/ao/pull/2629)
- Add float8 FakeQuantizeConfig and FakeQuantizer (https://github.com/pytorch/ao/pull/2735)
- (prototype) Add NVFP4 QAT (https://github.com/pytorch/ao/pull/2666)
Improvements
- Add StretchedUnifTorchaoQuantizer (https://github.com/pytorch/ao/pull/2576)
- Allow symmetric_no_clipping_error for KleidiAI kernels, update Readme and validate Kleidi INT4 quantization path (https://github.com/pytorch/ao/pull/2570)
- Enable powers of 2 cast in float8 rowwise_with_gw_hp recipe (https://github.com/pytorch/ao/pull/2677)
- Don't call erase if node is already erased in batch norm fusion. (https://github.com/pytorch/ao/pull/2716)
- Generalize FakeQuantizer beyond intx (https://github.com/pytorch/ao/pull/2714)
- Allow pattern replacement to ignore literals (https://github.com/pytorch/ao/pull/2519)
- Replace
export_for_trainingwithtorch.export.export(https://github.com/pytorch/ao/pull/2724) - Allow no quantization during QATConfig convert (https://github.com/pytorch/ao/pull/2694)
- Int4 sparse marlin tensor (https://github.com/pytorch/ao/pull/2771)
- Remove group_size arg in Float8DynamicActivationInt4WeightConfig (https://github.com/pytorch/ao/pull/2779)
- Fix batch norm folding in
prepare_pt2efor multiple conv->BN chains sharing the same conv weights (https://github.com/pytorch/ao/pull/2795) - Add Float8Tensor (https://github.com/pytorch/ao/pull/2463)
- (prototype) Allow per-group quantizers in QuantOptimizer, fix state_dict (https://github.com/pytorch/ao/pull/2743)
- (prototype) SpinQuant support split qkv (prototype) (https://github.com/pytorch/ao/pull/2547)
- (prototype) Make AWQ more general (https://github.com/pytorch/ao/pull/2400)
- (prototype) MX training
- Integration of new mxfp8 casting cuda kernel (https://github.com/pytorch/ao/pull/2564)
- Mx: expose scaling calculation methods in training UX (https://github.com/pytorch/ao/pull/2620)
- Mx: make CUDA kernel for dim1 cast in mxfp8_cublas recipe (https://github.com/pytorch/ao/pull/2661)
- Integration of new mxfp8 casting cuda kernel (https://github.com/pytorch/ao/pull/2564)
- (prototype) MoE training
- Mxfp8 emulated grouped gemm (https://github.com/pytorch/ao/pull/2626)
- Add differentiable mxfp8 grouped gemm with dynamic quant (forward pass) (https://github.com/pytorch/ao/pull/2627)
- Support for 2d-2d emulated mxfp8 grouped gemm (https://github.com/pytorch/ao/pull/2632)
- Backward pass for differentiable mxfp8 grouped gemm with dynamic quant (https://github.com/pytorch/ao/pull/2639)
- torch.compile support for ScaledGroupedMMTensor (https://github.com/pytorch/ao/pull/2509)
- Assert expert weights are column-major; preserve subclass with transpose (https://github.com/pytorch/ao/pull/2663)
- set token group alignment size to 16 for fp8 training test (https://github.com/pytorch/ao/pull/2678)
- Make scaling type configurable for MoE training (https://github.com/pytorch/ao/pull/2642)
- use smaller block sizes for per group scaling kernels to improve perf (https://github.com/pytorch/ao/pull/2668)
- add llama4 benchmarking script (https://github.com/pytorch/ao/pull/2669)
- add fp8 rowwise kernels for expert weights (https://github.com/pytorch/ao/pull/2696)
- add bench script for fp8 rowwise kernels and update autotune configs (https://github.com/pytorch/ao/pull/2697)
- integrate rowwise expert quant kernel (https://github.com/pytorch/ao/pull/2698)
- work around wrap_triton bug by using normal custom ops instead for fp8 rowwise kernels (https://github.com/pytorch/ao/pull/2734)
- fix scaling type bug; refactor distributed tests (https://github.com/pytorch/ao/pull/2749)
- use llama4 shapes for kernel benchmarks (https://github.com/pytorch/ao/pull/2756)
- remove duplicate benchmark script (https://github.com/pytorch/ao/pull/2762)
- refactor to share benchmarking and profiling utils (https://github.com/pytorch/ao/pull/2767)
- add memory bandwidth calculations to kernel benchmarking scripts (https://github.com/pytorch/ao/pull/2769)
- update bench script to compare fp8 dynamic quant scaled_grouped_mm fwd+bwd against bf16 (https://github.com/pytorch/ao/pull/2765)
- Mxfp8 emulated grouped gemm (https://github.com/pytorch/ao/pull/2626)
- Float8 blockwise training (prototype)
- Add Triton kernels for fp8 blockwise quantization and GEMMs (https://github.com/pytorch/ao/pull/2617)
- Add Float8BlockwiseLinear for training (https://github.com/pytorch/ao/pull/2618)
- Improve fp8 blockwise gemm perf (https://github.com/pytorch/ao/pull/2784)
- Add Triton kernels for fp8 blockwise quantization and GEMMs (https://github.com/pytorch/ao/pull/2617)
Bug Fixes
- Fix autocast handling for float8 training rowwise recipes (https://github.com/pytorch/ao/pull/2587)
- NVFP4 -> Use more of e4m3 range for block_scales (https://github.com/pytorch/ao/pull/2604)
- Handle the case when param groups are passed to optimizer (https://github.com/pytorch/ao/pull/2606)
- Fix bc breakage flex path (https://github.com/pytorch/ao/pull/2652)
- Fix FSDP2 breakage in nightly (https://github.com/pytorch/ao/pull/2684)
- When replacing literals with placeholders lists are always converted to (https://github.com/pytorch/ao/pull/2518)
- Don't learn zero points for symmetric quantization (https://github.com/pytorch/ao/pull/2739)
- fix ROCM build for newer hipblaslt BC-breaking change (https://github.com/pytorch/ao/pull/2510)
- Fix missing QuantOptimizer methods (https://github.com/pytorch/ao/pull/2770)
- Fix float8 + int4 QAT (https://github.com/pytorch/ao/pull/2851)
- Allowlist WeightWithDynamicFloat8CastTensor for deserialization for checkpointing (https://github.com/pytorch/ao/pull/2573)
Performance
- Fix float8 rowwise inference perf with torch.compile (https://github.com/pytorch/ao/pull/2672)
- Add CUDA kernel for MXFP8 dim1 casting (https://github.com/pytorch/ao/pull/2513, https://github.com/pytorch/ao/pull/2550)
- Extend the MX cast benchmark to include casting to mxfp4 (https://github.com/pytorch/ao/pull/2693)
Documentation
- Add QLoRA and FP8 to finetuning tutorial (part 2) (https://github.com/pytorch/ao/pull/2542)
- Clean up QAT API surface + add separate API ref (https://github.com/pytorch/ao/pull/2567)
- Update float8 README with AMD MI300X benchmark results (https://github.com/pytorch/ao/pull/2736)
- Update float8 README.md with more recent e2e performance numbers (https://github.com/pytorch/ao/pull/2774, https://github.com/pytorch/ao/pull/2580)
- Update quantization overview and contributor guide doc (https://github.com/pytorch/ao/pull/2723)
- add e2e training benchmark results to mx_formats README.md (https://github.com/pytorch/ao/pull/2777)
- Update paper link readme (https://github.com/pytorch/ao/pull/2563)
- Minor improvements to OpenVINOQuantizer (https://github.com/pytorch/ao/pull/2581)
- Update README with PEFT integration + installation (https://github.com/pytorch/ao/pull/2559)
Developers
- Bump cutlass version to 4.1.0 (https://github.com/pytorch/ao/pull/2589)
- Fix git repo url in citation (https://github.com/pytorch/ao/pull/2599)
- Simplify Float8Linear (https://github.com/pytorch/ao/pull/2594, https://github.com/pytorch/ao/pull/2595)
- Convert quantization internal methods to private (https://github.com/pytorch/ao/pull/2568)
- Reference representation of dqlinear int4 for xnnpack (https://github.com/pytorch/ao/pull/2520)
- Refactors to align with new tensor subclass design
- Add all fbgemm kernel Tensors into Int4WeightOnlyConfig and Float8DynamicActivationInt4WeightConfig (https://github.com/pytorch/ao/pull/2474)
- Add support for float8 activation for Int4PreshuffledTensor (https://github.com/pytorch/ao/pull/2437)
- Align Int4Tensor implementation details with the design of Float8Tensor (https://github.com/pytorch/ao/pull/2687)
- Support
optional_tensor_namesin TorchAOBaseTensor (https://github.com/pytorch/ao/pull/2710) - Update Int4PreshuffledTensor to align with implementation details of the Float8Tensor (https://github.com/pytorch/ao/pull/2738)
- Nvfp4 tensor: switch to using
qdata(https://github.com/pytorch/ao/pull/2787) - Nvfp4 tensor: switch to TorchAOBaseTensor (https://github.com/pytorch/ao/pull/2788)
- Nvfp4 tensor: refactor weight-only vs dynamic quant (https://github.com/pytorch/ao/pull/2790)
- Mxtensor: make data argument first and rename to
qdata(https://github.com/pytorch/ao/pull/2804) - Mxtensor: inherit from TorchAOBaseTensor (https://github.com/pytorch/ao/pull/2805)
- Mxtensor: refactor activation quant to use direct logic (https://github.com/pytorch/ao/pull/2806)
- Support more ops in TorchAOBaseTensor (https://github.com/pytorch/ao/pull/2609)
- Add all fbgemm kernel Tensors into Int4WeightOnlyConfig and Float8DynamicActivationInt4WeightConfig (https://github.com/pytorch/ao/pull/2474)
New Contributors
- @wdvr made their first contribution in https://github.com/pytorch/ao/pull/2548
- @carmocca made their first contribution in https://github.com/pytorch/ao/pull/2539
- @gausah-arm made their first contribution in https://github.com/pytorch/ao/pull/2570
- @daniil-lyakhov made their first contribution in https://github.com/pytorch/ao/pull/2581
- @zeshengzong made their first contribution in https://github.com/pytorch/ao/pull/2599
- @amdfaa made their first contribution in https://github.com/pytorch/ao/pull/2662
- @chowarfb made their first contribution in https://github.com/pytorch/ao/pull/2657
- @abeakkas made their first contribution in https://github.com/pytorch/ao/pull/2716
- @subhankarpal made their first contribution in https://github.com/pytorch/ao/pull/2795
Full Changelog: https://github.com/pytorch/ao/compare/v0.12.0...v0.13.0-rc1
- Python
Published by vkuzo 9 months ago
torchao - v0.12.0
Highlights
We are excited to announce the 0.12.0 release of torchao! This release adds support for QAT + Axolotl Integration and prototype MXFP/NVFP support on Blackwell GPUs!
QAT + Axolotl Integration
TorchAO’s QAT support has been integrated into Axolotl’s fine-tuning recipes! Check out the docs here or run it yourself using the following command:
shell
axolotl train examples/llama-3/3b-qat-fsdp2.yaml
axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml
Initial results for Llama3.2-3B by @SalmanMohammadi (https://github.com/axolotl-ai-cloud/axolotl/pull/2590): | Model/Metric | hellaswag acc | hellaswag accnorm | wikitext bitsperbyte | wikitext byteperplexity | wikitext word_perplexity | |--------------|---------------|-------------------|----------------------|-------------------------|-------------------------| | bfloat16 | 0.5552 | 0.7315 | 0.6410 | 1.5594 | 10.7591 | | bfloat16 PTQ | 0.5393 | 0.7157 | 0.6613 | 1.5815 | 11.6033 | | qat ptq | 0.5423 | 0.7180 | 0.6567 | 1.5764 | 11.4043 | | Recovered (qat ptq) | 18.87% | 14.56% | 22.66% | 23.08% | 23.57% |
[Prototype | API not finalized] MXFP and NVFP support on Blackwell GPUs
TorchAO now includes prototype support for NVFP4 (NVIDIA's 4-bit floating-point format) and Microscaling (MX) formats on NVIDIA's latest Blackwell GPU architecture. These formats enable efficient inference, achieving up to 61% end-to-end performance improvement in vLLM on Qwen3 models and near 2x speedups for diffusion workloads.
To use:
```py from torchao.quantization import quantize_ from torchao.prototype.mx_formats import ( MXFPInferenceConfig, NVFP4InferenceConfig, )
Quantize model with MXFP8
model = quantize(model, MXFPInferenceConfig(blocksize=32))
Quantize model to NVFP4 (without double scaling)
model = quantize_(model, NVFP4InferenceConfig()) ```
Note: This is a prototype feature with APIs subject to change. Requires NVIDIA Blackwell GPUs (B200, 5090) with CUDA 12.8+.
BC Breaking
- Remove preserve_zero and zero_point_domain from choose_qparams_affine (https://github.com/pytorch/ao/pull/2149)
- Rename qparams for tinygemm (https://github.com/pytorch/ao/pull/2344)
- Convert quant_primitives methods private (https://github.com/pytorch/ao/pull/2350)
- Delete Galore (https://github.com/pytorch/ao/pull/2397)
- Remove more Galore bits (https://github.com/pytorch/ao/pull/2417)
- Remove
sparsity/prototype/blocksparse(https://github.com/pytorch/ao/pull/2205)
Deprecations
- Clean up prototype folder (https://github.com/pytorch/ao/pull/2232)
- Make float8 training's force_recompute_fp8_weight_in_bwd flag do nothing (https://github.com/pytorch/ao/pull/2356)
New Features
- Enabling MOE Quantization using linear decomposition (https://github.com/pytorch/ao/pull/2043)
- [PT2E][X86] Migrate fusion passes in Inductor to torchao (https://github.com/pytorch/ao/pull/2140)
- 2:4 activation sparsity packing kernels (https://github.com/pytorch/ao/pull/2012)
- Add subclass based method for inference w/ MXFP8 (https://github.com/pytorch/ao/pull/2132)
- Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors (https://github.com/pytorch/ao/pull/1763)
- Arm_inductor_quantizer for Pt2e quantization (https://github.com/pytorch/ao/pull/2139)
- Add mx_fp4 path (https://github.com/pytorch/ao/pull/2201)
- Add support for KleidiAI int4 kernels on aarch64 Linux (https://github.com/pytorch/ao/pull/2169)
- Add support for fbgemm int4 mm kernel (https://github.com/pytorch/ao/pull/2255)
- Enable fp16+int4 mixed precission path for int4 xpu path with int zero point (https://github.com/pytorch/ao/pull/2240)
- Enable range learning for QAT (https://github.com/pytorch/ao/pull/2033)
- Patch the _is_conv_node function (https://github.com/pytorch/ao/pull/2257)
- Add support for fbgemm fp8 kernels (https://github.com/pytorch/ao/pull/2276)
- Add Float8ActInt4WeightQATQuantizer (https://github.com/pytorch/ao/pull/2289)
- [float8] add _auto_filter_for_recipe to float8 (https://github.com/pytorch/ao/pull/2410)
- NVfp4 (https://github.com/pytorch/ao/pull/2408)
- [float8] Prevent quantize_affine_float8/dequantize_affine_float8 decomposed on inductor (https://github.com/pytorch/ao/pull/2379)
- [CPU] Enable DA8W4 on CPU (https://github.com/pytorch/ao/pull/2128)
- Add exportable coreml codebook quantization op (https://github.com/pytorch/ao/pull/2443)
- Add support for Int4GroupwisePreshuffleTensor for fbgemm (https://github.com/pytorch/ao/pull/2421)
Improvement
- Add serialization support for
AOPerModuleConfig(https://github.com/pytorch/ao/pull/2186) - Set eps in end-to-end QAT flow (https://github.com/pytorch/ao/pull/2180)
- Enable {conv3d, conv_transpose3d} + bn fusion in pt2e (https://github.com/pytorch/ao/pull/2212)
- Update GemLite to support vLLM V1 (https://github.com/pytorch/ao/pull/2199)
- [sparse] Add fp8 sparse gemm with rowwise scaling for activation sparsity (https://github.com/pytorch/ao/pull/2242)
- Patch the _is_conv_node function (https://github.com/pytorch/ao/pull/2223)
- Relax int4wo device mismatch error (https://github.com/pytorch/ao/pull/2254)
- Rename AOPerModuleConfig to ModuleFqnToConfig (https://github.com/pytorch/ao/pull/2243)
- [reland2][ROCm] preshuffled weight mm (https://github.com/pytorch/ao/pull/2207)
- GPTQ updates (https://github.com/pytorch/ao/pull/2235)
- Fix QAT range learning, ensure scales get gradients (https://github.com/pytorch/ao/pull/2280)
- Fix slicing and get_plain() in GemLite (https://github.com/pytorch/ao/pull/2288)
- Add slicing support for fbgemm fp8 and int4 (https://github.com/pytorch/ao/pull/2308)
- Add support for bmm and
tofor fbgemm Tensor (https://github.com/pytorch/ao/pull/2337) - Add dynamic quantization support to gemlite layout (https://github.com/pytorch/ao/pull/2327)
- Test PARQ with torchao activation quantization (https://github.com/pytorch/ao/pull/2370)
- Update index.rst (https://github.com/pytorch/ao/pull/2395)
- Add inplace quantizer examples (https://github.com/pytorch/ao/pull/2345)
- Build mxfp4 kernel for sm120a (https://github.com/pytorch/ao/pull/2285)
- Enable to_mxfp8 cast for DTensor (https://github.com/pytorch/ao/pull/2420)
- Enable tensor parallelism for MXLinear (https://github.com/pytorch/ao/pull/2434)
- Graduate debug handle in torchao (https://github.com/pytorch/ao/pull/2452)
- Switch alignemtn to 8 for cutlass 4 upgrade (https://github.com/pytorch/ao/pull/2491)
- Mxfp8 training: add TP sharding strategy for dim1 kernel (https://github.com/pytorch/ao/pull/2436)
Bug Fixes
- [optim] Fix low-bit optim when used with FSDP2+CPUOffload (https://github.com/pytorch/ao/pull/2195)
- Fix Per Row scaling for inference (https://github.com/pytorch/ao/pull/2253)
- Fix benchmark_low_bit_adam.py reference (https://github.com/pytorch/ao/pull/2287)
- [optim] Fix bug when default dtype is BF16 (https://github.com/pytorch/ao/pull/2286)
- [sparse] marlin fixes (https://github.com/pytorch/ao/pull/2305)
- Fix ROCM test failures (https://github.com/pytorch/ao/pull/2362)
- [float8] Add fnuz fp8 dtypes to Float8Layout (https://github.com/pytorch/ao/pull/2351)
- Fixing ruff format for trunk (https://github.com/pytorch/ao/pull/2369)
- Fixing trunk - autoquant test failure (https://github.com/pytorch/ao/pull/2363)
- Remove torchao dependency from torchao build script (https://github.com/pytorch/ao/pull/2383)
- Fix torchao quantized model in fbcode (https://github.com/pytorch/ao/pull/2396)
- Gemlite generate.py fix (https://github.com/pytorch/ao/pull/2372)
- Fixes issue #156414: Fixes bug in implementation of _combine_histogram (Follow up) (https://github.com/pytorch/ao/pull/2418)
- TorchAO new observers (https://github.com/pytorch/ao/pull/2508)
- Fix tutorials (https://github.com/pytorch/ao/pull/2516)
Performance
- Add a triton kernel for swizziling (https://github.com/pytorch/ao/pull/2168)
Documentation
- Add blockwise fp8 gemm benchmarks to README (https://github.com/pytorch/ao/pull/2203)
- [float] document e2e training -> inference flow (https://github.com/pytorch/ao/pull/2190)
- Update Readme (https://github.com/pytorch/ao/pull/1526)
- Mark QAT range learning as prototype for now (https://github.com/pytorch/ao/pull/2272)
- Update float8 training readme to include time measurement (https://github.com/pytorch/ao/pull/2291)
- [BE/docs] Add float8 training api ref to docsite (https://github.com/pytorch/ao/pull/2313)
- Enable doc build to run on PRs (https://github.com/pytorch/ao/pull/2315)
- [BE] [docs] Add float8 pretraining tutorial to docsite (https://github.com/pytorch/ao/pull/2304)
- [BE/docs] Add fp8 rowwise perf table to float8 training readme (https://github.com/pytorch/ao/pull/2312)
- Update Quantization docs to show newer AOConfigs (https://github.com/pytorch/ao/pull/2317)
- Update QAT docs, highlight axolotl integration (https://github.com/pytorch/ao/pull/2266)
- Add static quant tutorial (https://github.com/pytorch/ao/pull/2047)
- Update README.md to include seamless v2 (https://github.com/pytorch/ao/pull/2355)
- Add Tutorial on E2E integration into VLLM and minimal Subclass (https://github.com/pytorch/ao/pull/2346)
- [docs] Replace deprecated configs with Config objects (https://github.com/pytorch/ao/pull/2375)
- Revamp README (https://github.com/pytorch/ao/pull/2374)
- Add pt2e tutorials to torchao doc page (https://github.com/pytorch/ao/pull/2384)
- Add part 2 of end-to-end tutorial: fine-tuning (https://github.com/pytorch/ao/pull/2394)
- Call out axolotl + QAT integration on README (https://github.com/pytorch/ao/pull/2442)
- Float8 readme: remove duplication (https://github.com/pytorch/ao/pull/2447)
- Float8 readme: add key features section (https://github.com/pytorch/ao/pull/2448)
- Update README.md to include Flux-Fast (https://github.com/pytorch/ao/pull/2457)
- Inference tutorial - Part 3 of e2e series (https://github.com/pytorch/ao/pull/2343)
- Update QAT README and API docstrings (https://github.com/pytorch/ao/pull/2465)
- Fix typo : whic -> which (https://github.com/pytorch/ao/pull/2495)
- Fix links for torchao tutorials (https://github.com/pytorch/ao/pull/2503)
- Fix docstrings for quantization API docs (https://github.com/pytorch/ao/pull/2471)
- Tutorial for benchmarking (https://github.com/pytorch/ao/pull/2499)
Developers
New Contributors
- @malfet made their first contribution in https://github.com/pytorch/ao/pull/2181
- @the-tuning-machine made their first contribution in https://github.com/pytorch/ao/pull/1763
- @choudhary-devang made their first contribution in https://github.com/pytorch/ao/pull/2139
- @vctrmn made their first contribution in https://github.com/pytorch/ao/pull/2169
- @yuguo68 made their first contribution in https://github.com/pytorch/ao/pull/2225
- @liangan1 made their first contribution in https://github.com/pytorch/ao/pull/2240
- @emmanuel-ferdman made their first contribution in https://github.com/pytorch/ao/pull/2250
- @odiemm-meta made their first contribution in https://github.com/pytorch/ao/pull/2328
- @lilianaairhart made their first contribution in https://github.com/pytorch/ao/pull/2360
- @Gasoonjia made their first contribution in https://github.com/pytorch/ao/pull/2390
- @zixi-qi made their first contribution in https://github.com/pytorch/ao/pull/2396
- @shiyang-weng made their first contribution in https://github.com/pytorch/ao/pull/2379
- @Akabbaj made their first contribution in https://github.com/pytorch/ao/pull/2418
- @mori360 made their first contribution in https://github.com/pytorch/ao/pull/2449
- @henrylhtsang made their first contribution in https://github.com/pytorch/ao/pull/2491
- @namgyu-youn made their first contribution in https://github.com/pytorch/ao/pull/2495
- @rohansjoshi made their first contribution in https://github.com/pytorch/ao/pull/2508
Full Changelog: https://github.com/pytorch/ao/compare/v0.11.0...v0.12.0-rc2
- Python
Published by drisspg 11 months ago
torchao - v0.11.0
Highlights
We are excited to announce the 0.11.0 release of torchao! This release adds support for mixture-of-experts (MoE) quantization, PyTorch 2 Export Quantization (PT2E), and a microbenchmarking framework for inference APIs!
MoE Quantization
We’ve a prototype feature for quantizing MoE modules with a number of TorchAO quantization techniques. This approach leverages the existing TorchAO features for quantizing linear ops and allows them to be used to quantize MoE modules.
```py from torchao.quantization.prototype.moequant.utils import condffnfilter, MoEQuantConfig from torchao.quantization.quantapi import quantize_, Int8WeightOnlyConfig
quantize(
model,
MoEQuantConfig(Int8WeightOnlyConfig()),
filterfn=condffnfilter
)
model=torch.compile(
model,
mode="reduce-overhead",
fullgraph=issingletoken_inference
)
```
While the above API is all that is needed to quantize a moe module if your moe module is written to be both quantizable and compilable, in practice its rare for a user model to satisfy these conditions due to the variety of MoE implementations. An initial swap of the normal MoE module with a MoEFeedForwardAOQuantizable module is needed to first prepare the model for quantization. An example of this can be found in llama4_quant.py where this technique is demonstrated for the huggingface llama-4-Scout-17B-16E-Instruct model.
We implemented MoE quantization with 2 methods. The first method (designated `base` in the below benchmarks) simply enhances the existing quantized tensor subclass to quantize the 3D MoE expert tensors and perform the necessary indexing and slicing ops while the second method (`fake`), uses a new tensor subclass to simulate a 3D quantized parameter by storing a sequence of 2D slices of the quantized parameter. The first approach is faster with marginally worse memory characteristics. In both cases doing MoE quantization in this way isn’t expected to be maximally performant compared to implementing fused MoE kernels for each technique, but this approach can yield both moderate speedups and significant memory savings.
The following benchmarks are for mixtral-moe run on a single H100 GPU:
| | batchsize 1 | | batchsize 8 | | |
|-------------|-------------|-------------|-------------|--------------|-------------|
| Technique | tok/s | memory (GB) | tok/s | tok/s* batch | memory (GB) |
| None | 78.35 | 93.76 | 18.2 | 145.64 | 94.12 |
| int8wo-base | 98.4 | 48.87 | 4.94 | 39.56 | 49.2 |
| int4wo-base | 79.38 | 36.15 | 10.29 | 82.29 | 36.12 |
| fp8wo-base | 59.41 | 52.07 | 2.98 | 23.81 | 52.05 |
| fp8dq-base | 45.92 | 53.97 | 3.78 | 30.23 | 53.94 |
| int8wo-fake | 6.14 | 49.13 | 5.01 | 40.09 | 49.23 |
| int4wo-fake | 14.25 | 30.21 | 11.84 | 94.75 | 30.19 |
| fp8wo-fake | 3.2 | 50.31 | 2.88 | 23.08 | 50.29 |
| fp8dq-fake | 9.78 | 50.92 | 4.08 | 32.61 | 50.89 |
PT2 Export Quantization
We added pytorch 2 export quantization from pytorch to torchao. As part of the planned migration. We’ll follow up with adding deprecation warnings to PyTorch torch.ao.quantization APIs and updating docs in the future. We also simplified the import path for some of the util functions. Here is a non-exhaustive list of APIs you can use:
```
top level APIs
from torchao.quantization.pt2e.quantizept2e import preparept2e, prepareqatpt2e, convert_pt2e from torchao.quantization.pt2e.quantizer import X86InductorQuantizer
export utils
from torchao.quantization.pt2e import ( moveexportedmodeltoeval, moveexportedmodeltotrain, allowexportedmodeltraineval )
graph utils
from torchao.quantization.pt2e import ( findsequentialpartitions, getequivalenttypes, updateequivalenttypesdict, bfstracewithnode_process, )
# pt2e numeric debugger from torchao.quantization.pt2e import ( generatenumericdebughandle, CUSTOMKEY, NUMERICDEBUGHANDLEKEY, prepareforpropagationcomparison, extractresultsfromloggers, compareresults, )
```
Microbenchmarking Framework for Inference APIs
We’ve introduced a streamlined microbenchmark framework, to help developers track and evaluate the performance of their post-training quantization and sparsity APIs for different matrix sizes and model types. The framework also includes support for advanced GPU and memory profiling techniques, providing deeper insights into performance characteristics.
To run the benchmarks, use the following command:
python -m benchmarks.microbenchmarks.benchmark_runner --config benchmarks/microbenchmarks/test/benchmark_config.yml
Sample Benchmark Results (on 1xH100):
| Name | Quantization | Shape | Baseline Inference Time (ms) | Inference Time (ms) | Speedup |
|-------------------|-----------------|---------------------|------------------------------|---------------------|---------|
| small_bf16_linear | float8dq-tensor | 16384, 16384, 16384 | 13.34 | 7.72 | 1.73x |
| small_bf16_linear | float8dq-tensor | 16384, 16384, 32768 | 26.04 | 14.62 | 1.78x |
| small_bf16_linear | float8dq-tensor | 16384, 16384, 65536 | 53.59 | 29.05 | 1.84x |
| small_bf16_linear | float8dq-tensor | 16384, 32768, 32768 | 68.94 | 28.07 | 2.46x |
| small_bf16_linear | float8dq-tensor | 16384, 32768, 65536 | 108.63 | 58.7 | 1.85x |
| small_bf16_linear | float8dq-tensor | 16384, 65536, 65536 | 215.66 | 118.42 | 1.82x |
| small_bf16_linear | float8dq-tensor | 32768, 32768, 32768 | 108.16 | 57.09 | 1.89x |
| small_bf16_linear | float8dq-tensor | 32768, 32768, 65536 | 214.74 | 110.08 | 1.95x |
| small_bf16_linear | float8dq-tensor | 32768, 65536, 65536 | 432.44 | 223.46 | 1.94x |
| small_bf16_linear | float8dq-tensor | 65536, 65536, 65536 | 870.37 | 447.97 | 1.94x |
BC Breaking
- Remove prototype low bit optim code completely (https://github.com/pytorch/ao/pull/2159)
New Features
- Add quantized attn_scores @ v test for intented used in quantized attention (https://github.com/pytorch/ao/pull/2008)
- Add fallback kernel and interface (https://github.com/pytorch/ao/pull/2010)
- Add fallback kernel and interface for rhs only quantized matmul (https://github.com/pytorch/ao/pull/2011)
- Add KleidiAI gemm kernels (https://github.com/pytorch/ao/pull/2000)
- Use quantized gemm only on aarch64 (https://github.com/pytorch/ao/pull/2023)
- Adds utility to replace Q/DQ ops with torchao quantized linear ops (https://github.com/pytorch/ao/pull/1967)
- Adds Q/DQ layout support for embedding quantization with IntxWeightOnlyConfig (https://github.com/pytorch/ao/pull/1972)
- Move Int8DynamicActivationIntxWeightConfig out of experimental (https://github.com/pytorch/ao/pull/1968)
- Initial ParetoQ commit (https://github.com/pytorch/ao/pull/1876)
- INT4 XPU enabling (https://github.com/pytorch/ao/pull/1577)
- Vectorized row sum (https://github.com/pytorch/ao/pull/2034)
- Add gemm for fp32_a_int8_b matmul kernel (https://github.com/pytorch/ao/pull/2039)
- Add gemm kernel to interface (https://github.com/pytorch/ao/pull/2040)
- Add tests for attention matmul for gemm kernels (https://github.com/pytorch/ao/pull/2041)
- Gemm int8 a int8 b kernels (https://github.com/pytorch/ao/pull/2049)
- Add tests cases for q @ k attention variant (https://github.com/pytorch/ao/pull/2051)
- Add gemm int8 a x int8 b to interface (https://github.com/pytorch/ao/pull/2055)
- [Quant][PT2E][X86] Enable annotation of aten.mul.tensor with X86InductorQuantizer (https://github.com/pytorch/ao/pull/2075)
- Add AOPerModuleConfig to
torchao.quantization(https://github.com/pytorch/ao/pull/2134) - Enabling MoE Quantization using linear decomposition (https://github.com/pytorch/ao/pull/2043)
Improvement
- Match QAT prepare and convert numerics exactly (https://github.com/pytorch/ao/pull/1964)
- [Prototype] Update torchao.prototype.parq and add 4-bit Llama 3.2 1B benchmark (https://github.com/pytorch/ao/pull/2017)
- [ROCm] preshuffled weight mm (https://github.com/pytorch/ao/pull/1702)
- Remove old code from torchao.experimental.quant_api (https://github.com/pytorch/ao/pull/2030)
- Remove zero_point_domain from quant configs (https://github.com/pytorch/ao/pull/2058)
- Match QAT prepare and convert numerics exactly for bf16 and fp16 (https://github.com/pytorch/ao/pull/2060)
- [scaled grouped mm] add triton kernels for float8 rowwise quantization with per-group/jagged scales (https://github.com/pytorch/ao/pull/2064)
- [reland][ROCm] preshuffled weight mm (https://github.com/pytorch/ao/pull/2044)
- [scaled grouped mm] integrate triton kernels into differentiable scaled grouped mm (https://github.com/pytorch/ao/pull/2077)
- Add AOPerModuleConfig (https://github.com/pytorch/ao/pull/2119)
- Improve GemLite Integration (https://github.com/pytorch/ao/pull/2096)
- [prototype] PARQ quantizer support for torchao's weight-only configs (https://github.com/pytorch/ao/pull/2091)
Bug Fixes
- Fix slice and padding for TensorCoreTiledLayout (https://github.com/pytorch/ao/pull/2015)
- Fix Int4WeightEmbeddingQATQuantizer.convert path (https://github.com/pytorch/ao/pull/2024)
- Fix static AQT flow (https://github.com/pytorch/ao/pull/2046)
- Fix QDQ layout slice operation when zero_point is None (https://github.com/pytorch/ao/pull/2054)
- Fix aqt implementation for aten.mm/aten.addmm fallback path (https://github.com/pytorch/ao/pull/2072)
- Fix AO SAM2 issues (https://github.com/pytorch/ao/pull/2109)
- Fix AOPerModuleConfig bug in skipping quantizing modules (https://github.com/pytorch/ao/pull/2135)
- Fixing aliasing behavior for slice in AQT TensorCoreTiledLayout (https://github.com/pytorch/ao/pull/2174)
Performance
- Add profiling to benchmarking (https://github.com/pytorch/ao/pull/2032)
- Model shapes config (https://github.com/pytorch/ao/pull/2036)
Documentation
- Remove hf_eval.py and add documentation on using lm-eval (https://github.com/pytorch/ao/pull/2045)
- Update QAT README.md (https://github.com/pytorch/ao/pull/2162)
Developers
New Contributors
- @YIWENX14 made their first contribution in https://github.com/pytorch/ao/pull/2080
- @navsud made their first contribution in https://github.com/pytorch/ao/pull/2079
- @jlbmorales made their first contribution in https://github.com/pytorch/ao/pull/2109
- @syed-ahmed made their first contribution in https://github.com/pytorch/ao/pull/2163
- @SalmanMohammadi made their first contribution in https://github.com/pytorch/ao/pull/2162
Full Changelog: https://github.com/pytorch/ao/compare/v0.10.0...v0.11.0
- Python
Published by andrewor14 about 1 year ago
torchao - v0.10.0
Highlights
We are excited to announce the 0.10.0 release of torchao! This release adds support for end to end training for mxfp8 on Nvidia B200, PARQ (for quantization aware training), module swap quantization API to for research, and updates for low bit kernels!
Low Bit Optimizers moved to Official Support (https://github.com/pytorch/ao/pull/1864)
Low bit optimizers (added in 0.4) is moved out of prototype and now have official support in torchao.
[Prototype] End to End Training Support for mxfp8 on NVIDIA B200 (#1786, #1841, #1951, #1932, #1980)
We have an early version of the end to end training workflow for the mxfp8 dtypes with torch.compile on NVIDIA B200, with the cuBLAS mxfp8 gemm seeing an observed speedup of over 2x over bfloat16 gemm, and casts from bfloat16 to mxfp8 achieving up to 5.5 TB/s. Please see our README.md for MX for more information. We plan to improve performance further in future releases.
[Prototype] Piecewise-Affine Regularized Quantization (https://github.com/pytorch/ao/pull/1738)
- PARQ is a new theoretical framework for inducing quantization through regularization. It supports standard QAT, as well as new gradual quantization methods, in an easy to use optimizer-only interface. No modifications to a model’s forward or backward pass are needed for quantization.
```py from torchao.prototype.parq.optim import QuantOptimizer, ProxHardQuant from torchao.prototype.parq.quant import UnifQuantizer
Separate quantizable from non-quantizable parameter groups
paramgroups = [ {"params": weights, "quantbits": 2}, # add extra quant_bits key for QAT {"params": others}, ]
Initialize any torch.optim.Optimizer
baseoptimizer = torch.optim.SGD(paramgroups, lr=0.1, momentum=0.9, weight_decay=1e-4)
Apply a simple wrapper to quantize in optimizer.step()
optimizer = QuantOptimizer( baseoptimizer, quantizer=UnifQuantizer(), proxmap=ProxHardQuant() ) ```
[Prototype] Module Swap Quantization API (https://github.com/pytorch/ao/pull/1886)
We added a prototype API for post-training quantization. Users can swap their linear or embedding layers into their QuantizedLinear and QuantizedEmbedding counterparts, and set the quantizers that specify how they want the input activations or weights to be quantized:
py
quantized_linear = QuantizedLinear(...)
quantized_linear.weight_quantization = IntQuantizer(
num_bits=4,
group_size=32,
dynamic=True,
quantization_mode="symmetric",
)
quantized_linear.input_quantization = CodeBookQuantizer(
num_bits=8,
features=10,
)
Note: The API is highly subject to change and will be integrated with quantize_ in the future. For more detail, please see the README.
[Prototype] Low Bit Kernels (#1826, #1935, #1998, #1652)
Low-bit CPU and MPS kernels are now pip installable from source. To install torchao with low-bit CPU kernels, you can use the following command on an Arm-based Mac:
USE_CPP=1 pip install git+https://github.com/pytorch/ao.git
You can then quantize your model to run on Arm-based Macs with high-performance CPU kernels in torchao. SharedEmbeddingQuantizer,EmbeddingQuantizer, and Int8DynamicActivationIntxWeightConfig all support 1-8 bit quantization.
```py from torchao.experimental.quantapi import Int8DynamicActivationIntxWeightConfig, SharedEmbeddingQuantizer, EmbeddingQuantizer from torchao.quantization.granularity import PerGroup, PerRow from torchao.quantization.quantapi import quantize_
Quantize embedding/unembedding to 8-bits with SharedEmbeddingQuantizer
SharedEmbeddingQuantizer is for quantizing models like Llama1B/3B
where the embedding/unembedding layers share weights
If the embedding/unembedding layers do not share weights, use
EmbeddingQuantizer instead
SharedEmbeddingQuantizer( weightdtype=torch.int8, granularity=PerRow(), hasweightzeros=True ).quantize(model) # Quantize linear layers to 4-bits quantize( model, Int8DynamicActivationIntxWeightConfig( weightdtype=torch.int4, granularity=PerGroup(128), hasweight_zeros=False, ) ) ```
BC Breaking
Delete delayed scaling from torchao.float8 (https://github.com/pytorch/ao/pull/1753)
The following usage of `Float8Config` is deprecated in torchao v0.10.0:
py
config = Float8LinearConfig(
cast_config_input=CastConfig(scaling_type=ScalingType.DELAYED),
cast_config_weight=CastConfig(scaling_type=ScalingType.DELAYED),
cast_config_grad_output=CastConfig(scaling_type=ScalingType.DELAYED),
)
If you would like to use float8 training with delayed scaling, please use an earlier release of torchao. Please see https://github.com/pytorch/ao/issues/1680 for more context about this deprecation.
Enforce AOBaseConfig type in quantize_'s config argument (https://github.com/pytorch/ao/pull/1861)
This was done following a deprecation window to simplify the arguments of quantize_, please see https://github.com/pytorch/ao/issues/1690 for more context.
```py
torchao v.0.9.0
def quantize(
model: torch.nn.Module,
config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]],
filterfn: Optional[Callable[[torch.nn.Module, str], bool]] = None,
setinductorconfig: Optional[bool] = None,
device: Optional[torch.types.Device] = None,
):
torchao v.0.10.0
def quantize(
model: torch.nn.Module,
config: AOBaseConfig,
filterfn: Optional[Callable[[torch.nn.Module, str], bool]] = None,
setinductorconfig: Optional[bool] = None,
device: Optional[torch.types.Device] = None,
):
```
Remove the set_inductor_config argument of quantize_. (https://github.com/pytorch/ao/pull/1865)
This was done following a deprecation window to decouple quantize_ from torchinductor, please see https://github.com/pytorch/ao/issues/1715 for more context.
```py
torchao v.0.9.0
def quantize(
...,
setinductorconfig: Optional[bool] = None,
...,
):
# if setinductorconfig != None, throw a deprecation warning
# if setinductor_config == None, set it to True to stay consistent with old behavior
torchao v0.10.0
def quantize( ..., ): # setinductorconfig is removed from quantize and moved to relevant individual workflows ```
Deprecations
We removed some of our prototype features that are not used, including DORA (https://github.com/pytorch/ao/pull/1815), split_k kernel (https://github.com/pytorch/ao/pull/1816), profiler (https://github.com/pytorch/ao/pull/1862) and bitnet (https://github.com/pytorch/ao/pull/1866).
New Features
QAT
- Added PARQ (https://github.com/pytorch/ao/pull/1738)
Low Bit Optimizers
- Promote Low Bit Optim out of prototype (https://github.com/pytorch/ao/pull/1864)
Module swap quantization API
- Add module swap quantization API from Quanty (https://github.com/pytorch/ao/pull/1886)
Benchmarking
- Micro-benchmark inference (https://github.com/pytorch/ao/pull/1759)
- Add sparsity to benchmarking (https://github.com/pytorch/ao/pull/1917)
- Add float8 training benchmarking scripts (https://github.com/pytorch/ao/pull/1802)
Improvement
Kernels
- 1-8 bit CPU and MPS kernels are now pip installable from source (https://github.com/pytorch/ao/pull/1826)
- Added 1-8 bit shared embedding ops to further compress models like Llama1B/3B where the embedding/unembedding weights are shared (https://github.com/pytorch/ao/pull/1935)
- CPU kernels added runtime microkernel selection based on CPU features and matrix size (https://github.com/pytorch/ao/pull/1998)
- KleidiAI microkernel library was integrated with CPU kernels to improve GEMM performance on Arm CPUs (https://github.com/pytorch/ao/pull/1652)
- Add build flag to set parallel_backend (https://github.com/pytorch/ao/pull/1870)
- Add quant api + python test for shared embedding (https://github.com/pytorch/ao/pull/1937)
- Add dynamic shape support for lowbit kernels (https://github.com/pytorch/ao/pull/1942)
- Add LUT-based bitpacking for 1-4 bits (https://github.com/pytorch/ao/pull/1987)
- Add lut support to linear kernel (https://github.com/pytorch/ao/pull/1990)
- Quantized matmul (https://github.com/pytorch/ao/pull/1994)
- Add fp32xint8 matmul (https://github.com/pytorch/ao/pull/2004)
- Add quantized q @ k test for intented used in quantized attention (https://github.com/pytorch/ao/pull/2006)
- ROCm Support : Tile_Layout kernel (https://github.com/pytorch/ao/pull/1201)
- Metal lowbit kernels: pip install (https://github.com/pytorch/ao/pull/1785)
- Metal lowbit ops: ci (https://github.com/pytorch/ao/pull/1825)
- ROCm Sparse Marlin Kernels #1206 (https://github.com/pytorch/ao/pull/1834)
- ROCm OCP FP8 Support (https://github.com/pytorch/ao/pull/1677)
- Migrate to int args (https://github.com/pytorch/ao/pull/1846)
- Add bias support to torchao kernels (https://github.com/pytorch/ao/pull/1879)
- Write weight packing/unpacking functions for universal kernels (https://github.com/pytorch/ao/pull/1921)
- Unpack weights at col (https://github.com/pytorch/ao/pull/1933)
- Shared embedding kernel (https://github.com/pytorch/ao/pull/1934)
- Bug fixes for shared_embedding (https://github.com/pytorch/ao/pull/1941)
- Update linear.h (https://github.com/pytorch/ao/pull/1963)
- Reintroduce has_weight_zeros as a template param (https://github.com/pytorch/ao/pull/1991)
AOConfigs
- Support Serialization for AOConfigs (https://github.com/pytorch/ao/pull/1875)
- Migrate to config for Int8DynamicActivationIntxWeightConfig (https://github.com/pytorch/ao/pull/1836)
- Migrate
sparsify_to configs (https://github.com/pytorch/ao/pull/1856)
SAM2
- SAM2: Use torch.export for VOS (https://github.com/pytorch/ao/pull/1708)
QAT
- Add linear bias support for QAT (https://github.com/pytorch/ao/pull/1755)
MX
- Allow for scales to be in new e8m0 dtype (https://github.com/pytorch/ao/pull/1742)
- Support MXFP6 packing and fused unpack-dequantize kernel (https://github.com/pytorch/ao/pull/1810)
- Implemented RCEIL (CUBLAS-style) MXFP scale factor derivation, with test cases. (https://github.com/pytorch/ao/pull/1835)
- Use torch.float8_e8m0fnu in mx_formats (https://github.com/pytorch/ao/pull/1966)
- Mx_formats: move training to the quantize_ API (https://github.com/pytorch/ao/pull/1970)
Affine Quantization
- Add support for copy_ for plain layout and tensor core tiled layout (https://github.com/pytorch/ao/pull/1791)
- Add bias support for Int8DynActInt4WeightLinear (https://github.com/pytorch/ao/pull/1845)
- Move config out of experimental (https://github.com/pytorch/ao/pull/1954)
Bug Fixes
- Fix potential out-of-bound access in int8_mm.py (https://github.com/pytorch/ao/pull/1751)
- Fixing DORA imports (https://github.com/pytorch/ao/pull/1795)
- Avoid assert error when there's bias (https://github.com/pytorch/ao/pull/1839)
- Update triton import error message (https://github.com/pytorch/ao/pull/1842)
- Enable the CPU int4 with HQQ quant (https://github.com/pytorch/ao/pull/1824)
- Do not override requires_grad=False when enable_float8_all_gather=True (https://github.com/pytorch/ao/pull/1873)
- Add MI300X specs to roofline benchmark (https://github.com/pytorch/ao/pull/1913)
- Fix dynamic shape for shared embedding (https://github.com/pytorch/ao/pull/1946)
Performance
- Modify cast from hp to mx to help inductor fuse (https://github.com/pytorch/ao/pull/1786)
- Enable torch.compile for mxfp8_cublas recipe (https://github.com/pytorch/ao/pull/1841)
- Optimize tensor_flatten for runtime (https://github.com/pytorch/ao/pull/1951)
- Triton kernel to cast to mx and write in col-major (https://github.com/pytorch/ao/pull/1932)
- small speedup with dim0 cast for mx (https://github.com/pytorch/ao/pull/1980)
Documentation
- Updating Cuda 12.1/12.4 to 12.4/12.6 to reflect current state (https://github.com/pytorch/ao/pull/1794)
- Update float8 training benchmark readme (https://github.com/pytorch/ao/pull/1872)
- Add perf benchmarks for float8 training with rowwise + tensorwise scaling (https://github.com/pytorch/ao/pull/1793)
- Fix link markdown in readme (https://github.com/pytorch/ao/pull/1881)
- Refresh torchao.float8 README (https://github.com/pytorch/ao/pull/1986)
- Refresh float8 training section of main README (https://github.com/pytorch/ao/pull/1985)
- Refresh MX README (https://github.com/pytorch/ao/pull/1989)
New Contributors
- @jithunnair-amd made their first contribution in https://github.com/pytorch/ao/pull/1749
- @facebook-github-bot made their first contribution in https://github.com/pytorch/ao/pull/1752
- @mark14wu made their first contribution in https://github.com/pytorch/ao/pull/1751
- @lisjin made their first contribution in https://github.com/pytorch/ao/pull/1738
- @mayank31398 made their first contribution in https://github.com/pytorch/ao/pull/1849
- @alex-titterton made their first contribution in https://github.com/pytorch/ao/pull/1810
- @mreso made their first contribution in https://github.com/pytorch/ao/pull/1913
- @frsun-nvda made their first contribution in https://github.com/pytorch/ao/pull/1835
Full Changelog: https://github.com/pytorch/ao/compare/v0.9.0...v0.10.0-rc1
- Python
Published by jerryzh168 about 1 year ago
torchao - v0.9.0
Highlights
We are excited to announce the 0.9.0 release of torchao! This release moves a number of sparsity techniques out of prototype, a significant overhaul of the quantize_ api, a new cutlass kernel for 4 bit dynamic quantization and more!
Block Sparsity promoted out of prototype
We’ve promoted block sparsity out of torchao.prototype and made several performance improvements. You can accelerate your models with block sparsity as follows:
python
from torchao.sparsity import sparsify, block_sparse_weight
sparsify_(model, block_sparse_weight(blocksize=64))
Blocksparse Benchmarks
| Technique |Decode (tok/s)| Model Size (GB) | |------------------------------|------------------|---------------------| | baseline | 134.40 | 15.01 | | 2:4 sparse | 163.13 | 10.08 | | bsr-0.8-32 | 210.91 | 6.01 | | bsr-0.8-64 | 222.43 | 6.00 | | bsr-0.9-32 | 255.19 | 4.88 | | bsr-0.9-64 | 262.94 | 4.88 | | 2:4 sparse + int4wo (marlin) | 255.21 | 3.89 |
Block Sparsity technique names (bsr) indicate sparsity fraction and blocksize.
These numbers were generated on H100 using torchao/_models/llama/generate.py on the Meta-Llama-3.1-8B model. You can reproduce these numbers using this script
BC Breaking
TorchAO M1 Binaries currently not working
W've identified that the binaries are broken on M1 and have been since v0.8.0 though they were working in v0.7.0. We're working on a fix for this, details and discussion can be found here.
quantize_ configuration callables -> configs (https://github.com/pytorch/ao/pull/1595, https://github.com/pytorch/ao/pull/1694, https://github.com/pytorch/ao/pull/1696, https://github.com/pytorch/ao/pull/1697)
We are migrating the way quantize_ workflows are configured from callables (tensor subclass inserters) to direct configuration (config objects). Motivation: align with the rest of the ecosystem, enable inspection of configs after instantiation, remove a common source of confusion.
What is changing:
Specifically, here is how the signature of quantize_'s second argument will change:
```python
torchao v0.8.0 and before
def quantize( model: torch.nn.Module, applytensorsubclass: Callable[[torch.nn.Module], torch.nn.Module], ..., ): ...
torchao v0.9.0
def quantize( model: torch.nn.Module, config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]], ..., ): ...
torchao v0.10.0 or later (exact version TBD)
def quantize( model: torch.nn.Module, config: AOBaseConfig, ..., ): ... ```
- the name of the second argument to
quantize_changed fromapply_tensor_subclasstoconfig. Since the vast majority of callsites today are passing in configuration with a positional argument, this change should not affect most people. - the type of the second argument to
quantize_will change fromCallable[[torch.nn.Module], torch.nn.Module]toconfig: AOBaseConfig, following a deprecation process detailed below. - for individual workflows, the user facing API name changed from snake case (
int8_weight_only) to camel case (Int8WeightOnlyConfig). All argument names for each config are kept as-is. We will keep the old snake case names (int8_weight_only) around and alias them to the new names (int8_weight_only = Int8WeightOnlyConfig), to avoid breaking callsites. We plan to keep the old names forever. Here are all the workflow config name changes:
| old name (will keep working) | new name (recommended) |
| --- | --- |
| int4_weight_only | Int4WeightOnlyConfig |
| float8_dynamic_activation_float8_weight | Float8DynamicActivationFloat8WeightConfig|
| float8_static_activation_float8_weight | Float8StaticActivationFloat8WeightConfig |
| float8_weight_only | Float8WeightOnlyConfig |
| fpx_weight_only | FPXWeightOnlyConfig |
| gemlite_uintx_weight_only | GemliteUIntXWeightOnlyConfig |
| int4_dynamic_activation_int4_weight | Int4DynamicActivationInt4WeightConfig |
| int8_dynamic_activation_int4_weight | Int8DynamicActivationInt4WeightConfig |
| int8_dynamic_activation_int8_semi_sparse_weight | n/a (deprecated) |
| int8_dynamic_activation_int8_weight | Int8DynamicActivationInt8WeightConfig |
| int8_weight_only | Int8WeightOnlyConfig |
| uintx_weight_only | UIntXWeightOnlyConfig |
Configuration for prototype workflows using quantize_ will be migrated at a later time.
How these changes can affect you:
1. If you are a user of existing quantize_ API workflows and are passing in config by a positional argument (quantize_(model, int8_weight_only(group_size=128))), you are not affected. This positional syntax will keep working going forward. You are encouraged to migrate your callsite to the new config name (quantize_(model, Int8WeightOnlyConfig(group_size=128)) though the old names will continue to work indefinitely.
2. If you are a user of existing quantize_ API workflows and are passing in config by a keyword argument (quantize_(model, tensor_subclass_inserter=int8_weight_only(group_size=128))), your callsite will break. You will need to change your callsite to quantize_(model, config=int8_weight_only(group_size=128)). We don't expect many people to be in this bucket.
3. If you are a developer writing new workflows for the quantize_ API, you will need to use the new configuration system. Please see https://github.com/pytorch/ao/issues/1690 for details.
4. If you are a user of sparsify_, you are not affected for now and a similar change will happen in a future version of torchao.
This migration will be a two step process: * in torchao v0.9.0, we will enable the new syntax while starting the deprecation process for the old syntax. * in torchao v.0.10.0 or later, we will remove the old syntax
Please see https://github.com/pytorch/ao/issues/1690 for more details.
Block Sparsity imports after moved out of prototype (https://github.com/pytorch/ao/pull/1734)
Before:
python
from torchao.prototype.sparsity.superblock.blocksparse import block_sparse_weight
After:
python
from torchao.sparsity import block_sparse_weight
Deprecations
deprecation of the set_inductor_config argument of quantize_ (https://github.com/pytorch/ao/pull/1716)
We are migrating the set_inductor_config argument of quantize_ to individual workflows. Motivation:
1. this functionality was intended for inference, and we don't want to expose it to future training workflows that we plan to add to quantize_.
2. higher level, this flag couples torchao workflows with torch.compile, which is not ideal. We would rather keep these systems decoupled at the quantize_ API level, with individual workflows opting in as needed.
Impact on users
- for torchao v0.9.0:: if you are passing in
set_inductor_configtoquantize_, your callsite will keep working with a deprecation warning. We recommend that you migrate this option to your individual workflow. - for a future version of torchao: the
set_inductor_configargument will be removed fromquantize_.
API changes
```python
torchao v0.8.x
def quantize( ..., setinductor_config: bool = True, ..., ): ...
torchao v.0.9.0
def quantize( ..., setinductorconfig: Optional[bool] = None, ..., ): # if setinductorconfig != None, throw a deprecation warning # if setinductor_config == None, set it to True to stay consistent with old behavior
torchao v TBD (a future release)
def quantize( ..., ): # setinductorconfig is removed from quantize and moved to relevant individual workflows ```
Please see https://github.com/pytorch/ao/issues/1715 for more details.
Deprecation warning for float8 training delayed and static scaling (https://github.com/pytorch/ao/pull/1681, https://github.com/pytorch/ao/issues/1680)
We plan to deprecate delayed and static scaling from torchao.float8 training codebase due to lack of real world use cases for delayed/static scaling (dynamic scaling is required for higher accuracy) and complexity tax for supporting these features. * for torchao v0.9.0: add deprecation warning for delayed and static scaling * for torchao v0.10.0: deprecate delayed and static scaling
New Features
Supermask for improving accuracy for sparse models (https://github.com/pytorch/ao/pull/1729)
Supermask (https://pytorch.org/blog/speeding-up-vits/) is a technique for improving the accuracy of block sparsified models by learning a block-sparse mask during a training phase.
```python from torchao.sparsity import SupermaskLinear, blocksparseweight sparsify(model, lambda x: SupermaskLinear.fromlinear(x, blocksize=64, sparsitylevel=0.9)
training here
collapse supermask into a normal linear layer (with many weights set to 0) and then convert to block sparse format for inference speedup
sparsify(model, lambda x: SupermaskLinear.tolinear(x, sparsitylevel=0.9) sparsify(model, blocksparseweight(blocksize=64)) ```
Dynamic quantization W4A4 CUTLASS-based kernel (https://github.com/pytorch/ao/pull/1515)
This kernel which adds support for 4 bit dynamic activation + 4 bit weight quantization can be used as follows:
python
from torchao.quantization import int4_dynamic_activation_int4_weight
quantize_(model, int4_dynamic_activation_int4_weight)
Improvements
Early prototype MXFP8 and MXFP4 training and inference support for NVIDIA Blackwell GPUs
In torchao v0.9.0, we include very early support for training and inference on the NVIDIA Blackwell GPUs following the microscaling recipes from https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf, and backed by real MX gemms.
Here is how to use the current prototype APIs.
:warning: Note that torch.compile support is not fully there yet, there are no guarantees on performance at this time, and we expect to change these APIs rapidly as we iterate in future versions of torchao. Please see https://github.com/pytorch/ao/issues/556 for more details.
MX training
```python from torchao.prototype.mxformats.mxlinear import swaplinearwithmxlinear from torchao.prototype.mxformats.config import MXLinearConfig, MXGemmKernelChoice from torchao.utils import issmatleast_100
early prototype: on MX-enabled hardware, you can use the real MX gemm backed by
torchao's CUTLASS kernels. In the future, we will also add cuBLAS kernel support.
gemmkernelchoice = MXGemmKernelChoice.EMULATED if issmatleast100(): gemmkernelchoice = MXGemmKernelChoice.CUTLASS
m = torch.nn.Sequential(torch.nn.Linear(32, 32)).cuda() config = MXLinearConfig( elemdtype=torch.float8e4m3fn, blocksize=32, gemmkernelchoice=gemmkernelchoice, ) swaplinearwithmx_linear(m, config=config)
training loop (not shown)
```
MX inference, weights are in MX and matmul is in high precision.
```python from torchao.prototype.mxformats.mxlinear import swaplinearwithmxinferencelinear from torchao.prototype.mxformats.config import MXLinearConfig
m = torch.nn.Sequential(torch.nn.Linear(32, 32)).cuda() config = MXLinearConfig(elemdtype=torch.float8e4m3fn, blocksize=32) swaplinearwithmxinferencelinear(m, config=config)
do inference (not shown)
```
The additional features for MX support in v0.9.0 were enabled by:
* Add mxfp8bf16 kernel (https://github.com/pytorch/ao/pull/1637)
* Support mixed MX element dtype in mx_mm function and MXLinear. (https://github.com/pytorch/ao/pull/1667)
* move blocksize and elemdtype into MXLinearConfig (https://github.com/pytorch/ao/pull/1689)
* hook up mxfp8 and mxfp4 CUTLASS kernels to MXLinear (https://github.com/pytorch/ao/pull/1713)
* add ceil and RNE rounding modes to the cast from fp32 to e8m0 (https://github.com/pytorch/ao/pull/1643)
Experimental
- Q dq layout (https://github.com/pytorch/ao/pull/1642)
- Add support for kleidi AI quantization schemes (https://github.com/pytorch/ao/pull/1447)
SAM2
- Add modal script extensions (https://github.com/pytorch/ao/pull/1500)
- Increase export usage, small perf improvements (https://github.com/pytorch/ao/pull/1673)
- Model experiments QoL improvements (https://github.com/pytorch/ao/pull/1683)
- Collect p90 latency statistics (https://github.com/pytorch/ao/pull/1703)
Training
- Support power of 2 scaling factors in float8 training with rowwise scaling and use e4m3 in fwd and bwd pass (https://github.com/pytorch/ao/pull/1670)
- clean up recipe names in Float8 training (https://github.com/pytorch/ao/pull/1730)
- make the "config from recipe" API polished in Float8 training (https://github.com/pytorch/ao/pull/1731)
- dd workaround to reduce FSDP memory usage for float8 rowwise training (https://github.com/pytorch/ao/pull/1629)
- Make FakeQuantizer expose useful config details when printed (https://github.com/pytorch/ao/pull/1717)
Sparsity
- Promote blocksparse from prototype, make it faster (https://github.com/pytorch/ao/pull/1734)
Other
- Relax dtype requirements for int4 and float8 quants in autoquant (https://github.com/pytorch/ao/pull/1571)
- Update init.py to load experimental ops even if other C++ ops are not found (https://github.com/pytorch/ao/pull/1565)
Bug Fixes
- Fix torch.intx support in FakeQuantizeConfig (https://github.com/pytorch/ao/pull/1544)
- Fix float related autoquant options (https://github.com/pytorch/ao/pull/1562)
- Fix #1559, sparsity instead of sparstiy (https://github.com/pytorch/ao/pull/1560)
- Fix
.item()issue in running parallel evaluation for BO mixed precision (https://github.com/pytorch/ao/pull/1630) - Add more stringent test for CPUOffloadOptimizer (https://github.com/pytorch/ao/pull/1650)
- Fix LR scheduler issue with CPU offload optimizer (https://github.com/pytorch/ao/pull/1649)
- Add int8 dynamic activation + int8 weight only test to TensorParallel (https://github.com/pytorch/ao/pull/1657)
- Fix compile issue for Marlin qqq on sm<8.0 (https://github.com/pytorch/ao/pull/1651)
- Fix usehqq for int4weight_only quantize (https://github.com/pytorch/ao/pull/1707)
- Unbreak float8 static quant tutorial (https://github.com/pytorch/ao /pull/1709)
- Fix
DDPwithnf4(https://github.com/pytorch/ao/pull/1684) - Fix tensor parallelism for float8 training with rowwise scaling (https://github.com/pytorch/ao/pull/1718)
Documentation
- Update supported dtypes for fp8 (https://github.com/pytorch/ao/pull/1573)
- Sparsity docs update (https://github.com/pytorch/ao/pull/1590)
- Sparsity getting started docs (https://github.com/pytorch/ao/pull/1592)
- Fix broken link on doc page (https://github.com/pytorch/ao/pull/1582)
- Add quick start guide for first time users (https://github.com/pytorch/ao/pull/1611)
- Update apirefdtypes docs (https://github.com/pytorch/ao/pull/1610)
- Add module swap -> tensor subclass migration tutorial (https://github.com/pytorch/ao/pull/1596)
- Update docs to refer to version.html (https://github.com/pytorch/ao/pull/1631)
- Split contributor guide into quantization overview (https://github.com/pytorch/ao/pull/1618)
- Update apirefquantization docs (https://github.com/pytorch/ao/pull/1619)
- Migrate static quant tutorials to direct configuration (https://github.com/pytorch/ao/pull/1710)
- Update torchao READMEs with new configuration APIs (https://github.com/pytorch/ao/pull/1711)
- Update SAM2 README.md (https://github.com/pytorch/ao/pull/1735)
- Add rowwise scaling README.md entry for float8 training(https://github.com/pytorch/ao/pull/1733)
Developers
- Consolidate
ZeroPointDomain.NONE&Nonezero point domains (https://github.com/pytorch/ao/pull/1556) - Only run docs build in CI if docs have changed (https://github.com/pytorch/ao/pull/1589)
- Add separate quantization primitives for float8 (https://github.com/pytorch/ao/pull/1597)
- Add boiler plate code to Tensor subclass (https://github.com/pytorch/ao/pull/1663)
- Change TORCHLIBRARY to TORCHLIBRARY_FRAGMENT (https://github.com/pytorch/ao/pull/1645)
- Reformat C++ kernels (https://github.com/pytorch/ao/pull/1723)
- Add torchao/experimental CI test (https://github.com/pytorch/ao/pull/1586)
- Clean up linearint8dynamicactivationintxweightsubclass (https://github.com/pytorch/ao/pull/1553)
New Contributors
- @jaewoosong made their first contribution in https://github.com/pytorch/ao/pull/1560
- @haodongucsb made their first contribution in https://github.com/pytorch/ao/pull/1630
- @nikhil-arm made their first contribution in https://github.com/pytorch/ao/pull/1447
- @ngc92 made their first contribution in https://github.com/pytorch/ao/pull/1650
- @balancap made their first contribution in https://github.com/pytorch/ao/pull/1667
Full Changelog: https://github.com/pytorch/ao/compare/v0.8.0...v0.9.0-rc1
- Python
Published by HDCharles over 1 year ago
torchao - v0.8.0
Highlights
We are excited to announce the 0.8.0 release of torchao! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding.
W4A8 based on CUTLASS
A new W4A8 linear operator is implemented, that corresponds to int8_dynamic_activation_int4_weight quantization where two 4-bit weights get packed into a single 8-bit integer value; also, CUTLASS is made a sub-module of torchao repo, in order to be able to utilize more of its functionality to implement new kernels.
Benchmarks on A100
| -q parameter | Average tokens/sec | Average Bandwidth in GB/s | Peak Memory Usage in GB | Model Size in GB |
| :--- | ---: | ---: | ---: | ---: |
| | 95.24 | 258.55 | 13.90 | 13.21 |
| -q int8wo | 155.31 | 1028.37 | 8.97 | 6.62 |
| -q int4wo-32 | 186.70 | 774.98 | 5.31 | 4.15 |
| -q int4wo-hqq | 186.47 | 774.01 | 5.04 | 4.15 |
| -q int8dq | 49.64 | 328.72 | 9.44 | 6.62 |
| -q w4a8-cutlass (tuned) | 119.31 | 394.86 | 4.52 | 3.31 |
Prefill performance benchmarks
We’ve added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an option for int8 dynamic quantization that will selectively use prefill during LLM decoding.
BC Breaking
Delete the float8-all-gather-only functionality from float8 training (https://github.com/pytorch/ao/pull/1451)
The use_fp8_all_gather_only was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.
Before
```python config = Float8LinearConfig( ...,
the option below is being removed
usefp8allgatheronly = True,
)
converttofloat8_training(model, config=config, ...)
```
After
The use_fp8_all_gather_only option is no longer supported.
New Features
- Add TTFT benchmarks + update sparsity benchmarks (https://github.com/pytorch/ao/pull/1140)
- Gemlite integration in torchao (https://github.com/pytorch/ao/pull/1034)
- W4A8 based on CUTLASS (https://github.com/pytorch/ao/pull/880)
Improvement
quantize_
- Expose zeropointdomain as arguments (https://github.com/pytorch/ao/pull/1401)
- Add convert path for quantize_ QAT API (https://github.com/pytorch/ao/pull/1540)
- Int8 dynamic prefill weight only decode (https://github.com/pytorch/ao/pull/1436)
autoquant
- Make int8 dynamic quant in autoquant serializable (https://github.com/pytorch/ao/pull/1484)
- Additional fixes for autoquant serialization (https://github.com/pytorch/ao/pull/1486)
- Add exhaustive config option to intmm kernel (https://github.com/pytorch/ao/pull/1392)
float8 training
- [float8] Allow specifying arbitrary dtype for each tensor, enabling recipes with e4m3 in both the forward and the backward (https://github.com/pytorch/ao/pull/1378)
experimental
- Remove temp build files from torchao (https://github.com/pytorch/ao/pull/1551)
other
- Torchao setup.py with cmake (https://github.com/pytorch/ao/pull/1490)
Bug Fixes
- Fix bfloat16/float16/float32 options (https://github.com/pytorch/ao/pull/1369)
- Fix a bug in LinearActivationQuantizedTensor (https://github.com/pytorch/ao/pull/1400)
- Fix error message in float8 FSDP utils (https://github.com/pytorch/ao/pull/1423)
- Fixes observer attachment to model based on config for wanda sparsifier (https://github.com/pytorch/ao/pull/1265)
- [resubmit] Gemlite fix (https://github.com/pytorch/ao/pull/1435)
- 🐛 Fix: Memory leak in image processing endpoint (https://github.com/pytorch/ao/pull/1513)
Performance
- [float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes (https://github.com/pytorch/ao/pull/1377)
Documentation
- Update api_ref_quantization.rst (https://github.com/pytorch/ao/pull/1408)
- Update index.rst (https://github.com/pytorch/ao/pull/1409)
- Update QAT READMEs using new APIs (https://github.com/pytorch/ao/pull/1541)
Developers
- Pytorch/ao/torchao/experimental/ops/mps/test (https://github.com/pytorch/ao/pull/1442)
- Verify that submodules are checked out (https://github.com/pytorch/ao/pull/1536)
New Contributors
- @sanchitintel made their first contribution in https://github.com/pytorch/ao/pull/1375
- @philipbutler made their first contribution in https://github.com/pytorch/ao/pull/1337
- @airMeng made their first contribution in https://github.com/pytorch/ao/pull/1401
- @DerekLiu35 made their first contribution in https://github.com/pytorch/ao/pull/1299
- @agrawal-aka made their first contribution in https://github.com/pytorch/ao/pull/1265
- @gmagogsfm made their first contribution in https://github.com/pytorch/ao/pull/1443
- @dongxiaolong made their first contribution in https://github.com/pytorch/ao/pull/1513
Full Changelog: https://github.com/pytorch/ao/compare/v0.7.0...v0.8.0-rc2
- Python
Published by jainapurva over 1 year ago
torchao - v0.7.0
Highlights
We are excited to announce the 0.7.0 release of torchao! This release moves QAT out of prototype with improved LoRA support and more flexible APIs, and adds support for new experimental kernels such as Marlin QQQ (for CUDA), int8_dynamic_activation_intx_weight (for ARM CPU), and more!
QAT moved out of prototype, LoRA integration, new flexible APIs (#1020, #1085, #1152, #1037, #1152)
QAT has been moved out of prototype to torchao/quantization/qat to provide better API stability guarantees moving forward. In addition to the existing *QATQuantizer classes, we now also support the more flexible FakeQuantizedLinear and FakeQuantizedEmbedding modules for users to configure the exact quantization settings they wish to use during QAT.
```python from torchao.quantization.qat.api import FakeQuantizeConfig from torchao.quantization.qat.embedding import FakeQuantizedEmbedding from torchao.quantization.qat.linear import FakeQuantizedLinear
Specify quantization schemes to use during QAT
activationconfig = FakeQuantizeConfig(torch.int8, "pertoken", issymmetric=False) weightconfig = FakeQuantizeConfig(torch.int4, group_size=8)
Replace nn.Linear and nn.Embedding with these in your model
fqlinear = FakeQuantizedLinear(16, 32, False, activationconfig, weightconfig) fqembedding = FakeQuantizedEmbedding(16, 32, weightconfig=weightconfig) ```
We also leveraged the new flexible APIs to build a new QAT + LoRA fine-tuning flow in torchtune. Try it out today!
bash
tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3/8B_qat_lora
Marlin QQQ for CUDA (#1113)
Marlin QQQ is an optimized GPU kernel that supports W4A8 mixed precision GEMM. For more details about Marlin QQQ, please refer to paper.
python
from torchao.dtypes import MarlinQQQLayout
quantize_(
model,
int8_dynamic_activation_int4_weight(
group_size=128,
mapping_type=MappingType.SYMMETRIC,
act_mapping_type=MappingType.SYMMETRIC,
layout=MarlinQQQLayout(),
),
)
Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#marlin-qqq.
This is a prototype feature - feel free to try out!
int8dynamicactivationintxweight Quantization for ARM CPU (#995, #1027, #1254, #1353)
We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon).
```python from torchao.experimental.quantapi import int8dynamicactivationintxweight assert precision == torch.float32, "int8dynamicactivationintx_weight requires fp32 precision"
Build kernels in temp location, and load them in torch
This requires an ARM CPU
from torchao.experimental.tempbuild import tempbuildandloadtorchaoops tempbuildandloadtorchaoops(cmakelistspath=os.path.dirname(os.path.realpath(file_)) + "/../../experimental")
Quantize model
nbit = 4 assert nbit >= 1 and nbit <= 8, "nbits must be 1 to 8" groupsize = 128 hasweightzeros = False quantize( model, int8dynamicactivationintxweight( groupsize=groupsize, nbit=nbit, hasweightzeros=hasweightzeros, ), ) ```
Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#int8dynamicactivationintxweight-quantization
We are still trying to figure out how to ship the ARM CPU kernels, so the exact API is subject to change.
BC Breaking
Rename AQT#2 LayoutType -> Layout (#1049)
Before:
from torchao.dtypes import (
BlockSparseLayoutType,
Int4CPULayoutType,
MarlinQQQLayoutType,
MarlinSparseLayoutType,
SemiSparseLayoutType,
TensorCoreTiledLayoutType,
UintxLayoutType,
Float8LayoutType,
LayoutType,
PlainLayoutType,
)
After:
from torchao.dtypes import (
BlockSparseLayout,
Int4CPULayout,
MarlinQQQLayout,
MarlinSparseLayout,
SemiSparseLayout,
TensorCoreTiledLayout,
UintxLayout,
Float8Layout,
Layout,
PlainLayout,
)
QAT imports after move out of prototype (#1091)
Before:
python
from torchao.quantization.prototype.qat import (
disable_4w_fake_quant,
disable_8da4w_fake_quant,
enable_4w_fake_quant,
enable_8da4w_fake_quant,
ComposableQATQuantizer,
Int4WeightOnlyQATQuantizer,
Int4WeightOnlyEmbeddingQATQuantizer
Int8DynActInt4WeightQATQuantizer,
Int8DynActInt4WeightQATLinear,
)
from torchao.quantization.prototype.qat.api import (
FakeQuantizeConfig,
)
from torchao.quantization.prototype.qat.fake_quantizer import (
FakeQuantizer,
)
After:
python
from torchao.quantization.qat import (
ComposableQATQuantizer,
Int4WeightOnlyQATQuantizer,
Int4WeightOnlyEmbeddingQATQuantizer
Int8DynActInt4WeightQATQuantizer,
)
from torchao.quantization.qat.linear import (
disable_4w_fake_quant,
disable_8da4w_fake_quant,
enable_4w_fake_quant,
enable_8da4w_fake_quant,
Int8DynActInt4WeightQATLinear,
)
from torchao.quantization.qat.api import (
FakeQuantizeConfig,
)
from torchao.quantization.qat.fake_quantizer import (
FakeQuantizer,
)
New Features
- Add BF16 stochastic rounding option for optimizers (https://github.com/pytorch/ao/pull/1124)
- Add quantize_() API support for NF4 (https://github.com/pytorch/ao/pull/1216)
- Support W4A8 Marlin kernel (https://github.com/pytorch/ao/pull/1113)
Improvements
quantize_
- Add default filtering to remove mis-alinged weights (https://github.com/pytorch/ao/pull/1194)
- Add tensor parallelism support for int4weightonly quantization (https://github.com/pytorch/ao/pull/1120)
- Add support for asymmetric act quant for int8 dynamic quant (https://github.com/pytorch/ao/pull/1131)
- Add support for groupwise quantization for int8 weight only quantization (https://github.com/pytorch/ao/pull/1121)
- Add AQT tensor parallel for float8dynamicquant (https://github.com/pytorch/ao/pull/1078)
- Int8wo Embedding Quant (https://github.com/pytorch/ao/pull/1167)
- Making sure int4 weight only supports cpu as well (https://github.com/pytorch/ao/pull/1203)
- BF16 support for Quant-LLM kernel (https://github.com/pytorch/ao/pull/1147)
- Add hardware check to fp8 quant (https://github.com/pytorch/ao/pull/1314)
- Add support for quantize_() with Float8Linear module (https://github.com/pytorch/ao/pull/1344)
autoquant
- Added support for Per Tensor Scaling for Float8 Dynamic Autoquant (https://github.com/pytorch/ao/pull/1175)
- Add floating point options for autoquant and add accuracy measurement (https://github.com/pytorch/ao/pull/1355)
benchmarks
- Adding batchsize support for torchao llama benchmarks (https://github.com/pytorch/ao/pull/1182)
- Add capability of benchmarking arbitrary binary (https://github.com/pytorch/ao/pull/1107)
experimental
- Add embedding ops aten (https://github.com/pytorch/ao/pull/1129)
- Add embedding ops executorch (https://github.com/pytorch/ao/pull/1137)
- Add quantized embedding kernels to torchao (https://github.com/pytorch/ao/pull/1018)
- Allow deprecated declarations what using Parallel ExecuTorch (https://github.com/pytorch/ao/pull/1031)
- Introduce lowbit quantized linear MPS kernels (https://github.com/pytorch/ao/pull/954)
- Enable 6-bit kernel (https://github.com/pytorch/ao/pull/1027)
- Kleidi 4b blockwise gemv prototype (https://github.com/pytorch/ao/pull/997)
- Experimental 6-bit quantization for Llama in torchchat (https://github.com/pytorch/ao/pull/1094)
- Introduce 7-bit quantization for Llama in torchchat. (https://github.com/pytorch/ao/pull/1139)
- Executorch Subclass API (#966) (https://github.com/pytorch/ao/pull/995)
- 8-bit packing support (https://github.com/pytorch/ao/pull/1248)
- Experimental Enable 8-bit (https://github.com/pytorch/ao/pull/1254)
- Experimental Benchmarking (https://github.com/pytorch/ao/pull/1353)
optimizer
- [low-bit optim] Upcast everything to FP32 for internal calculations (https://github.com/pytorch/ao/pull/1068)
- [Low-bit optim] Support for dcp.save() and dcp.load() (https://github.com/pytorch/ao/pull/1217)
- Enable CPU Offload for Intel GPU (https://github.com/pytorch/ao/pull/1324)
SAM2
- SAM2.1 copy (https://github.com/pytorch/ao/pull/1172)
- SAM2 AMG server side request batching (https://github.com/pytorch/ao/pull/1197)
- More SAM2-fast server improvements (https://github.com/pytorch/ao/pull/1285)
- SAM2 Fast AMG: memory profiling and more compile (https://github.com/pytorch/ao/pull/1296)
- SAM2 AMG cli and other QoL improvements (https://github.com/pytorch/ao/pull/1336)
- SAM2 AMG cli.py on modal (https://github.com/pytorch/ao/pull/1349)
- Reduce SAM2 AMG cli startup by using deploy (https://github.com/pytorch/ao/pull/1350)
- Reduce startup time for SAM2 AMG by using torch.export (https://github.com/pytorch/ao/pull/1358)
- More batching and improved furious accuracy/performance (https://github.com/pytorch/ao/pull/1253)
- SAM2.1 and example README (https://github.com/pytorch/ao/pull/1048)
- SAM2 AMG example mIoU, perf numbers and more SAM2 model annotations (https://github.com/pytorch/ao/pull/1196)
other
- Add SpinQuant to generate.py (https://github.com/pytorch/ao/pull/1069)
- SpinQuant (https://github.com/pytorch/ao/pull/983)
- SmoothQuant using tensor subclassing (https://github.com/pytorch/ao/pull/1030)
- Expose FakeQuantizeConfigs in QAT quantizers (https://github.com/pytorch/ao/pull/1214)
- Add module-swap UX for INT8 mixed-precision training (https://github.com/pytorch/ao/pull/1179)
- Float8 training: move module attribute setting to sync function (https://github.com/pytorch/ao/pull/1341)
Bug Fixes
- Header bug fix (https://github.com/pytorch/ao/pull/1079)
- Temporary fix for QAT quantizer when linear layer bias is True (https://github.com/pytorch/ao/pull/1087)
- Fix out-of-bounds memory access in Galore dequant kernel (https://github.com/pytorch/ao/pull/1125)
- Fixed weightsonly=True load for float8dynamicactivationfloat8weight in quantapi (https://github.com/pytorch/ao/pull/1122)
- Fix int8weightonly group_size (https://github.com/pytorch/ao/pull/1165)
- Is_linear fix for MHA (https://github.com/pytorch/ao/pull/1141)
- Fixing eval.py to use GPTQ_MT for gptq (https://github.com/pytorch/ao/pull/1176)
- [CPU offload optim] Fix when there are non-trainable params (https://github.com/pytorch/ao/pull/1210)
- Fix for weights-only load (https://github.com/pytorch/ao/pull/1228)
- Pin nightlies to deal with std::badalloc (https://github.com/pytorch/ao/pull/1256)
- Fix 2.5.1 failing sparsity test (https://github.com/pytorch/ao/pull/1261)
- Call narrow only for TensorCoreTiledLayout (https://github.com/pytorch/ao/pull/1207)
- Fix an autoquant bug in flatten/unflatten (https://github.com/pytorch/ao/pull/1288)
- Float8 with delayed scaling: fix autocast handling (https://github.com/pytorch/ao/pull/1306)
- Fix bug with float8 training + FSDP2 + TP (https://github.com/pytorch/ao/pull/1327)
- Float8 training: fix bug with AC + compile (https://github.com/pytorch/ao/pull/1329)
- Fix torchtitan + float8 + delayed + compile (https://github.com/pytorch/ao/pull/1334)
- [low-bit optim] Fix edge cases for FSDP2 integration (https://github.com/pytorch/ao/pull/1269)
- [NF4] .to() fixes (https://github.com/pytorch/ao/pull/1312)
- Check scale.ndim before applying t/transpose (https://github.com/pytorch/ao/pull/1339)
Performance
- Swap in faster uint6 bitpacking function (https://github.com/pytorch/ao/pull/1098)
- Implement more efficient pack and unpack uint5 (https://github.com/pytorch/ao/pull/1138)
- Fix 20x slowdown of FP6 kernel due to device properties query (https://github.com/pytorch/ao/pull/1092)
Documentation
- Add a developer guide for exporting to executorch (https://github.com/pytorch/ao/pull/1219)
- Enable AWQ example on CPU (https://github.com/pytorch/ao/pull/1043)
- Add readme doc for experiemental (https://github.com/pytorch/ao/pull/1130)
- Move float8 out of prototype in quantization README (https://github.com/pytorch/ao/pull/1166)
- Update torchao api reference and add contributor guide (https://github.com/pytorch/ao/pull/1255)
- Fix pickle.dump missing file argument typo in README (https://github.com/pytorch/ao/pull/1316)
- Update README.md (https://github.com/pytorch/ao/pull/1319)
- Update README.md: Fix bibtex and sglang links (https://github.com/pytorch/ao/pull/1361)
- Add bibtex (https://github.com/pytorch/ao/pull/1177)
- Clarify torchao.float8 PyTorch version support (https://github.com/pytorch/ao/pull/1191)
Developers
- [Tp Test] Fix the placement of the device tensor (https://github.com/pytorch/ao/pull/1054)
- Skip testfpxweight_only in fbcode (https://github.com/pytorch/ao/pull/1056)
- Pin pt nightly CPU version (https://github.com/pytorch/ao/pull/1061)
- Unpin CUDA Nightly (https://github.com/pytorch/ao/pull/1064)
- Update smoke test (https://github.com/pytorch/ao/pull/1111)
- Update regression_test.yml (https://github.com/pytorch/ao/pull/1163)
- Add PyTorch 2.5 to regression test (https://github.com/pytorch/ao/pull/1168)
- Fix Bias APIs, re-enable kleidi tests for arm64 (https://github.com/pytorch/ao/pull/1162)
- Create CITATION.cff (https://github.com/pytorch/ao/pull/1178)
- Unpin nightlies (https://github.com/pytorch/ao/pull/1183)
- [experimental] Kleidi - add operator level tests (https://github.com/pytorch/ao/pull/1173)
- Ruff format and lint (https://github.com/pytorch/ao/pull/1226)
- Update pre-commit to match CI/CD (https://github.com/pytorch/ao/pull/1227)
- Fixing pytest skip for only test_floatx.py (https://github.com/pytorch/ao/pull/1251)
- Fixed invalid url in citation section (https://github.com/pytorch/ao/pull/1348)
- Add to safe globals (https://github.com/pytorch/ao/pull/1171)
- Aqt rename#1 Layout -> TensorImpl (https://github.com/pytorch/ao/pull/1046)
- Move and rename GranularityType -> Granularity (https://github.com/pytorch/ao/pull/1038)
- Change torchao quantization types from int to sizet and preface vars with "preferred" (https://github.com/pytorch/ao/pull/1041)
- Shrink hadamard matrices (https://github.com/pytorch/ao/pull/1051)
- Use ExecuTorch prebuilt library in pip package to build custom kernels (https://github.com/pytorch/ao/pull/1059)
- Update base.h unit to unsigned int (https://github.com/pytorch/ao/pull/962)
- Create header for packed weight ops (https://github.com/pytorch/ao/pull/1072)
- Update cmake files (https://github.com/pytorch/ao/pull/1070)
- Create buildwheelsaarch64_linux.yml (https://github.com/pytorch/ao/pull/1083)
- ROCM binary upload (https://github.com/pytorch/ao/pull/1099)
- Create buildwheelswindows.yml (https://github.com/pytorch/ao/pull/1101)
- Use fewer instructions when unpacking uint6s. (https://github.com/pytorch/ao/pull/1109)
- [CI] XPU binary build enable (https://github.com/pytorch/ao/pull/1105)
- Move common ET/Aten op stuff to ops/library.h (https://github.com/pytorch/ao/pull/1116)
- Move bias from kernel to packed_weights (https://github.com/pytorch/ao/pull/1119)
- Update gpu_sparsity kernel benchmarking script (https://github.com/pytorch/ao/pull/1143)
- [ROCm] use dataclass for fnuz type setting (https://github.com/pytorch/ao/pull/1142)
- Move files to prototype/sparsity (https://github.com/pytorch/ao/pull/1145)
- C10::nullopt -> std::nullopt (#1032) (https://github.com/pytorch/ao/pull/1151)
- [reland][ROCm] use dataclass for fnuz type setting (https://github.com/pytorch/ao/pull/1150)
- Move float8atenapi to float8_ops (https://github.com/pytorch/ao/pull/1155)
- Initialize model with meta device for generation benchmarking (https://github.com/pytorch/ao/pull/1144)
- Replace torch.empty with torch.zeros (https://github.com/pytorch/ao/pull/1157)
- Update utils.py (https://github.com/pytorch/ao/pull/1186)
- Remove intscaledmm's dependency on triton for cpu (https://github.com/pytorch/ao/pull/128)
- at::optional -> std::optional (#1170) (https://github.com/pytorch/ao/pull/1212)
- fastflush kwarg of dobench is removed (https://github.com/pytorch/ao/pull/1222)
- Remove calibration args from generate.py (https://github.com/pytorch/ao/pull/1258)
- Skip marlin QQQ ops test in fbcode (https://github.com/pytorch/ao/pull/1289)
- Fix Marlin QQQ ops test with unittest (https://github.com/pytorch/ao/pull/1294)
- Fix Failing CI - Update bitsandbytes import (https://github.com/pytorch/ao/pull/1343)
- Remove lm_eval warning (https://github.com/pytorch/ao/pull/1347)
- Refactor Affine Quantized Tensor (#1234)
- Move files from quantization/prototype -> prototype/quantization (#1187)
- Add TTFT benchmarks + update sparsity benchmarks (https://github.com/pytorch/ao/pull/1140)
- Add "gemminput_role" to dunder slots (https://github.com/pytorch/ao/pull/984)
- Add an option to use fp8-all-gather only without fp8 computation. (https://github.com/pytorch/ao/pull/1093)
- Bump version to 0.7 (https://github.com/pytorch/ao/pull/1045)
New Contributors
- @Jack-Khuu made their first contribution in https://github.com/pytorch/ao/pull/1031
- @keyan made their first contribution in https://github.com/pytorch/ao/pull/1041
- @digantdesai made their first contribution in https://github.com/pytorch/ao/pull/997
- @EnragedAntelope made their first contribution in https://github.com/pytorch/ao/pull/962
- @c4lcut3c made their first contribution in https://github.com/pytorch/ao/pull/1094
- @elfisworking made their first contribution in https://github.com/pytorch/ao/pull/1087
- @chuanqi129 made their first contribution in https://github.com/pytorch/ao/pull/1105
- @p4arth made their first contribution in https://github.com/pytorch/ao/pull/1122
- @xuzijian629 made their first contribution in https://github.com/pytorch/ao/pull/1138
- @jeffdaily made their first contribution in https://github.com/pytorch/ao/pull/1142
- @r-barnes made their first contribution in https://github.com/pytorch/ao/pull/1151
- @helunwencser made their first contribution in https://github.com/pytorch/ao/pull/1157
- @bertmaher made their first contribution in https://github.com/pytorch/ao/pull/1222
- @tibidoh made their first contribution in https://github.com/pytorch/ao/pull/1248
- @mandroid6 made their first contribution in https://github.com/pytorch/ao/pull/1250
- @HandH1998 made their first contribution in https://github.com/pytorch/ao/pull/1113
- @readleyj made their first contribution in https://github.com/pytorch/ao/pull/1316
- @22dimensions made their first contribution in https://github.com/pytorch/ao/pull/1318
- @galqiwi made their first contribution in https://github.com/pytorch/ao/pull/1348
- @dbyoung18 made their first contribution in https://github.com/pytorch/ao/pull/1324
- @sunjiweiswift made their first contribution in https://github.com/pytorch/ao/pull/1259
- @merrymercy made their first contribution in https://github.com/pytorch/ao/pull/1361
Full Changelog: https://github.com/pytorch/ao/compare/v0.6.1...v0.7.0-rc1
- Python
Published by vkuzo over 1 year ago
torchao - v0.6.1
Highlights
We are excited to announce the 0.6.1 release of torchao! This release adds support for Auto-Round support, Float8 Axiswise scaled training, a BitNet training recipe, an implementation of AWQ and much more!
Auto-Round Support (#581)
Auto-Round is a new weight-only quantization algorithm, it has as achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.
``` Python from torchao.prototype.autoround.core import preparemodelforapplyingautoround from torchao.prototype.autoround.core import applyautoround
preparemodelforapplyingautoround( model, istargetmodule=istargetmodule, bits=4, group_size=128, iters=200, device=device, )
inputidslst = [] for data in dataloader: inputidslst.append(data["inputids"].to(modeldevice))
multitinputids = MultiTensor(inputidslst) out = model(multitinputids)
quantize(model, applyautoround(), istarget_module) ```
Added float8 training axiswise scaling support with per-gemm-argument configuration (#940)
We added experimental support for rowwise scaled float8 gemm to torchao.float8, with per-gemm-input configurability to enable exploration of various recipes. Here is how a user can configure all-axiswise scaling
```python
all-axiswise scaling
config = torchao.float8.config.recipenametolinearconfig(Float8LinearRecipeName.ALLAXISWISE) m = torchao.float8.converttofloat8training(config)
or, a custom recipe by @lw where grad_weight is left in bfloat16
config = torchao.float8.config.recipenametolinearconfig(Float8LinearRecipeName.LWAXISWISEWITHGWHP) m = torchao.float8.converttofloat8_training(config) ```
Early performance benchmarks show all-axiswise scaling achieve a 1.13x speedup vs bf16 on torchtitan / LLaMa 3 8B / 8 H100 GPUs (compared to 1.17x from all-tensorwise scaling in the same setup), and loss curves which match to bf16 and all-tensorwise scaling. Further performance and accuracy benchmarks will follow in future releases.
Introduced BitNet b1.58 training recipe (#930)
Adds recipe for doing BitNet b1.58](https://arxiv.org/abs/2402.17764) ternary weights clamping. ``` Python from torchao.prototype.quantizedtraining import bitnettraining from torchao import quantize_
model = ... quantize(model, bitnettraining()) ``` Notably: Our implementation utilizes INT8 Tensor Cores to make up for this loss in speed. In fact, our implementation is faster than BF16 training in most cases.
[Prototype] Implemented Activation Aware Weight Quantization AWQ (#743)
Perplexity and performance measured on A100 GPU: | Model | Quantization | Tokens/sec | Throughput (GB/sec) | Peak Mem (GB) | Model Size (GB) | |--------------------|--------------|------------|---------------------|---------------|-----------------| | Llama-2-7b-chat-hf | bfloat16 | 107.38 | 1418.93 | 13.88 | 13.21 | | | awq-hqq-int4 | 196.6 | 761.2 | 5.05 | 3.87 | | | awq-uint4 | 43.59 | 194.93 | 7.31 | 4.47 | | | int4wo-hqq | 209.19 | 804.32 | 4.89 | 3.84 | | | int4wo-64 | 201.14 | 751.42 | 4.87 | 3.74 |
Usage:
Python
from torchao.prototype.awq import insert_awq_observer_, awq_uintx, AWQObservedLinear
quant_dtype = torch.uint4
group_size = 64
calibration_limit = 10
calibration_seq_length = 1024
model=model.to(device)
insert_awq_observer_(model,calibration_limit, calibration_seq_length, quant_dtype=quant_dtype, group_size=group_size)
with torch.no_grad():
for batch in calibration_data:
model(batch.to(device))
is_observed_linear = lambda m, fqn: isinstance(m, AWQObservedLinear)
quantize_(model, awq_uintx(quant_dtype=quant_dtype, group_size = group_size), is_observed_linear)
New Features
- [Prototype] Added Float8 support for AQT tensor parallel (#1003)
- Added composable QAT quantizer (#938)
- Introduced torchchat quantizer (#897)
- Added INT8 mixed-precision training (#748)
- Implemented sparse marlin AQT layout (#621)
- Added a PerTensor static quant api (#787)
- Introduced uintx quant to generate and eval (#811)
- Added Float8 Weight Only and FP8 weight + dynamic activation (#740)
- Implemented Auto-Round support (#581)
- Added 2, 3, 4, 5 bit custom ops (#828)
- Introduced symmetric quantization with no clipping error in the tensor subclass based API (#845)
- Added int4 weight-only embedding QAT (#947)
- Added support for 1-bit and 6-bit quantization for Llama in torchchat (#910, #1007)
- Added a linear_observer class for doing static activation calibration (#807)
- Exposed hqq through uintxweightonly API (#786)
- Added RowWise scaling option for Float8 dynamic activation quantization (#819)
- Added Float8 weight only to autoquant api (#866)
Improvements
- Enhanced Auto-Round functionality (#870)
- Improved FSDP support for low-bit optimizers (#538)
- Added support for using AffineQuantizedTensor with
weights_only=Truefor torch.load (#630) - Optimized 3-bit packing (#1029)
- Added more evaluation metrics to llama/eval.sh (#934)
- Improved eager numerics for dynamic scales in float8 (#904)
Bug fixes
- Fixed inference_mode issues (#885)
- Fixed failing FP6 benchmark (#931)
- Resolved various issues with float8 support (#918, #923)
- Fixed load state dict when device is different for low-bit optim (#1021)
Performance
- Added SM75 (Turing) support for FP6 kernel (#942)
- Implemented int8 dynamic quant + bsr support (#821)
- Added workaround to recover the perf for quantized vit in torch.compile (#926)
INT8 Mixed-Precision Training
On NVIDIA GPUs, INT8 Tensor Cores is approximately 2x faster than their BF16/FP16 counterparts. In mixed-precision training, we can down-cast activations and weights dynamically to INT8 to leverage faster matmuls. However, since INT8 has very limited range [-128,127], we perform row-wise quantization, similar to how INT8 post-training quantization (PTQ) is done. Weight is still in original precision.
```Python from torchao.prototype.quantizedtraining import int8mixedprecisiontraining, Int8MixedPrecisionTrainingConfig from torchao.quantization import quantize_
model = ...
apply INT8 matmul to all 3 matmuls
quantize(model, int8mixedprecisiontraining())
customize which matmul is left in original precision.
config = Int8MixedPrecisionTrainingConfig(
output=True,
gradinput=True,
gradweight=False,
)
quantize(model, int8mixedprecisiontraining(config))
``
**End2end speed benchmark** usingbenchmarks/quantizedtraining/pretrainllama2.py`
Model & GPU | bs x seq_len| Config | Tok/s | Peak mem (GB) -----|-----|-----|-----|----- Llama2-7B, A100 | 8 x 2048 | BF16 (baseline) | ~4400 | 59.69 Llama2-7B, A100 | 8 x 2048 | INT8 mixed-precision | ~6100 (+39%) | 58.28 Llama2-1B, 4090 | 16 x 2048 | BF16 (baseline) | ~17,900 | 18.23 Llama2-1B, 4090 | 16 x 2048 | INT8 mixed-precision | ~30,700 (+72%) | 18.34
Docs
- Updated README with more current float8 speedup information (#816)
- Added tutorial for trainable tensor subclass (#908)
- Improved documentation for float8 unification and inference (#895, #896)
Devs
- Added compile tests to test suite (#906)
- Improved CI setup and build processes (#887)
- Added M1 wheel support (#822)
- Added more benchmarking and profiling tools (#1017)
- Renamed
fpxtofloatx(#877) - Removed torchao_nightly package (#661)
- Added more lint fixes (#827)
- Added better subclass testing support (#839)
- Added CI to catch syntax errors (#861)
- Added tutorial on composing quantized subclass w/ Dtensor based TP (#785)
Security
No significant security updates in this release.
Untopiced
- Added basic SAM2 AutomaticMaskGeneration example server (#1039)
New Contributors
New Contributors
- @iseeyuan made their first contribution in https://github.com/pytorch/ao/pull/805
- @YihengBrianWu made their first contribution in https://github.com/pytorch/ao/pull/860
- @kshitij12345 made their first contribution in https://github.com/pytorch/ao/pull/863
- @ZainRizvi made their first contribution in https://github.com/pytorch/ao/pull/887
- @alexsamardzic made their first contribution in https://github.com/pytorch/ao/pull/899
- @vaishnavi17 made their first contribution in https://github.com/pytorch/ao/pull/911
- @tobiasvanderwerff made their first contribution in https://github.com/pytorch/ao/pull/931
- @kwen2501 made their first contribution in https://github.com/pytorch/ao/pull/937
- @y-sq made their first contribution in https://github.com/pytorch/ao/pull/912
- @jimexist made their first contribution in https://github.com/pytorch/ao/pull/969
- @danielpatrickhug made their first contribution in https://github.com/pytorch/ao/pull/914
- @ramreddymounica made their first contribution in https://github.com/pytorch/ao/pull/1007
- @yushangdi made their first contribution in https://github.com/pytorch/ao/pull/1006
- @ringohoffman made their first contribution in https://github.com/pytorch/ao/pull/1023
Full Changelog: https://github.com/pytorch/ao/compare/v0.5.0...v0.6.1
- Python
Published by drisspg over 1 year ago
torchao - v0.5.0
Highlights
We are excited to announce the 0.5 release of torchao! This release adds support for memory efficient inference, float8 training and inference, int8 quantized training, HQQ, automatic mixed-precision quantization through bayesian optimization, sparse marlin, and integrations with HuggingFace, SGLang, and diffusers.
Memory Efficient Inference Support https://github.com/pytorch/ao/pull/738
We've added support for Llama 3.1 to the llama benchmarks in TorchAO and added new features and improvements as a proof of concept for memory efficient inference. These additions allow us to to do 130k context length inference with Llama 3.1-8B with only 18.91 GB memory if we combine with kv cache quantization, int4 weight only quantization and linear causal mask.
General savings depend on technique and context length as can be seen in the following graph:
Float8 Training https://github.com/pytorch/ao/pull/551
torchao.float8 implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433.
With torch.compile on, current results show throughput speedups of up to 1.5x on 128 H100 GPU LLaMa 3 70B pretraining jobs (details)
python
from torchao.float8 import convert_to_float8_training
convert_to_float8_training(m, module_filter_fn=...)
And for an end-to-minimal training recipe of pretraining with float8, you can check out torchtitan.
Float8 Inference https://github.com/pytorch/ao/pull/740 https://github.com/pytorch/ao/pull/819
We have introduced two new quantization APIs for Float8 inference:
Float8 Weight-Only Quantization: A new quant_api float8weightonly() has been added to apply float8 weight-only symmetric per-channel quantization to linear layers.
Float8 Dynamic Activation and Weight Quantization: A new quant_api float8dynamicactivationfloat8weight() has been introduced to apply float8 dynamic symmetric quantization to both activations and weights of linear layers. By default PerTensor scaling. We have also added an option to do PerRow scaling of both activations and weights. By computing scales at a finer granularity, it can potentially reduce the overall quantization error and increase performance by reducing dynamic quantization overhead.
Example usage: ```python import torch from torchao.quantization import quantize, float8weightonly, float8dynamicactivationfloat8_weight, PerRow
Create a model
model = YourModel()
Apply float8 weight-only quantization
quantize(model, float8weight_only())
Apply float8 dynamic activation and weight quantization
quantize(model, float8dynamicactivationfloat8_weight())
Apply PerRow scaling to weight and activations
quantize(linearmodule, float8dynamicactivationfloat8weight(granularity=PerRow())) ```
Notes:
- These new APIs are designed to work with PyTorch 2.5 and later versions.
- float8_dynamic_activation_float8_weight requires CUDA devices with compute capability 8.9 or higher for hardware acceleration.
Int8 quantized training #644 #748
@gau-nernst introduced 2 experimental works on training using INT8.
- INT8 quantized training (#644): weight is quantized to INT8 during the whole duration of training to save memory. Compute remains in high precision. To train the model effectively with only quantized weights, we use stochastic rounding for weight update. Right now, memory saving is not too competitive compared to compiled BF16 baseline.
- INT8 mixed-precision training (#748): weight is kept in the original high precision, but weight and activation are dynamically quantized to INT8 during training to utilize INT8 tensor cores. We observe up to 70% speedup for Llama2 pre-training on 4090, and 20% speedup for Llama3 pre-training on 8x A100 with FSDP2.
```python from torchao.quantization import quantize_ from torchao.prototype.quantizedtraining import int8weightonlyquantizedtraining, int8mixedprecisiontraining
model = YourModel()
apply INT8 quantized training
quantize(model, int8weightonlyquantized_training())
apply INT8 mixed-precision training
quantize(model, int8mixedprecisiontraining()) ```
For more information and benchmark results, see README and the respective PR (#644 and #748)
HQQ Integration in torchao https://github.com/pytorch/ao/pull/605 https://github.com/pytorch/ao/pull/786
hqq is added to existing torchao APIs, it gives improvements on model accuracy and leverages the existing efficient kernels in torchao. We enabled hqq for int4_weight_only API:
quantize_(model, int4_weight_only(group_size, use_hqq=True)
We also added this to the uintx api for accuracy experiments (current uintx kernels are slow):
quantize_(model, uintx_weight_only(torch.uint2, group_size, use_hqq=True)
Automatic Mixed-Precision Quantization through Bayesian Optimization https://github.com/pytorch/ao/pull/592, https://github.com/pytorch/ao/pull/694
We provided a Bayesian Optimization (BO) tool leveraging Ax to auto search mixed-precision weight-only quantization configuration, i.e., bit width and group size of intN_weight_only(bit_width, group_size) for each layer. It also includes a sensitivity analysis tool to calculate layer-wise average Hessian trace and average fisher information matrix trace, which is an optional step to customize and improve BO search.
To optimize for model accuracy under a model size constraint (GB):
python --BO_acc_modelsize.py --checkpoint=/tmp/Meta-Llama-3-8B --model_size_constraint=6.0
To optimize for inference throughput under a model perplexity constraint:
python --BO_acc_throughput.py --checkpoint=/tmp/Meta-Llama-3-8B --ppl_constraint=7.5
For more detailed usage, please refer to this README. The mixed-precision quantization searched by this tool reduces 20.1% model size with 2.8% perplexity reduction, and improves 15.1% inference throughput with 3.2% perplexity reduction on the Llama3-8B model compared to int8 uniform quantization.
Sparse Marlin https://github.com/pytorch/ao/pull/621, https://github.com/pytorch/ao/pull/733
@Diogo-V added sparse-marlin, a W4AFP16 2:4 sparse kernel, support to TorchAO.
On Meta LLama3, we observe a 25% tok/s increase (180 -> 226) compared to our existing int4-wo implementation.
python
from torchao.quantization.quant_api import quantize_, int4_weight_only
from torchao.dtypes import MarlinSparseLayoutType
quantize_(model, int4_weight_only(layout_type=MarlinSparseLayoutType()))
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
| ----------- | ----------------------- | ------------- | ----------------------- | ---------------- | --------------- |
| Llama-3-8B | Base (bfloat16) | 95.64 | 1435.54 | 16.43 | 15.01 |
| | int8dq | 8.61 | 64.75 | 9.24 | 7.52 |
| | int8wo | 153.03 | 1150.80 | 10.42 | 7.52 |
| | int4wo-64 | 180.80 | 763.33 | 6.88 | 4.22 |
| | int4wo-64-sparse-marlin | 226.02 | 689.20 | 5.32 | 3.05 |
HuggingFace Integration
torchao is integrated into huggingface: https://huggingface.co/docs/transformers/main/en/quantization/torchao now you can use int4_weight_only, int8_weight_only and int8_dynamic_activation_int8_weight through TorchAoConfig in huggingface. Currently available in huggingface main branch only.
SGLang Integration
torchao is also integrated into sglang (https://github.com/sgl-project/sglang/pull/1341) for llama3 model, you can try out with:
python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128
Supported configurations are ["int4wo-
diffusers Integration
diffusers-torchao provides end-to-end inference and experimental training recipes to use torchao with diffusers in this repo. We demonstrate 53.88% speedup on Flux.1-Dev* and 27.33% speedup on CogVideoX-5b when comparing compiled quantized models against their standard bf16 counterparts.
BC Breaking
Add layout option to woq int4 api https://github.com/pytorch/ao/pull/670
```
torchao 0.4.0
from torchao.quantization import quantize, int4weightonly quantize(mymodel, int4weightonly(innerk_tiles=8))
torchao 0.5.0
from torchao.quantization import quantize, int4weightonly quantize(mymodel, int4weightonly(layouttype=TensorCoreTiledLayoutType(innerk_tiles=8))) ```
Refactor QAT to use tensor subclasses https://github.com/pytorch/ao/pull/585
We refactored QAT to use tensor subclasses instead of module swap. This works well with torchtune and FSDP2, but currently lacks support for FSDP1 and DDP. As a fallback for these distribution strategies, please continue to use the old module swap flows.
```
torchao 0.4.0: This uses the module swap flow
torch 0.5.0 + FSDP2: This uses the tensor subclass flow
from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer quantizer = Int8DynActInt4WeightQATQuantizer() model = quantizer.prepare(model) train(model) model = quantizer.convert(model)
torchao 0.5.0 + DDP or FSDP1: This uses the module swap flow
from torchao.quantization.prototype.qat.moduleswap_api import Int8DynActInt4WeightQATQuantizerModuleSwap quantizer = Int8DynActInt4WeightQATQuantizerModuleSwap() model = quantizer.prepare(model) train(model) model = quantizer.convert(model) ```
Deprecations
New Features
- Optimizer CPU offload for single GPU training https://github.com/pytorch/ao/pull/584
- Add support for save quantized checkpoint in llama code https://github.com/pytorch/ao/pull/553
- Intx quantization tensor subclass https://github.com/pytorch/ao/pull/468
- Add superblock to sparse/prototype https://github.com/pytorch/ao/pull/660
- Add AffineQuantizedObserver https://github.com/pytorch/ao/pull/650
- Add BSR subclass + torch.compile and clean up superblock https://github.com/pytorch/ao/pull/680
- Add HQQ support https://github.com/pytorch/ao/pull/605
- Add performance profiler https://github.com/pytorch/ao/pull/690
- Add experimental INT8 quantized training https://github.com/pytorch/ao/pull/644
- Add high-level operator interface https://github.com/pytorch/ao/pull/708
- Add sparse marlin 2:4 gemm op https://github.com/pytorch/ao/pull/733
- Example for GPTQ-like calibration flow https://github.com/pytorch/ao/pull/721
- Llama3.1 and KV cache quantization https://github.com/pytorch/ao/pull/738
- Add float8 weight only and weight + dynamic activation https://github.com/pytorch/ao/pull/740
- Add Auto-Round support https://github.com/pytorch/ao/pull/581
Mixed-Precision Quantization
- Add sensitivity analysis tool for layer-wise FIT and Hessian trace https://github.com/pytorch/ao/pull/592
- Bayesian optimization tool for mixed precision quantization https://github.com/pytorch/ao/pull/694
Improvements
- Move sam eval from
scriptstotorchao/_modelshttps://github.com/pytorch/ao/pull/591 - QOL improvements to float8 gemm benchmark https://github.com/pytorch/ao/pull/596
- Move lowbit universal kernels from torchaccel to torchao https://github.com/pytorch/ao/pull/582
- Refactor autoquant to use AQT https://github.com/pytorch/ao/pull/609
- Add support for using AffineQuantizedTensor with
weights_only=Truehttps://github.com/pytorch/ao/pull/630 - Move Uintx out of prototype for future extension https://github.com/pytorch/ao/pull/635
- Refactor
_quantized_linearfor better extensibility https://github.com/pytorch/ao/pull/634 - Update micro benchmarking code for AQT https://github.com/pytorch/ao/pull/673
- Refactor superblock code + add final benchmark/eval scripts https://github.com/pytorch/ao/pull/691
- Relax QAT dtype assertion https://github.com/pytorch/ao/pull/692
- Add option to move param to
devicebefore quantization https://github.com/pytorch/ao/pull/699 - Add gpu benchmarking script https://github.com/pytorch/ao/pull/192
- Enable
to(device=device_name)forUintxhttps://github.com/pytorch/ao/pull/722 - Make torchao's llama model trainable https://github.com/pytorch/ao/pull/728
- Specify output dtype to
torch.float32in_foreach_normhttps://github.com/pytorch/ao/pull/727 - Add semi-structured sparsity to hf eval https://github.com/pytorch/ao/pull/576
- Use
torch.uint1totorch.uint7for Uintx tensor subclass https://github.com/pytorch/ao/pull/672 - Add AdamW to
CPUOffloadOptimizerdefault https://github.com/pytorch/ao/pull/742 - Make developer experience better for extending AQT https://github.com/pytorch/ao/pull/749
- Add back QAT module swap API https://github.com/pytorch/ao/pull/762
- Refactor quant_llm to work with affine quantized tensor https://github.com/pytorch/ao/pull/772
- Move iOS benchmarking infra code to torchao https://github.com/pytorch/ao/pull/766
- Add CPU bandwidth benchmark https://github.com/pytorch/ao/pull/773
- Update method names to support intx and floatx changes https://github.com/pytorch/ao/pull/775
- Add implementation for torchao::parallel_for backends https://github.com/pytorch/ao/pull/774
- Add Llama2-7B finetune benchmarks for low-bit optimizers https://github.com/pytorch/ao/pull/746
- Fix Adam4bit support on PyTorch 2.3 and 2.4 and update AdamFp8 torch requirement https://github.com/pytorch/ao/pull/755
- Improve compile time + fix PyTorch 2.3 support for 4-bit optim https://github.com/pytorch/ao/pull/812
- Allow quantized linear registration in a different file https://github.com/pytorch/ao/pull/783
- Add 2bit, 5bit packing routines https://github.com/pytorch/ao/pull/797, https://github.com/pytorch/ao/pull/798
- Freeze dataclass in nf4, prep for better pt2 support https://github.com/pytorch/ao/pull/799
- Format and lint nf4 file and test https://github.com/pytorch/ao/pull/800
- Move more utils to TorchAOBaseTensor https://github.com/pytorch/ao/pull/784
- Add more information to quantized linear module and added some logs https://github.com/pytorch/ao/pull/782
- Add int4 mode to autoquant https://github.com/pytorch/ao/pull/804
- Add uintx quant to generate and eval https://github.com/pytorch/ao/pull/811
- Move non-NF4 tensor to device prior to quantization on copy https://github.com/pytorch/ao/pull/737
Static quantization
- Add float8 static quant support https://github.com/pytorch/ao/pull/787
- Update how block_size is calculated with Observers https://github.com/pytorch/ao/pull/815
- Add a linear observer class and test https://github.com/pytorch/ao/pull/807
Float8
- Update benchmarks to be more useful for smaller shapes https://github.com/pytorch/ao/pull/615
- Remove unneeded kernel for scale generation https://github.com/pytorch/ao/pull/616
- Filter out microbenchmarking overhead in profiling script https://github.com/pytorch/ao/pull/629
- Save torch_logs, and attach them to profiling trace https://github.com/pytorch/ao/pull/645
- Add option for gpu time in GEMM benchmarks https://github.com/pytorch/ao/pull/666
- Add roofline estimation of GEMM + overhead https://github.com/pytorch/ao/pull/668
- Make roofline utils reusable https://github.com/pytorch/ao/pull/731
- Use
torch.compiler.is_compilinghttps://github.com/pytorch/ao/pull/739 - Float8 support in AQT https://github.com/pytorch/ao/pull/671
- Add static scaling for float8 training https://github.com/pytorch/ao/pull/760
- Make roofline script calculate observed overhead https://github.com/pytorch/ao/pull/734
- Make Inference and training code independent https://github.com/pytorch/ao/pull/808
- Add rowwise scaling option to float8 dynamic quant https://github.com/pytorch/ao/pull/819
Bug fixes
- Fix all-gather in 2D with DTensor (WeightWithDynamicFloat8CastTensor) https://github.com/pytorch/ao/pull/590
- Fix FP6-LLM API and add
.to(device)op https://github.com/pytorch/ao/pull/595 - Fix linearactivationtensor dynamic quant https://github.com/pytorch/ao/pull/622
- Fix bug with float8 inference_mode https://github.com/pytorch/ao/pull/659
- Quantization kernel bug fixes https://github.com/pytorch/ao/pull/717
- Cast
local_scale_tensorto fp32 for precompute of float8 dynamic scaling https://github.com/pytorch/ao/pull/713 - Fix affine quantized tensor to device calls https://github.com/pytorch/ao/pull/726
- Small fix for micro benchmark code https://github.com/pytorch/ao/pull/711
- Fix LR schedule handling for low-bit optimizers https://github.com/pytorch/ao/pull/736
- Fix FPX inductor error https://github.com/pytorch/ao/pull/790
- Fixed llama model inference https://github.com/pytorch/ao/pull/769
Docs
- Add QAT README https://github.com/pytorch/ao/pull/597
- Update serialization.rst to include getmodelsizeinbytes import https://github.com/pytorch/ao/pull/604
- Clarify details around unwraptensorsubclass in README.md https://github.com/pytorch/ao/pull/618, https://github.com/pytorch/ao/pull/619
- Spelling fixes https://github.com/pytorch/ao/pull/662
- Move developer guide file to a folder https://github.com/pytorch/ao/pull/681
- Update docs on how to use AUTOQUANT_CACHE https://github.com/pytorch/ao/pull/649
- Update pip install command in README https://github.com/pytorch/ao/pull/723
- Fix docstring args names https://github.com/pytorch/ao/pull/735
- Update README example with correct import of
sparsify_https://github.com/pytorch/ao/pull/741 - Update main and quantization README https://github.com/pytorch/ao/pull/745, https://github.com/pytorch/ao/pull/747, https://github.com/pytorch/ao/pull/757
- Add README for mixed-precision search tool and code refactor https://github.com/pytorch/ao/pull/776
- Add performance section to float8 README.md https://github.com/pytorch/ao/pull/794
- Make float8 README.md examples standalone https://github.com/pytorch/ao/pull/809
- Add KV cache quantization to READMEs https://github.com/pytorch/ao/pull/813
- Update main README.md with more current float8 speedup https://github.com/pytorch/ao/pull/816
Not user facing
- Fix float8 inference tests and add export test https://github.com/pytorch/ao/pull/613
- Reduce atol/rtol for stable tests https://github.com/pytorch/ao/pull/617
- Fix version guard in https://github.com/pytorch/ao/pull/620, https://github.com/pytorch/ao/pull/679, https://github.com/pytorch/ao/pull/684
- Fix BC for QAT location https://github.com/pytorch/ao/pull/626
- Enable float8 CI on sm89 https://github.com/pytorch/ao/pull/587
- Fix Inductor bench BC change https://github.com/pytorch/ao/pull/638, https://github.com/pytorch/ao/pull/641
- Add CUDA compute capability compile guard https://github.com/pytorch/ao/pull/636
- Remove numpy as bitpack dependency https://github.com/pytorch/ao/pull/677
- Add PyTorch 2.4 tests in CI https://github.com/pytorch/ao/pull/654
- Remove torchao_nightly package https://github.com/pytorch/ao/pull/661
- Update licenses in torchao/experimental https://github.com/pytorch/ao/pull/720
- Add lint checks for float8 inference https://github.com/pytorch/ao/pull/779
New Contributors
- @sayakpaul made their first contribution in https://github.com/pytorch/ao/pull/604
- @metascroy made their first contribution in https://github.com/pytorch/ao/pull/582
- @raziel made their first contribution in https://github.com/pytorch/ao/pull/618
- @nmacchioni made their first contribution in https://github.com/pytorch/ao/pull/641
- @Diogo-V made their first contribution in https://github.com/pytorch/ao/pull/670
- @mobicham made their first contribution in https://github.com/pytorch/ao/pull/605
- @crcrpar made their first contribution in https://github.com/pytorch/ao/pull/703
- @ebsmothers made their first contribution in https://github.com/pytorch/ao/pull/737
- @a-r-r-o-w made their first contribution in https://github.com/pytorch/ao/pull/741
- @kimishpatel made their first contribution in https://github.com/pytorch/ao/pull/766
We were able to close about 70% of tasks for 0.5.0, which will now spill over into upcoming releases. We will post a list for 0.6.0 next, which we aim to release at the end of September 2024. We want to follow a monthly release cadence until further notice.
Full Changelog: https://github.com/pytorch/ao/compare/v0.4.0...v0.5.0-rc1
- Python
Published by andrewor14 over 1 year ago
torchao - v0.4.0
v0.4.0
Highlights
We are excited to announce the 0.4 release of torchao! This release adds support for KV cache quantization, quantization aware training (QAT), low bit optimizer support, composing quantization and sparsity, and more!
KV cache quantization (https://github.com/pytorch/ao/pull/532)
We've added support for KV cache quantization, showing a peak memory reduction from 19.7 -> 19.2 GB on Llama3-8B at an 8192 context length. We plan to investigate Llama3.1 next.
Quantization-Aware Training (QAT) (#383, #555)
We now support two QAT schemes for linear layers: Int8 per token dynamic activations + int4 per group weights, and int4 per group weights (using the efficient tinygemm int4 kernel after training). Users can access this feature by transforming their models before and after training using the appropriate quantizer, for example:
```python from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer
Quantizer for int8 dynamic per token activations +
int4 grouped per channel weights, only for linear layers
qat_quantizer = Int8DynActInt4WeightQATQuantizer()
Insert "fake quantize" operations into linear layers.
These operations simulate quantization numerics during
training without performing any dtype casting
model = qat_quantizer.prepare(model)
Convert fake quantize to actual quantize operations
model = qat_quantizer.convert(model) ```
Initial evaluation results indicate that QAT in torchao can recover up to 96% of quantized accuracy degradation on hellaswag and up to 68% of quantized perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the README and this blog post.
Composing quantization and sparsity (#457, #473)
We've added support for composing int8 dynamic quantization with 2:4 sparsity, using the quantize_ API. We also added SAM benchmarks that show a 7% speedup over standalone sparsity / int8 dynamic quantization here.
python
from torchao.quantization import quantize_, int8_dynamic_activation_int8_semi_sparse_weight
quantize_(model, int8_dynamic_activation_int8_semi_sparse_weight())
Community Contributions
low-bit optimizer support (#478, #463, #482, #484, #538)
@gau-nernst added implementations for 4-bit, 8-bit, and FP8 Adam with FSDP2/FSDP support. Our API is a drop-in replacement for torch.optim.Adam and can be used as follows:
```python
from torchao.prototype.lowbitoptim import Adam8bit, Adam4bit, AdamFp8
from torchao.prototype.lowbitoptim import AdamW8bit, AdamW4bit, AdamWFp8
model = ... optim = Adam8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 / fp8 versions ```
For more information about low bit optimizer support please refer to our README.
Improvements to 4-bit quantization (https://github.com/pytorch/ao/pull/517, https://github.com/pytorch/ao/pull/552, https://github.com/pytorch/ao/pull/544, #479 )
@bdhirsh @jeromeku @yanbing-j @manuelcandales @larryliu0820 added torch.compile support for NF4 Tensor, custom CUDA int4 tinygemm unpacking ops, and several bugfixes to torchao
BC breaking
quantizehas been renamed toquantize_https://github.com/pytorch/ao/pull/467 ``` python # for torchao 0.4 from torchao.quantization import quantize, int8weightonly quantize(model, int8weightonly())
for torchao 0.3
from torchao.quantization import quantize, int8weightonly
quantize(model, int8weightonly())
* `apply_sparse_semi_structured` has been deprecated in favor of `sparsify_` which matches the `quantize_` API https://github.com/pytorch/ao/pull/473
python
for torchao 0.4
from torchao.sparsity import sparsify, semisparseweight sparsify(model, semisparseweight())
for torchao 0.3
from torchao.sparsity import applysparsesemistructured applysparsesemistructured(model) ```
Deprecations
New Features
- Added kv_cache quantization https://github.com/pytorch/ao/pull/532
- Migrated float8_experimental to
torchao.float8, enabling float8 training support https://github.com/pytorch/ao/pull/551 https://github.com/pytorch/ao/pull/529 - Added FP5 E2M2 https://github.com/pytorch/ao/pull/399
- Added 4-bit, 8-bit, and FP8 ADAM support https://github.com/pytorch/ao/pull/478 https://github.com/pytorch/ao/pull/463 https://github.com/pytorch/ao/pull/482
- Added FSDP2 support for low-bit optimizers https://github.com/pytorch/ao/pull/484
- [prototype] mixed-precision quantization and eval framework https://github.com/pytorch/ao/pull/531
- Added int4 weight-only QAT support https://github.com/pytorch/ao/pull/555, https://github.com/pytorch/ao/pull/383
- Added custom CUDA
tinygemmunpacking ops https://github.com/pytorch/ao/pull/415
Improvements
- Composing quantization and sparsity now uses the unified AQT Layout https://github.com/pytorch/ao/pull/498
- Added default inductor config settings https://github.com/pytorch/ao/pull/423
- Better dtype and device handling for
Int8DynActInt4WeightQuantizerandInt4WeightOnlyQuantizerhttps://github.com/pytorch/ao/pull/475 https://github.com/pytorch/ao/pull/479 - Enable
model.tofor int4/int8 weight only quantized models https://github.com/pytorch/ao/pull/486 https://github.com/pytorch/ao/pull/522 - Added more logging to
TensorCoreTiledAQTLayouthttps://github.com/pytorch/ao/pull/520 - Added general
fake_quantize_affine opwith mask support https://github.com/pytorch/ao/pull/492 https://github.com/pytorch/ao/pull/500 - QAT now uses the shared
fake_quantize_affineprimitive https://github.com/pytorch/ao/pull/527 - Improve FSDP support for low-bit optimizers https://github.com/pytorch/ao/pull/538
- Custom op and inductor decomp registration now uses a decorator https://github.com/pytorch/ao/pull/434
- Updated torch version to no longer require
unwrap_tensor_subclasshttps://github.com/pytorch/ao/pull/595
Bug fixes
- Fixed import for
TORCH_VERSION_AFTER_*https://github.com/pytorch/ao/pull/433 - Fixed crash when PYTORCH_VERSION is not defined https://github.com/pytorch/ao/pull/455
- Added
torch.compilesupport forNF4Tensorhttps://github.com/pytorch/ao/pull/544 - Added fbcode check to fix torchtune in Genie https://github.com/pytorch/ao/pull/480
- Fixed
int4pack_mmerror https://github.com/pytorch/ao/pull/517 - Fixed cuda device check https://github.com/pytorch/ao/pull/536
- Weight shuffling now runs on CPU for int4 quantization due to a MPS memory issue https://github.com/pytorch/ao/pull/552
- Scale and input now are the same dtype for int8 weight only quantization https://github.com/pytorch/ao/pull/534
- Fixed FP6-LLM API https://github.com/pytorch/ao/pull/595
Performance
- Added
segment-anything-fastbenchmarks for composed quantization + sparsity https://github.com/pytorch/ao/pull/457 - Updated low-bit Adam benchmark https://github.com/pytorch/ao/pull/481
Docs
- Updated README.md https://github.com/pytorch/ao/pull/583 https://github.com/pytorch/ao/pull/438 https://github.com/pytorch/ao/pull/445 https://github.com/pytorch/ao/pull/460
- Updated installation instructions https://github.com/pytorch/ao/pull/447 https://github.com/pytorch/ao/pull/459
- Added more docs for int4weightonly API https://github.com/pytorch/ao/pull/469
- Added developer guide notebook https://github.com/pytorch/ao/pull/588
- Added optimized model serialization/deserialization doc https://github.com/pytorch/ao/pull/524 https://github.com/pytorch/ao/pull/525
- Added new float8 feature tracker https://github.com/pytorch/ao/pull/557
- Added static quantization tutorial for calibration-based techniques https://github.com/pytorch/ao/pull/487
Devs
- Fix numpy version in CI https://github.com/pytorch/ao/pull/537
- trymerge now uploads merge records to s3 https://github.com/pytorch/ao/pull/448
- Updated python version to 3.9 https://github.com/pytorch/ao/pull/488
torchaono long depends ontorchhttps://github.com/pytorch/ao/pull/449benchmark_modelnow accepts args and kwargs and supportscpuandmpsbackends https://github.com/pytorch/ao/pull/586 https://github.com/pytorch/ao/pull/406- Add git version suffix to package name https://github.com/pytorch/ao/pull/547
- Added validations to torchao https://github.com/pytorch/ao/pull/453 https://github.com/pytorch/ao/pull/454
- Parallel test support with pytest-xdist https://github.com/pytorch/ao/pull/518
Quantizernow useslogginginstead ofprinthttps://github.com/pytorch/ao/pull/472
Not user facing
- Refactored
_replace_linear_8da4whttps://github.com/pytorch/ao/pull/451 - Remove unused code from AQT implementation https://github.com/pytorch/ao/pull/476 https://github.com/pytorch/ao/pull/440 https://github.com/pytorch/ao/pull/441 https://github.com/pytorch/ao/pull/471
- Improved error message for lm_eval script https://github.com/pytorch/ao/pull/444
- Updated HF_TOKEN env variable https://github.com/pytorch/ao/pull/427
- Fixed typo in Quant-LLM in https://github.com/pytorch/ao/pull/450
- Add a test for map_location="cpu" in https://github.com/pytorch/ao/pull/497
- Removed sparse test collection warning https://github.com/pytorch/ao/pull/489
- Refactored layout implementation https://github.com/pytorch/ao/pull/491
- Refactored
LinearActQuantizedTensorhttps://github.com/pytorch/ao/pull/542
New Contributors
- @qingquansong made their first contribution in https://github.com/pytorch/ao/pull/433
- @Hanxian97 made their first contribution in https://github.com/pytorch/ao/pull/451
- @larryliu0820 made their first contribution in https://github.com/pytorch/ao/pull/472
- @SLR722 made their first contribution in https://github.com/pytorch/ao/pull/480
- @jainapurva made their first contribution in https://github.com/pytorch/ao/pull/406
- @bdhirsh made their first contribution in https://github.com/pytorch/ao/pull/544
- @yanbing-j made their first contribution in https://github.com/pytorch/ao/pull/517
- @manuelcandales made their first contribution in https://github.com/pytorch/ao/pull/552
- @Valentine233 made their first contribution in https://github.com/pytorch/ao/pull/534
Full Changelog: https://github.com/pytorch/ao/compare/v0.3.1-rc1...v0.4.0-rc1
We were able to close about 60% of tasks for 0.4.0, which will now spill over into upcoming releases. We will post a list for 0.5.0 next, which we aim to release at the end of August 2024. We want to follow a monthly release cadence until further notice.
- Python
Published by jcaip almost 2 years ago
torchao - v0.3.1
v0.3.1
Highlights
We are excited to announce the 0.3 release of torchao! This release adds support for a new quantize API, MX format, FP6 dtype and bitpacking, 2:4 sparse accelerated training and benchmarking infra for llama2/llama3 models.
quantize API (https://github.com/pytorch/ao/pull/256)
We added a tensor subclass based quantization API, see docs and README for details on usage, this is planned to replace all existing quantization APIs in torchao for torch 2.4 and later.
Accelerated training with 2:4 sparsity (#184)
You can now accelerate training with 2:4 sparsity, using the runtime pruning + compression kernels written by xFormers. These kernels process a 4x4 sub-tile to be 2:4 sparse in both directions, to handle both the forward and backward pass when training. We see a 1.3x speedup for the MLP layers of ViT-L across a forward and backwards pass.
MX support (https://github.com/pytorch/ao/pull/264)
We added prototype support for MX format for training and inference with a reference native PyTorch implementation of training and inference primitives for using MX accelerated matrix multiplications. The MX numerical formats are new low precision formats with recent acceptance into the OCP spec: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
Benchmarking (https://github.com/pytorch/ao/pull/276, https://github.com/pytorch/ao/pull/374)
We added a stable way to benchmark llama2 and llama3 models that includes perf/accuracy comparisons. See torchao/_models/llama/benchmarks.sh for more details.
🌟 💥 Community Contributions 🌟 💥
FP6 support (https://github.com/pytorch/ao/pull/279, https://github.com/pytorch/ao/pull/283, https://github.com/pytorch/ao/pull/358)
@gau-nernst Added support for FP6 dtype and mixed matmul FP16 x FP6 kernel with support for torch.compile. Benchmark results show a 2.3x speedup over BF16 baseline for meta-llama/Llama-2-7b-chat-hf
Bitpacking (https://github.com/pytorch/ao/pull/307, https://github.com/pytorch/ao/pull/282)
@vayuda, @melvinebenezer @CoffeeVampir3 @andreaskoepf Added support for packing/unpacking lower bit dtypes leveraging torch.compile to generate the kernels for this and added UInt2 and Bitnet tensor based on this approach.
FP8 split-gemm kernel https://github.com/pytorch/ao/pull/263
Added the kernel written by @AdnanHoque to torchao with speedups compared to the cuBLAS kernel for batch size <=16
BC Breaking
Deprecations
- Deprecate top level quantization APIs https://github.com/pytorch/ao/pull/344
1. int8 weight only quantization
apply_weight_only_int8_quant(model) or change_linear_weights_to_int8_woqtensors(model)
-->
```python
for torch 2.4+
from torchao.quantization import quantize, int8weightonly quantize(model, int8weightonly())
for torch 2.2.2 and 2.3
from torchao.quantization.quantapi import changelinearweightstoint8woqtensors changelinearweightstoint8_woqtensors(model) ```
2. int8 dynamic quantization
apply_dynamic_quant(model) or change_linear_weights_to_int8_dqtensors(model)
-->
```python
Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor
torch.inductor.config.forcefuseintmmwithmul = True
for torch 2.4+
from torchao.quantization import quantize, int8dynamicactivationint8weight quantize(model, int8dynamicactivationint8weight())
for torch 2.2.2 and 2.3
from torchao.quantization.quantapi import changelinearweightstoint8dqtensors changelinearweightstoint8_dqtensors(model) ```
3. int4 weight only quantization
change_linear_weights_to_int4_wotensors(model)
-->
```python
for torch 2.4+
from torchao.quantization import quantize, int4weightonly quantize(model, int4weightonly())
for torch 2.2.2 and 2.3
from torchao.quantization.quantapi import changelinearweightstoint4woqtensors changelinearweightstoint4_woqtensors(model) ```
New Features
- Add
quantizehttps://github.com/pytorch/ao/pull/256 - Add a prototype of MX format training and inference https://github.com/pytorch/ao/pull/264
- [FP6-LLM] Port splitK map from DeepSpeed https://github.com/pytorch/ao/pull/283
- Improve FP6-LLM 2+4bit weight splitting + user API https://github.com/pytorch/ao/pull/279
- Bitpacking https://github.com/pytorch/ao/pull/291
- training acceleration via runtime semi-structured sparsity https://github.com/pytorch/ao/pull/184
- Bitpackingv2 https://github.com/pytorch/ao/pull/307
- Add FP6-LLM doc and move FP6-LLM to prototype https://github.com/pytorch/ao/pull/358
- Added first bits of Uint2Tensor and BitnetTensor https://github.com/pytorch/ao/pull/282
Improvements
- Improve primitives for FP6 quant https://github.com/pytorch/ao/pull/248
- Extract eval code from GPTQ for more general usage https://github.com/pytorch/ao/pull/275
- Factor out the specific configurations to helper functions https://github.com/pytorch/ao/pull/286
- Add support for
AQTLayout,PlainAQTLayoutandTensorCoreTiledAQTLayouthttps://github.com/pytorch/ao/pull/278 - Graceful handling of cpp extensions https://github.com/pytorch/ao/pull/296
- Refactor int8 dynamic quantization with call to
quantizehttps://github.com/pytorch/ao/pull/294 - [NF4][FSDP] return contiguous
quantization_factorhttps://github.com/pytorch/ao/pull/298 - Refactor int4 and int8 weight only quantization to use
quantizehttps://github.com/pytorch/ao/pull/301 - Adding a quick way for users to test model eval for hf models https://github.com/pytorch/ao/pull/328
- Wrap torch.ops.quantized_decomposed to improve import errors https://github.com/pytorch/ao/pull/310
- [NF4Tensor] Switch to save for backward since are now a tensor input https://github.com/pytorch/ao/pull/323
- Refactor rest of tinygemm quant primitive ops https://github.com/pytorch/ao/pull/321
- Move some util functions from quantization.utils to torchao.utils https://github.com/pytorch/ao/pull/337
- Clean up FP6-LLM https://github.com/pytorch/ao/pull/304
- Move quant ops to utils.py https://github.com/pytorch/ao/pull/331
- FP6-LLM clean up (again) https://github.com/pytorch/ao/pull/339
- Improving hf_eval.py https://github.com/pytorch/ao/pull/342
- Generalize Model Size Code https://github.com/pytorch/ao/pull/364
- Minor upgrades to bit pack https://github.com/pytorch/ao/pull/347
- Factor out dispatch and layout registration table https://github.com/pytorch/ao/pull/360
- Add
register_apply_tensor_subclasshttps://github.com/pytorch/ao/pull/366 - Refactor custom FPx cast https://github.com/pytorch/ao/pull/363
- Remove all dependencies except torch https://github.com/pytorch/ao/pull/369
- Enable a test for loading state_dict with tensor subclasses https://github.com/pytorch/ao/pull/389
- 073 scripts for benchmarks https://github.com/pytorch/ao/pull/372
- Add WOQ int8 test with Inductor Freeze https://github.com/pytorch/ao/pull/362
- Benchmarking updates for semi-structured sparse training https://github.com/pytorch/ao/pull/398
- add FSDP QLoRA test and revert failing PR https://github.com/pytorch/ao/pull/403
- Refactor the API for quant method argument for quantize function https://github.com/pytorch/ao/pull/400
- eval script fixes https://github.com/pytorch/ao/pull/414
Bug Fixes
- Fixed the HQQ import skip https://github.com/pytorch/ao/pull/262
- fixing autoquant bug https://github.com/pytorch/ao/pull/265
- Fix eval import after #275 https://github.com/pytorch/ao/pull/290
- Fixed f-string printing of
NF4Tensors https://github.com/pytorch/ao/pull/297 - Check and fix dequantize_affine is idempotent https://github.com/pytorch/ao/pull/309
- Update old pretrained TorchVision API in ao tutorials (#313) https://github.com/pytorch/ao/pull/314
- Fix dimension issues for int4 weight only quant path https://github.com/pytorch/ao/pull/330
- Fix compile in
hf_eval.pyhttps://github.com/pytorch/ao/pull/341 - tasklist to tasks in hfeval https://github.com/pytorch/ao/pull/343
- fixing peak memory stats for benchmark https://github.com/pytorch/ao/pull/353
- Fix inductor config BC change https://github.com/pytorch/ao/pull/382
- fixing scripts https://github.com/pytorch/ao/pull/395
Performance
- FP8 splitgemm user defined triton kernel https://github.com/pytorch/ao/pull/263
- sparse benchmarking numbers https://github.com/pytorch/ao/pull/303
- Fix FP6-LLM benchmark https://github.com/pytorch/ao/pull/312
- Adding Llama to TorchAO https://github.com/pytorch/ao/pull/276
- Generalize Model Size Code https://github.com/pytorch/ao/pull/364
- eval script for llama https://github.com/pytorch/ao/pull/374
- 077 autoquant gpt fast https://github.com/pytorch/ao/pull/361
Docs
- add static folder for images + fix links https://github.com/pytorch/ao/pull/271
- Fix Readme and remove unused kernel https://github.com/pytorch/ao/pull/270
- Kernel docs https://github.com/pytorch/ao/pull/274
- Quantization Docstrings https://github.com/pytorch/ao/pull/273
- Add
AffineQuantizedTensorbased workflow doc and examples https://github.com/pytorch/ao/pull/277 - Add
AUTOQUANT_CACHEdocs for reusing the same quantization plan https://github.com/pytorch/ao/pull/329 - Update nightly build instructions https://github.com/pytorch/ao/pull/334
- add link to benchmarking script https://github.com/pytorch/ao/pull/355
- New README https://github.com/pytorch/ao/pull/392
- Minor README updates https://github.com/pytorch/ao/pull/401
- Add
quantizeto doc page https://github.com/pytorch/ao/pull/367 - Add link to new custom op tutorial https://github.com/pytorch/ao/pull/424
Devs
- ci: Add push trigger for binary build workflows https://github.com/pytorch/ao/pull/259
- Make fp8 test explicit https://github.com/pytorch/ao/pull/266
- Move
AffineQuantizedTensorto torchao/dtypes https://github.com/pytorch/ao/pull/272 - Add suffix to package version https://github.com/pytorch/ao/pull/293
- Re-enable AOTI tests https://github.com/pytorch/ao/pull/212
- Add fused QKV
HQQtriton_mmtest https://github.com/pytorch/ao/pull/306 - Pin CUDA nightly to mitigate regression https://github.com/pytorch/ao/pull/322
- Unpin CUDA nightly https://github.com/pytorch/ao/pull/333
- Add architecture to index postfix for nightly builds https://github.com/pytorch/ao/pull/336
- Update regression test to python 3.8 https://github.com/pytorch/ao/pull/340
- Remove test_ops.py warning spew https://github.com/pytorch/ao/pull/267
- Add torchao.version https://github.com/pytorch/ao/pull/359
- make torchao test discovery pass in fbcode https://github.com/pytorch/ao/pull/351
- use pytorch version env variable https://github.com/pytorch/ao/pull/373
- Update prebuildscript.sh https://github.com/pytorch/ao/pull/390
- Add support for building CUDA extension on Windows https://github.com/pytorch/ao/pull/396
- Add trymerge https://github.com/pytorch/ao/pull/388
- Fix github CI error https://github.com/pytorch/ao/pull/409
- Fix missing dependencies in trymerge workflow https://github.com/pytorch/ao/pull/413
- Setup trymerge secrets https://github.com/pytorch/ao/pull/416
- Pin CUDA nightlies for mx failures https://github.com/pytorch/ao/pull/428
- fix mx triton kernel after PyTorch triton pin change https://github.com/pytorch/ao/pull/431
Untopiced
- Print the code when the check failed https://github.com/pytorch/ao/pull/254
- Retry of D58015187 Move AsyncCompile to a different file by @jamesjwu in https://github.com/pytorch/ao/pull/302
- Revert "Clean up FP6-LLM" https://github.com/pytorch/ao/pull/338
- Update version to 0.3.0 https://github.com/pytorch/ao/pull/348
- Add torchao.version https://github.com/pytorch/ao/pull/359
New Contributors
- @seemethere made their first contribution in https://github.com/pytorch/ao/pull/259
- @yiliu30 made their first contribution in https://github.com/pytorch/ao/pull/262
- @vkuzo made their first contribution in https://github.com/pytorch/ao/pull/264
- @vayuda made their first contribution in https://github.com/pytorch/ao/pull/291
- @awgu made their first contribution in https://github.com/pytorch/ao/pull/297
- @jamesjwu made their first contribution in https://github.com/pytorch/ao/pull/302
- @kit1980 made their first contribution in https://github.com/pytorch/ao/pull/314
- @RobinKa made their first contribution in https://github.com/pytorch/ao/pull/329
- @andreaskoepf made their first contribution in https://github.com/pytorch/ao/pull/282
- @clee2000 made their first contribution in https://github.com/pytorch/ao/pull/388
Full Changelog: https://github.com/pytorch/ao/compare/v0.2.0...v0.3.0-rc1
We were able to close about 60% of tasks for 0.3.0, which will now spill over into upcoming releases. We will post a list for 0.4.0 next, which we aim to release at the end of July 2024. We want to follow a monthly release cadence until further notice.
EDIT: We made a patch release for 0.3.1 to include 2 more PRs so now ao has no runtime dependencies https://github.com/pytorch/ao/pull/449 and https://github.com/pytorch/ao/pull/455
- Python
Published by supriyar almost 2 years ago
torchao - v0.2.0
What's Changed
Highlights
Custom CPU/CUDA extension to ship CPU/CUDA binaries.
PyTorch core has recently shipped a new custom op registration mechanism with torch.library with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with torch.compile()
We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tree/main/torchao/csrc and if you learn better via example you can follow this PR https://github.com/pytorch/ao/pull/135 to add your own custom ops to torchao.
Most notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for fp6 support https://github.com/pytorch/ao/pull/223
One key benefit of integrating your kernels in torchao directly is we thanks to our manylinux GPU support can ensure that CPU/CUDA kernels that you've added will work on as many devices and cuda versions as possible https://github.com/pytorch/ao/pull/176
A lot of prototype and community contributions
@jeromeku was our community champion merging support for 1. GaLore our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch 2. DoRA which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https://github.com/pytorch/ao/tree/main/torchao/prototype/dora 3. Fused int4/fp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq
@gau-nernst merged fp6 support showing up to 8x speedups on an fp16 baseline for small batch size inference https://github.com/pytorch/ao/pull/223
NF4 support for upcoming FSDP2
@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP https://github.com/pytorch/ao/pull/150 most notably by implementing torch.chunk(). We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html
BC breaking
Deprecations
New Features
- Match autoquant API with torch.compile (https://github.com/pytorch/ao/pull/109, https://github.com/pytorch/ao/pull/162, https://github.com/pytorch/ao/pull/175)
- [Prototype] 8da4w QAT (https://github.com/pytorch/ao/pull/138, https://github.com/pytorch/ao/pull/199, https://github.com/pytorch/ao/pull/198, https://github.com/pytorch/ao/pull/211, https://github.com/pytorch/ao/pull/154, https://github.com/pytorch/ao/pull/157, https://github.com/pytorch/ao/pull/229)
- [Prototype] GaLore (https://github.com/pytorch/ao/pull/95)
- [Prototype] DoRA (https://github.com/pytorch/ao/pull/216)
- [Prototype] HQQ (https://github.com/pytorch/ao/pull/153, https://github.com/pytorch/ao/pull/185)
- [Prototype] 2:4 sparse + int8 sparse subclass (https://github.com/pytorch/ao/pull/36)
- [Prototype] Unified quantization primitives (https://github.com/pytorch/ao/pull/159, https://github.com/pytorch/ao/pull/201, https://github.com/pytorch/ao/pull/193, https://github.com/pytorch/ao/pull/220, https://github.com/pytorch/ao/pull/227, https://github.com/pytorch/ao/pull/173, https://github.com/pytorch/ao/pull/210)
- [Prototype] Pruning primitives (https://github.com/pytorch/ao/pull/148, https://github.com/pytorch/ao/pull/194)
- [Prototype] AffineQuantizedTensor subclass (https://github.com/pytorch/ao/pull/214, https://github.com/pytorch/ao/pull/230, https://github.com/pytorch/ao/pull/243, https://github.com/pytorch/ao/pull/247, https://github.com/pytorch/ao/pull/251)
- [Prototype] Add
Int4WeightOnlyQuantizer(https://github.com/pytorch/ao/pull/119) - Custom CUDA extensions (https://github.com/pytorch/ao/pull/135, https://github.com/pytorch/ao/pull/186, https://github.com/pytorch/ao/pull/232)
- [Prototype] Add FP6 Linear (https://github.com/pytorch/ao/pull/223)
Improvements
- FSDP2 support for NF4Tensor (https://github.com/pytorch/ao/pull/118, https://github.com/pytorch/ao/pull/150, https://github.com/pytorch/ao/pull/207)
- Add save/load of int8 weight only quantized model (https://github.com/pytorch/ao/pull/122)
- Add intscaledmm on CPU (https://github.com/pytorch/ao/pull/121)
- Add cpu and gpu in int4wo and int4wo-gptq quantizer (https://github.com/pytorch/ao/pull/131)
- Add torch.export support to int8dq, int8wo, int4_wo subclasses (https://github.com/pytorch/ao/pull/146, https://github.com/pytorch/ao/pull/226, https://github.com/pytorch/ao/pull/213)
- Remove
is_gpt_fastspecialization from GTPQ (https://github.com/pytorch/ao/pull/172) - Common benchmark and profile utils (https://github.com/pytorch/ao/pull/238)
Bug fixes
- Fix padding in GPTQ (https://github.com/pytorch/ao/pull/119, https://github.com/pytorch/ao/pull/120)
- Fix
Int8DynActInt4WeightLinearmodule swap (https://github.com/pytorch/ao/pull/151) - Fix
NF4Tensor.toto use device kwarg (https://github.com/pytorch/ao/pull/158) - Fix
quantize_activation_per_token_absmaxperf regression (https://github.com/pytorch/ao/pull/253)
Performance
- Chunk NF4Tensor construction to reduce memory spike (https://github.com/pytorch/ao/pull/196)
- Fix intmm benchmark script (https://github.com/pytorch/ao/pull/141)
Docs
- Update READMEs (https://github.com/pytorch/ao/pull/140, https://github.com/pytorch/ao/pull/142, https://github.com/pytorch/ao/pull/169, https://github.com/pytorch/ao/pull/155, https://github.com/pytorch/ao/pull/179, https://github.com/pytorch/ao/pull/187, https://github.com/pytorch/ao/pull/188, https://github.com/pytorch/ao/pull/200, https://github.com/pytorch/ao/pull/217, https://github.com/pytorch/ao/pull/245)
- Add https://pytorch.org/ao (https://github.com/pytorch/ao/pull/136, https://github.com/pytorch/ao/pull/145, https://github.com/pytorch/ao/pull/163, https://github.com/pytorch/ao/pull/164, https://github.com/pytorch/ao/pull/165, https://github.com/pytorch/ao/pull/168, https://github.com/pytorch/ao/pull/177, https://github.com/pytorch/ao/pull/195, https://github.com/pytorch/ao/pull/224)
CI
- Add A10G support in CI (https://github.com/pytorch/ao/pull/176)
- General CI improvements (https://github.com/pytorch/ao/pull/161, https://github.com/pytorch/ao/pull/171, https://github.com/pytorch/ao/pull/178, https://github.com/pytorch/ao/pull/180, https://github.com/pytorch/ao/pull/183, https://github.com/pytorch/ao/pull/107, https://github.com/pytorch/ao/pull/215, https://github.com/pytorch/ao/pull/244, https://github.com/pytorch/ao/pull/257, https://github.com/pytorch/ao/pull/235, https://github.com/pytorch/ao/pull/242)
- Add expecttest to requirements.txt (https://github.com/pytorch/ao/pull/225)
- Push button binary support (https://github.com/pytorch/ao/pull/241, https://github.com/pytorch/ao/pull/240, https://github.com/pytorch/ao/pull/250)
Not user facing
Security
Untopiced
- Version bumps (https://github.com/pytorch/ao/pull/125, https://github.com/pytorch/ao/pull/234)
- Don't import _C in fbcode (https://github.com/pytorch/ao/pull/218)
New Contributors
- @Xia-Weiwen made their first contribution in https://github.com/pytorch/ao/pull/121
- @jeromeku made their first contribution in https://github.com/pytorch/ao/pull/95
- @weifengpy made their first contribution in https://github.com/pytorch/ao/pull/118
- @aakashapoorv made their first contribution in https://github.com/pytorch/ao/pull/179
- @UsingtcNower made their first contribution in https://github.com/pytorch/ao/pull/194
- @Jokeren made their first contribution in https://github.com/pytorch/ao/pull/217
- @gau-nernst made their first contribution in https://github.com/pytorch/ao/pull/223
- @janeyx99 made their first contribution in https://github.com/pytorch/ao/pull/245
- @huydhn made their first contribution in https://github.com/pytorch/ao/pull/250
- @lancerts made their first contribution in https://github.com/pytorch/ao/pull/238
Full Changelog: https://github.com/pytorch/ao/compare/v0.2.0...v0.2.1
We were able to close about half of tasks for 0.2.0, which will now spill over into upcoming releases. We will post a list for 0.3.0 next, which we aim to release at the end of May 2024. We want to follow a monthly release cadence until further notice.
- Python
Published by cpuhrsch about 2 years ago
torchao - TorchAO 0.1.0: First Release
Highlights
We’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplication kernels on cuda.
Note: TorchAO is currently in a pre-release state and under extensive development. The public APIs should not be considered stable. But we welcome you to try out our APIs and offerings and provide any feedback on your experience.
torchao 0.1.0 will be compatible with PyTorch 2.2.2 and 2.3.0, ExecuTorch 0.2.0 and TorchTune 0.1.0.
New Features
Quantization
- Added tensor subclass based quantization APIs:
change_linear_weights_to_int8_dqtensors,change_linear_weights_to_int8_woqtensorsandchange_linear_weights_to_int4_woqtensors(#1) - Added module based quantization APIs for int8 dynamic and weight only quantization
apply_weight_only_int8_quantandapply_dynamic_quant(#1) - Added module swap version of int4 weight only quantization
Int4WeightOnlyQuantizerandInt4WeightOnlyGPTQQuantizerused in TorchTune (#119, #116) - Added int8 dynamic activation and int4 weight quantization
Int8DynActInt4WeightQuantizerandInt8DynActInt4WeightGPTQQuantizer, used in ExecuTorch (#74) (available after torch 2.3.0 and later) ## Sparsity - Added
WandaSparsifierthat prunes both weights and activations (#22) ## Kernels - Added
autotunerfor int mm Triton kernels (#41) ## dtypes nf4tensor subclass andnf4linear (#37, #40, #62)- Added
uint4dtype tensor subclass (#13)
Improvements
- Setup github workflow for regression testing (#50)
- Setup github workflow for
torchao-nightlyrelease (#54)
Documentation
- Added tutorials for quantizing vision transformer model (#60)
- Added tutorials for how to add an op for
nf4tensor (#54)
Notes
- we are still debugging the accuracy problem for
Int8DynActInt4WeightGPTQQuantizer - Save and load does not work well for tensor subclass based APIs yet
- We will consolidate tensor subclass and module swap based quantization APIs later
uint4tensor subclass is going to be merged into pytorch core in the future- Quantization ops in
quant_primitives.pywill be deduplicated with similar quantize/dequantize ops in PyTorch later
- Python
Published by jerryzh168 about 2 years ago