Recent Releases of torchao

torchao - v0.13.0

Highlights

We are excited to announce the 0.13.0 release of torchao! This release adds support for numerous QAT improvements, faster mxfp8 pretraining and more!

Simpler Multi-step QAT API (https://github.com/pytorch/ao/pull/2629)

We added a new, simpler, multi-step QAT API that uses only a single config. Now users can specify the target post-training quantization (PTQ) config as the base config and we will automatically infer the correct fake quantize configs to use!

```py from torchao.quantization import ( quantize_, Int8DynamicActivationInt4WeightConfig ) from torchao.quantization.qat import QATConfig

prepare

baseconfig = Int8DynamicActivationInt4WeightConfig(groupsize=32) qatconfig = QATConfig(baseconfig, step="prepare") quantize(m, qatconfig)

train (not shown)

convert

quantize(m, QATConfig(baseconfig, step="convert")) ```

For more advanced use cases, users can continue to specify specific FakeQuantizeConfigs as before:

```py

prepare

activationconfig = IntxFakeQuantizeConfig(torch.int8, "pertoken", issymmetric=False) weightconfig = IntxFakeQuantizeConfig(torch.int4, groupsize=32) qatconfig = QATConfig( activationconfig=activationconfig, weightconfig=weightconfig, step="prepare", ) quantize(model, qatconfig)

train and convert (not shown)

```

(Prototype) NVFP4 and FP8 QAT (https://github.com/pytorch/ao/pull/2735, https://github.com/pytorch/ao/pull/2666)

We generalized QAT to support FP8 and NVFP4 use cases. You can try them out as follows:

```py from torchao.quantization import ( quantize, Float8DynamicActivationInt4WeightConfig, Float8DynamicActivationFloat8WeightConfig, Float8WeightOnlyConfig, ) from torchao.prototype.mxformats import NVFP4InferenceConfig from torchao.quantization.qat import QATConfig

Pick a base config

baseconfig = Float8DynamicActivationInt4WeightConfig() # or baseconfig = Float8DynamicActivationInt8WeightConfig() # or base_config = NVFP4InferenceConfig

prepare

qatconfig = QATConfig(baseconfig, step="prepare") quantize(m, qatconfig)

train (not shown)

convert

quantize(m, QATConfig(baseconfig, step="convert")) ```

Users can also use the more specific FakeQuantizeConfigs for more advanced use cases, e.g.:

```py from torchao.quantization import PerRow from torchao.quantization.qat import Float8FakeQuantizeConfig from torchao.prototype.qat import NVFP4FakeQuantizeConfig

actconfig = Float8FakeQuantizeConfig(torch.float8e4m3fn, PerRow()) weightconfig = NVFP4FakeQuantizeConfig(usepertensorscale=True)

prepare

qatconfig = QATConfig( activationconfig=activationconfig, weightconfig=weightconfig, step="prepare", ) quantize(model, qat_config)

train and convert (not shown)

```

(prototype) 1.2x MXFP8 dense pretraining speedups with torchtitan

We landed performance improvements (such as a faster to_mx dim1 cast) to our prototype MXFP8 training APIs, and we now achieve a 1.2x speedup vs bf16 on pretraining LLaMa 3 8B on NVIDIA B200. Please see our training benchmarks README for more information.

torchao float8 training now integrated into axolotl!

You can now use torchao.float8 directly from axolotl to achieve finetuning QPS e2e speedups of up to 1.1x on 3B parameter models (docs, release notes).

BC Breaking

Float8DynamicActivationFloat8WeightConfig and Float8WeightOnlyConfig version bump to 2 (https://github.com/pytorch/ao/pull/2650)

We updated the implementation for float8 Tensor, so bumps the default version from 1 to 2 for these two configs.

``` from transformers import AutoModelForCausalLM, AutoTokenizer modelname = "torchao-testing/opt-125m-Float8DynamicActivationFloat8WeightConfig-v1-0.13.dev" quantizedmodel = AutoModelForCausalLM.frompretrained( modelname, torchdtype="bfloat16", devicemap="cuda", )

/data/users/jerryzh/ao/torchao/core/config.py:249: UserWarning: Stored version is not the same as current default version of the config: storedversion=1, currentversion=2, please check the deprecation warning warnings.warn( /data/users/jerryzh/ao/torchao/dtypes/floatx/float8_layout.py:113: UserWarning: Models quantized with version 1 of Float8DynamicActivationFloat8WeightConfig is deprecated and will no longer be supported in a future release, please upgrade torchao and quantize again, or download a newer torchao checkpoint, see https://github.com/pytorch/ao/issues/2649 for more details warnings.warn( ```

Suggestion: upgrade torchao to 0.13 and later and generate the checkpoint again:

quantize_(model, Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))

Or download the checkpoint again (please let us know if the checkpoint is not updated)

Please see https://github.com/pytorch/ao/issues/2649 for more details around the deprecation.

QAT API Changes (https://github.com/pytorch/ao/pull/2628, https://github.com/pytorch/ao/pull/2641)

On a high level, the following existing APIs are deprecated and replaced by these new ones. Although this is technically BC-breaking due to typing changes, it will not affect most users as old classes are kept around for now. They are planned to be removed in the next release, however.

py IntXQuantizationAwareTrainingConfig -> QATConfig FromIntXQuantizationAwareTrainingConfig -> QATConfig FakeQuantizeConfig -> IntxFakeQuantizeConfig FakeQuantizer -> IntxFakeQuantizer

Please see https://github.com/pytorch/ao/issues/2630 and the latest QAT README for more information on how to migrate.

Remove old `change_linear_weights_to_*` APIs (https://github.com/pytorch/ao/pull/2721)

The following old quantization APIs no longer work and are removed:

py change_linear_weights_to_int8_dqtensors(model) change_linear_weights_to_int8_woqtensors(model) change_linear_weights_to_int4_woqtensors(model)

Please use the quantize_ API with the following configs instead:

py quantize_(model, Int8WeightOnlyConfig()) quantize_(model, Int4WeightOnlyConfig())

Deprecations

Deprecate old TORCH_VERSION variables (https://github.com/pytorch/ao/pull/2719)

The following variables are deprecated and will be removed in the next release:

py TORCH_VERSION_AT_LEAST_2_2 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_6 TORCH_VERSION_AT_LEAST_2_7 TORCH_VERSION_AT_LEAST_2_8 TORCH_VERSION_AFTER_2_2 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_5

Drop support for PyTorch 2.5 and before (https://github.com/pytorch/ao/pull/2720)

torchao only supports the latest 3 versions of PyTorch. Please upgrade to PyTorch 2.6.0+ if you were using an older version of PyTorch.

New Features

New multi-step QAT API (https://github.com/pytorch/ao/pull/2629)
Add float8 FakeQuantizeConfig and FakeQuantizer (https://github.com/pytorch/ao/pull/2735)
(prototype) Add NVFP4 QAT (https://github.com/pytorch/ao/pull/2666)

Improvements

Add StretchedUnifTorchaoQuantizer (https://github.com/pytorch/ao/pull/2576)
Allow symmetric_no_clipping_error for KleidiAI kernels, update Readme and validate Kleidi INT4 quantization path (https://github.com/pytorch/ao/pull/2570)
Enable powers of 2 cast in float8 rowwise_with_gw_hp recipe (https://github.com/pytorch/ao/pull/2677)
Don't call erase if node is already erased in batch norm fusion. (https://github.com/pytorch/ao/pull/2716)
Generalize FakeQuantizer beyond intx (https://github.com/pytorch/ao/pull/2714)
Allow pattern replacement to ignore literals (https://github.com/pytorch/ao/pull/2519)
Replace export_for_training with torch.export.export (https://github.com/pytorch/ao/pull/2724)
Allow no quantization during QATConfig convert (https://github.com/pytorch/ao/pull/2694)
Int4 sparse marlin tensor (https://github.com/pytorch/ao/pull/2771)
Remove group_size arg in Float8DynamicActivationInt4WeightConfig (https://github.com/pytorch/ao/pull/2779)
Fix batch norm folding in prepare_pt2e for multiple conv->BN chains sharing the same conv weights (https://github.com/pytorch/ao/pull/2795)
Add Float8Tensor (https://github.com/pytorch/ao/pull/2463)
(prototype) Allow per-group quantizers in QuantOptimizer, fix state_dict (https://github.com/pytorch/ao/pull/2743)
(prototype) SpinQuant support split qkv (prototype) (https://github.com/pytorch/ao/pull/2547)
(prototype) Make AWQ more general (https://github.com/pytorch/ao/pull/2400)
(prototype) MX training
- Integration of new mxfp8 casting cuda kernel (https://github.com/pytorch/ao/pull/2564)
- Mx: expose scaling calculation methods in training UX (https://github.com/pytorch/ao/pull/2620)
- Mx: make CUDA kernel for dim1 cast in mxfp8_cublas recipe (https://github.com/pytorch/ao/pull/2661)
(prototype) MoE training
- Mxfp8 emulated grouped gemm (https://github.com/pytorch/ao/pull/2626)
- Add differentiable mxfp8 grouped gemm with dynamic quant (forward pass) (https://github.com/pytorch/ao/pull/2627)
- Support for 2d-2d emulated mxfp8 grouped gemm (https://github.com/pytorch/ao/pull/2632)
- Backward pass for differentiable mxfp8 grouped gemm with dynamic quant (https://github.com/pytorch/ao/pull/2639)
- torch.compile support for ScaledGroupedMMTensor (https://github.com/pytorch/ao/pull/2509)
- Assert expert weights are column-major; preserve subclass with transpose (https://github.com/pytorch/ao/pull/2663)
- set token group alignment size to 16 for fp8 training test (https://github.com/pytorch/ao/pull/2678)
- Make scaling type configurable for MoE training (https://github.com/pytorch/ao/pull/2642)
- use smaller block sizes for per group scaling kernels to improve perf (https://github.com/pytorch/ao/pull/2668)
- add llama4 benchmarking script (https://github.com/pytorch/ao/pull/2669)
- add fp8 rowwise kernels for expert weights (https://github.com/pytorch/ao/pull/2696)
- add bench script for fp8 rowwise kernels and update autotune configs (https://github.com/pytorch/ao/pull/2697)
- integrate rowwise expert quant kernel (https://github.com/pytorch/ao/pull/2698)
- work around wrap_triton bug by using normal custom ops instead for fp8 rowwise kernels (https://github.com/pytorch/ao/pull/2734)
- fix scaling type bug; refactor distributed tests (https://github.com/pytorch/ao/pull/2749)
- use llama4 shapes for kernel benchmarks (https://github.com/pytorch/ao/pull/2756)
- remove duplicate benchmark script (https://github.com/pytorch/ao/pull/2762)
- refactor to share benchmarking and profiling utils (https://github.com/pytorch/ao/pull/2767)
- add memory bandwidth calculations to kernel benchmarking scripts (https://github.com/pytorch/ao/pull/2769)
- update bench script to compare fp8 dynamic quant scaled_grouped_mm fwd+bwd against bf16 (https://github.com/pytorch/ao/pull/2765)
Float8 blockwise training (prototype)
- Add Triton kernels for fp8 blockwise quantization and GEMMs (https://github.com/pytorch/ao/pull/2617)
- Add Float8BlockwiseLinear for training (https://github.com/pytorch/ao/pull/2618)
- Improve fp8 blockwise gemm perf (https://github.com/pytorch/ao/pull/2784)

Bug Fixes

Fix autocast handling for float8 training rowwise recipes (https://github.com/pytorch/ao/pull/2587)
NVFP4 -> Use more of e4m3 range for block_scales (https://github.com/pytorch/ao/pull/2604)
Handle the case when param groups are passed to optimizer (https://github.com/pytorch/ao/pull/2606)
Fix bc breakage flex path (https://github.com/pytorch/ao/pull/2652)
Fix FSDP2 breakage in nightly (https://github.com/pytorch/ao/pull/2684)
When replacing literals with placeholders lists are always converted to (https://github.com/pytorch/ao/pull/2518)
Don't learn zero points for symmetric quantization (https://github.com/pytorch/ao/pull/2739)
fix ROCM build for newer hipblaslt BC-breaking change (https://github.com/pytorch/ao/pull/2510)
Fix missing QuantOptimizer methods (https://github.com/pytorch/ao/pull/2770)
Fix float8 + int4 QAT (https://github.com/pytorch/ao/pull/2851)
Allowlist WeightWithDynamicFloat8CastTensor for deserialization for checkpointing (https://github.com/pytorch/ao/pull/2573)

Performance

Fix float8 rowwise inference perf with torch.compile (https://github.com/pytorch/ao/pull/2672)
Add CUDA kernel for MXFP8 dim1 casting (https://github.com/pytorch/ao/pull/2513, https://github.com/pytorch/ao/pull/2550)
Extend the MX cast benchmark to include casting to mxfp4 (https://github.com/pytorch/ao/pull/2693)

Documentation

Add QLoRA and FP8 to finetuning tutorial (part 2) (https://github.com/pytorch/ao/pull/2542)
Clean up QAT API surface + add separate API ref (https://github.com/pytorch/ao/pull/2567)
Update float8 README with AMD MI300X benchmark results (https://github.com/pytorch/ao/pull/2736)
Update float8 README.md with more recent e2e performance numbers (https://github.com/pytorch/ao/pull/2774, https://github.com/pytorch/ao/pull/2580)
Update quantization overview and contributor guide doc (https://github.com/pytorch/ao/pull/2723)
add e2e training benchmark results to mx_formats README.md (https://github.com/pytorch/ao/pull/2777)
Update paper link readme (https://github.com/pytorch/ao/pull/2563)
Minor improvements to OpenVINOQuantizer (https://github.com/pytorch/ao/pull/2581)
Update README with PEFT integration + installation (https://github.com/pytorch/ao/pull/2559)

Developers

Bump cutlass version to 4.1.0 (https://github.com/pytorch/ao/pull/2589)
Fix git repo url in citation (https://github.com/pytorch/ao/pull/2599)
Simplify Float8Linear (https://github.com/pytorch/ao/pull/2594, https://github.com/pytorch/ao/pull/2595)
Convert quantization internal methods to private (https://github.com/pytorch/ao/pull/2568)
Reference representation of dqlinear int4 for xnnpack (https://github.com/pytorch/ao/pull/2520)
Refactors to align with new tensor subclass design
- Add all fbgemm kernel Tensors into Int4WeightOnlyConfig and Float8DynamicActivationInt4WeightConfig (https://github.com/pytorch/ao/pull/2474)
- Add support for float8 activation for Int4PreshuffledTensor (https://github.com/pytorch/ao/pull/2437)
- Align Int4Tensor implementation details with the design of Float8Tensor (https://github.com/pytorch/ao/pull/2687)
- Support optional_tensor_names in TorchAOBaseTensor (https://github.com/pytorch/ao/pull/2710)
- Update Int4PreshuffledTensor to align with implementation details of the Float8Tensor (https://github.com/pytorch/ao/pull/2738)
- Nvfp4 tensor: switch to using qdata (https://github.com/pytorch/ao/pull/2787)
- Nvfp4 tensor: switch to TorchAOBaseTensor (https://github.com/pytorch/ao/pull/2788)
- Nvfp4 tensor: refactor weight-only vs dynamic quant (https://github.com/pytorch/ao/pull/2790)
- Mxtensor: make data argument first and rename to qdata (https://github.com/pytorch/ao/pull/2804)
- Mxtensor: inherit from TorchAOBaseTensor (https://github.com/pytorch/ao/pull/2805)
- Mxtensor: refactor activation quant to use direct logic (https://github.com/pytorch/ao/pull/2806)
- Support more ops in TorchAOBaseTensor (https://github.com/pytorch/ao/pull/2609)

New Contributors

@wdvr made their first contribution in https://github.com/pytorch/ao/pull/2548
@carmocca made their first contribution in https://github.com/pytorch/ao/pull/2539
@gausah-arm made their first contribution in https://github.com/pytorch/ao/pull/2570
@daniil-lyakhov made their first contribution in https://github.com/pytorch/ao/pull/2581
@zeshengzong made their first contribution in https://github.com/pytorch/ao/pull/2599
@amdfaa made their first contribution in https://github.com/pytorch/ao/pull/2662
@chowarfb made their first contribution in https://github.com/pytorch/ao/pull/2657
@abeakkas made their first contribution in https://github.com/pytorch/ao/pull/2716
@subhankarpal made their first contribution in https://github.com/pytorch/ao/pull/2795

Full Changelog: https://github.com/pytorch/ao/compare/v0.12.0...v0.13.0-rc1

- Python
Published by vkuzo 9 months ago

torchao - v0.12.0

Highlights

We are excited to announce the 0.12.0 release of torchao! This release adds support for QAT + Axolotl Integration and prototype MXFP/NVFP support on Blackwell GPUs!

QAT + Axolotl Integration

TorchAO’s QAT support has been integrated into Axolotl’s fine-tuning recipes! Check out the docs here or run it yourself using the following command:

shell axolotl train examples/llama-3/3b-qat-fsdp2.yaml axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml

Initial results for Llama3.2-3B by @SalmanMohammadi (https://github.com/axolotl-ai-cloud/axolotl/pull/2590): | Model/Metric | hellaswag acc | hellaswag accnorm | wikitext bitsperbyte | wikitext byteperplexity | wikitext word_perplexity | |--------------|---------------|-------------------|----------------------|-------------------------|-------------------------| | bfloat16 | 0.5552 | 0.7315 | 0.6410 | 1.5594 | 10.7591 | | bfloat16 PTQ | 0.5393 | 0.7157 | 0.6613 | 1.5815 | 11.6033 | | qat ptq | 0.5423 | 0.7180 | 0.6567 | 1.5764 | 11.4043 | | Recovered (qat ptq) | 18.87% | 14.56% | 22.66% | 23.08% | 23.57% |

[Prototype | API not finalized] MXFP and NVFP support on Blackwell GPUs

TorchAO now includes prototype support for NVFP4 (NVIDIA's 4-bit floating-point format) and Microscaling (MX) formats on NVIDIA's latest Blackwell GPU architecture. These formats enable efficient inference, achieving up to 61% end-to-end performance improvement in vLLM on Qwen3 models and near 2x speedups for diffusion workloads.

To use:

```py from torchao.quantization import quantize_ from torchao.prototype.mx_formats import ( MXFPInferenceConfig, NVFP4InferenceConfig, )

Quantize model with MXFP8

model = quantize(model, MXFPInferenceConfig(blocksize=32))

Quantize model to NVFP4 (without double scaling)

model = quantize_(model, NVFP4InferenceConfig()) ```

Note: This is a prototype feature with APIs subject to change. Requires NVIDIA Blackwell GPUs (B200, 5090) with CUDA 12.8+.

BC Breaking

Remove preserve_zero and zero_point_domain from choose_qparams_affine (https://github.com/pytorch/ao/pull/2149)
Rename qparams for tinygemm (https://github.com/pytorch/ao/pull/2344)
Convert quant_primitives methods private (https://github.com/pytorch/ao/pull/2350)
Delete Galore (https://github.com/pytorch/ao/pull/2397)
Remove more Galore bits (https://github.com/pytorch/ao/pull/2417)
Remove sparsity/prototype/blocksparse (https://github.com/pytorch/ao/pull/2205)

Deprecations

Clean up prototype folder (https://github.com/pytorch/ao/pull/2232)
Make float8 training's force_recompute_fp8_weight_in_bwd flag do nothing (https://github.com/pytorch/ao/pull/2356)

New Features

Enabling MOE Quantization using linear decomposition (https://github.com/pytorch/ao/pull/2043)
[PT2E][X86] Migrate fusion passes in Inductor to torchao (https://github.com/pytorch/ao/pull/2140)
2:4 activation sparsity packing kernels (https://github.com/pytorch/ao/pull/2012)
Add subclass based method for inference w/ MXFP8 (https://github.com/pytorch/ao/pull/2132)
Feat: Implementation of the DeepSeek blockwise quantization for fp8 tensors (https://github.com/pytorch/ao/pull/1763)
Arm_inductor_quantizer for Pt2e quantization (https://github.com/pytorch/ao/pull/2139)
Add mx_fp4 path (https://github.com/pytorch/ao/pull/2201)
Add support for KleidiAI int4 kernels on aarch64 Linux (https://github.com/pytorch/ao/pull/2169)
Add support for fbgemm int4 mm kernel (https://github.com/pytorch/ao/pull/2255)
Enable fp16+int4 mixed precission path for int4 xpu path with int zero point (https://github.com/pytorch/ao/pull/2240)
Enable range learning for QAT (https://github.com/pytorch/ao/pull/2033)
Patch the _is_conv_node function (https://github.com/pytorch/ao/pull/2257)
Add support for fbgemm fp8 kernels (https://github.com/pytorch/ao/pull/2276)
Add Float8ActInt4WeightQATQuantizer (https://github.com/pytorch/ao/pull/2289)
[float8] add _auto_filter_for_recipe to float8 (https://github.com/pytorch/ao/pull/2410)
NVfp4 (https://github.com/pytorch/ao/pull/2408)
[float8] Prevent quantize_affine_float8/dequantize_affine_float8 decomposed on inductor (https://github.com/pytorch/ao/pull/2379)
[CPU] Enable DA8W4 on CPU (https://github.com/pytorch/ao/pull/2128)
Add exportable coreml codebook quantization op (https://github.com/pytorch/ao/pull/2443)
Add support for Int4GroupwisePreshuffleTensor for fbgemm (https://github.com/pytorch/ao/pull/2421)

Improvement

Add serialization support for AOPerModuleConfig (https://github.com/pytorch/ao/pull/2186)
Set eps in end-to-end QAT flow (https://github.com/pytorch/ao/pull/2180)
Enable {conv3d, conv_transpose3d} + bn fusion in pt2e (https://github.com/pytorch/ao/pull/2212)
Update GemLite to support vLLM V1 (https://github.com/pytorch/ao/pull/2199)
[sparse] Add fp8 sparse gemm with rowwise scaling for activation sparsity (https://github.com/pytorch/ao/pull/2242)
Patch the _is_conv_node function (https://github.com/pytorch/ao/pull/2223)
Relax int4wo device mismatch error (https://github.com/pytorch/ao/pull/2254)
Rename AOPerModuleConfig to ModuleFqnToConfig (https://github.com/pytorch/ao/pull/2243)
[reland2][ROCm] preshuffled weight mm (https://github.com/pytorch/ao/pull/2207)
GPTQ updates (https://github.com/pytorch/ao/pull/2235)
Fix QAT range learning, ensure scales get gradients (https://github.com/pytorch/ao/pull/2280)
Fix slicing and get_plain() in GemLite (https://github.com/pytorch/ao/pull/2288)
Add slicing support for fbgemm fp8 and int4 (https://github.com/pytorch/ao/pull/2308)
Add support for bmm and to for fbgemm Tensor (https://github.com/pytorch/ao/pull/2337)
Add dynamic quantization support to gemlite layout (https://github.com/pytorch/ao/pull/2327)
Test PARQ with torchao activation quantization (https://github.com/pytorch/ao/pull/2370)
Update index.rst (https://github.com/pytorch/ao/pull/2395)
Add inplace quantizer examples (https://github.com/pytorch/ao/pull/2345)
Build mxfp4 kernel for sm120a (https://github.com/pytorch/ao/pull/2285)
Enable to_mxfp8 cast for DTensor (https://github.com/pytorch/ao/pull/2420)
Enable tensor parallelism for MXLinear (https://github.com/pytorch/ao/pull/2434)
Graduate debug handle in torchao (https://github.com/pytorch/ao/pull/2452)
Switch alignemtn to 8 for cutlass 4 upgrade (https://github.com/pytorch/ao/pull/2491)
Mxfp8 training: add TP sharding strategy for dim1 kernel (https://github.com/pytorch/ao/pull/2436)

Bug Fixes

[optim] Fix low-bit optim when used with FSDP2+CPUOffload (https://github.com/pytorch/ao/pull/2195)
Fix Per Row scaling for inference (https://github.com/pytorch/ao/pull/2253)
Fix benchmark_low_bit_adam.py reference (https://github.com/pytorch/ao/pull/2287)
[optim] Fix bug when default dtype is BF16 (https://github.com/pytorch/ao/pull/2286)
[sparse] marlin fixes (https://github.com/pytorch/ao/pull/2305)
Fix ROCM test failures (https://github.com/pytorch/ao/pull/2362)
[float8] Add fnuz fp8 dtypes to Float8Layout (https://github.com/pytorch/ao/pull/2351)
Fixing ruff format for trunk (https://github.com/pytorch/ao/pull/2369)
Fixing trunk - autoquant test failure (https://github.com/pytorch/ao/pull/2363)
Remove torchao dependency from torchao build script (https://github.com/pytorch/ao/pull/2383)
Fix torchao quantized model in fbcode (https://github.com/pytorch/ao/pull/2396)
Gemlite generate.py fix (https://github.com/pytorch/ao/pull/2372)
Fixes issue #156414: Fixes bug in implementation of _combine_histogram (Follow up) (https://github.com/pytorch/ao/pull/2418)
TorchAO new observers (https://github.com/pytorch/ao/pull/2508)
Fix tutorials (https://github.com/pytorch/ao/pull/2516)

Performance

Add a triton kernel for swizziling (https://github.com/pytorch/ao/pull/2168)

Documentation

Add blockwise fp8 gemm benchmarks to README (https://github.com/pytorch/ao/pull/2203)
[float] document e2e training -> inference flow (https://github.com/pytorch/ao/pull/2190)
Update Readme (https://github.com/pytorch/ao/pull/1526)
Mark QAT range learning as prototype for now (https://github.com/pytorch/ao/pull/2272)
Update float8 training readme to include time measurement (https://github.com/pytorch/ao/pull/2291)
[BE/docs] Add float8 training api ref to docsite (https://github.com/pytorch/ao/pull/2313)
Enable doc build to run on PRs (https://github.com/pytorch/ao/pull/2315)
[BE] [docs] Add float8 pretraining tutorial to docsite (https://github.com/pytorch/ao/pull/2304)
[BE/docs] Add fp8 rowwise perf table to float8 training readme (https://github.com/pytorch/ao/pull/2312)
Update Quantization docs to show newer AOConfigs (https://github.com/pytorch/ao/pull/2317)
Update QAT docs, highlight axolotl integration (https://github.com/pytorch/ao/pull/2266)
Add static quant tutorial (https://github.com/pytorch/ao/pull/2047)
Update README.md to include seamless v2 (https://github.com/pytorch/ao/pull/2355)
Add Tutorial on E2E integration into VLLM and minimal Subclass (https://github.com/pytorch/ao/pull/2346)
[docs] Replace deprecated configs with Config objects (https://github.com/pytorch/ao/pull/2375)
Revamp README (https://github.com/pytorch/ao/pull/2374)
Add pt2e tutorials to torchao doc page (https://github.com/pytorch/ao/pull/2384)
Add part 2 of end-to-end tutorial: fine-tuning (https://github.com/pytorch/ao/pull/2394)
Call out axolotl + QAT integration on README (https://github.com/pytorch/ao/pull/2442)
Float8 readme: remove duplication (https://github.com/pytorch/ao/pull/2447)
Float8 readme: add key features section (https://github.com/pytorch/ao/pull/2448)
Update README.md to include Flux-Fast (https://github.com/pytorch/ao/pull/2457)
Inference tutorial - Part 3 of e2e series (https://github.com/pytorch/ao/pull/2343)
Update QAT README and API docstrings (https://github.com/pytorch/ao/pull/2465)
Fix typo : whic -> which (https://github.com/pytorch/ao/pull/2495)
Fix links for torchao tutorials (https://github.com/pytorch/ao/pull/2503)
Fix docstrings for quantization API docs (https://github.com/pytorch/ao/pull/2471)
Tutorial for benchmarking (https://github.com/pytorch/ao/pull/2499)

Developers

New Contributors

@malfet made their first contribution in https://github.com/pytorch/ao/pull/2181
@the-tuning-machine made their first contribution in https://github.com/pytorch/ao/pull/1763
@choudhary-devang made their first contribution in https://github.com/pytorch/ao/pull/2139
@vctrmn made their first contribution in https://github.com/pytorch/ao/pull/2169
@yuguo68 made their first contribution in https://github.com/pytorch/ao/pull/2225
@liangan1 made their first contribution in https://github.com/pytorch/ao/pull/2240
@emmanuel-ferdman made their first contribution in https://github.com/pytorch/ao/pull/2250
@odiemm-meta made their first contribution in https://github.com/pytorch/ao/pull/2328
@lilianaairhart made their first contribution in https://github.com/pytorch/ao/pull/2360
@Gasoonjia made their first contribution in https://github.com/pytorch/ao/pull/2390
@zixi-qi made their first contribution in https://github.com/pytorch/ao/pull/2396
@shiyang-weng made their first contribution in https://github.com/pytorch/ao/pull/2379
@Akabbaj made their first contribution in https://github.com/pytorch/ao/pull/2418
@mori360 made their first contribution in https://github.com/pytorch/ao/pull/2449
@henrylhtsang made their first contribution in https://github.com/pytorch/ao/pull/2491
@namgyu-youn made their first contribution in https://github.com/pytorch/ao/pull/2495
@rohansjoshi made their first contribution in https://github.com/pytorch/ao/pull/2508

Full Changelog: https://github.com/pytorch/ao/compare/v0.11.0...v0.12.0-rc2

- Python
Published by drisspg 11 months ago

torchao - v0.11.0

Highlights

We are excited to announce the 0.11.0 release of torchao! This release adds support for mixture-of-experts (MoE) quantization, PyTorch 2 Export Quantization (PT2E), and a microbenchmarking framework for inference APIs!

MoE Quantization

We’ve a prototype feature for quantizing MoE modules with a number of TorchAO quantization techniques. This approach leverages the existing TorchAO features for quantizing linear ops and allows them to be used to quantize MoE modules.

```py from torchao.quantization.prototype.moequant.utils import condffnfilter, MoEQuantConfig from torchao.quantization.quantapi import quantize_, Int8WeightOnlyConfig

quantize( model, MoEQuantConfig(Int8WeightOnlyConfig()),
filterfn=condffnfilter ) model=torch.compile( model, mode="reduce-overhead", fullgraph=issingletoken_inference ) ```

While the above API is all that is needed to quantize a moe module if your moe module is written to be both quantizable and compilable, in practice its rare for a user model to satisfy these conditions due to the variety of MoE implementations. An initial swap of the normal MoE module with a MoEFeedForwardAOQuantizable module is needed to first prepare the model for quantization. An example of this can be found in llama4_quant.py where this technique is demonstrated for the huggingface llama-4-Scout-17B-16E-Instruct model.

We implemented MoE quantization with 2 methods. The first method (designated `base` in the below benchmarks) simply enhances the existing quantized tensor subclass to quantize the 3D MoE expert tensors and perform the necessary indexing and slicing ops while the second method (`fake`), uses a new tensor subclass to simulate a 3D quantized parameter by storing a sequence of 2D slices of the quantized parameter. The first approach is faster with marginally worse memory characteristics. In both cases doing MoE quantization in this way isn’t expected to be maximally performant compared to implementing fused MoE kernels for each technique, but this approach can yield both moderate speedups and significant memory savings.

The following benchmarks are for mixtral-moe run on a single H100 GPU:

| | batchsize 1 | | batchsize 8 | | |
|-------------|-------------|-------------|-------------|--------------|-------------|
| Technique | tok/s | memory (GB) | tok/s | tok/s* batch | memory (GB) |
| None | 78.35 | 93.76 | 18.2 | 145.64 | 94.12 |
| int8wo-base | 98.4 | 48.87 | 4.94 | 39.56 | 49.2 |
| int4wo-base | 79.38 | 36.15 | 10.29 | 82.29 | 36.12 |
| fp8wo-base | 59.41 | 52.07 | 2.98 | 23.81 | 52.05 |
| fp8dq-base | 45.92 | 53.97 | 3.78 | 30.23 | 53.94 |
| int8wo-fake | 6.14 | 49.13 | 5.01 | 40.09 | 49.23 |
| int4wo-fake | 14.25 | 30.21 | 11.84 | 94.75 | 30.19 |
| fp8wo-fake | 3.2 | 50.31 | 2.88 | 23.08 | 50.29 |
| fp8dq-fake | 9.78 | 50.92 | 4.08 | 32.61 | 50.89 |

PT2 Export Quantization

We added pytorch 2 export quantization from pytorch to torchao. As part of the planned migration. We’ll follow up with adding deprecation warnings to PyTorch torch.ao.quantization APIs and updating docs in the future. We also simplified the import path for some of the util functions. Here is a non-exhaustive list of APIs you can use:

```

top level APIs

from torchao.quantization.pt2e.quantizept2e import preparept2e, prepareqatpt2e, convert_pt2e from torchao.quantization.pt2e.quantizer import X86InductorQuantizer

export utils

from torchao.quantization.pt2e import ( moveexportedmodeltoeval, moveexportedmodeltotrain, allowexportedmodeltraineval )

graph utils

from torchao.quantization.pt2e import ( findsequentialpartitions, getequivalenttypes, updateequivalenttypesdict, bfstracewithnode_process, )

# pt2e numeric debugger from torchao.quantization.pt2e import ( generatenumericdebughandle, CUSTOMKEY, NUMERICDEBUGHANDLEKEY, prepareforpropagationcomparison, extractresultsfromloggers, compareresults, )

```

Microbenchmarking Framework for Inference APIs

We’ve introduced a streamlined microbenchmark framework, to help developers track and evaluate the performance of their post-training quantization and sparsity APIs for different matrix sizes and model types. The framework also includes support for advanced GPU and memory profiling techniques, providing deeper insights into performance characteristics.

To run the benchmarks, use the following command:

python -m benchmarks.microbenchmarks.benchmark_runner --config benchmarks/microbenchmarks/test/benchmark_config.yml

Sample Benchmark Results (on 1xH100):

| Name | Quantization | Shape | Baseline Inference Time (ms) | Inference Time (ms) | Speedup |
|-------------------|-----------------|---------------------|------------------------------|---------------------|---------|
| small_bf16_linear | float8dq-tensor | 16384, 16384, 16384 | 13.34 | 7.72 | 1.73x |
| small_bf16_linear | float8dq-tensor | 16384, 16384, 32768 | 26.04 | 14.62 | 1.78x |
| small_bf16_linear | float8dq-tensor | 16384, 16384, 65536 | 53.59 | 29.05 | 1.84x |
| small_bf16_linear | float8dq-tensor | 16384, 32768, 32768 | 68.94 | 28.07 | 2.46x |
| small_bf16_linear | float8dq-tensor | 16384, 32768, 65536 | 108.63 | 58.7 | 1.85x |
| small_bf16_linear | float8dq-tensor | 16384, 65536, 65536 | 215.66 | 118.42 | 1.82x |
| small_bf16_linear | float8dq-tensor | 32768, 32768, 32768 | 108.16 | 57.09 | 1.89x |
| small_bf16_linear | float8dq-tensor | 32768, 32768, 65536 | 214.74 | 110.08 | 1.95x |
| small_bf16_linear | float8dq-tensor | 32768, 65536, 65536 | 432.44 | 223.46 | 1.94x |
| small_bf16_linear | float8dq-tensor | 65536, 65536, 65536 | 870.37 | 447.97 | 1.94x |

BC Breaking

Remove prototype low bit optim code completely (https://github.com/pytorch/ao/pull/2159)

New Features

Add quantized attn_scores @ v test for intented used in quantized attention (https://github.com/pytorch/ao/pull/2008)
Add fallback kernel and interface (https://github.com/pytorch/ao/pull/2010)
Add fallback kernel and interface for rhs only quantized matmul (https://github.com/pytorch/ao/pull/2011)
Add KleidiAI gemm kernels (https://github.com/pytorch/ao/pull/2000)
Use quantized gemm only on aarch64 (https://github.com/pytorch/ao/pull/2023)
Adds utility to replace Q/DQ ops with torchao quantized linear ops (https://github.com/pytorch/ao/pull/1967)
Adds Q/DQ layout support for embedding quantization with IntxWeightOnlyConfig (https://github.com/pytorch/ao/pull/1972)
Move Int8DynamicActivationIntxWeightConfig out of experimental (https://github.com/pytorch/ao/pull/1968)
Initial ParetoQ commit (https://github.com/pytorch/ao/pull/1876)
INT4 XPU enabling (https://github.com/pytorch/ao/pull/1577)
Vectorized row sum (https://github.com/pytorch/ao/pull/2034)
Add gemm for fp32_a_int8_b matmul kernel (https://github.com/pytorch/ao/pull/2039)
Add gemm kernel to interface (https://github.com/pytorch/ao/pull/2040)
Add tests for attention matmul for gemm kernels (https://github.com/pytorch/ao/pull/2041)
Gemm int8 a int8 b kernels (https://github.com/pytorch/ao/pull/2049)
Add tests cases for q @ k attention variant (https://github.com/pytorch/ao/pull/2051)
Add gemm int8 a x int8 b to interface (https://github.com/pytorch/ao/pull/2055)
[Quant][PT2E][X86] Enable annotation of aten.mul.tensor with X86InductorQuantizer (https://github.com/pytorch/ao/pull/2075)
Add AOPerModuleConfig to torchao.quantization (https://github.com/pytorch/ao/pull/2134)
Enabling MoE Quantization using linear decomposition (https://github.com/pytorch/ao/pull/2043)

Improvement

Match QAT prepare and convert numerics exactly (https://github.com/pytorch/ao/pull/1964)
[Prototype] Update torchao.prototype.parq and add 4-bit Llama 3.2 1B benchmark (https://github.com/pytorch/ao/pull/2017)
[ROCm] preshuffled weight mm (https://github.com/pytorch/ao/pull/1702)
Remove old code from torchao.experimental.quant_api (https://github.com/pytorch/ao/pull/2030)
Remove zero_point_domain from quant configs (https://github.com/pytorch/ao/pull/2058)
Match QAT prepare and convert numerics exactly for bf16 and fp16 (https://github.com/pytorch/ao/pull/2060)
[scaled grouped mm] add triton kernels for float8 rowwise quantization with per-group/jagged scales (https://github.com/pytorch/ao/pull/2064)
[reland][ROCm] preshuffled weight mm (https://github.com/pytorch/ao/pull/2044)
[scaled grouped mm] integrate triton kernels into differentiable scaled grouped mm (https://github.com/pytorch/ao/pull/2077)
Add AOPerModuleConfig (https://github.com/pytorch/ao/pull/2119)
Improve GemLite Integration (https://github.com/pytorch/ao/pull/2096)
[prototype] PARQ quantizer support for torchao's weight-only configs (https://github.com/pytorch/ao/pull/2091)

Bug Fixes

Fix slice and padding for TensorCoreTiledLayout (https://github.com/pytorch/ao/pull/2015)
Fix Int4WeightEmbeddingQATQuantizer.convert path (https://github.com/pytorch/ao/pull/2024)
Fix static AQT flow (https://github.com/pytorch/ao/pull/2046)
Fix QDQ layout slice operation when zero_point is None (https://github.com/pytorch/ao/pull/2054)
Fix aqt implementation for aten.mm/aten.addmm fallback path (https://github.com/pytorch/ao/pull/2072)
Fix AO SAM2 issues (https://github.com/pytorch/ao/pull/2109)
Fix AOPerModuleConfig bug in skipping quantizing modules (https://github.com/pytorch/ao/pull/2135)
Fixing aliasing behavior for slice in AQT TensorCoreTiledLayout (https://github.com/pytorch/ao/pull/2174)

Performance

Add profiling to benchmarking (https://github.com/pytorch/ao/pull/2032)
Model shapes config (https://github.com/pytorch/ao/pull/2036)

Documentation

Remove hf_eval.py and add documentation on using lm-eval (https://github.com/pytorch/ao/pull/2045)
Update QAT README.md (https://github.com/pytorch/ao/pull/2162)

Developers

New Contributors

@YIWENX14 made their first contribution in https://github.com/pytorch/ao/pull/2080
@navsud made their first contribution in https://github.com/pytorch/ao/pull/2079
@jlbmorales made their first contribution in https://github.com/pytorch/ao/pull/2109
@syed-ahmed made their first contribution in https://github.com/pytorch/ao/pull/2163
@SalmanMohammadi made their first contribution in https://github.com/pytorch/ao/pull/2162

Full Changelog: https://github.com/pytorch/ao/compare/v0.10.0...v0.11.0

- Python
Published by andrewor14 about 1 year ago

torchao - v0.10.0

Highlights

We are excited to announce the 0.10.0 release of torchao! This release adds support for end to end training for mxfp8 on Nvidia B200, PARQ (for quantization aware training), module swap quantization API to for research, and updates for low bit kernels!

Low Bit Optimizers moved to Official Support (https://github.com/pytorch/ao/pull/1864)

Low bit optimizers (added in 0.4) is moved out of prototype and now have official support in torchao.

[Prototype] End to End Training Support for mxfp8 on NVIDIA B200 (#1786, #1841, #1951, #1932, #1980)

We have an early version of the end to end training workflow for the mxfp8 dtypes with torch.compile on NVIDIA B200, with the cuBLAS mxfp8 gemm seeing an observed speedup of over 2x over bfloat16 gemm, and casts from bfloat16 to mxfp8 achieving up to 5.5 TB/s. Please see our README.md for MX for more information. We plan to improve performance further in future releases.

[Prototype] Piecewise-Affine Regularized Quantization (https://github.com/pytorch/ao/pull/1738)

PARQ is a new theoretical framework for inducing quantization through regularization. It supports standard QAT, as well as new gradual quantization methods, in an easy to use optimizer-only interface. No modifications to a model’s forward or backward pass are needed for quantization.

```py from torchao.prototype.parq.optim import QuantOptimizer, ProxHardQuant from torchao.prototype.parq.quant import UnifQuantizer

Separate quantizable from non-quantizable parameter groups

paramgroups = [ {"params": weights, "quantbits": 2}, # add extra quant_bits key for QAT {"params": others}, ]

Initialize any torch.optim.Optimizer

baseoptimizer = torch.optim.SGD(paramgroups, lr=0.1, momentum=0.9, weight_decay=1e-4)

Apply a simple wrapper to quantize in optimizer.step()

optimizer = QuantOptimizer( baseoptimizer, quantizer=UnifQuantizer(), proxmap=ProxHardQuant() ) ```

[Prototype] Module Swap Quantization API (https://github.com/pytorch/ao/pull/1886)

We added a prototype API for post-training quantization. Users can swap their linear or embedding layers into their QuantizedLinear and QuantizedEmbedding counterparts, and set the quantizers that specify how they want the input activations or weights to be quantized:

py quantized_linear = QuantizedLinear(...) quantized_linear.weight_quantization = IntQuantizer( num_bits=4, group_size=32, dynamic=True, quantization_mode="symmetric", ) quantized_linear.input_quantization = CodeBookQuantizer( num_bits=8, features=10, )

Note: The API is highly subject to change and will be integrated with quantize_ in the future. For more detail, please see the README.

[Prototype] Low Bit Kernels (#1826, #1935, #1998, #1652)

Low-bit CPU and MPS kernels are now pip installable from source. To install torchao with low-bit CPU kernels, you can use the following command on an Arm-based Mac:

USE_CPP=1 pip install git+https://github.com/pytorch/ao.git

You can then quantize your model to run on Arm-based Macs with high-performance CPU kernels in torchao. SharedEmbeddingQuantizer,EmbeddingQuantizer, and Int8DynamicActivationIntxWeightConfig all support 1-8 bit quantization.

```py from torchao.experimental.quantapi import Int8DynamicActivationIntxWeightConfig, SharedEmbeddingQuantizer, EmbeddingQuantizer from torchao.quantization.granularity import PerGroup, PerRow from torchao.quantization.quantapi import quantize_

Quantize embedding/unembedding to 8-bits with SharedEmbeddingQuantizer

SharedEmbeddingQuantizer is for quantizing models like Llama1B/3B

where the embedding/unembedding layers share weights

If the embedding/unembedding layers do not share weights, use

EmbeddingQuantizer instead

SharedEmbeddingQuantizer( weightdtype=torch.int8, granularity=PerRow(), hasweightzeros=True ).quantize(model) # Quantize linear layers to 4-bits quantize( model, Int8DynamicActivationIntxWeightConfig( weightdtype=torch.int4, granularity=PerGroup(128), hasweight_zeros=False, ) ) ```

BC Breaking

Delete delayed scaling from torchao.float8 (https://github.com/pytorch/ao/pull/1753)

The following usage of `Float8Config` is deprecated in torchao v0.10.0:

py config = Float8LinearConfig( cast_config_input=CastConfig(scaling_type=ScalingType.DELAYED), cast_config_weight=CastConfig(scaling_type=ScalingType.DELAYED), cast_config_grad_output=CastConfig(scaling_type=ScalingType.DELAYED), )

If you would like to use float8 training with delayed scaling, please use an earlier release of torchao. Please see https://github.com/pytorch/ao/issues/1680 for more context about this deprecation.

Enforce AOBaseConfig type in `quantize_`'s `config` argument (https://github.com/pytorch/ao/pull/1861)

This was done following a deprecation window to simplify the arguments of quantize_, please see https://github.com/pytorch/ao/issues/1690 for more context.

```py

torchao v.0.9.0

def quantize( model: torch.nn.Module, config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]],
filterfn: Optional[Callable[[torch.nn.Module, str], bool]] = None, setinductorconfig: Optional[bool] = None, device: Optional[torch.types.Device] = None, ):

torchao v.0.10.0

def quantize( model: torch.nn.Module, config: AOBaseConfig, filterfn: Optional[Callable[[torch.nn.Module, str], bool]] = None, setinductorconfig: Optional[bool] = None,
device: Optional[torch.types.Device] = None, ): ```

Remove the `set_inductor_config` argument of `quantize_`. (https://github.com/pytorch/ao/pull/1865)

This was done following a deprecation window to decouple quantize_ from torchinductor, please see https://github.com/pytorch/ao/issues/1715 for more context.

```py

torchao v.0.9.0

def quantize( ..., setinductorconfig: Optional[bool] = None, ..., ):
# if setinductorconfig != None, throw a deprecation warning # if setinductor_config == None, set it to True to stay consistent with old behavior

torchao v0.10.0

def quantize( ..., ): # setinductorconfig is removed from quantize and moved to relevant individual workflows ```

Deprecations

We removed some of our prototype features that are not used, including DORA (https://github.com/pytorch/ao/pull/1815), split_k kernel (https://github.com/pytorch/ao/pull/1816), profiler (https://github.com/pytorch/ao/pull/1862) and bitnet (https://github.com/pytorch/ao/pull/1866).

New Features

QAT

Added PARQ (https://github.com/pytorch/ao/pull/1738)

Low Bit Optimizers

Promote Low Bit Optim out of prototype (https://github.com/pytorch/ao/pull/1864)

Module swap quantization API

Add module swap quantization API from Quanty (https://github.com/pytorch/ao/pull/1886)

Benchmarking

Micro-benchmark inference (https://github.com/pytorch/ao/pull/1759)
Add sparsity to benchmarking (https://github.com/pytorch/ao/pull/1917)
Add float8 training benchmarking scripts (https://github.com/pytorch/ao/pull/1802)

Improvement

Kernels

1-8 bit CPU and MPS kernels are now pip installable from source (https://github.com/pytorch/ao/pull/1826)
Added 1-8 bit shared embedding ops to further compress models like Llama1B/3B where the embedding/unembedding weights are shared (https://github.com/pytorch/ao/pull/1935)
CPU kernels added runtime microkernel selection based on CPU features and matrix size (https://github.com/pytorch/ao/pull/1998)
KleidiAI microkernel library was integrated with CPU kernels to improve GEMM performance on Arm CPUs (https://github.com/pytorch/ao/pull/1652)
Add build flag to set parallel_backend (https://github.com/pytorch/ao/pull/1870)
Add quant api + python test for shared embedding (https://github.com/pytorch/ao/pull/1937)
Add dynamic shape support for lowbit kernels (https://github.com/pytorch/ao/pull/1942)
Add LUT-based bitpacking for 1-4 bits (https://github.com/pytorch/ao/pull/1987)
Add lut support to linear kernel (https://github.com/pytorch/ao/pull/1990)
Quantized matmul (https://github.com/pytorch/ao/pull/1994)
Add fp32xint8 matmul (https://github.com/pytorch/ao/pull/2004)
Add quantized q @ k test for intented used in quantized attention (https://github.com/pytorch/ao/pull/2006)
ROCm Support : Tile_Layout kernel (https://github.com/pytorch/ao/pull/1201)
Metal lowbit kernels: pip install (https://github.com/pytorch/ao/pull/1785)
Metal lowbit ops: ci (https://github.com/pytorch/ao/pull/1825)
ROCm Sparse Marlin Kernels #1206 (https://github.com/pytorch/ao/pull/1834)
ROCm OCP FP8 Support (https://github.com/pytorch/ao/pull/1677)
Migrate to int args (https://github.com/pytorch/ao/pull/1846)
Add bias support to torchao kernels (https://github.com/pytorch/ao/pull/1879)
Write weight packing/unpacking functions for universal kernels (https://github.com/pytorch/ao/pull/1921)
Unpack weights at col (https://github.com/pytorch/ao/pull/1933)
Shared embedding kernel (https://github.com/pytorch/ao/pull/1934)
Bug fixes for shared_embedding (https://github.com/pytorch/ao/pull/1941)
Update linear.h (https://github.com/pytorch/ao/pull/1963)
Reintroduce has_weight_zeros as a template param (https://github.com/pytorch/ao/pull/1991)

AOConfigs

Support Serialization for AOConfigs (https://github.com/pytorch/ao/pull/1875)
Migrate to config for Int8DynamicActivationIntxWeightConfig (https://github.com/pytorch/ao/pull/1836)
Migrate sparsify_ to configs (https://github.com/pytorch/ao/pull/1856)

SAM2

SAM2: Use torch.export for VOS (https://github.com/pytorch/ao/pull/1708)

QAT

Add linear bias support for QAT (https://github.com/pytorch/ao/pull/1755)

Allow for scales to be in new e8m0 dtype (https://github.com/pytorch/ao/pull/1742)
Support MXFP6 packing and fused unpack-dequantize kernel (https://github.com/pytorch/ao/pull/1810)
Implemented RCEIL (CUBLAS-style) MXFP scale factor derivation, with test cases. (https://github.com/pytorch/ao/pull/1835)
Use torch.float8_e8m0fnu in mx_formats (https://github.com/pytorch/ao/pull/1966)
Mx_formats: move training to the quantize_ API (https://github.com/pytorch/ao/pull/1970)

Affine Quantization

Add support for copy_ for plain layout and tensor core tiled layout (https://github.com/pytorch/ao/pull/1791)
Add bias support for Int8DynActInt4WeightLinear (https://github.com/pytorch/ao/pull/1845)
Move config out of experimental (https://github.com/pytorch/ao/pull/1954)

Bug Fixes

Fix potential out-of-bound access in int8_mm.py (https://github.com/pytorch/ao/pull/1751)
Fixing DORA imports (https://github.com/pytorch/ao/pull/1795)
Avoid assert error when there's bias (https://github.com/pytorch/ao/pull/1839)
Update triton import error message (https://github.com/pytorch/ao/pull/1842)
Enable the CPU int4 with HQQ quant (https://github.com/pytorch/ao/pull/1824)
Do not override requires_grad=False when enable_float8_all_gather=True (https://github.com/pytorch/ao/pull/1873)
Add MI300X specs to roofline benchmark (https://github.com/pytorch/ao/pull/1913)
Fix dynamic shape for shared embedding (https://github.com/pytorch/ao/pull/1946)

Performance

Modify cast from hp to mx to help inductor fuse (https://github.com/pytorch/ao/pull/1786)
Enable torch.compile for mxfp8_cublas recipe (https://github.com/pytorch/ao/pull/1841)
Optimize tensor_flatten for runtime (https://github.com/pytorch/ao/pull/1951)
Triton kernel to cast to mx and write in col-major (https://github.com/pytorch/ao/pull/1932)
small speedup with dim0 cast for mx (https://github.com/pytorch/ao/pull/1980)

Documentation

Updating Cuda 12.1/12.4 to 12.4/12.6 to reflect current state (https://github.com/pytorch/ao/pull/1794)
Update float8 training benchmark readme (https://github.com/pytorch/ao/pull/1872)
Add perf benchmarks for float8 training with rowwise + tensorwise scaling (https://github.com/pytorch/ao/pull/1793)
Fix link markdown in readme (https://github.com/pytorch/ao/pull/1881)
Refresh torchao.float8 README (https://github.com/pytorch/ao/pull/1986)
Refresh float8 training section of main README (https://github.com/pytorch/ao/pull/1985)
Refresh MX README (https://github.com/pytorch/ao/pull/1989)

New Contributors

@jithunnair-amd made their first contribution in https://github.com/pytorch/ao/pull/1749
@facebook-github-bot made their first contribution in https://github.com/pytorch/ao/pull/1752
@mark14wu made their first contribution in https://github.com/pytorch/ao/pull/1751
@lisjin made their first contribution in https://github.com/pytorch/ao/pull/1738
@mayank31398 made their first contribution in https://github.com/pytorch/ao/pull/1849
@alex-titterton made their first contribution in https://github.com/pytorch/ao/pull/1810
@mreso made their first contribution in https://github.com/pytorch/ao/pull/1913
@frsun-nvda made their first contribution in https://github.com/pytorch/ao/pull/1835

Full Changelog: https://github.com/pytorch/ao/compare/v0.9.0...v0.10.0-rc1

- Python
Published by jerryzh168 about 1 year ago

torchao - v0.9.0

Highlights

We are excited to announce the 0.9.0 release of torchao! This release moves a number of sparsity techniques out of prototype, a significant overhaul of the quantize_ api, a new cutlass kernel for 4 bit dynamic quantization and more!

Block Sparsity promoted out of prototype

We’ve promoted block sparsity out of torchao.prototype and made several performance improvements. You can accelerate your models with block sparsity as follows:

python from torchao.sparsity import sparsify, block_sparse_weight sparsify_(model, block_sparse_weight(blocksize=64))

Blocksparse Benchmarks

| Technique |Decode (tok/s)| Model Size (GB) | |------------------------------|------------------|---------------------| | baseline | 134.40 | 15.01 | | 2:4 sparse | 163.13 | 10.08 | | bsr-0.8-32 | 210.91 | 6.01 | | bsr-0.8-64 | 222.43 | 6.00 | | bsr-0.9-32 | 255.19 | 4.88 | | bsr-0.9-64 | 262.94 | 4.88 | | 2:4 sparse + int4wo (marlin) | 255.21 | 3.89 |

Block Sparsity technique names (bsr) indicate sparsity fraction and blocksize.

These numbers were generated on H100 using torchao/_models/llama/generate.py on the Meta-Llama-3.1-8B model. You can reproduce these numbers using this script

BC Breaking

TorchAO M1 Binaries currently not working

W've identified that the binaries are broken on M1 and have been since v0.8.0 though they were working in v0.7.0. We're working on a fix for this, details and discussion can be found here.

quantize_ configuration callables -> configs (https://github.com/pytorch/ao/pull/1595, https://github.com/pytorch/ao/pull/1694, https://github.com/pytorch/ao/pull/1696, https://github.com/pytorch/ao/pull/1697)

We are migrating the way quantize_ workflows are configured from callables (tensor subclass inserters) to direct configuration (config objects). Motivation: align with the rest of the ecosystem, enable inspection of configs after instantiation, remove a common source of confusion.

What is changing:

Specifically, here is how the signature of quantize_'s second argument will change:

```python

torchao v0.8.0 and before

def quantize( model: torch.nn.Module, applytensorsubclass: Callable[[torch.nn.Module], torch.nn.Module], ..., ): ...

torchao v0.9.0

def quantize( model: torch.nn.Module, config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]], ..., ): ...

torchao v0.10.0 or later (exact version TBD)

def quantize( model: torch.nn.Module, config: AOBaseConfig, ..., ): ... ```

the name of the second argument to quantize_ changed from apply_tensor_subclass to config. Since the vast majority of callsites today are passing in configuration with a positional argument, this change should not affect most people.
the type of the second argument to quantize_ will change from Callable[[torch.nn.Module], torch.nn.Module] to config: AOBaseConfig, following a deprecation process detailed below.
for individual workflows, the user facing API name changed from snake case (int8_weight_only) to camel case (Int8WeightOnlyConfig). All argument names for each config are kept as-is. We will keep the old snake case names (int8_weight_only) around and alias them to the new names (int8_weight_only = Int8WeightOnlyConfig), to avoid breaking callsites. We plan to keep the old names forever. Here are all the workflow config name changes:

| old name (will keep working) | new name (recommended) | | --- | --- | | int4_weight_only | Int4WeightOnlyConfig | | float8_dynamic_activation_float8_weight | Float8DynamicActivationFloat8WeightConfig| | float8_static_activation_float8_weight | Float8StaticActivationFloat8WeightConfig | | float8_weight_only | Float8WeightOnlyConfig | | fpx_weight_only | FPXWeightOnlyConfig | | gemlite_uintx_weight_only | GemliteUIntXWeightOnlyConfig | | int4_dynamic_activation_int4_weight | Int4DynamicActivationInt4WeightConfig | | int8_dynamic_activation_int4_weight | Int8DynamicActivationInt4WeightConfig | | int8_dynamic_activation_int8_semi_sparse_weight | n/a (deprecated) | | int8_dynamic_activation_int8_weight | Int8DynamicActivationInt8WeightConfig | | int8_weight_only | Int8WeightOnlyConfig | | uintx_weight_only | UIntXWeightOnlyConfig |

Configuration for prototype workflows using quantize_ will be migrated at a later time.

How these changes can affect you: 1. If you are a user of existing quantize_ API workflows and are passing in config by a positional argument (quantize_(model, int8_weight_only(group_size=128))), you are not affected. This positional syntax will keep working going forward. You are encouraged to migrate your callsite to the new config name (quantize_(model, Int8WeightOnlyConfig(group_size=128)) though the old names will continue to work indefinitely. 2. If you are a user of existing quantize_ API workflows and are passing in config by a keyword argument (quantize_(model, tensor_subclass_inserter=int8_weight_only(group_size=128))), your callsite will break. You will need to change your callsite to quantize_(model, config=int8_weight_only(group_size=128)). We don't expect many people to be in this bucket. 3. If you are a developer writing new workflows for the quantize_ API, you will need to use the new configuration system. Please see https://github.com/pytorch/ao/issues/1690 for details. 4. If you are a user of sparsify_, you are not affected for now and a similar change will happen in a future version of torchao.

This migration will be a two step process: * in torchao v0.9.0, we will enable the new syntax while starting the deprecation process for the old syntax. * in torchao v.0.10.0 or later, we will remove the old syntax

Please see https://github.com/pytorch/ao/issues/1690 for more details.

Block Sparsity imports after moved out of prototype (https://github.com/pytorch/ao/pull/1734)

Before:

python from torchao.prototype.sparsity.superblock.blocksparse import block_sparse_weight

After: python from torchao.sparsity import block_sparse_weight

Deprecations

deprecation of the `set_inductor_config` argument of `quantize_` (https://github.com/pytorch/ao/pull/1716)

We are migrating the set_inductor_config argument of quantize_ to individual workflows. Motivation: 1. this functionality was intended for inference, and we don't want to expose it to future training workflows that we plan to add to quantize_. 2. higher level, this flag couples torchao workflows with torch.compile, which is not ideal. We would rather keep these systems decoupled at the quantize_ API level, with individual workflows opting in as needed.

Impact on users

for torchao v0.9.0:: if you are passing in set_inductor_config to quantize_, your callsite will keep working with a deprecation warning. We recommend that you migrate this option to your individual workflow.
for a future version of torchao: the set_inductor_config argument will be removed from quantize_.

API changes

```python

torchao v0.8.x

def quantize( ..., setinductor_config: bool = True, ..., ): ...

torchao v.0.9.0

def quantize( ..., setinductorconfig: Optional[bool] = None, ..., ): # if setinductorconfig != None, throw a deprecation warning # if setinductor_config == None, set it to True to stay consistent with old behavior

torchao v TBD (a future release)

def quantize( ..., ): # setinductorconfig is removed from quantize and moved to relevant individual workflows ```

Please see https://github.com/pytorch/ao/issues/1715 for more details.

Deprecation warning for float8 training delayed and static scaling (https://github.com/pytorch/ao/pull/1681, https://github.com/pytorch/ao/issues/1680)

We plan to deprecate delayed and static scaling from torchao.float8 training codebase due to lack of real world use cases for delayed/static scaling (dynamic scaling is required for higher accuracy) and complexity tax for supporting these features. * for torchao v0.9.0: add deprecation warning for delayed and static scaling * for torchao v0.10.0: deprecate delayed and static scaling

New Features

Supermask for improving accuracy for sparse models (https://github.com/pytorch/ao/pull/1729)

Supermask (https://pytorch.org/blog/speeding-up-vits/) is a technique for improving the accuracy of block sparsified models by learning a block-sparse mask during a training phase.

```python from torchao.sparsity import SupermaskLinear, blocksparseweight sparsify(model, lambda x: SupermaskLinear.fromlinear(x, blocksize=64, sparsitylevel=0.9)

training here

collapse supermask into a normal linear layer (with many weights set to 0) and then convert to block sparse format for inference speedup

sparsify(model, lambda x: SupermaskLinear.tolinear(x, sparsitylevel=0.9) sparsify(model, blocksparseweight(blocksize=64)) ```

Dynamic quantization W4A4 CUTLASS-based kernel (https://github.com/pytorch/ao/pull/1515)

This kernel which adds support for 4 bit dynamic activation + 4 bit weight quantization can be used as follows:

python from torchao.quantization import int4_dynamic_activation_int4_weight quantize_(model, int4_dynamic_activation_int4_weight)

Improvements

Early prototype MXFP8 and MXFP4 training and inference support for NVIDIA Blackwell GPUs

In torchao v0.9.0, we include very early support for training and inference on the NVIDIA Blackwell GPUs following the microscaling recipes from https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf, and backed by real MX gemms.

Here is how to use the current prototype APIs.

:warning: Note that torch.compile support is not fully there yet, there are no guarantees on performance at this time, and we expect to change these APIs rapidly as we iterate in future versions of torchao. Please see https://github.com/pytorch/ao/issues/556 for more details.

MX training

```python from torchao.prototype.mxformats.mxlinear import swaplinearwithmxlinear from torchao.prototype.mxformats.config import MXLinearConfig, MXGemmKernelChoice from torchao.utils import issmatleast_100

early prototype: on MX-enabled hardware, you can use the real MX gemm backed by

torchao's CUTLASS kernels. In the future, we will also add cuBLAS kernel support.

gemmkernelchoice = MXGemmKernelChoice.EMULATED if issmatleast100(): gemmkernelchoice = MXGemmKernelChoice.CUTLASS

m = torch.nn.Sequential(torch.nn.Linear(32, 32)).cuda() config = MXLinearConfig( elemdtype=torch.float8e4m3fn, blocksize=32, gemmkernelchoice=gemmkernelchoice, ) swaplinearwithmx_linear(m, config=config)

training loop (not shown)

```

MX inference, weights are in MX and matmul is in high precision.

```python from torchao.prototype.mxformats.mxlinear import swaplinearwithmxinferencelinear from torchao.prototype.mxformats.config import MXLinearConfig

m = torch.nn.Sequential(torch.nn.Linear(32, 32)).cuda() config = MXLinearConfig(elemdtype=torch.float8e4m3fn, blocksize=32) swaplinearwithmxinferencelinear(m, config=config)

do inference (not shown)

```

The additional features for MX support in v0.9.0 were enabled by: * Add mxfp8bf16 kernel (https://github.com/pytorch/ao/pull/1637) * Support mixed MX element dtype in mx_mm function and MXLinear. (https://github.com/pytorch/ao/pull/1667) * move blocksize and elemdtype into MXLinearConfig (https://github.com/pytorch/ao/pull/1689) * hook up mxfp8 and mxfp4 CUTLASS kernels to MXLinear (https://github.com/pytorch/ao/pull/1713) * add ceil and RNE rounding modes to the cast from fp32 to e8m0 (https://github.com/pytorch/ao/pull/1643)

Experimental

Q dq layout (https://github.com/pytorch/ao/pull/1642)
Add support for kleidi AI quantization schemes (https://github.com/pytorch/ao/pull/1447)

SAM2

Add modal script extensions (https://github.com/pytorch/ao/pull/1500)
Increase export usage, small perf improvements (https://github.com/pytorch/ao/pull/1673)
Model experiments QoL improvements (https://github.com/pytorch/ao/pull/1683)
Collect p90 latency statistics (https://github.com/pytorch/ao/pull/1703)

Training

Support power of 2 scaling factors in float8 training with rowwise scaling and use e4m3 in fwd and bwd pass (https://github.com/pytorch/ao/pull/1670)
clean up recipe names in Float8 training (https://github.com/pytorch/ao/pull/1730)
make the "config from recipe" API polished in Float8 training (https://github.com/pytorch/ao/pull/1731)
dd workaround to reduce FSDP memory usage for float8 rowwise training (https://github.com/pytorch/ao/pull/1629)
Make FakeQuantizer expose useful config details when printed (https://github.com/pytorch/ao/pull/1717)

Sparsity

Promote blocksparse from prototype, make it faster (https://github.com/pytorch/ao/pull/1734)

Other

Relax dtype requirements for int4 and float8 quants in autoquant (https://github.com/pytorch/ao/pull/1571)
Update init.py to load experimental ops even if other C++ ops are not found (https://github.com/pytorch/ao/pull/1565)

Bug Fixes

Fix torch.intx support in FakeQuantizeConfig (https://github.com/pytorch/ao/pull/1544)
Fix float related autoquant options (https://github.com/pytorch/ao/pull/1562)
Fix #1559, sparsity instead of sparstiy (https://github.com/pytorch/ao/pull/1560)
Fix .item() issue in running parallel evaluation for BO mixed precision (https://github.com/pytorch/ao/pull/1630)
Add more stringent test for CPUOffloadOptimizer (https://github.com/pytorch/ao/pull/1650)
Fix LR scheduler issue with CPU offload optimizer (https://github.com/pytorch/ao/pull/1649)
Add int8 dynamic activation + int8 weight only test to TensorParallel (https://github.com/pytorch/ao/pull/1657)
Fix compile issue for Marlin qqq on sm<8.0 (https://github.com/pytorch/ao/pull/1651)
Fix usehqq for int4weight_only quantize (https://github.com/pytorch/ao/pull/1707)
Unbreak float8 static quant tutorial (https://github.com/pytorch/ao /pull/1709)
Fix DDP with nf4 (https://github.com/pytorch/ao/pull/1684)
Fix tensor parallelism for float8 training with rowwise scaling (https://github.com/pytorch/ao/pull/1718)

Documentation

Update supported dtypes for fp8 (https://github.com/pytorch/ao/pull/1573)
Sparsity docs update (https://github.com/pytorch/ao/pull/1590)
Sparsity getting started docs (https://github.com/pytorch/ao/pull/1592)
Fix broken link on doc page (https://github.com/pytorch/ao/pull/1582)
Add quick start guide for first time users (https://github.com/pytorch/ao/pull/1611)
Update apirefdtypes docs (https://github.com/pytorch/ao/pull/1610)
Add module swap -> tensor subclass migration tutorial (https://github.com/pytorch/ao/pull/1596)
Update docs to refer to version.html (https://github.com/pytorch/ao/pull/1631)
Split contributor guide into quantization overview (https://github.com/pytorch/ao/pull/1618)
Update apirefquantization docs (https://github.com/pytorch/ao/pull/1619)
Migrate static quant tutorials to direct configuration (https://github.com/pytorch/ao/pull/1710)
Update torchao READMEs with new configuration APIs (https://github.com/pytorch/ao/pull/1711)
Update SAM2 README.md (https://github.com/pytorch/ao/pull/1735)
Add rowwise scaling README.md entry for float8 training(https://github.com/pytorch/ao/pull/1733)

Developers

Consolidate ZeroPointDomain.NONE & None zero point domains (https://github.com/pytorch/ao/pull/1556)
Only run docs build in CI if docs have changed (https://github.com/pytorch/ao/pull/1589)
Add separate quantization primitives for float8 (https://github.com/pytorch/ao/pull/1597)
Add boiler plate code to Tensor subclass (https://github.com/pytorch/ao/pull/1663)
Change TORCHLIBRARY to TORCHLIBRARY_FRAGMENT (https://github.com/pytorch/ao/pull/1645)
Reformat C++ kernels (https://github.com/pytorch/ao/pull/1723)
Add torchao/experimental CI test (https://github.com/pytorch/ao/pull/1586)
Clean up linearint8dynamicactivationintxweightsubclass (https://github.com/pytorch/ao/pull/1553)

New Contributors

@jaewoosong made their first contribution in https://github.com/pytorch/ao/pull/1560
@haodongucsb made their first contribution in https://github.com/pytorch/ao/pull/1630
@nikhil-arm made their first contribution in https://github.com/pytorch/ao/pull/1447
@ngc92 made their first contribution in https://github.com/pytorch/ao/pull/1650
@balancap made their first contribution in https://github.com/pytorch/ao/pull/1667

Full Changelog: https://github.com/pytorch/ao/compare/v0.8.0...v0.9.0-rc1

- Python
Published by HDCharles over 1 year ago

torchao - v0.8.0

Highlights

We are excited to announce the 0.8.0 release of torchao! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding.

W4A8 based on CUTLASS

A new W4A8 linear operator is implemented, that corresponds to int8_dynamic_activation_int4_weight quantization where two 4-bit weights get packed into a single 8-bit integer value; also, CUTLASS is made a sub-module of torchao repo, in order to be able to utilize more of its functionality to implement new kernels.

Benchmarks on A100

| -q parameter | Average tokens/sec | Average Bandwidth in GB/s | Peak Memory Usage in GB | Model Size in GB | | :--- | ---: | ---: | ---: | ---: | | | 95.24 | 258.55 | 13.90 | 13.21 | | -q int8wo | 155.31 | 1028.37 | 8.97 | 6.62 | | -q int4wo-32 | 186.70 | 774.98 | 5.31 | 4.15 | | -q int4wo-hqq | 186.47 | 774.01 | 5.04 | 4.15 | | -q int8dq | 49.64 | 328.72 | 9.44 | 6.62 | | -q w4a8-cutlass (tuned) | 119.31 | 394.86 | 4.52 | 3.31 |

Prefill performance benchmarks

We’ve added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an option for int8 dynamic quantization that will selectively use prefill during LLM decoding.

Screenshot 2025-01-15 at 10 06 09 AM

BC Breaking

Delete the float8-all-gather-only functionality from float8 training (https://github.com/pytorch/ao/pull/1451)

The use_fp8_all_gather_only was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.

Before

```python config = Float8LinearConfig( ...,

the option below is being removed

usefp8allgatheronly = True,
)
converttofloat8_training(model, config=config, ...) ```

After

The use_fp8_all_gather_only option is no longer supported.

New Features

Add TTFT benchmarks + update sparsity benchmarks (https://github.com/pytorch/ao/pull/1140)
Gemlite integration in torchao (https://github.com/pytorch/ao/pull/1034)
W4A8 based on CUTLASS (https://github.com/pytorch/ao/pull/880)

Improvement

quantize_

Expose zeropointdomain as arguments (https://github.com/pytorch/ao/pull/1401)
Add convert path for quantize_ QAT API (https://github.com/pytorch/ao/pull/1540)
Int8 dynamic prefill weight only decode (https://github.com/pytorch/ao/pull/1436)

autoquant

Make int8 dynamic quant in autoquant serializable (https://github.com/pytorch/ao/pull/1484)
Additional fixes for autoquant serialization (https://github.com/pytorch/ao/pull/1486)
Add exhaustive config option to intmm kernel (https://github.com/pytorch/ao/pull/1392)

float8 training

[float8] Allow specifying arbitrary dtype for each tensor, enabling recipes with e4m3 in both the forward and the backward (https://github.com/pytorch/ao/pull/1378)

experimental

Remove temp build files from torchao (https://github.com/pytorch/ao/pull/1551)

other

Torchao setup.py with cmake (https://github.com/pytorch/ao/pull/1490)

Bug Fixes

Fix bfloat16/float16/float32 options (https://github.com/pytorch/ao/pull/1369)
Fix a bug in LinearActivationQuantizedTensor (https://github.com/pytorch/ao/pull/1400)
Fix error message in float8 FSDP utils (https://github.com/pytorch/ao/pull/1423)
Fixes observer attachment to model based on config for wanda sparsifier (https://github.com/pytorch/ao/pull/1265)
[resubmit] Gemlite fix (https://github.com/pytorch/ao/pull/1435)
🐛 Fix: Memory leak in image processing endpoint (https://github.com/pytorch/ao/pull/1513)

Performance

[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes (https://github.com/pytorch/ao/pull/1377)

Documentation

Update api_ref_quantization.rst (https://github.com/pytorch/ao/pull/1408)
Update index.rst (https://github.com/pytorch/ao/pull/1409)
Update QAT READMEs using new APIs (https://github.com/pytorch/ao/pull/1541)

Developers

Pytorch/ao/torchao/experimental/ops/mps/test (https://github.com/pytorch/ao/pull/1442)
Verify that submodules are checked out (https://github.com/pytorch/ao/pull/1536)

New Contributors

@sanchitintel made their first contribution in https://github.com/pytorch/ao/pull/1375
@philipbutler made their first contribution in https://github.com/pytorch/ao/pull/1337
@airMeng made their first contribution in https://github.com/pytorch/ao/pull/1401
@DerekLiu35 made their first contribution in https://github.com/pytorch/ao/pull/1299
@agrawal-aka made their first contribution in https://github.com/pytorch/ao/pull/1265
@gmagogsfm made their first contribution in https://github.com/pytorch/ao/pull/1443
@dongxiaolong made their first contribution in https://github.com/pytorch/ao/pull/1513

Full Changelog: https://github.com/pytorch/ao/compare/v0.7.0...v0.8.0-rc2

- Python
Published by jainapurva over 1 year ago

torchao - v0.7.0

Highlights

We are excited to announce the 0.7.0 release of torchao! This release moves QAT out of prototype with improved LoRA support and more flexible APIs, and adds support for new experimental kernels such as Marlin QQQ (for CUDA), int8_dynamic_activation_intx_weight (for ARM CPU), and more!

QAT moved out of prototype, LoRA integration, new flexible APIs (#1020, #1085, #1152, #1037, #1152)

QAT has been moved out of prototype to torchao/quantization/qat to provide better API stability guarantees moving forward. In addition to the existing *QATQuantizer classes, we now also support the more flexible FakeQuantizedLinear and FakeQuantizedEmbedding modules for users to configure the exact quantization settings they wish to use during QAT.

```python from torchao.quantization.qat.api import FakeQuantizeConfig from torchao.quantization.qat.embedding import FakeQuantizedEmbedding from torchao.quantization.qat.linear import FakeQuantizedLinear

Specify quantization schemes to use during QAT

activationconfig = FakeQuantizeConfig(torch.int8, "pertoken", issymmetric=False) weightconfig = FakeQuantizeConfig(torch.int4, group_size=8)

Replace nn.Linear and nn.Embedding with these in your model

fqlinear = FakeQuantizedLinear(16, 32, False, activationconfig, weightconfig) fqembedding = FakeQuantizedEmbedding(16, 32, weightconfig=weightconfig) ```

We also leveraged the new flexible APIs to build a new QAT + LoRA fine-tuning flow in torchtune. Try it out today!

bash tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3/8B_qat_lora

Marlin QQQ for CUDA (#1113)

Marlin QQQ is an optimized GPU kernel that supports W4A8 mixed precision GEMM. For more details about Marlin QQQ, please refer to paper.

python from torchao.dtypes import MarlinQQQLayout quantize_( model, int8_dynamic_activation_int4_weight( group_size=128, mapping_type=MappingType.SYMMETRIC, act_mapping_type=MappingType.SYMMETRIC, layout=MarlinQQQLayout(), ), )

Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#marlin-qqq.

This is a prototype feature - feel free to try out!

int8dynamicactivationintxweight Quantization for ARM CPU (#995, #1027, #1254, #1353)

We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon).

```python from torchao.experimental.quantapi import int8dynamicactivationintxweight assert precision == torch.float32, "int8dynamicactivationintx_weight requires fp32 precision"

Build kernels in temp location, and load them in torch

This requires an ARM CPU

from torchao.experimental.tempbuild import tempbuildandloadtorchaoops tempbuildandloadtorchaoops(cmakelistspath=os.path.dirname(os.path.realpath(file_)) + "/../../experimental")

Quantize model

nbit = 4 assert nbit >= 1 and nbit <= 8, "nbits must be 1 to 8" groupsize = 128 hasweightzeros = False quantize( model, int8dynamicactivationintxweight( groupsize=groupsize, nbit=nbit, hasweightzeros=hasweightzeros, ), ) ```

Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#int8dynamicactivationintxweight-quantization

We are still trying to figure out how to ship the ARM CPU kernels, so the exact API is subject to change.

BC Breaking

Rename AQT#2 LayoutType -> Layout (#1049)

Before:

from torchao.dtypes import ( BlockSparseLayoutType, Int4CPULayoutType, MarlinQQQLayoutType, MarlinSparseLayoutType, SemiSparseLayoutType, TensorCoreTiledLayoutType, UintxLayoutType, Float8LayoutType, LayoutType, PlainLayoutType, )

After:

from torchao.dtypes import ( BlockSparseLayout, Int4CPULayout, MarlinQQQLayout, MarlinSparseLayout, SemiSparseLayout, TensorCoreTiledLayout, UintxLayout, Float8Layout, Layout, PlainLayout, )

QAT imports after move out of prototype (#1091)

Before:

python from torchao.quantization.prototype.qat import ( disable_4w_fake_quant, disable_8da4w_fake_quant, enable_4w_fake_quant, enable_8da4w_fake_quant, ComposableQATQuantizer, Int4WeightOnlyQATQuantizer, Int4WeightOnlyEmbeddingQATQuantizer Int8DynActInt4WeightQATQuantizer, Int8DynActInt4WeightQATLinear, ) from torchao.quantization.prototype.qat.api import ( FakeQuantizeConfig, ) from torchao.quantization.prototype.qat.fake_quantizer import ( FakeQuantizer, )

After:

python from torchao.quantization.qat import ( ComposableQATQuantizer, Int4WeightOnlyQATQuantizer, Int4WeightOnlyEmbeddingQATQuantizer Int8DynActInt4WeightQATQuantizer, ) from torchao.quantization.qat.linear import ( disable_4w_fake_quant, disable_8da4w_fake_quant, enable_4w_fake_quant, enable_8da4w_fake_quant, Int8DynActInt4WeightQATLinear, ) from torchao.quantization.qat.api import ( FakeQuantizeConfig, ) from torchao.quantization.qat.fake_quantizer import ( FakeQuantizer, )

New Features

Add BF16 stochastic rounding option for optimizers (https://github.com/pytorch/ao/pull/1124)
Add quantize_() API support for NF4 (https://github.com/pytorch/ao/pull/1216)
Support W4A8 Marlin kernel (https://github.com/pytorch/ao/pull/1113)

Improvements

quantize_

Add default filtering to remove mis-alinged weights (https://github.com/pytorch/ao/pull/1194)
Add tensor parallelism support for int4weightonly quantization (https://github.com/pytorch/ao/pull/1120)
Add support for asymmetric act quant for int8 dynamic quant (https://github.com/pytorch/ao/pull/1131)
Add support for groupwise quantization for int8 weight only quantization (https://github.com/pytorch/ao/pull/1121)
Add AQT tensor parallel for float8dynamicquant (https://github.com/pytorch/ao/pull/1078)
Int8wo Embedding Quant (https://github.com/pytorch/ao/pull/1167)
Making sure int4 weight only supports cpu as well (https://github.com/pytorch/ao/pull/1203)
BF16 support for Quant-LLM kernel (https://github.com/pytorch/ao/pull/1147)
Add hardware check to fp8 quant (https://github.com/pytorch/ao/pull/1314)
Add support for quantize_() with Float8Linear module (https://github.com/pytorch/ao/pull/1344)

autoquant

Added support for Per Tensor Scaling for Float8 Dynamic Autoquant (https://github.com/pytorch/ao/pull/1175)
Add floating point options for autoquant and add accuracy measurement (https://github.com/pytorch/ao/pull/1355)

benchmarks

Adding batchsize support for torchao llama benchmarks (https://github.com/pytorch/ao/pull/1182)
Add capability of benchmarking arbitrary binary (https://github.com/pytorch/ao/pull/1107)

experimental

Add embedding ops aten (https://github.com/pytorch/ao/pull/1129)
Add embedding ops executorch (https://github.com/pytorch/ao/pull/1137)
Add quantized embedding kernels to torchao (https://github.com/pytorch/ao/pull/1018)
Allow deprecated declarations what using Parallel ExecuTorch (https://github.com/pytorch/ao/pull/1031)
Introduce lowbit quantized linear MPS kernels (https://github.com/pytorch/ao/pull/954)
Enable 6-bit kernel (https://github.com/pytorch/ao/pull/1027)
Kleidi 4b blockwise gemv prototype (https://github.com/pytorch/ao/pull/997)
Experimental 6-bit quantization for Llama in torchchat (https://github.com/pytorch/ao/pull/1094)
Introduce 7-bit quantization for Llama in torchchat. (https://github.com/pytorch/ao/pull/1139)
Executorch Subclass API (#966) (https://github.com/pytorch/ao/pull/995)
8-bit packing support (https://github.com/pytorch/ao/pull/1248)
Experimental Enable 8-bit (https://github.com/pytorch/ao/pull/1254)
Experimental Benchmarking (https://github.com/pytorch/ao/pull/1353)

optimizer

[low-bit optim] Upcast everything to FP32 for internal calculations (https://github.com/pytorch/ao/pull/1068)
[Low-bit optim] Support for dcp.save() and dcp.load() (https://github.com/pytorch/ao/pull/1217)
Enable CPU Offload for Intel GPU (https://github.com/pytorch/ao/pull/1324)

SAM2

SAM2.1 copy (https://github.com/pytorch/ao/pull/1172)
SAM2 AMG server side request batching (https://github.com/pytorch/ao/pull/1197)
More SAM2-fast server improvements (https://github.com/pytorch/ao/pull/1285)
SAM2 Fast AMG: memory profiling and more compile (https://github.com/pytorch/ao/pull/1296)
SAM2 AMG cli and other QoL improvements (https://github.com/pytorch/ao/pull/1336)
SAM2 AMG cli.py on modal (https://github.com/pytorch/ao/pull/1349)
Reduce SAM2 AMG cli startup by using deploy (https://github.com/pytorch/ao/pull/1350)
Reduce startup time for SAM2 AMG by using torch.export (https://github.com/pytorch/ao/pull/1358)
More batching and improved furious accuracy/performance (https://github.com/pytorch/ao/pull/1253)
SAM2.1 and example README (https://github.com/pytorch/ao/pull/1048)
SAM2 AMG example mIoU, perf numbers and more SAM2 model annotations (https://github.com/pytorch/ao/pull/1196)

other

Add SpinQuant to generate.py (https://github.com/pytorch/ao/pull/1069)
SpinQuant (https://github.com/pytorch/ao/pull/983)
SmoothQuant using tensor subclassing (https://github.com/pytorch/ao/pull/1030)
Expose FakeQuantizeConfigs in QAT quantizers (https://github.com/pytorch/ao/pull/1214)
Add module-swap UX for INT8 mixed-precision training (https://github.com/pytorch/ao/pull/1179)
Float8 training: move module attribute setting to sync function (https://github.com/pytorch/ao/pull/1341)

Bug Fixes

Header bug fix (https://github.com/pytorch/ao/pull/1079)
Temporary fix for QAT quantizer when linear layer bias is True (https://github.com/pytorch/ao/pull/1087)
Fix out-of-bounds memory access in Galore dequant kernel (https://github.com/pytorch/ao/pull/1125)
Fixed weightsonly=True load for float8dynamicactivationfloat8weight in quantapi (https://github.com/pytorch/ao/pull/1122)
Fix int8weightonly group_size (https://github.com/pytorch/ao/pull/1165)
Is_linear fix for MHA (https://github.com/pytorch/ao/pull/1141)
Fixing eval.py to use GPTQ_MT for gptq (https://github.com/pytorch/ao/pull/1176)
[CPU offload optim] Fix when there are non-trainable params (https://github.com/pytorch/ao/pull/1210)
Fix for weights-only load (https://github.com/pytorch/ao/pull/1228)
Pin nightlies to deal with std::badalloc (https://github.com/pytorch/ao/pull/1256)
Fix 2.5.1 failing sparsity test (https://github.com/pytorch/ao/pull/1261)
Call narrow only for TensorCoreTiledLayout (https://github.com/pytorch/ao/pull/1207)
Fix an autoquant bug in flatten/unflatten (https://github.com/pytorch/ao/pull/1288)
Float8 with delayed scaling: fix autocast handling (https://github.com/pytorch/ao/pull/1306)
Fix bug with float8 training + FSDP2 + TP (https://github.com/pytorch/ao/pull/1327)
Float8 training: fix bug with AC + compile (https://github.com/pytorch/ao/pull/1329)
Fix torchtitan + float8 + delayed + compile (https://github.com/pytorch/ao/pull/1334)
[low-bit optim] Fix edge cases for FSDP2 integration (https://github.com/pytorch/ao/pull/1269)
[NF4] .to() fixes (https://github.com/pytorch/ao/pull/1312)
Check scale.ndim before applying t/transpose (https://github.com/pytorch/ao/pull/1339)

Performance

Swap in faster uint6 bitpacking function (https://github.com/pytorch/ao/pull/1098)
Implement more efficient pack and unpack uint5 (https://github.com/pytorch/ao/pull/1138)
Fix 20x slowdown of FP6 kernel due to device properties query (https://github.com/pytorch/ao/pull/1092)

Documentation

Add a developer guide for exporting to executorch (https://github.com/pytorch/ao/pull/1219)
Enable AWQ example on CPU (https://github.com/pytorch/ao/pull/1043)
Add readme doc for experiemental (https://github.com/pytorch/ao/pull/1130)
Move float8 out of prototype in quantization README (https://github.com/pytorch/ao/pull/1166)
Update torchao api reference and add contributor guide (https://github.com/pytorch/ao/pull/1255)
Fix pickle.dump missing file argument typo in README (https://github.com/pytorch/ao/pull/1316)
Update README.md (https://github.com/pytorch/ao/pull/1319)
Update README.md: Fix bibtex and sglang links (https://github.com/pytorch/ao/pull/1361)
Add bibtex (https://github.com/pytorch/ao/pull/1177)
Clarify torchao.float8 PyTorch version support (https://github.com/pytorch/ao/pull/1191)

Developers

[Tp Test] Fix the placement of the device tensor (https://github.com/pytorch/ao/pull/1054)
Skip testfpxweight_only in fbcode (https://github.com/pytorch/ao/pull/1056)
Pin pt nightly CPU version (https://github.com/pytorch/ao/pull/1061)
Unpin CUDA Nightly (https://github.com/pytorch/ao/pull/1064)
Update smoke test (https://github.com/pytorch/ao/pull/1111)
Update regression_test.yml (https://github.com/pytorch/ao/pull/1163)
Add PyTorch 2.5 to regression test (https://github.com/pytorch/ao/pull/1168)
Fix Bias APIs, re-enable kleidi tests for arm64 (https://github.com/pytorch/ao/pull/1162)
Create CITATION.cff (https://github.com/pytorch/ao/pull/1178)
Unpin nightlies (https://github.com/pytorch/ao/pull/1183)
[experimental] Kleidi - add operator level tests (https://github.com/pytorch/ao/pull/1173)
Ruff format and lint (https://github.com/pytorch/ao/pull/1226)
Update pre-commit to match CI/CD (https://github.com/pytorch/ao/pull/1227)
Fixing pytest skip for only test_floatx.py (https://github.com/pytorch/ao/pull/1251)
Fixed invalid url in citation section (https://github.com/pytorch/ao/pull/1348)
Add to safe globals (https://github.com/pytorch/ao/pull/1171)
Aqt rename#1 Layout -> TensorImpl (https://github.com/pytorch/ao/pull/1046)
Move and rename GranularityType -> Granularity (https://github.com/pytorch/ao/pull/1038)
Change torchao quantization types from int to sizet and preface vars with "preferred" (https://github.com/pytorch/ao/pull/1041)
Shrink hadamard matrices (https://github.com/pytorch/ao/pull/1051)
Use ExecuTorch prebuilt library in pip package to build custom kernels (https://github.com/pytorch/ao/pull/1059)
Update base.h unit to unsigned int (https://github.com/pytorch/ao/pull/962)
Create header for packed weight ops (https://github.com/pytorch/ao/pull/1072)
Update cmake files (https://github.com/pytorch/ao/pull/1070)
Create buildwheelsaarch64_linux.yml (https://github.com/pytorch/ao/pull/1083)
ROCM binary upload (https://github.com/pytorch/ao/pull/1099)
Create buildwheelswindows.yml (https://github.com/pytorch/ao/pull/1101)
Use fewer instructions when unpacking uint6s. (https://github.com/pytorch/ao/pull/1109)
[CI] XPU binary build enable (https://github.com/pytorch/ao/pull/1105)
Move common ET/Aten op stuff to ops/library.h (https://github.com/pytorch/ao/pull/1116)
Move bias from kernel to packed_weights (https://github.com/pytorch/ao/pull/1119)
Update gpu_sparsity kernel benchmarking script (https://github.com/pytorch/ao/pull/1143)
[ROCm] use dataclass for fnuz type setting (https://github.com/pytorch/ao/pull/1142)
Move files to prototype/sparsity (https://github.com/pytorch/ao/pull/1145)
C10::nullopt -> std::nullopt (#1032) (https://github.com/pytorch/ao/pull/1151)
[reland][ROCm] use dataclass for fnuz type setting (https://github.com/pytorch/ao/pull/1150)
Move float8atenapi to float8_ops (https://github.com/pytorch/ao/pull/1155)
Initialize model with meta device for generation benchmarking (https://github.com/pytorch/ao/pull/1144)
Replace torch.empty with torch.zeros (https://github.com/pytorch/ao/pull/1157)
Update utils.py (https://github.com/pytorch/ao/pull/1186)
Remove intscaledmm's dependency on triton for cpu (https://github.com/pytorch/ao/pull/128)
at::optional -> std::optional (#1170) (https://github.com/pytorch/ao/pull/1212)
fastflush kwarg of dobench is removed (https://github.com/pytorch/ao/pull/1222)
Remove calibration args from generate.py (https://github.com/pytorch/ao/pull/1258)
Skip marlin QQQ ops test in fbcode (https://github.com/pytorch/ao/pull/1289)
Fix Marlin QQQ ops test with unittest (https://github.com/pytorch/ao/pull/1294)
Fix Failing CI - Update bitsandbytes import (https://github.com/pytorch/ao/pull/1343)
Remove lm_eval warning (https://github.com/pytorch/ao/pull/1347)
Refactor Affine Quantized Tensor (#1234)
Move files from quantization/prototype -> prototype/quantization (#1187)
Add TTFT benchmarks + update sparsity benchmarks (https://github.com/pytorch/ao/pull/1140)
Add "gemminput_role" to dunder slots (https://github.com/pytorch/ao/pull/984)
Add an option to use fp8-all-gather only without fp8 computation. (https://github.com/pytorch/ao/pull/1093)
Bump version to 0.7 (https://github.com/pytorch/ao/pull/1045)

New Contributors

@Jack-Khuu made their first contribution in https://github.com/pytorch/ao/pull/1031
@keyan made their first contribution in https://github.com/pytorch/ao/pull/1041
@digantdesai made their first contribution in https://github.com/pytorch/ao/pull/997
@EnragedAntelope made their first contribution in https://github.com/pytorch/ao/pull/962
@c4lcut3c made their first contribution in https://github.com/pytorch/ao/pull/1094
@elfisworking made their first contribution in https://github.com/pytorch/ao/pull/1087
@chuanqi129 made their first contribution in https://github.com/pytorch/ao/pull/1105
@p4arth made their first contribution in https://github.com/pytorch/ao/pull/1122
@xuzijian629 made their first contribution in https://github.com/pytorch/ao/pull/1138
@jeffdaily made their first contribution in https://github.com/pytorch/ao/pull/1142
@r-barnes made their first contribution in https://github.com/pytorch/ao/pull/1151
@helunwencser made their first contribution in https://github.com/pytorch/ao/pull/1157
@bertmaher made their first contribution in https://github.com/pytorch/ao/pull/1222
@tibidoh made their first contribution in https://github.com/pytorch/ao/pull/1248
@mandroid6 made their first contribution in https://github.com/pytorch/ao/pull/1250
@HandH1998 made their first contribution in https://github.com/pytorch/ao/pull/1113
@readleyj made their first contribution in https://github.com/pytorch/ao/pull/1316
@22dimensions made their first contribution in https://github.com/pytorch/ao/pull/1318
@galqiwi made their first contribution in https://github.com/pytorch/ao/pull/1348
@dbyoung18 made their first contribution in https://github.com/pytorch/ao/pull/1324
@sunjiweiswift made their first contribution in https://github.com/pytorch/ao/pull/1259
@merrymercy made their first contribution in https://github.com/pytorch/ao/pull/1361

Full Changelog: https://github.com/pytorch/ao/compare/v0.6.1...v0.7.0-rc1

- Python
Published by vkuzo over 1 year ago

torchao - v0.6.1

Highlights

We are excited to announce the 0.6.1 release of torchao! This release adds support for Auto-Round support, Float8 Axiswise scaled training, a BitNet training recipe, an implementation of AWQ and much more!

Auto-Round Support (#581)

Auto-Round is a new weight-only quantization algorithm, it has as achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.

``` Python from torchao.prototype.autoround.core import preparemodelforapplyingautoround from torchao.prototype.autoround.core import applyautoround

preparemodelforapplyingautoround( model, istargetmodule=istargetmodule, bits=4, group_size=128, iters=200, device=device, )

inputidslst = [] for data in dataloader: inputidslst.append(data["inputids"].to(modeldevice))

multitinputids = MultiTensor(inputidslst) out = model(multitinputids)

quantize(model, applyautoround(), istarget_module) ```

Added float8 training axiswise scaling support with per-gemm-argument configuration (#940)

We added experimental support for rowwise scaled float8 gemm to torchao.float8, with per-gemm-input configurability to enable exploration of various recipes. Here is how a user can configure all-axiswise scaling

```python

all-axiswise scaling

config = torchao.float8.config.recipenametolinearconfig(Float8LinearRecipeName.ALLAXISWISE) m = torchao.float8.converttofloat8training(config)

or, a custom recipe by @lw where grad_weight is left in bfloat16

config = torchao.float8.config.recipenametolinearconfig(Float8LinearRecipeName.LWAXISWISEWITHGWHP) m = torchao.float8.converttofloat8_training(config) ```

Early performance benchmarks show all-axiswise scaling achieve a 1.13x speedup vs bf16 on torchtitan / LLaMa 3 8B / 8 H100 GPUs (compared to 1.17x from all-tensorwise scaling in the same setup), and loss curves which match to bf16 and all-tensorwise scaling. Further performance and accuracy benchmarks will follow in future releases.

Introduced BitNet b1.58 training recipe (#930)

Adds recipe for doing BitNet b1.58](https://arxiv.org/abs/2402.17764) ternary weights clamping. ``` Python from torchao.prototype.quantizedtraining import bitnettraining from torchao import quantize_

model = ... quantize(model, bitnettraining()) ``` Notably: Our implementation utilizes INT8 Tensor Cores to make up for this loss in speed. In fact, our implementation is faster than BF16 training in most cases.

[Prototype] Implemented Activation Aware Weight Quantization AWQ (#743)

Perplexity and performance measured on A100 GPU: | Model | Quantization | Tokens/sec | Throughput (GB/sec) | Peak Mem (GB) | Model Size (GB) | |--------------------|--------------|------------|---------------------|---------------|-----------------| | Llama-2-7b-chat-hf | bfloat16 | 107.38 | 1418.93 | 13.88 | 13.21 | | | awq-hqq-int4 | 196.6 | 761.2 | 5.05 | 3.87 | | | awq-uint4 | 43.59 | 194.93 | 7.31 | 4.47 | | | int4wo-hqq | 209.19 | 804.32 | 4.89 | 3.84 | | | int4wo-64 | 201.14 | 751.42 | 4.87 | 3.74 |

Usage:

Python from torchao.prototype.awq import insert_awq_observer_, awq_uintx, AWQObservedLinear quant_dtype = torch.uint4 group_size = 64 calibration_limit = 10 calibration_seq_length = 1024 model=model.to(device) insert_awq_observer_(model,calibration_limit, calibration_seq_length, quant_dtype=quant_dtype, group_size=group_size) with torch.no_grad(): for batch in calibration_data: model(batch.to(device)) is_observed_linear = lambda m, fqn: isinstance(m, AWQObservedLinear) quantize_(model, awq_uintx(quant_dtype=quant_dtype, group_size = group_size), is_observed_linear)

New Features

[Prototype] Added Float8 support for AQT tensor parallel (#1003)
Added composable QAT quantizer (#938)
Introduced torchchat quantizer (#897)
Added INT8 mixed-precision training (#748)
Implemented sparse marlin AQT layout (#621)
Added a PerTensor static quant api (#787)
Introduced uintx quant to generate and eval (#811)
Added Float8 Weight Only and FP8 weight + dynamic activation (#740)
Implemented Auto-Round support (#581)
Added 2, 3, 4, 5 bit custom ops (#828)
Introduced symmetric quantization with no clipping error in the tensor subclass based API (#845)
Added int4 weight-only embedding QAT (#947)
Added support for 1-bit and 6-bit quantization for Llama in torchchat (#910, #1007)
Added a linear_observer class for doing static activation calibration (#807)
Exposed hqq through uintxweightonly API (#786)
Added RowWise scaling option for Float8 dynamic activation quantization (#819)
Added Float8 weight only to autoquant api (#866)

Improvements

Enhanced Auto-Round functionality (#870)
Improved FSDP support for low-bit optimizers (#538)
Added support for using AffineQuantizedTensor with weights_only=True for torch.load (#630)
Optimized 3-bit packing (#1029)
Added more evaluation metrics to llama/eval.sh (#934)
Improved eager numerics for dynamic scales in float8 (#904)

Bug fixes

Fixed inference_mode issues (#885)
Fixed failing FP6 benchmark (#931)
Resolved various issues with float8 support (#918, #923)
Fixed load state dict when device is different for low-bit optim (#1021)

Performance

Added SM75 (Turing) support for FP6 kernel (#942)
Implemented int8 dynamic quant + bsr support (#821)

- Added workaround to recover the perf for quantized vit in torch.compile (#926)

INT8 Mixed-Precision Training

On NVIDIA GPUs, INT8 Tensor Cores is approximately 2x faster than their BF16/FP16 counterparts. In mixed-precision training, we can down-cast activations and weights dynamically to INT8 to leverage faster matmuls. However, since INT8 has very limited range [-128,127], we perform row-wise quantization, similar to how INT8 post-training quantization (PTQ) is done. Weight is still in original precision.

```Python from torchao.prototype.quantizedtraining import int8mixedprecisiontraining, Int8MixedPrecisionTrainingConfig from torchao.quantization import quantize_

model = ...

apply INT8 matmul to all 3 matmuls

quantize(model, int8mixedprecisiontraining())

customize which matmul is left in original precision.

config = Int8MixedPrecisionTrainingConfig( output=True, gradinput=True, gradweight=False, ) quantize(model, int8mixedprecisiontraining(config)) ``**End2end speed benchmark** usingbenchmarks/quantizedtraining/pretrainllama2.py`

Model & GPU | bs x seq_len| Config | Tok/s | Peak mem (GB) -----|-----|-----|-----|----- Llama2-7B, A100 | 8 x 2048 | BF16 (baseline) | ~4400 | 59.69 Llama2-7B, A100 | 8 x 2048 | INT8 mixed-precision | ~6100 (+39%) | 58.28 Llama2-1B, 4090 | 16 x 2048 | BF16 (baseline) | ~17,900 | 18.23 Llama2-1B, 4090 | 16 x 2048 | INT8 mixed-precision | ~30,700 (+72%) | 18.34

Docs

Updated README with more current float8 speedup information (#816)
Added tutorial for trainable tensor subclass (#908)
Improved documentation for float8 unification and inference (#895, #896)

Devs

Added compile tests to test suite (#906)
Improved CI setup and build processes (#887)
Added M1 wheel support (#822)
Added more benchmarking and profiling tools (#1017)
Renamed fpx to floatx (#877)
Removed torchao_nightly package (#661)
Added more lint fixes (#827)
Added better subclass testing support (#839)
Added CI to catch syntax errors (#861)
Added tutorial on composing quantized subclass w/ Dtensor based TP (#785)

Security

No significant security updates in this release.

Untopiced

Added basic SAM2 AutomaticMaskGeneration example server (#1039)

New Contributors

@iseeyuan made their first contribution in https://github.com/pytorch/ao/pull/805
@YihengBrianWu made their first contribution in https://github.com/pytorch/ao/pull/860
@kshitij12345 made their first contribution in https://github.com/pytorch/ao/pull/863
@ZainRizvi made their first contribution in https://github.com/pytorch/ao/pull/887
@alexsamardzic made their first contribution in https://github.com/pytorch/ao/pull/899
@vaishnavi17 made their first contribution in https://github.com/pytorch/ao/pull/911
@tobiasvanderwerff made their first contribution in https://github.com/pytorch/ao/pull/931
@kwen2501 made their first contribution in https://github.com/pytorch/ao/pull/937
@y-sq made their first contribution in https://github.com/pytorch/ao/pull/912
@jimexist made their first contribution in https://github.com/pytorch/ao/pull/969
@danielpatrickhug made their first contribution in https://github.com/pytorch/ao/pull/914
@ramreddymounica made their first contribution in https://github.com/pytorch/ao/pull/1007
@yushangdi made their first contribution in https://github.com/pytorch/ao/pull/1006
@ringohoffman made their first contribution in https://github.com/pytorch/ao/pull/1023

Full Changelog: https://github.com/pytorch/ao/compare/v0.5.0...v0.6.1

- Python
Published by drisspg over 1 year ago

torchao - v0.5.0

Highlights

We are excited to announce the 0.5 release of torchao! This release adds support for memory efficient inference, float8 training and inference, int8 quantized training, HQQ, automatic mixed-precision quantization through bayesian optimization, sparse marlin, and integrations with HuggingFace, SGLang, and diffusers.

Memory Efficient Inference Support https://github.com/pytorch/ao/pull/738

We've added support for Llama 3.1 to the llama benchmarks in TorchAO and added new features and improvements as a proof of concept for memory efficient inference. These additions allow us to to do 130k context length inference with Llama 3.1-8B with only 18.91 GB memory if we combine with kv cache quantization, int4 weight only quantization and linear causal mask.

General savings depend on technique and context length as can be seen in the following graph:

Float8 Training https://github.com/pytorch/ao/pull/551

torchao.float8 implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433.

With torch.compile on, current results show throughput speedups of up to 1.5x on 128 H100 GPU LLaMa 3 70B pretraining jobs (details)

python from torchao.float8 import convert_to_float8_training convert_to_float8_training(m, module_filter_fn=...)

And for an end-to-minimal training recipe of pretraining with float8, you can check out torchtitan.

Float8 Inference https://github.com/pytorch/ao/pull/740 https://github.com/pytorch/ao/pull/819

We have introduced two new quantization APIs for Float8 inference:

Float8 Weight-Only Quantization: A new quant_api float8weightonly() has been added to apply float8 weight-only symmetric per-channel quantization to linear layers.
Float8 Dynamic Activation and Weight Quantization: A new quant_api float8dynamicactivationfloat8weight() has been introduced to apply float8 dynamic symmetric quantization to both activations and weights of linear layers. By default PerTensor scaling. We have also added an option to do PerRow scaling of both activations and weights. By computing scales at a finer granularity, it can potentially reduce the overall quantization error and increase performance by reducing dynamic quantization overhead.

Example usage: ```python import torch from torchao.quantization import quantize, float8weightonly, float8dynamicactivationfloat8_weight, PerRow

Create a model

model = YourModel()

Apply float8 weight-only quantization

quantize(model, float8weight_only())

Apply float8 dynamic activation and weight quantization

quantize(model, float8dynamicactivationfloat8_weight())

Apply PerRow scaling to weight and activations

quantize(linearmodule, float8dynamicactivationfloat8weight(granularity=PerRow())) ```

Notes: - These new APIs are designed to work with PyTorch 2.5 and later versions. - float8_dynamic_activation_float8_weight requires CUDA devices with compute capability 8.9 or higher for hardware acceleration.

Int8 quantized training #644 #748

@gau-nernst introduced 2 experimental works on training using INT8.

INT8 quantized training (#644): weight is quantized to INT8 during the whole duration of training to save memory. Compute remains in high precision. To train the model effectively with only quantized weights, we use stochastic rounding for weight update. Right now, memory saving is not too competitive compared to compiled BF16 baseline.
INT8 mixed-precision training (#748): weight is kept in the original high precision, but weight and activation are dynamically quantized to INT8 during training to utilize INT8 tensor cores. We observe up to 70% speedup for Llama2 pre-training on 4090, and 20% speedup for Llama3 pre-training on 8x A100 with FSDP2.

```python from torchao.quantization import quantize_ from torchao.prototype.quantizedtraining import int8weightonlyquantizedtraining, int8mixedprecisiontraining

model = YourModel()

apply INT8 quantized training

quantize(model, int8weightonlyquantized_training())

apply INT8 mixed-precision training

quantize(model, int8mixedprecisiontraining()) ```

For more information and benchmark results, see README and the respective PR (#644 and #748)

HQQ Integration in torchao https://github.com/pytorch/ao/pull/605 https://github.com/pytorch/ao/pull/786

hqq is added to existing torchao APIs, it gives improvements on model accuracy and leverages the existing efficient kernels in torchao. We enabled hqq for int4_weight_only API: quantize_(model, int4_weight_only(group_size, use_hqq=True) We also added this to the uintx api for accuracy experiments (current uintx kernels are slow): quantize_(model, uintx_weight_only(torch.uint2, group_size, use_hqq=True)

Automatic Mixed-Precision Quantization through Bayesian Optimization https://github.com/pytorch/ao/pull/592, https://github.com/pytorch/ao/pull/694

We provided a Bayesian Optimization (BO) tool leveraging Ax to auto search mixed-precision weight-only quantization configuration, i.e., bit width and group size of intN_weight_only(bit_width, group_size) for each layer. It also includes a sensitivity analysis tool to calculate layer-wise average Hessian trace and average fisher information matrix trace, which is an optional step to customize and improve BO search.

To optimize for model accuracy under a model size constraint (GB): python --BO_acc_modelsize.py --checkpoint=/tmp/Meta-Llama-3-8B --model_size_constraint=6.0

To optimize for inference throughput under a model perplexity constraint: python --BO_acc_throughput.py --checkpoint=/tmp/Meta-Llama-3-8B --ppl_constraint=7.5

For more detailed usage, please refer to this README. The mixed-precision quantization searched by this tool reduces 20.1% model size with 2.8% perplexity reduction, and improves 15.1% inference throughput with 3.2% perplexity reduction on the Llama3-8B model compared to int8 uniform quantization.

Sparse Marlin https://github.com/pytorch/ao/pull/621, https://github.com/pytorch/ao/pull/733

@Diogo-V added sparse-marlin, a W4AFP16 2:4 sparse kernel, support to TorchAO. On Meta LLama3, we observe a 25% tok/s increase (180 -> 226) compared to our existing int4-wo implementation. python from torchao.quantization.quant_api import quantize_, int4_weight_only from torchao.dtypes import MarlinSparseLayoutType quantize_(model, int4_weight_only(layout_type=MarlinSparseLayoutType())) | Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | | ----------- | ----------------------- | ------------- | ----------------------- | ---------------- | --------------- | | Llama-3-8B | Base (bfloat16) | 95.64 | 1435.54 | 16.43 | 15.01 | | | int8dq | 8.61 | 64.75 | 9.24 | 7.52 | | | int8wo | 153.03 | 1150.80 | 10.42 | 7.52 | | | int4wo-64 | 180.80 | 763.33 | 6.88 | 4.22 | | | int4wo-64-sparse-marlin | 226.02 | 689.20 | 5.32 | 3.05 |

HuggingFace Integration

torchao is integrated into huggingface: https://huggingface.co/docs/transformers/main/en/quantization/torchao now you can use int4_weight_only, int8_weight_only and int8_dynamic_activation_int8_weight through TorchAoConfig in huggingface. Currently available in huggingface main branch only.

SGLang Integration

torchao is also integrated into sglang (https://github.com/sgl-project/sglang/pull/1341) for llama3 model, you can try out with: python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128 Supported configurations are ["int4wo-", "int8wo", "int8dq", "fp8wo" (only available in torchao 0.5+)]

diffusers Integration

diffusers-torchao provides end-to-end inference and experimental training recipes to use torchao with diffusers in this repo. We demonstrate 53.88% speedup on Flux.1-Dev* and 27.33% speedup on CogVideoX-5b when comparing compiled quantized models against their standard bf16 counterparts.

BC Breaking

Add layout option to woq int4 api https://github.com/pytorch/ao/pull/670

```

torchao 0.4.0

from torchao.quantization import quantize, int4weightonly quantize(mymodel, int4weightonly(innerk_tiles=8))

torchao 0.5.0

from torchao.quantization import quantize, int4weightonly quantize(mymodel, int4weightonly(layouttype=TensorCoreTiledLayoutType(innerk_tiles=8))) ```

Refactor QAT to use tensor subclasses https://github.com/pytorch/ao/pull/585

We refactored QAT to use tensor subclasses instead of module swap. This works well with torchtune and FSDP2, but currently lacks support for FSDP1 and DDP. As a fallback for these distribution strategies, please continue to use the old module swap flows.

```

torchao 0.4.0: This uses the module swap flow

torch 0.5.0 + FSDP2: This uses the tensor subclass flow

from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer quantizer = Int8DynActInt4WeightQATQuantizer() model = quantizer.prepare(model) train(model) model = quantizer.convert(model)

torchao 0.5.0 + DDP or FSDP1: This uses the module swap flow

from torchao.quantization.prototype.qat.moduleswap_api import Int8DynActInt4WeightQATQuantizerModuleSwap quantizer = Int8DynActInt4WeightQATQuantizerModuleSwap() model = quantizer.prepare(model) train(model) model = quantizer.convert(model) ```

Deprecations

New Features

Optimizer CPU offload for single GPU training https://github.com/pytorch/ao/pull/584
Add support for save quantized checkpoint in llama code https://github.com/pytorch/ao/pull/553
Intx quantization tensor subclass https://github.com/pytorch/ao/pull/468
Add superblock to sparse/prototype https://github.com/pytorch/ao/pull/660
Add AffineQuantizedObserver https://github.com/pytorch/ao/pull/650
Add BSR subclass + torch.compile and clean up superblock https://github.com/pytorch/ao/pull/680
Add HQQ support https://github.com/pytorch/ao/pull/605
Add performance profiler https://github.com/pytorch/ao/pull/690
Add experimental INT8 quantized training https://github.com/pytorch/ao/pull/644
Add high-level operator interface https://github.com/pytorch/ao/pull/708
Add sparse marlin 2:4 gemm op https://github.com/pytorch/ao/pull/733
Example for GPTQ-like calibration flow https://github.com/pytorch/ao/pull/721
Llama3.1 and KV cache quantization https://github.com/pytorch/ao/pull/738
Add float8 weight only and weight + dynamic activation https://github.com/pytorch/ao/pull/740
Add Auto-Round support https://github.com/pytorch/ao/pull/581

Mixed-Precision Quantization

Add sensitivity analysis tool for layer-wise FIT and Hessian trace https://github.com/pytorch/ao/pull/592
Bayesian optimization tool for mixed precision quantization https://github.com/pytorch/ao/pull/694

Improvements

Move sam eval from scripts to torchao/_models https://github.com/pytorch/ao/pull/591
QOL improvements to float8 gemm benchmark https://github.com/pytorch/ao/pull/596
Move lowbit universal kernels from torchaccel to torchao https://github.com/pytorch/ao/pull/582
Refactor autoquant to use AQT https://github.com/pytorch/ao/pull/609
Add support for using AffineQuantizedTensor with weights_only=True https://github.com/pytorch/ao/pull/630
Move Uintx out of prototype for future extension https://github.com/pytorch/ao/pull/635
Refactor _quantized_linear for better extensibility https://github.com/pytorch/ao/pull/634
Update micro benchmarking code for AQT https://github.com/pytorch/ao/pull/673
Refactor superblock code + add final benchmark/eval scripts https://github.com/pytorch/ao/pull/691
Relax QAT dtype assertion https://github.com/pytorch/ao/pull/692
Add option to move param to device before quantization https://github.com/pytorch/ao/pull/699
Add gpu benchmarking script https://github.com/pytorch/ao/pull/192
Enable to(device=device_name) for Uintx https://github.com/pytorch/ao/pull/722
Make torchao's llama model trainable https://github.com/pytorch/ao/pull/728
Specify output dtype to torch.float32 in _foreach_norm https://github.com/pytorch/ao/pull/727
Add semi-structured sparsity to hf eval https://github.com/pytorch/ao/pull/576
Use torch.uint1 to torch.uint7 for Uintx tensor subclass https://github.com/pytorch/ao/pull/672
Add AdamW to CPUOffloadOptimizer default https://github.com/pytorch/ao/pull/742
Make developer experience better for extending AQT https://github.com/pytorch/ao/pull/749
Add back QAT module swap API https://github.com/pytorch/ao/pull/762
Refactor quant_llm to work with affine quantized tensor https://github.com/pytorch/ao/pull/772
Move iOS benchmarking infra code to torchao https://github.com/pytorch/ao/pull/766
Add CPU bandwidth benchmark https://github.com/pytorch/ao/pull/773
Update method names to support intx and floatx changes https://github.com/pytorch/ao/pull/775
Add implementation for torchao::parallel_for backends https://github.com/pytorch/ao/pull/774
Add Llama2-7B finetune benchmarks for low-bit optimizers https://github.com/pytorch/ao/pull/746
Fix Adam4bit support on PyTorch 2.3 and 2.4 and update AdamFp8 torch requirement https://github.com/pytorch/ao/pull/755
Improve compile time + fix PyTorch 2.3 support for 4-bit optim https://github.com/pytorch/ao/pull/812
Allow quantized linear registration in a different file https://github.com/pytorch/ao/pull/783
Add 2bit, 5bit packing routines https://github.com/pytorch/ao/pull/797, https://github.com/pytorch/ao/pull/798
Freeze dataclass in nf4, prep for better pt2 support https://github.com/pytorch/ao/pull/799
Format and lint nf4 file and test https://github.com/pytorch/ao/pull/800
Move more utils to TorchAOBaseTensor https://github.com/pytorch/ao/pull/784
Add more information to quantized linear module and added some logs https://github.com/pytorch/ao/pull/782
Add int4 mode to autoquant https://github.com/pytorch/ao/pull/804
Add uintx quant to generate and eval https://github.com/pytorch/ao/pull/811
Move non-NF4 tensor to device prior to quantization on copy https://github.com/pytorch/ao/pull/737

Static quantization

Add float8 static quant support https://github.com/pytorch/ao/pull/787
Update how block_size is calculated with Observers https://github.com/pytorch/ao/pull/815
Add a linear observer class and test https://github.com/pytorch/ao/pull/807

Float8

Update benchmarks to be more useful for smaller shapes https://github.com/pytorch/ao/pull/615
Remove unneeded kernel for scale generation https://github.com/pytorch/ao/pull/616
Filter out microbenchmarking overhead in profiling script https://github.com/pytorch/ao/pull/629
Save torch_logs, and attach them to profiling trace https://github.com/pytorch/ao/pull/645
Add option for gpu time in GEMM benchmarks https://github.com/pytorch/ao/pull/666
Add roofline estimation of GEMM + overhead https://github.com/pytorch/ao/pull/668
Make roofline utils reusable https://github.com/pytorch/ao/pull/731
Use torch.compiler.is_compiling https://github.com/pytorch/ao/pull/739
Float8 support in AQT https://github.com/pytorch/ao/pull/671
Add static scaling for float8 training https://github.com/pytorch/ao/pull/760
Make roofline script calculate observed overhead https://github.com/pytorch/ao/pull/734
Make Inference and training code independent https://github.com/pytorch/ao/pull/808
Add rowwise scaling option to float8 dynamic quant https://github.com/pytorch/ao/pull/819

Bug fixes

Fix all-gather in 2D with DTensor (WeightWithDynamicFloat8CastTensor) https://github.com/pytorch/ao/pull/590
Fix FP6-LLM API and add .to(device) op https://github.com/pytorch/ao/pull/595
Fix linearactivationtensor dynamic quant https://github.com/pytorch/ao/pull/622
Fix bug with float8 inference_mode https://github.com/pytorch/ao/pull/659
Quantization kernel bug fixes https://github.com/pytorch/ao/pull/717
Cast local_scale_tensor to fp32 for precompute of float8 dynamic scaling https://github.com/pytorch/ao/pull/713
Fix affine quantized tensor to device calls https://github.com/pytorch/ao/pull/726
Small fix for micro benchmark code https://github.com/pytorch/ao/pull/711
Fix LR schedule handling for low-bit optimizers https://github.com/pytorch/ao/pull/736
Fix FPX inductor error https://github.com/pytorch/ao/pull/790
Fixed llama model inference https://github.com/pytorch/ao/pull/769

Docs

Add QAT README https://github.com/pytorch/ao/pull/597
Update serialization.rst to include getmodelsizeinbytes import https://github.com/pytorch/ao/pull/604
Clarify details around unwraptensorsubclass in README.md https://github.com/pytorch/ao/pull/618, https://github.com/pytorch/ao/pull/619
Spelling fixes https://github.com/pytorch/ao/pull/662
Move developer guide file to a folder https://github.com/pytorch/ao/pull/681
Update docs on how to use AUTOQUANT_CACHE https://github.com/pytorch/ao/pull/649
Update pip install command in README https://github.com/pytorch/ao/pull/723
Fix docstring args names https://github.com/pytorch/ao/pull/735
Update README example with correct import of sparsify_ https://github.com/pytorch/ao/pull/741
Update main and quantization README https://github.com/pytorch/ao/pull/745, https://github.com/pytorch/ao/pull/747, https://github.com/pytorch/ao/pull/757
Add README for mixed-precision search tool and code refactor https://github.com/pytorch/ao/pull/776
Add performance section to float8 README.md https://github.com/pytorch/ao/pull/794
Make float8 README.md examples standalone https://github.com/pytorch/ao/pull/809
Add KV cache quantization to READMEs https://github.com/pytorch/ao/pull/813
Update main README.md with more current float8 speedup https://github.com/pytorch/ao/pull/816

Not user facing

Fix float8 inference tests and add export test https://github.com/pytorch/ao/pull/613
Reduce atol/rtol for stable tests https://github.com/pytorch/ao/pull/617
Fix version guard in https://github.com/pytorch/ao/pull/620, https://github.com/pytorch/ao/pull/679, https://github.com/pytorch/ao/pull/684
Fix BC for QAT location https://github.com/pytorch/ao/pull/626
Enable float8 CI on sm89 https://github.com/pytorch/ao/pull/587
Fix Inductor bench BC change https://github.com/pytorch/ao/pull/638, https://github.com/pytorch/ao/pull/641
Add CUDA compute capability compile guard https://github.com/pytorch/ao/pull/636
Remove numpy as bitpack dependency https://github.com/pytorch/ao/pull/677
Add PyTorch 2.4 tests in CI https://github.com/pytorch/ao/pull/654
Remove torchao_nightly package https://github.com/pytorch/ao/pull/661
Update licenses in torchao/experimental https://github.com/pytorch/ao/pull/720
Add lint checks for float8 inference https://github.com/pytorch/ao/pull/779

New Contributors

@sayakpaul made their first contribution in https://github.com/pytorch/ao/pull/604
@metascroy made their first contribution in https://github.com/pytorch/ao/pull/582
@raziel made their first contribution in https://github.com/pytorch/ao/pull/618
@nmacchioni made their first contribution in https://github.com/pytorch/ao/pull/641
@Diogo-V made their first contribution in https://github.com/pytorch/ao/pull/670
@mobicham made their first contribution in https://github.com/pytorch/ao/pull/605
@crcrpar made their first contribution in https://github.com/pytorch/ao/pull/703
@ebsmothers made their first contribution in https://github.com/pytorch/ao/pull/737
@a-r-r-o-w made their first contribution in https://github.com/pytorch/ao/pull/741
@kimishpatel made their first contribution in https://github.com/pytorch/ao/pull/766

We were able to close about 70% of tasks for 0.5.0, which will now spill over into upcoming releases. We will post a list for 0.6.0 next, which we aim to release at the end of September 2024. We want to follow a monthly release cadence until further notice.

Full Changelog: https://github.com/pytorch/ao/compare/v0.4.0...v0.5.0-rc1

- Python
Published by andrewor14 over 1 year ago

torchao - v0.4.0

v0.4.0

Highlights

We are excited to announce the 0.4 release of torchao! This release adds support for KV cache quantization, quantization aware training (QAT), low bit optimizer support, composing quantization and sparsity, and more!

KV cache quantization (https://github.com/pytorch/ao/pull/532)

We've added support for KV cache quantization, showing a peak memory reduction from 19.7 -> 19.2 GB on Llama3-8B at an 8192 context length. We plan to investigate Llama3.1 next.

Quantization-Aware Training (QAT) (#383, #555)

We now support two QAT schemes for linear layers: Int8 per token dynamic activations + int4 per group weights, and int4 per group weights (using the efficient tinygemm int4 kernel after training). Users can access this feature by transforming their models before and after training using the appropriate quantizer, for example:

```python from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer

Quantizer for int8 dynamic per token activations +

int4 grouped per channel weights, only for linear layers

qat_quantizer = Int8DynActInt4WeightQATQuantizer()

Insert "fake quantize" operations into linear layers.

These operations simulate quantization numerics during

training without performing any dtype casting

model = qat_quantizer.prepare(model)

Convert fake quantize to actual quantize operations

model = qat_quantizer.convert(model) ```

Initial evaluation results indicate that QAT in torchao can recover up to 96% of quantized accuracy degradation on hellaswag and up to 68% of quantized perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the README and this blog post.

Composing quantization and sparsity (#457, #473)

We've added support for composing int8 dynamic quantization with 2:4 sparsity, using the quantize_ API. We also added SAM benchmarks that show a 7% speedup over standalone sparsity / int8 dynamic quantization here.

python from torchao.quantization import quantize_, int8_dynamic_activation_int8_semi_sparse_weight quantize_(model, int8_dynamic_activation_int8_semi_sparse_weight())

Community Contributions

low-bit optimizer support (#478, #463, #482, #484, #538)

@gau-nernst added implementations for 4-bit, 8-bit, and FP8 Adam with FSDP2/FSDP support. Our API is a drop-in replacement for torch.optim.Adam and can be used as follows: ```python from torchao.prototype.lowbitoptim import Adam8bit, Adam4bit, AdamFp8 from torchao.prototype.lowbitoptim import AdamW8bit, AdamW4bit, AdamWFp8

model = ... optim = Adam8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 / fp8 versions ```

For more information about low bit optimizer support please refer to our README.

Improvements to 4-bit quantization (https://github.com/pytorch/ao/pull/517, https://github.com/pytorch/ao/pull/552, https://github.com/pytorch/ao/pull/544, #479 )

@bdhirsh @jeromeku @yanbing-j @manuelcandales @larryliu0820 added torch.compile support for NF4 Tensor, custom CUDA int4 tinygemm unpacking ops, and several bugfixes to torchao

BC breaking

quantize has been renamed to quantize_ https://github.com/pytorch/ao/pull/467 ``` python # for torchao 0.4 from torchao.quantization import quantize, int8weightonly quantize(model, int8weightonly())

for torchao 0.3

from torchao.quantization import quantize, int8weightonly quantize(model, int8weightonly()) * `apply_sparse_semi_structured` has been deprecated in favor of `sparsify_` which matches the `quantize_` API https://github.com/pytorch/ao/pull/473 python

for torchao 0.4

from torchao.sparsity import sparsify, semisparseweight sparsify(model, semisparseweight())

for torchao 0.3

from torchao.sparsity import applysparsesemistructured applysparsesemistructured(model) ```

Deprecations

New Features

Added kv_cache quantization https://github.com/pytorch/ao/pull/532
Migrated float8_experimental to torchao.float8, enabling float8 training support https://github.com/pytorch/ao/pull/551 https://github.com/pytorch/ao/pull/529
Added FP5 E2M2 https://github.com/pytorch/ao/pull/399
Added 4-bit, 8-bit, and FP8 ADAM support https://github.com/pytorch/ao/pull/478 https://github.com/pytorch/ao/pull/463 https://github.com/pytorch/ao/pull/482
Added FSDP2 support for low-bit optimizers https://github.com/pytorch/ao/pull/484
[prototype] mixed-precision quantization and eval framework https://github.com/pytorch/ao/pull/531
Added int4 weight-only QAT support https://github.com/pytorch/ao/pull/555, https://github.com/pytorch/ao/pull/383
Added custom CUDA tinygemm unpacking ops https://github.com/pytorch/ao/pull/415

Improvements

Composing quantization and sparsity now uses the unified AQT Layout https://github.com/pytorch/ao/pull/498
Added default inductor config settings https://github.com/pytorch/ao/pull/423
Better dtype and device handling forInt8DynActInt4WeightQuantizer and Int4WeightOnlyQuantizer https://github.com/pytorch/ao/pull/475 https://github.com/pytorch/ao/pull/479
Enable model.to for int4/int8 weight only quantized models https://github.com/pytorch/ao/pull/486 https://github.com/pytorch/ao/pull/522
Added more logging to TensorCoreTiledAQTLayout https://github.com/pytorch/ao/pull/520
Added general fake_quantize_affine op with mask support https://github.com/pytorch/ao/pull/492 https://github.com/pytorch/ao/pull/500
QAT now uses the shared fake_quantize_affine primitive https://github.com/pytorch/ao/pull/527
Improve FSDP support for low-bit optimizers https://github.com/pytorch/ao/pull/538
Custom op and inductor decomp registration now uses a decorator https://github.com/pytorch/ao/pull/434
Updated torch version to no longer require unwrap_tensor_subclass https://github.com/pytorch/ao/pull/595

Bug fixes

Fixed import for TORCH_VERSION_AFTER_* https://github.com/pytorch/ao/pull/433
Fixed crash when PYTORCH_VERSION is not defined https://github.com/pytorch/ao/pull/455
Added torch.compile support for NF4Tensor https://github.com/pytorch/ao/pull/544
Added fbcode check to fix torchtune in Genie https://github.com/pytorch/ao/pull/480
Fixed int4pack_mm error https://github.com/pytorch/ao/pull/517
Fixed cuda device check https://github.com/pytorch/ao/pull/536
Weight shuffling now runs on CPU for int4 quantization due to a MPS memory issue https://github.com/pytorch/ao/pull/552
Scale and input now are the same dtype for int8 weight only quantization https://github.com/pytorch/ao/pull/534
Fixed FP6-LLM API https://github.com/pytorch/ao/pull/595

Performance

Added segment-anything-fast benchmarks for composed quantization + sparsity https://github.com/pytorch/ao/pull/457
Updated low-bit Adam benchmark https://github.com/pytorch/ao/pull/481

Docs

Updated README.md https://github.com/pytorch/ao/pull/583 https://github.com/pytorch/ao/pull/438 https://github.com/pytorch/ao/pull/445 https://github.com/pytorch/ao/pull/460
Updated installation instructions https://github.com/pytorch/ao/pull/447 https://github.com/pytorch/ao/pull/459
Added more docs for int4weightonly API https://github.com/pytorch/ao/pull/469
Added developer guide notebook https://github.com/pytorch/ao/pull/588
Added optimized model serialization/deserialization doc https://github.com/pytorch/ao/pull/524 https://github.com/pytorch/ao/pull/525
Added new float8 feature tracker https://github.com/pytorch/ao/pull/557
Added static quantization tutorial for calibration-based techniques https://github.com/pytorch/ao/pull/487

Devs

Fix numpy version in CI https://github.com/pytorch/ao/pull/537
trymerge now uploads merge records to s3 https://github.com/pytorch/ao/pull/448
Updated python version to 3.9 https://github.com/pytorch/ao/pull/488
torchao no long depends on torch https://github.com/pytorch/ao/pull/449
benchmark_model now accepts args and kwargs and supports cpu and mps backends https://github.com/pytorch/ao/pull/586 https://github.com/pytorch/ao/pull/406
Add git version suffix to package name https://github.com/pytorch/ao/pull/547
Added validations to torchao https://github.com/pytorch/ao/pull/453 https://github.com/pytorch/ao/pull/454
Parallel test support with pytest-xdist https://github.com/pytorch/ao/pull/518
Quantizer now uses logging instead of print https://github.com/pytorch/ao/pull/472

Not user facing

Refactored _replace_linear_8da4w https://github.com/pytorch/ao/pull/451
Remove unused code from AQT implementation https://github.com/pytorch/ao/pull/476 https://github.com/pytorch/ao/pull/440 https://github.com/pytorch/ao/pull/441 https://github.com/pytorch/ao/pull/471
Improved error message for lm_eval script https://github.com/pytorch/ao/pull/444
Updated HF_TOKEN env variable https://github.com/pytorch/ao/pull/427
Fixed typo in Quant-LLM in https://github.com/pytorch/ao/pull/450
Add a test for map_location="cpu" in https://github.com/pytorch/ao/pull/497
Removed sparse test collection warning https://github.com/pytorch/ao/pull/489
Refactored layout implementation https://github.com/pytorch/ao/pull/491
Refactored LinearActQuantizedTensor https://github.com/pytorch/ao/pull/542

New Contributors

@qingquansong made their first contribution in https://github.com/pytorch/ao/pull/433
@Hanxian97 made their first contribution in https://github.com/pytorch/ao/pull/451
@larryliu0820 made their first contribution in https://github.com/pytorch/ao/pull/472
@SLR722 made their first contribution in https://github.com/pytorch/ao/pull/480
@jainapurva made their first contribution in https://github.com/pytorch/ao/pull/406
@bdhirsh made their first contribution in https://github.com/pytorch/ao/pull/544
@yanbing-j made their first contribution in https://github.com/pytorch/ao/pull/517
@manuelcandales made their first contribution in https://github.com/pytorch/ao/pull/552
@Valentine233 made their first contribution in https://github.com/pytorch/ao/pull/534

Full Changelog: https://github.com/pytorch/ao/compare/v0.3.1-rc1...v0.4.0-rc1

We were able to close about 60% of tasks for 0.4.0, which will now spill over into upcoming releases. We will post a list for 0.5.0 next, which we aim to release at the end of August 2024. We want to follow a monthly release cadence until further notice.

- Python
Published by jcaip almost 2 years ago

torchao - v0.3.1

v0.3.1

Highlights

We are excited to announce the 0.3 release of torchao! This release adds support for a new quantize API, MX format, FP6 dtype and bitpacking, 2:4 sparse accelerated training and benchmarking infra for llama2/llama3 models.

`quantize` API (https://github.com/pytorch/ao/pull/256)

We added a tensor subclass based quantization API, see docs and README for details on usage, this is planned to replace all existing quantization APIs in torchao for torch 2.4 and later.

Accelerated training with 2:4 sparsity (#184)

You can now accelerate training with 2:4 sparsity, using the runtime pruning + compression kernels written by xFormers. These kernels process a 4x4 sub-tile to be 2:4 sparse in both directions, to handle both the forward and backward pass when training. We see a 1.3x speedup for the MLP layers of ViT-L across a forward and backwards pass.

MX support (https://github.com/pytorch/ao/pull/264)

We added prototype support for MX format for training and inference with a reference native PyTorch implementation of training and inference primitives for using MX accelerated matrix multiplications. The MX numerical formats are new low precision formats with recent acceptance into the OCP spec: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Benchmarking (https://github.com/pytorch/ao/pull/276, https://github.com/pytorch/ao/pull/374)

We added a stable way to benchmark llama2 and llama3 models that includes perf/accuracy comparisons. See torchao/_models/llama/benchmarks.sh for more details.

🌟 💥 Community Contributions 🌟 💥

FP6 support (https://github.com/pytorch/ao/pull/279, https://github.com/pytorch/ao/pull/283, https://github.com/pytorch/ao/pull/358)

@gau-nernst Added support for FP6 dtype and mixed matmul FP16 x FP6 kernel with support for torch.compile. Benchmark results show a 2.3x speedup over BF16 baseline for meta-llama/Llama-2-7b-chat-hf

Bitpacking (https://github.com/pytorch/ao/pull/307, https://github.com/pytorch/ao/pull/282)

@vayuda, @melvinebenezer @CoffeeVampir3 @andreaskoepf Added support for packing/unpacking lower bit dtypes leveraging torch.compile to generate the kernels for this and added UInt2 and Bitnet tensor based on this approach.

FP8 split-gemm kernel https://github.com/pytorch/ao/pull/263

Added the kernel written by @AdnanHoque to torchao with speedups compared to the cuBLAS kernel for batch size <=16

BC Breaking

Deprecations

Deprecate top level quantization APIs https://github.com/pytorch/ao/pull/344

1. int8 weight only quantization

apply_weight_only_int8_quant(model) or change_linear_weights_to_int8_woqtensors(model)

-->

```python

for torch 2.4+

from torchao.quantization import quantize, int8weightonly quantize(model, int8weightonly())

for torch 2.2.2 and 2.3

from torchao.quantization.quantapi import changelinearweightstoint8woqtensors changelinearweightstoint8_woqtensors(model) ```

2. int8 dynamic quantization

apply_dynamic_quant(model) or change_linear_weights_to_int8_dqtensors(model)

-->

```python

Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor

torch.inductor.config.forcefuseintmmwithmul = True

for torch 2.4+

from torchao.quantization import quantize, int8dynamicactivationint8weight quantize(model, int8dynamicactivationint8weight())

for torch 2.2.2 and 2.3

from torchao.quantization.quantapi import changelinearweightstoint8dqtensors changelinearweightstoint8_dqtensors(model) ```

3. int4 weight only quantization

change_linear_weights_to_int4_wotensors(model)

-->

```python

for torch 2.4+

from torchao.quantization import quantize, int4weightonly quantize(model, int4weightonly())

for torch 2.2.2 and 2.3

from torchao.quantization.quantapi import changelinearweightstoint4woqtensors changelinearweightstoint4_woqtensors(model) ```

New Features

Add quantize https://github.com/pytorch/ao/pull/256
Add a prototype of MX format training and inference https://github.com/pytorch/ao/pull/264
[FP6-LLM] Port splitK map from DeepSpeed https://github.com/pytorch/ao/pull/283
Improve FP6-LLM 2+4bit weight splitting + user API https://github.com/pytorch/ao/pull/279
Bitpacking https://github.com/pytorch/ao/pull/291
training acceleration via runtime semi-structured sparsity https://github.com/pytorch/ao/pull/184
Bitpackingv2 https://github.com/pytorch/ao/pull/307
Add FP6-LLM doc and move FP6-LLM to prototype https://github.com/pytorch/ao/pull/358
Added first bits of Uint2Tensor and BitnetTensor https://github.com/pytorch/ao/pull/282

Improvements

Improve primitives for FP6 quant https://github.com/pytorch/ao/pull/248
Extract eval code from GPTQ for more general usage https://github.com/pytorch/ao/pull/275
Factor out the specific configurations to helper functions https://github.com/pytorch/ao/pull/286
Add support for AQTLayout, PlainAQTLayout and TensorCoreTiledAQTLayout https://github.com/pytorch/ao/pull/278
Graceful handling of cpp extensions https://github.com/pytorch/ao/pull/296
Refactor int8 dynamic quantization with call to quantize https://github.com/pytorch/ao/pull/294
[NF4][FSDP] return contiguous quantization_factor https://github.com/pytorch/ao/pull/298
Refactor int4 and int8 weight only quantization to use quantize https://github.com/pytorch/ao/pull/301
Adding a quick way for users to test model eval for hf models https://github.com/pytorch/ao/pull/328
Wrap torch.ops.quantized_decomposed to improve import errors https://github.com/pytorch/ao/pull/310
[NF4Tensor] Switch to save for backward since are now a tensor input https://github.com/pytorch/ao/pull/323
Refactor rest of tinygemm quant primitive ops https://github.com/pytorch/ao/pull/321
Move some util functions from quantization.utils to torchao.utils https://github.com/pytorch/ao/pull/337
Clean up FP6-LLM https://github.com/pytorch/ao/pull/304
Move quant ops to utils.py https://github.com/pytorch/ao/pull/331
FP6-LLM clean up (again) https://github.com/pytorch/ao/pull/339
Improving hf_eval.py https://github.com/pytorch/ao/pull/342
Generalize Model Size Code https://github.com/pytorch/ao/pull/364
Minor upgrades to bit pack https://github.com/pytorch/ao/pull/347
Factor out dispatch and layout registration table https://github.com/pytorch/ao/pull/360
Add register_apply_tensor_subclass https://github.com/pytorch/ao/pull/366
Refactor custom FPx cast https://github.com/pytorch/ao/pull/363
Remove all dependencies except torch https://github.com/pytorch/ao/pull/369
Enable a test for loading state_dict with tensor subclasses https://github.com/pytorch/ao/pull/389
073 scripts for benchmarks https://github.com/pytorch/ao/pull/372
Add WOQ int8 test with Inductor Freeze https://github.com/pytorch/ao/pull/362
Benchmarking updates for semi-structured sparse training https://github.com/pytorch/ao/pull/398
add FSDP QLoRA test and revert failing PR https://github.com/pytorch/ao/pull/403
Refactor the API for quant method argument for quantize function https://github.com/pytorch/ao/pull/400
eval script fixes https://github.com/pytorch/ao/pull/414

Bug Fixes

Fixed the HQQ import skip https://github.com/pytorch/ao/pull/262
fixing autoquant bug https://github.com/pytorch/ao/pull/265
Fix eval import after #275 https://github.com/pytorch/ao/pull/290
Fixed f-string printing of NF4Tensors https://github.com/pytorch/ao/pull/297
Check and fix dequantize_affine is idempotent https://github.com/pytorch/ao/pull/309
Update old pretrained TorchVision API in ao tutorials (#313) https://github.com/pytorch/ao/pull/314
Fix dimension issues for int4 weight only quant path https://github.com/pytorch/ao/pull/330
Fix compile in hf_eval.py https://github.com/pytorch/ao/pull/341
tasklist to tasks in hfeval https://github.com/pytorch/ao/pull/343
fixing peak memory stats for benchmark https://github.com/pytorch/ao/pull/353
Fix inductor config BC change https://github.com/pytorch/ao/pull/382
fixing scripts https://github.com/pytorch/ao/pull/395

Performance

FP8 splitgemm user defined triton kernel https://github.com/pytorch/ao/pull/263
sparse benchmarking numbers https://github.com/pytorch/ao/pull/303
Fix FP6-LLM benchmark https://github.com/pytorch/ao/pull/312
Adding Llama to TorchAO https://github.com/pytorch/ao/pull/276
Generalize Model Size Code https://github.com/pytorch/ao/pull/364
eval script for llama https://github.com/pytorch/ao/pull/374
077 autoquant gpt fast https://github.com/pytorch/ao/pull/361

Docs

add static folder for images + fix links https://github.com/pytorch/ao/pull/271
Fix Readme and remove unused kernel https://github.com/pytorch/ao/pull/270
Kernel docs https://github.com/pytorch/ao/pull/274
Quantization Docstrings https://github.com/pytorch/ao/pull/273
Add AffineQuantizedTensor based workflow doc and examples https://github.com/pytorch/ao/pull/277
Add AUTOQUANT_CACHE docs for reusing the same quantization plan https://github.com/pytorch/ao/pull/329
Update nightly build instructions https://github.com/pytorch/ao/pull/334
add link to benchmarking script https://github.com/pytorch/ao/pull/355
New README https://github.com/pytorch/ao/pull/392
Minor README updates https://github.com/pytorch/ao/pull/401
Add quantize to doc page https://github.com/pytorch/ao/pull/367
Add link to new custom op tutorial https://github.com/pytorch/ao/pull/424

Devs

ci: Add push trigger for binary build workflows https://github.com/pytorch/ao/pull/259
Make fp8 test explicit https://github.com/pytorch/ao/pull/266
Move AffineQuantizedTensor to torchao/dtypes https://github.com/pytorch/ao/pull/272
Add suffix to package version https://github.com/pytorch/ao/pull/293
Re-enable AOTI tests https://github.com/pytorch/ao/pull/212
Add fused QKV HQQ triton_mm test https://github.com/pytorch/ao/pull/306
Pin CUDA nightly to mitigate regression https://github.com/pytorch/ao/pull/322
Unpin CUDA nightly https://github.com/pytorch/ao/pull/333
Add architecture to index postfix for nightly builds https://github.com/pytorch/ao/pull/336
Update regression test to python 3.8 https://github.com/pytorch/ao/pull/340
Remove test_ops.py warning spew https://github.com/pytorch/ao/pull/267
Add torchao.version https://github.com/pytorch/ao/pull/359
make torchao test discovery pass in fbcode https://github.com/pytorch/ao/pull/351
use pytorch version env variable https://github.com/pytorch/ao/pull/373
Update prebuildscript.sh https://github.com/pytorch/ao/pull/390
Add support for building CUDA extension on Windows https://github.com/pytorch/ao/pull/396
Add trymerge https://github.com/pytorch/ao/pull/388
Fix github CI error https://github.com/pytorch/ao/pull/409
Fix missing dependencies in trymerge workflow https://github.com/pytorch/ao/pull/413
Setup trymerge secrets https://github.com/pytorch/ao/pull/416
Pin CUDA nightlies for mx failures https://github.com/pytorch/ao/pull/428
fix mx triton kernel after PyTorch triton pin change https://github.com/pytorch/ao/pull/431

Untopiced

Print the code when the check failed https://github.com/pytorch/ao/pull/254
Retry of D58015187 Move AsyncCompile to a different file by @jamesjwu in https://github.com/pytorch/ao/pull/302
Revert "Clean up FP6-LLM" https://github.com/pytorch/ao/pull/338
Update version to 0.3.0 https://github.com/pytorch/ao/pull/348
Add torchao.version https://github.com/pytorch/ao/pull/359

New Contributors

@seemethere made their first contribution in https://github.com/pytorch/ao/pull/259
@yiliu30 made their first contribution in https://github.com/pytorch/ao/pull/262
@vkuzo made their first contribution in https://github.com/pytorch/ao/pull/264
@vayuda made their first contribution in https://github.com/pytorch/ao/pull/291
@awgu made their first contribution in https://github.com/pytorch/ao/pull/297
@jamesjwu made their first contribution in https://github.com/pytorch/ao/pull/302
@kit1980 made their first contribution in https://github.com/pytorch/ao/pull/314
@RobinKa made their first contribution in https://github.com/pytorch/ao/pull/329
@andreaskoepf made their first contribution in https://github.com/pytorch/ao/pull/282
@clee2000 made their first contribution in https://github.com/pytorch/ao/pull/388

Full Changelog: https://github.com/pytorch/ao/compare/v0.2.0...v0.3.0-rc1

We were able to close about 60% of tasks for 0.3.0, which will now spill over into upcoming releases. We will post a list for 0.4.0 next, which we aim to release at the end of July 2024. We want to follow a monthly release cadence until further notice.

EDIT: We made a patch release for 0.3.1 to include 2 more PRs so now ao has no runtime dependencies https://github.com/pytorch/ao/pull/449 and https://github.com/pytorch/ao/pull/455

- Python
Published by supriyar almost 2 years ago

torchao - v0.2.0

What's Changed

Highlights

Custom CPU/CUDA extension to ship CPU/CUDA binaries.

PyTorch core has recently shipped a new custom op registration mechanism with torch.library with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with torch.compile()

We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tree/main/torchao/csrc and if you learn better via example you can follow this PR https://github.com/pytorch/ao/pull/135 to add your own custom ops to torchao.

Most notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for fp6 support https://github.com/pytorch/ao/pull/223

One key benefit of integrating your kernels in torchao directly is we thanks to our manylinux GPU support can ensure that CPU/CUDA kernels that you've added will work on as many devices and cuda versions as possible https://github.com/pytorch/ao/pull/176

A lot of prototype and community contributions

@jeromeku was our community champion merging support for 1. GaLore our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch 2. DoRA which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https://github.com/pytorch/ao/tree/main/torchao/prototype/dora 3. Fused int4/fp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq

@gau-nernst merged fp6 support showing up to 8x speedups on an fp16 baseline for small batch size inference https://github.com/pytorch/ao/pull/223

NF4 support for upcoming FSDP2

@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP https://github.com/pytorch/ao/pull/150 most notably by implementing torch.chunk(). We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

BC breaking

Deprecations

New Features

Match autoquant API with torch.compile (https://github.com/pytorch/ao/pull/109, https://github.com/pytorch/ao/pull/162, https://github.com/pytorch/ao/pull/175)
[Prototype] 8da4w QAT (https://github.com/pytorch/ao/pull/138, https://github.com/pytorch/ao/pull/199, https://github.com/pytorch/ao/pull/198, https://github.com/pytorch/ao/pull/211, https://github.com/pytorch/ao/pull/154, https://github.com/pytorch/ao/pull/157, https://github.com/pytorch/ao/pull/229)
[Prototype] GaLore (https://github.com/pytorch/ao/pull/95)
[Prototype] DoRA (https://github.com/pytorch/ao/pull/216)
[Prototype] HQQ (https://github.com/pytorch/ao/pull/153, https://github.com/pytorch/ao/pull/185)
[Prototype] 2:4 sparse + int8 sparse subclass (https://github.com/pytorch/ao/pull/36)
[Prototype] Unified quantization primitives (https://github.com/pytorch/ao/pull/159, https://github.com/pytorch/ao/pull/201, https://github.com/pytorch/ao/pull/193, https://github.com/pytorch/ao/pull/220, https://github.com/pytorch/ao/pull/227, https://github.com/pytorch/ao/pull/173, https://github.com/pytorch/ao/pull/210)
[Prototype] Pruning primitives (https://github.com/pytorch/ao/pull/148, https://github.com/pytorch/ao/pull/194)
[Prototype] AffineQuantizedTensor subclass (https://github.com/pytorch/ao/pull/214, https://github.com/pytorch/ao/pull/230, https://github.com/pytorch/ao/pull/243, https://github.com/pytorch/ao/pull/247, https://github.com/pytorch/ao/pull/251)
[Prototype] Add Int4WeightOnlyQuantizer (https://github.com/pytorch/ao/pull/119)
Custom CUDA extensions (https://github.com/pytorch/ao/pull/135, https://github.com/pytorch/ao/pull/186, https://github.com/pytorch/ao/pull/232)
[Prototype] Add FP6 Linear (https://github.com/pytorch/ao/pull/223)

Improvements

FSDP2 support for NF4Tensor (https://github.com/pytorch/ao/pull/118, https://github.com/pytorch/ao/pull/150, https://github.com/pytorch/ao/pull/207)
Add save/load of int8 weight only quantized model (https://github.com/pytorch/ao/pull/122)
Add intscaledmm on CPU (https://github.com/pytorch/ao/pull/121)
Add cpu and gpu in int4wo and int4wo-gptq quantizer (https://github.com/pytorch/ao/pull/131)
Add torch.export support to int8dq, int8wo, int4_wo subclasses (https://github.com/pytorch/ao/pull/146, https://github.com/pytorch/ao/pull/226, https://github.com/pytorch/ao/pull/213)
Remove is_gpt_fast specialization from GTPQ (https://github.com/pytorch/ao/pull/172)
Common benchmark and profile utils (https://github.com/pytorch/ao/pull/238)

Bug fixes

Fix padding in GPTQ (https://github.com/pytorch/ao/pull/119, https://github.com/pytorch/ao/pull/120)
Fix Int8DynActInt4WeightLinear module swap (https://github.com/pytorch/ao/pull/151)
Fix NF4Tensor.to to use device kwarg (https://github.com/pytorch/ao/pull/158)
Fix quantize_activation_per_token_absmax perf regression (https://github.com/pytorch/ao/pull/253)

Performance

Chunk NF4Tensor construction to reduce memory spike (https://github.com/pytorch/ao/pull/196)
Fix intmm benchmark script (https://github.com/pytorch/ao/pull/141)

Docs

Update READMEs (https://github.com/pytorch/ao/pull/140, https://github.com/pytorch/ao/pull/142, https://github.com/pytorch/ao/pull/169, https://github.com/pytorch/ao/pull/155, https://github.com/pytorch/ao/pull/179, https://github.com/pytorch/ao/pull/187, https://github.com/pytorch/ao/pull/188, https://github.com/pytorch/ao/pull/200, https://github.com/pytorch/ao/pull/217, https://github.com/pytorch/ao/pull/245)
Add https://pytorch.org/ao (https://github.com/pytorch/ao/pull/136, https://github.com/pytorch/ao/pull/145, https://github.com/pytorch/ao/pull/163, https://github.com/pytorch/ao/pull/164, https://github.com/pytorch/ao/pull/165, https://github.com/pytorch/ao/pull/168, https://github.com/pytorch/ao/pull/177, https://github.com/pytorch/ao/pull/195, https://github.com/pytorch/ao/pull/224)

CI

Add A10G support in CI (https://github.com/pytorch/ao/pull/176)
General CI improvements (https://github.com/pytorch/ao/pull/161, https://github.com/pytorch/ao/pull/171, https://github.com/pytorch/ao/pull/178, https://github.com/pytorch/ao/pull/180, https://github.com/pytorch/ao/pull/183, https://github.com/pytorch/ao/pull/107, https://github.com/pytorch/ao/pull/215, https://github.com/pytorch/ao/pull/244, https://github.com/pytorch/ao/pull/257, https://github.com/pytorch/ao/pull/235, https://github.com/pytorch/ao/pull/242)
Add expecttest to requirements.txt (https://github.com/pytorch/ao/pull/225)
Push button binary support (https://github.com/pytorch/ao/pull/241, https://github.com/pytorch/ao/pull/240, https://github.com/pytorch/ao/pull/250)

Not user facing

Security

Untopiced

Version bumps (https://github.com/pytorch/ao/pull/125, https://github.com/pytorch/ao/pull/234)
Don't import _C in fbcode (https://github.com/pytorch/ao/pull/218)

New Contributors

@Xia-Weiwen made their first contribution in https://github.com/pytorch/ao/pull/121
@jeromeku made their first contribution in https://github.com/pytorch/ao/pull/95
@weifengpy made their first contribution in https://github.com/pytorch/ao/pull/118
@aakashapoorv made their first contribution in https://github.com/pytorch/ao/pull/179
@UsingtcNower made their first contribution in https://github.com/pytorch/ao/pull/194
@Jokeren made their first contribution in https://github.com/pytorch/ao/pull/217
@gau-nernst made their first contribution in https://github.com/pytorch/ao/pull/223
@janeyx99 made their first contribution in https://github.com/pytorch/ao/pull/245
@huydhn made their first contribution in https://github.com/pytorch/ao/pull/250
@lancerts made their first contribution in https://github.com/pytorch/ao/pull/238

Full Changelog: https://github.com/pytorch/ao/compare/v0.2.0...v0.2.1

We were able to close about half of tasks for 0.2.0, which will now spill over into upcoming releases. We will post a list for 0.3.0 next, which we aim to release at the end of May 2024. We want to follow a monthly release cadence until further notice.

- Python
Published by cpuhrsch about 2 years ago

torchao - TorchAO 0.1.0: First Release

Highlights

We’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplication kernels on cuda.

Note: TorchAO is currently in a pre-release state and under extensive development. The public APIs should not be considered stable. But we welcome you to try out our APIs and offerings and provide any feedback on your experience.

torchao 0.1.0 will be compatible with PyTorch 2.2.2 and 2.3.0, ExecuTorch 0.2.0 and TorchTune 0.1.0.

New Features

Quantization

Added tensor subclass based quantization APIs: change_linear_weights_to_int8_dqtensors, change_linear_weights_to_int8_woqtensors and change_linear_weights_to_int4_woqtensors (#1)
Added module based quantization APIs for int8 dynamic and weight only quantization apply_weight_only_int8_quant and apply_dynamic_quant (#1)
Added module swap version of int4 weight only quantization Int4WeightOnlyQuantizer and Int4WeightOnlyGPTQQuantizer used in TorchTune (#119, #116)
Added int8 dynamic activation and int4 weight quantization Int8DynActInt4WeightQuantizer and Int8DynActInt4WeightGPTQQuantizer, used in ExecuTorch (#74) (available after torch 2.3.0 and later) ## Sparsity
Added WandaSparsifier that prunes both weights and activations (#22) ## Kernels
Added autotuner for int mm Triton kernels (#41) ## dtypes
nf4 tensor subclass and nf4 linear (#37, #40, #62)
Added uint4 dtype tensor subclass (#13)

Improvements

Setup github workflow for regression testing (#50)
Setup github workflow for torchao-nightly release (#54)

Documentation

Added tutorials for quantizing vision transformer model (#60)
Added tutorials for how to add an op for nf4 tensor (#54)

Notes

we are still debugging the accuracy problem for Int8DynActInt4WeightGPTQQuantizer
Save and load does not work well for tensor subclass based APIs yet
We will consolidate tensor subclass and module swap based quantization APIs later
uint4 tensor subclass is going to be merged into pytorch core in the future
Quantization ops in quant_primitives.py will be deduplicated with similar quantize/dequantize ops in PyTorch later

- Python
Published by jerryzh168 about 2 years ago

Recent Releases of torchao

torchao - v0.13.0

Highlights

Simpler Multi-step QAT API (https://github.com/pytorch/ao/pull/2629)

prepare

train (not shown)

convert

prepare

train and convert (not shown)

(Prototype) NVFP4 and FP8 QAT (https://github.com/pytorch/ao/pull/2735, https://github.com/pytorch/ao/pull/2666)

Pick a base config

prepare

train (not shown)

convert

prepare

train and convert (not shown)

(prototype) 1.2x MXFP8 dense pretraining speedups with torchtitan

torchao float8 training now integrated into axolotl!

BC Breaking

QAT API Changes (https://github.com/pytorch/ao/pull/2628, https://github.com/pytorch/ao/pull/2641)

Remove old change_linear_weights_to_* APIs (https://github.com/pytorch/ao/pull/2721)

Deprecations

Deprecate old TORCH_VERSION variables (https://github.com/pytorch/ao/pull/2719)

Drop support for PyTorch 2.5 and before (https://github.com/pytorch/ao/pull/2720)

New Features

Improvements

Bug Fixes

Performance

Documentation

Developers

New Contributors

torchao - v0.12.0

Highlights

QAT + Axolotl Integration

[Prototype | API not finalized] MXFP and NVFP support on Blackwell GPUs

Quantize model with MXFP8

Quantize model to NVFP4 (without double scaling)

BC Breaking

Deprecations

New Features

Improvement

Bug Fixes

Performance

Documentation

Developers

New Contributors

torchao - v0.11.0

Highlights

MoE Quantization

PT2 Export Quantization

top level APIs

export utils

graph utils

Microbenchmarking Framework for Inference APIs

BC Breaking

New Features

Improvement

Bug Fixes

Performance

Documentation

Developers

New Contributors

torchao - v0.10.0

Highlights

Low Bit Optimizers moved to Official Support (https://github.com/pytorch/ao/pull/1864)

[Prototype] End to End Training Support for mxfp8 on NVIDIA B200 (#1786, #1841, #1951, #1932, #1980)

[Prototype] Piecewise-Affine Regularized Quantization (https://github.com/pytorch/ao/pull/1738)

Separate quantizable from non-quantizable parameter groups

Initialize any torch.optim.Optimizer

Apply a simple wrapper to quantize in optimizer.step()

[Prototype] Module Swap Quantization API (https://github.com/pytorch/ao/pull/1886)

[Prototype] Low Bit Kernels (#1826, #1935, #1998, #1652)

Quantize embedding/unembedding to 8-bits with SharedEmbeddingQuantizer

SharedEmbeddingQuantizer is for quantizing models like Llama1B/3B

where the embedding/unembedding layers share weights

If the embedding/unembedding layers do not share weights, use

EmbeddingQuantizer instead

BC Breaking

Delete delayed scaling from torchao.float8 (https://github.com/pytorch/ao/pull/1753)

Enforce AOBaseConfig type in quantize_'s config argument (https://github.com/pytorch/ao/pull/1861)

Remove old `change_linear_weights_to_*` APIs (https://github.com/pytorch/ao/pull/2721)

Enforce AOBaseConfig type in `quantize_`'s `config` argument (https://github.com/pytorch/ao/pull/1861)

Remove the `set_inductor_config` argument of `quantize_`. (https://github.com/pytorch/ao/pull/1865)

deprecation of the `set_inductor_config` argument of `quantize_` (https://github.com/pytorch/ao/pull/1716)