Recent Releases of torchao

torchao - v0.13.0

Highlights

We are excited to announce the 0.13.0 release of torchao! This release adds support for numerous QAT improvements, faster mxfp8 pretraining and more!

Simpler Multi-step QAT API (https://github.com/pytorch/ao/pull/2629)

We added a new, simpler, multi-step QAT API that uses only a single config. Now users can specify the target post-training quantization (PTQ) config as the base config and we will automatically infer the correct fake quantize configs to use!

```py from torchao.quantization import ( quantize_, Int8DynamicActivationInt4WeightConfig ) from torchao.quantization.qat import QATConfig

prepare

baseconfig = Int8DynamicActivationInt4WeightConfig(groupsize=32) qatconfig = QATConfig(baseconfig, step="prepare") quantize(m, qatconfig)

train (not shown)

convert

quantize(m, QATConfig(baseconfig, step="convert")) ```

For more advanced use cases, users can continue to specify specific FakeQuantizeConfigs as before:

```py

prepare

activationconfig = IntxFakeQuantizeConfig(torch.int8, "pertoken", issymmetric=False) weightconfig = IntxFakeQuantizeConfig(torch.int4, groupsize=32) qatconfig = QATConfig( activationconfig=activationconfig, weightconfig=weightconfig, step="prepare", ) quantize(model, qatconfig)

train and convert (not shown)

```

(Prototype) NVFP4 and FP8 QAT (https://github.com/pytorch/ao/pull/2735, https://github.com/pytorch/ao/pull/2666)

We generalized QAT to support FP8 and NVFP4 use cases. You can try them out as follows:

```py from torchao.quantization import ( quantize, Float8DynamicActivationInt4WeightConfig, Float8DynamicActivationFloat8WeightConfig, Float8WeightOnlyConfig, ) from torchao.prototype.mxformats import NVFP4InferenceConfig from torchao.quantization.qat import QATConfig

Pick a base config

baseconfig = Float8DynamicActivationInt4WeightConfig() # or baseconfig = Float8DynamicActivationInt8WeightConfig() # or base_config = NVFP4InferenceConfig

prepare

qatconfig = QATConfig(baseconfig, step="prepare") quantize(m, qatconfig)

train (not shown)

convert

quantize(m, QATConfig(baseconfig, step="convert")) ```

Users can also use the more specific FakeQuantizeConfigs for more advanced use cases, e.g.:

```py from torchao.quantization import PerRow from torchao.quantization.qat import Float8FakeQuantizeConfig from torchao.prototype.qat import NVFP4FakeQuantizeConfig

actconfig = Float8FakeQuantizeConfig(torch.float8e4m3fn, PerRow()) weightconfig = NVFP4FakeQuantizeConfig(usepertensorscale=True)

prepare

qatconfig = QATConfig( activationconfig=activationconfig, weightconfig=weightconfig, step="prepare", ) quantize(model, qat_config)

train and convert (not shown)

```

(prototype) 1.2x MXFP8 dense pretraining speedups with torchtitan

We landed performance improvements (such as a faster to_mx dim1 cast) to our prototype MXFP8 training APIs, and we now achieve a 1.2x speedup vs bf16 on pretraining LLaMa 3 8B on NVIDIA B200. Please see our training benchmarks README for more information.

torchao float8 training now integrated into axolotl!

You can now use torchao.float8 directly from axolotl to achieve finetuning QPS e2e speedups of up to 1.1x on 3B parameter models (docs, release notes).

BC Breaking

Float8DynamicActivationFloat8WeightConfig and Float8WeightOnlyConfig version bump to 2 (https://github.com/pytorch/ao/pull/2650)

We updated the implementation for float8 Tensor, so bumps the default version from 1 to 2 for these two configs.

``` from transformers import AutoModelForCausalLM, AutoTokenizer modelname = "torchao-testing/opt-125m-Float8DynamicActivationFloat8WeightConfig-v1-0.13.dev" quantizedmodel = AutoModelForCausalLM.frompretrained( modelname, torchdtype="bfloat16", devicemap="cuda", )

/data/users/jerryzh/ao/torchao/core/config.py:249: UserWarning: Stored version is not the same as current default version of the config: storedversion=1, currentversion=2, please check the deprecation warning warnings.warn( /data/users/jerryzh/ao/torchao/dtypes/floatx/float8_layout.py:113: UserWarning: Models quantized with version 1 of Float8DynamicActivationFloat8WeightConfig is deprecated and will no longer be supported in a future release, please upgrade torchao and quantize again, or download a newer torchao checkpoint, see https://github.com/pytorch/ao/issues/2649 for more details warnings.warn( ```

Suggestion: upgrade torchao to 0.13 and later and generate the checkpoint again:

quantize_(model, Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))

Or download the checkpoint again (please let us know if the checkpoint is not updated)

Please see https://github.com/pytorch/ao/issues/2649 for more details around the deprecation.

QAT API Changes (https://github.com/pytorch/ao/pull/2628, https://github.com/pytorch/ao/pull/2641)

On a high level, the following existing APIs are deprecated and replaced by these new ones. Although this is technically BC-breaking due to typing changes, it will not affect most users as old classes are kept around for now. They are planned to be removed in the next release, however.

py IntXQuantizationAwareTrainingConfig -> QATConfig FromIntXQuantizationAwareTrainingConfig -> QATConfig FakeQuantizeConfig -> IntxFakeQuantizeConfig FakeQuantizer -> IntxFakeQuantizer

Please see https://github.com/pytorch/ao/issues/2630 and the latest QAT README for more information on how to migrate.

Remove old change_linear_weights_to_* APIs (https://github.com/pytorch/ao/pull/2721)

The following old quantization APIs no longer work and are removed:

py change_linear_weights_to_int8_dqtensors(model) change_linear_weights_to_int8_woqtensors(model) change_linear_weights_to_int4_woqtensors(model)

Please use the quantize_ API with the following configs instead:

py quantize_(model, Int8WeightOnlyConfig()) quantize_(model, Int4WeightOnlyConfig())

Deprecations

Deprecate old TORCH_VERSION variables (https://github.com/pytorch/ao/pull/2719)

The following variables are deprecated and will be removed in the next release:

py TORCH_VERSION_AT_LEAST_2_2 TORCH_VERSION_AT_LEAST_2_3 TORCH_VERSION_AT_LEAST_2_4 TORCH_VERSION_AT_LEAST_2_5 TORCH_VERSION_AT_LEAST_2_6 TORCH_VERSION_AT_LEAST_2_7 TORCH_VERSION_AT_LEAST_2_8 TORCH_VERSION_AFTER_2_2 TORCH_VERSION_AFTER_2_3 TORCH_VERSION_AFTER_2_4 TORCH_VERSION_AFTER_2_5

Drop support for PyTorch 2.5 and before (https://github.com/pytorch/ao/pull/2720)

torchao only supports the latest 3 versions of PyTorch. Please upgrade to PyTorch 2.6.0+ if you were using an older version of PyTorch.

New Features

Improvements

Bug Fixes

Performance

Documentation

Developers

New Contributors

Full Changelog: https://github.com/pytorch/ao/compare/v0.12.0...v0.13.0-rc1

- Python
Published by vkuzo 9 months ago

torchao - v0.12.0

Highlights

We are excited to announce the 0.12.0 release of torchao! This release adds support for QAT + Axolotl Integration and prototype MXFP/NVFP support on Blackwell GPUs!

QAT + Axolotl Integration

TorchAO’s QAT support has been integrated into Axolotl’s fine-tuning recipes! Check out the docs here or run it yourself using the following command:

shell axolotl train examples/llama-3/3b-qat-fsdp2.yaml axolotl quantize examples/llama-3/3b-qat-fsdp2.yaml

Initial results for Llama3.2-3B by @SalmanMohammadi (https://github.com/axolotl-ai-cloud/axolotl/pull/2590): | Model/Metric | hellaswag acc | hellaswag accnorm | wikitext bitsperbyte | wikitext byteperplexity | wikitext word_perplexity | |--------------|---------------|-------------------|----------------------|-------------------------|-------------------------| | bfloat16 | 0.5552 | 0.7315 | 0.6410 | 1.5594 | 10.7591 | | bfloat16 PTQ | 0.5393 | 0.7157 | 0.6613 | 1.5815 | 11.6033 | | qat ptq | 0.5423 | 0.7180 | 0.6567 | 1.5764 | 11.4043 | | Recovered (qat ptq) | 18.87% | 14.56% | 22.66% | 23.08% | 23.57% |

[Prototype | API not finalized] MXFP and NVFP support on Blackwell GPUs

TorchAO now includes prototype support for NVFP4 (NVIDIA's 4-bit floating-point format) and Microscaling (MX) formats on NVIDIA's latest Blackwell GPU architecture. These formats enable efficient inference, achieving up to 61% end-to-end performance improvement in vLLM on Qwen3 models and near 2x speedups for diffusion workloads.

To use:

```py from torchao.quantization import quantize_ from torchao.prototype.mx_formats import ( MXFPInferenceConfig, NVFP4InferenceConfig, )

Quantize model with MXFP8

model = quantize(model, MXFPInferenceConfig(blocksize=32))

Quantize model to NVFP4 (without double scaling)

model = quantize_(model, NVFP4InferenceConfig()) ```

Note: This is a prototype feature with APIs subject to change. Requires NVIDIA Blackwell GPUs (B200, 5090) with CUDA 12.8+.

BC Breaking

Deprecations

New Features

Improvement

Bug Fixes

Performance

Documentation

Developers

New Contributors

Full Changelog: https://github.com/pytorch/ao/compare/v0.11.0...v0.12.0-rc2

- Python
Published by drisspg 11 months ago

torchao - v0.11.0

Highlights

We are excited to announce the 0.11.0 release of torchao! This release adds support for mixture-of-experts (MoE) quantization, PyTorch 2 Export Quantization (PT2E), and a microbenchmarking framework for inference APIs!

MoE Quantization

We’ve a prototype feature for quantizing MoE modules with a number of TorchAO quantization techniques. This approach leverages the existing TorchAO features for quantizing linear ops and allows them to be used to quantize MoE modules.

```py from torchao.quantization.prototype.moequant.utils import condffnfilter, MoEQuantConfig from torchao.quantization.quantapi import quantize_, Int8WeightOnlyConfig

quantize( model, MoEQuantConfig(Int8WeightOnlyConfig()),
filter
fn=condffnfilter ) model=torch.compile( model, mode="reduce-overhead", fullgraph=issingletoken_inference ) ```

While the above API is all that is needed to quantize a moe module if your moe module is written to be both quantizable and compilable, in practice its rare for a user model to satisfy these conditions due to the variety of MoE implementations. An initial swap of the normal MoE module with a MoEFeedForwardAOQuantizable module is needed to first prepare the model for quantization. An example of this can be found in llama4_quant.py where this technique is demonstrated for the huggingface llama-4-Scout-17B-16E-Instruct model.

We implemented MoE quantization with 2 methods. The first method (designated `base` in the below benchmarks) simply enhances the existing quantized tensor subclass to quantize the 3D MoE expert tensors and perform the necessary indexing and slicing ops while the second method (`fake`), uses a new tensor subclass to simulate a 3D quantized parameter by storing a sequence of 2D slices of the quantized parameter. The first approach is faster with marginally worse memory characteristics. In both cases doing MoE quantization in this way isn’t expected to be maximally performant compared to implementing fused MoE kernels for each technique, but this approach can yield both moderate speedups and significant memory savings.

The following benchmarks are for mixtral-moe run on a single H100 GPU:

| | batchsize 1 | | batchsize 8 | | |
|-------------|-------------|-------------|-------------|--------------|-------------|
| Technique | tok/s | memory (GB) | tok/s | tok/s* batch | memory (GB) |
| None | 78.35 | 93.76 | 18.2 | 145.64 | 94.12 |
| int8wo-base | 98.4 | 48.87 | 4.94 | 39.56 | 49.2 |
| int4wo-base | 79.38 | 36.15 | 10.29 | 82.29 | 36.12 |
| fp8wo-base | 59.41 | 52.07 | 2.98 | 23.81 | 52.05 |
| fp8dq-base | 45.92 | 53.97 | 3.78 | 30.23 | 53.94 |
| int8wo-fake | 6.14 | 49.13 | 5.01 | 40.09 | 49.23 |
| int4wo-fake | 14.25 | 30.21 | 11.84 | 94.75 | 30.19 |
| fp8wo-fake | 3.2 | 50.31 | 2.88 | 23.08 | 50.29 |
| fp8dq-fake | 9.78 | 50.92 | 4.08 | 32.61 | 50.89 |

PT2 Export Quantization

We added pytorch 2 export quantization from pytorch to torchao. As part of the planned migration. We’ll follow up with adding deprecation warnings to PyTorch torch.ao.quantization APIs and updating docs in the future. We also simplified the import path for some of the util functions. Here is a non-exhaustive list of APIs you can use:

```

top level APIs

from torchao.quantization.pt2e.quantizept2e import preparept2e, prepareqatpt2e, convert_pt2e from torchao.quantization.pt2e.quantizer import X86InductorQuantizer

export utils

from torchao.quantization.pt2e import ( moveexportedmodeltoeval, moveexportedmodeltotrain, allowexportedmodeltraineval )

graph utils

from torchao.quantization.pt2e import ( findsequentialpartitions, getequivalenttypes, updateequivalenttypesdict, bfstracewithnode_process, )

# pt2e numeric debugger from torchao.quantization.pt2e import ( generatenumericdebughandle, CUSTOMKEY, NUMERICDEBUGHANDLEKEY, prepareforpropagationcomparison, extractresultsfromloggers, compareresults, )

```

Microbenchmarking Framework for Inference APIs

We’ve introduced a streamlined microbenchmark framework, to help developers track and evaluate the performance of their post-training quantization and sparsity APIs for different matrix sizes and model types. The framework also includes support for advanced GPU and memory profiling techniques, providing deeper insights into performance characteristics.

To run the benchmarks, use the following command:

python -m benchmarks.microbenchmarks.benchmark_runner --config benchmarks/microbenchmarks/test/benchmark_config.yml

Sample Benchmark Results (on 1xH100):

| Name | Quantization | Shape | Baseline Inference Time (ms) | Inference Time (ms) | Speedup |
|-------------------|-----------------|---------------------|------------------------------|---------------------|---------|
| small_bf16_linear | float8dq-tensor | 16384, 16384, 16384 | 13.34 | 7.72 | 1.73x |
| small_bf16_linear | float8dq-tensor | 16384, 16384, 32768 | 26.04 | 14.62 | 1.78x |
| small_bf16_linear | float8dq-tensor | 16384, 16384, 65536 | 53.59 | 29.05 | 1.84x |
| small_bf16_linear | float8dq-tensor | 16384, 32768, 32768 | 68.94 | 28.07 | 2.46x |
| small_bf16_linear | float8dq-tensor | 16384, 32768, 65536 | 108.63 | 58.7 | 1.85x |
| small_bf16_linear | float8dq-tensor | 16384, 65536, 65536 | 215.66 | 118.42 | 1.82x |
| small_bf16_linear | float8dq-tensor | 32768, 32768, 32768 | 108.16 | 57.09 | 1.89x |
| small_bf16_linear | float8dq-tensor | 32768, 32768, 65536 | 214.74 | 110.08 | 1.95x |
| small_bf16_linear | float8dq-tensor | 32768, 65536, 65536 | 432.44 | 223.46 | 1.94x |
| small_bf16_linear | float8dq-tensor | 65536, 65536, 65536 | 870.37 | 447.97 | 1.94x |

BC Breaking

New Features

Improvement

Bug Fixes

Performance

Documentation

Developers

New Contributors

Full Changelog: https://github.com/pytorch/ao/compare/v0.10.0...v0.11.0

- Python
Published by andrewor14 about 1 year ago

torchao - v0.10.0

Highlights

We are excited to announce the 0.10.0 release of torchao! This release adds support for end to end training for mxfp8 on Nvidia B200, PARQ (for quantization aware training), module swap quantization API to for research, and updates for low bit kernels!

Low Bit Optimizers moved to Official Support (https://github.com/pytorch/ao/pull/1864)

Low bit optimizers (added in 0.4) is moved out of prototype and now have official support in torchao.

[Prototype] End to End Training Support for mxfp8 on NVIDIA B200 (#1786, #1841, #1951, #1932, #1980)

We have an early version of the end to end training workflow for the mxfp8 dtypes with torch.compile on NVIDIA B200, with the cuBLAS mxfp8 gemm seeing an observed speedup of over 2x over bfloat16 gemm, and casts from bfloat16 to mxfp8 achieving up to 5.5 TB/s. Please see our README.md for MX for more information. We plan to improve performance further in future releases.

[Prototype] Piecewise-Affine Regularized Quantization (https://github.com/pytorch/ao/pull/1738)

  • PARQ is a new theoretical framework for inducing quantization through regularization. It supports standard QAT, as well as new gradual quantization methods, in an easy to use optimizer-only interface. No modifications to a model’s forward or backward pass are needed for quantization.

```py from torchao.prototype.parq.optim import QuantOptimizer, ProxHardQuant from torchao.prototype.parq.quant import UnifQuantizer

Separate quantizable from non-quantizable parameter groups

paramgroups = [ {"params": weights, "quantbits": 2}, # add extra quant_bits key for QAT {"params": others}, ]

Initialize any torch.optim.Optimizer

baseoptimizer = torch.optim.SGD(paramgroups, lr=0.1, momentum=0.9, weight_decay=1e-4)

Apply a simple wrapper to quantize in optimizer.step()

optimizer = QuantOptimizer( baseoptimizer, quantizer=UnifQuantizer(), proxmap=ProxHardQuant() ) ```

[Prototype] Module Swap Quantization API (https://github.com/pytorch/ao/pull/1886)

We added a prototype API for post-training quantization. Users can swap their linear or embedding layers into their QuantizedLinear and QuantizedEmbedding counterparts, and set the quantizers that specify how they want the input activations or weights to be quantized:

py quantized_linear = QuantizedLinear(...) quantized_linear.weight_quantization = IntQuantizer( num_bits=4, group_size=32, dynamic=True, quantization_mode="symmetric", ) quantized_linear.input_quantization = CodeBookQuantizer( num_bits=8, features=10, )

Note: The API is highly subject to change and will be integrated with quantize_ in the future. For more detail, please see the README.

[Prototype] Low Bit Kernels (#1826, #1935, #1998, #1652)

Low-bit CPU and MPS kernels are now pip installable from source. To install torchao with low-bit CPU kernels, you can use the following command on an Arm-based Mac:

USE_CPP=1 pip install git+https://github.com/pytorch/ao.git

You can then quantize your model to run on Arm-based Macs with high-performance CPU kernels in torchao. SharedEmbeddingQuantizer,EmbeddingQuantizer, and Int8DynamicActivationIntxWeightConfig all support 1-8 bit quantization.

```py from torchao.experimental.quantapi import Int8DynamicActivationIntxWeightConfig, SharedEmbeddingQuantizer, EmbeddingQuantizer from torchao.quantization.granularity import PerGroup, PerRow from torchao.quantization.quantapi import quantize_

Quantize embedding/unembedding to 8-bits with SharedEmbeddingQuantizer

SharedEmbeddingQuantizer is for quantizing models like Llama1B/3B

where the embedding/unembedding layers share weights

If the embedding/unembedding layers do not share weights, use

EmbeddingQuantizer instead

SharedEmbeddingQuantizer( weightdtype=torch.int8, granularity=PerRow(), hasweightzeros=True ).quantize(model) # Quantize linear layers to 4-bits quantize( model, Int8DynamicActivationIntxWeightConfig( weightdtype=torch.int4, granularity=PerGroup(128), hasweight_zeros=False, ) ) ```

BC Breaking

Delete delayed scaling from torchao.float8 (https://github.com/pytorch/ao/pull/1753)

The following usage of `Float8Config` is deprecated in torchao v0.10.0:

py config = Float8LinearConfig( cast_config_input=CastConfig(scaling_type=ScalingType.DELAYED), cast_config_weight=CastConfig(scaling_type=ScalingType.DELAYED), cast_config_grad_output=CastConfig(scaling_type=ScalingType.DELAYED), )

If you would like to use float8 training with delayed scaling, please use an earlier release of torchao. Please see https://github.com/pytorch/ao/issues/1680 for more context about this deprecation.

Enforce AOBaseConfig type in quantize_'s config argument (https://github.com/pytorch/ao/pull/1861)

This was done following a deprecation window to simplify the arguments of quantize_, please see https://github.com/pytorch/ao/issues/1690 for more context.

```py

torchao v.0.9.0

def quantize( model: torch.nn.Module, config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]],
filter
fn: Optional[Callable[[torch.nn.Module, str], bool]] = None, setinductorconfig: Optional[bool] = None, device: Optional[torch.types.Device] = None, ):

torchao v.0.10.0

def quantize( model: torch.nn.Module, config: AOBaseConfig, filterfn: Optional[Callable[[torch.nn.Module, str], bool]] = None, setinductorconfig: Optional[bool] = None,
device: Optional[torch.types.Device] = None, ): ```

Remove the set_inductor_config argument of quantize_. (https://github.com/pytorch/ao/pull/1865)

This was done following a deprecation window to decouple quantize_ from torchinductor, please see https://github.com/pytorch/ao/issues/1715 for more context.

```py

torchao v.0.9.0

def quantize( ..., setinductorconfig: Optional[bool] = None, ..., ):
# if set
inductorconfig != None, throw a deprecation warning # if setinductor_config == None, set it to True to stay consistent with old behavior

torchao v0.10.0

def quantize( ..., ): # setinductorconfig is removed from quantize and moved to relevant individual workflows ```

Deprecations

We removed some of our prototype features that are not used, including DORA (https://github.com/pytorch/ao/pull/1815), split_k kernel (https://github.com/pytorch/ao/pull/1816), profiler (https://github.com/pytorch/ao/pull/1862) and bitnet (https://github.com/pytorch/ao/pull/1866).

New Features

QAT

Low Bit Optimizers

Module swap quantization API

Benchmarking

Improvement

Kernels

AOConfigs

SAM2

QAT

MX

Affine Quantization

Bug Fixes

Performance

Documentation

New Contributors

Full Changelog: https://github.com/pytorch/ao/compare/v0.9.0...v0.10.0-rc1

- Python
Published by jerryzh168 about 1 year ago

torchao - v0.9.0

Highlights

We are excited to announce the 0.9.0 release of torchao! This release moves a number of sparsity techniques out of prototype, a significant overhaul of the quantize_ api, a new cutlass kernel for 4 bit dynamic quantization and more!

Block Sparsity promoted out of prototype

We’ve promoted block sparsity out of torchao.prototype and made several performance improvements. You can accelerate your models with block sparsity as follows:

python from torchao.sparsity import sparsify, block_sparse_weight sparsify_(model, block_sparse_weight(blocksize=64))

Blocksparse Benchmarks

| Technique |Decode (tok/s)| Model Size (GB) | |------------------------------|------------------|---------------------| | baseline | 134.40 | 15.01 | | 2:4 sparse | 163.13 | 10.08 | | bsr-0.8-32 | 210.91 | 6.01 | | bsr-0.8-64 | 222.43 | 6.00 | | bsr-0.9-32 | 255.19 | 4.88 | | bsr-0.9-64 | 262.94 | 4.88 | | 2:4 sparse + int4wo (marlin) | 255.21 | 3.89 |

Block Sparsity technique names (bsr) indicate sparsity fraction and blocksize.

These numbers were generated on H100 using torchao/_models/llama/generate.py on the Meta-Llama-3.1-8B model. You can reproduce these numbers using this script

BC Breaking

TorchAO M1 Binaries currently not working

W've identified that the binaries are broken on M1 and have been since v0.8.0 though they were working in v0.7.0. We're working on a fix for this, details and discussion can be found here.

quantize_ configuration callables -> configs (https://github.com/pytorch/ao/pull/1595, https://github.com/pytorch/ao/pull/1694, https://github.com/pytorch/ao/pull/1696, https://github.com/pytorch/ao/pull/1697)

We are migrating the way quantize_ workflows are configured from callables (tensor subclass inserters) to direct configuration (config objects). Motivation: align with the rest of the ecosystem, enable inspection of configs after instantiation, remove a common source of confusion.

What is changing:

Specifically, here is how the signature of quantize_'s second argument will change:

```python

torchao v0.8.0 and before

def quantize( model: torch.nn.Module, applytensorsubclass: Callable[[torch.nn.Module], torch.nn.Module], ..., ): ...

torchao v0.9.0

def quantize( model: torch.nn.Module, config: Union[AOBaseConfig, Callable[[torch.nn.Module], torch.nn.Module]], ..., ): ...

torchao v0.10.0 or later (exact version TBD)

def quantize( model: torch.nn.Module, config: AOBaseConfig, ..., ): ... ```

  1. the name of the second argument to quantize_ changed from apply_tensor_subclass to config. Since the vast majority of callsites today are passing in configuration with a positional argument, this change should not affect most people.
  2. the type of the second argument to quantize_ will change from Callable[[torch.nn.Module], torch.nn.Module] to config: AOBaseConfig, following a deprecation process detailed below.
  3. for individual workflows, the user facing API name changed from snake case (int8_weight_only) to camel case (Int8WeightOnlyConfig). All argument names for each config are kept as-is. We will keep the old snake case names (int8_weight_only) around and alias them to the new names (int8_weight_only = Int8WeightOnlyConfig), to avoid breaking callsites. We plan to keep the old names forever. Here are all the workflow config name changes:

| old name (will keep working) | new name (recommended) | | --- | --- | | int4_weight_only | Int4WeightOnlyConfig | | float8_dynamic_activation_float8_weight | Float8DynamicActivationFloat8WeightConfig| | float8_static_activation_float8_weight | Float8StaticActivationFloat8WeightConfig | | float8_weight_only | Float8WeightOnlyConfig | | fpx_weight_only | FPXWeightOnlyConfig | | gemlite_uintx_weight_only | GemliteUIntXWeightOnlyConfig | | int4_dynamic_activation_int4_weight | Int4DynamicActivationInt4WeightConfig | | int8_dynamic_activation_int4_weight | Int8DynamicActivationInt4WeightConfig | | int8_dynamic_activation_int8_semi_sparse_weight | n/a (deprecated) | | int8_dynamic_activation_int8_weight | Int8DynamicActivationInt8WeightConfig | | int8_weight_only | Int8WeightOnlyConfig | | uintx_weight_only | UIntXWeightOnlyConfig |

Configuration for prototype workflows using quantize_ will be migrated at a later time.

How these changes can affect you: 1. If you are a user of existing quantize_ API workflows and are passing in config by a positional argument (quantize_(model, int8_weight_only(group_size=128))), you are not affected. This positional syntax will keep working going forward. You are encouraged to migrate your callsite to the new config name (quantize_(model, Int8WeightOnlyConfig(group_size=128)) though the old names will continue to work indefinitely. 2. If you are a user of existing quantize_ API workflows and are passing in config by a keyword argument (quantize_(model, tensor_subclass_inserter=int8_weight_only(group_size=128))), your callsite will break. You will need to change your callsite to quantize_(model, config=int8_weight_only(group_size=128)). We don't expect many people to be in this bucket. 3. If you are a developer writing new workflows for the quantize_ API, you will need to use the new configuration system. Please see https://github.com/pytorch/ao/issues/1690 for details. 4. If you are a user of sparsify_, you are not affected for now and a similar change will happen in a future version of torchao.

This migration will be a two step process: * in torchao v0.9.0, we will enable the new syntax while starting the deprecation process for the old syntax. * in torchao v.0.10.0 or later, we will remove the old syntax

Please see https://github.com/pytorch/ao/issues/1690 for more details.

Block Sparsity imports after moved out of prototype (https://github.com/pytorch/ao/pull/1734)

Before:

python from torchao.prototype.sparsity.superblock.blocksparse import block_sparse_weight

After: python from torchao.sparsity import block_sparse_weight

Deprecations

deprecation of the set_inductor_config argument of quantize_ (https://github.com/pytorch/ao/pull/1716)

We are migrating the set_inductor_config argument of quantize_ to individual workflows. Motivation: 1. this functionality was intended for inference, and we don't want to expose it to future training workflows that we plan to add to quantize_. 2. higher level, this flag couples torchao workflows with torch.compile, which is not ideal. We would rather keep these systems decoupled at the quantize_ API level, with individual workflows opting in as needed.

Impact on users
  • for torchao v0.9.0:: if you are passing in set_inductor_config to quantize_, your callsite will keep working with a deprecation warning. We recommend that you migrate this option to your individual workflow.
  • for a future version of torchao: the set_inductor_config argument will be removed from quantize_.
API changes

```python

torchao v0.8.x

def quantize( ..., setinductor_config: bool = True, ..., ): ...

torchao v.0.9.0

def quantize( ..., setinductorconfig: Optional[bool] = None, ..., ): # if setinductorconfig != None, throw a deprecation warning # if setinductor_config == None, set it to True to stay consistent with old behavior

torchao v TBD (a future release)

def quantize( ..., ): # setinductorconfig is removed from quantize and moved to relevant individual workflows ```

Please see https://github.com/pytorch/ao/issues/1715 for more details.

Deprecation warning for float8 training delayed and static scaling (https://github.com/pytorch/ao/pull/1681, https://github.com/pytorch/ao/issues/1680)

We plan to deprecate delayed and static scaling from torchao.float8 training codebase due to lack of real world use cases for delayed/static scaling (dynamic scaling is required for higher accuracy) and complexity tax for supporting these features. * for torchao v0.9.0: add deprecation warning for delayed and static scaling * for torchao v0.10.0: deprecate delayed and static scaling

New Features

Supermask for improving accuracy for sparse models (https://github.com/pytorch/ao/pull/1729)

Supermask (https://pytorch.org/blog/speeding-up-vits/) is a technique for improving the accuracy of block sparsified models by learning a block-sparse mask during a training phase.

```python from torchao.sparsity import SupermaskLinear, blocksparseweight sparsify(model, lambda x: SupermaskLinear.fromlinear(x, blocksize=64, sparsitylevel=0.9)

training here

collapse supermask into a normal linear layer (with many weights set to 0) and then convert to block sparse format for inference speedup

sparsify(model, lambda x: SupermaskLinear.tolinear(x, sparsitylevel=0.9) sparsify(model, blocksparseweight(blocksize=64)) ```

Dynamic quantization W4A4 CUTLASS-based kernel (https://github.com/pytorch/ao/pull/1515)

This kernel which adds support for 4 bit dynamic activation + 4 bit weight quantization can be used as follows:

python from torchao.quantization import int4_dynamic_activation_int4_weight quantize_(model, int4_dynamic_activation_int4_weight)

Improvements

Early prototype MXFP8 and MXFP4 training and inference support for NVIDIA Blackwell GPUs

In torchao v0.9.0, we include very early support for training and inference on the NVIDIA Blackwell GPUs following the microscaling recipes from https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf, and backed by real MX gemms.

Here is how to use the current prototype APIs.

:warning: Note that torch.compile support is not fully there yet, there are no guarantees on performance at this time, and we expect to change these APIs rapidly as we iterate in future versions of torchao. Please see https://github.com/pytorch/ao/issues/556 for more details.

MX training

```python from torchao.prototype.mxformats.mxlinear import swaplinearwithmxlinear from torchao.prototype.mxformats.config import MXLinearConfig, MXGemmKernelChoice from torchao.utils import issmatleast_100

early prototype: on MX-enabled hardware, you can use the real MX gemm backed by

torchao's CUTLASS kernels. In the future, we will also add cuBLAS kernel support.

gemmkernelchoice = MXGemmKernelChoice.EMULATED if issmatleast100(): gemmkernelchoice = MXGemmKernelChoice.CUTLASS

m = torch.nn.Sequential(torch.nn.Linear(32, 32)).cuda() config = MXLinearConfig( elemdtype=torch.float8e4m3fn, blocksize=32, gemmkernelchoice=gemmkernelchoice, ) swaplinearwithmx_linear(m, config=config)

training loop (not shown)

```

MX inference, weights are in MX and matmul is in high precision.

```python from torchao.prototype.mxformats.mxlinear import swaplinearwithmxinferencelinear from torchao.prototype.mxformats.config import MXLinearConfig

m = torch.nn.Sequential(torch.nn.Linear(32, 32)).cuda() config = MXLinearConfig(elemdtype=torch.float8e4m3fn, blocksize=32) swaplinearwithmxinferencelinear(m, config=config)

do inference (not shown)

```

The additional features for MX support in v0.9.0 were enabled by: * Add mxfp8bf16 kernel (https://github.com/pytorch/ao/pull/1637) * Support mixed MX element dtype in mx_mm function and MXLinear. (https://github.com/pytorch/ao/pull/1667) * move blocksize and elemdtype into MXLinearConfig (https://github.com/pytorch/ao/pull/1689) * hook up mxfp8 and mxfp4 CUTLASS kernels to MXLinear (https://github.com/pytorch/ao/pull/1713) * add ceil and RNE rounding modes to the cast from fp32 to e8m0 (https://github.com/pytorch/ao/pull/1643)

Experimental

  • Q dq layout (https://github.com/pytorch/ao/pull/1642)
  • Add support for kleidi AI quantization schemes (https://github.com/pytorch/ao/pull/1447)

SAM2

  • Add modal script extensions (https://github.com/pytorch/ao/pull/1500)
  • Increase export usage, small perf improvements (https://github.com/pytorch/ao/pull/1673)
  • Model experiments QoL improvements (https://github.com/pytorch/ao/pull/1683)
  • Collect p90 latency statistics (https://github.com/pytorch/ao/pull/1703)

Training

  • Support power of 2 scaling factors in float8 training with rowwise scaling and use e4m3 in fwd and bwd pass (https://github.com/pytorch/ao/pull/1670)
  • clean up recipe names in Float8 training (https://github.com/pytorch/ao/pull/1730)
  • make the "config from recipe" API polished in Float8 training (https://github.com/pytorch/ao/pull/1731)
  • dd workaround to reduce FSDP memory usage for float8 rowwise training (https://github.com/pytorch/ao/pull/1629)
  • Make FakeQuantizer expose useful config details when printed (https://github.com/pytorch/ao/pull/1717)

Sparsity

  • Promote blocksparse from prototype, make it faster (https://github.com/pytorch/ao/pull/1734)

Other

  • Relax dtype requirements for int4 and float8 quants in autoquant (https://github.com/pytorch/ao/pull/1571)
  • Update init.py to load experimental ops even if other C++ ops are not found (https://github.com/pytorch/ao/pull/1565)

Bug Fixes

  • Fix torch.intx support in FakeQuantizeConfig (https://github.com/pytorch/ao/pull/1544)
  • Fix float related autoquant options (https://github.com/pytorch/ao/pull/1562)
  • Fix #1559, sparsity instead of sparstiy (https://github.com/pytorch/ao/pull/1560)
  • Fix .item() issue in running parallel evaluation for BO mixed precision (https://github.com/pytorch/ao/pull/1630)
  • Add more stringent test for CPUOffloadOptimizer (https://github.com/pytorch/ao/pull/1650)
  • Fix LR scheduler issue with CPU offload optimizer (https://github.com/pytorch/ao/pull/1649)
  • Add int8 dynamic activation + int8 weight only test to TensorParallel (https://github.com/pytorch/ao/pull/1657)
  • Fix compile issue for Marlin qqq on sm<8.0 (https://github.com/pytorch/ao/pull/1651)
  • Fix usehqq for int4weight_only quantize (https://github.com/pytorch/ao/pull/1707)
  • Unbreak float8 static quant tutorial (https://github.com/pytorch/ao /pull/1709)
  • Fix DDP with nf4 (https://github.com/pytorch/ao/pull/1684)
  • Fix tensor parallelism for float8 training with rowwise scaling (https://github.com/pytorch/ao/pull/1718)

Documentation

  • Update supported dtypes for fp8 (https://github.com/pytorch/ao/pull/1573)
  • Sparsity docs update (https://github.com/pytorch/ao/pull/1590)
  • Sparsity getting started docs (https://github.com/pytorch/ao/pull/1592)
  • Fix broken link on doc page (https://github.com/pytorch/ao/pull/1582)
  • Add quick start guide for first time users (https://github.com/pytorch/ao/pull/1611)
  • Update apirefdtypes docs (https://github.com/pytorch/ao/pull/1610)
  • Add module swap -> tensor subclass migration tutorial (https://github.com/pytorch/ao/pull/1596)
  • Update docs to refer to version.html (https://github.com/pytorch/ao/pull/1631)
  • Split contributor guide into quantization overview (https://github.com/pytorch/ao/pull/1618)
  • Update apirefquantization docs (https://github.com/pytorch/ao/pull/1619)
  • Migrate static quant tutorials to direct configuration (https://github.com/pytorch/ao/pull/1710)
  • Update torchao READMEs with new configuration APIs (https://github.com/pytorch/ao/pull/1711)
  • Update SAM2 README.md (https://github.com/pytorch/ao/pull/1735)
  • Add rowwise scaling README.md entry for float8 training(https://github.com/pytorch/ao/pull/1733)

Developers

  • Consolidate ZeroPointDomain.NONE & None zero point domains (https://github.com/pytorch/ao/pull/1556)
  • Only run docs build in CI if docs have changed (https://github.com/pytorch/ao/pull/1589)
  • Add separate quantization primitives for float8 (https://github.com/pytorch/ao/pull/1597)
  • Add boiler plate code to Tensor subclass (https://github.com/pytorch/ao/pull/1663)
  • Change TORCHLIBRARY to TORCHLIBRARY_FRAGMENT (https://github.com/pytorch/ao/pull/1645)
  • Reformat C++ kernels (https://github.com/pytorch/ao/pull/1723)
  • Add torchao/experimental CI test (https://github.com/pytorch/ao/pull/1586)
  • Clean up linearint8dynamicactivationintxweightsubclass (https://github.com/pytorch/ao/pull/1553)

New Contributors

  • @jaewoosong made their first contribution in https://github.com/pytorch/ao/pull/1560
  • @haodongucsb made their first contribution in https://github.com/pytorch/ao/pull/1630
  • @nikhil-arm made their first contribution in https://github.com/pytorch/ao/pull/1447
  • @ngc92 made their first contribution in https://github.com/pytorch/ao/pull/1650
  • @balancap made their first contribution in https://github.com/pytorch/ao/pull/1667

Full Changelog: https://github.com/pytorch/ao/compare/v0.8.0...v0.9.0-rc1

- Python
Published by HDCharles over 1 year ago

torchao - v0.8.0

Highlights

We are excited to announce the 0.8.0 release of torchao! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding.

W4A8 based on CUTLASS

A new W4A8 linear operator is implemented, that corresponds to int8_dynamic_activation_int4_weight quantization where two 4-bit weights get packed into a single 8-bit integer value; also, CUTLASS is made a sub-module of torchao repo, in order to be able to utilize more of its functionality to implement new kernels.

Benchmarks on A100

| -q parameter | Average tokens/sec | Average Bandwidth in GB/s | Peak Memory Usage in GB | Model Size in GB | | :--- | ---: | ---: | ---: | ---: | | | 95.24 | 258.55 | 13.90 | 13.21 | | -q int8wo | 155.31 | 1028.37 | 8.97 | 6.62 | | -q int4wo-32 | 186.70 | 774.98 | 5.31 | 4.15 | | -q int4wo-hqq | 186.47 | 774.01 | 5.04 | 4.15 | | -q int8dq | 49.64 | 328.72 | 9.44 | 6.62 | | -q w4a8-cutlass (tuned) | 119.31 | 394.86 | 4.52 | 3.31 |

Prefill performance benchmarks

We’ve added TTFT benchmarks to torchAO and compared different quantization + sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an option for int8 dynamic quantization that will selectively use prefill during LLM decoding.

Screenshot 2025-01-15 at 10 06 09 AM

BC Breaking

Delete the float8-all-gather-only functionality from float8 training (https://github.com/pytorch/ao/pull/1451)

The use_fp8_all_gather_only was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.

Before

```python config = Float8LinearConfig( ...,

the option below is being removed

usefp8allgatheronly = True,
)
converttofloat8_training(model, config=config, ...) ```

After

The use_fp8_all_gather_only option is no longer supported.

New Features

Improvement

quantize_

autoquant

float8 training

experimental

other

Bug Fixes

Performance

Documentation

Developers

New Contributors

Full Changelog: https://github.com/pytorch/ao/compare/v0.7.0...v0.8.0-rc2

- Python
Published by jainapurva over 1 year ago

torchao - v0.7.0

Highlights

We are excited to announce the 0.7.0 release of torchao! This release moves QAT out of prototype with improved LoRA support and more flexible APIs, and adds support for new experimental kernels such as Marlin QQQ (for CUDA), int8_dynamic_activation_intx_weight (for ARM CPU), and more!

QAT moved out of prototype, LoRA integration, new flexible APIs (#1020, #1085, #1152, #1037, #1152)

QAT has been moved out of prototype to torchao/quantization/qat to provide better API stability guarantees moving forward. In addition to the existing *QATQuantizer classes, we now also support the more flexible FakeQuantizedLinear and FakeQuantizedEmbedding modules for users to configure the exact quantization settings they wish to use during QAT.

```python from torchao.quantization.qat.api import FakeQuantizeConfig from torchao.quantization.qat.embedding import FakeQuantizedEmbedding from torchao.quantization.qat.linear import FakeQuantizedLinear

Specify quantization schemes to use during QAT

activationconfig = FakeQuantizeConfig(torch.int8, "pertoken", issymmetric=False) weightconfig = FakeQuantizeConfig(torch.int4, group_size=8)

Replace nn.Linear and nn.Embedding with these in your model

fqlinear = FakeQuantizedLinear(16, 32, False, activationconfig, weightconfig) fqembedding = FakeQuantizedEmbedding(16, 32, weightconfig=weightconfig) ```

We also leveraged the new flexible APIs to build a new QAT + LoRA fine-tuning flow in torchtune. Try it out today!

bash tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3/8B_qat_lora

Marlin QQQ for CUDA (#1113)

Marlin QQQ is an optimized GPU kernel that supports W4A8 mixed precision GEMM. For more details about Marlin QQQ, please refer to paper.

python from torchao.dtypes import MarlinQQQLayout quantize_( model, int8_dynamic_activation_int4_weight( group_size=128, mapping_type=MappingType.SYMMETRIC, act_mapping_type=MappingType.SYMMETRIC, layout=MarlinQQQLayout(), ), )

Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#marlin-qqq.

This is a prototype feature - feel free to try out!

int8dynamicactivationintxweight Quantization for ARM CPU (#995, #1027, #1254, #1353)

We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon).

```python from torchao.experimental.quantapi import int8dynamicactivationintxweight assert precision == torch.float32, "int8dynamicactivationintx_weight requires fp32 precision"

Build kernels in temp location, and load them in torch

This requires an ARM CPU

from torchao.experimental.tempbuild import tempbuildandloadtorchaoops tempbuildandloadtorchaoops(cmakelistspath=os.path.dirname(os.path.realpath(file_)) + "/../../experimental")

Quantize model

nbit = 4 assert nbit >= 1 and nbit <= 8, "nbits must be 1 to 8" groupsize = 128 hasweightzeros = False quantize( model, int8dynamicactivationintxweight( groupsize=groupsize, nbit=nbit, hasweightzeros=hasweightzeros, ), ) ```

Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#int8dynamicactivationintxweight-quantization

We are still trying to figure out how to ship the ARM CPU kernels, so the exact API is subject to change.

BC Breaking

Rename AQT#2 LayoutType -> Layout (#1049)

Before:

from torchao.dtypes import ( BlockSparseLayoutType, Int4CPULayoutType, MarlinQQQLayoutType, MarlinSparseLayoutType, SemiSparseLayoutType, TensorCoreTiledLayoutType, UintxLayoutType, Float8LayoutType, LayoutType, PlainLayoutType, )

After:

from torchao.dtypes import ( BlockSparseLayout, Int4CPULayout, MarlinQQQLayout, MarlinSparseLayout, SemiSparseLayout, TensorCoreTiledLayout, UintxLayout, Float8Layout, Layout, PlainLayout, )

QAT imports after move out of prototype (#1091)

Before:

python from torchao.quantization.prototype.qat import ( disable_4w_fake_quant, disable_8da4w_fake_quant, enable_4w_fake_quant, enable_8da4w_fake_quant, ComposableQATQuantizer, Int4WeightOnlyQATQuantizer, Int4WeightOnlyEmbeddingQATQuantizer Int8DynActInt4WeightQATQuantizer, Int8DynActInt4WeightQATLinear, ) from torchao.quantization.prototype.qat.api import ( FakeQuantizeConfig, ) from torchao.quantization.prototype.qat.fake_quantizer import ( FakeQuantizer, )

After:

python from torchao.quantization.qat import ( ComposableQATQuantizer, Int4WeightOnlyQATQuantizer, Int4WeightOnlyEmbeddingQATQuantizer Int8DynActInt4WeightQATQuantizer, ) from torchao.quantization.qat.linear import ( disable_4w_fake_quant, disable_8da4w_fake_quant, enable_4w_fake_quant, enable_8da4w_fake_quant, Int8DynActInt4WeightQATLinear, ) from torchao.quantization.qat.api import ( FakeQuantizeConfig, ) from torchao.quantization.qat.fake_quantizer import ( FakeQuantizer, )

New Features

  • Add BF16 stochastic rounding option for optimizers (https://github.com/pytorch/ao/pull/1124)
  • Add quantize_() API support for NF4 (https://github.com/pytorch/ao/pull/1216)
  • Support W4A8 Marlin kernel (https://github.com/pytorch/ao/pull/1113)

Improvements

quantize_

  • Add default filtering to remove mis-alinged weights (https://github.com/pytorch/ao/pull/1194)
  • Add tensor parallelism support for int4weightonly quantization (https://github.com/pytorch/ao/pull/1120)
  • Add support for asymmetric act quant for int8 dynamic quant (https://github.com/pytorch/ao/pull/1131)
  • Add support for groupwise quantization for int8 weight only quantization (https://github.com/pytorch/ao/pull/1121)
  • Add AQT tensor parallel for float8dynamicquant (https://github.com/pytorch/ao/pull/1078)
  • Int8wo Embedding Quant (https://github.com/pytorch/ao/pull/1167)
  • Making sure int4 weight only supports cpu as well (https://github.com/pytorch/ao/pull/1203)
  • BF16 support for Quant-LLM kernel (https://github.com/pytorch/ao/pull/1147)
  • Add hardware check to fp8 quant (https://github.com/pytorch/ao/pull/1314)
  • Add support for quantize_() with Float8Linear module (https://github.com/pytorch/ao/pull/1344)

autoquant

  • Added support for Per Tensor Scaling for Float8 Dynamic Autoquant (https://github.com/pytorch/ao/pull/1175)
  • Add floating point options for autoquant and add accuracy measurement (https://github.com/pytorch/ao/pull/1355)

benchmarks

  • Adding batchsize support for torchao llama benchmarks (https://github.com/pytorch/ao/pull/1182)
  • Add capability of benchmarking arbitrary binary (https://github.com/pytorch/ao/pull/1107)

experimental

  • Add embedding ops aten (https://github.com/pytorch/ao/pull/1129)
  • Add embedding ops executorch (https://github.com/pytorch/ao/pull/1137)
  • Add quantized embedding kernels to torchao (https://github.com/pytorch/ao/pull/1018)
  • Allow deprecated declarations what using Parallel ExecuTorch (https://github.com/pytorch/ao/pull/1031)
  • Introduce lowbit quantized linear MPS kernels (https://github.com/pytorch/ao/pull/954)
  • Enable 6-bit kernel (https://github.com/pytorch/ao/pull/1027)
  • Kleidi 4b blockwise gemv prototype (https://github.com/pytorch/ao/pull/997)
  • Experimental 6-bit quantization for Llama in torchchat (https://github.com/pytorch/ao/pull/1094)
  • Introduce 7-bit quantization for Llama in torchchat. (https://github.com/pytorch/ao/pull/1139)
  • Executorch Subclass API (#966) (https://github.com/pytorch/ao/pull/995)
  • 8-bit packing support (https://github.com/pytorch/ao/pull/1248)
  • Experimental Enable 8-bit (https://github.com/pytorch/ao/pull/1254)
  • Experimental Benchmarking (https://github.com/pytorch/ao/pull/1353)

optimizer

  • [low-bit optim] Upcast everything to FP32 for internal calculations (https://github.com/pytorch/ao/pull/1068)
  • [Low-bit optim] Support for dcp.save() and dcp.load() (https://github.com/pytorch/ao/pull/1217)
  • Enable CPU Offload for Intel GPU (https://github.com/pytorch/ao/pull/1324)

SAM2

  • SAM2.1 copy (https://github.com/pytorch/ao/pull/1172)
  • SAM2 AMG server side request batching (https://github.com/pytorch/ao/pull/1197)
  • More SAM2-fast server improvements (https://github.com/pytorch/ao/pull/1285)
  • SAM2 Fast AMG: memory profiling and more compile (https://github.com/pytorch/ao/pull/1296)
  • SAM2 AMG cli and other QoL improvements (https://github.com/pytorch/ao/pull/1336)
  • SAM2 AMG cli.py on modal (https://github.com/pytorch/ao/pull/1349)
  • Reduce SAM2 AMG cli startup by using deploy (https://github.com/pytorch/ao/pull/1350)
  • Reduce startup time for SAM2 AMG by using torch.export (https://github.com/pytorch/ao/pull/1358)
  • More batching and improved furious accuracy/performance (https://github.com/pytorch/ao/pull/1253)
  • SAM2.1 and example README (https://github.com/pytorch/ao/pull/1048)
  • SAM2 AMG example mIoU, perf numbers and more SAM2 model annotations (https://github.com/pytorch/ao/pull/1196)

other

  • Add SpinQuant to generate.py (https://github.com/pytorch/ao/pull/1069)
  • SpinQuant (https://github.com/pytorch/ao/pull/983)
  • SmoothQuant using tensor subclassing (https://github.com/pytorch/ao/pull/1030)
  • Expose FakeQuantizeConfigs in QAT quantizers (https://github.com/pytorch/ao/pull/1214)
  • Add module-swap UX for INT8 mixed-precision training (https://github.com/pytorch/ao/pull/1179)
  • Float8 training: move module attribute setting to sync function (https://github.com/pytorch/ao/pull/1341)

Bug Fixes

  • Header bug fix (https://github.com/pytorch/ao/pull/1079)
  • Temporary fix for QAT quantizer when linear layer bias is True (https://github.com/pytorch/ao/pull/1087)
  • Fix out-of-bounds memory access in Galore dequant kernel (https://github.com/pytorch/ao/pull/1125)
  • Fixed weightsonly=True load for float8dynamicactivationfloat8weight in quantapi (https://github.com/pytorch/ao/pull/1122)
  • Fix int8weightonly group_size (https://github.com/pytorch/ao/pull/1165)
  • Is_linear fix for MHA (https://github.com/pytorch/ao/pull/1141)
  • Fixing eval.py to use GPTQ_MT for gptq (https://github.com/pytorch/ao/pull/1176)
  • [CPU offload optim] Fix when there are non-trainable params (https://github.com/pytorch/ao/pull/1210)
  • Fix for weights-only load (https://github.com/pytorch/ao/pull/1228)
  • Pin nightlies to deal with std::badalloc (https://github.com/pytorch/ao/pull/1256)
  • Fix 2.5.1 failing sparsity test (https://github.com/pytorch/ao/pull/1261)
  • Call narrow only for TensorCoreTiledLayout (https://github.com/pytorch/ao/pull/1207)
  • Fix an autoquant bug in flatten/unflatten (https://github.com/pytorch/ao/pull/1288)
  • Float8 with delayed scaling: fix autocast handling (https://github.com/pytorch/ao/pull/1306)
  • Fix bug with float8 training + FSDP2 + TP (https://github.com/pytorch/ao/pull/1327)
  • Float8 training: fix bug with AC + compile (https://github.com/pytorch/ao/pull/1329)
  • Fix torchtitan + float8 + delayed + compile (https://github.com/pytorch/ao/pull/1334)
  • [low-bit optim] Fix edge cases for FSDP2 integration (https://github.com/pytorch/ao/pull/1269)
  • [NF4] .to() fixes (https://github.com/pytorch/ao/pull/1312)
  • Check scale.ndim before applying t/transpose (https://github.com/pytorch/ao/pull/1339)

Performance

  • Swap in faster uint6 bitpacking function (https://github.com/pytorch/ao/pull/1098)
  • Implement more efficient pack and unpack uint5 (https://github.com/pytorch/ao/pull/1138)
  • Fix 20x slowdown of FP6 kernel due to device properties query (https://github.com/pytorch/ao/pull/1092)

Documentation

  • Add a developer guide for exporting to executorch (https://github.com/pytorch/ao/pull/1219)
  • Enable AWQ example on CPU (https://github.com/pytorch/ao/pull/1043)
  • Add readme doc for experiemental (https://github.com/pytorch/ao/pull/1130)
  • Move float8 out of prototype in quantization README (https://github.com/pytorch/ao/pull/1166)
  • Update torchao api reference and add contributor guide (https://github.com/pytorch/ao/pull/1255)
  • Fix pickle.dump missing file argument typo in README (https://github.com/pytorch/ao/pull/1316)
  • Update README.md (https://github.com/pytorch/ao/pull/1319)
  • Update README.md: Fix bibtex and sglang links (https://github.com/pytorch/ao/pull/1361)
  • Add bibtex (https://github.com/pytorch/ao/pull/1177)
  • Clarify torchao.float8 PyTorch version support (https://github.com/pytorch/ao/pull/1191)

Developers

  • [Tp Test] Fix the placement of the device tensor (https://github.com/pytorch/ao/pull/1054)
  • Skip testfpxweight_only in fbcode (https://github.com/pytorch/ao/pull/1056)
  • Pin pt nightly CPU version (https://github.com/pytorch/ao/pull/1061)
  • Unpin CUDA Nightly (https://github.com/pytorch/ao/pull/1064)
  • Update smoke test (https://github.com/pytorch/ao/pull/1111)
  • Update regression_test.yml (https://github.com/pytorch/ao/pull/1163)
  • Add PyTorch 2.5 to regression test (https://github.com/pytorch/ao/pull/1168)
  • Fix Bias APIs, re-enable kleidi tests for arm64 (https://github.com/pytorch/ao/pull/1162)
  • Create CITATION.cff (https://github.com/pytorch/ao/pull/1178)
  • Unpin nightlies (https://github.com/pytorch/ao/pull/1183)
  • [experimental] Kleidi - add operator level tests (https://github.com/pytorch/ao/pull/1173)
  • Ruff format and lint (https://github.com/pytorch/ao/pull/1226)
  • Update pre-commit to match CI/CD (https://github.com/pytorch/ao/pull/1227)
  • Fixing pytest skip for only test_floatx.py (https://github.com/pytorch/ao/pull/1251)
  • Fixed invalid url in citation section (https://github.com/pytorch/ao/pull/1348)
  • Add to safe globals (https://github.com/pytorch/ao/pull/1171)
  • Aqt rename#1 Layout -> TensorImpl (https://github.com/pytorch/ao/pull/1046)
  • Move and rename GranularityType -> Granularity (https://github.com/pytorch/ao/pull/1038)
  • Change torchao quantization types from int to sizet and preface vars with "preferred" (https://github.com/pytorch/ao/pull/1041)
  • Shrink hadamard matrices (https://github.com/pytorch/ao/pull/1051)
  • Use ExecuTorch prebuilt library in pip package to build custom kernels (https://github.com/pytorch/ao/pull/1059)
  • Update base.h unit to unsigned int (https://github.com/pytorch/ao/pull/962)
  • Create header for packed weight ops (https://github.com/pytorch/ao/pull/1072)
  • Update cmake files (https://github.com/pytorch/ao/pull/1070)
  • Create buildwheelsaarch64_linux.yml (https://github.com/pytorch/ao/pull/1083)
  • ROCM binary upload (https://github.com/pytorch/ao/pull/1099)
  • Create buildwheelswindows.yml (https://github.com/pytorch/ao/pull/1101)
  • Use fewer instructions when unpacking uint6s. (https://github.com/pytorch/ao/pull/1109)
  • [CI] XPU binary build enable (https://github.com/pytorch/ao/pull/1105)
  • Move common ET/Aten op stuff to ops/library.h (https://github.com/pytorch/ao/pull/1116)
  • Move bias from kernel to packed_weights (https://github.com/pytorch/ao/pull/1119)
  • Update gpu_sparsity kernel benchmarking script (https://github.com/pytorch/ao/pull/1143)
  • [ROCm] use dataclass for fnuz type setting (https://github.com/pytorch/ao/pull/1142)
  • Move files to prototype/sparsity (https://github.com/pytorch/ao/pull/1145)
  • C10::nullopt -> std::nullopt (#1032) (https://github.com/pytorch/ao/pull/1151)
  • [reland][ROCm] use dataclass for fnuz type setting (https://github.com/pytorch/ao/pull/1150)
  • Move float8atenapi to float8_ops (https://github.com/pytorch/ao/pull/1155)
  • Initialize model with meta device for generation benchmarking (https://github.com/pytorch/ao/pull/1144)
  • Replace torch.empty with torch.zeros (https://github.com/pytorch/ao/pull/1157)
  • Update utils.py (https://github.com/pytorch/ao/pull/1186)
  • Remove intscaledmm's dependency on triton for cpu (https://github.com/pytorch/ao/pull/128)
  • at::optional -> std::optional (#1170) (https://github.com/pytorch/ao/pull/1212)
  • fastflush kwarg of dobench is removed (https://github.com/pytorch/ao/pull/1222)
  • Remove calibration args from generate.py (https://github.com/pytorch/ao/pull/1258)
  • Skip marlin QQQ ops test in fbcode (https://github.com/pytorch/ao/pull/1289)
  • Fix Marlin QQQ ops test with unittest (https://github.com/pytorch/ao/pull/1294)
  • Fix Failing CI - Update bitsandbytes import (https://github.com/pytorch/ao/pull/1343)
  • Remove lm_eval warning (https://github.com/pytorch/ao/pull/1347)
  • Refactor Affine Quantized Tensor (#1234)
  • Move files from quantization/prototype -> prototype/quantization (#1187)
  • Add TTFT benchmarks + update sparsity benchmarks (https://github.com/pytorch/ao/pull/1140)
  • Add "gemminput_role" to dunder slots (https://github.com/pytorch/ao/pull/984)
  • Add an option to use fp8-all-gather only without fp8 computation. (https://github.com/pytorch/ao/pull/1093)
  • Bump version to 0.7 (https://github.com/pytorch/ao/pull/1045)

New Contributors

  • @Jack-Khuu made their first contribution in https://github.com/pytorch/ao/pull/1031
  • @keyan made their first contribution in https://github.com/pytorch/ao/pull/1041
  • @digantdesai made their first contribution in https://github.com/pytorch/ao/pull/997
  • @EnragedAntelope made their first contribution in https://github.com/pytorch/ao/pull/962
  • @c4lcut3c made their first contribution in https://github.com/pytorch/ao/pull/1094
  • @elfisworking made their first contribution in https://github.com/pytorch/ao/pull/1087
  • @chuanqi129 made their first contribution in https://github.com/pytorch/ao/pull/1105
  • @p4arth made their first contribution in https://github.com/pytorch/ao/pull/1122
  • @xuzijian629 made their first contribution in https://github.com/pytorch/ao/pull/1138
  • @jeffdaily made their first contribution in https://github.com/pytorch/ao/pull/1142
  • @r-barnes made their first contribution in https://github.com/pytorch/ao/pull/1151
  • @helunwencser made their first contribution in https://github.com/pytorch/ao/pull/1157
  • @bertmaher made their first contribution in https://github.com/pytorch/ao/pull/1222
  • @tibidoh made their first contribution in https://github.com/pytorch/ao/pull/1248
  • @mandroid6 made their first contribution in https://github.com/pytorch/ao/pull/1250
  • @HandH1998 made their first contribution in https://github.com/pytorch/ao/pull/1113
  • @readleyj made their first contribution in https://github.com/pytorch/ao/pull/1316
  • @22dimensions made their first contribution in https://github.com/pytorch/ao/pull/1318
  • @galqiwi made their first contribution in https://github.com/pytorch/ao/pull/1348
  • @dbyoung18 made their first contribution in https://github.com/pytorch/ao/pull/1324
  • @sunjiweiswift made their first contribution in https://github.com/pytorch/ao/pull/1259
  • @merrymercy made their first contribution in https://github.com/pytorch/ao/pull/1361

Full Changelog: https://github.com/pytorch/ao/compare/v0.6.1...v0.7.0-rc1

- Python
Published by vkuzo over 1 year ago

torchao - v0.6.1

Highlights

We are excited to announce the 0.6.1 release of torchao! This release adds support for Auto-Round support, Float8 Axiswise scaled training, a BitNet training recipe, an implementation of AWQ and much more!

Auto-Round Support (#581)

Auto-Round is a new weight-only quantization algorithm, it has as achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.

``` Python from torchao.prototype.autoround.core import preparemodelforapplyingautoround from torchao.prototype.autoround.core import applyautoround

preparemodelforapplyingautoround( model, istargetmodule=istargetmodule, bits=4, group_size=128, iters=200, device=device, )

inputidslst = [] for data in dataloader: inputidslst.append(data["inputids"].to(modeldevice))

multitinputids = MultiTensor(inputidslst) out = model(multitinputids)

quantize(model, applyautoround(), istarget_module) ```

Added float8 training axiswise scaling support with per-gemm-argument configuration (#940)

We added experimental support for rowwise scaled float8 gemm to torchao.float8, with per-gemm-input configurability to enable exploration of various recipes. Here is how a user can configure all-axiswise scaling

```python

all-axiswise scaling

config = torchao.float8.config.recipenametolinearconfig(Float8LinearRecipeName.ALLAXISWISE) m = torchao.float8.converttofloat8training(config)

or, a custom recipe by @lw where grad_weight is left in bfloat16

config = torchao.float8.config.recipenametolinearconfig(Float8LinearRecipeName.LWAXISWISEWITHGWHP) m = torchao.float8.converttofloat8_training(config) ```

Early performance benchmarks show all-axiswise scaling achieve a 1.13x speedup vs bf16 on torchtitan / LLaMa 3 8B / 8 H100 GPUs (compared to 1.17x from all-tensorwise scaling in the same setup), and loss curves which match to bf16 and all-tensorwise scaling. Further performance and accuracy benchmarks will follow in future releases.

Introduced BitNet b1.58 training recipe (#930)

Adds recipe for doing BitNet b1.58](https://arxiv.org/abs/2402.17764) ternary weights clamping. ``` Python from torchao.prototype.quantizedtraining import bitnettraining from torchao import quantize_

model = ... quantize(model, bitnettraining()) ``` Notably: Our implementation utilizes INT8 Tensor Cores to make up for this loss in speed. In fact, our implementation is faster than BF16 training in most cases.

[Prototype] Implemented Activation Aware Weight Quantization AWQ (#743)

Perplexity and performance measured on A100 GPU: | Model | Quantization | Tokens/sec | Throughput (GB/sec) | Peak Mem (GB) | Model Size (GB) | |--------------------|--------------|------------|---------------------|---------------|-----------------| | Llama-2-7b-chat-hf | bfloat16 | 107.38 | 1418.93 | 13.88 | 13.21 | | | awq-hqq-int4 | 196.6 | 761.2 | 5.05 | 3.87 | | | awq-uint4 | 43.59 | 194.93 | 7.31 | 4.47 | | | int4wo-hqq | 209.19 | 804.32 | 4.89 | 3.84 | | | int4wo-64 | 201.14 | 751.42 | 4.87 | 3.74 |

Usage:

Python from torchao.prototype.awq import insert_awq_observer_, awq_uintx, AWQObservedLinear quant_dtype = torch.uint4 group_size = 64 calibration_limit = 10 calibration_seq_length = 1024 model=model.to(device) insert_awq_observer_(model,calibration_limit, calibration_seq_length, quant_dtype=quant_dtype, group_size=group_size) with torch.no_grad(): for batch in calibration_data: model(batch.to(device)) is_observed_linear = lambda m, fqn: isinstance(m, AWQObservedLinear) quantize_(model, awq_uintx(quant_dtype=quant_dtype, group_size = group_size), is_observed_linear)

New Features

  • [Prototype] Added Float8 support for AQT tensor parallel (#1003)
  • Added composable QAT quantizer (#938)
  • Introduced torchchat quantizer (#897)
  • Added INT8 mixed-precision training (#748)
  • Implemented sparse marlin AQT layout (#621)
  • Added a PerTensor static quant api (#787)
  • Introduced uintx quant to generate and eval (#811)
  • Added Float8 Weight Only and FP8 weight + dynamic activation (#740)
  • Implemented Auto-Round support (#581)
  • Added 2, 3, 4, 5 bit custom ops (#828)
  • Introduced symmetric quantization with no clipping error in the tensor subclass based API (#845)
  • Added int4 weight-only embedding QAT (#947)
  • Added support for 1-bit and 6-bit quantization for Llama in torchchat (#910, #1007)
  • Added a linear_observer class for doing static activation calibration (#807)
  • Exposed hqq through uintxweightonly API (#786)
  • Added RowWise scaling option for Float8 dynamic activation quantization (#819)
  • Added Float8 weight only to autoquant api (#866)

Improvements

  • Enhanced Auto-Round functionality (#870)
  • Improved FSDP support for low-bit optimizers (#538)
  • Added support for using AffineQuantizedTensor with weights_only=True for torch.load (#630)
  • Optimized 3-bit packing (#1029)
  • Added more evaluation metrics to llama/eval.sh (#934)
  • Improved eager numerics for dynamic scales in float8 (#904)

Bug fixes

  • Fixed inference_mode issues (#885)
  • Fixed failing FP6 benchmark (#931)
  • Resolved various issues with float8 support (#918, #923)
  • Fixed load state dict when device is different for low-bit optim (#1021)

Performance

  • Added SM75 (Turing) support for FP6 kernel (#942)
  • Implemented int8 dynamic quant + bsr support (#821)

- Added workaround to recover the perf for quantized vit in torch.compile (#926)

INT8 Mixed-Precision Training

On NVIDIA GPUs, INT8 Tensor Cores is approximately 2x faster than their BF16/FP16 counterparts. In mixed-precision training, we can down-cast activations and weights dynamically to INT8 to leverage faster matmuls. However, since INT8 has very limited range [-128,127], we perform row-wise quantization, similar to how INT8 post-training quantization (PTQ) is done. Weight is still in original precision.

```Python from torchao.prototype.quantizedtraining import int8mixedprecisiontraining, Int8MixedPrecisionTrainingConfig from torchao.quantization import quantize_

model = ...

apply INT8 matmul to all 3 matmuls

quantize(model, int8mixedprecisiontraining())

customize which matmul is left in original precision.

config = Int8MixedPrecisionTrainingConfig( output=True, gradinput=True, gradweight=False, ) quantize(model, int8mixedprecisiontraining(config)) `` **End2end speed benchmark** usingbenchmarks/quantizedtraining/pretrainllama2.py`

Model & GPU | bs x seq_len| Config | Tok/s | Peak mem (GB) -----|-----|-----|-----|----- Llama2-7B, A100 | 8 x 2048 | BF16 (baseline) | ~4400 | 59.69 Llama2-7B, A100 | 8 x 2048 | INT8 mixed-precision | ~6100 (+39%) | 58.28 Llama2-1B, 4090 | 16 x 2048 | BF16 (baseline) | ~17,900 | 18.23 Llama2-1B, 4090 | 16 x 2048 | INT8 mixed-precision | ~30,700 (+72%) | 18.34

Docs

  • Updated README with more current float8 speedup information (#816)
  • Added tutorial for trainable tensor subclass (#908)
  • Improved documentation for float8 unification and inference (#895, #896)

Devs

  • Added compile tests to test suite (#906)
  • Improved CI setup and build processes (#887)
  • Added M1 wheel support (#822)
  • Added more benchmarking and profiling tools (#1017)
  • Renamed fpx to floatx (#877)
  • Removed torchao_nightly package (#661)
  • Added more lint fixes (#827)
  • Added better subclass testing support (#839)
  • Added CI to catch syntax errors (#861)
  • Added tutorial on composing quantized subclass w/ Dtensor based TP (#785)

Security

No significant security updates in this release.

Untopiced

  • Added basic SAM2 AutomaticMaskGeneration example server (#1039)

New Contributors

New Contributors

  • @iseeyuan made their first contribution in https://github.com/pytorch/ao/pull/805
  • @YihengBrianWu made their first contribution in https://github.com/pytorch/ao/pull/860
  • @kshitij12345 made their first contribution in https://github.com/pytorch/ao/pull/863
  • @ZainRizvi made their first contribution in https://github.com/pytorch/ao/pull/887
  • @alexsamardzic made their first contribution in https://github.com/pytorch/ao/pull/899
  • @vaishnavi17 made their first contribution in https://github.com/pytorch/ao/pull/911
  • @tobiasvanderwerff made their first contribution in https://github.com/pytorch/ao/pull/931
  • @kwen2501 made their first contribution in https://github.com/pytorch/ao/pull/937
  • @y-sq made their first contribution in https://github.com/pytorch/ao/pull/912
  • @jimexist made their first contribution in https://github.com/pytorch/ao/pull/969
  • @danielpatrickhug made their first contribution in https://github.com/pytorch/ao/pull/914
  • @ramreddymounica made their first contribution in https://github.com/pytorch/ao/pull/1007
  • @yushangdi made their first contribution in https://github.com/pytorch/ao/pull/1006
  • @ringohoffman made their first contribution in https://github.com/pytorch/ao/pull/1023

Full Changelog: https://github.com/pytorch/ao/compare/v0.5.0...v0.6.1

- Python
Published by drisspg over 1 year ago

torchao - v0.5.0

Highlights

We are excited to announce the 0.5 release of torchao! This release adds support for memory efficient inference, float8 training and inference, int8 quantized training, HQQ, automatic mixed-precision quantization through bayesian optimization, sparse marlin, and integrations with HuggingFace, SGLang, and diffusers.

Memory Efficient Inference Support https://github.com/pytorch/ao/pull/738

We've added support for Llama 3.1 to the llama benchmarks in TorchAO and added new features and improvements as a proof of concept for memory efficient inference. These additions allow us to to do 130k context length inference with Llama 3.1-8B with only 18.91 GB memory if we combine with kv cache quantization, int4 weight only quantization and linear causal mask.

General savings depend on technique and context length as can be seen in the following graph: image

Float8 Training https://github.com/pytorch/ao/pull/551

torchao.float8 implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433.

With torch.compile on, current results show throughput speedups of up to 1.5x on 128 H100 GPU LLaMa 3 70B pretraining jobs (details)

python from torchao.float8 import convert_to_float8_training convert_to_float8_training(m, module_filter_fn=...)

And for an end-to-minimal training recipe of pretraining with float8, you can check out torchtitan.

Float8 Inference https://github.com/pytorch/ao/pull/740 https://github.com/pytorch/ao/pull/819

We have introduced two new quantization APIs for Float8 inference:

  1. Float8 Weight-Only Quantization: A new quant_api float8weightonly() has been added to apply float8 weight-only symmetric per-channel quantization to linear layers.

  2. Float8 Dynamic Activation and Weight Quantization: A new quant_api float8dynamicactivationfloat8weight() has been introduced to apply float8 dynamic symmetric quantization to both activations and weights of linear layers. By default PerTensor scaling. We have also added an option to do PerRow scaling of both activations and weights. By computing scales at a finer granularity, it can potentially reduce the overall quantization error and increase performance by reducing dynamic quantization overhead.

Example usage: ```python import torch from torchao.quantization import quantize, float8weightonly, float8dynamicactivationfloat8_weight, PerRow

Create a model

model = YourModel()

Apply float8 weight-only quantization

quantize(model, float8weight_only())

Apply float8 dynamic activation and weight quantization

quantize(model, float8dynamicactivationfloat8_weight())

Apply PerRow scaling to weight and activations

quantize(linearmodule, float8dynamicactivationfloat8weight(granularity=PerRow())) ```

Notes: - These new APIs are designed to work with PyTorch 2.5 and later versions. - float8_dynamic_activation_float8_weight requires CUDA devices with compute capability 8.9 or higher for hardware acceleration.

Int8 quantized training #644 #748

@gau-nernst introduced 2 experimental works on training using INT8.

  • INT8 quantized training (#644): weight is quantized to INT8 during the whole duration of training to save memory. Compute remains in high precision. To train the model effectively with only quantized weights, we use stochastic rounding for weight update. Right now, memory saving is not too competitive compared to compiled BF16 baseline.
  • INT8 mixed-precision training (#748): weight is kept in the original high precision, but weight and activation are dynamically quantized to INT8 during training to utilize INT8 tensor cores. We observe up to 70% speedup for Llama2 pre-training on 4090, and 20% speedup for Llama3 pre-training on 8x A100 with FSDP2.

```python from torchao.quantization import quantize_ from torchao.prototype.quantizedtraining import int8weightonlyquantizedtraining, int8mixedprecisiontraining

model = YourModel()

apply INT8 quantized training

quantize(model, int8weightonlyquantized_training())

apply INT8 mixed-precision training

quantize(model, int8mixedprecisiontraining()) ```

For more information and benchmark results, see README and the respective PR (#644 and #748)

HQQ Integration in torchao https://github.com/pytorch/ao/pull/605 https://github.com/pytorch/ao/pull/786

hqq is added to existing torchao APIs, it gives improvements on model accuracy and leverages the existing efficient kernels in torchao. We enabled hqq for int4_weight_only API: quantize_(model, int4_weight_only(group_size, use_hqq=True) We also added this to the uintx api for accuracy experiments (current uintx kernels are slow): quantize_(model, uintx_weight_only(torch.uint2, group_size, use_hqq=True)

Automatic Mixed-Precision Quantization through Bayesian Optimization https://github.com/pytorch/ao/pull/592, https://github.com/pytorch/ao/pull/694

We provided a Bayesian Optimization (BO) tool leveraging Ax to auto search mixed-precision weight-only quantization configuration, i.e., bit width and group size of intN_weight_only(bit_width, group_size) for each layer. It also includes a sensitivity analysis tool to calculate layer-wise average Hessian trace and average fisher information matrix trace, which is an optional step to customize and improve BO search.

To optimize for model accuracy under a model size constraint (GB): python --BO_acc_modelsize.py --checkpoint=/tmp/Meta-Llama-3-8B --model_size_constraint=6.0

To optimize for inference throughput under a model perplexity constraint: python --BO_acc_throughput.py --checkpoint=/tmp/Meta-Llama-3-8B --ppl_constraint=7.5

For more detailed usage, please refer to this README. The mixed-precision quantization searched by this tool reduces 20.1% model size with 2.8% perplexity reduction, and improves 15.1% inference throughput with 3.2% perplexity reduction on the Llama3-8B model compared to int8 uniform quantization.

Sparse Marlin https://github.com/pytorch/ao/pull/621, https://github.com/pytorch/ao/pull/733

@Diogo-V added sparse-marlin, a W4AFP16 2:4 sparse kernel, support to TorchAO. On Meta LLama3, we observe a 25% tok/s increase (180 -> 226) compared to our existing int4-wo implementation. python from torchao.quantization.quant_api import quantize_, int4_weight_only from torchao.dtypes import MarlinSparseLayoutType quantize_(model, int4_weight_only(layout_type=MarlinSparseLayoutType())) | Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) | | ----------- | ----------------------- | ------------- | ----------------------- | ---------------- | --------------- | | Llama-3-8B | Base (bfloat16) | 95.64 | 1435.54 | 16.43 | 15.01 | | | int8dq | 8.61 | 64.75 | 9.24 | 7.52 | | | int8wo | 153.03 | 1150.80 | 10.42 | 7.52 | | | int4wo-64 | 180.80 | 763.33 | 6.88 | 4.22 | | | int4wo-64-sparse-marlin | 226.02 | 689.20 | 5.32 | 3.05 |

HuggingFace Integration

torchao is integrated into huggingface: https://huggingface.co/docs/transformers/main/en/quantization/torchao now you can use int4_weight_only, int8_weight_only and int8_dynamic_activation_int8_weight through TorchAoConfig in huggingface. Currently available in huggingface main branch only.

SGLang Integration

torchao is also integrated into sglang (https://github.com/sgl-project/sglang/pull/1341) for llama3 model, you can try out with: python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 1 --input 128 --output 8 --torchao-config int4wo-128 Supported configurations are ["int4wo-", "int8wo", "int8dq", "fp8wo" (only available in torchao 0.5+)]

diffusers Integration

diffusers-torchao provides end-to-end inference and experimental training recipes to use torchao with diffusers in this repo. We demonstrate 53.88% speedup on Flux.1-Dev* and 27.33% speedup on CogVideoX-5b when comparing compiled quantized models against their standard bf16 counterparts.

BC Breaking

Add layout option to woq int4 api https://github.com/pytorch/ao/pull/670

```

torchao 0.4.0

from torchao.quantization import quantize, int4weightonly quantize(mymodel, int4weightonly(innerk_tiles=8))

torchao 0.5.0

from torchao.quantization import quantize, int4weightonly quantize(mymodel, int4weightonly(layouttype=TensorCoreTiledLayoutType(innerk_tiles=8))) ```

Refactor QAT to use tensor subclasses https://github.com/pytorch/ao/pull/585

We refactored QAT to use tensor subclasses instead of module swap. This works well with torchtune and FSDP2, but currently lacks support for FSDP1 and DDP. As a fallback for these distribution strategies, please continue to use the old module swap flows.

```

torchao 0.4.0: This uses the module swap flow

torch 0.5.0 + FSDP2: This uses the tensor subclass flow

from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer quantizer = Int8DynActInt4WeightQATQuantizer() model = quantizer.prepare(model) train(model) model = quantizer.convert(model)

torchao 0.5.0 + DDP or FSDP1: This uses the module swap flow

from torchao.quantization.prototype.qat.moduleswap_api import Int8DynActInt4WeightQATQuantizerModuleSwap quantizer = Int8DynActInt4WeightQATQuantizerModuleSwap() model = quantizer.prepare(model) train(model) model = quantizer.convert(model) ```

Deprecations

New Features

  • Optimizer CPU offload for single GPU training https://github.com/pytorch/ao/pull/584
  • Add support for save quantized checkpoint in llama code https://github.com/pytorch/ao/pull/553
  • Intx quantization tensor subclass https://github.com/pytorch/ao/pull/468
  • Add superblock to sparse/prototype https://github.com/pytorch/ao/pull/660
  • Add AffineQuantizedObserver https://github.com/pytorch/ao/pull/650
  • Add BSR subclass + torch.compile and clean up superblock https://github.com/pytorch/ao/pull/680
  • Add HQQ support https://github.com/pytorch/ao/pull/605
  • Add performance profiler https://github.com/pytorch/ao/pull/690
  • Add experimental INT8 quantized training https://github.com/pytorch/ao/pull/644
  • Add high-level operator interface https://github.com/pytorch/ao/pull/708
  • Add sparse marlin 2:4 gemm op https://github.com/pytorch/ao/pull/733
  • Example for GPTQ-like calibration flow https://github.com/pytorch/ao/pull/721
  • Llama3.1 and KV cache quantization https://github.com/pytorch/ao/pull/738
  • Add float8 weight only and weight + dynamic activation https://github.com/pytorch/ao/pull/740
  • Add Auto-Round support https://github.com/pytorch/ao/pull/581

Mixed-Precision Quantization

  • Add sensitivity analysis tool for layer-wise FIT and Hessian trace https://github.com/pytorch/ao/pull/592
  • Bayesian optimization tool for mixed precision quantization https://github.com/pytorch/ao/pull/694

Improvements

  • Move sam eval from scripts to torchao/_models https://github.com/pytorch/ao/pull/591
  • QOL improvements to float8 gemm benchmark https://github.com/pytorch/ao/pull/596
  • Move lowbit universal kernels from torchaccel to torchao https://github.com/pytorch/ao/pull/582
  • Refactor autoquant to use AQT https://github.com/pytorch/ao/pull/609
  • Add support for using AffineQuantizedTensor with weights_only=True https://github.com/pytorch/ao/pull/630
  • Move Uintx out of prototype for future extension https://github.com/pytorch/ao/pull/635
  • Refactor _quantized_linear for better extensibility https://github.com/pytorch/ao/pull/634
  • Update micro benchmarking code for AQT https://github.com/pytorch/ao/pull/673
  • Refactor superblock code + add final benchmark/eval scripts https://github.com/pytorch/ao/pull/691
  • Relax QAT dtype assertion https://github.com/pytorch/ao/pull/692
  • Add option to move param to device before quantization https://github.com/pytorch/ao/pull/699
  • Add gpu benchmarking script https://github.com/pytorch/ao/pull/192
  • Enable to(device=device_name) for Uintx https://github.com/pytorch/ao/pull/722
  • Make torchao's llama model trainable https://github.com/pytorch/ao/pull/728
  • Specify output dtype to torch.float32 in _foreach_norm https://github.com/pytorch/ao/pull/727
  • Add semi-structured sparsity to hf eval https://github.com/pytorch/ao/pull/576
  • Use torch.uint1 to torch.uint7 for Uintx tensor subclass https://github.com/pytorch/ao/pull/672
  • Add AdamW to CPUOffloadOptimizer default https://github.com/pytorch/ao/pull/742
  • Make developer experience better for extending AQT https://github.com/pytorch/ao/pull/749
  • Add back QAT module swap API https://github.com/pytorch/ao/pull/762
  • Refactor quant_llm to work with affine quantized tensor https://github.com/pytorch/ao/pull/772
  • Move iOS benchmarking infra code to torchao https://github.com/pytorch/ao/pull/766
  • Add CPU bandwidth benchmark https://github.com/pytorch/ao/pull/773
  • Update method names to support intx and floatx changes https://github.com/pytorch/ao/pull/775
  • Add implementation for torchao::parallel_for backends https://github.com/pytorch/ao/pull/774
  • Add Llama2-7B finetune benchmarks for low-bit optimizers https://github.com/pytorch/ao/pull/746
  • Fix Adam4bit support on PyTorch 2.3 and 2.4 and update AdamFp8 torch requirement https://github.com/pytorch/ao/pull/755
  • Improve compile time + fix PyTorch 2.3 support for 4-bit optim https://github.com/pytorch/ao/pull/812
  • Allow quantized linear registration in a different file https://github.com/pytorch/ao/pull/783
  • Add 2bit, 5bit packing routines https://github.com/pytorch/ao/pull/797, https://github.com/pytorch/ao/pull/798
  • Freeze dataclass in nf4, prep for better pt2 support https://github.com/pytorch/ao/pull/799
  • Format and lint nf4 file and test https://github.com/pytorch/ao/pull/800
  • Move more utils to TorchAOBaseTensor https://github.com/pytorch/ao/pull/784
  • Add more information to quantized linear module and added some logs https://github.com/pytorch/ao/pull/782
  • Add int4 mode to autoquant https://github.com/pytorch/ao/pull/804
  • Add uintx quant to generate and eval https://github.com/pytorch/ao/pull/811
  • Move non-NF4 tensor to device prior to quantization on copy https://github.com/pytorch/ao/pull/737

Static quantization

  • Add float8 static quant support https://github.com/pytorch/ao/pull/787
  • Update how block_size is calculated with Observers https://github.com/pytorch/ao/pull/815
  • Add a linear observer class and test https://github.com/pytorch/ao/pull/807

Float8

  • Update benchmarks to be more useful for smaller shapes https://github.com/pytorch/ao/pull/615
  • Remove unneeded kernel for scale generation https://github.com/pytorch/ao/pull/616
  • Filter out microbenchmarking overhead in profiling script https://github.com/pytorch/ao/pull/629
  • Save torch_logs, and attach them to profiling trace https://github.com/pytorch/ao/pull/645
  • Add option for gpu time in GEMM benchmarks https://github.com/pytorch/ao/pull/666
  • Add roofline estimation of GEMM + overhead https://github.com/pytorch/ao/pull/668
  • Make roofline utils reusable https://github.com/pytorch/ao/pull/731
  • Use torch.compiler.is_compiling https://github.com/pytorch/ao/pull/739
  • Float8 support in AQT https://github.com/pytorch/ao/pull/671
  • Add static scaling for float8 training https://github.com/pytorch/ao/pull/760
  • Make roofline script calculate observed overhead https://github.com/pytorch/ao/pull/734
  • Make Inference and training code independent https://github.com/pytorch/ao/pull/808
  • Add rowwise scaling option to float8 dynamic quant https://github.com/pytorch/ao/pull/819

Bug fixes

  • Fix all-gather in 2D with DTensor (WeightWithDynamicFloat8CastTensor) https://github.com/pytorch/ao/pull/590
  • Fix FP6-LLM API and add .to(device) op https://github.com/pytorch/ao/pull/595
  • Fix linearactivationtensor dynamic quant https://github.com/pytorch/ao/pull/622
  • Fix bug with float8 inference_mode https://github.com/pytorch/ao/pull/659
  • Quantization kernel bug fixes https://github.com/pytorch/ao/pull/717
  • Cast local_scale_tensor to fp32 for precompute of float8 dynamic scaling https://github.com/pytorch/ao/pull/713
  • Fix affine quantized tensor to device calls https://github.com/pytorch/ao/pull/726
  • Small fix for micro benchmark code https://github.com/pytorch/ao/pull/711
  • Fix LR schedule handling for low-bit optimizers https://github.com/pytorch/ao/pull/736
  • Fix FPX inductor error https://github.com/pytorch/ao/pull/790
  • Fixed llama model inference https://github.com/pytorch/ao/pull/769

Docs

  • Add QAT README https://github.com/pytorch/ao/pull/597
  • Update serialization.rst to include getmodelsizeinbytes import https://github.com/pytorch/ao/pull/604
  • Clarify details around unwraptensorsubclass in README.md https://github.com/pytorch/ao/pull/618, https://github.com/pytorch/ao/pull/619
  • Spelling fixes https://github.com/pytorch/ao/pull/662
  • Move developer guide file to a folder https://github.com/pytorch/ao/pull/681
  • Update docs on how to use AUTOQUANT_CACHE https://github.com/pytorch/ao/pull/649
  • Update pip install command in README https://github.com/pytorch/ao/pull/723
  • Fix docstring args names https://github.com/pytorch/ao/pull/735
  • Update README example with correct import of sparsify_ https://github.com/pytorch/ao/pull/741
  • Update main and quantization README https://github.com/pytorch/ao/pull/745, https://github.com/pytorch/ao/pull/747, https://github.com/pytorch/ao/pull/757
  • Add README for mixed-precision search tool and code refactor https://github.com/pytorch/ao/pull/776
  • Add performance section to float8 README.md https://github.com/pytorch/ao/pull/794
  • Make float8 README.md examples standalone https://github.com/pytorch/ao/pull/809
  • Add KV cache quantization to READMEs https://github.com/pytorch/ao/pull/813
  • Update main README.md with more current float8 speedup https://github.com/pytorch/ao/pull/816

Not user facing

  • Fix float8 inference tests and add export test https://github.com/pytorch/ao/pull/613
  • Reduce atol/rtol for stable tests https://github.com/pytorch/ao/pull/617
  • Fix version guard in https://github.com/pytorch/ao/pull/620, https://github.com/pytorch/ao/pull/679, https://github.com/pytorch/ao/pull/684
  • Fix BC for QAT location https://github.com/pytorch/ao/pull/626
  • Enable float8 CI on sm89 https://github.com/pytorch/ao/pull/587
  • Fix Inductor bench BC change https://github.com/pytorch/ao/pull/638, https://github.com/pytorch/ao/pull/641
  • Add CUDA compute capability compile guard https://github.com/pytorch/ao/pull/636
  • Remove numpy as bitpack dependency https://github.com/pytorch/ao/pull/677
  • Add PyTorch 2.4 tests in CI https://github.com/pytorch/ao/pull/654
  • Remove torchao_nightly package https://github.com/pytorch/ao/pull/661
  • Update licenses in torchao/experimental https://github.com/pytorch/ao/pull/720
  • Add lint checks for float8 inference https://github.com/pytorch/ao/pull/779

New Contributors

  • @sayakpaul made their first contribution in https://github.com/pytorch/ao/pull/604
  • @metascroy made their first contribution in https://github.com/pytorch/ao/pull/582
  • @raziel made their first contribution in https://github.com/pytorch/ao/pull/618
  • @nmacchioni made their first contribution in https://github.com/pytorch/ao/pull/641
  • @Diogo-V made their first contribution in https://github.com/pytorch/ao/pull/670
  • @mobicham made their first contribution in https://github.com/pytorch/ao/pull/605
  • @crcrpar made their first contribution in https://github.com/pytorch/ao/pull/703
  • @ebsmothers made their first contribution in https://github.com/pytorch/ao/pull/737
  • @a-r-r-o-w made their first contribution in https://github.com/pytorch/ao/pull/741
  • @kimishpatel made their first contribution in https://github.com/pytorch/ao/pull/766

We were able to close about 70% of tasks for 0.5.0, which will now spill over into upcoming releases. We will post a list for 0.6.0 next, which we aim to release at the end of September 2024. We want to follow a monthly release cadence until further notice.

Full Changelog: https://github.com/pytorch/ao/compare/v0.4.0...v0.5.0-rc1

- Python
Published by andrewor14 over 1 year ago

torchao - v0.4.0

v0.4.0

Highlights

We are excited to announce the 0.4 release of torchao! This release adds support for KV cache quantization, quantization aware training (QAT), low bit optimizer support, composing quantization and sparsity, and more!

KV cache quantization (https://github.com/pytorch/ao/pull/532)

We've added support for KV cache quantization, showing a peak memory reduction from 19.7 -> 19.2 GB on Llama3-8B at an 8192 context length. We plan to investigate Llama3.1 next.

Quantization-Aware Training (QAT) (#383, #555)

We now support two QAT schemes for linear layers: Int8 per token dynamic activations + int4 per group weights, and int4 per group weights (using the efficient tinygemm int4 kernel after training). Users can access this feature by transforming their models before and after training using the appropriate quantizer, for example:

```python from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer

Quantizer for int8 dynamic per token activations +

int4 grouped per channel weights, only for linear layers

qat_quantizer = Int8DynActInt4WeightQATQuantizer()

Insert "fake quantize" operations into linear layers.

These operations simulate quantization numerics during

training without performing any dtype casting

model = qat_quantizer.prepare(model)

Convert fake quantize to actual quantize operations

model = qat_quantizer.convert(model) ```

Initial evaluation results indicate that QAT in torchao can recover up to 96% of quantized accuracy degradation on hellaswag and up to 68% of quantized perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the README and this blog post.

Composing quantization and sparsity (#457, #473)

We've added support for composing int8 dynamic quantization with 2:4 sparsity, using the quantize_ API. We also added SAM benchmarks that show a 7% speedup over standalone sparsity / int8 dynamic quantization here.

python from torchao.quantization import quantize_, int8_dynamic_activation_int8_semi_sparse_weight quantize_(model, int8_dynamic_activation_int8_semi_sparse_weight())

Community Contributions

low-bit optimizer support (#478, #463, #482, #484, #538)

@gau-nernst added implementations for 4-bit, 8-bit, and FP8 Adam with FSDP2/FSDP support. Our API is a drop-in replacement for torch.optim.Adam and can be used as follows: ```python from torchao.prototype.lowbitoptim import Adam8bit, Adam4bit, AdamFp8 from torchao.prototype.lowbitoptim import AdamW8bit, AdamW4bit, AdamWFp8

model = ... optim = Adam8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 / fp8 versions ```

For more information about low bit optimizer support please refer to our README.

Improvements to 4-bit quantization (https://github.com/pytorch/ao/pull/517, https://github.com/pytorch/ao/pull/552, https://github.com/pytorch/ao/pull/544, #479 )

@bdhirsh @jeromeku @yanbing-j @manuelcandales @larryliu0820 added torch.compile support for NF4 Tensor, custom CUDA int4 tinygemm unpacking ops, and several bugfixes to torchao

BC breaking

  • quantize has been renamed to quantize_ https://github.com/pytorch/ao/pull/467 ``` python # for torchao 0.4 from torchao.quantization import quantize, int8weightonly quantize(model, int8weightonly())

for torchao 0.3

from torchao.quantization import quantize, int8weightonly quantize(model, int8weightonly()) * `apply_sparse_semi_structured` has been deprecated in favor of `sparsify_` which matches the `quantize_` API https://github.com/pytorch/ao/pull/473 python

for torchao 0.4

from torchao.sparsity import sparsify, semisparseweight sparsify(model, semisparseweight())

for torchao 0.3

from torchao.sparsity import applysparsesemistructured applysparsesemistructured(model) ```

Deprecations

New Features

  • Added kv_cache quantization https://github.com/pytorch/ao/pull/532
  • Migrated float8_experimental to torchao.float8, enabling float8 training support https://github.com/pytorch/ao/pull/551 https://github.com/pytorch/ao/pull/529
  • Added FP5 E2M2 https://github.com/pytorch/ao/pull/399
  • Added 4-bit, 8-bit, and FP8 ADAM support https://github.com/pytorch/ao/pull/478 https://github.com/pytorch/ao/pull/463 https://github.com/pytorch/ao/pull/482
  • Added FSDP2 support for low-bit optimizers https://github.com/pytorch/ao/pull/484
  • [prototype] mixed-precision quantization and eval framework https://github.com/pytorch/ao/pull/531
  • Added int4 weight-only QAT support https://github.com/pytorch/ao/pull/555, https://github.com/pytorch/ao/pull/383
  • Added custom CUDA tinygemm unpacking ops https://github.com/pytorch/ao/pull/415

Improvements

  • Composing quantization and sparsity now uses the unified AQT Layout https://github.com/pytorch/ao/pull/498
  • Added default inductor config settings https://github.com/pytorch/ao/pull/423
  • Better dtype and device handling forInt8DynActInt4WeightQuantizer and Int4WeightOnlyQuantizer https://github.com/pytorch/ao/pull/475 https://github.com/pytorch/ao/pull/479
  • Enable model.to for int4/int8 weight only quantized models https://github.com/pytorch/ao/pull/486 https://github.com/pytorch/ao/pull/522
  • Added more logging to TensorCoreTiledAQTLayout https://github.com/pytorch/ao/pull/520
  • Added general fake_quantize_affine op with mask support https://github.com/pytorch/ao/pull/492 https://github.com/pytorch/ao/pull/500
  • QAT now uses the shared fake_quantize_affine primitive https://github.com/pytorch/ao/pull/527
  • Improve FSDP support for low-bit optimizers https://github.com/pytorch/ao/pull/538
  • Custom op and inductor decomp registration now uses a decorator https://github.com/pytorch/ao/pull/434
  • Updated torch version to no longer require unwrap_tensor_subclass https://github.com/pytorch/ao/pull/595

Bug fixes

  • Fixed import for TORCH_VERSION_AFTER_* https://github.com/pytorch/ao/pull/433
  • Fixed crash when PYTORCH_VERSION is not defined https://github.com/pytorch/ao/pull/455
  • Added torch.compile support for NF4Tensor https://github.com/pytorch/ao/pull/544
  • Added fbcode check to fix torchtune in Genie https://github.com/pytorch/ao/pull/480
  • Fixed int4pack_mm error https://github.com/pytorch/ao/pull/517
  • Fixed cuda device check https://github.com/pytorch/ao/pull/536
  • Weight shuffling now runs on CPU for int4 quantization due to a MPS memory issue https://github.com/pytorch/ao/pull/552
  • Scale and input now are the same dtype for int8 weight only quantization https://github.com/pytorch/ao/pull/534
  • Fixed FP6-LLM API https://github.com/pytorch/ao/pull/595

Performance

  • Added segment-anything-fast benchmarks for composed quantization + sparsity https://github.com/pytorch/ao/pull/457
  • Updated low-bit Adam benchmark https://github.com/pytorch/ao/pull/481

Docs

  • Updated README.md https://github.com/pytorch/ao/pull/583 https://github.com/pytorch/ao/pull/438 https://github.com/pytorch/ao/pull/445 https://github.com/pytorch/ao/pull/460
  • Updated installation instructions https://github.com/pytorch/ao/pull/447 https://github.com/pytorch/ao/pull/459
  • Added more docs for int4weightonly API https://github.com/pytorch/ao/pull/469
  • Added developer guide notebook https://github.com/pytorch/ao/pull/588
  • Added optimized model serialization/deserialization doc https://github.com/pytorch/ao/pull/524 https://github.com/pytorch/ao/pull/525
  • Added new float8 feature tracker https://github.com/pytorch/ao/pull/557
  • Added static quantization tutorial for calibration-based techniques https://github.com/pytorch/ao/pull/487

Devs

  • Fix numpy version in CI https://github.com/pytorch/ao/pull/537
  • trymerge now uploads merge records to s3 https://github.com/pytorch/ao/pull/448
  • Updated python version to 3.9 https://github.com/pytorch/ao/pull/488
  • torchao no long depends on torch https://github.com/pytorch/ao/pull/449
  • benchmark_model now accepts args and kwargs and supports cpu and mps backends https://github.com/pytorch/ao/pull/586 https://github.com/pytorch/ao/pull/406
  • Add git version suffix to package name https://github.com/pytorch/ao/pull/547
  • Added validations to torchao https://github.com/pytorch/ao/pull/453 https://github.com/pytorch/ao/pull/454
  • Parallel test support with pytest-xdist https://github.com/pytorch/ao/pull/518
  • Quantizer now uses logging instead of print https://github.com/pytorch/ao/pull/472

Not user facing

  • Refactored _replace_linear_8da4w https://github.com/pytorch/ao/pull/451
  • Remove unused code from AQT implementation https://github.com/pytorch/ao/pull/476 https://github.com/pytorch/ao/pull/440 https://github.com/pytorch/ao/pull/441 https://github.com/pytorch/ao/pull/471
  • Improved error message for lm_eval script https://github.com/pytorch/ao/pull/444
  • Updated HF_TOKEN env variable https://github.com/pytorch/ao/pull/427
  • Fixed typo in Quant-LLM in https://github.com/pytorch/ao/pull/450
  • Add a test for map_location="cpu" in https://github.com/pytorch/ao/pull/497
  • Removed sparse test collection warning https://github.com/pytorch/ao/pull/489
  • Refactored layout implementation https://github.com/pytorch/ao/pull/491
  • Refactored LinearActQuantizedTensor https://github.com/pytorch/ao/pull/542

New Contributors

  • @qingquansong made their first contribution in https://github.com/pytorch/ao/pull/433
  • @Hanxian97 made their first contribution in https://github.com/pytorch/ao/pull/451
  • @larryliu0820 made their first contribution in https://github.com/pytorch/ao/pull/472
  • @SLR722 made their first contribution in https://github.com/pytorch/ao/pull/480
  • @jainapurva made their first contribution in https://github.com/pytorch/ao/pull/406
  • @bdhirsh made their first contribution in https://github.com/pytorch/ao/pull/544
  • @yanbing-j made their first contribution in https://github.com/pytorch/ao/pull/517
  • @manuelcandales made their first contribution in https://github.com/pytorch/ao/pull/552
  • @Valentine233 made their first contribution in https://github.com/pytorch/ao/pull/534

Full Changelog: https://github.com/pytorch/ao/compare/v0.3.1-rc1...v0.4.0-rc1

We were able to close about 60% of tasks for 0.4.0, which will now spill over into upcoming releases. We will post a list for 0.5.0 next, which we aim to release at the end of August 2024. We want to follow a monthly release cadence until further notice.

- Python
Published by jcaip almost 2 years ago

torchao - v0.3.1

v0.3.1

Highlights

We are excited to announce the 0.3 release of torchao! This release adds support for a new quantize API, MX format, FP6 dtype and bitpacking, 2:4 sparse accelerated training and benchmarking infra for llama2/llama3 models.

quantize API (https://github.com/pytorch/ao/pull/256)

We added a tensor subclass based quantization API, see docs and README for details on usage, this is planned to replace all existing quantization APIs in torchao for torch 2.4 and later.

Accelerated training with 2:4 sparsity (#184)

You can now accelerate training with 2:4 sparsity, using the runtime pruning + compression kernels written by xFormers. These kernels process a 4x4 sub-tile to be 2:4 sparse in both directions, to handle both the forward and backward pass when training. We see a 1.3x speedup for the MLP layers of ViT-L across a forward and backwards pass.

MX support (https://github.com/pytorch/ao/pull/264)

We added prototype support for MX format for training and inference with a reference native PyTorch implementation of training and inference primitives for using MX accelerated matrix multiplications. The MX numerical formats are new low precision formats with recent acceptance into the OCP spec: https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Benchmarking (https://github.com/pytorch/ao/pull/276, https://github.com/pytorch/ao/pull/374)

We added a stable way to benchmark llama2 and llama3 models that includes perf/accuracy comparisons. See torchao/_models/llama/benchmarks.sh for more details.

🌟 💥 Community Contributions 🌟 💥

FP6 support (https://github.com/pytorch/ao/pull/279, https://github.com/pytorch/ao/pull/283, https://github.com/pytorch/ao/pull/358)

@gau-nernst Added support for FP6 dtype and mixed matmul FP16 x FP6 kernel with support for torch.compile. Benchmark results show a 2.3x speedup over BF16 baseline for meta-llama/Llama-2-7b-chat-hf

Bitpacking (https://github.com/pytorch/ao/pull/307, https://github.com/pytorch/ao/pull/282)

@vayuda, @melvinebenezer @CoffeeVampir3 @andreaskoepf Added support for packing/unpacking lower bit dtypes leveraging torch.compile to generate the kernels for this and added UInt2 and Bitnet tensor based on this approach.

FP8 split-gemm kernel https://github.com/pytorch/ao/pull/263

Added the kernel written by @AdnanHoque to torchao with speedups compared to the cuBLAS kernel for batch size <=16

BC Breaking

Deprecations

  • Deprecate top level quantization APIs https://github.com/pytorch/ao/pull/344

1. int8 weight only quantization

apply_weight_only_int8_quant(model) or change_linear_weights_to_int8_woqtensors(model)

-->

```python

for torch 2.4+

from torchao.quantization import quantize, int8weightonly quantize(model, int8weightonly())

for torch 2.2.2 and 2.3

from torchao.quantization.quantapi import changelinearweightstoint8woqtensors changelinearweightstoint8_woqtensors(model) ```

2. int8 dynamic quantization

apply_dynamic_quant(model) or change_linear_weights_to_int8_dqtensors(model)

-->

```python

Fuse the int8*int8 -> int32 matmul and subsequent mul op avoiding materialization of the int32 intermediary tensor

torch.inductor.config.forcefuseintmmwithmul = True

for torch 2.4+

from torchao.quantization import quantize, int8dynamicactivationint8weight quantize(model, int8dynamicactivationint8weight())

for torch 2.2.2 and 2.3

from torchao.quantization.quantapi import changelinearweightstoint8dqtensors changelinearweightstoint8_dqtensors(model) ```

3. int4 weight only quantization

change_linear_weights_to_int4_wotensors(model)

-->

```python

for torch 2.4+

from torchao.quantization import quantize, int4weightonly quantize(model, int4weightonly())

for torch 2.2.2 and 2.3

from torchao.quantization.quantapi import changelinearweightstoint4woqtensors changelinearweightstoint4_woqtensors(model) ```

New Features

  • Add quantize https://github.com/pytorch/ao/pull/256
  • Add a prototype of MX format training and inference https://github.com/pytorch/ao/pull/264
  • [FP6-LLM] Port splitK map from DeepSpeed https://github.com/pytorch/ao/pull/283
  • Improve FP6-LLM 2+4bit weight splitting + user API https://github.com/pytorch/ao/pull/279
  • Bitpacking https://github.com/pytorch/ao/pull/291
  • training acceleration via runtime semi-structured sparsity https://github.com/pytorch/ao/pull/184
  • Bitpackingv2 https://github.com/pytorch/ao/pull/307
  • Add FP6-LLM doc and move FP6-LLM to prototype https://github.com/pytorch/ao/pull/358
  • Added first bits of Uint2Tensor and BitnetTensor https://github.com/pytorch/ao/pull/282

Improvements

  • Improve primitives for FP6 quant https://github.com/pytorch/ao/pull/248
  • Extract eval code from GPTQ for more general usage https://github.com/pytorch/ao/pull/275
  • Factor out the specific configurations to helper functions https://github.com/pytorch/ao/pull/286
  • Add support for AQTLayout, PlainAQTLayout and TensorCoreTiledAQTLayout https://github.com/pytorch/ao/pull/278
  • Graceful handling of cpp extensions https://github.com/pytorch/ao/pull/296
  • Refactor int8 dynamic quantization with call to quantize https://github.com/pytorch/ao/pull/294
  • [NF4][FSDP] return contiguous quantization_factor https://github.com/pytorch/ao/pull/298
  • Refactor int4 and int8 weight only quantization to use quantize https://github.com/pytorch/ao/pull/301
  • Adding a quick way for users to test model eval for hf models https://github.com/pytorch/ao/pull/328
  • Wrap torch.ops.quantized_decomposed to improve import errors https://github.com/pytorch/ao/pull/310
  • [NF4Tensor] Switch to save for backward since are now a tensor input https://github.com/pytorch/ao/pull/323
  • Refactor rest of tinygemm quant primitive ops https://github.com/pytorch/ao/pull/321
  • Move some util functions from quantization.utils to torchao.utils https://github.com/pytorch/ao/pull/337
  • Clean up FP6-LLM https://github.com/pytorch/ao/pull/304
  • Move quant ops to utils.py https://github.com/pytorch/ao/pull/331
  • FP6-LLM clean up (again) https://github.com/pytorch/ao/pull/339
  • Improving hf_eval.py https://github.com/pytorch/ao/pull/342
  • Generalize Model Size Code https://github.com/pytorch/ao/pull/364
  • Minor upgrades to bit pack https://github.com/pytorch/ao/pull/347
  • Factor out dispatch and layout registration table https://github.com/pytorch/ao/pull/360
  • Add register_apply_tensor_subclass https://github.com/pytorch/ao/pull/366
  • Refactor custom FPx cast https://github.com/pytorch/ao/pull/363
  • Remove all dependencies except torch https://github.com/pytorch/ao/pull/369
  • Enable a test for loading state_dict with tensor subclasses https://github.com/pytorch/ao/pull/389
  • 073 scripts for benchmarks https://github.com/pytorch/ao/pull/372
  • Add WOQ int8 test with Inductor Freeze https://github.com/pytorch/ao/pull/362
  • Benchmarking updates for semi-structured sparse training https://github.com/pytorch/ao/pull/398
  • add FSDP QLoRA test and revert failing PR https://github.com/pytorch/ao/pull/403
  • Refactor the API for quant method argument for quantize function https://github.com/pytorch/ao/pull/400
  • eval script fixes https://github.com/pytorch/ao/pull/414

Bug Fixes

  • Fixed the HQQ import skip https://github.com/pytorch/ao/pull/262
  • fixing autoquant bug https://github.com/pytorch/ao/pull/265
  • Fix eval import after #275 https://github.com/pytorch/ao/pull/290
  • Fixed f-string printing of NF4Tensors https://github.com/pytorch/ao/pull/297
  • Check and fix dequantize_affine is idempotent https://github.com/pytorch/ao/pull/309
  • Update old pretrained TorchVision API in ao tutorials (#313) https://github.com/pytorch/ao/pull/314
  • Fix dimension issues for int4 weight only quant path https://github.com/pytorch/ao/pull/330
  • Fix compile in hf_eval.py https://github.com/pytorch/ao/pull/341
  • tasklist to tasks in hfeval https://github.com/pytorch/ao/pull/343
  • fixing peak memory stats for benchmark https://github.com/pytorch/ao/pull/353
  • Fix inductor config BC change https://github.com/pytorch/ao/pull/382
  • fixing scripts https://github.com/pytorch/ao/pull/395

Performance

  • FP8 splitgemm user defined triton kernel https://github.com/pytorch/ao/pull/263
  • sparse benchmarking numbers https://github.com/pytorch/ao/pull/303
  • Fix FP6-LLM benchmark https://github.com/pytorch/ao/pull/312
  • Adding Llama to TorchAO https://github.com/pytorch/ao/pull/276
  • Generalize Model Size Code https://github.com/pytorch/ao/pull/364
  • eval script for llama https://github.com/pytorch/ao/pull/374
  • 077 autoquant gpt fast https://github.com/pytorch/ao/pull/361

Docs

  • add static folder for images + fix links https://github.com/pytorch/ao/pull/271
  • Fix Readme and remove unused kernel https://github.com/pytorch/ao/pull/270
  • Kernel docs https://github.com/pytorch/ao/pull/274
  • Quantization Docstrings https://github.com/pytorch/ao/pull/273
  • Add AffineQuantizedTensor based workflow doc and examples https://github.com/pytorch/ao/pull/277
  • Add AUTOQUANT_CACHE docs for reusing the same quantization plan https://github.com/pytorch/ao/pull/329
  • Update nightly build instructions https://github.com/pytorch/ao/pull/334
  • add link to benchmarking script https://github.com/pytorch/ao/pull/355
  • New README https://github.com/pytorch/ao/pull/392
  • Minor README updates https://github.com/pytorch/ao/pull/401
  • Add quantize to doc page https://github.com/pytorch/ao/pull/367
  • Add link to new custom op tutorial https://github.com/pytorch/ao/pull/424

Devs

  • ci: Add push trigger for binary build workflows https://github.com/pytorch/ao/pull/259
  • Make fp8 test explicit https://github.com/pytorch/ao/pull/266
  • Move AffineQuantizedTensor to torchao/dtypes https://github.com/pytorch/ao/pull/272
  • Add suffix to package version https://github.com/pytorch/ao/pull/293
  • Re-enable AOTI tests https://github.com/pytorch/ao/pull/212
  • Add fused QKV HQQ triton_mm test https://github.com/pytorch/ao/pull/306
  • Pin CUDA nightly to mitigate regression https://github.com/pytorch/ao/pull/322
  • Unpin CUDA nightly https://github.com/pytorch/ao/pull/333
  • Add architecture to index postfix for nightly builds https://github.com/pytorch/ao/pull/336
  • Update regression test to python 3.8 https://github.com/pytorch/ao/pull/340
  • Remove test_ops.py warning spew https://github.com/pytorch/ao/pull/267
  • Add torchao.version https://github.com/pytorch/ao/pull/359
  • make torchao test discovery pass in fbcode https://github.com/pytorch/ao/pull/351
  • use pytorch version env variable https://github.com/pytorch/ao/pull/373
  • Update prebuildscript.sh https://github.com/pytorch/ao/pull/390
  • Add support for building CUDA extension on Windows https://github.com/pytorch/ao/pull/396
  • Add trymerge https://github.com/pytorch/ao/pull/388
  • Fix github CI error https://github.com/pytorch/ao/pull/409
  • Fix missing dependencies in trymerge workflow https://github.com/pytorch/ao/pull/413
  • Setup trymerge secrets https://github.com/pytorch/ao/pull/416
  • Pin CUDA nightlies for mx failures https://github.com/pytorch/ao/pull/428
  • fix mx triton kernel after PyTorch triton pin change https://github.com/pytorch/ao/pull/431

Untopiced

  • Print the code when the check failed https://github.com/pytorch/ao/pull/254
  • Retry of D58015187 Move AsyncCompile to a different file by @jamesjwu in https://github.com/pytorch/ao/pull/302
  • Revert "Clean up FP6-LLM" https://github.com/pytorch/ao/pull/338
  • Update version to 0.3.0 https://github.com/pytorch/ao/pull/348
  • Add torchao.version https://github.com/pytorch/ao/pull/359

New Contributors

  • @seemethere made their first contribution in https://github.com/pytorch/ao/pull/259
  • @yiliu30 made their first contribution in https://github.com/pytorch/ao/pull/262
  • @vkuzo made their first contribution in https://github.com/pytorch/ao/pull/264
  • @vayuda made their first contribution in https://github.com/pytorch/ao/pull/291
  • @awgu made their first contribution in https://github.com/pytorch/ao/pull/297
  • @jamesjwu made their first contribution in https://github.com/pytorch/ao/pull/302
  • @kit1980 made their first contribution in https://github.com/pytorch/ao/pull/314
  • @RobinKa made their first contribution in https://github.com/pytorch/ao/pull/329
  • @andreaskoepf made their first contribution in https://github.com/pytorch/ao/pull/282
  • @clee2000 made their first contribution in https://github.com/pytorch/ao/pull/388

Full Changelog: https://github.com/pytorch/ao/compare/v0.2.0...v0.3.0-rc1

We were able to close about 60% of tasks for 0.3.0, which will now spill over into upcoming releases. We will post a list for 0.4.0 next, which we aim to release at the end of July 2024. We want to follow a monthly release cadence until further notice.

EDIT: We made a patch release for 0.3.1 to include 2 more PRs so now ao has no runtime dependencies https://github.com/pytorch/ao/pull/449 and https://github.com/pytorch/ao/pull/455

- Python
Published by supriyar almost 2 years ago

torchao - v0.2.0

What's Changed

Highlights

Custom CPU/CUDA extension to ship CPU/CUDA binaries.

PyTorch core has recently shipped a new custom op registration mechanism with torch.library with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with torch.compile()

We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tree/main/torchao/csrc and if you learn better via example you can follow this PR https://github.com/pytorch/ao/pull/135 to add your own custom ops to torchao.

Most notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for fp6 support https://github.com/pytorch/ao/pull/223

One key benefit of integrating your kernels in torchao directly is we thanks to our manylinux GPU support can ensure that CPU/CUDA kernels that you've added will work on as many devices and cuda versions as possible https://github.com/pytorch/ao/pull/176

A lot of prototype and community contributions

@jeromeku was our community champion merging support for 1. GaLore our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch 2. DoRA which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https://github.com/pytorch/ao/tree/main/torchao/prototype/dora 3. Fused int4/fp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq

@gau-nernst merged fp6 support showing up to 8x speedups on an fp16 baseline for small batch size inference https://github.com/pytorch/ao/pull/223

NF4 support for upcoming FSDP2

@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP https://github.com/pytorch/ao/pull/150 most notably by implementing torch.chunk(). We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

BC breaking

Deprecations

New Features

  • Match autoquant API with torch.compile (https://github.com/pytorch/ao/pull/109, https://github.com/pytorch/ao/pull/162, https://github.com/pytorch/ao/pull/175)
  • [Prototype] 8da4w QAT (https://github.com/pytorch/ao/pull/138, https://github.com/pytorch/ao/pull/199, https://github.com/pytorch/ao/pull/198, https://github.com/pytorch/ao/pull/211, https://github.com/pytorch/ao/pull/154, https://github.com/pytorch/ao/pull/157, https://github.com/pytorch/ao/pull/229)
  • [Prototype] GaLore (https://github.com/pytorch/ao/pull/95)
  • [Prototype] DoRA (https://github.com/pytorch/ao/pull/216)
  • [Prototype] HQQ (https://github.com/pytorch/ao/pull/153, https://github.com/pytorch/ao/pull/185)
  • [Prototype] 2:4 sparse + int8 sparse subclass (https://github.com/pytorch/ao/pull/36)
  • [Prototype] Unified quantization primitives (https://github.com/pytorch/ao/pull/159, https://github.com/pytorch/ao/pull/201, https://github.com/pytorch/ao/pull/193, https://github.com/pytorch/ao/pull/220, https://github.com/pytorch/ao/pull/227, https://github.com/pytorch/ao/pull/173, https://github.com/pytorch/ao/pull/210)
  • [Prototype] Pruning primitives (https://github.com/pytorch/ao/pull/148, https://github.com/pytorch/ao/pull/194)
  • [Prototype] AffineQuantizedTensor subclass (https://github.com/pytorch/ao/pull/214, https://github.com/pytorch/ao/pull/230, https://github.com/pytorch/ao/pull/243, https://github.com/pytorch/ao/pull/247, https://github.com/pytorch/ao/pull/251)
  • [Prototype] Add Int4WeightOnlyQuantizer (https://github.com/pytorch/ao/pull/119)
  • Custom CUDA extensions (https://github.com/pytorch/ao/pull/135, https://github.com/pytorch/ao/pull/186, https://github.com/pytorch/ao/pull/232)
  • [Prototype] Add FP6 Linear (https://github.com/pytorch/ao/pull/223)

Improvements

  • FSDP2 support for NF4Tensor (https://github.com/pytorch/ao/pull/118, https://github.com/pytorch/ao/pull/150, https://github.com/pytorch/ao/pull/207)
  • Add save/load of int8 weight only quantized model (https://github.com/pytorch/ao/pull/122)
  • Add intscaledmm on CPU (https://github.com/pytorch/ao/pull/121)
  • Add cpu and gpu in int4wo and int4wo-gptq quantizer (https://github.com/pytorch/ao/pull/131)
  • Add torch.export support to int8dq, int8wo, int4_wo subclasses (https://github.com/pytorch/ao/pull/146, https://github.com/pytorch/ao/pull/226, https://github.com/pytorch/ao/pull/213)
  • Remove is_gpt_fast specialization from GTPQ (https://github.com/pytorch/ao/pull/172)
  • Common benchmark and profile utils (https://github.com/pytorch/ao/pull/238)

Bug fixes

  • Fix padding in GPTQ (https://github.com/pytorch/ao/pull/119, https://github.com/pytorch/ao/pull/120)
  • Fix Int8DynActInt4WeightLinear module swap (https://github.com/pytorch/ao/pull/151)
  • Fix NF4Tensor.to to use device kwarg (https://github.com/pytorch/ao/pull/158)
  • Fix quantize_activation_per_token_absmax perf regression (https://github.com/pytorch/ao/pull/253)

Performance

  • Chunk NF4Tensor construction to reduce memory spike (https://github.com/pytorch/ao/pull/196)
  • Fix intmm benchmark script (https://github.com/pytorch/ao/pull/141)

Docs

  • Update READMEs (https://github.com/pytorch/ao/pull/140, https://github.com/pytorch/ao/pull/142, https://github.com/pytorch/ao/pull/169, https://github.com/pytorch/ao/pull/155, https://github.com/pytorch/ao/pull/179, https://github.com/pytorch/ao/pull/187, https://github.com/pytorch/ao/pull/188, https://github.com/pytorch/ao/pull/200, https://github.com/pytorch/ao/pull/217, https://github.com/pytorch/ao/pull/245)
  • Add https://pytorch.org/ao (https://github.com/pytorch/ao/pull/136, https://github.com/pytorch/ao/pull/145, https://github.com/pytorch/ao/pull/163, https://github.com/pytorch/ao/pull/164, https://github.com/pytorch/ao/pull/165, https://github.com/pytorch/ao/pull/168, https://github.com/pytorch/ao/pull/177, https://github.com/pytorch/ao/pull/195, https://github.com/pytorch/ao/pull/224)

CI

  • Add A10G support in CI (https://github.com/pytorch/ao/pull/176)
  • General CI improvements (https://github.com/pytorch/ao/pull/161, https://github.com/pytorch/ao/pull/171, https://github.com/pytorch/ao/pull/178, https://github.com/pytorch/ao/pull/180, https://github.com/pytorch/ao/pull/183, https://github.com/pytorch/ao/pull/107, https://github.com/pytorch/ao/pull/215, https://github.com/pytorch/ao/pull/244, https://github.com/pytorch/ao/pull/257, https://github.com/pytorch/ao/pull/235, https://github.com/pytorch/ao/pull/242)
  • Add expecttest to requirements.txt (https://github.com/pytorch/ao/pull/225)
  • Push button binary support (https://github.com/pytorch/ao/pull/241, https://github.com/pytorch/ao/pull/240, https://github.com/pytorch/ao/pull/250)

Not user facing

Security

Untopiced

  • Version bumps (https://github.com/pytorch/ao/pull/125, https://github.com/pytorch/ao/pull/234)
  • Don't import _C in fbcode (https://github.com/pytorch/ao/pull/218)

New Contributors

  • @Xia-Weiwen made their first contribution in https://github.com/pytorch/ao/pull/121
  • @jeromeku made their first contribution in https://github.com/pytorch/ao/pull/95
  • @weifengpy made their first contribution in https://github.com/pytorch/ao/pull/118
  • @aakashapoorv made their first contribution in https://github.com/pytorch/ao/pull/179
  • @UsingtcNower made their first contribution in https://github.com/pytorch/ao/pull/194
  • @Jokeren made their first contribution in https://github.com/pytorch/ao/pull/217
  • @gau-nernst made their first contribution in https://github.com/pytorch/ao/pull/223
  • @janeyx99 made their first contribution in https://github.com/pytorch/ao/pull/245
  • @huydhn made their first contribution in https://github.com/pytorch/ao/pull/250
  • @lancerts made their first contribution in https://github.com/pytorch/ao/pull/238

Full Changelog: https://github.com/pytorch/ao/compare/v0.2.0...v0.2.1

We were able to close about half of tasks for 0.2.0, which will now spill over into upcoming releases. We will post a list for 0.3.0 next, which we aim to release at the end of May 2024. We want to follow a monthly release cadence until further notice.

- Python
Published by cpuhrsch about 2 years ago

torchao - TorchAO 0.1.0: First Release

Highlights

We’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplication kernels on cuda.

Note: TorchAO is currently in a pre-release state and under extensive development. The public APIs should not be considered stable. But we welcome you to try out our APIs and offerings and provide any feedback on your experience.

torchao 0.1.0 will be compatible with PyTorch 2.2.2 and 2.3.0, ExecuTorch 0.2.0 and TorchTune 0.1.0.

New Features

Quantization

  • Added tensor subclass based quantization APIs: change_linear_weights_to_int8_dqtensors, change_linear_weights_to_int8_woqtensors and change_linear_weights_to_int4_woqtensors (#1)
  • Added module based quantization APIs for int8 dynamic and weight only quantization apply_weight_only_int8_quant and apply_dynamic_quant (#1)
  • Added module swap version of int4 weight only quantization Int4WeightOnlyQuantizer and Int4WeightOnlyGPTQQuantizer used in TorchTune (#119, #116)
  • Added int8 dynamic activation and int4 weight quantization Int8DynActInt4WeightQuantizer and Int8DynActInt4WeightGPTQQuantizer, used in ExecuTorch (#74) (available after torch 2.3.0 and later) ## Sparsity
  • Added WandaSparsifier that prunes both weights and activations (#22) ## Kernels
  • Added autotuner for int mm Triton kernels (#41) ## dtypes
  • nf4 tensor subclass and nf4 linear (#37, #40, #62)
  • Added uint4 dtype tensor subclass (#13)

Improvements

  • Setup github workflow for regression testing (#50)
  • Setup github workflow for torchao-nightly release (#54)

Documentation

  • Added tutorials for quantizing vision transformer model (#60)
  • Added tutorials for how to add an op for nf4 tensor (#54)

Notes

  • we are still debugging the accuracy problem for Int8DynActInt4WeightGPTQQuantizer
  • Save and load does not work well for tensor subclass based APIs yet
  • We will consolidate tensor subclass and module swap based quantization APIs later
  • uint4 tensor subclass is going to be merged into pytorch core in the future
  • Quantization ops in quant_primitives.py will be deduplicated with similar quantize/dequantize ops in PyTorch later

- Python
Published by jerryzh168 about 2 years ago