sleekit

Bag of Tricks for NN Quantization

https://github.com/coloquinte/sleekit

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Bag of Tricks for NN Quantization

Basic Info

Host: GitHub
Owner: Coloquinte
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 4.06 MB

Statistics

Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed 6 months ago

Metadata Files

Readme License Citation

Bag of Tricks for NN Quantization

Neural network quantization is the process that compresses the weights in a neural network to use a smaller number representation. This makes its representation smaller, both on disk and in memory, and can make the computation less expensive for accelerators, typically by using small integer weights for the coefficients. At the same time, it reduces the precision of the computations, so that good algorithm design is necessary to maintain good quality.

This repository contains tools to research post-training neural networks quantization, with methods to improve over the current state-of-the-art. It is purely for analysis purpose: complete implementations will be made available on other repositories. Our main contributions are two simple improvements that are compatible with most quantization methods: an improved scaling method, and making better use of the bias during quantization.

Quantization method

Sleekit uses a very generic quantization method. The steps to quantize a layer are: * gathering sample data: we run the network on some data samples to gather statistical informations for each layer; * chosing of a codebook: a codebook gives a limited number of values that can be represented, and we round the weights to one of the values in the codebook; * scaling the weights: we apply a scaling factor so that the weights are close to the chosen codebook; * optimizing the weights: to maintain a good quality for the neural network, we use a specialized algorithm to tweak the weights after rounding.

Improvements

We present several generic improvements that can be applied to any quantization method. They will target both the scaling step, to select better scaling factors, and the weight optimization step to reduce the layer error.

Methodology: layer-per-layer analysis

To develop our methods we analyze the effect of quantization decisions on a per-layer basis. Despite many previous works using network-level metrics, post-training quantization methods minimize the error at the layer level. Analyzing the error at the layer level is therefore the natural approach. Moreover, network-level metrics have a tendency to be noisy, can hide small quantization errors or on the contrary be over-sensitive to some layers.

Our baseline for comparison is the GPTQ algorithm with 3-bit and 1.5-bit weights. We use GPTQ's given parameters for the heuristic (diagonal ordering and 1% dampening). For the layer weights and metrics, we use layer statistics from a full accuracy run on several smaller networks (OPT-125M, OPT-350M, BLOOM-560M). We compare the error introduced by the quantization with and without our methods.

Trick 1: better scaling

A good scaling factor minimizes the error introduced by quantization. The typical method is to chose a scaling factor that minimizes the mean squared error on the weights (MSE). We introduce a more precise approach, that optimizes the layer's result directly.

For weight optimization, we already have access to an accurate measure of the layer's error (the hessian matrix $H$ obtained from input samples). Our idea is to reuse it for scaling optimization. We test three different approaches to scaling, and compare the layer error after applying GPTQ: * minimizing the mean squared error after rounding to the nearest; * using the full hessian matrix to compute the error, which is computationally expensive; * using the diagonal of the hessian matrix to compute the error, which has the same computational cost as the MSE; * using the full weight optimization to compute the error for each scaling value, which is extremely expensive but is theoretically optimal.

The usual approach of minimizing the MSE yields results that are far from optimal. Using the full hessian matrix or its diagonal yields similar results that are on average much better than MSE alone. Results are far from the theoretical optimum, and even slightly degraded for some layers, leaving room for improvement.

Trick 2: combining with bias correction

Bias correction is a method used to reduce the impact of quantization on a layer. Newer quantization methods behave much better, and it is not used much anymore. However, it is compatible and there is no reason not to use both. The effect of bias correction can even be integrated in the cost function used for weight optimization, using $H=\frac{1}{n} X^\intercal X -M^\intercal M$, where $X$ are the input samples and $M = \frac{1}{n}1^\intercal X$ is the average value of the samples for each input.

We test three different ways to update the bias: * applying weight optimization alone (GPTQ) without bias correction; * applying bias correction after weight optimization, yielding a slightly smaller layer error; * taking the effect of bias correction into account during weight optimization.

Adding back bias correction greatly improves certain layers, in particular some attention layers in all networks. Unsurprisingly, it has more impact with a more agressive quantization and yields better result if taken into account for weight optimization.

Trick 3: adding local search

The weight optimization problem is NP-hard, and can only be solved at scale in an approximate manner. GPTQ provides a good heuristic for it, however the heuristic of choice to obtain good solutions to similar problems (QUBO) is a simple local search. For this reason, we test the effect of applying a few local search moves after GPTQ, in a best-first manner.

The effect of just a few local search moves is notable on many layers, and applying them after GPTQ can drastically reduce layer error.

Minor tricks

Other tricks yield smaller but useful improvements: * Using a different ordering for GPTQ. GPTQ makes rounding decisions for the weights in a greedy manner; they obtain a good ordering using the diagonal of the matrix in decreasing order. Instead, we multiply this value by the sum of squares of the quantization error; this takes better account of the effect of saturation. * Using a different dampening for GPTQ. GPTQ does not behave well on ill-conditioned matrix, and adding a larger penalty term to the matrix paradoxically yields better results. The original paper uses a 1% penalty, but penalties of 3-10% behave better, while removing the penalty altogether degrades results significantly.

The many tricks that do not work

The following approaches did not yield promising results and were abandoned: * Improved codebooks: the data is far from being gaussian-distributed, but training a codebook naively on the data is not better than a NF4 codebook. A good codebook training needs to take the hessian (or its diagonal) into account. * Entropy coding: it is tempting to combine codebook optimization with entropy coding to reduce storage needs. However, the gain in entropy is not huge compared to an error-optimized codebook, and does not seem worth the effort. * GPTQ reordering: clever heuristic orderings for GPTQ based on the hessian matrix do not bring a reduction in layer error, compared to using its diagonal as the original paper does. We tested several variations using the diagonal of the inverse and pivoted Cholesky decompositions. * More complex algorithms for weight optimization: it just doesn't scale, but if you want to go in this direction you probably want to use the MQLib as a solver.

Putting it all together

Finally, we put these algorithms together in Sleekit:

the hessian matrix is modified to represent the effect of bias correction;
scaling is performed based on the hessian diagonal;
weight optimization uses our slightly improved ordering and dampening.

The computational cost of the algorithm is not increased so far compared to GPTQ: this is the "Sleekit light" version.

At the cost of additional computations we add the following for the "Sleekit heavy" version:

scaling is performed based on a weight optimization computation;
local search is performed during the final weight optimization for 1000 moves.

Most of the improvement is due to the better scaling method, but the various methods stack well. Together, they yield to a reduced error on almost all layers. On the other hand, a minority of layers experiences a huge improvement, with error reduced by 80% or more. It is still unclear what the impact is for the neural network as a whole.

Numerical results

All results are available in the results folder. The geometric mean impact of each trick on the mean squared error against the default GPTQ is shown below.

Expand numerical results

| Scaling method | 3b | 2b | 1.5b | 1b | | -------------- | ------- | ------- | ------- | ------- | | Diagonal | -20.25% | -16.66% | -15.52% | -7.78% | | Hessian | -20.50% | -18.41% | -16.36% | -19.48% | | Exhaustive | -29.68% | -29.35% | -26.03% | -30.64% | | Correction method | 3b | 2b | 1.5b | 1b | | ------------------- | ------- | ------- | ------- | ------- | | After optimization | -1.72% | -4.11% | -5.01% | -10.78% | | During optimization | -4.01% | -6.72% | -7.90% | -13.44% | | Local search duration | 3b | 2b | 1.5b | 1b | | --------------------- | ------- | ------- | ------- | ------- | | 10 moves | -4.51% | -6.05% | -7.07% | -9.57% | | 100 moves | -9.42% | -13.47% | -15.64% | -20.25% | | Ordering | 3b | 2b | 1.5b | 1b | | ------------------------ | ------ | ------ | ------ | ------ | | Diagonal * Error | -0.57% | -0.62% | -0.59% | -0.50% | | Diagonal * Squared Error | -1.95% | -1.69% | -1.35% | -1.40% | | Dampening | 3b | 2b | 1.5b | 1b | | ---------- | ------- | ------- | ------- | ------- | | 0.001 | +2.52% | +3.50% | +2.72% | +3.17% | | 0.003 | +1.29% | +1.73% | +1.57% | +1.63% | | 0.03 | -0.91% | -1.49% | -1.54% | -1.91% | | 0.1 | -0.03% | -1.91% | -2.14% | -3.86% | | 0.3 | +5.42% | +1.42% | +0.45% | -3.67% | | 1.0 | +19.78% | +12.48% | +10.08% | +1.47% | | Method | 3b | 2b | 1.5b | 1b | | ------------------------ | ------- | ------- | ------- | ------- | | Correction only | -4.01% | -6.72% | -7.90% | -13.44% | | Diagonal scaling only | -20.25% | -16.66% | -15.52% | -7.78% | | Sleekit light | -25.04% | -23.90% | -22.43% | -20.50% | | Sleekit heavy | -34.86% | -36.49% | -34.33% | -41.94% |

References

The algorithms in this repository build on the following works: * Bias correction and GPTQ for the approach to weight quantization, as well as similar works such as AdaRound, AdaQuant, OBQ or GPTVQ; * Lloyd and LBG for the choice of quantization grids; * The GPTQ repository was used for data and testing.

Owner

Name: Gabriel Gouvine
Login: Coloquinte
Kind: user
Location: Edinburgh
Company: AMD

Repositories: 36
Profile: https://github.com/Coloquinte

Citation (CITATION.cff)

cff-version: 1.2.0
title: Sleekit, Bag of Tricks for NN Quantization
type: software
authors:
  - family-names: Gouvine
    given-names: Gabriel
    orcid: 0000-0003-3404-6659
repository-code: https://github.com/Coloquinte/Sleekit
keywords:
  - neural network
  - quantization
  - inference
license: MIT
date-released: 2024-06-01

GitHub Events

Total

Watch event: 3
Delete event: 1
Push event: 1

Last Year

Watch event: 3
Delete event: 1
Push event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/pytest.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

pyproject.toml pypi

numpy *
torch *

sleekit

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Bag of Tricks for NN Quantization

Quantization method

Improvements

Methodology: layer-per-layer analysis

Trick 1: better scaling

Trick 2: combining with bias correction

Trick 3: adding local search

Minor tricks

The many tricks that do not work

Putting it all together

Numerical results

References

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies