https://github.com/bytedance/abq-llm

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

https://github.com/bytedance/abq-llm

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary

Keywords

cuda llm-inference mlsys quantized-networks research
Last synced: 5 months ago · JSON representation

Repository

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

Basic Info
  • Host: GitHub
  • Owner: bytedance
  • License: apache-2.0
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 53.9 MB
Statistics
  • Stars: 221
  • Watchers: 5
  • Forks: 21
  • Open Issues: 14
  • Releases: 0
Topics
cuda llm-inference mlsys quantized-networks research
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

ABQ-LLM</h1

ABQ-LLM is a novel arbitrary bit quantization scheme that achieves excellent performance under various quantization settings while enabling efficient arbitrary bit computation at the inference level.

ABQ-LLM The current release version supports the following features: - The ABQ-LLM algorithm is employed for precise weight-only quantization (W8A16, W4A16, W3A16, W2A16) and weight-activation quantization (W8A8, W6A6, W4A4, W3A8, W3A6, W2A8, W2A6). - Pre-trained ABQ-LLM model weights for LLM (LLaMA and LLaMA-2 loaded to run quantized models). - A set of out-of-the-box arbitrary bit quantization operators that support arbitrary bit model inference in Turing and above architectures.

Contents

Install

Installation of the algorithmic runtime environment

conda create -n abq-llm python=3.10.0 -y conda activate abq-llm git clone https://github.com/bytedance/ABQ-LLM.git cd ./ABQ-LLM/algorithm pip install --upgrade pip pip install -r requirements.txt

Installation of the inference engine environment

You can actually compile and test our quantized inference Kernel, but you need to install the basic CUDA Toolkit. 1. Install CUDA Toolkit (11.8 or 12.1, linux or windows). Use the Express Installation option. Installation may require a restart (windows). 2. Clone the CUTLASS. (It is only used for speed comparison) git submodule init git submodule update

ABQ-LLM Model

We provide pre-trained ABQ-LLM model zoo for multiple model families, including LLaMa-1&2, OPT. The detailed support list:

| Models | Sizes | W4A16 | W3A16 | W2A16 | W2A16g128 | W2A16g64| | ------- | ------------------------------- | ----- | --------- | -------- | ----- |----- | | LLaMA | 7B/13B | ✅ | ✅ | ✅ | ✅ |✅ | | LLaMA-2 | 7B/13B | ✅ | ✅ | ✅ | ✅ |✅ |

| Models | Sizes | W8A8 | W4A8 | W6A6 | W4A6 | W4A4 | W3A8 | W3A6 | W2A8 | W2A6 | | ------------ | ------------------------------- | --------- | ----- | --------- | ---- | ---- |---- |---- |---- |---- | | LLaMA | 7B/13B | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |✅ | | LLaMA-2 | 7B/13B | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |✅ |

Usage

Algorithm Testing

We provide the pre-trained ABQ- LLM model weight in hugginface, you can verify the model performance by the following commands. CUDA_VISIBLE_DEVICES=0 python run_pretrain_abq_model.py \ --model /PATH/TO/LLaMA/llama-7b-ABQ \ --wbits 4 --abits 4

We also provide full script to run ABQ-LLM in ./algorithm/scripts/. We use LLaMa-7B as an example here: 1. Obtain the channel-wise scales and shifts required for initialization: python generate_act_scale_shift.py --model /PATH/TO/LLaMA/llama-7b

  1. Weight-only quantization ``` # W3A16 CUDAVISIBLEDEVICES=0 python main.py \ --model /PATH/TO/LLaMA/llama-7b \ --epochs 20 --outputdir ./log/llama-7b-w3a16 \ --evalppl --wbits 3 --abits 16 --lwc --let

W3A16g128

CUDAVISIBLEDEVICES=0 python main.py \ --model /PATH/TO/LLaMA/llama-7b \ --epochs 20 --outputdir ./log/llama-7b-w3a16g128 \ --evalppl --wbits 3 --abits 16 --group_size 128 --lwc --let ```

  1. weight-activation quantization # W4A4 CUDA_VISIBLE_DEVICES=0 python main.py \ --model /PATH/TO/LLaMA/llama-7b \ --epochs 20 --output_dir ./log/llama-7b-w4a4 \ --eval_ppl --wbits 4 --abits 4 --lwc --let \ --tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

More detailed and optional arguments: - --model: the local model path or huggingface format. - --wbits: weight quantization bits. - --abits: activation quantization bits. - --group_size: group size of weight quantization. If no set, use per-channel quantization for weight as default. - --lwc: activate the Learnable Weight Clipping (LWC). - --let: activate the Learnable Equivalent Transformation (LET). - --lwc_lr: learning rate of LWC parameters, 1e-2 as default. - --let_lr: learning rate of LET parameters, 5e-3 as default. - --epochs: training epochs. You can set it as 0 to evaluate pre-trained ABQ-LLM checkpoints. - --nsamples: number of calibration samples, 128 as default. - --eval_ppl: evaluating the perplexity of quantized models. - --tasks: evaluating zero-shot tasks. - --multigpu: to inference larger network on multiple GPUs - --real_quant: real quantization, which can see memory reduce. Note that due to the limitations of AutoGPTQ kernels, the real quantization of weight-only quantization can only lead memory reduction, but with slower inference speed. - --save_dir: saving the quantization model for further exploration.

Kernel Benchmark

  1. Compile Kernels.

By default, w2a2, w3a3, w4a4, w5a5, w6a6, w7a7, w8a8 are compiled, and the kernel of w2a4, w2a6, w2a8, and w4a8 quantization combination is compiled. Each quantization scheme corresponds to dozens of kernel implementation schemes to build its search space. ```

linux

cd engine bash build.sh

windows

cd engine build.bat ```

  1. Comprehensive benchmark.

For the typical GEMM operation of the llama model, different quantization combinations (w2a2, w3a3, w4a4,w5a5, w6a6, w7a7, w8a8, w2a4, w2a6, w2a8, w4a8) are tested to obtain the optimal performance in the search space of each quantization combination. ```

linux

bash test.sh

windows

test.bat ```

  1. Add new quantization combinations(Optional).

We reconstructed the quantized matrix multiplication operation in a clever way, decomposing it into a series of binary matrix multiplications, and performed a high degree of template and computational model abstraction.

Based on the above optimizations, you can quickly expand our code to support new quantization combinations, such as wpaq. You only need to add wpaq instantiation definition and declaration files in engine/mmaany/aqwmma_impl and then recompile.

The performance upper limit depends on how the search space is defined (the instantiated function configuration). For related experience, please refer to the paper or the existing implementation in this directory.

E2E Benchmark

  1. Compile the fastertransformer cd fastertransformer bash build.sh

  2. Config llama (Change precision in examples/cpp/llama/llamaconfig.ini) ``` fp16: int8mode=0 w8a16: int8mode=1 w8a8: int8mode=2 w4a16: int8mode=4 w2a8: int8mode=5 ```

  3. Run llama on single GPU cd build_release ./bin/llama_example

  4. (Optional) Run in multi GPU. Change tensorparasize=2 in examples/cpp/llama/llama_config.ini

cd build_release mpirun -n 2 ./bin/llama_example

Results

  • ABQ-LLM achieve SoTA performance in weight-only quantization weight_only
  • ABQ-LLM achieve SoTA performance in weight-activation quantization weight_activation
  • ABQ-LLM achieve SoTA performance in zero-shot task zero_shot
  • On kernel inference acceleration, ABQ- LLM achieves performance gains that far exceed those of CUTLASS and CUBLAS. kernel_speed
  • We integrated our ABQKernel into FastTransformer and compared it with the FP16 version of FastTransformer and the INT8 version of SmoothQuant. Our approach achieved a 2.8x speedup and 4.8x memory compression over FP16, using only 10GB of memory on LLaMA-30B, less than what FP16 requires for LLaMA-7B. Additionally, it outperformed SmoothQuant with a 1.6x speedup and 2.7x memory compression. e2e_speed

Related Project

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

RPTQ: Reorder-Based Post-Training Quantization for Large Language Models

OmniQuant is a simple and powerful quantization technique for LLMs

Citation

If you use our ABQ-LLM approach in your research, please cite our paper: @article{zeng2024abq, title={ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models}, author={Zeng, Chao and Liu, Songwei and Xie, Yusheng and Liu, Hong and Wang, Xiaojian and Wei, Miao and Yang, Shu and Chen, Fangmin and Mei, Xing}, journal={arXiv preprint arXiv:2408.08554}, year={2024} }

Star History

Star History Chart

Owner

  • Name: Bytedance Inc.
  • Login: bytedance
  • Kind: organization
  • Location: Singapore

GitHub Events

Total
  • Issues event: 5
  • Watch event: 53
  • Issue comment event: 8
  • Fork event: 6
Last Year
  • Issues event: 5
  • Watch event: 53
  • Issue comment event: 8
  • Fork event: 6

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 54
  • Total Committers: 5
  • Avg Commits per committer: 10.8
  • Development Distribution Score (DDS): 0.463
Past Year
  • Commits: 54
  • Committers: 5
  • Avg Commits per committer: 10.8
  • Development Distribution Score (DDS): 0.463
Top Committers
Name Email Commits
曾超 z****4@b****m 29
xieyusheng.12 x****2@b****m 12
liusongwei.zju l****u@b****m 10
yanchenqian.i y****i@b****m 2
root r****t@n****g 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 20
  • Total pull requests: 0
  • Average time to close issues: 15 days
  • Average time to close pull requests: N/A
  • Total issue authors: 13
  • Total pull request authors: 0
  • Average comments per issue: 1.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 20
  • Pull requests: 0
  • Average time to close issues: 15 days
  • Average time to close pull requests: N/A
  • Issue authors: 13
  • Pull request authors: 0
  • Average comments per issue: 1.75
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Sekri0 (4)
  • RanchiZhao (3)
  • luliyucoordinate (2)
  • aur61 (1)
  • Godlovecui (1)
  • FlyFoxPlayer (1)
  • KoalaYuFeng (1)
  • gdsaikrishna (1)
  • CalebDu (1)
  • Juelianqvq (1)
  • goddice (1)
  • renjie0 (1)
  • Nekofish-L (1)
  • gloritygithub11 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

algorithm/requirements.txt pypi
  • DataProperty ==1.0.1
  • Jinja2 ==3.1.4
  • MarkupSafe ==2.1.5
  • PyYAML ==6.0.1
  • Pygments ==2.18.0
  • absl-py ==2.1.0
  • accelerate ==0.29.3
  • aiohttp ==3.9.5
  • aiosignal ==1.3.1
  • antlr4-python3-runtime ==4.9.3
  • anyio ==4.4.0
  • async-timeout ==4.0.3
  • attrs ==23.2.0
  • certifi ==2024.2.2
  • chardet ==5.2.0
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • cmake ==3.29.3
  • colorama ==0.4.6
  • contourpy ==1.2.1
  • cycler ==0.12.1
  • datasets ==2.14.7
  • dill ==0.3.7
  • distro ==1.9.0
  • docstring_parser ==0.16
  • einops ==0.8.0
  • evaluate ==0.4.2
  • exceptiongroup ==1.2.1
  • filelock ==3.14.0
  • fire ==0.6.0
  • fonttools ==4.51.0
  • frozenlist ==1.4.1
  • fsspec ==2023.10.0
  • h11 ==0.14.0
  • hjson ==3.1.0
  • httpcore ==1.0.5
  • httpx ==0.27.0
  • huggingface-hub ==0.17.3
  • idna ==3.7
  • jieba ==0.42.1
  • joblib ==1.4.2
  • jsonlines ==4.0.0
  • kiwisolver ==1.4.5
  • lit ==18.1.4
  • lxml ==5.2.2
  • markdown-it-py ==3.0.0
  • matplotlib ==3.8.4
  • mbstrdecoder ==1.1.3
  • mdurl ==0.1.2
  • mpmath ==1.3.0
  • multidict ==6.0.5
  • multiprocess ==0.70.15
  • networkx ==3.2
  • ninja ==1.11.1.1
  • nltk ==3.8.1
  • numexpr ==2.10.0
  • numpy ==1.26.4
  • nvidia-cublas-cu11 ==11.10.3.66
  • nvidia-cuda-cupti-cu11 ==11.7.101
  • nvidia-cuda-nvrtc-cu11 ==11.7.99
  • nvidia-cuda-runtime-cu11 ==11.7.99
  • nvidia-cudnn-cu11 ==8.5.0.96
  • nvidia-cufft-cu11 ==10.9.0.58
  • nvidia-curand-cu11 ==10.2.10.91
  • nvidia-cusolver-cu11 ==11.4.0.1
  • nvidia-cusparse-cu11 ==11.7.4.91
  • nvidia-nccl-cu11 ==2.14.3
  • nvidia-nvtx-cu11 ==11.7.91
  • omegaconf ==2.3.0
  • openai ==1.33.0
  • packaging ==24.0
  • pandas ==2.2.2
  • pathvalidate ==3.2.0
  • peft ==0.10.0
  • pillow ==10.3.0
  • portalocker ==2.8.2
  • protobuf ==5.26.1
  • psutil ==5.9.8
  • py-cpuinfo ==9.0.0
  • pyarrow ==16.1.0
  • pyarrow-hotfix ==0.6
  • pybind11 ==2.12.0
  • pycountry ==23.12.11
  • pydantic ==1.10.15
  • pyparsing ==3.1.2
  • pytablewriter ==1.2.0
  • python-dateutil ==2.9.0.post0
  • pytz ==2024.1
  • regex ==2024.5.10
  • requests ==2.31.0
  • rich ==13.7.1
  • rouge-chinese ==1.0.3
  • rouge_score ==0.1.2
  • sacrebleu ==2.4.2
  • safetensors ==0.4.3
  • scikit-learn ==1.4.2
  • scipy ==1.13.0
  • seaborn ==0.13.2
  • sentencepiece ==0.2.0
  • shtab ==1.7.1
  • six ==1.16.0
  • sniffio ==1.3.1
  • sqlitedict ==2.1.0
  • sympy ==1.12
  • tabledata ==1.3.3
  • tabulate ==0.9.0
  • tcolorpy ==0.1.6
  • tensorboardX ==2.6.2.2
  • termcolor ==2.4.0
  • threadpoolctl ==3.5.0
  • tiktoken ==0.6.0
  • tokenizers ==0.14.1
  • torch ==2.0.0
  • tqdm ==4.66.2
  • tqdm-multiprocess ==0.0.11
  • transformers ==4.35.0
  • triton ==2.0.0
  • trl ==0.7.2
  • typepy ==1.3.2
  • typing_extensions ==4.11.0
  • tyro ==0.8.4
  • tzdata ==2024.1
  • urllib3 ==2.2.1
  • uvicorn ==0.29.0
  • xxhash ==3.4.1
  • yarl ==1.9.4
  • zstandard ==0.22.0
algorithm/setup.py pypi
  • numpy *
  • torch *
requirements.txt pypi
  • accelerate ==0.33.0
  • beautifulsoup4 *
  • bs4 *
  • datasets ==2.20.0
  • diffusers ==0.28.0
  • einops *
  • ftfy *
  • gradio ==4.1.1
  • lpips *
  • mmcv ==1.7.0
  • numpy *
  • open_clip_torch *
  • opencv-python *
  • optimum *
  • peft *
  • protobuf ==3.20.2
  • pytorch-fid *
  • sentencepiece *
  • tensorboard *
  • tensorboardX *
  • thop *
  • timm ==0.6.12
  • torch-fidelity *
  • transformers ==4.42.4
  • wcwidth *
  • xformers ==0.0.27
  • yapf ==0.40.1
fastertransformer/examples/pytorch/bert/bert-quantization-sparsity/Dockerfile docker
  • ${FROM_IMAGE_NAME} latest build
  • nvcr.io/nvidia/tritonserver 20.06-v1-py3-clientsdk build
fastertransformer/examples/tensorflow/bert/bert-quantization/Dockerfile docker
  • ${FROM_IMAGE_NAME} latest build
fastertransformer/3rdparty/cutlass/tools/library/scripts/pycutlass/pyproject.toml pypi
fastertransformer/3rdparty/cutlass/tools/library/scripts/pycutlass/setup.py pypi
  • bfloat16 *
  • cuda-python <11.7.0
  • numpy <1.23
  • pybind11 *
  • scikit-build *
  • treelib *
  • typeguard *
  • typing *
fastertransformer/examples/pytorch/bart/requirement.txt pypi
  • SentencePiece *
  • datasets *
  • omegaconf *
  • rouge_score *
  • sacrebleu *
  • tokenizers *
  • transformers *
fastertransformer/examples/pytorch/bert/bert-quantization-sparsity/requirements.txt pypi
  • boto3 *
  • h5py *
  • html2text *
  • ipdb *
  • nltk *
  • onnxruntime *
  • progressbar *
  • requests *
  • six *
  • tqdm *
fastertransformer/examples/pytorch/gpt/requirement.txt pypi
  • accelerate *
  • datasets *
  • fire *
  • omegaconf *
  • rouge_score *
  • transformers *
fastertransformer/examples/pytorch/swin/Swin-Transformer-Quantization/SwinTransformer/kernels/window_process/setup.py pypi
fastertransformer/examples/pytorch/t5/requirement.txt pypi
  • SentencePiece *
  • datasets *
  • omegaconf *
  • rouge_score *
  • sacrebleu *
  • tokenizers *
  • transformers *
fastertransformer/examples/pytorch/vit/ViT-quantization/ViT-pytorch/requirements.txt pypi
  • ml-collections *
  • numpy *
  • tensorboard *
  • torch *
  • tqdm *
fastertransformer/examples/pytorch/vit/requirement.txt pypi
  • ml_collections *
  • pytorch-quantization *
  • termcolor ==1.1.0
  • timm ==0.4.12
  • yacs *
fastertransformer/examples/tensorflow/bert/bert-quantization/ft-tensorflow-quantization/setup.py pypi
fastertransformer/examples/tensorflow/bert/tensorflow_bert/bert/requirements.txt pypi
  • tensorflow >=1.11.0
fastertransformer/examples/tensorflow/deberta/requirement.txt pypi
  • SentencePiece *
  • numpy *
  • transformers *
fastertransformer/examples/tensorflow/requirement.txt pypi
  • fire >=0.1.3
  • opennmt-tf ==1.25.1
  • regex ==2017.4.5
  • requests ==2.21.0
  • tqdm ==4.31.1
fastertransformer/examples/tensorflow/t5/requirement.txt pypi
  • SentencePiece *
  • datasets *
  • omegaconf *
  • rouge_score *
  • sacrebleu *
  • transformers *