https://github.com/cornell-zhang/llm-datatypes

Codebase for ICML'24 paper: Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Codebase for ICML'24 paper: Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

Basic Info

Host: GitHub
Owner: cornell-zhang
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 1.19 MB

Statistics

Stars: 2
Watchers: 6
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

By Jordan Dotzel, Yuzong Chen, Bahaa Kotb, Sushma Prasad, Gang Wu, Sheng Li, Mohamed S. Abdelfattah, Zhiru Zhang

Introduction

This is the corresponding code the for the ICML paper Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs. This work first conducts a large-scale analysis of LLM weights and activations across 30 networks and concludes that most distributions follow a Student’s t-distribution. It then derives a new theoretically optimal format, Student Float (SF4), that improves over NF4 across modern LLMs. Then, using this format as a high-accuracy reference, it proposes augmenting E2M1 with two variants of supernormal support for higher model accuracy. Finally, it explores the quality and efficiency frontier across 11 datatypes by evaluating their model accuracy and hardware complexity. It discovers a Pareto curve composed of INT4, E2M1, and E2M1 with supernormal support, which offers a continuous tradeoff between model accuracy and chip area.

Getting Started

To get started, create a conda environment with the required dependencies and activate it.

bash conda env create -f requirements.yaml conda activate llm-datatypes

Then, use run_quant.py to run the quantization and evaluation on desired tasks. For example:

bash python run_quant.py --model facebook/opt-125m --quantize --batch_size=64 --tasks lambada_openai --bits=4 --dtype=sf4_5 --group_size=128 --algo=RTN

With access to a slurm server, run the run_quant_slurm.sh script for batched evaluation: bash slurm batch run_quant_slurm.sh

Evaluation

Use run_quant.py to quantize and evaluate the model across common datasets. It includes support for weight and activation quantization, including with GPTQ[1] and SmoothQuant[2]. In addition, it has arguments that can specify the models, the evaluation tasks, and quantization settings. Below are the most important arguments, but all can be found in the argparse section in run_quant.py.

Important Arguments

quantize: Enables model quantization.
model: Specifies the model to use (default: EleutherAI/gpt-j-6b).
device: Defines the device to use (default: cuda:0).
seed: Seed for sampling calibration data (default: 42).
tasks: List of tasks for accuracy validation (default: ["lambada_openai", "hellaswag", "winogrande", "piqa", "wikitext"]).

SmoothQuant

SmoothQuant can be used to increase the accuracy of models with weight and activation quantization. The alpha argument allows balancing the quantization error on the weights and activations.

sq: Enables SmoothQuant.
alpha: SmoothQuant parameter (default: 0.5).

Quantization

enable_activation: Enables activation quantization.
activation_quantile: Clipping quantile for dynamic activation quantization (default: 1.0).
algo: Specifies the weight-only quantization algorithm (default: RTN, choices: RTN, AWQ, TEQ, GPTQ).
bits: Number of bits for quantization (default: 8).
group_size: Group size for quantization (default: -1).
dtype: Data type for quantization (default: int).

Datatypes

Use the dtype argument to select datatypes. Below the most important datatypes are listed, the full list is provided in neural-compressor/adapter/torch_utils/weight_only.py. In the code, they are implemented as lists of floating-point values so additional datatypes can be easily experimented with.

NF4: Normal Float (NF4) defined in QLoRA [3]
SF4_5: Our proposed Student Float (SF4) format derived from the Student's t-distribution.
FP4_BASIC: The standard E2M1 format with subnormal support
FP4_RANGE: Our super-range E2M1 variant that provides higher accuracy especially on distributions with large spread.
FP4_PREC2: Our super-precision E2M1 variant that leads to high accuracy across most distributions over E2M1.
FP4_LOG: A symmetric 4-bit logarithmic format.
APOT4: A 4-bit Additive-Powers-of-Two format.
APOT4_SP: A 4-bit Additive-Powers-of-Two format with super-precision.

Acknowledgements

This code was built from the Intel Neural Compressor codebase.

References

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. *ICLR 2023*. https://arxiv.org/abs/2210.17323
Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., & Han, S. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. *ICML 2024*. https://arxiv.org/abs/2211.10438
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. *NeurIPS 2023*. https://arxiv.org/abs/2305.14314

Citation

@article{dotzel2024students, title={Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs}, author={Jordan Dotzel and Yuzong Chen and Bahaa Kotb and Sushma Prasad and Gang Wu and Sheng Li and Mohamed S. Abdelfattah and Zhiru Zhang}, year={2024}, journal={International Conference on Machine Learning} }

Owner

Name: Cornell Zhang Research Group
Login: cornell-zhang
Kind: organization

Website: https://zhang.ece.cornell.edu/
Repositories: 12
Profile: https://github.com/cornell-zhang

GitHub Events

Total

Watch event: 6

Last Year

Watch event: 6

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/cornell-zhang/llm-datatypes

Science Score: 23.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

Introduction

Getting Started

Evaluation

Important Arguments

SmoothQuant

Quantization

Datatypes

Acknowledgements

References

Citation

Owner

GitHub Events

Total

Last Year