kvpress

LLM KV cache compression made easy

https://github.com/nvidia/kvpress

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary

Keywords

inference kv-cache kv-cache-compression large-language-models llm long-context python pytorch transformers

Last synced: 6 months ago · JSON representation ·

Repository

LLM KV cache compression made easy

Basic Info

Host: GitHub
Owner: NVIDIA
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 6.31 MB

Statistics

Stars: 599
Watchers: 17
Forks: 55
Open Issues: 8
Releases: 17

Topics

inference kv-cache kv-cache-compression large-language-models llm long-context python pytorch transformers

Created over 1 year ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License Citation

Deploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.

Installation

bash pip install kvpress

For a local installation with all dev dependencies, use uv:

bash git clone https://github.com/NVIDIA/kvpress.git cd kvpress uv sync --all-groups

Advanced installation settings

To install optional packages, you can use uv. To install with flash attention, just run:

bash git clone https://github.com/NVIDIA/kvpress.git cd kvpress uv sync --extra flash-attn

To install with dependencies for evaluation, run

bash git clone https://github.com/NVIDIA/kvpress.git cd kvpress uv sync --extra eval

Usage

KVPress provides a set of "presses" that compress the KV cache during the prefilling-phase. Each press is associated with a compression_ratio attribute that measures the compression of the cache. The easiest way to use a press is through our custom KVPressTextGenerationPipeline. It is automatically registered as a transformers pipeline with the name "kv-press-text-generation" when kvpress is imported and handles chat templates and tokenization for you:

```python from transformers import pipeline from kvpress import ExpectedAttentionPress

device = "cuda:0" model = "meta-llama/Llama-3.1-8B-Instruct" modelkwargs = {"attnimplementation": "flashattention2"} pipe = pipeline("kv-press-text-generation", model=model, device=device, modelkwargs=modelkwargs)

context = "A very long text you want to compress once and for all" question = "\nA question about the compressed context" # optional

press = ExpectedAttentionPress(compression_ratio=0.5) answer = pipe(context, question=question, press=press)["answer"] ```

In the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the Wikipedia notebook demo for a more detailed example (also available on Colab here).

[!IMPORTANT]
We focus on compression during the pre-filling phase as the KV cache becomes a bottleneck for long-context sequence (100k - 1M tokens) which are essentially long context prompts. This would typically apply to improving prompt caching systems.

[!NOTE]
Use model_kwargs={"attn_implementation":"flash_attention_2"} to enable flash attention. To use the press ObservedAttentionPress, you need to specify model_kwargs={"attn_implementation":"eager"} as this press requires to materialize the attention weights

Contributing

We welcome contributions! To add a new press, simply open an issue or submit a pull request. Check the new_press.ipynb notebook for a step-by-step guide.

Available presses

All current presses are training free and inherit from BasePress (source).

Several presses inherit from ScorerPress (source) and rely on a score to prune the KV pairs with lowest importance:

RandomPress (source): random score
KnormPress (source, paper): inverse norm of the key
SnapKVPress (source, paper): average attention weight of the last queries
ExpectedAttentionPress (source, notebook): expected attention weight during the generation phase
StreamingLLMPress (source, paper): keep only the initial and recent tokens
TOVAPress (source, paper): attention weight of the last query averaged across heads
ObservedAttentionPress (source, paper): average attention weight observed during in pre-filling phase
QFilterPress (source, paper): project the Key representations on the main SVD component of the Query vectors to approximate the attention scores.
PyramidKVPress (source, paper): maintain pyramid-like cache sizes, allocating more cache budget to lower layers and less to higher layers
LagKVPress (source, paper): leverage on the KV lag-relative information to compress. It's query free, attention-weight free, and flash-attention compatible.
KeyDiffPress (source, paper): evicts tokens based solely on key similarity.

Some presses rely on a different logic: - ThinKPress (source, paper): compress the dimensions of the keys based on the channel attention score on the last queries - SimLayerKVPress (source, paper): identify "lazy" layers, and apply the StreamingLLM approach to them - DuoAttentionPress (source, paper): split heads into retrieval heads (no compression) and streaming heads (StreamingLLM approach) - FinchPress (source, paper): similar to SnapKV with a dynamic window size and key value re-rotation - KVzipPress (source, paper): identifies redundant KV pairs through context reconstruction. Achieves near-lossless compression at the cost of multiple forward passes.

Finally we provide wrapper presses that can be combined with other presses: - AdaKVPress (source, paper): prune bottom scores of any ScorerPress but across all heads, achieving head-wise compressions - PerLayerCompressionPress (source): compress each layer with a different compression ratio (experimental) - ComposedPress (source): compose multiple presses together by chaining their forward hooks - KeyRerotationPress (source): rerotate pruned keys to have continuous RoPE embeddings - ChunkKVPress (source, paper): compresses by selecting important chunks, preserving semantic coherence - ChunkPress (source, paper): compress the KV cache on each sequence chunk separately. This can yield to more uniform compression across long sequences - CriticalKVPress and CriticalAdaKVPress (source, paper): refine the scores using the L1 norm of Wo @ values, coupled with a two-stage selection. - BlockPress (source, paper): segments input sequence into non-overlapping blocks and compresses iteratively.

For a detailed list of existing KV cache compression methods, check Awesome-KV-Cache-Compression or Awesome-LLM-Compression

Evaluation

We provide a simple CLI to evaluate the performance of different presses on several long-context datasets.

Accuracy: Test your method on popular benchmarks directly using our CLI. For a broader comparison, check out our public Hugging Face Leaderboard , where you can see how various methods stack up against each other.
Speed and Memory: The speedandmemory notebook can help you measure peak memory usage and total time gain.

Please refer to the evaluation directory in this repo for more details and results.

Below we report the average performance on the RULER dataset with 4k context length for different presses, from our

Leaderboard

Quantization

We support KV cache quantization through the transformers QuantizedCache class (see HF blog post). To use it, simply pass a cache object to your pipeline:

```python from transformers import QuantizedCacheConfig, QuantoQuantizedCache

config = QuantizedCacheConfig(nbits=4) cache = QuantoQuantizedCache(config)

pipe(..., cache=cache) ```

By default, the DynamicCache is used (no quantization).

[!IMPORTANT]
To use the QuantizedCache, you need to install additional dependencies (e.g. pip install optimum-quanto).

FAQ

### Which models are supported ?

Some presses depend on the model architecture (_e.g._ `ExpectedAttentionPress` or `SnapKVPress`) hence they might not work with all models. We tested support for `LlamaForCausalLM`, `MistralForCausalLM`, `Phi3ForCausalLM`, `Qwen2ForCausalLM`, `Qwen3ForCausalLM`, and `Gemma3ForCausalLM` but many other models might be supported out of the box because their implementation is often similar in transformers.

### How to run inference on multiple GPUs ?

kvpress supports multi-GPU inference through [accelerate](https://huggingface.co/docs/accelerate/en/index): ```python pipe = pipeline("kv-press-text-generation", model=model, device_map="auto") ```

### What are the memory and throughput gains ?

Memory usage should be reduced by around `compression_ratio * kv_cache_size`. As the KV cache is smaller, decoding should also be faster. You can measure peak memory usage gain and total time gain using [this notebook](notebooks/speed_and_memory.ipynb).

### How does a press work ?

A press registers a forward hook (`press.forward_hook` method) to each attention layer during the pre-filling phase. Registration can be applied using the press as a context manager (`press.__call__` method): ```python import torch from transformers import AutoModelForCausalLM from kvpress import KnormPress device = "cuda:0" ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(ckpt).to(device) press = KnormPress(compression_ratio=0.4) inputs = model.dummy_inputs["input_ids"].to(device) with torch.no_grad(): print(model(inputs).past_key_values[0][0].shape) # torch.Size([3, 8, 5, 128]) with torch.no_grad(), press(model): print(model(inputs).past_key_values[0][0].shape) # torch.Size([3, 8, 3, 128]) ```

### Why not using model.generate ?

In fact you can use `model.generate` with a press by using the press as a context manager: ```python with press(model): outputs = model.generate(inputs) ``` However, the `generate` method does not allow to exclude the question from the compression, which would artificially favors methods such as SnapKV. Ideally, we want a compression method that works whatever comes after the context (_e.g._ for use cases such as chat or document question answering). Finally the `generate` method does not allow to provide generation for multiple questions at once.

Owner

Name: NVIDIA Corporation
Login: NVIDIA
Kind: organization
Location: 2788 San Tomas Expressway, Santa Clara, CA, 95051

Website: https://nvidia.com
Repositories: 342
Profile: https://github.com/NVIDIA

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use kvpress, please cite it as below."
authors:
- family-names: "Jegou"
  given-names: "Simon"
- family-names: "Jeblick"
  given-names: "Maximilian"
- family-names: "Austin"
  given-names: "David"
title: "kvpress"
date-released: 2024-11-13
year: 2024
url: "https://github.com/NVIDIA/kvpress"

Committers

Last synced: 9 months ago

All Time

Total Commits: 39
Total Committers: 9
Avg Commits per committer: 4.333
Development Distribution Score (DDS): 0.462

Past Year

Commits: 39
Committers: 9
Avg Commits per committer: 4.333
Development Distribution Score (DDS): 0.462

Top Committers

Name	Email	Commits
Simon Jégou	S****g	21
maxjeblick	m****k	11
fanqiNO1	7****1	1
Z	4****t	1
Yuan Feng	f**n@g**m	1
Xiang LIU	4****4	1
NathanGodey	3****y	1
Huanxuan Liao	l**3@i**n	1
Emmanuel Ferdman	e**n@g**m	1

Committer Domains (Top 20 + Academic)

ia.ac.cn: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 38
Total pull requests: 108
Average time to close issues: 14 days
Average time to close pull requests: 3 days
Total issue authors: 21
Total pull request authors: 21
Average comments per issue: 2.39
Average comments per pull request: 2.29
Merged pull requests: 77
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 38
Pull requests: 108
Average time to close issues: 14 days
Average time to close pull requests: 3 days
Issue authors: 21
Pull request authors: 21
Average comments per issue: 2.39
Average comments per pull request: 2.29
Merged pull requests: 77
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

maxjeblick (7)
giulio98 (6)
SimJeg (3)
Dominic789654 (2)
alessiodevoto (2)
toilaluan (2)
FFY0 (2)
figuremout (2)
msharmavikram (1)
PengWenChen (1)
Janghyun1230 (1)
lele-zh (1)
Xnhyacinth (1)
fanqiNO1 (1)
ChenHong30 (1)

Pull Request Authors

SimJeg (37)
maxjeblick (34)
alessiodevoto (10)
neuralsorcerer (6)
giulio98 (5)
figuremout (4)
FFY0 (4)
Xnhyacinth (4)
Dominic789654 (2)
joshua-j-hong (2)
NathanGodey (2)
JoelSeniorLiang (2)
fanqiNO1 (2)
yuhuixu1993 (2)
dame-cell (2)

Top Labels

Issue Labels

feature request (17) bug (8) press request (4)

Pull Request Labels

Packages

Total packages: 3
Total downloads:
- pypi 395 last-month

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 52
Total maintainers: 1

proxy.golang.org: github.com/NVIDIA/kvpress

Documentation: https://pkg.go.dev/github.com/NVIDIA/kvpress#section-documentation
License: apache-2.0
Latest release: v0.2.10
published 7 months ago

Versions: 17
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

proxy.golang.org: github.com/nvidia/kvpress

Documentation: https://pkg.go.dev/github.com/nvidia/kvpress#section-documentation
License: apache-2.0
Latest release: v0.2.10
published 7 months ago

Versions: 17
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 5.4%

Average: 5.6%

Dependent repos count: 5.8%

Last synced: 6 months ago

pypi.org: kvpress

Efficiently compress the KV cache of any pretrained transformer

Documentation: https://kvpress.readthedocs.io/
License: apache-2.0
Latest release: 0.3.0
published 6 months ago

Versions: 18
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 395 Last month

Rankings

Dependent packages count: 10.0%

Average: 33.3%

Dependent repos count: 56.5%

Maintainers (1)

SimJeg

Last synced: 6 months ago

Dependencies

.github/workflows/python-publish.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite
pypa/gh-action-pypi-publish release/v1 composite

.github/workflows/style.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/test.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

pyproject.toml pypi

black ^24.8.0 develop
flake8 ^7.0.0 develop
isort ^5.13.2 develop
mypy ^1.11.2 develop
pytest ^7.0.0 develop
pytest-cov ^5.0.0 develop
pytest-dependency ^0.6.0 develop
pytest-html >=4.1.1, <5.0.0 develop
types-pyyaml ^6.0 develop
accelerate ^1.0.0
bert-score ^0.3.13
bs4 ^0.0.2
datasets ^2.21.0
fire ^0.6.0
ipykernel ^6.29.4
matplotlib ^3.9.0
nltk ^3.9.1
numpy ^2.0.0
nvitop ^1.3.2
pandas ^2.2.2
protobuf ^5.27.2
python >=3.10
rouge ^1.0.1
scipy ^1.13.1
sentencepiece ^0.2.0
torch ^2.3.1
tqdm ^4.66.4
transformers ^4.45.1

kvpress

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Installation

Usage

Contributing

Available presses

Evaluation

Quantization

FAQ

Owner

Citation (CITATION.cff)

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/NVIDIA/kvpress

Rankings

proxy.golang.org: github.com/nvidia/kvpress

Rankings

pypi.org: kvpress

Rankings

Maintainers (1)

Dependencies