Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (5.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: ishanrev
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 6.04 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created 9 months ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

🔁 Disaggregated KV Caching for LLM Inference (WIP)

Overview

This project explores disaggregated KV cache storage to enable long-context LLM inference beyond GPU memory limits. It introduces a tiered caching architecture that offloads Key/Value tensors from GPU to CPU (and eventually to disk), while maintaining throughput through asynchronous overlap of compute and I/O.

⚠️ Under active development. Expect ongoing optimizations, new features, and performance experiments.


✨ Key Features (Planned & In Progress)

  • ✅ GPU → CPU offloading of KV cache blocks
  • Double-buffered async decode using CUDA streams for overlapping data transfer and attention compute
  • 🔄 Three-stage pipeline: load → compute → merge
  • 🚧 KV Block Manager to minimize fragmentation across varying sequence lengths
  • 🚧 Log-sum-exp partial attention across streamed KV blocks
  • 🔄 Future: Disk-tier offload for ultra-long contexts (100K+ tokens)
  • ⚙️ Aim: virtually unlimited context on constrained GPUs, with bounded throughput slowdown (<2×)

🔬 Architecture Summary

[GPU KV Buffer] ←→ [Pinned CPU Memory] ←→ [Disk Backend (planned)] ▲ ▲ Memory │ Async │ Stream │ memcpyAsync Transfer │ │ [Attention Engine] ←→ [KV Block Manager]

  • KVBlockManager: Manages allocation, eviction, and tier transitions
  • Attention Engine: Custom kernel leveraging cudaMemcpyAsync and double buffers
  • OffloadManager: Schedules data movement across CUDA streams and CPU memory

⚙️ Tech Stack

  • 🧠 Frameworks: PyTorch, C++ extensions
  • 🚀 Parallelism: CUDA streams, double buffering, asynchronous memory transfers
  • 🛠️ Languages: Python, C++
  • 📦 Deployment: Docker, AWS

🧪 Benchmarks (Ongoing)

| Context | VRAM Usage | Throughput (tokens/s) | Offload Stage | | ------- | ---------- | --------------------- | -------------------------- | | 8K | Baseline | TBD | N/A | | 16K | –30% | TBD | GPU → CPU | | 32K+ | TBD | TBD | GPU → CPU → Disk (planned) |

Goal: Keep throughput slowdown under 2× while scaling context length.


📌 Development Roadmap

  1. Stabilize double-buffered async decode and benchmark against baseline
  2. Implement log-sum-exp accumulation in streamed attention kernel
  3. Add disk-tier offload and adaptive caching policies
  4. Integrate a demo notebook showcasing 32K+ token inference

📬 Contact & Collaboration

Maintained by Ishan Revankar 🔗 LinkedIn 📫 Open an issue or reach out for ideas and contributions!

⚠️ Work in progress—expect breaking changes and updates frequently.

Owner

  • Login: ishanrev
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, you can cite it as shown below."
title: "LitGPT"
abstract: "20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale."
date-released: 2023-03-22
authors:
  - name: "The Lightning AI team"
license: "Apache-2.0"
url: "https://github.com/Lightning-AI/litgpt"

GitHub Events

Total
  • Push event: 6
  • Create event: 2
Last Year
  • Push event: 6
  • Create event: 2