ghostcache

https://github.com/ishanrev/ghostcache

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.9%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: ishanrev
License: apache-2.0
Language: Python
Default Branch: main
Size: 6.04 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Releases: 0

Created about 1 year ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

🔁 Disaggregated KV Caching for LLM Inference (WIP)

Overview

This project explores disaggregated KV cache storage to enable long-context LLM inference beyond GPU memory limits. It introduces a tiered caching architecture that offloads Key/Value tensors from GPU to CPU (and eventually to disk), while maintaining throughput through asynchronous overlap of compute and I/O.

⚠️ Under active development. Expect ongoing optimizations, new features, and performance experiments.

✨ Key Features (Planned & In Progress)

✅ GPU → CPU offloading of KV cache blocks
✅ Double-buffered async decode using CUDA streams for overlapping data transfer and attention compute
🔄 Three-stage pipeline: load → compute → merge
🚧 KV Block Manager to minimize fragmentation across varying sequence lengths
🚧 Log-sum-exp partial attention across streamed KV blocks
🔄 Future: Disk-tier offload for ultra-long contexts (100K+ tokens)
⚙️ Aim: virtually unlimited context on constrained GPUs, with bounded throughput slowdown (<2×)

🔬 Architecture Summary

[GPU KV Buffer] ←→ [Pinned CPU Memory] ←→ [Disk Backend (planned)] ▲ ▲ Memory │ Async │ Stream │ memcpyAsync Transfer │ │ [Attention Engine] ←→ [KV Block Manager]

KVBlockManager: Manages allocation, eviction, and tier transitions
Attention Engine: Custom kernel leveraging cudaMemcpyAsync and double buffers
OffloadManager: Schedules data movement across CUDA streams and CPU memory

⚙️ Tech Stack

🧠 Frameworks: PyTorch, C++ extensions
🚀 Parallelism: CUDA streams, double buffering, asynchronous memory transfers
🛠️ Languages: Python, C++
📦 Deployment: Docker, AWS

🧪 Benchmarks (Ongoing)

| Context | VRAM Usage | Throughput (tokens/s) | Offload Stage | | ------- | ---------- | --------------------- | -------------------------- | | 8K | Baseline | TBD | N/A | | 16K | –30% | TBD | GPU → CPU | | 32K+ | TBD | TBD | GPU → CPU → Disk (planned) |

⚡ Goal: Keep throughput slowdown under 2× while scaling context length.

📌 Development Roadmap

Stabilize double-buffered async decode and benchmark against baseline
Implement log-sum-exp accumulation in streamed attention kernel
Add disk-tier offload and adaptive caching policies
Integrate a demo notebook showcasing 32K+ token inference

📬 Contact & Collaboration

Maintained by Ishan Revankar 🔗 LinkedIn 📫 Open an issue or reach out for ideas and contributions!

⚠️ Work in progress—expect breaking changes and updates frequently.

Owner

Login: ishanrev
Kind: user

Repositories: 5
Profile: https://github.com/ishanrev

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, you can cite it as shown below."
title: "LitGPT"
abstract: "20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale."
date-released: 2023-03-22
authors:
  - name: "The Lightning AI team"
license: "Apache-2.0"
url: "https://github.com/Lightning-AI/litgpt"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science