ghostcache
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.9%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: ishanrev
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 6.04 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
🔁 Disaggregated KV Caching for LLM Inference (WIP)
Overview
This project explores disaggregated KV cache storage to enable long-context LLM inference beyond GPU memory limits. It introduces a tiered caching architecture that offloads Key/Value tensors from GPU to CPU (and eventually to disk), while maintaining throughput through asynchronous overlap of compute and I/O.
⚠️ Under active development. Expect ongoing optimizations, new features, and performance experiments.
✨ Key Features (Planned & In Progress)
- ✅ GPU → CPU offloading of KV cache blocks
- ✅ Double-buffered async decode using CUDA streams for overlapping data transfer and attention compute
- 🔄 Three-stage pipeline: load → compute → merge
- 🚧 KV Block Manager to minimize fragmentation across varying sequence lengths
- 🚧 Log-sum-exp partial attention across streamed KV blocks
- 🔄 Future: Disk-tier offload for ultra-long contexts (100K+ tokens)
- ⚙️ Aim: virtually unlimited context on constrained GPUs, with bounded throughput slowdown (<2×)
🔬 Architecture Summary
[GPU KV Buffer] ←→ [Pinned CPU Memory] ←→ [Disk Backend (planned)]
▲ ▲
Memory │ Async │
Stream │ memcpyAsync Transfer
│ │
[Attention Engine] ←→ [KV Block Manager]
- KVBlockManager: Manages allocation, eviction, and tier transitions
- Attention Engine: Custom kernel leveraging
cudaMemcpyAsyncand double buffers - OffloadManager: Schedules data movement across CUDA streams and CPU memory
⚙️ Tech Stack
- 🧠 Frameworks: PyTorch, C++ extensions
- 🚀 Parallelism: CUDA streams, double buffering, asynchronous memory transfers
- 🛠️ Languages: Python, C++
- 📦 Deployment: Docker, AWS
🧪 Benchmarks (Ongoing)
| Context | VRAM Usage | Throughput (tokens/s) | Offload Stage | | ------- | ---------- | --------------------- | -------------------------- | | 8K | Baseline | TBD | N/A | | 16K | –30% | TBD | GPU → CPU | | 32K+ | TBD | TBD | GPU → CPU → Disk (planned) |
⚡ Goal: Keep throughput slowdown under 2× while scaling context length.
📌 Development Roadmap
- Stabilize double-buffered async decode and benchmark against baseline
- Implement log-sum-exp accumulation in streamed attention kernel
- Add disk-tier offload and adaptive caching policies
- Integrate a demo notebook showcasing 32K+ token inference
📬 Contact & Collaboration
Maintained by Ishan Revankar 🔗 LinkedIn 📫 Open an issue or reach out for ideas and contributions!
⚠️ Work in progress—expect breaking changes and updates frequently.
Owner
- Login: ishanrev
- Kind: user
- Repositories: 5
- Profile: https://github.com/ishanrev
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, you can cite it as shown below." title: "LitGPT" abstract: "20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale." date-released: 2023-03-22 authors: - name: "The Lightning AI team" license: "Apache-2.0" url: "https://github.com/Lightning-AI/litgpt"
GitHub Events
Total
- Push event: 6
- Create event: 2
Last Year
- Push event: 6
- Create event: 2