PiKV: KV Cache Management System for Mixture of Experts

Abstract

As large-scale language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce PiKV, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages expert-sharded KV storage to partition caches across GPUs, PiKV routing to reduce token-to-KV access, and a PiKV Scheduling to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates PiKV Compression modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: https://github.com/NoakLiu/PiKVhttps://github.com/NoakLiu/PiKV. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…