Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

Abstract

In modern GPU inference, cache efficiency remains a major bottleneck, and heuristic policies such as LRU can perform far worse than the offline optimum. Existing learning-based caching systems improve hit rates mainly through predictor design, but often follow learned predictions blindly, making performance unreliable when predictions are inaccurate. In contrast, emerging learning-augmented caching algorithms~pmlr-v80-lykouris18a,mitzenmacher2022algorithms provide performance guarantees by carefully integrating predictions into caching policies, achieving both consistency (near-optimality under perfect predictions) and robustness (bounded worst-case performance under prediction errors). However, deployment remains challenging. A practical algorithm should satisfy strict time and space efficiency constraints, which some theoretical work overlooks, while also incurring low deployment overhead. We propose learning-augmented LRU, a deployment-oriented learning-augmented caching algorithm that guarantees 1-consistency and O(k)-robustness, incurs low time and space overhead, and maintains strong compatibility. We further build a GPU cache, called LCR, on top of learning-augmented LRU to benefit from its theoretical guarantees and translate them into practical performance. In experiments, LCR reduces P99 time-to-first-token (TTFT) by up to 28.3\% on LLM workloads and increases throughput by up to 24.2\% on deep learning recommendation (DLRM) workloads. Even with poor predictions, performance degrades gracefully and remains close to LRU, demonstrating robustness with practical value.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…