ProServe: Unified Multi-Priority Request Scheduling for LLM Serving
Abstract
The widespread deployment of large language models (LLMs) for interactive applications necessitates serving systems that can handle thousands of concurrent requests with diverse Service Level Objective (SLO) requirements. A critical yet often overlooked dimension in this context is the inherent priority difference among clients; for instance, business-critical functions demand higher performance guarantees, as fulfilling such requests yields significantly greater business value. However, existing LLM serving schedulers fail to jointly optimize for both SLO attainment and client-level priorities. To bridge this gap, we first formalize multi-priority request scheduling as a service gain maximization problem, where satisfying latency requirements for requests of different priorities contributes varying gain. We propose ProServe, a unified two-tier scheduling framework designed to maximize overall service gain. At the engine layer, SlideBatching dynamically adapts batch formation under varying loads, employing a sliding boundary mechanism to balance latency and priority differentiation. Considering potential preemption, efficient block management adopts asynchronous offloading, pipelined reloading, and adaptive copy-budget control to overlap computation with host-device block transfers. At the service layer, GoRouting performs gain-oriented and capability-aware dispatching across distributed instances, proactively reserving capacity for future high-priority or long requests. Extensive evaluation on four open-source and one industrial dataset shows that ProServe outperforms state-of-the-art baselines, improving system gain by up to 35% and SLO attainment by up to 52%.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.