HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL

Abstract

Recent advances in agentic large language models (LLMs) have substantially improved Text-to-SQL, enabling users without database expertise to query databases intuitively. However, deploying agentic LLM-based Text-to-SQL systems in production remains challenging due to multi-stage dependencies, strict latency requirements, and deployment complexity across heterogeneous GPUs in enterprise clusters. Existing LLM serving frameworks are designed mainly for independent inference tasks, leading to suboptimal performance and frequent service-level objective (SLO) violations for Text-to-SQL workloads. In this paper, we introduce , a framework for scheduling and executing agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters serving multi-tenant requests. adopts a hierarchical scheduler that combines global workload-balanced task dispatching with an adaptive local priority queue, guided by a systematic analysis of agentic Text-to-SQL workflows. We also propose a lightweight simulation-based method to tune key scheduling hyperparameters, improving robustness and adaptability. Evaluations on realistic Text-to-SQL benchmarks show that significantly outperforms state-of-the-art LLM serving frameworks. Across all traces, reduces P95 tail latency by 1.421.56× and increases throughput by 1.491.81×, demonstrating consistent gains under diverse workloads.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…