Compass: Co-Exploration of Mapping and Hardware for Heterogeneous Multi-Chiplet Accelerators Targeting LLM Inference Service Workloads

Xuehai Zhou

Compass: Co-Exploration of Mapping and Hardware for Heterogeneous Multi-Chiplet Accelerators Targeting LLM Inference Service Workloads

Abstract

Large language models (LLMs) bring huge computational demands, which makes multi-chiplet accelerators that can integrate large-scale computing resources a powerful solution. However, existing design space exploration (DSE) efforts for such accelerators primarily focus on traditional CNN/Transformer workloads and fall short in supporting the highly dynamic behavior of real-world LLM inference services. This dynamic nature manifests in two key aspects: 1) Mixed request types: the prefill and decode phases exhibit significantly different computational patterns and are frequently interleaved by modern system-level service schedulers; 2) Variable sequence lengths: the sequence length differences across requests can span several orders of magnitude, rendering padding-based assumptions inefficient. Moreover, many prior works assume homogeneous chiplets and overlook the potential beneficial interaction between LLM dynamics and heterogeneous chiplet architectures. To bridge this gap, we introduce Compass, a co-exploration framework designed to optimize mapping strategies and hardware design for multi-chiplet accelerators, specifically tailored for dynamic LLM workloads. First, we propose a computation execution graph-based mapping encoding scheme that decouples micro-batch and layer dimensions, enabling fine-grained execution control on heterogeneous chiplets and flexibly representing various parallelism strategies. Second, based on this scheme, we develop the Compass framework itself, which integrates an evaluation engine, a mapping generation engine based on genetic algorithm, and a hardware sampling engine based on Bayesian optimization, enabling fast and flexible cross-level co-design. Compared with the SOTA DSE works Gemini and MOHaM, Compass reduces latency by 63.92\% and energy by 40.32\% on average in various scenarios, with only a 3.11\% increase in monetary cost.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…