How Much Parallelism Is &#34;Free&#34;? A Principle of Near-Free Parallelism for Parallel Decoding

Aiwei Liu

How Much Parallelism Is "Free"? A Principle of Near-Free Parallelism for Parallel Decoding

Abstract

Parallel decoding improves generation efficiency by processing multiple decode positions within a single decode forward, but reported speedups conflate algorithmic token utilization with the system cost of executing multiple positions. We isolate the system side by introducing Near-Free Parallelism (NFP), the maximum number of positions executable at near-free latency. Analyzing Dense FFNs, MoE FFNs, and Attention against an idle-compute baseline, we find that NFP is shaped not by memory-bound resource slack alone, but also by implementation-induced kernel-granularity slack. Based on these mechanisms, we establish a Near-Free Parallelism principle that predicts the NFP boundary from hardware balance and implementation granularity. Validation on representative Dense and MoE models -- spanning both diffusion and autoregressive decoding -- shows that the principle accurately predicts practical NFP boundaries, revealing that the standard idle-compute intuition can over-predict by up to 23x -- offering a system-side budget for parallelism selection and model-system co-design.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…

How Much Parallelism Is &#34;Free&#34;? A Principle of Near-Free Parallelism for Parallel Decoding

Abstract

Turn this paper into a full lesson

Discussion (0)

How Much Parallelism Is "Free"? A Principle of Near-Free Parallelism for Parallel Decoding