Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search
Abstract
An N-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the context-free model, nodes represent computation stages and edge weights are independently measured instruction costs. In the context-aware model, nodes are expanded to encode the predecessor edge type, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW FrigoJohnson1998: that optimal-substructure assumptions break down ``because of the different states of the cache.'' Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal). The context-aware Dijkstra discovers R4 R2 R4 R4 Fused-8 at 29.8~GFLOPS -- a 5.2× improvement over pure radix-2 and 34\% faster than the context-free result. This arrangement includes a radix-2 pass sandwiched between radix-4 passes, exploiting cache residuals that only exist in context. No context-free search can discover this.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.