From Roofline to Ruggedness: Decomposing and Smoothing the GEMM Performance Landscape

Abstract

Adjacent GEMM problems that differ by a single 128-element step in N can show 30% different throughput on the same GPU. This pervasive performance ruggedness - invisible to roofline analysis and peak-FLOPs intuition, yet dominant for every non-peak workload - is the subject of this paper. We propose performance ruggedness analysis as an analytical framework complementary to roofline: rather than summarizing GPU performance with a scalar bound, treat the full multidimensional performance surface as the object of study, decompose its texture into mechanism-attributable components and separate software-removable contributions from hardware-bound ones. The framing is directly analogous to deep-learning loss landscapes - a continuous quantity (the idealized time 2MNK / computethroughputpeak) made rugged by interaction with discrete hardware substrates (tiles, sub-groups, cache lines, DRAM channels). We apply the framework to BF16 NN (no transpose) GEMM on Intel Battlemage (Arc B580, sycl-tla) via a 32,768-configuration sweep (M, N, K) belongs to 128, ..., 40963. The peak is 110.8 TFLOPs at the non-square shape M=3840, N=2048, K=4096 with the default tile size; the initial landscape roughness is 16.8 TFLOPs per 128-step against an ideal of 2.0. A two-stage software stack - (i) best-of-six dynamic tile selection and (ii) a novel dynamic-programming based padding-and-splitting optimizer with O(1) runtime lookup - reduces roughness by 70% and raises mean throughput by 30%. Cross-tile experiments establish that the residual sawtooth period scales exactly with software tile size, ruling out cache set conflicts and attributing the remaining variance to four hardware-bound sources (per-kernel base overhead, wave quantization, DPAS atom geometry and GDDR6 channel-hash interactions).

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…