Sublinear Time Quantum Algorithm for Attention Approximation
Abstract
Given the query, key and value matrices Q, K, V∈ Rn× d, the attention module is defined as Att(Q, K, V)=D-1AV where A=(QK/d) with (·) applied entrywise, D=diag(A 1n). The attention module is the backbone of modern transformers and large language models, but explicitly forming the softmax matrix D-1A incurs (n2) time, motivating numerous approximation schemes that reduce runtime to O(nd) via sparsity or low-rank factorization. We propose a quantum data structure that approximates any row of Att(Q, K, V) using only row queries to Q, K, V. Our algorithm preprocesses these matrices in O( ε-1 n0.5 ( sλ2.5 + sλ1.5 d + α0.5 d ) ) time, where ε is the target accuracy, sλ is the λ-statistical dimension of the exponential kernel defined by Q and K, and α measures the row distortion of V that is at most d/ srank(V), the stable rank of V. Each row query can be answered in O(sλ2 + sλ d) time. To our knowledge, this is the first quantum data structure that approximates rows of the attention matrix in sublinear time with respect to n. Our approach relies on a quantum Nystr\"om approximation of the exponential kernel, quantum multivariate mean estimation for computing D, and quantum leverage score sampling for the multiplication with V.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.