Chiaroscuro Attention: Spending Compute in the Dark
Abstract
We introduce CHIAR-Former (CHIAroscuro Attention-based tRansFormer), an efficient transformer that routes each token to either DCT spectral mixing (O(d log d), sub-quadratic) or full self-attention (O(n2 d), quadratic in sequence length n) based on per-token spectral entropy H(x) in [0,1], which measures the frequency-domain complexity of each token embedding x. We make three contributions: (1) we discover routing collapse -- a three-operator system collapses to DCT+Attention, revealing the optimal operator subset; (2) we propose a learned task-level MetaRouter g = sigma(Linear(x-bar)) in [0,1], where x-bar is the batch-mean embedding and g soft-blends spectral and identity paths end-to-end; and (3) we demonstrate 35-40% FLOP reduction at 400M parameters with a 3.93 PPL cost on WikiText-103 (Test PPL 27.51 vs. 23.58). Under mixed-dataset training, CHIAR-Former dramatically outperforms full attention on small corpora, confirming the regularisation value of spectral mixing. The MetaRouter stabilises at g ~ 0.22, indicating that at scale the model reaches a robust compute-quality equilibrium: attention layers absorb representational complexity while spectral preprocessing efficiently anchors low-frequency structure.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.