Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Abstract

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection WQ may be set to identity without noticeable performance deterioration. This is possible because attention depends on X only through the products XWQ, XWK, XWV, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace WQ ∈ d × d with a nonlinear residual of the form Q(X) = X + fθ(X), where fθ is a bottleneck MLP with d2 + O(d) parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline (2.40\% lower validation log-loss, 6.81\% lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…