Interleaved Head Attention
Abstract
Multi-Head Attention (MHA) is the core computational primitive underlying modern Large Language Models (LLMs). However, MHA suffers from a fundamental linear scaling limitation: H attention heads produce exactly H independent attention matrices, with no communication between heads during attention computation. This becomes problematic for multi-step reasoning, where correct answers depend on aggregating evidence from multiple parts of the context and composing latent token-to-token relations over a chain of intermediate inferences. To address this, we propose Interleaved Head Attention (IHA), which enables cross-head mixing by constructing P pseudo-heads per head (typically P=H), where each pseudo query/key/value is a learned linear combination of all H original queries, keys and values respectively. Interactions between pseudo-query and pseudo-key heads induce up to P2 attention patterns per head with modest parameter overhead O(H2P). We provide theory showing improved efficiency in terms of number of parameters on the synthetic Polynomial task (IHA uses (kn2) parameters vs. (kn2) for MHA) and on the synthetic order-sensitive CPM-3 task (IHA uses N heads vs. N for MHA). On real-world benchmarks, IHA improves Multi-Key retrieval on RULER by 10-20% (4k-16k) and, after fine-tuning for reasoning on OpenThoughts, improves GSM8K by 5.8% and MATH-500 by 2.8% (Majority Vote) over full attention.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.