Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights

Abstract

The vanilla self-attention mechanism in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically induced by inputs and whose hidden dimension is equal to the sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but it becomes unscalable for extremely long sequences. Recently, this fast-weight perspective has motivated the Mixture-of-Experts (MoE) attention mechanism, which partitions the sequence into rigid blocks, treats them as fast-weight experts, and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for efficient attention mechanisms, interpreting them as making fast weights scalable through either routing or compression, and organizing them into a five-dimensional taxonomy. Then, we propose Mixture-of-Top-k Attention (MiTA), which employs a small set of landmark queries to gather top-k attended key-value pairs as query-aware and deformable routed experts, while compressing the N-width MLP into a narrower shared expert. Consequently, our MiTA improves the flexibility of prior MoE attention from rigid to deformable fast-weight experts, as well as the scalability of prior top-k attention from query-specific set to reusable top-k set. We conduct extensive experiments on vision tasks showing the superior effectiveness and efficiency of our MiTA, and also uncovering intriguing properties such as an emergent token-pruning effect and easy generalization from standard attention. Code is available at https://github.com/QishuaiWen/MiTA.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…