M2RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

Abstract

Transformers are highly parallel but are limited to computations in the TC0 complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M2RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling performance of non-linear RNNs is limited by their state size, and show how the state size expansion mechanism enables efficient use of tensor cores. Empirically, M2RNN achieves perfect state tracking generalization at sequence lengths not seen during training. These benefits also translate to large-scale language modeling. In hybrid settings that interleave recurrent layers with attention, Hybrid M2RNN outperforms equivalent Gated DeltaNet hybrids by 0.4-0.5 perplexity points on a 7B MoE model, while using 3× smaller state sizes for the recurrent layers. Notably, replacing even a single recurrent layer with M2RNN in an existing hybrid architecture yields accuracy gains comparable to Hybrid M2RNN with minimal impact on training throughput. Further, the Hybrid Gated DeltaNet models with a single M2RNN layer also achieve superior long-context generalization, outperforming state-of-the-art hybrid linear attention architectures by up to 8 points on LongBench. Together, these results establish non-linear RNN layers as a compelling building block for efficient and scalable language models.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…