On the Convergence of Muon and Beyond
Abstract
The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal ergodic convergence rate of O(T-1/4) in stochastic non-convex settings, where T denotes the number of iterations. To study the theoretical limits of Muon, we analyze two momentum-based variance-reduced variants: the one-batch Muon-MVR1 and the two-batch Muon-MVR2. We provide the first rigorous proof that, under horizon-free learning-rate schedules, variance reduction enables Muon-MVR2 to attain the optimal anytime convergence rate O(T-1/3), matching the lower bound for this problem class. Under the Polyak--ojasiewicz (PL) condition, we establish anytime guarantees for Muon-MVR1 and Muon-MVR2: they attain best-iterate rates of O(T-1/4) and O(T-1/3) for the expected square-root suboptimality, and, given an additional uniform gradient bound along the iterates, achieve last-iterate rates of O(T-1/4) and O(T-1/3) for the objective gap, respectively. Experiments on CIFAR-10 and C4 support the practical effectiveness of the proposed variance-reduced Muon variants. Code is available at https://github.com/MaeChd/MUON-MVRMuon-MVR Codebase.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.