MuCon: Clipped Muon Updates for LLM Training
Abstract
Muon-style optimizers take a matrix-valued momentum or preconditioned update B = U diag(σ1,…,σr) V and replace it with its canonical partial polar factor Pol(B) = U V. This maps every nonzero singular value to one. MuCon is the clipped-Muon variant studied here: it applies singular-value clipping to the same Muon matrix, DMuCon\τ(B) = MClip\τ(B) = U diag(\σ\i,τ\) V, τ> 0. Thus, MClip\τ denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that substitutes this clipped direction for Muon's polar direction. The Muon/MuCon scaling parameterization used in this work is called SpectralP: it is the hidden-matrix scaling recipe under which polar Muon or clipped MuCon directions are applied. The map MClip\τ is the Frobenius projection onto the spectral-norm ball \X : \|X\|2 τ\: it leaves singular values at or below τ unchanged and modifies only the violating singular directions. This paper asks when the MuCon clipping step can be approximated without a full dense SVD. We record two exact identities, a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter for the clipped positive-semidefinite factor, and identify the numerical obstruction common to both: singular values near the threshold make sign decisions and rational solves ill-conditioned. Matrix-function methods are therefore useful only when paired with stable polar/square-root primitives or explicit regularization near the clipping boundary.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.