Muon$^p$: Muon with Fractional Spectral Powers

Will Sawin

Muonp: Muon with Fractional Spectral Powers

Abstract

Muon is an increasingly widely used optimizer that replaces a gradient G=USV with its polar factor UV, thereby flattening the singular spectrum. However, full flattening discards singular-value information that may matter for adaptation. We introduce Muonp, a Muon-style optimizer that instead uses fractional spectral-power updates USpV for rational p∈(0,1), interpolating between Muon and gradient descent. To make it practical, we prove that fractional spectral powers cannot be computed by any fixed univariate polynomial iteration, and furthermore derive low-degree odd bivariate recurrences that approximate USpV using only matrix multiplications, preserving Muon's matrix-multiplication-only structure and compute complexity. We show that Muonp maximizes the linear improvement in loss under the Schatten q-norm for q=1+1p. Empirically, Muonp is especially effective for finetuning: on billion-scale models, Muonp improves validation perplexity and downstream task performance. We further analyze when Muonp is less suitable, through the lens of spectral geometry. Our results reveal important insights on when preserving the singular spectrum can bring significant gains, and introduce a principled way to achieve them.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…