PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation
Abstract
Muon's matrix-level update couples two distinct effects: spectral control via a polar map, and equivariance under orthogonal changes of multiplicity-space basis (Schur gauge-equivariance). We separate them with PolarAdamW, a controlled hybrid that preserves Muon's polar spectral-norm control but breaks the gauge-equivariance, since AdamW's coordinatewise preconditioner is basis-dependent. Algorithmically, PolarAdamW applies Muon's Newton-Schulz polar map to AdamW's preconditioned direction rather than to raw momentum, at per-iteration wall-time comparable to Muon. We prove that Muon's polar step is Schur gauge-equivariant on multiplicity matrices while AdamW's coordinatewise step is not. On DeiT-Tiny trained from scratch on four independently sampled 100-class subsets of ImageNet-1k, where multiplicity-basis freedom is trivial, PolarAdamW outperforms Muon by +1.93 pp in test accuracy on average and AdamW by +9.5 pp; under the 300-epoch DeiT-style recipe, it remains ahead of Muon by +1.37 pp and AdamW by +5.80 pp on average. On SO(3)-equivariant 3D point-cloud regression, where multiplicity-basis freedom is non-trivial, the ordering reverses: Muon outperforms PolarAdamW at every audited capacity, and the gap widens with capacity. Both matrix-polar optimisers continue to outperform AdamW. This double dissociation separates spectral control from Schur gauge-equivariance: the first composes well with AdamW preconditioning on standard transformers, while the second becomes consequential when multiplicity-basis freedom is structurally non-trivial.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.