SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
Abstract
Fully decentralized Muon is difficult because its nonlinear matrix-sign operator does not commute with linear gossip averaging. This makes decentralized Muon a structural design problem: in designing the algorithm, one must distinguish modular components from non-modular ones. We propose , which realizes this separation through a unified primal--dual communication template called SUDA; within this template, ED/D2, EXTRA, and gradient tracking become modular backbone choices. We prove a topology-separated non-asymptotic convergence guarantee in the nuclear-norm geometry: the dominant term scales as O((1+σ/N)K-1/4) and does not explicitly involve graph quantities, identifying the communication backbone as the modular axis in the structure design. We then establish two complementary non-modular boundaries. Internally, tracking-before-polarization is necessary for this natural no-tracking variant to avoid non-stationary fixed points under heterogeneous objectives. Externally, in the absence of a central server, a fully decentralized method cannot perform the federated average-then-polarize update; we show that this non-modular local-polarize-then-average design is the essential reason why can fail to exhibit linear speedup. Experiments on CIFAR-100 and GPT-2 fine-tuning support the same picture: the unified template makes different communication algorithms directly comparable. In mild near-IID regimes, the resulting variants perform similarly, while in the more difficult long-horizon non-IID CIFAR-100 setting, achieves higher accuracy and lower loss than DeMuon.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.