The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Gal Vardi

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Abstract

We study the implicit bias of momentum-based optimizers on smooth homogeneous models. We show that momentum steepest descent algorithms like Muon (spectral norm), MomentumGD (2 norm), and Signum (∞ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the ∞ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…