When and Why Momentum Accelerates SGD:An Empirical Study

Abstract

Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the effective learning rates ηef, a notion unifying the influence of momentum coefficient μ and batch size b over learning rate η. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when ηef is small, SGDM and SGD experience almost the same empirical training losses; when ηef surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to abrupt sharpening which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…