Power-law escape rate of SGD

Abstract

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a random time change. Using this formalism, we show that the log loss barrier L=[L(θs)/L(θ*)] between a local minimum θ* and a saddle θs determines the escape rate of SGD from the local minimum, contrary to the previous results borrowing from physics that the linear loss barrier L=L(θs)-L(θ*) decides the escape rate. Our escape-rate formula strongly depends on the typical magnitude h* and the number n of the outlier eigenvalues of the Hessian. This result explains an empirical fact that SGD prefers flat minima with low effective dimensions, giving an insight into implicit biases of SGD.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…