How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Zeyuan Allen-Zhu

How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Abstract

Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives f(x). However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when f(x) is convex. If f(x) is convex, to find a point with gradient norm , we design an algorithm SGD3 with a near-optimal rate O(-2), improving the best known rate O(-8/3) of [18]. If f(x) is nonconvex, to find its -approximate local minimum, we design an algorithm SGD5 with rate O(-3.5), where previously SGD variants only achieve O(-4) [6, 15, 33]. This is no slower than the best known stochastic version of Newton's method in all parameter regimes [30].

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…