Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization
Abstract
Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases. Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve the optimal complexity result O(-3) to produce a stochastic -stationary solution, if a mean-squared smoothness condition holds. Different from existing optimal methods, PStorm can achieve the O(-3) result by using only one or O(1) samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or O(1) new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network and a sparse convolutional neural network.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.