(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum
Abstract
Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over stochastic gradient descent. By primarily focusing on strongly-convex quadratics, we aim to better understand the theoretical advantage of SHB and subsequently improve the method. For strongly-convex quadratics, Kidambi et al. (2018) show that SHB (with a mini-batch of size 1) cannot attain accelerated convergence, and hence has no theoretical benefit over SGD. They conjecture that the practical gain of SHB is a by-product of using larger mini-batches. We first substantiate this claim by showing that SHB can attain an accelerated rate when the mini-batch size is larger than a threshold b* that depends on the condition number . Specifically, we prove that with the same step-size and momentum parameters as in the deterministic setting, SHB with a sufficiently large mini-batch size results in an O((-T) + σ ) convergence when measuring the distance to the optimal solution in the 2 norm, where T is the number of iterations and σ2 is the variance in the stochastic gradients. We prove a lower-bound which demonstrates that a dependence in b* is necessary. To ensure convergence to the minimizer, we design a noise-adaptive multi-stage algorithm that results in an O((-T) + σT) rate when measuring the distance to the optimal solution in the 2 norm. We also consider the general smooth, strongly-convex setting and propose the first noise-adaptive SHB variant that converges to the minimizer at an O((-T) + σ2T) rate when measuring the distance to the optimal solution in the squared 2 norm. We empirically demonstrate the effectiveness of the proposed algorithms.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.