High Probability Convergence of Clipped-SGD Under Heavy-tailed Noise

Abstract

While the convergence behaviors of stochastic gradient methods are well understood in expectation, there still exist many gaps in the understanding of their convergence with high probability, where the convergence rate has a logarithmic dependency on the desired success probability parameter. In the heavy-tailed noise setting, where the stochastic gradient noise only has bounded p-th moments for some p∈(1,2], existing works could only show bounds in expectation for a variant of stochastic gradient descent (SGD) with clipped gradients, or high probability bounds in special cases (such as p=2) or with extra assumptions (such as the stochastic gradients having bounded non-central moments). In this work, using a novel analysis framework, we present new and time-optimal (up to logarithmic factors) high probability convergence bounds for SGD with clipping under heavy-tailed noise for both convex and non-convex smooth objectives using only minimal assumptions.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…