Convergence of Stochastic Gradient Descent with mini-batching and infinite variance

Philippe Soulier

Convergence of Stochastic Gradient Descent with mini-batching and infinite variance

Abstract

Stochastic gradient descent (SGD) with mini-batching is a standard tool in large-scale optimization, yet its theoretical properties under heavy-tailed gradient noise remain largely unexplored. In this paper we study SGD with increasing batch sizes when the gradient noise belongs to the domain of attraction of an α-stable law with α∈(1,2). Building on existing results for the finite-variance regime and for heavy-tailed SGD without batching, we establish three main results. First, we derive Lp moment bounds for the SGD error and show that increasing batch sizes lead to faster convergence rates. In particular, batching enables convergence in probability even for a constant stepsize. Second, we prove that the properly normalized SGD iterates converge in distribution to the stationary law of an Ornstein-Uhlenbeck process driven by an α-stable L\'evy process. Third, for Polyak-Ruppert averaging we obtain a stable limit theorem with a normalization that explicitly depends on the batch-size schedule.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…