Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence
Abstract
Heavy-tailed noise in nonconvex stochastic optimization has garnered increasing research interest, as empirical studies, including those on training attention models, suggest it is a more realistic gradient noise condition. This paper studies first-order nonconvex stochastic optimization under heavy-tailed gradient noise in a decentralized setup, where each node can only communicate with its direct neighbors in a predefined graph. Specifically, we consider a class of heavy-tailed gradient noise that is zero-mean and has only p-th moment for p ∈ (1, 2]. We propose GT-NSGDm, Gradient Tracking based Normalized Stochastic Gradient Descent with momentum, that utilizes normalization, in conjunction with gradient tracking and momentum, to cope with heavy-tailed noise on distributed nodes. We show that, when the communication graph admits primitive and doubly stochastic weights, GT-NSGDm guarantees, for the first time in the literature, that the expected gradient norm converges at an optimal non-asymptotic rate O(1/T(p-1)/(3p-2)), which matches the lower bound in the centralized setup. When tail index p is unknown, GT-NSGDm attains a non-asymptotic rate O( 1/T(p-1)/(2p) ) that is, for p < 2, topology independent and has a speedup factor n1-1/p in terms of the number of nodes n. Finally, experiments on nonconvex linear regression with tokenized synthetic data and decentralized training of language models on a real-world corpus demonstrate that GT-NSGDm is more robust and efficient than baselines.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.