Stochastic Distributed Learning with Gradient Quantization and Variance Reduction

Abstract

We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g.\ quantize or sparsify) the gradients, thereby introducing additional variance ω ≥ 1 that might slow down convergence. For strongly convex functions with condition number distributed among n machines, we (i) give a scheme that converges in O(( + ωn + ω) (1/ε)) steps to a neighborhood of the optimal solution. For objective functions with a finite-sum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in O(( + ωn + ω + m)(1/ε)) steps to arbitrary accuracy ε > 0. These are the first methods that achieve linear convergence for arbitrary quantized updates. We also (iii) give analysis for the weakly convex and non-convex cases and (iv) verify in experiments that our novel variance reduced schemes are more efficient than the baselines.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…