Error Feedback Shines when Features are Rare
Abstract
We provide the first proof that gradient descent (green GD) with greedy sparsification (green TopK) and error feedback (green EF) can obtain better communication complexity than vanilla green GD when solving the distributed optimization problem x∈ Rd f(x)=1nΣi=1n fi(x), where n = # of clients, d = # of features, and f1,…,fn are smooth nonconvex functions. Despite intensive research since 2014 when green EF was first proposed by Seide et al., this problem remained open until now. We show that green EF shines in the regime when features are rare, i.e., when each feature is present in the data owned by a small number of clients only. To illustrate our main result, we show that in order to find a random vector x such that ∇ f(x) 2 ≤ in expectation, green GD with the green Top1 sparsifier and green EF requires O (( L+bluer redcn ( redcn i Li2, 1nΣi=1n Li2 ) ) 1 ) bits to be communicated by each worker to the server only, where L is the smoothness constant of f, Li is the smoothness constant of fi, redc is the maximal number of clients owning any feature (1≤ redc ≤ n), and bluer is the maximal number of features owned by any client (1≤ bluer ≤ d). Clearly, the communication complexity improves as redc decreases (i.e., as features become more rare), and can be much better than the O(bluer L 1) communication complexity of green GD in the same regime.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.