Distributed Data Summarization in Well-Connected Networks
Abstract
We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph G of n nodes each of which may hold a value initially, we focus on computing Σi=1N g(fi), where fi is the number of occurrences of value i and g is some fixed function. This includes important statistics such as the number of distinct elements, frequency moments, and the empirical entropy of the data. In the CONGEST model, a simple adaptation from streaming lower bounds shows that it requires (D+ n) rounds, where D is the diameter of the graph, to compute some of these statistics exactly. However, these lower bounds do not hold for graphs that are well-connected. We give an algorithm that computes Σi=1N g(fi) exactly in τG · 2O( n) rounds where τG is the mixing time of G. This also has applications in computing the top k most frequent elements. We demonstrate that there is a high similarity between the GOSSIP model and the CONGEST model in well-connected graphs. In particular, we show that each round of the GOSSIP model can be simulated almost-perfectly in O(τG rounds of the CONGEST model. To this end, we develop a new algorithm for the GOSSIP model that 1 ε approximates the p-th frequency moment Fp = Σi=1N fip in O(ε-2 n1-k/p) rounds, for p ≥2, when the number of distinct elements F0 is at most O(n1/(k-1)). This result can be translated back to the CONGEST model with a factor O(τG) blow-up in the number of rounds.