k-Means for Streaming and Distributed Big Sparse Data

Abstract

We provide the first streaming algorithm for computing a provable approximation to the k-means of sparse Big data. Here, sparse Big Data is a set of n vectors in Rd, where each vector has O(1) non-zeroes entries, and d≥ n. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most n· kO(1) input points in memory. If the stream is distributed among M machines, the running time reduces by a factor of M, while communicating a total of M· kO(1) (sparse) input points between the machines. % Our main technical result is a deterministic algorithm for computing a sparse (k,ε)-coreset, which is a weighted subset of kO(1) input points that approximates the sum of squared distances from the n input points to every k centers, up to (1ε) factor, for any given constant ε>0. This is the first such coreset of size independent of both d and n. Existing algorithms use coresets of size at least polynomial in d, or project the input points on a subspace which diminishes their sparsity, thus require memory and communication (d)=(n) even for k=2. Experimental results real public datasets shows that our algorithm boost the performance of such given heuristics even in the off-line setting. Open code is provided for reproducibility.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…