Sets Clustering
Abstract
The input to the sets-k-means problem is an integer k≥ 1 and a set P=\P1,·s,Pn\ of sets in Rd. The goal is to compute a set C of k centers (points) in Rd that minimizes the sum ΣP∈ P p∈ P, c∈ C\| p-c \|2 of squared distances to these sets. An -core-set for this problem is a weighted subset of P that approximates this sum up to 1 factor, for every set C of k centers in Rd. We prove that such a core-set of O(2n) sets always exists, and can be computed in O(nn) time, for every input P and every fixed d,k≥ 1 and ∈ (0,1). The result easily generalized for any metric space, distances to the power of z>0, and M-estimators that handle outliers. Applying an inefficient but optimal algorithm on this coreset allows us to obtain the first PTAS (1+ approximation) for the sets-k-means problem that takes time near linear in n. This is the first result even for sets-mean on the plane (k=1, d=2). Open source code and experimental results for document classification and facility locations are also provided.