A simple D2-sampling based PTAS for k-means and other Clustering Problems
Abstract
Given a set of points P ⊂ Rd, the k-means clustering problem is to find a set of k centers C = \c1,...,ck\, ci ∈ Rd, such that the objective function Σx ∈ P d(x,C)2, where d(x,C) denotes the distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that have been studied with respect to clustering. D2-sampling ArthurV07 is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points P ⊂eq Rd, the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled points. D2-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii ArthurV07 show that k points chosen as centers from P using D2-sampling gives an O(k) approximation in expectation. Ailon et. al. AJMonteleoni09 and Aggarwal et. al. AggarwalDK09 extended results of ArthurV07 to show that O(k) points chosen as centers using D2-sampling give O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D2-sampling by giving a simple randomized (1 + ε)-approximation algorithm that uses the D2-sampling in its core.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.