A tight lower bound instance for k-means++ in constant dimension

Abstract

The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial k centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For i > 1, pick a point to be the ith center with probability proportional to the square of the Euclidean distance of this point to the closest previously (i-1) chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an O(k) approximation in expectation as shown by Arthur and Vassilvitskii. There are datasets on which this seeding algorithm gives an approximation factor of (k) in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say 1/poly(k)). Brunsch and R\"oglin gave a dataset where the k-means++ seeding algorithm achieves an O(k) approximation ratio with probability that is exponentially small in k. However, this and all other known lower-bound examples are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an O(k) approximation ratio with probability exponentially small in k. This solves open problems posed by Mahajan et al. and by Brunsch and R\"oglin.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…