A bad 2-dimensional instance for k-means++
Abstract
The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial k centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: quote Pick the first center randomly from among the given points. For i > 1, pick a point to be the ith center with probability proportional to the square of the Euclidean distance of this point to the previously (i-1) chosen centers. quote The k-means++ seeding algorithm is not only simple and fast but gives an O(k) approximation in expectation as shown by Arthur and Vassilvitskii av07. There are datasets av07,adk09 on which this seeding algorithm gives an approximation factor (k) in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably large probability (say 1/poly(k)). Brunsch and R\"oglin br11 gave a dataset where the k-means++ seeding algorithm achieves an approximation ratio of (2/3 - ε)· k only with probability that is exponentially small in k. However, this and all other known lower-bound examples av07,adk09 are high dimensional. So, an open problem is to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an approximation ratio c (for some universal constant c) only with probability exponentially small in k. This is the first step towards solving open problems posed by Mahajan et al mnv12 and by Brunsch and R\"oglin br11.