Faster Algorithms for the Constrained k-means Problem
Abstract
The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O1, ..., Ok are an arbitrary partition of the dataset and the goal is to output k-centers c1, ..., ck such that the objective function Σi=1k Σx ∈ Oi ||x - ci||2 is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter > 0, let denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1+) approximation w.r.t. the objective function above. In this paper, we show an upper bound on by giving a randomized algorithm that outputs a list of 2O(k/) k-centers. We also give a closely matching lower bound of 2(k/). Moreover, our algorithm runs in time O (n d · 2O(k/) ). This is a significant improvement over the previous result of Ding and Xu who gave an algorithm with running time O (n d · (n)k · 2poly(k/) ) and output a list of size O ((n)k · 2poly(k/) ).
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.