On the k-Means/Median Cost Function
Abstract
In this work, we study the k-means cost function. Given a dataset X ⊂eq Rd and an integer k, the goal of the Euclidean k-means problem is to find a set of k centers C ⊂eq Rd such that (C, X) Σx ∈ X c ∈ C ||x - c||2 is minimized. Let (X,k) C ⊂eq Rd (C, X) denote the cost of the optimal k-means solution. For any dataset X, (X,k) decreases as k increases. In this work, we try to understand this behaviour more precisely. For any dataset X ⊂eq Rd, integer k ≥ 1, and a precision parameter > 0, let L(X, k, ) denote the smallest integer such that (X, L(X, k, )) ≤ · (X,k). We show upper and lower bounds on this quantity. Our techniques generalize for the metric k-median problem in arbitrary metric spaces and we give bounds in terms of the doubling dimension of the metric. Finally, we observe that for any dataset X, we can compute a set S of size O (L(X, k, /c) ) using D2-sampling such that (S,X) ≤ · (X,k) for some fixed constant c. We also discuss some applications of our bounds.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.