Fast k-means Seeding Under The Manifold Hypothesis

Abstract

We study beyond worst case analysis for the k-means problem where the goal is to model typical instances of k-means arising in practice. Existing theoretical approaches provide guarantees under certain assumptions on the optimal solutions to k-means, making them difficult to validate in practice. We propose the manifold hypothesis, where data obtained in ambient dimension D concentrates around a low dimensional manifold of intrinsic dimension d, as a reasonable assumption to model real world clustering instances. We identify key geometric properties of datasets which have theoretically predictable scaling laws depending on the quantization exponent = 2/d using techniques from optimum quantization theory. We show how to exploit these regularities to design a fast seeding method called Qkmeans which provides O(-2 k) approximate solutions to the k-means problem in time O(nD) + O(1+-1k1+γ); where the exponent γ = + for an input parameter < 1. This allows us to obtain new runtime - quality tradeoffs. We perform a large scale empirical study across various domains to validate our theoretical predictions and algorithm performance to bridge theory and practice for beyond worst case data clustering.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…