Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

Abstract

We study k-means clustering in a semi-supervised setting. Given an oracle that returns whether two given points belong to the same cluster in a fixed optimal clustering, we investigate the following question: how many oracle queries are sufficient to efficiently recover a clustering that, with probability at least (1 - δ), simultaneously has a cost of at most (1 + ε) times the optimal cost and an accuracy of at least (1 - ε)? We show how to achieve such a clustering on n points with O((k2 n) · m(Q, ε4, δ / (k n))) oracle queries, when the k clusters can be learned with an ε' error and a failure probability δ' using m(Q, ε',δ') labeled samples in the supervised setting, where Q is the set of candidate cluster centers. We show that m(Q, ε', δ') is small both for k-means instances in Euclidean space and for those in finite metric spaces. We further show that, for the Euclidean k-means instances, we can avoid the dependency on n in the query complexity at the expense of an increased dependency on k: specifically, we give a slightly more involved algorithm that uses O(k4/(ε2 δ) + (k9/ε4) (1/δ) + k · m(Rr, ε4/k, δ)) oracle queries. We also show that the number of queries needed for (1 - ε)-accuracy in Euclidean k-means must linearly depend on the dimension of the underlying Euclidean space, and for finite metric space k-means, we show that it must at least be logarithmic in the number of candidate centers. This shows that our query complexities capture the right dependencies on the respective parameters.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…