Approximate Clustering with Same-Cluster Queries

Abstract

Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?" There are many clustering contexts where such same-cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed O(k2 k + k n) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting showing that a few of such same-cluster queries enables one to get a polynomial-time (1 + )-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1 + c) for a fixed constant 0 < c < 1. The number of same-cluster queries used is poly(k/) which is independent of the size n of the dataset. Our algorithm is based on the D2-sampling technique. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make (kpoly k ) same-cluster queries. Our algorithm can be extended for the case when the oracle is faulty. Another result we show with respect to the k-means++ seeding algorithm is that a small modification to the k-means++ seeding algorithm within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(k)-approximation algorithm.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…