Clustering Small Samples with Quality Guarantees: Adaptivity with One2all pps

Abstract

Clustering of data points is a fundamental tool in data analysis. We consider points X in a relaxed metric space, where the triangle inequality holds within a constant factor. The cost of clustering X by Q is V(Q)=Σx∈ X dxQ. Two basic tasks, parametrized by k ≥ 1, are cost estimation, which returns (approximate) V(Q) for queries Q such that |Q|=k and clustering, which returns an (approximate) minimizer of V(Q) of size |Q|=k. With very large data sets X, we seek efficient constructions of small samples that act as surrogates to the full data for performing these tasks. Existing constructions that provide quality guarantees are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs. At the core of our design is the one2all construction of multi-objective probability-proportional-to-size (pps) samples: Given a set M of centroids and α ≥ 1, one2all efficiently assigns probabilities to points so that the clustering cost of each Q with cost V(Q) ≥ V(M)/α can be estimated well from a sample of size O(α |M|ε-2). For cost queries, we can obtain worst-case sample size O(kε-2) by applying one2all to a bicriteria approximation M, but we adaptively balance |M| and α to further reduce sample size. For clustering, we design an adaptive wrapper that applies a base clustering algorithm to a sample S. Our wrapper uses the smallest sample that provides statistical guarantees that the quality of the clustering on the sample carries over to the full data set. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…