Approximation Schemes for Clustering with Outliers
Abstract
Clustering problems are well-studied in a variety of fields such as data science, operations research, and computer science. Such problems include variants of centre location problems, k-median, and k-means to name a few. In some cases, not all data points need to be clustered; some may be discarded for various reasons. We study clustering problems with outliers. More specifically, we look at Uncapacitated Facility Location (UFL), k-Median, and k-Means. In UFL with outliers, we have to open some centres, discard up to z points of X and assign every other point to the nearest open centre, minimizing the total assignment cost plus centre opening costs. In k-Median and k-Means, we have to open up to k centres but there are no opening costs. In k-Means, the cost of assigning j to i is δ2(j,i). We present several results. Our main focus is on cases where δ is a doubling metric or is the shortest path metrics of graphs from a minor-closed family of graphs. For uniform-cost UFL with outliers on such metrics we show that a multiswap simple local search heuristic yields a PTAS. With a bit more work, we extend this to bicriteria approximations for the k-Median and k-Means problems in the same metrics where, for any constant ε > 0, we can find a solution using (1+ε)k centres whose cost is at most a (1+ε)-factor of the optimum and uses at most z outliers. We also show that natural local search heuristics that do not violate the number of clusters and outliers for k-Median (or k-Means) will have unbounded gap even in Euclidean metrics. Furthermore, we show how our analysis can be extended to general metrics for k-Means with outliers to obtain a (25+ε,1+ε) bicriteria.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.