Consistency and Inconsistency in K-Means Clustering

Abstract

A celebrated result of Pollard proves asymptotic consistency for k-means clustering when the population distribution has finite variance. In this work, we point out that the population-level k-means clustering problem is, in fact, well-posed under the weaker assumption of a finite expectation, and we investigate whether some form of asymptotic consistency holds in this setting. As we illustrate in a variety of negative results, the complete story is quite subtle; for example, the empirical k-means cluster centers may fail to converge even if there exists a unique set of population k-means cluster centers. A detailed analysis of our negative results reveals that inconsistency arises because of an extreme form of cluster imbalance, whereby the presence of outlying samples leads to some empirical k-means clusters possessing very few points. We then give a collection of positive results which show that some forms of asymptotic consistency, under only the assumption of finite expectation, may be recovered by imposing some a priori degree of balance among the empirical k-means clusters.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…