Number of Clusters in a Dataset: A Regularized K-means Approach
Abstract
Finding the number of meaningful clusters in an unlabeled dataset is important in many applications. Regularized k-means algorithm is a possible approach frequently used to find the correct number of distinct clusters in datasets. The most common formulation of the regularization function is the additive linear term λ k, where k is the number of clusters and λ a positive coefficient. Currently, there are no principled guidelines for setting a value for the critical hyperparameter λ. In this paper, we derive rigorous bounds for λ assuming clusters are ideal. Ideal clusters (defined as d-dimensional spheres with identical radii) are close proxies for k-means clusters (d-dimensional spherically symmetric distributions with identical standard deviations). Experiments show that the k-means algorithm with additive regularizer often yields multiple solutions. Thus, we also analyze k-means algorithm with multiplicative regularizer. The consensus among k-means solutions with additive and multiplicative regularizations reduces the ambiguity of multiple solutions in certain cases. We also present selected experiments that demonstrate performance of the regularized k-means algorithms as clusters deviate from the ideal assumption.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.