Evaluation of the number of clusters in a data set using p-values from Multiple Tests of Hypotheses
Abstract
This paper proposes a novel, nonparametric, interpoint distance-based measure to investigate whether there exist any groups in a set of given data, and if so then, how many groups are prevailing in total. It is a cluster accuracy index useful for arbitrary-dimensional data set, in association with any clustering algorithm having the number of groups specified as a priori. We perform univariate, nonparametric, multiple statistical tests of hypotheses, where as many dependent tests as the sample size are carried out using the interpoint distances. They possess p-values to be combined to reach a decision, which is taken in a step-wise process for a possible number of clusters. It reduces the unnecessary computations compared with the other accuracy measures from the literature. Data study establishes the proposed index's efficiency and superiority.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.