Selecting the number of components in PCA via random signflips

Abstract

Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noise variance. Moreover, it turns out that these methods, which are highly effective for homogeneous noise, can fail dramatically for data with heterogeneous noise. This paper proposes a new method called signflip parallel analysis (FlipPA) for the setting of approximately symmetric noise: it compares the data singular values to those of "empirical null" matrices generated by flipping the sign of each entry randomly with probability one-half. We develop a rigorous theory for FlipPA, showing that it has nonasymptotic type I error control and that it consistently selects the correct rank for signals rising above the noise floor in the large-dimensional limit (even when the noise is heterogeneous). We also rigorously explain why classical permutation-based parallel analysis degrades under heterogeneous noise. Finally, we illustrate that FlipPA compares favorably to state-of-the-art methods via numerical simulations and an illustration on data coming from astronomy.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…