Dimension Reduction for Large-Scale Federated Data: Statistical Rate and Asymptotic Inference
Abstract
In light of the rapidly growing large-scale data in federated ecosystems, the traditional principal component analysis (PCA) is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under distributed settings. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension d and the sample size n are ultra-large, by simultaneously performing parallel computing along d and distributed computing along n. Specifically, we utilize L parallel copies of p-dimensional fast sketches to divide the computing burden along d and aggregate the results distributively along the split samples. We present a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI accelerates the computation while enjoying the same non-asymptotic error rate as the traditional PCA when Lp d. We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as Lp increases. We perform extensive simulations to empirically validate our theoretical findings, and apply FADI to the 1000 Genomes data to study the population structure.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.