High-Dimensional Robust Mean Estimation with Untrusted Batches
Abstract
We study high-dimensional mean estimation in a collaborative setting where data is contributed by N users in batches of size n. In this environment, a learner seeks to recover the mean μ of a true distribution P from a collection of sources that are both statistically heterogeneous and potentially malicious. We formalize this challenge through a double corruption landscape: an -fraction of users are entirely adversarial, while the remaining ``good'' users provide data from distributions that are related to P, but deviate by a proximity parameter α. Unlike existing work on the untrusted batch model, which typically measures this deviation via total variation distance in discrete settings, we address the continuous, high-dimensional regime under two natural variants for deviation: (1) good batches are drawn from distributions with a mean-shift of α, or (2) an α-fraction of samples within each good batch are adversarially corrupted. In particular, the second model presents significant new challenges: in high dimensions, unlike discrete settings, even a small fraction of sample-level corruption can shift empirical means and covariances arbitrarily. We provide two Sum-of-Squares (SoS) based algorithms to navigate this tiered corruption. Our algorithms achieve the minimax-optimal error rate O(/n + d/nN + α), demonstrating that while heterogeneity α represents an inherent statistical difficulty, the influence of adversarial users is suppressed by a factor of 1/n due to the internal averaging afforded by the batch structure.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.