Sample complexity of population recovery

Abstract

The problem of population recovery refers to estimating a distribution based on incomplete or corrupted samples. Consider a random poll of sample size n conducted on a population of individuals, where each pollee is asked to answer d binary questions. We consider one of the two polling impediments: (a) in lossy population recovery, a pollee may skip each question with probability ε, (b) in noisy population recovery, a pollee may lie on each question with probability ε. Given n lossy or noisy samples, the goal is to estimate the probabilities of all 2d binary vectors simultaneously within accuracy δ with high probability. This paper settles the sample complexity of population recovery. For lossy model, the optimal sample complexity is (δ-2\ε1-ε,1\), improving the state of the art by Moitra and Saks in several ways: a lower bound is established, the upper bound is improved and the result depends at most on the logarithm of the dimension. Surprisingly, the sample complexity undergoes a phase transition from parametric to nonparametric rate when ε exceeds 1/2. For noisy population recovery, the sharp sample complexity turns out to be more sensitive to dimension and scales as ((d1/3 2/3(1/δ))) except for the trivial cases of ε=0,1/2 or 1. For both models, our estimators simply compute the empirical mean of a certain function, which is found by pre-solving a linear program (LP). Curiously, the dual LP can be understood as Le Cam's method for lower-bounding the minimax risk, thus establishing the statistical optimality of the proposed estimators. The value of the LP is determined by complex-analytic methods.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…