Learning Discrete Distributions from Untrusted Batches
Abstract
We consider the problem of learning a discrete distribution in the presence of an ε fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, p, and each data source provides a batch of k samples, with the guarantee that at least a (1-ε) fraction of the sources draw their samples from a distribution with total variation distance at most η from p. We make no assumptions on the data provided by the remaining ε fraction of sources--this data can even be chosen as an adversarial function of the (1-ε) fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, n, but polynomial in k, 1/ε and 1/η that takes O((n+k)/ε2) batches and recovers p to error O(η+ε/k). This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the η = 0 setting and also achieves an O(ε/k) recover guarantee, though it runs in poly((nk)k) time. This second algorithm, which approximates a certain tensor via a rank-1 tensor minimizing 1 distance, is surprising in light of the hardness of many low-rank tensor approximation problems, and may be of independent interest.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.