A Data Prism: Semi-Verified Learning in the Small-Alpha Regime

Abstract

We consider a model of unreliable or crowdsourced data where there is an underlying set of n binary variables, each evaluator contributes a (possibly unreliable or adversarial) estimate of the values of some subset of r of the variables, and the learner is given the true value of a constant number of variables. We show that, provided an α-fraction of the evaluators are "good" (either correct, or with independent noise rate p < 1/2), then the true values of a (1-ε) fraction of the n underlying variables can be deduced as long as α > 1/(2-2p)r. This setting can be viewed as an instance of the semi-verified learning model introduced in [CSV17], which explores the tradeoff between the number of items evaluated by each worker and the fraction of good evaluators. Our results require the number of evaluators to be extremely large, >nr, although our algorithm runs in linear time, Or,ε(n), given query access to the large dataset of evaluations. This setting and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment. This parameter regime where the fraction of reliable data is small, is relevant to a number of practical settings. For example, settings where one has a large dataset of customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a constant number of randomly selected items, to recover an accurate estimate of the entire set of preferences. In this sense, our results can be viewed as a "data prism" allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…