Debiased machine learning for combining probability and non-probability survey data
Abstract
We consider the problem of estimating the finite population mean Y of an outcome variable Y using data from a nonprobability sample and auxiliary information from a probability sample. Existing double robust (DR) estimators of this mean Y require the estimation of two nuisance functions: the conditional probability of selection into the nonprobability sample given covariates X that are observed in both samples, and the conditional expectation of Y given X. These nuisance functions can be estimated using parametric models, but the resulting estimator of Y will typically be biased if both parametric models are misspecified. It would therefore be advantageous to be able to use more flexible data-adaptive / machine-learning estimators of the nuisance functions. Here, we develop a general framework for the valid use of DR estimators of Y when the design of the probability sample uses sampling without replacement at the first stage and data-adaptive / machine-learning estimators are used for the nuisance functions. We prove that several DR estimators of Y, including targeted maximum likelihood estimators, are asymptotically normally distributed when the estimators of the nuisance functions converge faster than the n1/4 rate and cross-fitting is used. We present a simulation study that demonstrates good performance of these DR estimators compared to the corresponding DR estimators that rely on at least one correctly specified parametric model.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.