Active Subsampling for Measurement-Constrained M-Estimation of Individualized Thresholds with High-Dimensional Data
Abstract
Measurement-constrained problems frequently arise in modern applications such as electronic health record studies. In such problems, despite the availability of large datasets, collecting labeled data can be highly costly or time-consuming, allowing only a small portion of the data to be labeled within a given budget. This raises a critical question: which data points are most beneficial to label given the budget constraint? We study this question in the context of estimating an optimal individualized threshold under a measurement-constrained M-estimation framework. In particular, our goal is to estimate a high-dimensional parameter θ in a linear threshold θTZ for a continuous variable X such that the discrepancy between whether X exceeds the threshold θTZ and a binary outcome Y is minimized. In the measurement-constrained setting, we propose a novel K-step active subsampling algorithm to estimate θ, which iteratively samples the most informative observations in the dataset and solves a regularized M-estimator. Our theoretical analysis reveals a sharp phase transition phenomenon with respect to β, the smoothness of the conditional density of X given Y and Z. Please see the paper for the full abstract.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.