Choosing good subsamples for regression modelling

Abstract

A common problem in health research is that we have a large database with many variables measured on a large number of individuals. We are interested in measuring additional variables on a subsample; these measurements may be newly available, or expensive, or simply not considered when the data were first collected. The intended use for the new measurements is to fit a regression model generalisable to the whole cohort (and to its source population). This is a two-phase sampling problem; it differs from some other two-phase sampling problems in the richness of the phase I data and in the goal of regression modelling. In particular, an important special case is measurement-error models, where a variable strongly correlated with the phase II measurements is available at phase I. We will explain how influence functions have been useful as a unifying concept for extending classical results to this setting, and describe the steps from designing for a simple weighted estimator at known parameter values through adaptive multiwave designs and the use of prior information. We will conclude with some comments on the information gap between design-based and model-based estimators in this setting.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…