Data-Adaptive Integration With Summary Data
Abstract
Combining an internal individual-level study with readily available external summary statistics promises major efficiency gains at minimal additional cost, yet heterogeneity between sources can bias estimates for the internal target population. We develop a generalized entropy-balancing integration strategy that calibrates external moments to the internal covariate distribution, explicitly permitting a biased external sample. Our estimator of the internal-population mean is doubly robust: it remains consistent when either the outcome-regression model or the entropy-balancing modelis correctly specified. When multiple balancing specifications are plausible, we introduce a data-adaptive selection rule. We also provide easy-to-compute, fully estimable diagnostics-based on the Mahalanobis distance and the Pearson chi-square divergence-that pinpoint when integration is guaranteed to strictly outperform the internal sample mean. The approach is implemented in the R package daisy. Simulations and an application to nationwide public-access defibrillation records in Japan demonstrate meaningful precision gains while maintaining bias control under distributional shift.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.