Finite-sample bias-variance tradeoff with variables related to trial participation inserted into causal forest models for ensuring generalizability
Abstract
Estimating conditional average treatment effects (CATE) from randomized controlled trials (RCTs) and generalizing them to broader populations is essential for personalizing treatment rules but is complicated by selection bias due to trial participation and potentially high dimensional covariates. We evaluated finite sample bias variance tradeoff for Causal Forest based CATE estimation strategies to address the selection bias. Identification theory suggests unbiased CATE estimation is possible when covariates related to trial participation are included in CATE estimating models. However, simulation studies demonstrated that, under realistic RCT sample sizes, variance inflation from high dimensional covariates often outweighed modest bias reduction. In our data generating process that define individual treatment effect (ITE) in source population and selected trial samples, including more than 3 covariates related to participation in causal forest substantially degraded precision unless sample sizes were large. In contrast, inverse probability weighting (IPW) based methods consistently improved performance across scenarios. Application to a RCT of omega 3 fatty acids and coronary heart disease illustrated how IPW shifts CATE estimates toward source population effects and refines heterogeneity assessments. Our findings highlight that including trial-selection variables for CATE estimating models may inflate estimator variance and reduce ITE prediction performance in applications using medical RCTs. Addressing selection bias separately (e.g. through IPW) would be a reasonable strategy.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.