Confidence Intervals for Random Forest Permutation Importance with Missing Data
Abstract
Random Forests are renowned for their predictive accuracy, but valid inference, particularly about permutation-based feature importances, remains challenging. Existing methods, such as the confidence intervals (CIs) from Ishwaran et al. (2019), are promising but assume complete feature observation. However, real-world data often contains missing values. In this paper, we investigate how common imputation techniques affect the validity of Random Forest permutation-importance CIs when data are incomplete. Through an extensive simulation and real-world benchmark study, we compare state-of-the-art imputation methods across various missing-data mechanisms and missing rates. Our results show that single-imputation strategies lead to low CI coverage. As a remedy, we adapt Rubin's rule to aggregate feature-importance estimates and their variances over several imputed datasets and account for imputation uncertainty. Our numerical results indicate that the adjusted CIs achieve better nominal coverage.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.