Robust Model Selection with Application in Single-Cell Multiomics Data
Abstract
Model selection is critical in the modern statistics and machine learning community. However, most existing works do not apply to heavy-tailed data, which are commonly encountered in real applications, such as the single-cell multiomics data. In this paper, we propose a rank-sum based approach that outputs a confidence set containing the optimal model with guaranteed probability. Motivated by conformal inference, we developed a general method that is applicable without moment or tail assumptions on the data. We demonstrate the advantage of the proposed method through extensive simulation and a real application on the COVID-19 genomics dataset (Stephenson et al., 2021). To perform the inference on rank-sum statistics, we derive a general Gaussian approximation theory for high dimensional two-sample U-statistics, which may be of independent interest to the statistics and machine learning community.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.