Cross-trait prediction accuracy of high-dimensional ridge-type estimators in genome-wide association studies
Abstract
Marginal association summary statistics have attracted great attention in statistical genetics, mainly because the primary results of most genome-wide association studies (GWAS) are produced by marginal screening. In this paper, we study the prediction accuracy of marginal estimator in dense (or sparsity free) high-dimensional settings with (n,p,m) ∞, m/n γ ∈ (0,∞), and p/n ω ∈ (0,∞). We consider a general correlation structure among the p features and allow an unknown subset m of them to be signals. As the marginal estimator can be viewed as a ridge estimator with regularization parameter λ ∞, we further investigate a class of ridge-type estimators in a unifying framework, including the popular best linear unbiased prediction (BLUP) in genetics. We find that the influence of λ on out-of-sample prediction accuracy heavily depends on ω. Though selecting an optimal λ can be important when p and n are comparable, it turns out that the out-of-sample R2 of ridge-type estimators becomes near-optimal for any λ ∈ (0,∞) as ω increases. For example, when features are independent, the out-of-sample R2 is always bounded by 1/ω from above and is largely invariant to λ given large ω (say, ω>5). We also find that in-sample R2 has completely different patterns and depends much more on λ than out-of-sample R2. In practice, our analysis delivers useful messages for genome-wide polygenic risk prediction and computation-accuracy trade-off in dense high-dimensions. We numerically illustrate our results in simulation studies and a real data example.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.