On identification in ill-posed linear regression

Abstract

A novel framework is introduced to formalize identifiability in well-specified but ill-posed linear regression models. The framework is distribution-free and accommodates highly correlated features that may or may not relate to the response, reflecting typical real-data structures. First, the identifiable parameter is defined as the least-squares solution obtained by regressing the response on the largest subset of relevant features whose condition number does not exceed a specified threshold, and the relative risk incurred by using this predictor instead of the optimal one is quantified. Second, simple, verifiable conditions are provided under which a broad class of linear dimensionality reduction algorithms can estimate identifiable parameters; algorithms satisfying these conditions are termed statistically interpretable. Third, sharp high-probability error bounds are derived for these algorithms, with rates explicitly reflecting the degree of ill-posedness. With heavy-tailed features and sufficiently low effective rank, these algorithms achieve convergence rates that improve upon both the minimax least-squares rate and lower bounds for sparse estimation under sub-Gaussian features. Results are illustrated via simulations and a real-data application, in which effective rank grows logarithmically with dimension. The framework may extend to algorithms modeling nonlinear response-feature dependence.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…