Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

Abstract

We study the computational and sample complexity of learning a target function f*:Rd with additive structure, that is, f*(x) = 1MΣm=1M fm( x, vm), where f1,f2,...,fM:R are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features \vm\m=1M, and the number of additive tasks M grows with the dimensionality M dγ for γ 0. This problem setting is motivated by the classical additive model literature, the recent representation learning theory of two-layer neural network, and large-scale pretraining where the model simultaneously acquires a large number of "skills" that are often localized in distinct parts of the trained network. We prove that a large subset of polynomial f* can be efficiently learned by gradient descent training of a two-layer neural network, with a polynomial statistical and computational complexity that depends on the number of tasks M and the information exponent of fm, despite the unknown link function and M growing with the dimensionality. We complement this learnability guarantee with computational hardness result by establishing statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…