On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective
Abstract
We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. When the network is sufficiently over-parameterized, these matrices individually approximate an integral operator which is determined by the feature vector distribution only. Consequently, GD method can be viewed as approximately applying the powers of this integral operator on the underlying/target function f* that generates the responses/labels. We show that if f* admits a low-rank approximation with respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low-rank approximation error at a linear rate which is determined by f* and only, i.e., the rate is independent of the sample size n. Furthermore, if f* has zero low-rank approximation error, then, as long as the width of the neural network is (n n), the empirical risk decreases to (1/n). To the best of our knowledge, this is the first result showing the sufficiency of nearly-linear network over-parameterization. We provide an application of our general results to the setting where is the uniform distribution on the spheres and f* is a polynomial. Throughout this paper, we consider the scenario where the input dimension d is fixed.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.