Geometric compression of invariant manifolds in neural nets
Abstract
We study how neural networks compress uninformative input space in models where data lie in d dimensions, but whose label only vary within a linear manifold of dimension d < d. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the first layer of weights evolve to become nearly insensitive to the d=d-d uninformative directions. These are effectively compressed by a factor λ p, where p is the size of the training set. We quantify the benefit of such a compression on the test error ε. For large initialization of the weights (the lazy training regime), no compression occurs and for regular boundaries separating labels we find that ε p-β, with βLazy = d / (3d-2). Compression improves the learning curves so that βFeature = (2d-1)/(3d-2) if d = 1 and βFeature = (d + d/2)/(3d-2) if d > 1. We test these predictions for a stripe model where boundaries are parallel interfaces (d=1) as well as for a cylindrical boundary (d=2). Next we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden layer FC network trained on the stripe model and for a 16-layers CNN trained on MNIST, for which we also find βFeature>βLazy.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.