Why and When Deep is Better than Shallow: Implementation-Agnostic State-Transition Model of Deep Learning
Abstract
Why and when does depth improve generalization? We study this question in an implementation-agnostic state-transition model, where a depth-k predictor is a readout class H composed with the word ball B(k,F) generated by hidden state transitions. Generalization bounds separate implementation error, approximation error, and statistical complexity, and upper bound the depth-dependent variance term by a Dudley entropy integral over B(k,F), with a conditional lower-bound diagnostic under readout separation. We identify geometric and semigroup mechanisms that keep this entropy contribution saturated or polynomial, and contrast them with separation mechanisms that recover the classical exponential-growth obstruction. Coupling these variance upper bounds with approximation rates gives typical depth trade-off patterns, clarifying that depth is statistically favorable when approximation improves rapidly while the transition semigroup remains geometrically tame.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.