Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis
Abstract
We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices, trained with gradient descent in the supervised learning setting, and prove that gradient descent can achieve optimality without massive overparameterization. Our in-depth nonasymptotic analysis (i) provides improved bounds on the network size m in terms of the sequence length T, sample size n and ambient dimension d, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the convergence and network width bounds characterized by a cutoff point that depends on the Lipschitz continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with n samples can achieve optimality with a network size m that scales only logarithmically with n. This sharply contrasts with the prior works that require high-order polynomial dependency of m on n to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.