Convergence of backpropagation with momentum for network architectures with skip connections
Abstract
We study a class of deep neural networks with networks that form a directed acyclic graph (DAG). For backpropagation defined by gradient descent with adaptive momentum, we show weights converge for a large class of nonlinear activation functions. The proof generalizes the results of Wu et al. (2008) who showed convergence for a feed forward network with one hidden layer. For an example of the effectiveness of DAG architectures, we describe an example of compression through an autoencoder, and compare against sequential feed forward networks under several metrics.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.