Depth Dependence of μP Learning Rates in ReLU MLPs

Abstract

In this short note we consider random fully connected ReLU networks of width n and depth L equipped with a mean-field weight initialization. Our purpose is to study the dependence on n and L of the maximal update (μP) learning rate, the largest learning rate for which the mean squared change in pre-activations after one step of gradient descent remains uniformly bounded at large n,L. As in prior work on μP of Yang et. al., we find that this maximal update learning rate is independent of n for all but the first and last layer weights. However, we find that it has a non-trivial dependence of L, scaling like L-3/2.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…