Why Do We Need Warm-up? A Theoretical Perspective

Abstract

Learning rate warm-up -- increasing the learning rate at the beginning of training -- has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the (L0, L1)-smoothness condition, which bounds local curvature as a linear function of the loss suboptimality and exhibits desirable closure properties. We show -- both theoretically and empirically -- that this condition is satisfied by common neural architectures and accurately captures the curvature of the optimization landscape early in training. Adapting the learning rate in response to this curvature condition naturally induces a warm-up-like schedule, and we show that this choice yields provably faster convergence guarantees than using a fixed learning rate. Experiments on language and vision models show that the resulting one-parameter warm-up schedule can match tuned linear warm-up and improve over no warm-up.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…