On the O(dK1/4) Convergence Rate of AdamW Measured by 1 Norm

Abstract

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate 1KΣk=1KE[||∇ f(xk)||1]≤ O(dCK1/4) for AdamW measured by 1 norm, where K represents the iteration number, d denotes the model dimension, and C matches the constant in the optimal convergence rate of SGD. Theoretically, we have ||∇ f(x)||2 ||∇ f(x)||1≤ d||∇ f(x)||2 for any high-dimensional vector x and E[||∇ f(x)||1]≥2dπE[||∇ f(x)||2] when each element of ∇ f(x) is generated from Gaussian distribution N(0,1). Empirically, our experimental results on real-world deep learning tasks reveal ||∇ f(x)||1=(d)||∇ f(x)||2. Both support that our convergence rate can be considered to be analogous to the optimal 1KΣk=1KE[||∇ f(xk)||2]≤ O(CK1/4) convergence rate of SGD in the ideal case. We also extend our result to NAdamW, an AdamW variant that employs a double-momentum mechanism, and demonstrate that it maintains the same convergence rate.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…