Convergence Rate Analysis of the AdamW-style Shampoo: Unifying One-Sided and Two-Sided Preconditioning

Zhouchen Lin

Convergence Rate Analysis of the AdamW-style Shampoo: Unifying One-Sided and Two-Sided Preconditioning

Abstract

This paper studies AdamW-style Shampoo, an effective variant of the classical Shampoo that won the external tuning track of the AlgoPerf neural network training competition. Our analysis unifies one-sided and two-sided preconditioning. When the exponents of the two preconditioners sum to 1/2, we establish the convergence rate 1KΣk=1KE[||∇ f(Xk)||*]≤ O(m+nCK1/4), where K represents the number of iterations, (m,n) denotes the dimensions of the matrix-valued parameters, and C matches the constant appearing in the optimal convergence rate of SGD. Theoretically, the nuclear norm and Frobenius norm satisfy ||∇ f(X)||F≤ ||∇ f(X)||*≤ \m,n\||∇ f(X)||F, which suggests that our convergence rate is analogous to the optimal 1KΣk=1KE[||∇ f(Xk)||F]≤ O(CK1/4) convergence rate of SGD in the ideal case where ||∇ f(X)||*= Θ(\m,n\)||∇ f(X)||F and m and n are of comparable magnitude. Then, we extend our analysis to settings where the preconditioning exponents do not sum to 1/2, and establish convergence with an explicit but more involved rate.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…