Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

Abstract

Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective--cross-entropy loss with L2 regularization--by proving it satisfies Villani's criteria for coercive energy functions. Specifically, we show that the regularized loss F is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies the differential growth condition - + 1s\|∇F\|2 ∞ as \|θ\| ∞ for all s>0. From this structure, we derive explicit log-Sobolev and Poincar\'e constants CLS ≤ λ-1 + d/λ2, linking the regularization strength λ and model dimension d to finite-time convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds that tighten with increasing λ. To validate our theory, we introduce a scalable Villani diagnostic s(θ) = - F + s-1\|∇ F\|2 and estimate it efficiently using Hutchinson trace probes in models with over 100M parameters. Experiments on GPT-Neo-125M across Penn Treebank and WikiText-103 confirm the predicted quadratic growth of s, spectral inflation of the Hessian, and exponential convergence behavior consistent with our log-Sobolev analysis. These results demonstrate that weight decay not only improves generalization empirically but also establishes the mathematical conditions required for fast Langevin mixing and theoretically grounded curvature-aware optimization in deep learning.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…