Temperature is All You Need for Generalization in Langevin Dynamics and other Markov Processes

Abstract

We analyze the generalization gap (gap between the training and test errors) when training a potentially over-parametrized model using a Markovian stochastic training algorithm, initialized from some distribution θ0 p0. We focus on Langevin dynamics with a positive temperature β-1, i.e. gradient descent on a training loss L with infinitesimal step size, perturbed with β-1-variances Gaussian noise, and lightly regularized or bounded. There, we bound the generalization gap, at any time during training, by (βE L (θ0) + (1/δ))/N with probability 1-δ over the dataset, where N is the sample size, and E L (θ0) =O(1) with standard initialization scaling. In contrast to previous guarantees, we have no dependence on either training time or reliance on mixing, nor a dependence on dimensionality, gradient norms, or any other properties of the loss or model. This guarantee follows from a general analysis of any Markov process-based training that has a Gibbs-style stationary distribution. The proof is surprisingly simple, once we observe that the marginal distribution divergence from initialization remains bounded, as implied by a generalized second law of thermodynamics.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…