SGD at the Edge of Stability: The Stochastic Sharpness Gap

Abstract

When training neural networks with full-batch gradient descent (GD) and step size η, the largest eigenvalue of the Hessian -- the sharpness S(θ) -- rises to 2/η and hovers there, a phenomenon termed the Edge of Stability (EoS). damian2023selfstab showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint S(θ)≤ 2/η. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below 2/η, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below 2/η. Following the approach of damian2023selfstab, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: S = η β σu2/(4α), where α is the progressive sharpening rate, β is the self-stabilization strength, and σ u2 is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…