Stochastic Bregman Proximal Gradient Method Revisited: Kernel Conditioning and Painless Variance Reduction
Abstract
We investigate stochastic Bregman proximal gradient (SBPG) methods for minimizing a finite-sum nonconvex function (x):=1nΣi=1nfi(x)+φ(x), where φ is convex and nonsmooth, while fi, instead of gradient global Lipschitz continuity, satisfies a smooth-adaptability condition w.r.t. some kernel h. Standard acceleration techniques for stochastic algorithms (momentum, shuffling, variance reduction) depend on bounding stochastic errors by gradient differences that are further controlled via Lipschitz property. Lacking this, existing SBPG results are limited to vanilla stochastic approximation that cannot yield the optimal O(n) complexity dependence on n. Moreover, existing works report complexities under various nonstandard stationarity measures that largely deviate from the standard minimal limiting Fr\'echet subdifferential dist(0,∂(·)). Our analysis reveals that these popular stationarity measures are often much smaller than dist(0,∂(·)), leading to overstated solution quality and non-stationary output. To resolve these issues, we design a new gradient mapping Dφ,hλ (·) by BPG residuals in dual space and a new kernel-conditioning (KC) regularity, under which the mismatch between \|Dφ,hλ (·)\| and dist(0,∂(·)) is provably O(1) and instance-free. Moreover, KC-regularity guarantees Lipschitz-like bounds for gradient differences, providing general analysis tools for momentum, shuffling, and variance reduction under smooth-adaptability. We illustrate this point on variance reduced SBPG methods and establish an O(n) complexity for \|Dφ,hλ (·)\|, providing instance-free (worst-case) complexity under dist(0,∂(·)).
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.