Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness

Abstract

Recent results in non-convex stochastic optimization demonstrate the convergence of popular adaptive algorithms (e.g., AdaGrad) under the (L0, L1)-smoothness condition, but the rate of convergence is a higher-order polynomial in terms of problem parameters like the smoothness constants. The complexity guaranteed by such algorithms to find an ε-stationary point may be significantly larger than the optimal complexity of ( L σ2 ε-4 ) achieved by SGD in the L-smooth setting, where is the initial optimality gap, σ2 is the variance of stochastic gradient. However, it is currently not known whether these higher-order dependencies can be tightened. To answer this question, we investigate complexity lower bounds for several adaptive optimization algorithms in the (L0, L1)-smooth setting, with a focus on the dependence in terms of problem parameters , L0, L1. We provide complexity bounds for three variations of AdaGrad, which show at least a quadratic dependence on problem parameters , L0, L1. Notably, we show that the decorrelated variant of AdaGrad-Norm requires at least ( 2 L12 σ2 ε-4 ) stochastic gradient queries to find an ε-stationary point. We also provide a lower bound for SGD with a broad class of adaptive stepsizes. Our results show that, for certain adaptive algorithms, the (L0, L1)-smooth setting is fundamentally more difficult than the standard smooth setting, in terms of the initial optimality gap and the smoothness constants.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…