The Adaptive Complexity of Finding a Stationary Point
Abstract
In large-scale applications, such as machine learning, it is desirable to design non-convex optimization algorithms with a high degree of parallelization. In this work, we study the adaptive complexity of finding a stationary point, which is the minimal number of sequential rounds required to achieve stationarity given polynomially many queries executed in parallel at each round. For the high-dimensional case, i.e., d = (-(2 + 2p)/p), we show that for any (potentially randomized) algorithm, there exists a function with Lipschitz p-th order derivatives such that the algorithm requires at least -(p+1)/p iterations to find an -stationary point. Our lower bounds are tight and show that even with poly(d) queries per iteration, no algorithm has better convergence rate than those achievable with one-query-per-round algorithms. In other words, gradient descent, the cubic-regularized Newton's method, and the p-th order adaptive regularization method are adaptively optimal. Our proof relies upon novel analysis with the characterization of the output for the hardness potentials based on a chain-like structure with random partition. For the constant-dimensional case, i.e., d = (1), we propose an algorithm that bridges grid search and gradient flow trapping, finding an approximate stationary point in constant iterations. Its asymptotic tightness is verified by a new lower bound on the required queries per iteration. We show there exists a smooth function such that any algorithm running with ( (1/)) rounds requires at least ((1/)(d-1)/2) queries per round. This lower bound is tight up to a logarithmic factor, and implies that the gradient flow trapping is adaptively optimal.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.