Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension
Abstract
We consider the optimization problem of minimizing the logistic loss with gradient descent to train a linear model for binary classification with separable data. With a budget of T iterations, it was recently shown that an accelerated 1/T2 rate is possible by choosing a large stepsize η= Θ(γ2 T) (where γ is the dataset's margin) despite the resulting non-monotonicity of the loss. In this paper, we provide a tighter analysis of gradient descent for this problem when the data is two-dimensional: we show that GD with a sufficiently large learning rate η finds a point with loss smaller than O(1/(ηγ2 T)), as long as T ≥ Ω(n/γ+ 1/γ2), where n is the dataset size. Our improved rate comes from a tighter bound on the time τ that it takes for GD to transition from unstable (non-monotonic loss) to stable (monotonic loss), via a fine-grained analysis of the oscillatory dynamics of GD in the subspace orthogonal to the max-margin classifier. We also provide a lower bound of τ matching our upper bound up to logarithmic factors, showing that our analysis is tight.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.