Non-Stationary Bandit Convex Optimization: An Optimal Algorithm with Two-Point Feedback
Abstract
This paper studies bandit convex optimization in non-stationary environments with two-point feedback, using dynamic regret as the performance measure. We propose an algorithm based on bandit mirror descent that extends naturally to non-Euclidean settings. Let T be the total number of iterations and PT,p the path variation with respect to the p-norm. In Euclidean space, our algorithm matches the optimal regret bound O(dT(1+PT,2)), improving upon zhao2021bandit by a factor of O(d). Beyond Euclidean settings, our algorithm achieves an upper bound of O(d(d)T(T)(1 + PT,1)) on the simplex, which is nearly optimal up to log factors. For the cross-polytope, the bound reduces to O(d(d)T(1+PT,p)) for some p = 1 + 1/(d).
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.