Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization

Abstract

We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions t admit a "pseudo-1d" structure, i.e. t() = t(t()) where the output of t is one-dimensional. At each round, the learner observes context t, plays prediction t(t; t) (e.g. t(·)= t, ·) for some t ∈ Rd and observes loss t(t(t)) where t is a convex Lipschitz-continuous function. The goal is to minimize the standard regret metric. This pseudo-1d bandit convex optimization problem () arises frequently in domains such as online decision-making or parameter-tuning in large systems. For this problem, we first show a lower bound of (dT, T3/4) for the regret of any algorithm, where T is the number of rounds. We propose a new algorithm that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively, guaranteeing the optimal regret bound mentioned above, up to additional logarithmic factors. In contrast, applying state-of-the-art online convex optimization methods leads to O((d9.5T,dT3/4)) regret, that is significantly suboptimal in d.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…