Kiefer Wolfowitz Algorithm is Asymptotically Optimal for a Class of Non-Stationary Bandit Problems

Abstract

We consider the problem of designing an allocation rule or an "online learning algorithm" for a class of bandit problems in which the set of control actions available at each time s is a convex, compact subset of Rd. Upon choosing an action x at time s, the algorithm obtains a noisy value of the unknown and time-varying function fs evaluated at x. The "regret" of an algorithm is the gap between its expected reward, and the reward earned by a strategy which has the knowledge of the function fs at each time s and hence chooses the action xs that maximizes fs. For this non-stationary bandit problem set-up, we consider two variants of the Kiefer Wolfowitz (KW) algorithm i) KW with fixed step-size β, and ii) KW with sliding window of length L. We show that if the number of times that the function fs varies during time T is o(T), and if the learning rates of the proposed algorithms are chosen "optimally", then the regret of the proposed algorithms is o(T), and hence the algorithms are asymptotically efficient.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…