Normal Bandits of Unknown Means and Variances: Asymptotic Optimality, Finite Horizon Regret Bounds, and a Solution to an Open Problem

Abstract

Consider the problem of sampling sequentially from a finite number of N ≥ 2 populations, specified by random variables Xik, i = 1,… , N, and k = 1, 2, …; where Xik denotes the outcome from population i the kth time it is sampled. It is assumed that for each fixed i, \ Xik \k ≥ 1 is a sequence of i.i.d. normal random variables, with unknown mean μi and unknown variance σi2. The objective is to have a policy π for deciding from which of the N populations to sample form at any time n=1,2,… so as to maximize the expected sum of outcomes of n samples or equivalently to minimize the regret due to lack on information of the parameters μi and σi2. In this paper, we present a simple inflated sample mean (ISM) index policy that is asymptotically optimal in the sense of Theorem 4 below. This resolves a standing open problem from Burnetas and Katehakis (1996). Additionally, finite horizon regret bounds are given.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…