Rotting Infinitely Many-armed Bandits

Abstract

We consider the infinitely many-armed bandit problem with rotting rewards, where the mean reward of an arm decreases at each pull of the arm according to an arbitrary trend with maximum rotting rate =o(1). We show that this learning problem has an (\1/3T,T\) worst-case regret lower bound where T is the horizon time. We show that a matching upper bound O(\1/3T,T\), up to a poly-logarithmic factor, can be achieved by an algorithm that uses a UCB index for each arm and a threshold value to decide whether to continue pulling an arm or remove the arm from further consideration, when the algorithm knows the value of the maximum rotting rate . We also show that an O(\1/3T,T3/4\) regret upper bound can be achieved by an algorithm that does not know the value of , by using an adaptive UCB index along with an adaptive threshold value.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…