Multi-Armed Bandits With Best-Action Queries
Abstract
We study multi-armed bandits (MABs) augmented with best-action queries, in which the learner may additionally query an oracle that reveals the best arm in the current round. This setting was recently characterized by Russo et al. [2024] in the full-feedback model, where the learner observes the rewards of all arms after each round. They show that, in both stochastic and adversarial environments, k best-action queries reduce the optimal O(T) regret to O(\T/k,T\). Whether this improvement extends to the more realistic bandit-feedback model -- where the learner observes only the reward of the played arm -- was left as an open problem. We fully resolve this question. When rewards are stochastic but correlated among arms, we show that the full-feedback result does not carry over: any algorithm must incur regret at least (T-k). This lower bound directly extends to adversarial environments. On the positive side, we show that O(\T/k,T-k\) regret is still achievable when rewards are stochastic and i.i.d., and establish a matching lower bound, up to logarithmic factors. Together, these results provide a complete characterization of the benefits of best-action queries in the bandit-feedback model.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.