Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit

Masashi Sugiyama

Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit

Abstract

We study the real-valued combinatorial pure exploration of the multi-armed bandit (R-CPE-MAB) problem. In R-CPE-MAB, a player is given d stochastic arms, and the reward of each arm s∈\1, …, d\ follows an unknown distribution with mean μs. In each time step, a player pulls a single arm and observes its reward. The player's goal is to identify the optimal action π* = π ∈ A μπ from a finite-sized real-valued action set A⊂ Rd with as few arm pulls as possible. Previous methods in the R-CPE-MAB assume that the size of the action set A is polynomial in d. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in d. We also introduce a novel problem-dependent sample complexity lower bound of the R-CPE-MAB problem, and show that the GenTS-Explore algorithm achieves the optimal sample complexity up to a problem-dependent constant factor.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…