Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

Zhiyi Zeng

Continuum-armed Bandit Optimization with Batch Pairwise Comparison Oracles

Abstract

This paper studies a bandit optimization problem where the goal is to maximize a function f(x) over T periods for some unknown strongly concave function f. We consider a new pairwise comparison oracle, where the decision-maker chooses a pair of actions (x, x') for a consecutive number of periods and then obtains an estimate of f(x)-f(x'). We show that such a pairwise comparison oracle finds important applications to joint pricing and inventory replenishment problems and network revenue management. The challenge in this bandit optimization is twofold. First, the decision-maker not only needs to determine a pair of actions (x, x') but also a stopping time n (i.e., the number of queries based on (x, x')). Second, motivated by our inventory application, the estimate of the difference f(x)-f(x') is biased, which is different from existing oracles in stochastic optimization literature. To address these challenges, we first introduce a discretization technique and local polynomial approximation to relate this problem to linear bandits. Then we developed a tournament successive elimination technique to localize the discretized cell and run an interactive batched version of LinUCB algorithm on cells. We establish regret bounds that are optimal up to poly-logarithmic factors. Furthermore, we apply our proposed algorithm and analytical framework to the two operations management problems and obtain results that improve state-of-the-art results in the existing literature.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…