Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?
Abstract
Multi-objective bandits have attracted increasing attention for their broad applicability, with \(d\)-dimensional reward vectors inducing Pareto regret. There has been a subtle debate over whether this added structure makes the problem fundamentally harder than single-objective bandits. We answer this by showing that, in terms of Pareto regret, it is surprisingly no harder: Pareto regret scales inversely with \(g\), the largest objective-wise suboptimality gap, and thus matches the smallest objective-wise classical regret. We formalize this idea via a novel method with upper and lower confidence-bound estimators for every arm-objective pair. It uses top-two races to compare arms within each objective and an uncertainty-greedy rule to allocate exploration toward the largest objective-wise gap \(g\), until the corresponding Pareto-optimal arm is committed to. We prove that it achieves Pareto regret of \(O( Tg)\), where \(T\) is the horizon, with no dependence on \(d\). A matching lower bound of \(( Tg)\) implies optimality. We evaluate the method on synthetic and real-world datasets, confirming the theory and achieving order-of-magnitude reductions in Pareto regret over baselines. Real-world results further show that our method commits to a Pareto optimal arm, possibly at the cost of empirical fairness, suggesting a potential hardness absent in single-objective bandits.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.