Adversarial Multi-dueling Bandits
Abstract
We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select m ≥ 2 arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative T-round regret of MiDEX compared to a Borda-winner from a set of K arms is upper bounded by O((K K)1/3 T2/3). Moreover, we prove a lower bound of (K1/3 T2/3) for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.