Stochastic Matching Bandits with Rare Optimization Updates

Abstract

We introduce a bandit framework for stochastic matching under the multinomial logit (MNL) choice model. In our setting, N agents on one side are assigned to K arms on the other side, where each arm stochastically selects an agent from its assigned pool according to unknown preferences and yields a corresponding reward over a horizon T. The objective is to minimize regret by maximizing the cumulative revenue from successful matches. A naive approach requires solving an NP-hard combinatorial optimization problem at every round, resulting in a prohibitive computational cost. To address this challenge, we propose batched algorithms that strategically limit the number of times matching assignments are updated to ( T) over the entire horizon. By invoking expensive combinatorial optimization only on a vanishing fraction of rounds, our algorithms substantially reduce overall computational overhead while still achieving a regret bound of O(T).

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…