Online Budget Allocation with Censored Semi-Bandit Feedback

Abstract

We study a stochastic budget-allocation problem over K tasks. At each round t, the learner chooses an allocation Xt ∈ K. Task k succeeds with probability Fk(Xt,k), where F1,…,FK are nondecreasing budget-to-success curves, and upon success yields a random reward with unknown mean μk. The learner observes which tasks succeed, and observes a task's reward only upon success (censored semi-bandit feedback). This model captures, for instance, splitting payments across crowdsourcing workers or distributing bids across simultaneous auctions, and subsumes stochastic multi-armed bandits and semi-bandits. We design an optimism-based algorithm that operates under censored semi-bandit feedback. Our main result shows that in diminishing-returns regimes, the regret of this algorithm scales polylogarithmically with the horizon T without any ad hoc tuning. For general nondecreasing curves, we prove that the same algorithm (with the same tuning) achieves a worst-case regret upper bound of O(KT). Finally, we establish a matching worst-case regret lower bound of (KT) that holds even for full-feedback algorithms, highlighting the intrinsic hardness of our problem outside diminishing returns.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…