On Pareto Optimality for Parametric Choice Bandits

Abstract

We study online assortment optimization under stochastic choice when a decision maker simultaneously values cumulative revenue performance and the quality of post-hoc inference on revenue contrasts. We analyze a forced-exploration optimism-in-the-face-of-uncertainty (OFU) scheme that combines two regularized maximum-likelihood estimators: one based on all observations for sequential decision making, and one based only on exploration rounds for inference. Our general theory is developed under predictable score proxies and per-round action-dependent curvature domination. Under these conditions we establish a self-normalized concentration inequality, a likelihood-based ellipsoidal confidence-set theorem, and a regret bound for approximate optimistic actions that explicitly accounts for optimization error. For the multinomial logit (MNL) model we derive explicit score and curvature proxies and show that a balanced spaced singleton-exploration schedule yields realized coordinate coverage, implying regret (nT + T/nT) and revenue-contrast error (1/nT) up to fixed problem-dependent factors. A hard two-assortment subclass yields a matching lower bound at the product level. Consequently, within the polynomial exploration family nT Tα, the regret and inference rates become (T\α,1-α/2\) and (T-α/2), respectively; hence α∈[2/3,1) is the rate-wise Pareto-undominated interval and α=2/3 is the unique balancing point that minimizes the regret exponent. Finally, for the Exponomial Choice and Nested Logit models we state verifiable sufficient conditions that would instantiate the general framework.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…