Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback

Abstract

We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among K candidate links/servers (arms) whose performance is a stochastic d-dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is probe-then-commit (PtC): the agent may probe up to q>1 candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits (q=1) and full-information experts (q=K), yet existing multi-objective learning theory largely focuses on these extremes. We develop PtC-P-UCB, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode, e.g., it selects the q probes by approximately maximizing a hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to directly expand the attained Pareto region. We prove a dominated-hypervolume frontier error of O (KP d/qT), where KP is the Pareto-frontier size and T is the horizon, and scalarized regret O (Lφ d(K/q)T), where φ is the scalarizer. These quantify a transparent 1/q acceleration from limited probing. We further extend to multi-modal probing: each probe returns M modalities (e.g., CSI, queue, compute telemetry), and uncertainty fusion yields variance-adaptive versions of the above bounds via an effective noise scale.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…