Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

Abstract

We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, Σk=1K [V*,Mk - Vπk,Mk], where Mk represents the action context in the k-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of O(SAH3K L) for adversarial contexts, where L denotes the number of possible contexts. This result implies a regret bound of O(SAH3K) for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of O(SAH3/ε2) for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ O( ∈fp∈ [0,1) ( 1Δp + pKΔp ) K · poly(S,A,H) ), \] where Δp is the global p-trimmed positive-gap floor over suboptimal (h,s,a) triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…