On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
Abstract
We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions u for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever u does not have full domain dom(u)≠ R, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size SA of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon 11-γ explicit. Specifically, for CVaRτ we show that the correct dependence on τ is 1τ2, thus improving by a factor of 1τ over state-of-the-art although our bound has a suboptimal dependence on 11-γ.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.