Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

Abstract

We present an algorithm based on the Optimism in the Face of Uncertainty (OFU) principle which is able to learn Reinforcement Learning (RL) modeled by Markov decision process (MDP) with finite state-action space efficiently. By evaluating the state-pair difference of the optimal bias function h*, the proposed algorithm achieves a regret bound of O(SAHT)The symbol O means O with log factors ignored. for MDP with S states and A actions, in the case that an upper bound H on the span of h*, i.e., sp(h*) is known. This result outperforms the best previous regret bounds O(SAHT) fruit2019improved by a factor of S. Furthermore, this regret bound matches the lower bound of (SAHT) jaksch2010near up to a logarithmic factor. As a consequence, we show that there is a near optimal regret bound of O(SADT) for MDPs with a finite diameter D compared to the lower bound of (SADT) jaksch2010near.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…