Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

Abstract

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) -- the problem of evaluating a new policy using the historical data obtained by different behavior policies -- under the model of nonstationary episodic Markov Decision Processes (MDP) with a long horizon and a large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon H. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of 1n Σt=1HEμ[dtπ(st)2dtμ(st)2 Varμ[πt(at|st)μt(at|st)( Vt+1π(st+1) + rt) | st]] + O(n-1.5) where μ and π are the logging and target policies, dtμ(st) and dtπ(st) are the marginal distribution of the state at tth step, H is the horizon, n is the sample size and Vt+1π is the value function of the MDP under π. The result matches the Cramer-Rao lower bound in jiang2016doubly up to a multiplicative factor of H. To the best of our knowledge, this is the first OPE estimation error bound with a polynomial dependence on H. Besides theory, we show empirical superiority of our method in time-varying, partially observable, and long-horizon RL environments.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…