Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Abstract

We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named DUCRL2 to obtain a near-optimal policy for this setting and show that it achieves a regret bound of O(DSAT + d (SA)3) where S and A are the sizes of the state and action spaces, respectively, D is the diameter of the MDP, d is a parameter upper bounded by the maximum reward delay, and T denotes the time horizon. This demonstrates the optimality of the bound in the order of T, and an additive impact of the delay.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…