Adversarial Online Multi-Task Reinforcement Learning

Nishant A. Mehta

Adversarial Online Multi-Task Reinforcement Learning

Abstract

We consider the adversarial online multi-task reinforcement learning setting, where in each of K episodes the learner is given an unknown task taken from a finite set of M unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in M are well-separated under a notion of λ-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of (KDSAH) on the regret of any learning algorithm and an instance-specific lower bound of (Kλ2) in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains O(Kλ2) sample complexity guarantee for the clustering phase and O(MK) regret guarantee for the learning phase, indicating that the dependency on K and 1λ2 is tight.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…