MinMaxMin Q-learning

Abstract

MinMaxMin Q-learning is a novel optimistic Actor-Critic algorithm that addresses the problem of overestimation bias (Q-estimations are overestimating the real Q-values) inherent in conservative RL algorithms. Its core formula relies on the disagreement among Q-networks in the form of the min-batch MaxMin Q-networks distance which is added to the Q-target and used as the priority experience replay sampling-rule. We implement MinMaxMin on top of TD3 and TD7, subjecting it to rigorous testing against state-of-the-art continuous-space algorithms-DDPG, TD3, and TD7-across popular MuJoCo and Bullet environments. The results show a consistent performance improvement of MinMaxMin over DDPG, TD3, and TD7 across all tested tasks.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…