Overcoming Environmental Meta-Stationarity in MARL via Adaptive Curriculum and Counterfactual Group Advantage

Abstract

Multi-agent reinforcement learning (MARL) has reached competitive performance on cooperative tasks against scripted adversaries, yet most methods train agents at a single fixed difficulty throughout the entire run. We term this static-difficulty regime environmental meta-stationarity and show that it caps policy generalization and steers learning toward shallow local optima. To break this regime, we propose CL-MARL, a dynamic curriculum learning framework that adapts opponent strength online from win-rate signals, advancing or regressing the task as agents master it. Its scheduler, FlexDiff, fuses momentum-based trend estimation with sliding-window dual-curve monitoring of training and evaluation returns, yielding stable difficulty transitions without manual tuning. Because a moving curriculum amplifies non-stationarity and sparsifies global rewards, we introduce the Counterfactual Group Relative Policy Advantage (CGRPA), which extends GRPO-style group-relative optimization with counterfactual baselines to disentangle each agent's contribution under shifting team dynamics. On the StarCraft Multi-Agent Challenge (SMAC), CL-MARL attains a 40% mean win rate on the super-hard maps with an average episode return of 17.85, exceeding the QMIX, OW-QMIX, DER, EMC, and MARR baselines by +2.94 on average, while reaching its peak win rate roughly 1.28faster on 8mvs9m and 1.42 faster on 3s5zvs3s6z than the strongest baseline. The implementation is publicly available at https://github.com/NICE-HKU/CL2MARL-SMAC.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…