Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

Abstract

We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gammaeff, that quantifies the balance-quality tradeoff. Tracking gammaeff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gammaeff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gammaeff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilonl), and robustness verification across four independent quality estimators (r >= 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…