DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Abstract
Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-k routing imposes a rigid sparsity pattern that ignores the intrinsic variance in token difficulty and layer-specific computational needs. Top-p routing is more adaptive because it selects experts until their cumulative routing probability reaches a threshold, allowing confident tokens to use fewer experts and ambiguous tokens to recruit more. However, we demonstrate that existing naive Top-p implementations with fixed global probability thresholds provide only marginal gains over Top-k, suffer from hyperparameter sensitivity, and result in uncontrolled computational costs. In this paper, we propose **DTop-p**, a sparsity-controllable dynamic routing mechanism that learns the Top-p probability threshold with a Proportional-Integral controller and uses dynamic routing normalization to support layer-wise expert selection under a global sparsity constraint. Extensive experiments on Large Language Models and Diffusion Transformers demonstrate that **DTop-p** consistently outperforms both Top-k and fixed Top-p baselines while matching the average FLOPs of Top-k MoE. Our analysis confirms that **DTop-p** exhibits strong scaling properties across expert granularity, total expert capacity, model size, and dataset size, offering a robust and efficient MoE framework for foundation model pre-training.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.