Soft-to-Hard Routing in Sparse Mixture-of-Experts Models
Abstract
Softmax routing approaches hard top-1 routing as the temperature tends to zero, but the limiting passage is singular at router ties. This paper develops a boundary-layer calculus for this soft-to-hard limit in population squared-loss mixture-of-experts regression. For a router with logits ak(x;ϕ), the relevant local quantity is the top-two margin Δ(x;ϕ), and the relevant global quantity is the boundary mass P(Δ(X;ϕ) w). Under smoothness and transversality assumptions, coarea and tubular-neighborhood estimates show how this mass scales with the slab width; in the binary case the leading coefficient is an explicit surface integral over the routing interface. These geometric estimates give quantitative bounds between the soft objective Lτ and the hard objective L0, including an O(τα) uniform comparison under a margin-tail condition, and yield Γ-convergence of the soft objectives on compact parameter spaces. The main conclusion is that the zero-temperature approximation is controlled by the probability carried by an O(τ) neighborhood of the routing interfaces, not by temperature alone. After isolating this boundary-layer part of the problem, we record a conditional landscape-transfer theorem from hard to small-temperature soft routing and a reduced two-expert Gaussian calculation illustrating local symmetry breaking. Synthetic diagnostics are included only as controlled checks of the boundary-layer predictions.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.