Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference
Abstract
Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for online inference remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, Tanbr, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, Tanbr estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, Tanbr employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that Tanbr achieves a sublinear regret bound of O(T (T)) over T rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that Tanbr reduces inference latency by at least 45\% and memory usage by up to 25\%, while maintaining a high accuracy compared to many state-of-the-art methods.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.