RDMACell: Token-Based Flowcell-Level RDMA Load Balancing for Large-Scale AI Training

Abstract

Remote Direct Memory Access (RDMA) is a core technol ogy for high-performance data center networks. However, its default Equal-Cost Multi-Path (ECMP) load-balancing mechanism often suffers from severe performance degra dation due to hash collisions and elephant flows. Existing solutions have obvious limitations: flowlet-based approaches lack the necessary time gaps to trigger flowlet switching, while packet-level approaches suffer from severe packet re ordering problems. On the other hand, although in-network programmable solutions can achieve fine-grained control, they are difficult to deploy in commercial general-purpose infrastructures. This paper presents RDMACell, a host-side flowcell-level load-balancing system. It proactively splits flows into mul tiple flowcells, distributes them across multiple paths, and achievesreal-timefeedbackthroughatomicdualWorkQueue Elements (WQEs). With token-based receiver-side control, RDMACellenables microsecond-level path switching and ex tremely low packet loss without modifying NIC firmware or switch hardware. The ns-3 simulations in a fat-tree topology show that under 80% network load with all-to-all traffic, RD MACell reduces 99th percentile FCT by 44% compared with ECMPandby42.2% compared with LetFlow, while achiev ing tail latency performance comparable to state-of-the-art in-network solutions

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…