Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU

Abstract

Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing 4 GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This paper presents novel optimizations to large GPU-aware all-reduce operations by extending the lane-aware algorithm to heterogeneous architectures and notably using multiple CPU cores per GPU to accelerate these operations. Using GPUDirect RDMA and host copy communications respectively, these multi-CPU-accelerated GPU-aware all-reduces yield speedups over system MPI of up to 3x on LLNL's Tuolumne supercomputer and up to 2.45x for large MPI all-reduces across the NVIDIA A100 GPUs of NCSA's Delta supercomputer.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…