Scalable Multi-node Fast Fourier Transform on GPUs
Abstract
In this paper, we present the details of our multi-node GPU-FFT library, as well its scaling on Selene HPC system. Our library employs slab decomposition for data division and MPI for communication among GPUs. We performed GPU-FFT on 10243, 20483, and 40963 grids using a maximum of 512 A100 GPUs. We observed good scaling for 40963 grid with 64 to 512 GPUs. We report that the timings of multicore FFT of 15363 grid with 196608 cores of Cray XC40 is comparable to that of GPU-FFT of 20483 grid with 128 GPUs. The efficiency of GPU-FFT is due to the fast computation capabilities of A100 card and efficient communication via NVlink.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.