Joint Training on AMD and NVIDIA GPUs

Zhendong Yu

Joint Training on AMD and NVIDIA GPUs

Abstract

As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. This paper presents a technical solution for heterogeneous mixed training in AMD-NVIDIA environments. We first adopt a compatibility-oriented approach based on CPU-Forwarding Communication, with differentiated communication back-end selection across parallel groups and multi-NIC parallel data transfer. To achieve higher performance, we further propose another Device-Direct Communication approach, integrating a CPU-offloading P2P mechanism to enable direct cross-vendor GPU data transfer without host-memory staging. Experiments on LLaMA-8B and Qwen2-7B demonstrate that the proposed Device-Direct Communication approach achieves up to 98% of the throughput of an NVIDIA homogeneous system, while preserving training stability and correctness.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or open the topic learn hub

Discussion (0)

Sign in to join the discussion.

Loading comments…