Understanding Communication Backends in Cross-Silo Federated Learning

Salman Avestimehr

Understanding Communication Backends in Cross-Silo Federated Learning

Abstract

Federated learning (FL) has emerged as a practical means for privacy-preserving distributed machine learning. FL's versatile design makes it suitable for various training settings, from IoT edge devices in cross-device FL to powerful servers in cross-silo FL. A key consequence of this versatility is the high level of diversity found in the networking configuration of FL applications. Coupled with the rising demand for large-scale models such as large language models, well-informed selection and configuration of communication backends become crucial for ensuring optimal performance in FL systems. This work focuses on cross-silo federated learning, presenting in-depth benchmarks of various communication backends, including MPI, gRPC, and PyTorch RPC. In addition, we introduce gRPC+S3, a hybrid backend designed to overcome the limitations of existing approaches, particularly for transmitting large models across geo-distributed deployments, achieving up to 3.8× end-to-end speedup over gRPC. Our benchmarks examine point-to-point and end-to-end performance for a broad range of model sizes running under realistic network conditions. Our findings provide practical insights for selecting and configuring suitable communication backends tailored to the specific federated learning tasks and network configurations.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…