Scaling Distributed Machine Learning with In-Network Aggregation

SwitchML is a system for distributed machine learning that accelerates data-parallel training using P4 switches. SwitchML uses in-network aggregation to reduce data transfer during synchronization phases of training. By co-designing the software networking end-host stack and the switch pipeline, SwitchML greatly reduces the volume of exchanged data and the latency of all-to-all communication, speeding up training by up to 300%.

SwitchML talk at P4 Workshop 2019

Distributed machine learning (ML) has become common practice, given the increasing complexity of Deep Neural Network (DNN) models and the sheer size of real-world datasets. While GPUs and other accelerators have massively increased compute power, networks have not improved at the same pace. As a result, in large deployments with several parallel workers, distributed ML training is increasingly network bounded.

The network bottleneck for training workloads arises from the need to periodically distribute model updates in order to synchronize workers. This communication is usually done with collective operations (i.e., all-reduce) or enabled by parameter servers. Both require many CPU cycles to aggregate updates and are oblivious to the physical network topology. SwitchML co-designs the switch processing logic, end-host protocols, and ML frameworks to offload aggregation directly to switches. This reduces the amount of data transmitted during synchronization phases, speeding up training time.

In this project, we address three main problems with using programmable switches in such a fashion: limited computational power, maintaining synchronization, and dealing with failures. We address these challenges by appropriately dividing the functionality between hosts and switches, resulting in an efficient and reliable streaming aggregation protocol. Our technical report linked below desribes this in detail.

We analyze the performance of SwitchML using standard benchmarks on popular DNNs in TensorFlow, trained over ImageNet (except for AlexNet, which uses synthetic data). We trained on 8 machines using Horovod, each with 1 NVidia P100 16 GB GPU, dual Xeon E5-2630 v4’s, 128 GB RAM, and dual Intel 82599ES 10Gbps and Mellanox Connect-X 5 100Gbps NICs. We use a 64 x 100 Gbps switch with Barefoot Networks’ Tofino chip. The figure below shows the training performance speedup compared to TensorFlow using the NCCL library. Overall, SwitchML's speedups range between 20%-300%. As expected, different models benefit from in-network aggregation differently, depending on how network-bound they are.

Training speedup

This project is a collaboration with colleagues at Microsoft Research, University of Washington and Barefoot Networks (now Intel).

Marco Canini
Associate Professor of Computer Science

My current interest is in designing better systems support for AI/ML and provide practical implementations deployable in the real-world.


Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training …