NeuronaBox

It’s ModelNet for Distributed DNN Training!

Last updated on Aug 4, 2024

A Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

Modern DNN training clusters are remarkable engineering feats that more closely resemble high-performance specialized computing environments – and the large costs that these entail – than their mainstream counterparts in commodity cloud computing datacenters. Optimizing resource utilization and overall efficiency is paramount to maximizing the performance of training workloads and minimizing associated costs. Therefore, it is highly desirable to explore the large space of potential design considerations, performance optimizations, and configuration tunings, and ideally to do so without incurring the time, energy, and monetary costs of profiling training workloads at scale on actual hardware.

Conducting in-depth “what if” analyses is essential to making informed decisions and beneficial for a variety of scenarios. But it is not practical to profile the training workload on thousands of HW accelerators (GPUs, TPUs, etc.) for each possible strategy and different configurations. Recent work has shown the potential of simulation and analytical methods to gain insights about DNN training behavior. However, these approaches suffer from at least one of three limitations:(1) they require significant effort to transform the actual workloads into an input model for the simulator; (2) they require explicit models of parallelization strategies and incorporating new ones entails non-trivial development of new simulation models, and, (3) the fidelity of their results is limited by how faithful the underlying analytical models of computation and communication are, which are notoriously difficult to get right at scale

In this project, we address the above shortcomings by proposing NeuronaBox – a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. An illustration of our approach is provided in the figure below.

NeuronaBox Concept

Our goal is to enable any subset N of nodes in a distributed DNN training job to execute the workload as if it were running on the entire set of nodes and resources. We propose to achieve this goal by emulating the interactions between N and its networked environment N, which in a sense can be viewed as a virtualization of the remaining job’s nodes. We argue that, under certain assumptions, by observing the performance of N, we can analyze and extrapolate the behavior for an entire job with high fidelity. In our design, we envision that NeuronaBox (1) should be used without any modification to their existing user code; and (2) should be flexible to seamlessly adapt to changes in parallelization strategies, including new ones that may emerge in the future. The high-level workflow of NeuronaBox illustrated in the figure below.

NeuronaBox Architecture

First, the user provides the training script, the job configuration (e.g., world size, nodes in N, HW resources, etc.), and optionally a set of what-if conditions for experimentation (an example is given later). Second, NeuronaBox initializes the emulation environment by synthesizing the network topology and instantiating a communication model that calculates delay times for collective operations within the emulated environment. Third, the training script is launched (e.g., via torchrun). Meanwhile, desired performance metrics like iteration time and resource utilization are gathered in N. Traces of collective communication (e.g., NCCL traces) can also be collected.

Early Results

We built a proof-of-concept implementation of NeuronaBox by developing an end-to-end training system using the PyTorch DNN framework and NCCL as the collective communication library, chosen because of their popularity. Our implementation is able to run two-node training using a distributed data-parallel or fully-sharded data parallelism (FSDP) strategy. We evaluate our prototype implementation by conducting training experiments using four real-world DNN models: BERT, ResNet152, DeepLight and T5. As main evaluation metric we are interested in the difference between the emulated iteration time and the observed iteration time from the real system (without emulation). Initial results, presented in the table below, show that NeuoronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.

Model	Observed iteration time (ms)	Emulated iteration time (ms)
BERT	628±1.1	629±3.0
ResNet152	1063±16.3	1061±19.8
DeepLight	726±13.8	727±15.0
T5-base	4174±4.4	4175±0.7

Ongoing Work

The proof-of-concept implementation only features a 2-node cluster emulation. We intend to extend our implementation to emulate many nodes simultaneously. We intend to achieve this by intercepting the topology detection and memory allocation interfaces of the DNN framework and then modifying them to fit into the workflow of NeuronaBox. Also, the current implementation only emulates the amount of data transmitted over the network and not the actual values of the data. This is sufficient to observe the time for each training iteration, but not sufficient to observe the training progress per iteration, especially for DNN training speed-up techniques such as lossy quantization and compression.

Marco Canini

Professor of Computer Science

My current interest is in designing better systems support for AI/ML and provide practical implementations deployable in the real-world.