Systems for AI

The Systems for AI reading group is a forum designed to explore impactful ML Systems research, exchange ideas, and connect with others who share the same research interests. It aims to foster engaging discussions on cutting-edge research, offering participants the chance to dive deeply into selected papers and share their insights and perspectives. For each session, participants are requested to read the paper in advance and bring their thoughts, questions, and ideas for discussion. Suggestions for future papers are welcome and can be submitted during or outside sessions.

Session Details:

Below is the initial schedule of papers to be discussed. This schedule will be updated as we progress.

Machine learning models are distributed across multiple nodes using numerous parallelism strategies. The resulting collective communication is often on the critical path due to a lack of independent coarse-grain computation kernels available to execute. In this work, we propose fusing computation with its subsequent collective communication and leverage GPUs’ massive parallelism, along with GPU-initiated communication, to overlap communication and computation. Specifically thread blocks/workgroups (WGs) immediately communicate their results to remote GPUs after completing their computation,while other WGs within the same kernel perform computation. We developed three prototype fused operators (embedding+All-to-All, GEMV+AllReduce, and GEMM+All-to-All) to address the communication overheads in DLRM, Transformers and MoE model architectures. We expose fused kernels as new PyTorch operators, as well as extend the Triton framework to demonstrate their practicality. Our evaluations show our approach effectively overlaps communication with computations, subsequently reducing their combined execution time achieving 12% - 31% lower execution time across all three operators.

As deep learning model sizes expand and new GPUs are released every year, the need for distributed training on heterogeneous GPUs rises to fully harness under-utilized low-end GPUs and reduce the cost of purchasing expensive high-end GPUs. In this paper, we introduce Metis, a system designed to automatically find efficient parallelism plans for distributed training on heterogeneous GPUs. Metis holistically optimizes several key system components, such as profiler, cost estimator, and planner, which were limited to single GPU types, to now efficiently leverage compute powers and memory capacities of diverse GPU types. This enables Metis to achieve fine-grained distribution of training workloads across heterogeneous GPUs, improving resource efficiency. However, the search space designed for automatic parallelism in this complexity would be prohibitively expensive to navigate. To address this issue, Metis develops a new search algorithm that efficiently prunes large search spaces and balances loads with heterogeneity-awareness, while preferring data parallelism over tensor parallelism within a pipeline stage to take advantage of its superior computation and communication trade-offs. Our evaluation with three large models (GPT-3, MoE, and Wide-Resnet) on combinations of three types of GPUs demonstrates that Metis finds better parallelism plans than traditional methods with $1.05 ~ 8.43× training speed-up, while requiring less profiling searching time. Compared to the oracle planning that delivers the fastest parallel training, Metis finds near-optimal solutions while reducing profiling and search overheads by orders of magnitude.

Deep learning training (DLT), e.g., large language model (LLM) training, has become one of the most important services in multitenant cloud computing. By deeply studying in-production DLT jobs, we observed that communication contention among different DLT jobs seriously influences the overall GPU computation utilization, resulting in the low efficiency of the training cluster. In this paper, we present Crux, a communication scheduler that aims to maximize GPU computation utilization by mitigating the communication contention among DLT jobs. Maximizing GPU computation utilization for DLT, nevertheless, is NP-Complete; thus, we formulate and prove a novel theorem to approach this goal by GPU intensity-aware communication scheduling. Then, we propose an approach that prioritizes the DLT flows with high GPU computation intensity, reducing potential communication contention. Our 96-GPU testbed experiments show that Crux improves 8.3% to 14.8% GPU computation utilization. The large-scale production trace-based simulation further shows that Crux increases GPU computation utilization by up to 23% compared with alternatives including Sincronia, TACCL, and CASSINI.

Deep Learning (DL) workloads have rapidly increased in popularity in enterprise clusters and several new cluster schedulers have been proposed in recent years to support these workloads. With rapidly evolving DL workloads, it is challenging to quickly prototype and compare scheduling policies across workloads. Further, as prior systems target different aspects of scheduling (resource allocation, placement, elasticity etc.), it is also challenging to combine these techniques and understand the overall benefits. To address these challenges we propose Blox, a modular toolkit which allows developers to compose individual components and realize diverse scheduling frameworks. We identify a set of core abstractions for DL scheduling, implement several existing schedulers using these abstractions, and verify the fidelity of these implementations by reproducing results from prior research. We also highlight how we can evaluate and compare existing schedulers in new settings: different workload traces, higher cluster load, change in DNN workloads and deployment characteristics. Finally, we showcase Blox’s extensibility by composing policies from different schedulers, and implementing novel policies with minimal code changes. Blox is available at https://github.com/msr-fiddle/blox.

Inference serving for large language models (LLMs) is the key to unleashing their potential in people’s daily lives. However, efficient LLM serving remains challenging today because the requests are inherently heterogeneous and unpredictable in terms of resource and latency requirements, as a result of the diverse applications and the dynamic execution nature of LLMs. Existing systems are fundamentally limited in han- dling these characteristics and cause problems such as severe queuing delays, poor tail latencies, and SLO violations. We introduce Llumnix, an LLM serving system that re- acts to such heterogeneous and unpredictable requests by runtime rescheduling across multiple model instances. Sim- ilar to context switching across CPU cores in modern op- erating systems, Llumnix reschedules requests to improve load balancing and isolation, mitigate resource fragmenta- tion, and differentiate request priorities and SLOs. Llumnix implements the rescheduling with an efficient and scalable live migration mechanism for requests and their in-memory states, and exploits it in a dynamic scheduling policy that unifies the multiple rescheduling scenarios elegantly. Our evaluations show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5×, and delivers up to 36% cost savings while achieving similar tail latencies, compared against state-of-the- art LLM serving systems. Llumnix is publicly available at https://github.com/AlibabaPAI/llumnix.

Modern machine learning (ML) workloads heavily depend on distributing tasks across clusters of server CPUs and specialized accelerators, such as GPUs and TPUs, to achieve optimal performance. Nonetheless, prior research has highlighted the inefficient utilization of computing resources in distributed ML, leading to suboptimal performance. This inefficiency primarily stems from CPU bottlenecks and suboptimal accelerator scheduling. Although numerous proposals have been put forward to address these issues individually, none have effectively tackled both inefficiencies simultaneously. In this paper, we introduce Conspirator, an innovative control plane design aimed at alleviating both bottlenecks by harnessing the enhanced computing capabilities of SmartNICs. Following the evolving role of SmartNICs, which have transitioned from their initial function of standard networking task offloading to serving as programmable connectors between disaggregated computing resources, Conspirator facilitates efficient data transfer without the involvement of host CPUs and hence circumvents the potential bottlenecks there. Conspirator further integrates a novel scheduling algorithm that takes into consideration of the heterogeneity of accelerators and adapts to changing workload dynamics, enabling the flexibility to mitigate the second bottleneck. Our evaluation demonstrates that Conspirator may provide a 15% end-to-end completion time reduction compared to RDMA-based alternatives while being 17% more cost-effective and 44% more power-efficient. Our proposed scheduler also helps to save 33% GPU hours compared to naive GPU-sharing schedulers by making close-to-optimal decisions while taking much less time than the optimal NP-Hard scheduler.