SAILOR: fast, cost-effective ML training in the cloud

Abstract

Cloud providers offer a wide variety of resources, such as different GPU types, number of GPUs per VM, GPU-to-CPU ratios, availability zones, and VM reliability options, i.e. spot vs on-demand machines. The configuration of a cluster significantly impacts the performance and cost of ML workloads. For instance, creating homogeneous clusters of one GPU type might be more beneficial for performance but less beneficial for cost compared to another GPU type, or necessitate different workload partitioning. Furthermore, the current shortage of GPUs in the public cloud means that ML workloads might need to use heterogeneous resources, spread across cloud zones and regions to facilitate training, and seamlessly adapt to changes in resource availability. Thus, in order to optimize performance and cost of jobs in the public cloud, a system needs to jointly optimize resource allocation and job parallelization plan, and enable training over heterogeneous, dynamic environments. Existing works solve only part of this problem, such as identifying optimal cluster configurations for generic workloads, or partitioning an ML training workload given a fixed cluster setup. To address these all of these challenges, we are developing SAILOR, a system that automates large-scale ML training in the public cloud. SAILOR consists of two components: a Planner that co-optimizes cluster allocation and workload optimization, and an elastic, fault-tolerant system that enables seamless training in dynamic, heterogeneous environments.

Date
Location
Building 3-5220

Bio

Foteini Strati is a 3rd year PhD student at the Systems Group of ETH Zurich, supervised by Prof. Ana Klimovic, working on Systems for Machine Learning. Currently, she is working on increasing GPU utilization and fault tolerance of ML workloads. During her PhD she has completed internships in Microsoft Research and NVIDIA. She obtained a MSc degree in Computer Science from ETH Zurich, and a Diploma in Electrical and Computer Engineering from the National Technical University of Athens.