Fall 2024 - SANDS Seminar Series - S3S

This thesis introduces Thanos, a novel pruning algorithm designed to improve computational efficiency in large language models by removing less important weights while preserving accuracy. Unlike traditional methods, Thanos uses a block-wise pruning strategy to optimize weight removal, reducing the need for fine-grained updates. A key feature of Thanos is its adaptive pruning mask, which dynamically adjusts to changes in model weights, enabling flexible pruning that enhances generalization. It also supports structured sparsity formats like n:m sparsity, optimized for modern hardware acceleration. Experiments show that our method outperforms existing methods in smaller models, such as OPT-125M and OPT-350M, improving perplexity and zero-shot performance. For larger models, like LLaMA-2 and LLaMA-3, it achieves comparable results to SparseGPT and Wanda while processing faster for models up to 3 billion parameters. This thesis provides a thorough evaluation of Thanos across diverse tasks and models, demonstrating its adaptability and efficiency. The publicly available implementation establishes Thanos as a high-performing, flexible solution for resource-efficient language model deployment

Traffic monitoring is foundational to the network control loop, ensuring that modern networks remain secure, reliable, and performant. Fine-grained monitoring can enable powerful failure mitigation and intrusion detection, but achieving this level of visibility at scale requires either computationally intensive in-network processing or centralized solutions capable of massive data ingestion. These demands often force operators to reduce network visibility, thereby weakening control loop effectiveness. In this talk, Jonatan will introduce Direct Telemetry Access, a high-speed telemetry collection solution capable of line-rate data ingestion, and discuss his recent work on spatiotemporal sketch disaggregation — a method that delivers accurate, network-wide flow-level statistics, even in highly restrictive and heterogeneous environments.

Large language models training is plagued by the intense compute cost and limited hardware memory. Low precision (LP) datatypes can not only reduce training costs by accelerating compute-bound matrix multiplications, but also improve byte-wise memory efficiency. However, directly using LP datatypes as drop-in replacements for higher precision datatypes (e.g. FP32) can result in degraded models or even divergence. Existing works carefully adopt LP for LLM training at different training stages & components, for example, LP matrix multiplication (e.g. Transformer Engine, MS-AMP), LP weights, optimizer states & optimization (e.g., MS-AMP, Collage), etc. In this talk, we will first revisit some state-of-the-art low-precision strategies for LLM training, centered on floating-point type, and then go over some latest research from our group at AWS AI, including i) Collage (https://arxiv.org/abs/2405.03637), a low precision light-weight optimization framework; ii) how stochastic rounding chips to help in LLM training and iii) Lossless LLM training with 4-bit floats.

Cloud providers offer a wide variety of resources, such as different GPU types, number of GPUs per VM, GPU-to-CPU ratios, availability zones, and VM reliability options, i.e. spot vs on-demand machines. The configuration of a cluster significantly impacts the performance and cost of ML workloads. For instance, creating homogeneous clusters of one GPU type might be more beneficial for performance but less beneficial for cost compared to another GPU type, or necessitate different workload partitioning. Furthermore, the current shortage of GPUs in the public cloud means that ML workloads might need to use heterogeneous resources, spread across cloud zones and regions to facilitate training, and seamlessly adapt to changes in resource availability. Thus, in order to optimize performance and cost of jobs in the public cloud, a system needs to jointly optimize resource allocation and job parallelization plan, and enable training over heterogeneous, dynamic environments. Existing works solve only part of this problem, such as identifying optimal cluster configurations for generic workloads, or partitioning an ML training workload given a fixed cluster setup. To address these all of these challenges, we are developing SAILOR, a system that automates large-scale ML training in the public cloud. SAILOR consists of two components: a Planner that co-optimizes cluster allocation and workload optimization, and an elastic, fault-tolerant system that enables seamless training in dynamic, heterogeneous environments.

Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this talk, we will discuss the pitfalls of current performance metrics in evaluating LLM inference systems and then discuss Etalon, a comprehensive performance evaluation framework that includes fluidity-index - a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience.