Fall 2024 - SANDS Seminar Series - S3S

Author: Ivan Ilin KAUST

Nov 26, 2024 12:00 PM — 1:00 PM Building 4-5220

Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

This thesis introduces Thanos, a novel pruning algorithm designed to improve computational efficiency in large language models by removing less important weights while preserving accuracy. Unlike traditional methods, Thanos uses a block-wise pruning strategy to optimize weight removal, reducing the need for fine-grained updates. A key feature of Thanos is its adaptive pruning mask, which dynamically adjusts to changes in model weights, enabling flexible pruning that enhances generalization. It also supports structured sparsity formats like n:m sparsity, optimized for modern hardware acceleration. Experiments show that our method outperforms existing methods in smaller models, such as OPT-125M and OPT-350M, improving perplexity and zero-shot performance. For larger models, like LLaMA-2 and LLaMA-3, it achieves comparable results to SparseGPT and Wanda while processing faster for models up to 3 billion parameters. This thesis provides a thorough evaluation of Thanos across diverse tasks and models, demonstrating its adaptability and efficiency. The publicly available implementation establishes Thanos as a high-performing, flexible solution for resource-efficient language model deployment

Recording Website

Author: Jonatan Langlet KTH Royal Institute of Technology

Nov 19, 2024 12:00 PM — 1:00 PM Building 4-5220

Network Monitoring at Scale through Hardware Algorithms

Traffic monitoring is foundational to the network control loop, ensuring that modern networks remain secure, reliable, and performant. Fine-grained monitoring can enable powerful failure mitigation and intrusion detection, but achieving this level of visibility at scale requires either computationally intensive in-network processing or centralized solutions capable of massive data ingestion. These demands often force operators to reduce network visibility, thereby weakening control loop effectiveness. In this talk, Jonatan will introduce Direct Telemetry Access, a high-speed telemetry collection solution capable of line-rate data ingestion, and discuss his recent work on spatiotemporal sketch disaggregation — a method that delivers accurate, network-wide flow-level statistics, even in highly restrictive and heterogeneous environments.

Recording Website

Author: Tao Yu AWS AI

Nov 5, 2024 5:00 PM — 6:00 PM Building 3-5220

Revisiting Low-Precision Strategies for LLM Training

Large language models training is plagued by the intense compute cost and limited hardware memory. Low precision (LP) datatypes can not only reduce training costs by accelerating compute-bound matrix multiplications, but also improve byte-wise memory efficiency. However, directly using LP datatypes as drop-in replacements for higher precision datatypes (e.g. FP32) can result in degraded models or even divergence. Existing works carefully adopt LP for LLM training at different training stages & components, for example, LP matrix multiplication (e.g. Transformer Engine, MS-AMP), LP weights, optimizer states & optimization (e.g., MS-AMP, Collage), etc. In this talk, we will first revisit some state-of-the-art low-precision strategies for LLM training, centered on floating-point type, and then go over some latest research from our group at AWS AI, including i) Collage (https://arxiv.org/abs/2405.03637), a low precision light-weight optimization framework; ii) how stochastic rounding chips to help in LLM training and iii) Lossless LLM training with 4-bit floats.

Recording Website

Author: Foteini Strati ETH Zurich

Oct 8, 2024 5:00 PM — 6:00 PM Building 3-5220

SAILOR: fast, cost-effective ML training in the cloud

Cloud providers offer a wide variety of resources, such as different GPU types, number of GPUs per VM, GPU-to-CPU ratios, availability zones, and VM reliability options, i.e. spot vs on-demand machines. The configuration of a cluster significantly impacts the performance and cost of ML workloads. For instance, creating homogeneous clusters of one GPU type might be more beneficial for performance but less beneficial for cost compared to another GPU type, or necessitate different workload partitioning. Furthermore, the current shortage of GPUs in the public cloud means that ML workloads might need to use heterogeneous resources, spread across cloud zones and regions to facilitate training, and seamlessly adapt to changes in resource availability. Thus, in order to optimize performance and cost of jobs in the public cloud, a system needs to jointly optimize resource allocation and job parallelization plan, and enable training over heterogeneous, dynamic environments. Existing works solve only part of this problem, such as identifying optimal cluster configurations for generic workloads, or partitioning an ML training workload given a fixed cluster setup. To address these all of these challenges, we are developing SAILOR, a system that automates large-scale ML training in the public cloud. SAILOR consists of two components: a Planner that co-optimizes cluster allocation and workload optimization, and an elastic, fault-tolerant system that enables seamless training in dynamic, heterogeneous environments.

Recording Website

Author: Amey Agrawal Georgia Tech

Sep 24, 2024 5:00 PM — 6:00 PM Building 3-5220

Etalon: Revisiting Performance Evaluation Framework for LLM Inference Systems

Serving large language models (LLMs) in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this talk, we will discuss the pitfalls of current performance metrics in evaluating LLM inference systems and then discuss Etalon, a comprehensive performance evaluation framework that includes fluidity-index - a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience.

Recording Website PDF