Thanos: A Block-wise Pruning Algorithm for Efficient Large Language Model Compression

Abstract

This thesis introduces Thanos, a novel pruning algorithm designed to improve computational efficiency in large language models by removing less important weights while preserving accuracy. Unlike traditional methods, Thanos uses a block-wise pruning strategy to optimize weight removal, reducing the need for fine-grained updates. A key feature of Thanos is its adaptive pruning mask, which dynamically adjusts to changes in model weights, enabling flexible pruning that enhances generalization. It also supports structured sparsity formats like n:m sparsity, optimized for modern hardware acceleration. Experiments show that our method outperforms existing methods in smaller models, such as OPT-125M and OPT-350M, improving perplexity and zero-shot performance. For larger models, like LLaMA-2 and LLaMA-3, it achieves comparable results to SparseGPT and Wanda while processing faster for models up to 3 billion parameters. This thesis provides a thorough evaluation of Thanos across diverse tasks and models, demonstrating its adaptability and efficiency. The publicly available implementation establishes Thanos as a high-performing, flexible solution for resource-efficient language model deployment

Date
Location
Building 4-5220

Bio

Ivan Ilin studies Computer Science in the MS/PhD program at the King Abdullah University of Science and Technology. He is a member of the KAUST GenAI Center of Excellence, advised by Peter Richtárik. His academic work experience includes: Undergraduate Research Assistant, Budker Institute of Nuclear Physics, Novosibirsk, Russia, 2020 Deep learning Junior Researcher, ExpaSoft, Novosibirsk, Russia, 2020–2023 Lavrentyev Institute of Hydrodynamics, Novosibirsk, Russia, 2018–2019 Founder of the: Educational platform ilinblog.ru (2017–Present), Online school vectozavr.ru (2022–Present)