Revisiting Low-Precision Strategies for LLM Training

Abstract

Large language models training is plagued by the intense compute cost and limited hardware memory. Low precision (LP) datatypes can not only reduce training costs by accelerating compute-bound matrix multiplications, but also improve byte-wise memory efficiency. However, directly using LP datatypes as drop-in replacements for higher precision datatypes (e.g. FP32) can result in degraded models or even divergence. Existing works carefully adopt LP for LLM training at different training stages & components, for example, LP matrix multiplication (e.g. Transformer Engine, MS-AMP), LP weights, optimizer states & optimization (e.g., MS-AMP, Collage), etc. In this talk, we will first revisit some state-of-the-art low-precision strategies for LLM training, centered on floating-point type, and then go over some latest research from our group at AWS AI, including i) Collage (https://arxiv.org/abs/2405.03637), a low precision light-weight optimization framework; ii) how stochastic rounding chips to help in LLM training and iii) Lossless LLM training with 4-bit floats.

Date
Location
Building 3-5220

Bio

Tao Yu is currently a Scientist at AWS AI. He graduates in 2024 from the Computer Science department of Cornell University, supervised by Prof. Christopher De Sa. His research focuses on creating efficient machine learning systems, such as low-precision algorithms that support fast and memory-efficient LLM training.