Large language models training is plagued by the intense compute cost and limited hardware memory. Low precision (LP) datatypes can not only reduce training costs by accelerating compute-bound matrix multiplications, but also improve byte-wise memory efficiency. However, directly using LP datatypes as drop-in replacements for higher precision datatypes (e.g. FP32) can result in degraded models or even divergence. Existing works carefully adopt LP for LLM training at different training stages & components, for example, LP matrix multiplication (e.g. Transformer Engine, MS-AMP), LP weights, optimizer states & optimization (e.g., MS-AMP, Collage), etc. In this talk, we will first revisit some state-of-the-art low-precision strategies for LLM training, centered on floating-point type, and then go over some latest research from our group at AWS AI, including i) Collage (
https://arxiv.org/abs/2405.03637), a low precision light-weight optimization framework; ii) how stochastic rounding chips to help in LLM training and iii) Lossless LLM training with 4-bit floats.