Fantastic Pretraining Optimizers and Where to Find Them
Overview
Paper Summary
This paper benchmarks 11 optimizers for large language model pretraining and finds that while some like Muon and Soap do offer a speedup over AdamW, it is smaller (up to 1.4x) than previously claimed and diminishes as model size increases. Furthermore, they find that optimal hyperparameters vary significantly between optimizers, making comparisons using shared hyperparameters unfair, and early checkpoints can be misleading as optimizer rankings can shift during training.
Explain Like I'm Five
Some new, fancy ways to train AI models work faster than the old way, but the gains are less than hyped and disappear as models get huge.
Possible Conflicts of Interest
The authors acknowledge support from Google, which has a vested interest in efficient large language model training, but this seems appropriately disclosed and does not obviously bias the research.
Identified Limitations
Rating Explanation
This is a strong study with rigorous methodology addressing a relevant problem. The hyperparameter tuning, scaling analysis, and identification of misleading evaluation practices are valuable. The limited model size is a notable weakness preventing a 5, but the findings are important for current-scale models and motivate important further research at larger scales.
Good to know
This is the Starter analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.
Explore Pro →