← Back to papers

Fantastic Pretraining Optimizers and Where to Find Them

★ ★ ★ ★ ☆

Paper Summary

Paperzilla title
Muon and Soap Reign Supreme...But Only for Small Language Models

This paper benchmarks 11 optimizers for large language model pretraining and finds that while some like Muon and Soap do offer a speedup over AdamW, it is smaller (up to 1.4x) than previously claimed and diminishes as model size increases. Furthermore, they find that optimal hyperparameters vary significantly between optimizers, making comparisons using shared hyperparameters unfair, and early checkpoints can be misleading as optimizer rankings can shift during training.

Explain Like I'm Five

Some new, fancy ways to train AI models work faster than the old way, but the gains are less than hyped and disappear as models get huge.

Possible Conflicts of Interest

The authors acknowledge support from Google, which has a vested interest in efficient large language model training, but this seems appropriately disclosed and does not obviously bias the research.

Identified Limitations

Limited model sizes tested
The largest model tested is 1.2B parameters, leaving open the question of how these optimizers perform on truly massive models that dominate current research and applications (7B+ parameters). The paper does extrapolate results suggesting the speedup disappears at larger sizes, but empirical validation is missing.
Focus on pretraining
The study solely evaluates optimizers on pretraining, not fine-tuning or downstream tasks. While pretraining is a major cost, ultimate performance on specific tasks matters more.

Rating Explanation

This is a strong study with rigorous methodology addressing a relevant problem. The hyperparameter tuning, scaling analysis, and identification of misleading evaluation practices are valuable. The limited model size is a notable weakness preventing a 5, but the findings are important for current-scale models and motivate important further research at larger scales.

Good to know

This is the Starter analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.

Explore Pro →

Topic Hierarchy

File Information

Original Title: Fantastic Pretraining Optimizers and Where to Find Them
Uploaded: September 04, 2025 at 06:22 PM
Privacy: Public