PAPERZILLA
Crunching Academic Papers into Bite-sized Insights.
About
Sign Out
← Back to papers

Physical SciencesComputer ScienceArtificial Intelligence

Fantastic Pretraining Optimizers and Where to Find Them

SHARE

Overview

Paper Summary
Conflicts of Interest
Identified Weaknesses
Rating Explanation
Good to know
Topic Hierarchy
File Information

Paper Summary

Paperzilla title
Muon and Soap Reign Supreme...But Only for Small Language Models
This paper benchmarks 11 optimizers for large language model pretraining and finds that while some like Muon and Soap do offer a speedup over AdamW, it is smaller (up to 1.4x) than previously claimed and diminishes as model size increases. Furthermore, they find that optimal hyperparameters vary significantly between optimizers, making comparisons using shared hyperparameters unfair, and early checkpoints can be misleading as optimizer rankings can shift during training.

Possible Conflicts of Interest

The authors acknowledge support from Google, which has a vested interest in efficient large language model training, but this seems appropriately disclosed and does not obviously bias the research.

Identified Weaknesses

Limited model sizes tested
The largest model tested is 1.2B parameters, leaving open the question of how these optimizers perform on truly massive models that dominate current research and applications (7B+ parameters). The paper does extrapolate results suggesting the speedup disappears at larger sizes, but empirical validation is missing.
Focus on pretraining
The study solely evaluates optimizers on pretraining, not fine-tuning or downstream tasks. While pretraining is a major cost, ultimate performance on specific tasks matters more.

Rating Explanation

This is a strong study with rigorous methodology addressing a relevant problem. The hyperparameter tuning, scaling analysis, and identification of misleading evaluation practices are valuable. The limited model size is a notable weakness preventing a 5, but the findings are important for current-scale models and motivate important further research at larger scales.

Good to know

This is our free standard analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.
Explore Pro →

Topic Hierarchy

File Information

Original Title:
Fantastic Pretraining Optimizers and Where to Find Them
File Name:
paper_1102.pdf
[download]
File Size:
2.17 MB
Uploaded:
September 04, 2025 at 06:22 PM
Privacy:
🌐 Public
© 2025 Paperzilla. All rights reserved.

If you are not redirected automatically, click here.