← Back to papers

OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

★ ★ ★ ★ ☆

Paper Summary

Paperzilla title
AI: Are You Overthinking This? New Benchmark Tests if Language Models Think Too Much or Too Little

This paper introduces OptimalThinkingBench, a new benchmark designed to evaluate both overthinking (using too many tokens on simple queries) and underthinking (not thinking enough on complex tasks) in large language models (LLMs). Their findings suggest that current LLMs struggle to balance thinking effort with task complexity, often overthinking simple questions without accuracy gains while underthinking on more challenging reasoning tasks. They explore various methods to improve optimal thinking, including efficient reasoning techniques and routing between thinking/non-thinking modes, but significant improvement remains a challenge for future work.

Explain Like I'm Five

This paper introduces a benchmark to test whether AI models think too much or too little for different tasks. It finds current models aren't great at balancing thinking effort with problem difficulty.

Possible Conflicts of Interest

The authors are affiliated with Meta/FAIR, which may have a vested interest in the development and evaluation of large language models. However, the research appears to be conducted objectively and with a fair comparison of various open-source and closed models.

Identified Limitations

Limited Model Diversity
While the benchmark is a novel contribution, the evaluation primarily focuses on a limited set of closed and open-source LLMs, potentially overlooking the diverse landscape of models and approaches in the field.
Synthetic Data Limitations
The reliance on synthetically generated datasets, although scalable, might not fully capture the nuances and complexities of real-world scenarios, limiting the generalizability of findings to practical applications.
Novel Metrics
Although the paper proposes novel metrics for evaluating overthinking and underthinking, their long-term effectiveness and broader adoption by the research community remain to be seen.
Preliminary Exploration of Optimal Thinking Methods
The exploration of methods to encourage optimal thinking is preliminary, lacking extensive comparisons with existing techniques or in-depth analyses of their strengths and weaknesses in different reasoning domains.

Rating Explanation

This paper presents a valuable contribution by introducing a unified benchmark to assess both overthinking and underthinking in LLMs, along with novel metrics and a comprehensive evaluation of numerous models. Although there are some limitations regarding model diversity and reliance on synthetic data, the overall methodology is sound and the findings provide insightful directions for future research. The clear COI with Meta is noted.

Good to know

This is the Starter analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.

Explore Pro →

Topic Hierarchy

File Information

Original Title: OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
Uploaded: August 19, 2025 at 04:20 AM
Privacy: Public