OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
Overview
Paper Summary
This paper introduces OptimalThinkingBench, a new benchmark designed to evaluate both overthinking (using too many tokens on simple queries) and underthinking (not thinking enough on complex tasks) in large language models (LLMs). Their findings suggest that current LLMs struggle to balance thinking effort with task complexity, often overthinking simple questions without accuracy gains while underthinking on more challenging reasoning tasks. They explore various methods to improve optimal thinking, including efficient reasoning techniques and routing between thinking/non-thinking modes, but significant improvement remains a challenge for future work.
Explain Like I'm Five
This paper introduces a benchmark to test whether AI models think too much or too little for different tasks. It finds current models aren't great at balancing thinking effort with problem difficulty.
Possible Conflicts of Interest
The authors are affiliated with Meta/FAIR, which may have a vested interest in the development and evaluation of large language models. However, the research appears to be conducted objectively and with a fair comparison of various open-source and closed models.
Identified Limitations
Rating Explanation
This paper presents a valuable contribution by introducing a unified benchmark to assess both overthinking and underthinking in LLMs, along with novel metrics and a comprehensive evaluation of numerous models. Although there are some limitations regarding model diversity and reliance on synthetic data, the overall methodology is sound and the findings provide insightful directions for future research. The clear COI with Meta is noted.
Good to know
This is the Starter analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.
Explore Pro →