PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD
Overview
Paper Summary
This study evaluated eight large language models (LLMs) on the 2025 USAMO, a challenging math competition requiring rigorous proofs. The models performed poorly, with the best achieving an average score of less than 25%, revealing limitations in logical reasoning and proof generation.
Explain Like I'm Five
Even the smartest computer programs struggle with really hard math problems that need creative solutions and detailed explanations. They're good at calculations but not so good at thinking outside the box.
Possible Conflicts of Interest
None identified
Identified Limitations
Rating Explanation
This paper presents a well-structured evaluation of LLM capabilities on a challenging mathematical benchmark. While the study highlights significant limitations in current LLMs, it provides valuable insights into areas for improvement. The methodology is sound, and the analysis of failure modes is informative. However, the study focuses solely on the USAMO, limiting the generalizability of findings to other mathematical reasoning tasks. It could benefit from evaluating LLMs on a wider range of mathematical problems.
Good to know
This is the Starter analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.
Explore Pro →