Paper Summary
Paperzilla title
Large Language Models Fail Math Olympiad: Proof or Bluff?
This study evaluated eight large language models (LLMs) on the 2025 USAMO, a challenging math competition requiring rigorous proofs. The models performed poorly, with the best achieving an average score of less than 25%, revealing limitations in logical reasoning and proof generation.
Possible Conflicts of Interest
None identified
Identified Weaknesses
Limited Problem Solving Ability
The best-performing LLM only achieved a 24.4% average score on the USAMO problems. This suggests current LLMs are not proficient in solving complex, proof-based mathematical problems.
Lack of Creativity and Reasoning
The LLMs often applied incorrect logic, made unjustified assumptions, and failed to explore alternative solution strategies, indicating a lack of genuine understanding and problem-solving creativity.
Over-Reliance on Pattern Recognition
The models exhibited a tendency to overgeneralize patterns from simple cases to more complex ones without formal proof, demonstrating a superficial approach to mathematical reasoning.
Issues with Solution Clarity and Structure
Many LLMs produced chaotic and difficult-to-interpret solutions, hindering effective evaluation and highlighting the need for improvement in solution presentation.
Artifacts from Optimization Strategies
Behaviors such as consistently "boxing" answers and generating non-existent citations were observed, revealing unintended consequences of training methodologies.
Rating Explanation
This paper presents a well-structured evaluation of LLM capabilities on a challenging mathematical benchmark. While the study highlights significant limitations in current LLMs, it provides valuable insights into areas for improvement. The methodology is sound, and the analysis of failure modes is informative. However, the study focuses solely on the USAMO, limiting the generalizability of findings to other mathematical reasoning tasks. It could benefit from evaluating LLMs on a wider range of mathematical problems.
Good to know
This is our free standard analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.
File Information
Original Title:
PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD
Uploaded:
September 10, 2025 at 06:08 PM
© 2025 Paperzilla. All rights reserved.