PAPERZILLA
Crunching Academic Papers into Bite-sized Insights.
About
Sign Out
← Back to papers

Physical SciencesComputer ScienceArtificial Intelligence

PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD

SHARE

Overview

Paper Summary
Conflicts of Interest
Identified Weaknesses
Rating Explanation
Good to know
Topic Hierarchy
File Information

Paper Summary

Paperzilla title
Large Language Models Fail Math Olympiad: Proof or Bluff?
This study evaluated eight large language models (LLMs) on the 2025 USAMO, a challenging math competition requiring rigorous proofs. The models performed poorly, with the best achieving an average score of less than 25%, revealing limitations in logical reasoning and proof generation.

Possible Conflicts of Interest

None identified

Identified Weaknesses

Limited Problem Solving Ability
The best-performing LLM only achieved a 24.4% average score on the USAMO problems. This suggests current LLMs are not proficient in solving complex, proof-based mathematical problems.
Lack of Creativity and Reasoning
The LLMs often applied incorrect logic, made unjustified assumptions, and failed to explore alternative solution strategies, indicating a lack of genuine understanding and problem-solving creativity.
Over-Reliance on Pattern Recognition
The models exhibited a tendency to overgeneralize patterns from simple cases to more complex ones without formal proof, demonstrating a superficial approach to mathematical reasoning.
Issues with Solution Clarity and Structure
Many LLMs produced chaotic and difficult-to-interpret solutions, hindering effective evaluation and highlighting the need for improvement in solution presentation.
Artifacts from Optimization Strategies
Behaviors such as consistently "boxing" answers and generating non-existent citations were observed, revealing unintended consequences of training methodologies.

Rating Explanation

This paper presents a well-structured evaluation of LLM capabilities on a challenging mathematical benchmark. While the study highlights significant limitations in current LLMs, it provides valuable insights into areas for improvement. The methodology is sound, and the analysis of failure modes is informative. However, the study focuses solely on the USAMO, limiting the generalizability of findings to other mathematical reasoning tasks. It could benefit from evaluating LLMs on a wider range of mathematical problems.

Good to know

This is our free standard analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.
Explore Pro →

Topic Hierarchy

File Information

Original Title:
PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD
File Name:
paper_1363.pdf
[download]
File Size:
0.64 MB
Uploaded:
September 10, 2025 at 06:08 PM
Privacy:
🌐 Public
© 2025 Paperzilla. All rights reserved.

If you are not redirected automatically, click here.