Large Language Models Fail Math Olympiad: Proof or Bluff?

Overview

Paper Summary › Explain Like I'm Five › Conflicts of Interest › Identified Limitations › Rating Explanation › Good to know › Topic Hierarchy › File Information ›

Paper Summary

Paperzilla title

This study evaluated eight large language models (LLMs) on the 2025 USAMO, a challenging math competition requiring rigorous proofs. The models performed poorly, with the best achieving an average score of less than 25%, revealing limitations in logical reasoning and proof generation.

Explain Like I'm Five

Even the smartest computer programs struggle with really hard math problems that need creative solutions and detailed explanations. They're good at calculations but not so good at thinking outside the box.

Possible Conflicts of Interest

None identified

Identified Limitations

Limited Problem Solving Ability

The best-performing LLM only achieved a 24.4% average score on the USAMO problems. This suggests current LLMs are not proficient in solving complex, proof-based mathematical problems.

Lack of Creativity and Reasoning

The LLMs often applied incorrect logic, made unjustified assumptions, and failed to explore alternative solution strategies, indicating a lack of genuine understanding and problem-solving creativity.

Over-Reliance on Pattern Recognition

The models exhibited a tendency to overgeneralize patterns from simple cases to more complex ones without formal proof, demonstrating a superficial approach to mathematical reasoning.

Issues with Solution Clarity and Structure

Many LLMs produced chaotic and difficult-to-interpret solutions, hindering effective evaluation and highlighting the need for improvement in solution presentation.

Artifacts from Optimization Strategies

Behaviors such as consistently "boxing" answers and generating non-existent citations were observed, revealing unintended consequences of training methodologies.

Rating Explanation

This paper presents a well-structured evaluation of LLM capabilities on a challenging mathematical benchmark. While the study highlights significant limitations in current LLMs, it provides valuable insights into areas for improvement. The methodology is sound, and the analysis of failure modes is informative. However, the study focuses solely on the USAMO, limiting the generalizability of findings to other mathematical reasoning tasks. It could benefit from evaluating LLMs on a wider range of mathematical problems.

Good to know

This is the Starter analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.

Explore Pro →

Topic Hierarchy

Domain: Physical Sciences

Field: Computer Science

Subfield: Artificial Intelligence

File Information

Original Title: PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD

Uploaded: September 10, 2025 at 06:08 PM

Privacy: Public