AI Models Struggle with US Math Olympiad Problems: A Study
Introduction to the Challenge
The U.S. Math Olympiad (USAMO) is recognized as a prestigious mathematical competition that serves as a qualifier for the International Math Olympiad. Contestants face demanding problems requiring complete proofs, evaluated based on correctness, completeness, and clarity. Unlike earlier tests such as the American Invitational Mathematics Examination (AIME), which focus on simpler problems, the USAMO challenges students to delve deeper into mathematical reasoning. A recent study evaluated the capabilities of several AI reasoning models against the 2025 USAMO problems, shedding light on the current limits of AI in grasping complex mathematical concepts.
AI Models Tested
Research conducted on the newly released 2025 USAMO evaluated various AI reasoning models after rigorous testing. The models included:
- Qwen’s QwQ-32B
- DeepSeek R1
- Google’s Gemini 2.0 Flash Thinking (Experimental)
- Gemini 2.5 Pro
- OpenAI’s o1-pro and o3-mini-high
- Anthropic’s Claude 3.7 Sonnet with Extended Thinking
- xAI’s Grok 3
This evaluation ensured that the models’ training data did not encompass the specific problems sought.
Performance Overview
Among the AI models assessed, Google’s Gemini 2.5 Pro stood out, achieving an average score of 10.1 out of 42 points—approximately 24 percent accuracy. In contrast, the results for the other models revealed significant shortcomings:
- DeepSeek R1 and Grok 3: 2.0 points each
- Google’s Flash Thinking: 1.8 points
- Anthropic’s Claude 3.7: 1.5 points
- Qwen’s QwQ and OpenAI’s o1-pro: 1.2 points each
- OpenAI’s o3-mini: 0.9 points (~2.1 percent)
Importantly, no model produced a perfect solution for any problem tested. Although newer models, such as OpenAI’s o3 and o4-mini-high, weren’t directly analyzed, benchmarks indicated they scored 21.73 percent and 19.05 percent, respectively, but these figures may have been influenced by prior visibility of the contest solutions.
Understanding Failure Patterns
The study identified several recurrent patterns that highlighted the failures of AI models when tackling USAMO problems. Primarily, responses exhibited logical gaps, lacked sufficient mathematical justification, and occasionally relied on unproven assumptions, leading to contradictory outputs despite appearing plausible in other respects.
A specific instance of failure occurred with USAMO 2025 Problem 5, which involved identifying all positive integers "k" for a given calculation. The Qwen model mistakenly ruled out non-integer possibilities, resulting in an inaccurate conclusion despite correctly identifying necessary conditions during its reasoning process. This instance underscores the challenge AI models face in navigating the complexities of mathematical reasoning.
Conclusion: Implications for the Future
The study’s results underline a significant gap in AI’s current mathematical capabilities, particularly when faced with higher-level reasoning tasks. The limited success of these models suggests that while AI can assist in various fields, complex problem-solving in mathematics remains a formidable hurdle. As these technologies continue to develop, ongoing research and enhancements will be necessary to equip AI with the necessary reasoning skills.
The findings could influence how educators approach AI integration into mathematics, emphasizing the need for a complementary relationship rather than a complete reliance on technology. As the landscape of education and technology evolves, understanding these limitations will be crucial in guiding the future development of intelligent systems capable of outperforming human proficiency in advanced mathematical reasoning.