DeepSeek’s Abysmal Performance with the AIME 2025 Math Benchmark
What Is the AIME Benchmark?
The American Invitational Mathematics Examination (AIME) is the second exam in the series of exams used to challenge mathletes competing to become the team that represents the US at the International Mathematics Olympiad. While most AIME participants are high school students, some bright middle school students also qualify each year. The exam is administered by the Mathematical Association of America.
This year’s AIME was held February 6th, and the problems and answers were published immediately afterwards—and are already being bandied about on various YouTube channels, forums, and blogs.
MathArena’s AIME Leaderboard
What Is It?
The MathArena team jumped on this dataset and worked against the clock to run evaluations using the 2025 problems before models could start training on it. Since these are challenging math problems, it makes for an excellent benchmark to see how well these models reason through more complex problems, with less opportunity to get the answer correct by chance since the test isn’t multiple choice like many benchmarks.
As soon as I got the news about the new leaderboard, I added it to my AI Strategy tool, which you can view filtered for the ‘Solve math problems’ task. As always, I provide tips, links, and benchmark definitions in the leaderboard modal.
What I Like About It
A few things I appreciate about MathArena’s leaderboard:
Actual Model Performance
Below is a screenshot of the leaderboard table, with the larger DeepSeek models ranging from 50-65%. Not in o1/3 territory but nipping at their heels. However, Qwen’s DeepSeek-R1-Distill-Qwen-1.5B, which is considerably smaller, did not perform well at all.
领英推荐
Only two models—gpt-4o and claude-3.5-sonnet—performed worse. I wasn’t surprised about Claude. I personally experienced its math deficiency as recently as Wednesday, when I was adding the new leaderboard to my strategy app.
To be fair though, it’s been my experience that Claude 3.5 Sonnet is by far the best model for coding tasks. I love it because it doesn’t pontificate. Working with it feels like having a coding mentor who can walk me through any bug, feature clash, or faulty logic.
So Why Call Out DeepSeek?
When I did a search to see when the AIME 2025 problems were released, I found several posts extolling the virtues of the DeepScaleR-1.5B-Preview, a language model fine-tuned on the DeepSeek-R1-Distilled-Qwen-1.5B model. The emphasis is on how well it performs with only 1.5b parameters (a very small model), but their references are to the AIME 2024 test questions.
DeepScaleR-1.5B-Preview is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-1.5B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 43.1% Pass@1 accuracy on AIME 2024, representing a 15% improvement over the base model (28.8%) and surpassing OpenAI’s O1-Preview performance with just 1.5B parameters.
To be fair, the DeepScaleR-1.5B-Preview model wasn’t included in the AIME 2025 leaderboard so they wouldn’t be able to reference it. But the pressing question, for me at least, is why is anyone talking about a model’s performance on a dataset that’s been publicly available for a year? Even if these models didn’t intentionally train on AIME problems and solutions from previous years (which are readily available on Art of Problem Solving’s website as well as Kaggle), the chances of these problems being included in other sources are quite high.
So there was a very small window of opportunity where this benchmark would be relevant and largely free from model contamination, with some hope that accuracy scores could generalize well to test data. But referencing its performance in last year’s test with the 2025 test results hot off the presses—and not much of a flex for DeepSeek’s 1.5b model—is odd at best and a case of smoke and mirrors at worse.
Conclusion
We need to be a little more skeptical about model creators’ claims, if they are not verified by a third-party benchmark organizer. Also, the MathArena team said that they plan to run evals on more models. I personally think this is unfortunate as the chances of model contamination increase with each day that passes, and oftentimes leaderboards don’t disclose if they allow models to self-report. With an arms race underway to be the first to lay claim to math mastery—one of the more difficult frontiers for LLMs—it would take an extraordinary amount of self-control to not consume these test problems and solutions as soon as they become publicly available.
That said, I’m glad that DeepSeek is applying pressure to US model creators to bring their model costs under control. I have several posts scheduled that address OpenAI’s out-of-touch price points for o1 on a number of fronts. Below is a sneak peek into just one example comparing its 01-2024-12-17 agent to Google’s gemini-2.0-flash-001 agent. Gemini outperforms it at 1% of o1’s estimated cost (source).
Photo credit: Elimende Inagella