DeepSeek’s Abysmal Performance with the AIME 2025 Math Benchmark
To be clear, I’m talking about the DeepSeek-R1-Distill-Qwen-1.5B model, not R1

DeepSeek’s Abysmal Performance with the AIME 2025 Math Benchmark

What Is the AIME Benchmark?

The American Invitational Mathematics Examination (AIME) is the second exam in the series of exams used to challenge mathletes competing to become the team that represents the US at the International Mathematics Olympiad. While most AIME participants are high school students, some bright middle school students also qualify each year. The exam is administered by the Mathematical Association of America.

This year’s AIME was held February 6th, and the problems and answers were published immediately afterwards—and are already being bandied about on various YouTube channels, forums, and blogs.

MathArena’s AIME Leaderboard

What Is It?

The MathArena team jumped on this dataset and worked against the clock to run evaluations using the 2025 problems before models could start training on it. Since these are challenging math problems, it makes for an excellent benchmark to see how well these models reason through more complex problems, with less opportunity to get the answer correct by chance since the test isn’t multiple choice like many benchmarks.

As soon as I got the news about the new leaderboard, I added it to my AI Strategy tool, which you can view filtered for the ‘Solve math problems’ task. As always, I provide tips, links, and benchmark definitions in the leaderboard modal.


What I Like About It

A few things I appreciate about MathArena’s leaderboard:

  • At least at this time, the organizers run these evaluations themselves and don’t allow self-reporting.
  • They published their leaderboard early enough that it’s unlikely that models had a chance to train on the dataset (i.e., the problems and answers). They’re not supposed to do this anyway as it causes model contamination, but some of clearly do it anyway. I’ve addressed this issue here and here (and have more posts that address it scheduled). One leaderboard actually calls out models it suspects of contamination, which I can’t love enough and use it as a proxy when some of these models make outrageous claims. Another leaderboard discloses models that self-report, which I also use as a reference when trying to decide how many grains of salt I’m willing to allocate to a model creator’s claim.
  • They run each problem through the model four times and calculate their average accuracy score. The answers for DeepSeek’s 1.5b param model varied pretty wildly from one pass to another, which indicates it was more taking stabs in the dark than consistently following a well-reasoned path to solving the problem.
  • They turned the table into a heatmap. You can get a breakdown of the legend in the leaderboard modal in my strategy app or the leaderboard FAQ section, but this functionality is quite clever imo—as is their methodology of running each problem through four times.
  • Instead of estimating the cost, they record the actual tokens used for the input and output as well as the actual cost. The input tokens range from 190-218, but the output tokens range from 549 to 14,623, which is just wild. Also, o1’s pricing logic seems pretty out of touch compared to the other models.
  • DeepSeek’s larger R1 model performed decently. Not great but in the acceptable range. I’m targeting the 1.5B model.

Actual Model Performance

Below is a screenshot of the leaderboard table, with the larger DeepSeek models ranging from 50-65%. Not in o1/3 territory but nipping at their heels. However, Qwen’s DeepSeek-R1-Distill-Qwen-1.5B, which is considerably smaller, did not perform well at all.

Only two models—gpt-4o and claude-3.5-sonnet—performed worse. I wasn’t surprised about Claude. I personally experienced its math deficiency as recently as Wednesday, when I was adding the new leaderboard to my strategy app.


To be fair though, it’s been my experience that Claude 3.5 Sonnet is by far the best model for coding tasks. I love it because it doesn’t pontificate. Working with it feels like having a coding mentor who can walk me through any bug, feature clash, or faulty logic.

So Why Call Out DeepSeek?

When I did a search to see when the AIME 2025 problems were released, I found several posts extolling the virtues of the DeepScaleR-1.5B-Preview, a language model fine-tuned on the DeepSeek-R1-Distilled-Qwen-1.5B model. The emphasis is on how well it performs with only 1.5b parameters (a very small model), but their references are to the AIME 2024 test questions.

DeepScaleR-1.5B-Preview is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-1.5B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 43.1% Pass@1 accuracy on AIME 2024, representing a 15% improvement over the base model (28.8%) and surpassing OpenAI’s O1-Preview performance with just 1.5B parameters.

To be fair, the DeepScaleR-1.5B-Preview model wasn’t included in the AIME 2025 leaderboard so they wouldn’t be able to reference it. But the pressing question, for me at least, is why is anyone talking about a model’s performance on a dataset that’s been publicly available for a year? Even if these models didn’t intentionally train on AIME problems and solutions from previous years (which are readily available on Art of Problem Solving’s website as well as Kaggle), the chances of these problems being included in other sources are quite high.

So there was a very small window of opportunity where this benchmark would be relevant and largely free from model contamination, with some hope that accuracy scores could generalize well to test data. But referencing its performance in last year’s test with the 2025 test results hot off the presses—and not much of a flex for DeepSeek’s 1.5b model—is odd at best and a case of smoke and mirrors at worse.

Conclusion

We need to be a little more skeptical about model creators’ claims, if they are not verified by a third-party benchmark organizer. Also, the MathArena team said that they plan to run evals on more models. I personally think this is unfortunate as the chances of model contamination increase with each day that passes, and oftentimes leaderboards don’t disclose if they allow models to self-report. With an arms race underway to be the first to lay claim to math mastery—one of the more difficult frontiers for LLMs—it would take an extraordinary amount of self-control to not consume these test problems and solutions as soon as they become publicly available.

That said, I’m glad that DeepSeek is applying pressure to US model creators to bring their model costs under control. I have several posts scheduled that address OpenAI’s out-of-touch price points for o1 on a number of fronts. Below is a sneak peek into just one example comparing its 01-2024-12-17 agent to Google’s gemini-2.0-flash-001 agent. Gemini outperforms it at 1% of o1’s estimated cost (source).


Photo credit: Elimende Inagella

要查看或添加评论,请登录

Annie Cushing的更多文章

社区洞察

其他会员也浏览了