登录查看更多内容

DeepSeek’s Abysmal Performance with the AIME 2025 Math Benchmark

Annie Cushing

Sr AI Strategist and Author of Making Data Sexy ????

发布日期: 2025年2月14日

What Is the AIME Benchmark?

The American Invitational Mathematics Examination (AIME) is the second exam in the series of exams used to challenge mathletes competing to become the team that represents the US at the International Mathematics Olympiad. While most AIME participants are high school students, some bright middle school students also qualify each year. The exam is administered by the Mathematical Association of America.

This year’s AIME was held February 6th, and the problems and answers were published immediately afterwards—and are already being bandied about on various YouTube channels, forums, and blogs.

MathArena’s AIME Leaderboard

What Is It?

The MathArena team jumped on this dataset and worked against the clock to run evaluations using the 2025 problems before models could start training on it. Since these are challenging math problems, it makes for an excellent benchmark to see how well these models reason through more complex problems, with less opportunity to get the answer correct by chance since the test isn’t multiple choice like many benchmarks.

As soon as I got the news about the new leaderboard, I added it to my AI Strategy tool, which you can view filtered for the ‘Solve math problems’ task. As always, I provide tips, links, and benchmark definitions in the leaderboard modal.

What I Like About It

A few things I appreciate about MathArena’s leaderboard:

At least at this time, the organizers run these evaluations themselves and don’t allow self-reporting.
They published their leaderboard early enough that it’s unlikely that models had a chance to train on the dataset (i.e., the problems and answers). They’re not supposed to do this anyway as it causes model contamination, but some of clearly do it anyway. I’ve addressed this issue here and here (and have more posts that address it scheduled). One leaderboard actually calls out models it suspects of contamination, which I can’t love enough and use it as a proxy when some of these models make outrageous claims. Another leaderboard discloses models that self-report, which I also use as a reference when trying to decide how many grains of salt I’m willing to allocate to a model creator’s claim.
They run each problem through the model four times and calculate their average accuracy score. The answers for DeepSeek’s 1.5b param model varied pretty wildly from one pass to another, which indicates it was more taking stabs in the dark than consistently following a well-reasoned path to solving the problem.
They turned the table into a heatmap. You can get a breakdown of the legend in the leaderboard modal in my strategy app or the leaderboard FAQ section, but this functionality is quite clever imo—as is their methodology of running each problem through four times.
Instead of estimating the cost, they record the actual tokens used for the input and output as well as the actual cost. The input tokens range from 190-218, but the output tokens range from 549 to 14,623, which is just wild. Also, o1’s pricing logic seems pretty out of touch compared to the other models.
DeepSeek’s larger R1 model performed decently. Not great but in the acceptable range. I’m targeting the 1.5B model.

Actual Model Performance

Below is a screenshot of the leaderboard table, with the larger DeepSeek models ranging from 50-65%. Not in o1/3 territory but nipping at their heels. However, Qwen’s DeepSeek-R1-Distill-Qwen-1.5B, which is considerably smaller, did not perform well at all.

领英推荐

The Major Innovations in Mathematics in 2024

Sidd TUMKUR 2 个月前

The Simple Equations Driving Much of Modern Math

Quanta Magazine 1 个月前

Saturday with Math (Aug 10th)

Alberto Boaventura 7 个月前

Only two models—gpt-4o and claude-3.5-sonnet—performed worse. I wasn’t surprised about Claude. I personally experienced its math deficiency as recently as Wednesday, when I was adding the new leaderboard to my strategy app.

To be fair though, it’s been my experience that Claude 3.5 Sonnet is by far the best model for coding tasks. I love it because it doesn’t pontificate. Working with it feels like having a coding mentor who can walk me through any bug, feature clash, or faulty logic.

So Why Call Out DeepSeek?

When I did a search to see when the AIME 2025 problems were released, I found several posts extolling the virtues of the DeepScaleR-1.5B-Preview, a language model fine-tuned on the DeepSeek-R1-Distilled-Qwen-1.5B model. The emphasis is on how well it performs with only 1.5b parameters (a very small model), but their references are to the AIME 2024 test questions.

DeepScaleR-1.5B-Preview is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-1.5B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 43.1% Pass@1 accuracy on AIME 2024, representing a 15% improvement over the base model (28.8%) and surpassing OpenAI’s O1-Preview performance with just 1.5B parameters.

To be fair, the DeepScaleR-1.5B-Preview model wasn’t included in the AIME 2025 leaderboard so they wouldn’t be able to reference it. But the pressing question, for me at least, is why is anyone talking about a model’s performance on a dataset that’s been publicly available for a year? Even if these models didn’t intentionally train on AIME problems and solutions from previous years (which are readily available on Art of Problem Solving’s website as well as Kaggle), the chances of these problems being included in other sources are quite high.

So there was a very small window of opportunity where this benchmark would be relevant and largely free from model contamination, with some hope that accuracy scores could generalize well to test data. But referencing its performance in last year’s test with the 2025 test results hot off the presses—and not much of a flex for DeepSeek’s 1.5b model—is odd at best and a case of smoke and mirrors at worse.

Conclusion

We need to be a little more skeptical about model creators’ claims, if they are not verified by a third-party benchmark organizer. Also, the MathArena team said that they plan to run evals on more models. I personally think this is unfortunate as the chances of model contamination increase with each day that passes, and oftentimes leaderboards don’t disclose if they allow models to self-report. With an arms race underway to be the first to lay claim to math mastery—one of the more difficult frontiers for LLMs—it would take an extraordinary amount of self-control to not consume these test problems and solutions as soon as they become publicly available.

That said, I’m glad that DeepSeek is applying pressure to US model creators to bring their model costs under control. I have several posts scheduled that address OpenAI’s out-of-touch price points for o1 on a number of fronts. Below is a sneak peek into just one example comparing its 01-2024-12-17 agent to Google’s gemini-2.0-flash-001 agent. Gemini outperforms it at 1% of o1’s estimated cost (source).

Photo credit: Elimende Inagella

要查看或添加评论，请登录

Annie Cushing的更多文章

The 6 Biggest Mistakes that Tank AI Projects

2025年2月20日

The 6 Biggest Mistakes that Tank AI Projects

I’ve had the pleasure of working with quite a few teams now on various AI projects that have spanned five industries…

1 条评论
DeepSeek R1 Was Given a Pop Quiz on Safety—and Bombed It

2025年2月13日

DeepSeek R1 Was Given a Pop Quiz on Safety—and Bombed It

Last month AI security researchers from Robust Intelligence (now a part of Cisco) and the University of Pennsylvania…
Google, OpenAI, and Hugging Face Battle It Out for Agentic AI Supremacy

2025年2月10日

Google, OpenAI, and Hugging Face Battle It Out for Agentic AI Supremacy

The Deep Research Arms Race In December Google announced Deep Research, its new agentic feature that allows Gemini…
How Trustworthy Are Model Creators’ Performance Claims?

2025年1月27日

How Trustworthy Are Model Creators’ Performance Claims?

OpenAI and Epoch AI’s Math Benchmark Snafu Last week it was revealed that OpenAI funded FrontierMath, one of the latest…

1 条评论
Demystifying the Process of Picking an AI Model

2025年1月23日

Demystifying the Process of Picking an AI Model

TL;DR You can play with the AI Strategy app and get all the tips you need from the tool tips in Step 1 and 2. Intro So…

1 条评论

See all articles

DeepSeek’s Abysmal Performance with the AIME 2025 Math Benchmark

Annie Cushing

Sr AI Strategist and Author of Making Data Sexy ????

What Is the AIME Benchmark?

MathArena’s AIME Leaderboard

What Is It?

What I Like About It

Actual Model Performance

领英推荐

So Why Call Out DeepSeek?

Conclusion

Annie Cushing的更多文章

社区洞察

其他会员也浏览了

Narrow Definitions Constrict Creativity: More on Basic Facts

A Layman's Bridge to Mathematics - Preface

VEDIC MATH POWER: SUPERCHARGE YOUR ABILITIES

The Importance of Mathematics: From Equations to Everyday Insights

Challenge Your Instincts, Unveil Infinite Possibilities: Mathematics, isn’t just numbers or logic – it’s the very foundation of innovation!

16-3-1 Vedic Math Sutras

What it takes to solve a 350-year-old math problem

Cracking Vedic Mathematics / How to Excel in MATHS/The Practical Way

Me and Mathematics at our Cheenta

Four Aspects of Mathematical Understanding: Could These Change The Way We Teach Mathematics?

What Is the AIME Benchmark?

MathArena’s AIME Leaderboard

What Is It?

What I Like About It

Actual Model Performance

领英推荐

So Why Call Out DeepSeek?

Conclusion

Annie Cushing的更多文章

The 6 Biggest Mistakes that Tank AI Projects

DeepSeek R1 Was Given a Pop Quiz on Safety—and Bombed It

Google, OpenAI, and Hugging Face Battle It Out for Agentic AI Supremacy

How Trustworthy Are Model Creators’ Performance Claims?

Demystifying the Process of Picking an AI Model

社区洞察

其他会员也浏览了

Narrow Definitions Constrict Creativity: More on Basic Facts

A Layman's Bridge to Mathematics - Preface

VEDIC MATH POWER: SUPERCHARGE YOUR ABILITIES

The Importance of Mathematics: From Equations to Everyday Insights

Challenge Your Instincts, Unveil Infinite Possibilities: Mathematics, isn’t just numbers or logic – it’s the very foundation of innovation!

16-3-1 Vedic Math Sutras

What it takes to solve a 350-year-old math problem

Cracking Vedic Mathematics / How to Excel in MATHS/The Practical Way

Me and Mathematics at our Cheenta

Four Aspects of Mathematical Understanding: Could These Change The Way We Teach Mathematics?