DeepSeek r1 vs OpenAI o1
Deepseek and OpenAI logos

DeepSeek r1 vs OpenAI o1

It seems that a bandwagon is currently whizzing past me, so I suppose I should get on it and post something about the current hype generated by DeepSeek. Many news articles I've read have said that DeepSeek's r1 model is almost equivalent to OpenAI's o1 in terms of reasoning and responses. I've conducted a few tests of my own, and I'll present my findings here.

First, though, I want to mention the incredible feat of bringing such a full-featured LLM model to market. By all accounts, this was a side project (!) for the founder, Liang Wenfeng, and reportedly cost just $6 million to build and train the model. If those reports are accurate, then it's easy to see why so much money has been wiped off US tech stocks - especially after the announcement of the $500 billion AI infrastructure fund. That could build DeepSeek 83,000 times over!

Is r1 any good?

This is the five hundred billion dollar question. I gave these tasks to both r1 and o1 with identical prompts. If you're doing this yourself, make sure you've clicked the DeepThink (R1) button otherwise you're using the quicker, but less capable model.

The first task was to summarise a document, and to come up with an outline for a fictitious essay. Both summaries were very good with all of the main points mentioned. The essay outlines were actually surprisingly similar. I suppose my prompt could have been considered a little leading, but the outputs were exactly what I was looking for. If these two LLMs had been students in my class, I'd have assumed one was copying the other. So, after round 1, I think it's all even. A point to both r1 and o1.

The second task was to optimise and refactor a PowerShell script. Personally, I consider PowerShell to have all the elegance and beauty of Perl after someone has smashed its face in with a brick, but fortunately LLMs don't have such qualms.

Again, the outputs were very similar - even down to the same choice of variable names. So, for bonus points, I asked them to add some additional functionality to the script. Here is where things got interesting. Both LLMs added the new functionality, and they both realised that the this would also affect some existing code. o1 surrounded that with an if clause and added the new functionality at the end. r1 thought for a lot longer - it printed its thinking on screen for over 5 minutes before coming up with a solution, which was elegant and outside the box, but it did leave in some code that was rendered superfluous by its refactoring.

Both scripts would (and did) run, but o1's output didn't contain redundant code and was cleaner and more user-friendly (as user-friendly as a PowerShell script can ever get, anyway).

After round 2, then, I have to give an additional point to o1. So, r1 now has 2 points, while o1 has 3 points.

It's still all to play for in the final round! For this, I decided to use my tried, tested and definitely unscientific method of asking them cryptic crossword clues. I tried the same prompt I'd used with o1 that I mentioned in the linked article:

r1 solving a crossword clue
The return of the wicked ox

As you can see, it thought for a long time - 52 seconds compared to o1's 12 seconds. But, the final answer was correct.

The correct answer of X MARKS THE SPOT
X marks the spot again!

In fact, r1 was able to solve a clue that o1 couldn't. A tricky one from The Guardian a few weeks ago: Podcast's out a fraction (6). Initially, r1 and o1 both gave the same response. They decided that it was an anagram clue, and tried to form a 6-letter anagram from "Podcast"; however r1 realised there was an error in its thinking. I gave both models the same hint to try redirect their thinking, but here is where o1 doubled down and told me I was wrong, but r1 thought for another mammoth amount of time before finally giving me the correct answer and an explanation of why it thought the clue was problematic. (The answer is FOURTH, by the way. You can check the comments at Fifteensquared for the reasons why).

The next test was to ask r1 to create a clue for the word SKITTER as I did in my previous article. You might remember that o1's solution was good, but slightly inaccurate because it left a T unaccounted for. Here was r1's response:

r1's clue
r1 has a go at making a clue

Its...not bad. I think "end of litter" is a little imprecise, but it included all of the letters. I think, objectively, it has done a better job than o1 in the cryptic crossword round. I'll give o1 1 point for this round and r1 gets 2 points (although I am actually keen to give it more).

The overall result then is a draw! 4 points each.

Conclusions

So, which one - if either - is better? That's hard to say. The o1 model is definitely faster, but r1 seems more thorough. When you compare the costs: $0 per month compared to $20 per month for ChatGPT, the choice seems clearer. Is o1 worth $20 per month more than r1? Not in my book.

Matt's rule of AI still applies - don't ask it anything that you don't already know (or understand) the answer to. I'm still going to triple-check any code it produces, but I'm going to give r1 a shot as my daily driver LLM model to see how it performs longer term.

Mirko Perkusich

Founder @ Omni Academy | Software Engineering Researcher, Agile & Scrum Trainer

1 个月

I liked your Matt rule of AI: "Don't ask it anything that you don't already know (or understand) the answer to." ??

回复
Yoni Lavi

Doing what I can to help people make the most of tech

1 个月

That was a really cool cryptic clue! I tried to replicate your experiment with o1, and it took it a whopping 7m4s, but actually did manage to solve it for me, and I'm impressed at how it persevered (I would have abandoned it much sooner ?? ) https://gist.github.com/yoniLavi/615f61c4fca12a22a7d7d3fa9d4f988d

Barrie Millar

Junior Alteryx Developer at Crowe UK | Data Analytics & Automation | 20+ Years of Programming and IT Technical Experience

1 个月

That was really interesting and matches up with what I've read online from other sources too.

Ali Elhaj

Fullstack software/web development trainee?? This month's project: " Cinema booking site Frontend/Backend "

1 个月

Thank you for this Matt! , I'm curious about which of the two would perform better in regards of ability to remember a given info or rule throughout the session and still sticking to it

Joe Ashton

Founder, Innovator & Software Developer

1 个月

Matt Rudge - really interesting stuff. When testing using previously published crossword clues, how can we differentiate between reasoning and clever archive retrieval?

要查看或添加评论,请登录

Matt Rudge的更多文章

  • LLMs vs The Cryptic Crossword

    LLMs vs The Cryptic Crossword

    Some of you may know that I love solving, and occasionally setting, cryptic crossword puzzles. In fact, my connection…

    1 条评论
  • The Life-Changing Magic of Finding Things Out For Yourself!

    The Life-Changing Magic of Finding Things Out For Yourself!

    Sometimes it’s hard to believe that we were able to create code without the help of Stack Overflow and other…

    6 条评论

社区洞察

其他会员也浏览了