DeepSeek r1 vs OpenAI o1
It seems that a bandwagon is currently whizzing past me, so I suppose I should get on it and post something about the current hype generated by DeepSeek. Many news articles I've read have said that DeepSeek's r1 model is almost equivalent to OpenAI's o1 in terms of reasoning and responses. I've conducted a few tests of my own, and I'll present my findings here.
First, though, I want to mention the incredible feat of bringing such a full-featured LLM model to market. By all accounts, this was a side project (!) for the founder, Liang Wenfeng, and reportedly cost just $6 million to build and train the model. If those reports are accurate, then it's easy to see why so much money has been wiped off US tech stocks - especially after the announcement of the $500 billion AI infrastructure fund. That could build DeepSeek 83,000 times over!
Is r1 any good?
This is the five hundred billion dollar question. I gave these tasks to both r1 and o1 with identical prompts. If you're doing this yourself, make sure you've clicked the DeepThink (R1) button otherwise you're using the quicker, but less capable model.
The first task was to summarise a document, and to come up with an outline for a fictitious essay. Both summaries were very good with all of the main points mentioned. The essay outlines were actually surprisingly similar. I suppose my prompt could have been considered a little leading, but the outputs were exactly what I was looking for. If these two LLMs had been students in my class, I'd have assumed one was copying the other. So, after round 1, I think it's all even. A point to both r1 and o1.
The second task was to optimise and refactor a PowerShell script. Personally, I consider PowerShell to have all the elegance and beauty of Perl after someone has smashed its face in with a brick, but fortunately LLMs don't have such qualms.
Again, the outputs were very similar - even down to the same choice of variable names. So, for bonus points, I asked them to add some additional functionality to the script. Here is where things got interesting. Both LLMs added the new functionality, and they both realised that the this would also affect some existing code. o1 surrounded that with an if clause and added the new functionality at the end. r1 thought for a lot longer - it printed its thinking on screen for over 5 minutes before coming up with a solution, which was elegant and outside the box, but it did leave in some code that was rendered superfluous by its refactoring.
Both scripts would (and did) run, but o1's output didn't contain redundant code and was cleaner and more user-friendly (as user-friendly as a PowerShell script can ever get, anyway).
After round 2, then, I have to give an additional point to o1. So, r1 now has 2 points, while o1 has 3 points.
It's still all to play for in the final round! For this, I decided to use my tried, tested and definitely unscientific method of asking them cryptic crossword clues. I tried the same prompt I'd used with o1 that I mentioned in the linked article:
领英推荐
As you can see, it thought for a long time - 52 seconds compared to o1's 12 seconds. But, the final answer was correct.
In fact, r1 was able to solve a clue that o1 couldn't. A tricky one from The Guardian a few weeks ago: Podcast's out a fraction (6). Initially, r1 and o1 both gave the same response. They decided that it was an anagram clue, and tried to form a 6-letter anagram from "Podcast"; however r1 realised there was an error in its thinking. I gave both models the same hint to try redirect their thinking, but here is where o1 doubled down and told me I was wrong, but r1 thought for another mammoth amount of time before finally giving me the correct answer and an explanation of why it thought the clue was problematic. (The answer is FOURTH, by the way. You can check the comments at Fifteensquared for the reasons why).
The next test was to ask r1 to create a clue for the word SKITTER as I did in my previous article. You might remember that o1's solution was good, but slightly inaccurate because it left a T unaccounted for. Here was r1's response:
Its...not bad. I think "end of litter" is a little imprecise, but it included all of the letters. I think, objectively, it has done a better job than o1 in the cryptic crossword round. I'll give o1 1 point for this round and r1 gets 2 points (although I am actually keen to give it more).
The overall result then is a draw! 4 points each.
Conclusions
So, which one - if either - is better? That's hard to say. The o1 model is definitely faster, but r1 seems more thorough. When you compare the costs: $0 per month compared to $20 per month for ChatGPT, the choice seems clearer. Is o1 worth $20 per month more than r1? Not in my book.
Matt's rule of AI still applies - don't ask it anything that you don't already know (or understand) the answer to. I'm still going to triple-check any code it produces, but I'm going to give r1 a shot as my daily driver LLM model to see how it performs longer term.
Founder @ Omni Academy | Software Engineering Researcher, Agile & Scrum Trainer
1 个月I liked your Matt rule of AI: "Don't ask it anything that you don't already know (or understand) the answer to." ??
Doing what I can to help people make the most of tech
1 个月That was a really cool cryptic clue! I tried to replicate your experiment with o1, and it took it a whopping 7m4s, but actually did manage to solve it for me, and I'm impressed at how it persevered (I would have abandoned it much sooner ?? ) https://gist.github.com/yoniLavi/615f61c4fca12a22a7d7d3fa9d4f988d
Junior Alteryx Developer at Crowe UK | Data Analytics & Automation | 20+ Years of Programming and IT Technical Experience
1 个月That was really interesting and matches up with what I've read online from other sources too.
Fullstack software/web development trainee?? This month's project: " Cinema booking site Frontend/Backend "
1 个月Thank you for this Matt! , I'm curious about which of the two would perform better in regards of ability to remember a given info or rule throughout the session and still sticking to it
Founder, Innovator & Software Developer
1 个月Matt Rudge - really interesting stuff. When testing using previously published crossword clues, how can we differentiate between reasoning and clever archive retrieval?