Deepseek R1 (full model) v Claude Sonnet: coding, linguistic and big picture tasks
Courtesy of Microsoft Azure, I managed to get access to the full Deepseek R1 671B model last night, without sending any data to Deepseek.
Being the slightly excitable nerd I am, I jumped straight in and decided to spend a few hours comparing Deepseek R1 and Claude Sonnet on some real tasks we are either running every day – or that I have on my to-do list.
In hindsight, I should have documented the exact examples as I went, but I was excited and didn't! But maybe I will follow up with another article with some actual examples as I do more testing. But I still think the takeaways I have written below should prove interesting!
Before I get into it, just a note on cost comparisons. I have looked at costs on the assumption you are "paying as you go" - i.e. API pricing. Obviously, if you have a subscription now or in the future that includes Claude Sonnet or Deepseek R1, this becomes slightly irrelevant. But right now its the best comparison, and probably a good indicator as to what a subscription would cost relatively speaking.
Tough linguistic/legal tasks (winner: Claude Sonnet)
Our application currently uses Claude Sonnet for a range of fairly tricky linguistic/contract tasks, and we have data on what has worked well and what has not worked well in the past. So it made a great test.
For context, our use of LLMs in our app is EXTREMELY structured. There are no chat windows. We dynamically build prompts to a formula based on user "commands". All data that goes in is structured, and all data that comes out is even more structured. So its much more about understanding language than writing it.
So, everything is pretty deterministic. An even better test!
How did it go?
TLDR: Claude Sonnet wins here for me. There's no real difference in accuracy or quality of output, but there's a huge difference in speed. I don't think I will be that phased by the cost difference.
I don't think we'll be using Deepseek R1 for anything in our app anytime soon, unless a use case come along that plays into its advantages a little more (and there are some - that I found in my other tests below...).
Coding tasks (winner: draw - but with nuances)
This was probably my favourite test, and where I see myself reaching for Deepseek R1 a fair bit.
I personally do not like using LLMs to do big coding tasks. Its virtually never in the style that you want, and honestly it just takes me longer to check than it would to write.
So what do I use it for? Scaffolding, refactoring and writing very specific bits of code that is annoying and fiddly to write.
I thought the latter would be most interesting - writing fiddly code. So I tried it on writing some pretty knarly data processing that involved lots of iteration.
If you aren't a developer, as a bit of context iteration done badly can have a HUGE impact on performance. So quality is important.
领英推荐
So, if you are prepared to take a little bit of time to write some instructions and think about what context to provide, there's very little in it in terms of output.
Whilst the slowness of Deepseek R1 breaks my flow state, writing instructions and context does as well. So in future I'll probably stick with Claude Sonnet for simple tasks for speed - and reach for Deepseek R1 for knarlier tasks.
TLDR: I'll probably be using both here. Which is awesome, because it means I can avoid a $300 per month on ChatGPT Pro (which I was never going to do!).
Big picture tasks (winner: Deepseek R1 (I think?))
Going into this task, I thought this would be the most exciting use case. The results made me both underwhelmed and also excited...
I took two different tasks here - one in the legal domain and another in software engineering:
General observations:
I really need to try Deepseek R1 with some more of these tasks. I was so excited by the legal variant, but the software engineering one was really disappointing. I do however suspect that was a quirk, and if I try it with other software engineering big picture tasks I will get better results...
So in conclusion...
For our applications use case, we'll be sticking with Claude Sonnet. There won't be a huge cost different given the way we use it, and speed is more important to us.
For everything else, I will probably continue using Claude Sonnet where speed / maintaining flow state is paramount. But I'll definitely be reaching for Deepseek R1 more.
So whilst in places I was pretty underwhelmed by Deepseek R1, I am excited by it. Mostly because it means I don't need to continue umm and arr whether ChatGPT's $300 a month price tag is worth it. I'll instead be paying a few dollars a month for Azure Deepseek R1 API tokens...
I'll definitely be doing some more testing of both - and this time I will keep the examples to share!
Founder at Lexical Labs
4 周Great Article Chris, thanks for sharing.?Interesting 'legal big picture' test outcome - wasn't expecting that!
Founder at Altien | Father | Husband | Legal Technologist | Cyclist | DJ
1 个月Great article, appreciate the share.
Director of Legal Solutions at adidas
1 个月Nice one, Chris! Cool to see you use my question!
Great post Chris Bridges. Useful to see the pros and cons at use case level.