Deepseek R1 (full model) v Claude Sonnet: coding, linguistic and big picture tasks

Deepseek R1 (full model) v Claude Sonnet: coding, linguistic and big picture tasks

Courtesy of Microsoft Azure, I managed to get access to the full Deepseek R1 671B model last night, without sending any data to Deepseek.

Being the slightly excitable nerd I am, I jumped straight in and decided to spend a few hours comparing Deepseek R1 and Claude Sonnet on some real tasks we are either running every day – or that I have on my to-do list.

In hindsight, I should have documented the exact examples as I went, but I was excited and didn't! But maybe I will follow up with another article with some actual examples as I do more testing. But I still think the takeaways I have written below should prove interesting!

Before I get into it, just a note on cost comparisons. I have looked at costs on the assumption you are "paying as you go" - i.e. API pricing. Obviously, if you have a subscription now or in the future that includes Claude Sonnet or Deepseek R1, this becomes slightly irrelevant. But right now its the best comparison, and probably a good indicator as to what a subscription would cost relatively speaking.

Tough linguistic/legal tasks (winner: Claude Sonnet)

Our application currently uses Claude Sonnet for a range of fairly tricky linguistic/contract tasks, and we have data on what has worked well and what has not worked well in the past. So it made a great test.

For context, our use of LLMs in our app is EXTREMELY structured. There are no chat windows. We dynamically build prompts to a formula based on user "commands". All data that goes in is structured, and all data that comes out is even more structured. So its much more about understanding language than writing it.

So, everything is pretty deterministic. An even better test!

How did it go?

  • I took a bunch of previous examples that we knew didn't perform well with Claude Sonnet to make it an interesting test, and stripped out all of the tuning we've since done to the context.
  • I ran those prompts through both Claude Sonnet and Deepseek R1 in parallel.
  • Speed: Claude Sonnet took less than two seconds. Deepseek R1 took around one minute.
  • Accuracy: Deepseek R1 made all of the same mistakes that Claude Sonnet did, with no exceptions. It didn't do anything better.
  • Writing style. Other than one command, this isn't a big thing in our app so my sample size is small. But Deepseek R1 was more clumsily worded (IMO) - but not affecting accuracy.
  • Cost. Hard to give exact figures at the moment, as Azure don't seem to be charging me! But looking at token usage, and making some assumptions based on Azure applying a fair uplift on Deepseek's own API pricing, I'd guess Deepseek R1 will be cheaper. But not by as much as I might have initially thought. For this task we use very few output tokens - which is predominantly what makes Claude Sonnet expensive. Deepseek R1 uses a tonne more output tokens, as its reasoning counts as output.

TLDR: Claude Sonnet wins here for me. There's no real difference in accuracy or quality of output, but there's a huge difference in speed. I don't think I will be that phased by the cost difference.

I don't think we'll be using Deepseek R1 for anything in our app anytime soon, unless a use case come along that plays into its advantages a little more (and there are some - that I found in my other tests below...).

Coding tasks (winner: draw - but with nuances)

This was probably my favourite test, and where I see myself reaching for Deepseek R1 a fair bit.

I personally do not like using LLMs to do big coding tasks. Its virtually never in the style that you want, and honestly it just takes me longer to check than it would to write.

So what do I use it for? Scaffolding, refactoring and writing very specific bits of code that is annoying and fiddly to write.

I thought the latter would be most interesting - writing fiddly code. So I tried it on writing some pretty knarly data processing that involved lots of iteration.

If you aren't a developer, as a bit of context iteration done badly can have a HUGE impact on performance. So quality is important.

  • I used a few approaches here, and the results were different dependent. But before I go into those, there's some general observations.
  • Speed: Claude Sonnet again was orders of magnitude quicker on every task. I found myself getting a little frustrated waiting for Deepseek R1 and can see it breaking my flow state if I use it too much.
  • Cost. Again, no actual cost comparison here as Azure doesn't seem to be charging me. But I can say with near certainty that on an API basis, Deepseek R1 will be much cheaper due to the quantity of output tokens. Though in practice, for these tasks I use Claude Sonnet on a subscription basis, so unless Deepseek R1 makes me discard my subscription, it doesn't really matter.
  • Quality - task with minimal context or instruction. For this test, I just asked them do X, Y or Z without any guidance on how I wanted it done or how it fit into the bigger picture (other than the actual file it'd be going in). Deepseek R1 did a much better job here, producing code that was not only cleaner and more readable, but architecturally better. I am a bit OCD so I still wanted to re-write it slightly for style, but that's my quirk to deal with.
  • Quality - task with good context and instruction. Here, I asked them to do the same X, Y or Z but gave both some context on the wider app and my preferences on how I'd like it done. They both came out pretty similar here - there was no clear winner between the two.

So, if you are prepared to take a little bit of time to write some instructions and think about what context to provide, there's very little in it in terms of output.

Whilst the slowness of Deepseek R1 breaks my flow state, writing instructions and context does as well. So in future I'll probably stick with Claude Sonnet for simple tasks for speed - and reach for Deepseek R1 for knarlier tasks.

TLDR: I'll probably be using both here. Which is awesome, because it means I can avoid a $300 per month on ChatGPT Pro (which I was never going to do!).

Big picture tasks (winner: Deepseek R1 (I think?))

Going into this task, I thought this would be the most exciting use case. The results made me both underwhelmed and also excited...

I took two different tasks here - one in the legal domain and another in software engineering:

  • Legal - inspired by Alex Herrity 's post yesterday, I asked both to provide some thoughts on how you'd go about creating a taxonomy for contract types and risks.
  • Software engineering - we are about to undergo a little bit of a re-architecture of our app to make it more developer friendly. Its really just re-organising files and concepts to make them easier to understand - and to make it easier to share code between different bits of the app where it makes sense. I asked it for its thoughts on how we should do this, in two different ways.

General observations:

  • Speed: same observations as above here, Claude Sonnet was much quicker by orders of magnitude, but I honestly cared less. These types of tasks are not ones you want to do quickly anyway. You stop and think lots, and iterate.
  • Cost. Again, no actual cost comparison here as Azure doesn't seem to be charging me. But as with coding tasks, I can say with near certainty Deepseek R1 will be cheaper on an API basis due to the quantity of output tokens (which are what make Claude Sonnet quite pricey).
  • Quality - legal. Claude Sonnet was pretty basic here. Wasn't impressed at all. Deepseek R1 however blew my socks off. It came up with a pretty good starter for 10 - and this is something I have spent so much time building and iterating on over the last year. It was far from perfect, and what we have built is a lot more refined, but 100% Deepseek would have given us a few days headstart at least, for sure.
  • Quality - software engineering - take 1. The first time I ran this test, I provided them both our current architecture a draft explanation of our "target state". Claude Sonnet gave a fairly decent answer with some useful tips around the sides but I was horrified by how bad Deepseek R1 was. It produced (no exaggeration) a stream of unreadable gobbledygook. I've never seen an LLM behave so weirdly. Looking at the reasoning, it just seemed to be going in circles getting confused based on the target state document. So I tried again...
  • Quality software engineering - take 2. I re-ran my test without our target state explanation (so, kind of like the coding task, with less context). Claude Sonnet was not very useful here at all - it made sense, was accurate but it was just a bit generic and basic in its suggestions. Deepseek R1 did much better here, and its answer was roughly on par with Claude Sonnet's the first time round (maybe slightly better) - which is impressive given it had much less context.

I really need to try Deepseek R1 with some more of these tasks. I was so excited by the legal variant, but the software engineering one was really disappointing. I do however suspect that was a quirk, and if I try it with other software engineering big picture tasks I will get better results...

So in conclusion...

For our applications use case, we'll be sticking with Claude Sonnet. There won't be a huge cost different given the way we use it, and speed is more important to us.

For everything else, I will probably continue using Claude Sonnet where speed / maintaining flow state is paramount. But I'll definitely be reaching for Deepseek R1 more.

So whilst in places I was pretty underwhelmed by Deepseek R1, I am excited by it. Mostly because it means I don't need to continue umm and arr whether ChatGPT's $300 a month price tag is worth it. I'll instead be paying a few dollars a month for Azure Deepseek R1 API tokens...

I'll definitely be doing some more testing of both - and this time I will keep the examples to share!


Liam Gilchrist

Founder at Lexical Labs

4 周

Great Article Chris, thanks for sharing.?Interesting 'legal big picture' test outcome - wasn't expecting that!

回复
Allen Morgan

Founder at Altien | Father | Husband | Legal Technologist | Cyclist | DJ

1 个月

Great article, appreciate the share.

Alex Herrity

Director of Legal Solutions at adidas

1 个月

Nice one, Chris! Cool to see you use my question!

Great post Chris Bridges. Useful to see the pros and cons at use case level.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了