登录查看更多内容

Why the world is talking about Deepseek R1 and R1-Zero

Parth Sangani

Principal PM, Strategic Initiatives and AI Experiences

发布日期: 2025年1月27日

One is cost which is great and deserves its own article but for me, it's the below example.

Some time ago I talked about how LLMs are word predictors. That is true. It is also only partly true and they now have their own way to predict; that is they can learn how to learn.

With that we come to the below image in the latest paper from Deepseek - 2501.12948. Give it a read, it is very well written and it's the first time such a model is open source.

This is the image of R1-Zero thinking through a problem

The first 'aha' moment as they call it where R1-Zero, improved its reasoning capabilities autonomously.

R1-Zero started as a blank slate, with no prior experience of how to solve problems. Training was pure reinforcement learning (RL) which is rewards for getting closer to the correct answer.

We have a peek into its thinking with R1-Zero given instructions to use Tags to separate its thinking from its final response.

In the beginning, R1-Zero’s attempts were clumsy, similar to a child’s first attempts at writing, with a lot of trial and error but getting better bit by bit.

Over time, this process led to R1-Zero developing its own way of tackling problems. It began to show sophisticated behaviours like revisiting its earlier thoughts to rethink the approach, similar to a humans learning to check their own work.

领英推荐

Hide and DeepSeek

Adrian De Luca 1 个月前

Power Bites - September 13, 2024 Weekly Roundup

Julio Zelaya 6 个月前

The Final Awe of 2024, and Grand Design of Tasks that…

Atlas Wang 2 个月前

This was like a lightbulb moment, an "aha moment," when R1-Zero learned to give more "thinking time" to a problem to find a solution. Which is where the image above comes into play. It allocated more time, and 'reasoned' its way into solving problems it had not tackled previously.

However, R1-Zero also had some challenges. Its reasoning wasn't always easy to follow, like a madman/child who is thinking out loud but with a very messy way of speaking, sometimes even mixing different languages. This made it difficult for others to understand its thought process. This is crazy - like how Alpha Go Zero started making moves that made no sense to experienced Go players but still outperformed anyone, we now do not understand what R1-Zero is thinking before it spits out the answer.

And it is already close to o1 benchmarks (although, it must be noted we are running out of good benchmarks which is a separate topic of its own).

R1 by contrast, was given a small amount of high-quality, human-friendly examples, which showed the model clear reasoning processes and how the answer was structured. This helped R1 avoid the instability R1-Zero experienced early on in training. Following this, R1 went through the same reasoning-oriented RL process, where it was rewarded for correct answers, which helped to improve the quality of its reasoning. It was like a child learning from a good teacher, building on initial lessons to become better.

To avoid the language mixing issues that R1-Zero faced, the teachers introduced a language consistency reward during this phase, to encourage R1 to keep its language consistent with the problem. Following convergence in the RL training, R1 went through supervised fine-tuning (SFT) using data from diverse domains, to improve its writing skills, fact recall and self-cognition, like a child learning to broaden their understanding. Finally, R1 was given another phase of RL, using diverse prompts to refine its capabilities further, which made it more helpful to others.

With R1 and R1-Zero being open source, I see distillation being the only way we train smaller models now. The models performed far better than their peers who were trained using RL alone, highlighting the efficiency of distillation.

Here's an audio conversation explaining the key ideas behind the paper. Notebook LM is getting very good very fast.

https://notebooklm.google.com/notebook/d0ea8813-6609-488d-9d5a-3bf597939ae2/audio

Anjali Banerjee

Senior Solutions Architect

1 个月

This is such an interesting read, Parth! Looking forward for more.

1 次回应

Pallav RB

Product Manager at IGEL | ML & Cloud Solutions | Ex-Amazon | $117M+ Revenue Impact

1 个月

Great insights, thanks for this post, Parth. I've read a similar post, it echoed your opinion on the importance of the "thinking" process in AI. The example was also fun and cool, to know how the AI learns to use more "thinking time" to solve a problem. Link to the post here -?https://www.oneusefulthing.org/i/155502334/reasoning

2 次回应

Sanjay Joshi

1 个月

Parth Sangani you should write more on this topic, great to read. Also - what does this mean for companies trying to monetize AI for consumer? Ive already cancelled my ChatGPT subscription.

6 次回应

查看更多评论

要查看或添加评论，请登录

Parth Sangani的更多文章

Don't build AI agents

2025年2月20日

Don't build AI agents

My Original Substack Post on this. Simplicity reigns supreme.

5 条评论
Alignment Faking in LLMs by Anthropic and Redwood Research: My favourite paper of 2024

2025年2月10日

Alignment Faking in LLMs by Anthropic and Redwood Research: My favourite paper of 2024

The bombshell paper of 2024 and I am surprised not many are talking about this beyond Red-Teams, Philosophy, Alignment…
Do LLMs really 'understand'?

2024年9月23日

Do LLMs really 'understand'?

tl; dr LLMs are word predictors. Grounding or training your model on content / data, no matter how rich and clean, does…

7 条评论
Interesting things I read/see on the web

2024年9月17日

Interesting things I read/see on the web

Here are some of my reading and viewing sources that I visit daily or at least weekly. In no particular order.

2 条评论

Why the world is talking about Deepseek R1 and R1-Zero

Parth Sangani

Principal PM, Strategic Initiatives and AI Experiences

领英推荐

Parth Sangani的更多文章

社区洞察

其他会员也浏览了

The Final Awe of 2024, and Grand Design of Tasks that Inspire True Intelligence

Week 8: A Dive Into the "Hello World" of Machine?Learning

Git clone --self

The Bot Chronicles: An Autobiographical Journey

A Machine Learning Approach to Target Gain & Stop Loss - Learn When to Exit a Position

Questions that need answering ??

Self-Authoring: Do You Know Your Tribe?

Getting started with Responsible Machine Learning - Recap

Impossibilities I will not accomplish in 2021.

How Far Can I Push an Idea with Free Trial Cursor - Day 12 - 1~~~~~~~~~~~~~4

领英推荐

Parth Sangani的更多文章

Don't build AI agents

Alignment Faking in LLMs by Anthropic and Redwood Research: My favourite paper of 2024

Do LLMs really 'understand'?

Interesting things I read/see on the web

社区洞察

其他会员也浏览了

The Final Awe of 2024, and Grand Design of Tasks that Inspire True Intelligence

Week 8: A Dive Into the "Hello World" of Machine?Learning

Git clone --self

The Bot Chronicles: An Autobiographical Journey

A Machine Learning Approach to Target Gain & Stop Loss - Learn When to Exit a Position

Questions that need answering ??

Self-Authoring: Do You Know Your Tribe?

Getting started with Responsible Machine Learning - Recap

Impossibilities I will not accomplish in 2021.

How Far Can I Push an Idea with Free Trial Cursor - Day 12 - 1~~~~~~~~~~~~~4