Why the world is talking about Deepseek R1 and R1-Zero
One is cost which is great and deserves its own article but for me, it's the below example.
Some time ago I talked about how LLMs are word predictors. That is true. It is also only partly true and they now have their own way to predict; that is they can learn how to learn.
With that we come to the below image in the latest paper from Deepseek - 2501.12948. Give it a read, it is very well written and it's the first time such a model is open source.
This is the image of R1-Zero thinking through a problem
The first 'aha' moment as they call it where R1-Zero, improved its reasoning capabilities autonomously.
R1-Zero started as a blank slate, with no prior experience of how to solve problems. Training was pure reinforcement learning (RL) which is rewards for getting closer to the correct answer.
We have a peek into its thinking with R1-Zero given instructions to use Tags to separate its thinking from its final response.
In the beginning, R1-Zero’s attempts were clumsy, similar to a child’s first attempts at writing, with a lot of trial and error but getting better bit by bit.
Over time, this process led to R1-Zero developing its own way of tackling problems. It began to show sophisticated behaviours like revisiting its earlier thoughts to rethink the approach, similar to a humans learning to check their own work.
领英推荐
This was like a lightbulb moment, an "aha moment," when R1-Zero learned to give more "thinking time" to a problem to find a solution. Which is where the image above comes into play. It allocated more time, and 'reasoned' its way into solving problems it had not tackled previously.
However, R1-Zero also had some challenges. Its reasoning wasn't always easy to follow, like a madman/child who is thinking out loud but with a very messy way of speaking, sometimes even mixing different languages. This made it difficult for others to understand its thought process. This is crazy - like how Alpha Go Zero started making moves that made no sense to experienced Go players but still outperformed anyone, we now do not understand what R1-Zero is thinking before it spits out the answer.
And it is already close to o1 benchmarks (although, it must be noted we are running out of good benchmarks which is a separate topic of its own).
R1 by contrast, was given a small amount of high-quality, human-friendly examples, which showed the model clear reasoning processes and how the answer was structured. This helped R1 avoid the instability R1-Zero experienced early on in training. Following this, R1 went through the same reasoning-oriented RL process, where it was rewarded for correct answers, which helped to improve the quality of its reasoning. It was like a child learning from a good teacher, building on initial lessons to become better.
To avoid the language mixing issues that R1-Zero faced, the teachers introduced a language consistency reward during this phase, to encourage R1 to keep its language consistent with the problem. Following convergence in the RL training, R1 went through supervised fine-tuning (SFT) using data from diverse domains, to improve its writing skills, fact recall and self-cognition, like a child learning to broaden their understanding. Finally, R1 was given another phase of RL, using diverse prompts to refine its capabilities further, which made it more helpful to others.
With R1 and R1-Zero being open source, I see distillation being the only way we train smaller models now. The models performed far better than their peers who were trained using RL alone, highlighting the efficiency of distillation.
Here's an audio conversation explaining the key ideas behind the paper. Notebook LM is getting very good very fast.
Senior Solutions Architect
1 个月This is such an interesting read, Parth! Looking forward for more.
Product Manager at IGEL | ML & Cloud Solutions | Ex-Amazon | $117M+ Revenue Impact
1 个月Great insights, thanks for this post, Parth. I've read a similar post, it echoed your opinion on the importance of the "thinking" process in AI. The example was also fun and cool, to know how the AI learns to use more "thinking time" to solve a problem. Link to the post here -?https://www.oneusefulthing.org/i/155502334/reasoning
Parth Sangani you should write more on this topic, great to read. Also - what does this mean for companies trying to monetize AI for consumer? Ive already cancelled my ChatGPT subscription.