A Deep Dive into DeepSeek R1 - technical version

A Deep Dive into DeepSeek R1 - technical version

Hello everyone, and welcome to this newsletter edition where we explore some of the most important concepts in AI model training—and how a novel large language model called DeepSeek R1 brings them all together.

We’ll look at foundational ideas such as Reinforcement Learning, Supervised Fine-Tuning (SFT), Knowledge Distillation, and more. Then we’ll dive into the design of DeepSeek R1, examining how it uses “Chain-of-Thought” reasoning, advanced RL methods, and a multi-stage training process.


Part 1: Key Concepts in Modern AI Training

1. Reinforcement Learning (RL)

Reinforcement Learning is a branch of machine learning in which an AI model learns by taking actions in an environment and receiving rewards or penalties based on those actions. The goal is to maximize rewards over time.

Example: Imagine teaching a robot to play a simple video game. Each time it performs a beneficial action—like scoring a point—it receives a positive reward. When it makes a mistake—like losing a point—it’s penalized. Over many attempts, the robot identifies the actions that yield the highest score, improving its skill at the game.


2. SFT Fine-Tuning (Supervised Fine-Tuning)

Fine-tuning is the process of taking a pre-trained AI model and making small adjustments so it performs better on a specific task. Instead of training from scratch, we “tweak” the model using additional data to improve performance.

SFT (Supervised Fine-Tuning) refers to a particular kind of fine-tuning that relies on a labeled dataset. The model sees examples (like text or images) paired with the correct labels (i.e., the “right answers”). This helps the model learn how to predict accurately in situations similar to those examples.

Example: Suppose you have a large language model (LLM) and you want it to do better at handling customer-support requests. You collect a labeled dataset of common customer questions and correct answers, then fine-tune the model so it becomes far more accurate at providing helpful responses. If you have a substantial amount of labeled data, SFT is an excellent approach.


3. Knowledge Distillation

Knowledge Distillation is a method of transferring knowledge from a large, complex “teacher” model to a smaller, simpler “student” model. The aim is to retain most of the performance of the larger model while gaining significant benefits in speed and efficiency.

Why do it?

? Smaller model size

? Faster inference

? Lower memory footprint

This technique is valuable when you need to deploy models in resource-constrained environments while still maintaining strong performance.


4. Cold Start Data

Cold start data is a minimal labeled dataset used to give the model a basic understanding of a task. It’s especially useful when you don’t have large-scale labeled data. For instance, if you’re building a chatbot but only have a small FAQ list scraped from your website, you can use that as your cold start data to conduct a simple fine-tuning step.

5. Multi-Stage Training

Multi-stage training involves training the model in carefully planned phases, each focusing on a specific improvement, such as accuracy, alignment, or robustness. A common example is:

1. Train a base model on general text data.

2. Refine it using RL with user feedback to improve its conversational or decision-making ability.

By breaking the training process into stages, you can systematically target each aspect of performance you care about.


6. Rejection Sampling

Rejection sampling refers to generating multiple outputs (responses or solutions) from a model, and only selecting those that meet specific criteria (e.g., correctness, quality, relevance) to feed back into further training.

Example: In an RL process, the model may produce multiple potential answers. Only those that are judged valuable for retraining are kept. This filters out noise and accelerates learning.


Part 2: DeepSeek R1’s Key Technologies

DeepSeek R1 is a new large language model developed by a research team in China. It showcases three major pillars:

1. Chain-of-Thought: Encouraging the model to “think out loud.”

2. Reinforcement Learning: Letting the model learn from self-guided exploration and feedback.

3. Distillation: Compressing a large model into a smaller one without sacrificing much performance.


Let’s see how these ideas come together.

1. Chain-of-Thought

Most AI models output an answer straight away, with no indication of the reasoning behind it. If the answer is wrong, you don’t know where it went off-track.

Chain-of-Thought forces the model to break down its thought process step by step. This is helpful because:

? You can easily pinpoint errors if the result is incorrect.

? The model itself can spot mistakes and self-correct as it goes, leading to better overall accuracy.

In DeepSeek’s paper, there’s an example of solving a math problem: the model identifies a miscalculation in its own step-by-step process and revises its answer in real time. This approach fosters more robust and transparent reasoning.


2. Reinforcement Learning (RL)

Traditionally, AI models are trained in a supervised manner: show them a question and the correct answer, repeat many times. DeepSeek R1, however, takes a cue from how infants learn. Infants try, fail, adjust, and try again—this is the essence of Reinforcement Learning.

DeepSeek R1 improves its reasoning by exploring multiple ways to answer a prompt, then comparing outcomes. It uses a specialized method called Group Relative Policy Optimization (GRPO) that eliminates the need for a separate “critic” model and instead relies on predefined criteria (like coherence or fluency) to score its own moves. If the new attempt is better, DeepSeek R1 updates its internal strategy.

Result?

? Lower labeling costs

? The ability to learn from its own mistakes

? A model that continuously refines itself over time


3. Distillation

DeepSeek R1 is massive—its full-sized version has 6,710 billion parameters. That’s far beyond the reach of most organizations’ hardware budgets.

The solution is distillation. After training a giant “teacher” model, DeepSeek’s creators transfer its capabilities into smaller “student” models, like Llama 3 or Qwen. Surprisingly, these lighter versions sometimes match or even outperform the original, all while running on just a single GPU. This dramatically expands access to the technology and empowers more people to use high-level AI models.


Part 3: GRPO RL Framework and Multi-Stage Training

The GRPO RL Framework

In many RL setups for Large Language Models, a “critic” model is used to provide feedback based on labeled data. But if that labeled data is incomplete or biased, the critic’s feedback is limited.

Group Relative Policy Optimization (GRPO) sidesteps this by removing the critic. The model’s outputs are compared to predefined rules—for instance, measuring coherence or fluency—and then rated relative to each other. The model learns to optimize its strategy by aiming to beat its own average performance.

Training Process Overview

DeepSeek R1 is trained in multiple phases, each of which tackles a different goal:

1. Cold Start Fine-Tuning: Use a few thousand labeled data points (“cold start data”) to establish a solid foundation. Compared to typical supervised learning requiring millions or billions of data points, this is relatively small.

2. Pure RL (like R1-Zero): Improve problem-solving with a self-guided approach, letting the model attempt tasks and learn from rewards or penalties.

3. Rejection Sampling: Once RL has progressed, the model generates multiple solutions and keeps only the best as new “labeled” data for itself (synthetic data).

4. Combine Synthetic and Supervised Data: Merge the newly created outputs with data from DeepSeek-V3-Base covering writing quality, factual checks, and self-awareness. This step ensures the model learns from high-quality outputs and a broad range of domain knowledge.

5. Final RL Pass: With the new data in hand, the model goes through one more RL phase in various prompts and scenarios.

Each stage builds on the one before it, amplifying the model’s abilities systematically.


Why DeepSeek R1 Matters:

DeepSeek R1 weaves together Chain-of-Thought, Reinforcement Learning, and Model Distillation into a single training pipeline, resulting in a powerful new LLM. What makes it stand out?

? Transparency: Chain-of-Thought reveals the model’s reasoning process.

? Continuous Improvement: RL helps DeepSeek R1 iteratively refine its responses.

? Accessibility: Distillation shrinks the colossal original model into versions that can run on far more modest hardware.

Even if you’re not an AI researcher, you can appreciate how crucial these capabilities are. From writing assistants to advanced research tools, DeepSeek R1’s innovations signal how AI might soon evolve across many real-world domains.


Frequently Asked Questions

1. What is DeepSeek R1?

DeepSeek R1 is a new large language model developed by a Chinese research team. It is noteworthy because it performs comparably to other leading models (like OpenAI’s O1) on complex tasks such as mathematics, coding, and scientific reasoning. Its innovations in reinforcement learning and distillation could make AI more efficient and accessible.

2. How does DeepSeek R1 use “Chain-of-Thought” prompts?

DeepSeek R1 is encouraged to “think out loud,” providing step-by-step explanations of its reasoning. When solving a math problem, for example, it lists each step of the logic. This helps identify (and correct) errors more easily, and allows the model to refine its own reasoning if it spots a contradiction or miscalculation.

3. What is special about its Reinforcement Learning approach?

DeepSeek R1’s RL strategy mimics how humans (especially infants) learn by repeated trials and adjustments. Instead of using a separate critic model, it leverages GRPO—a method that compares its answers against predefined criteria and its own average performance. This yields more flexibility and reduces the need for extensive labeled data.


In summary:

DeepSeek R1 is significant not just because of what it can do, but because of how it does it. By combining transparent reasoning (Chain-of-Thought), incremental self-improvement (Reinforcement Learning), and resource-efficient deployment (Distillation), it points to an exciting new direction in AI research and application.

In the coming months and years, we may see these innovations filter down into widely available tools—making powerful, transparent, and ever-improving AI assistants a part of daily life. Keep an eye on DeepSeek R1 and related advancements if you’re eager to stay at the cutting edge of AI.

Thanks for reading! If you have further questions, feel free to reach out or stay tuned for our next deep dive into emerging AI technologies.

Gea Ban Peng

CEO | PDPA Compliance | Obligations | Sustainability | Cybersecurity

1 周

Thank you, Lionel

Lionel Sim

Building AI for Sales and Marketing | TikTok, Apple, Tencent Alum | AdAge 40 under 40 | Amazon #1 Top New Release ‘The AI Selling Revolution’ | Board and Startup Advisor | Associate Certified Coach

1 周

Thanks all for your positive DMs , amazing to see many knowledge sprouting from the product management and AI engineering fields

回复
Allen F.

Global Digital Marketing & Ecommerce & AI

1 周

Incredible deep dive, Lionel! The breakdown of DeepSeek R1’s training process—from RL and Chain-of-Thought to knowledge distillation—is fascinating. The GRPO approach to RL, eliminating the need for a separate critic, is particularly interesting for reducing bias and improving model adaptability. Excited to see how innovations like these push AI accessibility and efficiency forward!

Jason Du

Sales Consultant

1 周

Insightful

要查看或添加评论,请登录

Lionel Sim的更多文章

  • How AI Is Transforming SaaS Sales—and the Rise of AI Agent Selling

    How AI Is Transforming SaaS Sales—and the Rise of AI Agent Selling

    Artificial intelligence (AI) is rapidly reshaping how businesses operate, and the Software as a Service (SaaS) sector…

    3 条评论
  • The Evolution of AI Large Language Models and its Business Impact

    The Evolution of AI Large Language Models and its Business Impact

    In the digital era, artificial intelligence (AI) has steadily become a critical driver of transformation for businesses…

    2 条评论
  • Discover the Power of ChatGPT Deep Research

    Discover the Power of ChatGPT Deep Research

    Why Deep Research with ChatGPT Matters 1. Speed and Efficiency in Data Analysis Traditional research methods are often…

  • Introduction to Alibaba AI Model Qwen 2.5

    Introduction to Alibaba AI Model Qwen 2.5

    Alibaba Qwen 2.5 is in the Qwen series of large language models (LLMs), developed by Alibaba’s DAMO Academy.

  • Key learnings from DeepSeek

    Key learnings from DeepSeek

    Artificial intelligence is undergoing a profound transformation, marked by evolving strategies, intensifying…

    10 条评论
  • DeepSeek 101 for Marketers

    DeepSeek 101 for Marketers

    As someone who’s spent years in the fast-paced world of digital marketing, I’ve witnessed firsthand how artificial…

    14 条评论
  • Introduction to DeepSeek Janus Pro

    Introduction to DeepSeek Janus Pro

    Janus-Pro is an advanced multimodal AI model developed by DeepSeek-AI, building on its predecessor Janus. It integrates…

    3 条评论
  • Introduction to DeepSeek

    Introduction to DeepSeek

    Introduction to DeepSeek DeepSeek (杭州深度求索人工智能基础技术研究有限公司) is a Chinese AI research lab and open-source model developer…

    26 条评论
  • The Business of AI Agents

    The Business of AI Agents

    Artificial Intelligence (AI) agents are transforming industries across the globe, enabling businesses to automate…

    6 条评论
  • A deep dive of Large Language Models (LLM)

    A deep dive of Large Language Models (LLM)

    In the evolving realm of artificial intelligence, Large Language Models (LLMs) represent a monumental leap. These…