What's Deep about DeepSeek?

What's Deep about DeepSeek?

Deepseek has taken the LLM world by storm, achieving parity with the latest models from OpenAI at a fraction of the stated cost and with much smaller models. I am sure folks are wondering how they did all that! I delved into their paper given here.

Here is what I learnt!

The most significant part of the Deepseek approach is the use of Reinforcement Learning -- a reward system -- to help the model understand which reasoning paths are better than others.


The biggest difference is in their reinforcement learning policy. They call it "Group Reinforcement Learning Policy" or GRPO as opposed to the Proximal Policy Optimization of PPO that is typically used and shown below, taken from their earlier paper here




In PPO, along with the reference and reward models, you have a Value Model that is roughly the same size as the other two models leading to computational and memory burden. Moreover, this Value Model is treated as the baseline.

In GRPO on the other hand, the Value Model is removed and the baseline is obtained as "an average reward of multiple sample outputs, produced in response to the same question, as the baseline."

"More specifically, for each question ??, GRPO samples a group of outputs {??1, ??2, · · · , ????} from the old policy ?????????? and then optimizes the policy model by maximizing the following objective:"


Don't be too scared by the function. The first part is the expectation that, given a question q, the output matches an output based on the old policy. The second part is a ratio of the outputs derived from the current and the old policy models with Ai,t being an advantage calculated based on relative rewards of the outputs based on each group. Think of it as taking some measured action to the outputs.

The last part, Dkl, known as the Kullbach-Liebler Divergence, is a penalty applied to indicate the difference of the current policy model from the reference model.

And this is what gives the model the power to rapidly understand new pathways for reasoning through chain-of-thought modifications, saving on huge amounts of training data required in supervised modelling, otherwise.



And THIS is what has made Deepseek so powerful that with way less training data than OpenAI o1, they are still able to meet their benchmark standards.

In a way, this makes sense. Deepseek is training the model a human being learns. By adapting and learning from mistakes and the feedback provided on the errors.

Paraphrasing an old Chinese saying, 'we do live in interesting times!'

Sakthi Kannan Elayaraja Eswaran Sridhar Mahadevan Sree Balaji Olivier Travers

Yashan Kumar

IT professional

1 个月

Insightful

回复
Anand Prahlad

Technology Risk|Information Security|Business Continuity|Enterprise Software|Products

1 个月

Good one Arun Krishnan. So, their approach is a " hey LLM, get trained in a group setting versus the old value model way where it is a 1-on-1 tutoring"...? Their cost savings is on the training side right? Not on the inference or running-the-model side? Pray, do educate us non-math folks some more kindly

回复
Anand K.

Applied datascience

1 个月

Hi Arun, thanks for sharing this paper and your summary. for PPO, softmax assigns the probailities, so, how can multi group reward model create better results ?. fascinating and look forward to experimentation

回复

Thanks for explaining in “simple” words ??

Elayaraja Eswaran

Chief Architect | Senior Vice President – Data & Analytics | Microsoft at iLink Digital

1 个月

Nice Summary. Great progress for a 2 year old startup. On a lighter note, "China products are comparatively cheap while offering unique features". Still, real-world use cases will be the true test.

要查看或添加评论,请登录

Arun Krishnan的更多文章

  • A new architecture that incorporates more human-like memory features

    A new architecture that incorporates more human-like memory features

    The one huge drawback of attention models that are ubiquitous in LLMs, is the fact that the memory requirements can…

    3 条评论
  • BertViz - Visualizing Attention in Transformers

    BertViz - Visualizing Attention in Transformers

    With the increasing use of LLMs and Transformers in organisations, users are starting to demand explainability from…

  • Buffer-of-Thought Prompting

    Buffer-of-Thought Prompting

    With use cases becoming more and more complicated and agent-based systems becoming the norm for #GenerativeAI based…

    1 条评论
  • To Embed or not to Embed ...

    To Embed or not to Embed ...

    Everyone by now, ought to be familiar with the Retrieval-Augmented Generation (RAG) approach, wherein documents or text…

  • The GenAI conundrum

    The GenAI conundrum

    So you are the CEO of a company and have heard of this wonderful new toy called Generative AI. You call a meeting of…

    9 条评论
  • Understanding the craft of writing

    Understanding the craft of writing

    I have never written an article about writing. Even though I have published my first novel and three more are already…

  • Generating Images with Large Language Model (GILL)

    Generating Images with Large Language Model (GILL)

    By now, we all know that LLMs work by creating embeddings of sentences in a large, multi-dimensional textual space…

    2 条评论
  • Are neural networks actually starting to replicate the functioning of the human brain?

    Are neural networks actually starting to replicate the functioning of the human brain?

    Artificial Neural Networks (ANNs), as the name suggests were patterned after the way we thought the human brain worked.…

    2 条评论
  • Claude and "Constitutional" AI

    Claude and "Constitutional" AI

    For a while now, I have been of the firm opinion that we need to build in Asimov's Three Laws of Robotics into our AI…

  • All about Chain-of-Thought (CoT)Prompting

    All about Chain-of-Thought (CoT)Prompting

    The rapidity with which LLM models have been progressing has been nothing short of stunning. The last few months have…

    5 条评论

社区洞察

其他会员也浏览了