What's Deep about DeepSeek?
Arun Krishnan
Entrepreneur, Technology Leader, Business Leader, Analytics, AI, GenAI, Author Experienced Data Science and AI professional and leader. Driving Data Science, AI and GenAI technology and business growth
Deepseek has taken the LLM world by storm, achieving parity with the latest models from OpenAI at a fraction of the stated cost and with much smaller models. I am sure folks are wondering how they did all that! I delved into their paper given here.
Here is what I learnt!
The most significant part of the Deepseek approach is the use of Reinforcement Learning -- a reward system -- to help the model understand which reasoning paths are better than others.
The biggest difference is in their reinforcement learning policy. They call it "Group Reinforcement Learning Policy" or GRPO as opposed to the Proximal Policy Optimization of PPO that is typically used and shown below, taken from their earlier paper here
In PPO, along with the reference and reward models, you have a Value Model that is roughly the same size as the other two models leading to computational and memory burden. Moreover, this Value Model is treated as the baseline.
In GRPO on the other hand, the Value Model is removed and the baseline is obtained as "an average reward of multiple sample outputs, produced in response to the same question, as the baseline."
"More specifically, for each question ??, GRPO samples a group of outputs {??1, ??2, · · · , ????} from the old policy ?????????? and then optimizes the policy model by maximizing the following objective:"
领英推荐
Don't be too scared by the function. The first part is the expectation that, given a question q, the output matches an output based on the old policy. The second part is a ratio of the outputs derived from the current and the old policy models with Ai,t being an advantage calculated based on relative rewards of the outputs based on each group. Think of it as taking some measured action to the outputs.
The last part, Dkl, known as the Kullbach-Liebler Divergence, is a penalty applied to indicate the difference of the current policy model from the reference model.
And this is what gives the model the power to rapidly understand new pathways for reasoning through chain-of-thought modifications, saving on huge amounts of training data required in supervised modelling, otherwise.
And THIS is what has made Deepseek so powerful that with way less training data than OpenAI o1, they are still able to meet their benchmark standards.
In a way, this makes sense. Deepseek is training the model a human being learns. By adapting and learning from mistakes and the feedback provided on the errors.
Paraphrasing an old Chinese saying, 'we do live in interesting times!'
IT professional
1 个月Insightful
Technology Risk|Information Security|Business Continuity|Enterprise Software|Products
1 个月Good one Arun Krishnan. So, their approach is a " hey LLM, get trained in a group setting versus the old value model way where it is a 1-on-1 tutoring"...? Their cost savings is on the training side right? Not on the inference or running-the-model side? Pray, do educate us non-math folks some more kindly
Applied datascience
1 个月Hi Arun, thanks for sharing this paper and your summary. for PPO, softmax assigns the probailities, so, how can multi group reward model create better results ?. fascinating and look forward to experimentation
Thanks for explaining in “simple” words ??
Chief Architect | Senior Vice President – Data & Analytics | Microsoft at iLink Digital
1 个月Nice Summary. Great progress for a 2 year old startup. On a lighter note, "China products are comparatively cheap while offering unique features". Still, real-world use cases will be the true test.