DeepSeek's AI improves when rewarded

DeepSeek's AI improves when rewarded

Reinforcement Learning (RL) has been a popular approach to training and improving AI models. Here's an oversimplification of how it works: Have the AI agent check its generated answers to a "reward function" - that is, some way of judging how bad or good the answer is in the context of the AI's goals. There's more to it than that, including updating the model to reflect maximizing the rewards the model gets.

In Agentic AI, a similar approach is taken to improve the output generated by Large Language Models: Have agents check the output of other agents and pass judgment on 'how good' that output is. Based on that, the agents essentially "reward" an LLM when it produces better output. Although not technical using a reward function, agents can use different methods to improve the generated output, including using mechanisms like peer review, voting, feedback loops, supervisor review, and automated output testing.

Reward functions are typically set up and chosen by humans, possibly in combination with automated iterative testing and refinement. Some AI models are trained using Group Relative Policy Optimization (GRPO), an approach where agents work together to refine the model during training to maximize the prescribed reward. However, the underlying reward functions are typically pre-defined and fixed. Additionally, they are generally only used during the model's training.

Now, DeepSeek R1 has extended the use of GRPO to generate output after the model has been trained. Simply put, using GRPO to continuously evaluate generated output enables it to find and select the "best" output. This use of GRPO also makes DeepSeek R1 more efficient while driving the quality of the output.

However, for me, the real kicker is how DeepSeek R1 selects and uses reward functions as part of GRPO. Instead of having a fixed reward function determined ahead of time, R1 selects the reward function(s) that best represent the task or goals of the prompt. Furthermore, reward functions can be added for intermediate goals that the agent identifies as it progresses. Determining which reward functions to use is driven not only by the current context but also by how well the reward functions are doing when working towards the goal of the prompt.

This approach also allows multiple reward functions to be applied to a reasoning flow, such as using separate reward functions to balance accuracy and speed. Compliance and safety reward functions can also be injected to detect and penalize undesired output.

This approach results in better-generated output and the ability to "see" into these GRPO-driven reasoning decisions. DeepSeek R1 makes those decision points more visible, helping increase not only the understanding of how it got to the output but also the trust in its accuracy and validity.

Of course, this is only another step towards better AI. Many things can still go wrong, such as 'reward hacking' (i.e., finding loopholes). And, as I pointed out in my last note, the AI still doesn't know when to ask and dynamically learn from humans.

?

?#ArtificialIntelligence

#LLM #LargeLanguageModels

#LargeConceptModels

#AIResearch

#MachineLearning

#NaturalLanguageProcessing

#TechInnovation

#AITrends

#SemanticAI

#FutureOfAI


[Rick Munoz started working in AI at Symbolics, Inc. in the 1980s. He went on to incorporate AI components like Expert Systems, Natural Language Processing, and Fuzzy Logic into multiple systems. He currently designs and implements large cloud-based applications that include AI capabilities.]

Felipe Andrés Barrera Lobo

Estrategia, Crecimiento y Nuevos Negocios | Startups | Aceleración y gestión de innovación corporativa y emprendedora | Levantamiento de capital público y privado

1 个月

Working with AI teams, I keep noticing how these systems get surprisingly creative in finding shortcuts to maximize rewards - not always in ways we intended.

要查看或添加评论,请登录

Rick Munoz的更多文章

  • Why We Need LLM Output Gatekeeping - Policing AI Agents

    Why We Need LLM Output Gatekeeping - Policing AI Agents

    In Computerphile's recent video on "Generative AI's Greatest Flaw," Mike Pound shows how indirect prompt injection…

  • Dividing Event-Sourcing into nanoservices

    Dividing Event-Sourcing into nanoservices

    A few years ago, I led the design of a cloud-native financial management system meant to run globally, with users in…

  • Which Low-code development platform is best?

    Which Low-code development platform is best?

    Are you grappling with deciding which low-code application development platform to use? The landscape of low-code (or…

    2 条评论
  • AI can't learn without asking

    AI can't learn without asking

    I ran across this great video from IBM Technology on "Can AI Think? " It points out that LLMs (and neural networks in…

    1 条评论
  • Meta AI's LCM: a good evolutionary step

    Meta AI's LCM: a good evolutionary step

    Meta's recent paper on Large Concept Models (LCMs) is being heralded as the next generation of AI technology. This is…

  • Three Keys to Successful App Transformation

    Three Keys to Successful App Transformation

    How old are your applications? Your database systems? Is any of your technology nearing (or way past) its end-of-life…

社区洞察

其他会员也浏览了