DeepSeek's AI improves when rewarded
Reinforcement Learning (RL) has been a popular approach to training and improving AI models. Here's an oversimplification of how it works: Have the AI agent check its generated answers to a "reward function" - that is, some way of judging how bad or good the answer is in the context of the AI's goals. There's more to it than that, including updating the model to reflect maximizing the rewards the model gets.
In Agentic AI, a similar approach is taken to improve the output generated by Large Language Models: Have agents check the output of other agents and pass judgment on 'how good' that output is. Based on that, the agents essentially "reward" an LLM when it produces better output. Although not technical using a reward function, agents can use different methods to improve the generated output, including using mechanisms like peer review, voting, feedback loops, supervisor review, and automated output testing.
Reward functions are typically set up and chosen by humans, possibly in combination with automated iterative testing and refinement. Some AI models are trained using Group Relative Policy Optimization (GRPO), an approach where agents work together to refine the model during training to maximize the prescribed reward. However, the underlying reward functions are typically pre-defined and fixed. Additionally, they are generally only used during the model's training.
Now, DeepSeek R1 has extended the use of GRPO to generate output after the model has been trained. Simply put, using GRPO to continuously evaluate generated output enables it to find and select the "best" output. This use of GRPO also makes DeepSeek R1 more efficient while driving the quality of the output.
However, for me, the real kicker is how DeepSeek R1 selects and uses reward functions as part of GRPO. Instead of having a fixed reward function determined ahead of time, R1 selects the reward function(s) that best represent the task or goals of the prompt. Furthermore, reward functions can be added for intermediate goals that the agent identifies as it progresses. Determining which reward functions to use is driven not only by the current context but also by how well the reward functions are doing when working towards the goal of the prompt.
This approach also allows multiple reward functions to be applied to a reasoning flow, such as using separate reward functions to balance accuracy and speed. Compliance and safety reward functions can also be injected to detect and penalize undesired output.
This approach results in better-generated output and the ability to "see" into these GRPO-driven reasoning decisions. DeepSeek R1 makes those decision points more visible, helping increase not only the understanding of how it got to the output but also the trust in its accuracy and validity.
Of course, this is only another step towards better AI. Many things can still go wrong, such as 'reward hacking' (i.e., finding loopholes). And, as I pointed out in my last note, the AI still doesn't know when to ask and dynamically learn from humans.
?
?#ArtificialIntelligence
领英推荐
#LLM #LargeLanguageModels
#LargeConceptModels
#AIResearch
#MachineLearning
#NaturalLanguageProcessing
#TechInnovation
#AITrends
#SemanticAI
#FutureOfAI
[Rick Munoz started working in AI at Symbolics, Inc. in the 1980s. He went on to incorporate AI components like Expert Systems, Natural Language Processing, and Fuzzy Logic into multiple systems. He currently designs and implements large cloud-based applications that include AI capabilities.]
Estrategia, Crecimiento y Nuevos Negocios | Startups | Aceleración y gestión de innovación corporativa y emprendedora | Levantamiento de capital público y privado
1 个月Working with AI teams, I keep noticing how these systems get surprisingly creative in finding shortcuts to maximize rewards - not always in ways we intended.