登录查看更多内容

DeepSeek's AI improves when rewarded

Rick Munoz

Chief Architect and CTO at T4S Partners

发布日期: 2025年1月28日

Reinforcement Learning (RL) has been a popular approach to training and improving AI models. Here's an oversimplification of how it works: Have the AI agent check its generated answers to a "reward function" - that is, some way of judging how bad or good the answer is in the context of the AI's goals. There's more to it than that, including updating the model to reflect maximizing the rewards the model gets.

In Agentic AI, a similar approach is taken to improve the output generated by Large Language Models: Have agents check the output of other agents and pass judgment on 'how good' that output is. Based on that, the agents essentially "reward" an LLM when it produces better output. Although not technical using a reward function, agents can use different methods to improve the generated output, including using mechanisms like peer review, voting, feedback loops, supervisor review, and automated output testing.

Reward functions are typically set up and chosen by humans, possibly in combination with automated iterative testing and refinement. Some AI models are trained using Group Relative Policy Optimization (GRPO), an approach where agents work together to refine the model during training to maximize the prescribed reward. However, the underlying reward functions are typically pre-defined and fixed. Additionally, they are generally only used during the model's training.

Now, DeepSeek R1 has extended the use of GRPO to generate output after the model has been trained. Simply put, using GRPO to continuously evaluate generated output enables it to find and select the "best" output. This use of GRPO also makes DeepSeek R1 more efficient while driving the quality of the output.

However, for me, the real kicker is how DeepSeek R1 selects and uses reward functions as part of GRPO. Instead of having a fixed reward function determined ahead of time, R1 selects the reward function(s) that best represent the task or goals of the prompt. Furthermore, reward functions can be added for intermediate goals that the agent identifies as it progresses. Determining which reward functions to use is driven not only by the current context but also by how well the reward functions are doing when working towards the goal of the prompt.

This approach also allows multiple reward functions to be applied to a reasoning flow, such as using separate reward functions to balance accuracy and speed. Compliance and safety reward functions can also be injected to detect and penalize undesired output.

This approach results in better-generated output and the ability to "see" into these GRPO-driven reasoning decisions. DeepSeek R1 makes those decision points more visible, helping increase not only the understanding of how it got to the output but also the trust in its accuracy and validity.

Of course, this is only another step towards better AI. Many things can still go wrong, such as 'reward hacking' (i.e., finding loopholes). And, as I pointed out in my last note, the AI still doesn't know when to ask and dynamically learn from humans.

?#ArtificialIntelligence

领英推荐

Embracing AI in Change Management --with Cautions &…

Miguel Guevara 1 年前

Reinforcement Learning Utilising Human Feedback for…

Mrinmoy Paul ???? 4 个月前

Knowledge Transfer & Retention - Part 2

Jeff Beacham 3 个月前

#LLM #LargeLanguageModels

#LargeConceptModels

#AIResearch

#MachineLearning

#NaturalLanguageProcessing

#TechInnovation

#AITrends

#SemanticAI

#FutureOfAI

[Rick Munoz started working in AI at Symbolics, Inc. in the 1980s. He went on to incorporate AI components like Expert Systems, Natural Language Processing, and Fuzzy Logic into multiple systems. He currently designs and implements large cloud-based applications that include AI capabilities.]

Felipe Andrés Barrera Lobo

Estrategia, Crecimiento y Nuevos Negocios | Startups | Aceleración y gestión de innovación corporativa y emprendedora | Levantamiento de capital público y privado

1 个月

Working with AI teams, I keep noticing how these systems get surprisingly creative in finding shortcuts to maximize rewards - not always in ways we intended.

1 次回应

查看更多评论

要查看或添加评论，请登录

Rick Munoz的更多文章

Why We Need LLM Output Gatekeeping - Policing AI Agents

2025年2月28日

Why We Need LLM Output Gatekeeping - Policing AI Agents

In Computerphile's recent video on "Generative AI's Greatest Flaw," Mike Pound shows how indirect prompt injection…
Dividing Event-Sourcing into nanoservices

2025年2月20日

Dividing Event-Sourcing into nanoservices

A few years ago, I led the design of a cloud-native financial management system meant to run globally, with users in…
Which Low-code development platform is best?

2025年1月30日

Which Low-code development platform is best?

Are you grappling with deciding which low-code application development platform to use? The landscape of low-code (or…

2 条评论
AI can't learn without asking

2025年1月25日

AI can't learn without asking

I ran across this great video from IBM Technology on "Can AI Think? " It points out that LLMs (and neural networks in…

1 条评论
Meta AI's LCM: a good evolutionary step

2025年1月20日

Meta AI's LCM: a good evolutionary step

Meta's recent paper on Large Concept Models (LCMs) is being heralded as the next generation of AI technology. This is…
Three Keys to Successful App Transformation

2023年3月20日

Three Keys to Successful App Transformation

How old are your applications? Your database systems? Is any of your technology nearing (or way past) its end-of-life…

See all articles

DeepSeek's AI improves when rewarded

Rick Munoz

Chief Architect and CTO at T4S Partners

领英推荐

Rick Munoz的更多文章

社区洞察

其他会员也浏览了

AURA Lesson 1

AURA Lesson 1

Why Human Feedback Isn’t the Gold Standard for Evaluating AI Models

What is a Training Epoch in AI?

Love Letters to Claude: Generative AI and the Edge Cases of Weirdness

Beyond "ChatGPT as a Fancy Calculator": Rethinking AI in the Classroom (and Why SAMR Isn't Enough)

The Perpetual Loop Prompt: A New Approach for Training AI Language Models for Better Results

How we can accelerate our story skills with the help of AI

Artificial Intelligence - Part 2 -Ethics

领英推荐

Rick Munoz的更多文章

Why We Need LLM Output Gatekeeping - Policing AI Agents

Dividing Event-Sourcing into nanoservices

Which Low-code development platform is best?

AI can't learn without asking

Meta AI's LCM: a good evolutionary step

Three Keys to Successful App Transformation

社区洞察

其他会员也浏览了

AURA Lesson 1

AURA Lesson 1

Why Human Feedback Isn’t the Gold Standard for Evaluating AI Models

What is a Training Epoch in AI?

Love Letters to Claude: Generative AI and the Edge Cases of Weirdness

Beyond "ChatGPT as a Fancy Calculator": Rethinking AI in the Classroom (and Why SAMR Isn't Enough)

The Perpetual Loop Prompt: A New Approach for Training AI Language Models for Better Results

How we can accelerate our story skills with the help of AI

Artificial Intelligence - Part 2 -Ethics