登录查看更多内容

Policy Gradient Theorem for continuous tasks ?? -RL

Abram George

Embedded Automotive SWE | The German University in Cairo

发布日期: 2023年9月9日

Welcome Again ??!

In today's article, we will discuss policy gradient theorem proof. policy gradient theorem plays a vital rule in calculating the gradient of a new objective for continuing tasks in RL which is the average reward r(π).

First of all, let's discuss our objective r(π)

Average Reward Objective

Imagine the agent has interacted with the world for h steps. This is the reward it has received on average across those h steps.

To better understand the function let's break it down:

the objective is summation over all states of multiplication of 1) stationary distribution mu and 2) the expected reward under the policy Pi from a particular state S.

1)mu is also called state visitation rate, which is the fraction of time each state is visited under the policy (keep this in mind)

mu is a probability distribution across states.

There is also something important to keep in mind about mu is that if you select action s according to pi, you remain in the same distribution.

Keep this equation in mind. let's call it equation X.

2) the expected reward under the policy Pi from a particular state S depends on both the policy and the environment dynamics and can be seen more clearly through a backup diagram.

Expected reward would be equal to (0.25)(0.3*1+0.7*2)+(0.75)(0.8*2+0.2*-2)

Average Reward Objective Gradient

Back to our objective, we need to find its gradient with respect to the adjustable parameter theta. we can use product rule to yield 2 terms.

gradient of average reward objective is dependent on gradient of mu

Recall the mu the state visitation rate is dependent on the policy we are trying to optimize so mu is also function of tunable parameter theta. computing such gradient is a difficult task, luckily Policy Gradient Theorem can help

领英推荐

RISC-V Newsletter - September 2024

RISC-V International 5 个月前

Deploying Computer Vision Solutions at the Edge with…

Barbara 2 个月前

Get your network future-ready with GenAI, Automation…

CodiLime 4 个月前

Policy Gradient Theorem

Policy gradient theorem helps to give a simpler expression.

To walk through the proof, first we need to compute the gradient of the state value function V(pi) function of Q(pi)

and by using product rule we can have 2 terms

we can replace the gradient of Q(pi) by the differential return plus the state value of the next state.

re-arranging terms getting an expression for r(theta)

Noticing that LHS is not dependent on state S so RHS doesn't so we can weigh the RHS by the mu of S without changing anything. We can also split the expression into 3 terms.

returning back to equation X, we can refactor the second term yielding

cancelling out the second and third terms out with each other yields a very simpler expression for our gradient.

Notice how the new expression contains only 1) Q(pi) which we can estimate with many methods 2) Gradient of pi which we can calculate exactly but computing the sum over states is really impractical, but we can use stochastic gradient descent.

This is what the stochastic gradient descent update looks like for policy parameterization.

References

Richard S. Sutton and Andrew G. Barto, Reinforcement Learning an Introduction, Second Edition -Chapter 13 Section 6.

"Prediction and Control with Function Approximation" UAlberta course on Coursera

要查看或添加评论，请登录

Abram George的更多文章

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

2025年3月5日

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

Hello! It has been a long long time ?? Last time we explored Eligibility Traces which introduces a spectrum of…
Eligibility Traces, Spectrum of new learning algorithms ??

2024年7月21日

Eligibility Traces, Spectrum of new learning algorithms ??

Hello there ???? In last time, we talked about states that are not observable and how can we deal with it in RL…
Partially Observable MDPs

2024年6月23日

Partially Observable MDPs

Hello Again ^^, Long time In last time, We talked about some interesting ways of building RL agents with…
Reinforcement Learning agents with gifts/disorders

2024年3月12日

Reinforcement Learning agents with gifts/disorders

Hello Again ??. In last article, We discussed how can we use intrinsic as well as extrinsic rewards -if designed…
Curiosity-Driven Reinforcement Learning

2024年1月21日

Curiosity-Driven Reinforcement Learning

Hello Again ??, Have you ever wondered how does human learn? At first glance, you may say, well, human learns from…

2 条评论
Importance Sampling and Monte Carlo Methods

2023年7月16日

Importance Sampling and Monte Carlo Methods

Hello again??. Today I want to dive deep into the concept of importance sampling, Importance sampling is a technique…
Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

2023年7月1日

Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

Introduction Alright, so you do algorithms based on fierce mathematics especially those which contain lots of…

See all articles

Policy Gradient Theorem for continuous tasks ?? -RL

Abram George

Embedded Automotive SWE | The German University in Cairo

Average Reward Objective

Average Reward Objective Gradient

领英推荐

Policy Gradient Theorem

References

Abram George的更多文章

社区洞察

其他会员也浏览了

Meeting Diverse Requirements of AI Applications with 100G Ethernet Cards

Claude can now use my computer!

Post 5: Implementation of Data Driven Machine Learning methods and Physics driven Concepts for real-time well performance estimation

The Power of Letta (formerly MemGPT): An Easy-to-Understand Guide"- Revolutionizing Memory Management for (LLMs)

There Is No Algorithmic Component to the NOL Effect in Conjoint Analysis

Setting off fusion in Kubernetes!

The General Routing Problem

Why is XOR gate controlled inverter? (2024)

November 08, 2021

Doubt as a Fundamental Component in Cognitive Architectures

Average Reward Objective

Average Reward Objective Gradient

领英推荐

Policy Gradient Theorem

References

Abram George的更多文章

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

Eligibility Traces, Spectrum of new learning algorithms ??

Partially Observable MDPs

Reinforcement Learning agents with gifts/disorders

Curiosity-Driven Reinforcement Learning

Importance Sampling and Monte Carlo Methods

Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

社区洞察

其他会员也浏览了

Meeting Diverse Requirements of AI Applications with 100G Ethernet Cards

Claude can now use my computer!

Post 5: Implementation of Data Driven Machine Learning methods and Physics driven Concepts for real-time well performance estimation

The Power of Letta (formerly MemGPT): An Easy-to-Understand Guide"- Revolutionizing Memory Management for (LLMs)

There Is No Algorithmic Component to the NOL Effect in Conjoint Analysis

Setting off fusion in Kubernetes!

The General Routing Problem

Why is XOR gate controlled inverter? (2024)

November 08, 2021

Doubt as a Fundamental Component in Cognitive Architectures