Policy Gradient Theorem for continuous tasks ?? -RL

Policy Gradient Theorem for continuous tasks ?? -RL

Welcome Again ??!

In today's article, we will discuss policy gradient theorem proof. policy gradient theorem plays a vital rule in calculating the gradient of a new objective for continuing tasks in RL which is the average reward r(π).

First of all, let's discuss our objective r(π)


Average Reward Objective

Average Reward Objective


Imagine the agent has interacted with the world for h steps. This is the reward it has received on average across those h steps.

To better understand the function let's break it down:

the objective is summation over all states of multiplication of 1) stationary distribution mu and 2) the expected reward under the policy Pi from a particular state S.

1)mu is also called state visitation rate, which is the fraction of time each state is visited under the policy (keep this in mind)

mu is a probability distribution across states.

There is also something important to keep in mind about mu is that if you select action s according to pi, you remain in the same distribution.

Keep this equation in mind. let's call it equation X.

2) the expected reward under the policy Pi from a particular state S depends on both the policy and the environment dynamics and can be seen more clearly through a backup diagram.


Backup Diagram

Expected reward would be equal to (0.25)(0.3*1+0.7*2)+(0.75)(0.8*2+0.2*-2)


Average Reward Objective Gradient

Back to our objective, we need to find its gradient with respect to the adjustable parameter theta. we can use product rule to yield 2 terms.


gradient of average reward objective is dependent on gradient of mu

Recall the mu the state visitation rate is dependent on the policy we are trying to optimize so mu is also function of tunable parameter theta. computing such gradient is a difficult task, luckily Policy Gradient Theorem can help


Policy Gradient Theorem

Policy gradient theorem helps to give a simpler expression.


To walk through the proof, first we need to compute the gradient of the state value function V(pi) function of Q(pi)

and by using product rule we can have 2 terms

we can replace the gradient of Q(pi) by the differential return plus the state value of the next state.

re-arranging terms getting an expression for r(theta)

Noticing that LHS is not dependent on state S so RHS doesn't so we can weigh the RHS by the mu of S without changing anything. We can also split the expression into 3 terms.

returning back to equation X, we can refactor the second term yielding

cancelling out the second and third terms out with each other yields a very simpler expression for our gradient.

Notice how the new expression contains only 1) Q(pi) which we can estimate with many methods 2) Gradient of pi which we can calculate exactly but computing the sum over states is really impractical, but we can use stochastic gradient descent.

stochastic gradient descent update

This is what the stochastic gradient descent update looks like for policy parameterization.

References

Richard S. Sutton and Andrew G. Barto, Reinforcement Learning an Introduction, Second Edition -Chapter 13 Section 6.

"Prediction and Control with Function Approximation" UAlberta course on Coursera



要查看或添加评论,请登录

Abram George的更多文章

社区洞察

其他会员也浏览了