Eligibility Traces, Spectrum of new learning algorithms ??

Eligibility Traces, Spectrum of new learning algorithms ??

Hello there ????

In last time, we talked about states that are not observable and how can we deal with it in RL context. if you missed it, check it out here.

Today, we are going to discuss Eligibility Traces which is a technique used to generalize learning algorithms and also improves their efficiency ??. Eligibility Traces technique is highly reliant on general understanding both Monte Carlo and Temporal Difference methods

Temporal Difference (TD) methods

Let's discuss the simplest form of the temporal difference spectrum which is TD(0).

TD(0) update equation


As we can see in the update equation of TD(0) algorithm, The value function of the current state is 'estimated' by using a target consisting of the value function of the next 'sampled' state discounted by gamma plus the direct reward achieved

The bolded words indicate that such algorithm is a bootstrapping + sampling based technique, bootstrapping means that such update is highly reliant on an estimation which introduces a high bias to our algorithm, note that sampling here is done in shape of 1 step lookahead so no variance is included.

Drawbacks of TD methods

  1. TD does include bias estimates which can produce a slow convergence results ?and even stagnation getting stuck in a local minima
  2. TD (0) has a slow credit assignment. As after each iteration, the reward is propagated by only one step.

Note that there are other variants of TD which mitigate such drawbacks like N steps TD in which the next state value function is not used in the update equation, Instead, A sequence of rewards obtained till "N-steps" lookahead. Here is an example where N=2, keep that in mind because we will need it later on :D

Return used as a target in 2-Steps TD update equation


General update equation formula for n-step bootstrapping algorithms


Monte Carlo Methods (MC)

MC Backup diagram


For MC methods, the value function of the current state is updated by using a target consisting of Gt where Gt is the total return, The MC learning is impossible to take place in non-episodic tasks as the updates are done using the return Gt which can be obtained only at the end of each episode. Also, as we can see from the backup diagram, the algorithm is highly reliant on a sequence of many and many sampling events which introduces variance in our estimate


Drawbacks of MC methods

  1. High variance leads to increased uncertainty in the estimated values, making it difficult to obtain accurate and reliable value function estimates. Consequently, it requires a large number of samples to reduce the variance adequately, resulting in slower learning and reduced sample efficiency.
  2. The algorithm by its vanilla form is uncapable of solving non-episodic tasks


Offline Lambda returns balancing the bias variance tradeoff

The drawbacks of each algorithm mentioned above were the motive to introduce a new algorithm which solves that bias-variance trade off but still faces the "wait till the end of the episode" problem

Remember that high bias drawback of TD can be mitigated by using N step update rather than the one step look-ahead and we illustrated the target equation when N=2. The new question that arises is should we use a certain N? ??. Why not having like an interpolation between all the possible steps? ??

Geometric weighting on n-steps returns


Why not using a geometric weighting of the steps according to λ factor. By doing so -weighting the future steps- we decay according to λ decay value weighting the rewards obtained in near steps more than the upcoming far ones

The used decaying mechanism


And then the target in our algorithm is then replaced by the following return

Lamda return target


Eligibility Traces ??

So far, we have solved the bias-variance tradeoff but still we are stuck with the offline setting of the lambda return algorithm, we are still constrained to having the full trajectory. So what is the solution? the solution is pretty easy ??. Instead of looking forward in time looking at the next steps "forward view", We can use a backward view in which the past states are updated on what we are seeing now.

Forward view used in all previous algorithms


Backward view used in the Eligibility Traces


Let's break down the latest nice figure, the eligibility Traces in this figure are contained in the vector Zt, Eligibility Traces is like a short-term memory which contains all the states visited so far, δ is the weighting mechanism responsible for state "decay" over time. There are states eligible to update whenever our reinforcing event occurs. thus, when a reinforcing event occurs, the value of the states with a trace value different from zero will be updated following their weights

  • On advantage of the new online setting is that the new algorithm is more computationally efficient cause we will have to store only a single vector.
  • Another advantage of this method is credit assignment, now the rewards are assigned and propagates not only to the last visited state but to all the states that directly contributed for that reward

For the detailed algorithm of Eligibility Traces also known as TD (λ), Please view the Sutton and Barto book "Reinforcement Learning, An Introduction" -Second Edition Chapter 12


References

Reinforcement Learning, An Introduction -Second Edition


要查看或添加评论,请登录

Abram George的更多文章

社区洞察

其他会员也浏览了