登录查看更多内容

Eligibility Traces, Spectrum of new learning algorithms ??

Abram George

Embedded Automotive SWE | The German University in Cairo

发布日期: 2024年7月21日

Hello there ????

In last time, we talked about states that are not observable and how can we deal with it in RL context. if you missed it, check it out here.

Today, we are going to discuss Eligibility Traces which is a technique used to generalize learning algorithms and also improves their efficiency ??. Eligibility Traces technique is highly reliant on general understanding both Monte Carlo and Temporal Difference methods

Temporal Difference (TD) methods

Let's discuss the simplest form of the temporal difference spectrum which is TD(0).

As we can see in the update equation of TD(0) algorithm, The value function of the current state is 'estimated' by using a target consisting of the value function of the next 'sampled' state discounted by gamma plus the direct reward achieved

The bolded words indicate that such algorithm is a bootstrapping + sampling based technique, bootstrapping means that such update is highly reliant on an estimation which introduces a high bias to our algorithm, note that sampling here is done in shape of 1 step lookahead so no variance is included.

Drawbacks of TD methods

TD does include bias estimates which can produce a slow convergence results ?and even stagnation getting stuck in a local minima
TD (0) has a slow credit assignment. As after each iteration, the reward is propagated by only one step.

Note that there are other variants of TD which mitigate such drawbacks like N steps TD in which the next state value function is not used in the update equation, Instead, A sequence of rewards obtained till "N-steps" lookahead. Here is an example where N=2, keep that in mind because we will need it later on :D

Return used as a target in 2-Steps TD update equation

General update equation formula for n-step bootstrapping algorithms

Monte Carlo Methods (MC)

For MC methods, the value function of the current state is updated by using a target consisting of Gt where Gt is the total return, The MC learning is impossible to take place in non-episodic tasks as the updates are done using the return Gt which can be obtained only at the end of each episode. Also, as we can see from the backup diagram, the algorithm is highly reliant on a sequence of many and many sampling events which introduces variance in our estimate

Drawbacks of MC methods

High variance leads to increased uncertainty in the estimated values, making it difficult to obtain accurate and reliable value function estimates. Consequently, it requires a large number of samples to reduce the variance adequately, resulting in slower learning and reduced sample efficiency.
The algorithm by its vanilla form is uncapable of solving non-episodic tasks

领英推荐

Ensemble Learning and Its Variants

EnjoyAlgorithms 1 年前

The Complete Guide to Handling Missing Values in…

Didarul Islam 1 年前

A Beginner's Guide to Machine Learning

Randy Lao ?? 6 年前

Offline Lambda returns balancing the bias variance tradeoff

The drawbacks of each algorithm mentioned above were the motive to introduce a new algorithm which solves that bias-variance trade off but still faces the "wait till the end of the episode" problem

Remember that high bias drawback of TD can be mitigated by using N step update rather than the one step look-ahead and we illustrated the target equation when N=2. The new question that arises is should we use a certain N? ??. Why not having like an interpolation between all the possible steps? ??

Why not using a geometric weighting of the steps according to λ factor. By doing so -weighting the future steps- we decay according to λ decay value weighting the rewards obtained in near steps more than the upcoming far ones

And then the target in our algorithm is then replaced by the following return

Eligibility Traces ??

So far, we have solved the bias-variance tradeoff but still we are stuck with the offline setting of the lambda return algorithm, we are still constrained to having the full trajectory. So what is the solution? the solution is pretty easy ??. Instead of looking forward in time looking at the next steps "forward view", We can use a backward view in which the past states are updated on what we are seeing now.

Forward view used in all previous algorithms

Backward view used in the Eligibility Traces

Let's break down the latest nice figure, the eligibility Traces in this figure are contained in the vector Zt, Eligibility Traces is like a short-term memory which contains all the states visited so far, δ is the weighting mechanism responsible for state "decay" over time. There are states eligible to update whenever our reinforcing event occurs. thus, when a reinforcing event occurs, the value of the states with a trace value different from zero will be updated following their weights

On advantage of the new online setting is that the new algorithm is more computationally efficient cause we will have to store only a single vector.
Another advantage of this method is credit assignment, now the rewards are assigned and propagates not only to the last visited state but to all the states that directly contributed for that reward

For the detailed algorithm of Eligibility Traces also known as TD (λ), Please view the Sutton and Barto book "Reinforcement Learning, An Introduction" -Second Edition Chapter 12

References

Reinforcement Learning, An Introduction -Second Edition

要查看或添加评论，请登录

Abram George的更多文章

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

2025年3月5日

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

Hello! It has been a long long time ?? Last time we explored Eligibility Traces which introduces a spectrum of…
Partially Observable MDPs

2024年6月23日

Partially Observable MDPs

Hello Again ^^, Long time In last time, We talked about some interesting ways of building RL agents with…
Reinforcement Learning agents with gifts/disorders

2024年3月12日

Reinforcement Learning agents with gifts/disorders

Hello Again ??. In last article, We discussed how can we use intrinsic as well as extrinsic rewards -if designed…
Curiosity-Driven Reinforcement Learning

2024年1月21日

Curiosity-Driven Reinforcement Learning

Hello Again ??, Have you ever wondered how does human learn? At first glance, you may say, well, human learns from…

2 条评论
Policy Gradient Theorem for continuous tasks ?? -RL

2023年9月9日

Policy Gradient Theorem for continuous tasks ?? -RL

Welcome Again ??! In today's article, we will discuss policy gradient theorem proof. policy gradient theorem plays a…
Importance Sampling and Monte Carlo Methods

2023年7月16日

Importance Sampling and Monte Carlo Methods

Hello again??. Today I want to dive deep into the concept of importance sampling, Importance sampling is a technique…
Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

2023年7月1日

Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

Introduction Alright, so you do algorithms based on fierce mathematics especially those which contain lots of…

See all articles

Eligibility Traces, Spectrum of new learning algorithms ??

Abram George

Embedded Automotive SWE | The German University in Cairo

Temporal Difference (TD) methods

Monte Carlo Methods (MC)

领英推荐

Offline Lambda returns balancing the bias variance tradeoff

Eligibility Traces ??

References

Abram George的更多文章

社区洞察

其他会员也浏览了

A Beginner's Guide to Machine Learning

Introduction to Gradient Boosting Machines (GBM): A Powerful Ensemble Technique

Why Stochastic Gradient Descent (SGD) is a Game Changer?

Transfer Learning vs Fine Tuning - Top 10 Differences Between Them

The Different Types of Machine Learning

Machine Learning 101: Introduction

Mastering Machine Learning: The Art of Random Forests

Ensemble Learning: Boosting Your Machine Learning Models

A* Algorithm: the darling of pathfinding algorithms (Q* series - part 3/4)

Temporal Difference (TD) methods

Monte Carlo Methods (MC)

领英推荐

Offline Lambda returns balancing the bias variance tradeoff

Eligibility Traces ??

References

Abram George的更多文章

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

Partially Observable MDPs

Reinforcement Learning agents with gifts/disorders

Curiosity-Driven Reinforcement Learning

Policy Gradient Theorem for continuous tasks ?? -RL

Importance Sampling and Monte Carlo Methods

Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

社区洞察

其他会员也浏览了

A Beginner's Guide to Machine Learning

Introduction to Gradient Boosting Machines (GBM): A Powerful Ensemble Technique

Why Stochastic Gradient Descent (SGD) is a Game Changer?

Transfer Learning vs Fine Tuning - Top 10 Differences Between Them

The Different Types of Machine Learning

Machine Learning 101: Introduction

Mastering Machine Learning: The Art of Random Forests

Ensemble Learning: Boosting Your Machine Learning Models

A* Algorithm: the darling of pathfinding algorithms (Q* series - part 3/4)