登录查看更多内容

Q-Learning vs SARSA (Model-Free RL Algorithms)

Ahmed Raafat

Unstructured Data Scientist, L2 @Beyond Limits

发布日期: 2018年8月11日

This article discusses the definition of Q-Learning and SARSA, then compares between them. At first, we should understand what are these algorithms, and what is the intuition behind them.

By looking at their equation (which we will look at it later), we will see that they both tend to solve the famous bellman equation, which gives us an action-value function "called the Q-function" that correspond to the utility or the significance of this specific action in terms of the payoff of the initial actions.

Q-Learning

Definition

Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used in finding an optimal action-selection policy for any given MDP. So in a nutshell, It maps states to action by computing the values of action in certain states. And this is the Q-learning equation:

As we can see, there is the expectation term which appears due to the randomness of the actions. In other words, we don't know which action will be taken; That's why we need to compute the expectation, which can be then solved using integration over all actions. But this solution isn't easy and we can't compute it explicitly. So, we have to use a trick to solve this expectation.

Solving the Expectation

As usual, when we can't compute something explicitly, we go for approximation which can be done by sampling trajectories. And this means we take, let's say 5 trajectories ( A trajectory is a tuple of state, action, reward and next state), and it may be the only option is to take 1 trajectory as in self-driving cars, when you have to wait for the car to take its action to go to the next action and this would give you the sample you need or the trajectory to somehow estimate this expectation. But this one sample will be very noisy and we will need something called temporal difference method which uses exponentially weighted moving average to learn using only one sample, which is as follows:

This alpha is called the learning rate, in which you learn from the new noisy sample slowly and updating on the old Q value to make it more reliable. And the above equation is called the update rule.

Q-Learning Algorithm steps

So, now we know almost everything. We just need to :

Initialize Q-values with zeros
Then we mainly need a state, action, an immediate reward for the state caused by the action and the next state sampled from the decision process (Can be obtained by 1 step).
Plug the trajectory into the equation above and compute the Q-value of your actions then take the maximum.

Choosing actions

But, how to choose our action?, we can answer this by a simple sentence "exploration vs exploitation". Q-Learning uses epsilon greedy technique by generating a random number between 0 and 1, and depending on this number, you choose weather you do exploration i.e do random action, or do exploitation i.e do the action you have learnt. And this makes it an Off-Policy algorithm and we will get to that later.

Le'ts take a look on the SARSA.

SARSA

Definition

From its name, it is clear that it uses state, action, reward, next state, and next action. That is the only difference from Q-Learning, as it replaces the maximization of Q-Learning by expectation over the probability distribution of actions by any exploration policy. So basically, what you have to add is you have to consider one more step to get the next action.

SARSA Algorithm steps

And this is how it should be.

As we can see it is quite similar to the Q-Learning algorithm, we simply remember the action that the agent did, and use this to perform an update.

Of course this would be slightly problematic for high action dimension space, and this can be solved using the Expected SARSA, which basically takes the average of the Q-values of the possible actions with the probability denoted by the policy used.

So, the question is what is the difference between them?

Differences between Q-Learning and SARSA

Actually, if you look at the Q-Learning algorithm, you will realize that it computes the shortest path without actually looking if this action is safe to take or no, while SARSA discounts the value of these actions which lets it discover a more safer path.

By considering a cliff world example.

Q-Learning would take the path with the red arrow because it gets optimal without exploration if this action would get it near the cliff.

While SARSA will take the path of the green arrow as it discounted the values of the actions taking the red path and makes it explore another path which is more safer.

And by looking at the mean reward over time.

We can see that SARSA clearly has better mean reward and better policy, and this is because Q-Learning will go to the shorter path which sometimes leads into falling and give high negative reward.

?On-policy vs Off-policy

There is one more difference is that Q-Learning is an Off-policy, while SARSA is an On-policy. For simplicity, Off-policy means that it uses an exploration strategy for training, and another strategy for testing. As in Q-Learning, we use epsilon greedy policy in training, but for testing we use the Q-table actions only. While On-policy means that the agent always follow his own policy i.e there is one strategy for the training and testing phase. For example, SARSA as they assume that their experience comes from the agents himself, and they try to improve agents policy right down this online stop. So, the agent plays and then it improves and plays again. So, basically It estimates the return for state-action pairs assuming the current policy continues to be followed.

Summary

To recap, here are some key points to remember.

Q-Learning

Takes the maximum Q-value of all actions.
Computes the shortest path which might not be safe.
Less mean reward.
Off-policy.

SARSA

Takes the expectation over the probability distribution of actions.
Computes a safer path (Can be the shortest if it is safe).
More mean reward.
On-policy.

In conclusion, Q-Learning and SARSA are both algorithms that solves the bellman equation to compute the policy. So, whenever you want to solve a RL problem, you should always check which one you will need keeping in mind the goal and differences between them.

References

Practical Reinforcement Learning course by HSE at Coursera.org.
Article for Reinforcement Learning algorithm.
My Implementation on cliff world open.ai gym environment using Q-Learning and SARSA and comparing with them.

Q-Learning vs SARSA (Model-Free RL Algorithms)

Ahmed Raafat

Unstructured Data Scientist, L2 @Beyond Limits

Definition

Solving the Expectation

Q-Learning Algorithm steps

Choosing actions

Definition

SARSA Algorithm steps

Differences between Q-Learning and SARSA

?On-policy vs Off-policy

Summary

Q-Learning

SARSA

References

社区洞察

其他会员也浏览了

?? This Prompt Makes Learning FUN

Types of Machine Learning - Mustafa Mahmud HussAIn

Visualizing the Future with Q-Learning

A Practical Guide to Meta-Learning for Enterprise

Learning Without Limits: Self-Supervised Learning in Perspective

AI Atlas #11: Semi-Supervised Learning

Ch 12:Reinforcement learning Complete Guide

The Awesome Power of Learning New Sh**

Issue 3 - AI as your learning partner

A Comprehensive Hands on guide to transfer learning