Q-Learning vs SARSA (Model-Free RL Algorithms)
This article discusses the definition of Q-Learning and SARSA, then compares between them. At first, we should understand what are these algorithms, and what is the intuition behind them.
By looking at their equation (which we will look at it later), we will see that they both tend to solve the famous bellman equation, which gives us an action-value function "called the Q-function" that correspond to the utility or the significance of this specific action in terms of the payoff of the initial actions.
Q-Learning
Definition
Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used in finding an optimal action-selection policy for any given MDP. So in a nutshell, It maps states to action by computing the values of action in certain states. And this is the Q-learning equation:
As we can see, there is the expectation term which appears due to the randomness of the actions. In other words, we don't know which action will be taken; That's why we need to compute the expectation, which can be then solved using integration over all actions. But this solution isn't easy and we can't compute it explicitly. So, we have to use a trick to solve this expectation.
Solving the Expectation
As usual, when we can't compute something explicitly, we go for approximation which can be done by sampling trajectories. And this means we take, let's say 5 trajectories ( A trajectory is a tuple of state, action, reward and next state), and it may be the only option is to take 1 trajectory as in self-driving cars, when you have to wait for the car to take its action to go to the next action and this would give you the sample you need or the trajectory to somehow estimate this expectation. But this one sample will be very noisy and we will need something called temporal difference method which uses exponentially weighted moving average to learn using only one sample, which is as follows:
This alpha is called the learning rate, in which you learn from the new noisy sample slowly and updating on the old Q value to make it more reliable. And the above equation is called the update rule.
Q-Learning Algorithm steps
So, now we know almost everything. We just need to :
- Initialize Q-values with zeros
- Then we mainly need a state, action, an immediate reward for the state caused by the action and the next state sampled from the decision process (Can be obtained by 1 step).
- Plug the trajectory into the equation above and compute the Q-value of your actions then take the maximum.
Choosing actions
But, how to choose our action?, we can answer this by a simple sentence "exploration vs exploitation". Q-Learning uses epsilon greedy technique by generating a random number between 0 and 1, and depending on this number, you choose weather you do exploration i.e do random action, or do exploitation i.e do the action you have learnt. And this makes it an Off-Policy algorithm and we will get to that later.
Le'ts take a look on the SARSA.
SARSA
Definition
From its name, it is clear that it uses state, action, reward, next state, and next action. That is the only difference from Q-Learning, as it replaces the maximization of Q-Learning by expectation over the probability distribution of actions by any exploration policy. So basically, what you have to add is you have to consider one more step to get the next action.
SARSA Algorithm steps
And this is how it should be.
As we can see it is quite similar to the Q-Learning algorithm, we simply remember the action that the agent did, and use this to perform an update.
Of course this would be slightly problematic for high action dimension space, and this can be solved using the Expected SARSA, which basically takes the average of the Q-values of the possible actions with the probability denoted by the policy used.
So, the question is what is the difference between them?
Differences between Q-Learning and SARSA
Actually, if you look at the Q-Learning algorithm, you will realize that it computes the shortest path without actually looking if this action is safe to take or no, while SARSA discounts the value of these actions which lets it discover a more safer path.
By considering a cliff world example.
Q-Learning would take the path with the red arrow because it gets optimal without exploration if this action would get it near the cliff.
While SARSA will take the path of the green arrow as it discounted the values of the actions taking the red path and makes it explore another path which is more safer.
And by looking at the mean reward over time.
We can see that SARSA clearly has better mean reward and better policy, and this is because Q-Learning will go to the shorter path which sometimes leads into falling and give high negative reward.
?On-policy vs Off-policy
There is one more difference is that Q-Learning is an Off-policy, while SARSA is an On-policy. For simplicity, Off-policy means that it uses an exploration strategy for training, and another strategy for testing. As in Q-Learning, we use epsilon greedy policy in training, but for testing we use the Q-table actions only. While On-policy means that the agent always follow his own policy i.e there is one strategy for the training and testing phase. For example, SARSA as they assume that their experience comes from the agents himself, and they try to improve agents policy right down this online stop. So, the agent plays and then it improves and plays again. So, basically It estimates the return for state-action pairs assuming the current policy continues to be followed.
Summary
To recap, here are some key points to remember.
Q-Learning
- Takes the maximum Q-value of all actions.
- Computes the shortest path which might not be safe.
- Less mean reward.
- Off-policy.
SARSA
- Takes the expectation over the probability distribution of actions.
- Computes a safer path (Can be the shortest if it is safe).
- More mean reward.
- On-policy.
In conclusion, Q-Learning and SARSA are both algorithms that solves the bellman equation to compute the policy. So, whenever you want to solve a RL problem, you should always check which one you will need keeping in mind the goal and differences between them.
References
- Practical Reinforcement Learning course by HSE at Coursera.org.
- Article for Reinforcement Learning algorithm.
- My Implementation on cliff world open.ai gym environment using Q-Learning and SARSA and comparing with them.
Head of Delivery Strategy at Ericsson India Global Services Pvt Ltd
5 年Very useful and informative
PhD | Senior Team Leader at Si-Vision | Deep Learning Researcher
6 年Focused and to the point. Good Job Raafat?