Reinforcement Learning Approaches for beginners

Reinforcement Learning Approaches for beginners

 In a series of continuous improvement in RL, we have moved from Q-learning to SARSA to Deep Q Network (DQN) to DDPG. Q-learning lacks generality and is off-policy algorithm. SARSA is on-policy algorithm but lacks generality. To solve generality issue, deep neural networks, better known as DQN, is used to get Q value and hence it gives Q value for unseen cases. It works well when we have small/discrete action space. 

Imagine you are building a game player (agent) that can take the best decision at all times (states).

Agent needs to choose the best action, for each state, to maximize reward by the end of the game. In short, the goal of the agent is to create the best policy that will maximize the total rewards received from the environment.

“When the agent is in some state, what is the best action to take?”

Answer lies in Q-table.

Q-learning is all about getting a good Q-table based on state and action. Based on Q-value formula, we can get Q-value given the state and action in addition to discount factor and reward scheme. It learns in iterative way. Its cons is that it is not able to get Q-value for unseen states. It is cumbersome to get Q-value when we have more possible actions or more possible states.


This Q-table is generally updated throughout the agent’s lifetime, so what might have been considered the best action, may not be considered quite so great after a period when the agent goes through some experience.

Rows are states and columns are actions. For each state and action combination, it creates Q-value eventually after many iteration.

For unseen states, Q-table may not give 
good suggestion

Note that in Q-learning, the agent does not know state transition probabilities or rewards. The agent only tries to know that there is a reward for going from one state to another via a given action. In value-iteration method, the agent discovers the state transition probabilities via given actions.

===========================================================

  1. Value (V): The expected long-term return with discount (Not the short-term reward R). V-subscript π as a function of state s refers to as the expected long-term return of the current states under policy π.
  2. Q-value : Also known as action-value (Q). Q-value depends on state s as well as action aQ-subscript π as a function of s and a means the long-term return of the current state s, taking action a under policy π.


Difference between Q-learning and Value Iteration

With value iteration, the agent learns the expected cost while being in a state x. With q-learning, the agent gets the expected discounted cost while being in a state x and applying action a.

============================================================

To overcome the issue of generality, neural networks based Q-value estimation was created. Its name is Deep Q- Networks (DQN)

DQN is able to get Q-value for unseen states as well because it learns on basis of neural networks. DQN is improved with new ideas of Double DQN, dueling DQN, prioritized experience replay.

Deep Deterministic Policy Gradient (DDPG) algorithm takes ideas of experience replay and separate target from DQN. It performs nicely specially in continuous environments where we have large action space. Adding noise on the parameter space or action space boosts the power of DDPG.

DDPG has basic Actor-critic architecture where Actor is supposed to tune parameters for policy function that decides the best action for a given state and a critic evaluates the policy function estimated by the Actor as per temporal difference error. DDPG suffers from convergence problem or the step size issue. Then we needed to move on to new ideas such as TRPO, PPO where policy parameters update is more smart than ever.
A concept called advantage is introduced.


Note that Q- value of a state is called Value. As we have many possible actions at a given state, we need an indicator/operator, known as advantage, that can differentiate between actions. Advantage is Q-value for an action(and state) minus the Value of the state. Advantage measures how good the new policy is w.r.t. the old policy. 

TRPO (Trust region Policy Optimization) solves the main problems of DDPG, not monotonous improvement of its performance. It solves by using the concept of trust region. Here we maximise the expectation subject to KL divergence constraint with the aim of disallowing too much change in policy parameters. TRPO has cons, that is extremely complicated computation and implementation due to KL divergence and its second order derivatives. Conjugate gradient algorithm of TRPO was used to avoid the second order derivative but it complicates the overall implementation.


This problem of complexity is solved by PPO (Proximal Policy Optimization) where a clipped surrogate objective function is used. It modifies TRPO’s objective function with a penalty of too large update in policy and with the removal of costly constraints. In short, PPO improves performance and implementation.


Q-learning, DQN, DDPG, TRPO and PPO are model-free and off-policy algorithms. Model-free indicates that the objective function are not to be estimated and knowledge is updated with trial and error. SARSA is model-free and on-policy as it learns value based on current action.


Next articles will cover Advantage Actor critic and its better version Asynchronous Actor critic in addition to simulation, RL codes, game player, RL in NLP and Robotics.

Vaibhav Saxena

Data Scientist p? Mavera, a Verisk business

6 年

Brilliance. Easy to understand and very helpful.

要查看或添加评论,请登录

Navin Manaswi的更多文章

社区洞察

其他会员也浏览了