登录查看更多内容

Reinforcement Learning Approaches for beginners

Navin Manaswi

Author of Best Seller AI book| Authoring “AI Agent" book | Represented India on Metaverse at ITU-T, Geneva | 12 Years AI | Corporate Trainer| AI Consulting| Entrepreneur | Guest Faculty at IIT | Google Developers Expert

发布日期: 2018年11月15日

In a series of continuous improvement in RL, we have moved from Q-learning to SARSA to Deep Q Network (DQN) to DDPG. Q-learning lacks generality and is off-policy algorithm. SARSA is on-policy algorithm but lacks generality. To solve generality issue, deep neural networks, better known as DQN, is used to get Q value and hence it gives Q value for unseen cases. It works well when we have small/discrete action space.

Imagine you are building a game player (agent) that can take the best decision at all times (states).

Agent needs to choose the best action, for each state, to maximize reward by the end of the game. In short, the goal of the agent is to create the best policy that will maximize the total rewards received from the environment.

“When the agent is in some state, what is the best action to take?”

Answer lies in Q-table.

Q-learning is all about getting a good Q-table based on state and action. Based on Q-value formula, we can get Q-value given the state and action in addition to discount factor and reward scheme. It learns in iterative way. Its cons is that it is not able to get Q-value for unseen states. It is cumbersome to get Q-value when we have more possible actions or more possible states.

This Q-table is generally updated throughout the agent’s lifetime, so what might have been considered the best action, may not be considered quite so great after a period when the agent goes through some experience.

Rows are states and columns are actions. For each state and action combination, it creates Q-value eventually after many iteration.

For unseen states, Q-table may not give 
good suggestion

Note that in Q-learning, the agent does not know state transition probabilities or rewards. The agent only tries to know that there is a reward for going from one state to another via a given action. In value-iteration method, the agent discovers the state transition probabilities via given actions.

===========================================================

Value (V): The expected long-term return with discount (Not the short-term reward R). V-subscript π as a function of state s refers to as the expected long-term return of the current states under policy π.
Q-value : Also known as action-value (Q). Q-value depends on state s as well as action a. Q-subscript π as a function of s and a means the long-term return of the current state s, taking action a under policy π.

Difference between Q-learning and Value Iteration

With value iteration, the agent learns the expected cost while being in a state x. With q-learning, the agent gets the expected discounted cost while being in a state x and applying action a.

============================================================

To overcome the issue of generality, neural networks based Q-value estimation was created. Its name is Deep Q- Networks (DQN)

DQN is able to get Q-value for unseen states as well because it learns on basis of neural networks. DQN is improved with new ideas of Double DQN, dueling DQN, prioritized experience replay.

Deep Deterministic Policy Gradient (DDPG) algorithm takes ideas of experience replay and separate target from DQN. It performs nicely specially in continuous environments where we have large action space. Adding noise on the parameter space or action space boosts the power of DDPG.

DDPG has basic Actor-critic architecture where Actor is supposed to tune parameters for policy function that decides the best action for a given state and a critic evaluates the policy function estimated by the Actor as per temporal difference error. DDPG suffers from convergence problem or the step size issue. Then we needed to move on to new ideas such as TRPO, PPO where policy parameters update is more smart than ever.

A concept called advantage is introduced.

Note that Q- value of a state is called Value. As we have many possible actions at a given state, we need an indicator/operator, known as advantage, that can differentiate between actions. Advantage is Q-value for an action(and state) minus the Value of the state. Advantage measures how good the new policy is w.r.t. the old policy.

TRPO (Trust region Policy Optimization) solves the main problems of DDPG, not monotonous improvement of its performance. It solves by using the concept of trust region. Here we maximise the expectation subject to KL divergence constraint with the aim of disallowing too much change in policy parameters. TRPO has cons, that is extremely complicated computation and implementation due to KL divergence and its second order derivatives. Conjugate gradient algorithm of TRPO was used to avoid the second order derivative but it complicates the overall implementation.

This problem of complexity is solved by PPO (Proximal Policy Optimization) where a clipped surrogate objective function is used. It modifies TRPO’s objective function with a penalty of too large update in policy and with the removal of costly constraints. In short, PPO improves performance and implementation.

Q-learning, DQN, DDPG, TRPO and PPO are model-free and off-policy algorithms. Model-free indicates that the objective function are not to be estimated and knowledge is updated with trial and error. SARSA is model-free and on-policy as it learns value based on current action.

Next articles will cover Advantage Actor critic and its better version Asynchronous Actor critic in addition to simulation, RL codes, game player, RL in NLP and Robotics.

Vaibhav Saxena

Data Scientist p? Mavera, a Verisk business

6 年

Brilliance. Easy to understand and very helpful.

1 次回应

要查看或添加评论，请登录

Navin Manaswi的更多文章

New Approach of LLM Safety: Bias Mitigation and Toxicity Removal Critical to GenAI's success

2024年9月3日

New Approach of LLM Safety: Bias Mitigation and Toxicity Removal Critical to GenAI's success

Bias Mitigation: Using Hamiltonian Mechanics and Poisson Bracket to create safe LLM System Bias Mitigation is one of…
Building Safe LLM Systems by using Fourier Neural Operators -- A promising framework of Scalable Safe LLM Systems

2024年8月28日

Building Safe LLM Systems by using Fourier Neural Operators -- A promising framework of Scalable Safe LLM Systems

How to build Safe LLM Systems (LLM Guardrails) Using Fourier Neural Operators(FNO) To create a robust framework that…

4 条评论
PINN: A birthplace of Safe LLMs

2024年8月26日

PINN: A birthplace of Safe LLMs

Physics-Informed Neural Networks (PINNs) are poised to play a critical role in the advancement of both AI and…

2 条评论
Detecting Gender and Racial Bias in GenAI Systems: Quantum Entanglement in GenAI Systems

2024年8月20日

Detecting Gender and Racial Bias in GenAI Systems: Quantum Entanglement in GenAI Systems

1. Introduction to Bias and Quantum Entanglement Gender and racial bias in Generative AI (GenAI) systems can profoundly…
Building Safe LLM Systems: Perturbation Theory as a Framework for Predicting and Mitigating Risks in Large Language Models

2024年8月14日

Building Safe LLM Systems: Perturbation Theory as a Framework for Predicting and Mitigating Risks in Large Language Models

Large Language Models (LLMs) like GPT, Llama3, BERT, and their successors have demonstrated remarkable abilities in…

1 条评论
Generative AI for Image Generation - GAN

2024年5月8日

Generative AI for Image Generation - GAN

Generative adversarial networks (GANs) are one of the hottest topics in deep learning. They can generate an infinite…
A Race to beat ChatGPT

2023年6月29日

A Race to beat ChatGPT

Large Language Models (LLM) such as ChatGPT, GPT-4, and Bard are powerful language models that have been fine-tuned…
Introduction to GAN (Generative Adversarial Networks)

2020年3月18日

Introduction to GAN (Generative Adversarial Networks)

GAN is an algorithm(Deep Learning Approach) behind 1. DeepFake 2.

2 条评论
Google MLkit - Simplified for Dummies

2020年1月27日

Google MLkit - Simplified for Dummies

′ML Kit beta brings Google's machine learning expertise to mobile developers in a powerful and easy-to-use package ′…
Future of e-commerce: 3D Avatar and Virtual Try-on as game changer

2019年9月25日

Future of e-commerce: 3D Avatar and Virtual Try-on as game changer

3D Avatar is all in the rage now. With multiple startups and companies announcing their next big move into the AR and…

1 条评论

See all articles

Reinforcement Learning Approaches for beginners

Navin Manaswi

Author of Best Seller AI book| Authoring “AI Agent" book | Represented India on Metaverse at ITU-T, Geneva | 12 Years AI | Corporate Trainer| AI Consulting| Entrepreneur | Guest Faculty at IIT | Google Developers Expert

Difference between Q-learning and Value Iteration

Navin Manaswi的更多文章

社区洞察

其他会员也浏览了

Deep Learning: Visual Exploration training

Unlocking the Potential: Reinforcement Learning in Computer Vision Research

DEEP LEARNING BASED OJECT RECOGNITION SYTEM: Analyzing The Effect Of The Learning Rate In A Convolutional Neural Network

Week 3: The Anatomy of a Model: Input, Output, and Parameters. Breaking down what goes into a deep learning model, step-by-step.

An Intro to Reinforcement Learning through Flappy Bird

Automating Neural Network Configuration with Keras-Tuner

Top 10 Activation Functions in Deep Learning

ADAM, RMSProp, and Rprop: Essential Optimizers for Dummies in Neural Learning.

Overview of Artificial Intelligence and Neural Networks

Adam, AdaGrad, RMSProp, Delta-Bar-Delta - Which Learning Rate Strategy Will Enhance Your Model?

Difference between Q-learning and Value Iteration

Navin Manaswi的更多文章

New Approach of LLM Safety: Bias Mitigation and Toxicity Removal Critical to GenAI's success

Building Safe LLM Systems by using Fourier Neural Operators -- A promising framework of Scalable Safe LLM Systems

PINN: A birthplace of Safe LLMs

Detecting Gender and Racial Bias in GenAI Systems: Quantum Entanglement in GenAI Systems

Building Safe LLM Systems: Perturbation Theory as a Framework for Predicting and Mitigating Risks in Large Language Models

Generative AI for Image Generation - GAN

A Race to beat ChatGPT

Introduction to GAN (Generative Adversarial Networks)

Google MLkit - Simplified for Dummies

Future of e-commerce: 3D Avatar and Virtual Try-on as game changer

社区洞察

其他会员也浏览了

Deep Learning: Visual Exploration training

Unlocking the Potential: Reinforcement Learning in Computer Vision Research

DEEP LEARNING BASED OJECT RECOGNITION SYTEM: Analyzing The Effect Of The Learning Rate In A Convolutional Neural Network

Week 3: The Anatomy of a Model: Input, Output, and Parameters. Breaking down what goes into a deep learning model, step-by-step.

An Intro to Reinforcement Learning through Flappy Bird

Automating Neural Network Configuration with Keras-Tuner

Top 10 Activation Functions in Deep Learning

ADAM, RMSProp, and Rprop: Essential Optimizers for Dummies in Neural Learning.

Overview of Artificial Intelligence and Neural Networks

Adam, AdaGrad, RMSProp, Delta-Bar-Delta - Which Learning Rate Strategy Will Enhance Your Model?