登录查看更多内容

Introduction to Reinforcement Learning

Shailendra Singh Kathait

Co-Founder & Chief Data Scientist @ Valiance | Envisioning a Future Transformed by AI | Harnessing AI Responsibly | Prioritizing Global Impact |

发布日期: 2017年4月28日

+ 关注

Machine Learning can be broadly classified into 3 categories:

1. Supervised Learning

2. Unsupervised Learning

3. Reinforcement Learning

Supervised learning is a type of learning in which the Target variable is known, and this information is explicitly used during training (Supervised), that is the model is trained under the supervision of a Teacher (Target). For example if we want to build a classification model for handwritten digits, the input will be the set of images (training data) and the target variable will be the labels assigned to these images, that is their classes from 0-9.

Unsupervised learning is a type of learning algorithm that is used to draw inferences from datasets consisting of input data without knowing the target. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.

Reinforcement learning is a type of learning algorithm in which the machine takes decisions on what actions to take, given a certain situation/environment, so as to maximize a reward. The difference between supervised and reinforcement learning is the reward signal that simply tells whether the action (input) taken by the agent is good or bad. It doesn’t tell us anything about what is the best action. In this type of learning, we neither have the training data nor the target variables.

Reinforcement Learning:

Reinforcement learning is a type of Machine Learning that is influenced by behaviorist psychology. It is concerned with how software agents ought to take action in an environment so as to maximize some notion of cumulative reward.

It is learning what to do, how to map situations to actions so as to maximize a numerical reward signal. It does not make use of any training data-set to learn the pattern unlike other learning methods. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics: trial-and-error search and delayed reward are the distinguishing features of Reinforcement Learning.

The reinforcement learning model consists of:

1. A set of environment and agent states S.

2. A set of actions A of the agent.

3. Policies of transitioning from states to actions.

4. Rules that determine the scalar immediate reward of a transition.

5. Rules that describe what the agent observes.

A task is defined by a set of states, s∈S, a set of actions, a∈A, a state-action transition function,

T: S×A→S, and a reward function, R: S×A→R. At each time step, the learner (also called the agent) selects an action, and then as a result is given a reward and its new state. The goal of reinforcement learning is to learn a policy, a mapping from states to actions, Π: S →A that maximizes the sum of its reward overtime.

In machine learning, the environment is formulated as a Markov decision process (MDP), as many reinforcement learning algorithms for this context utilize dynamic programming techniques.

Examples:

To get more insights of Reinforcement Learning, let us consider some examples:

1. A master chess player makes a move. The choice is informed both by planning-anticipating possible replies and counter replies and by immediate, intuitive judgments of the desirability of particular positions and moves.

2. An adaptive controller adjusts parameters of a petroleum refinery’s operation in real time. The controller optimizes the yield/cost/quality trade-off on the basis of specified marginal costs without sticking strictly to the set points originally suggested by engineers.

3. A gazelle calf struggles to its feet minutes after being born. Half an hour later it is running at 20 miles per hour.

4. Self driving car is the best example of Reinforcement Learning.

5. Playing Tic-Tac-Toe with Computer that has been trained through Reinforcement Learning.

Elements of Reinforcement Learning:

Except the agent and the environment, we have four sub elements of reinforcement learning system:

1. Policy: It defines the learning agent’s way of behaving at a given time.

2. Reward function: It defines the goal in reinforcement learning problem.

3. Value function: It specifies what is good in the long run.

4. Model of the environment (optional): Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced.

Rewards are in a sense primary, whereas values, as predictions of rewards, are secondary. Without rewards there could be no values, and the only purpose of estimating values is to achieve more reward.

Reinforcement learning is all about trying to understand the optimal way of making decisions/actions so that we maximize reward R. This reward is a reply signal that show how well the agent is doing at a given time step. The action A that an agent takes at every time step is a function of both the reward and the state S, which is a description of the environment the agent is in. The mapping from environment states to actions is policy P. The policy basically defines the agent’s way of behaving at a certain time, given a certain situation. Now, we also have a value function V which is a measure of how good each position is. This is different from the reward in that the reward signal indicates what is good in the immediate sense, while the value function is more indicative of how good it is to be in this state/position in the long run. Finally, we have a model M which is the agent’s representation of the environment. This is the agent’s model of how it thinks that the environment is going to behave.

The whole Reinforcement Learning environment can be described with a MDP.

Markov Decision Process (MDP):

Markov decision process (MDP) is a mathematical framework used to model decision making in the situations where target is partly random and partly under the control of a decision maker.

MDPs are useful when we are studying a wide range of optimization process that can be solved by dynamic programming and reinforcement learning. MDP consists of a finite set of states, value functions for those states, finite set of actions, a policy and a reward function.

The above diagram illustrates a MDP with three states (green circles) and two actions (orange circles), with two rewards (yellow arrows). The value function can be defined in terms of 2 functions.

1. State-value function V: State-value function V is defined as the expected return from being in a state S and following a policy π. It is calculated by the summation of the reward at each future time step (Gamma refers to a constant discount factor with value lying between 0 and 1). It is represented by the following equation:

2. Action-value function Q: The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following π.

The above equation is commonly known as Q-Equation where is the reward observed after performing in, alpha is the learning rate, and Gamma is a number between 0 and 1 called the discount factor.

By solving MDP we get the optimum policy through the use of dynamic programming and specifically through the use of policy iteration. The idea is that we take some initial policy π1 and evaluate the state value function for that policy. We solve it by using Bellman expectation equation given as:

This equation represents that the value function, given the policy π, can be decomposed into the expected return sum of the immediate reward Rt+1 and the value function of the successor state St+1. This is equivalent to the value function definition used in the previous section. Policy evaluation component uses this equation. In order to get a better policy, we use a policy improvement step where we simply act greedily with respect to the value function. In other words, the agent takes the action that maximizes value.

Now, in order to get the optimal policy, we repeat these 2 steps, one after the other, until we converge to optimal policy π.

Summary:

Reinforcement learning is a computational approach used to understand and automate the goal-directed learning and decision-making. It is distinguished from other computational approaches by its emphasis on learning by the individual from direct interaction with its environment, without relying upon some predefined labeled dataset. Reinforcement learning addresses the computational issues that arise when learning from interaction with the environment so as to achieve long-term goals.

RL uses a formal framework that defines the interaction between a learning agent and its environment in terms of states, actions, and rewards. The framework is intended to be a simple way of representing essential features of the artificial intelligence problem. These features include a sense of cause and effect, a sense of uncertainty and non-determinism, and the existence of explicit goals.

References:

1. Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An Introduction. Vol. 1. No. 1. Cambridge: MIT press, 1998.

2. Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey" Journal of artificial intelligence research 4 (1996): 237-285.

3. Deep-Learning-Research-Review-Week-2-Reinforcement-Learning: https://adeshpande3.github.io/

Varun Sharma

Digital & Technology | APAC Head of Data & AI

7 年

your blogs are highly informative.... keep sharing the good thoughts (y)

1 次回应

查看更多评论

要查看或添加评论，请登录

Shailendra Singh Kathait的更多文章

How did we reduce our data, computing, and storage requirements by roughly 96.6 % from the peak of 3 TB per day? Without dip in performance matrix

2023年5月27日

How did we reduce our data, computing, and storage requirements by roughly 96.6 % from the peak of 3 TB per day? Without dip in performance matrix

We drastically reduced the number of images to be classified by an object identification algorithm, resulting in…

8 条评论
Identifying New Mineral Occurrence using Remote Sensing Images

2023年3月5日

Identifying New Mineral Occurrence using Remote Sensing Images

The traditional ways of mapping earth’s geology and mineral resources, such as field sampling and aerial photographs…

5 条评论
Credit Risk Scorecard Monitoring

2017年5月17日

Credit Risk Scorecard Monitoring

Introduction Nowadays, Retail Banks are more focused on finding or discriminating the right clients and the wrong ones…

28 条评论
Moving towards world powered by Artificial Intelligence & Deep Learning.

2016年10月4日

Moving towards world powered by Artificial Intelligence & Deep Learning.

Machine learning is a very successful technology but applying it today often requires spending substantial effort…
Collecting Twitter Stream : Using Python & MongoDB

2015年12月10日

Collecting Twitter Stream : Using Python & MongoDB

Collecting Twitter Stream using MongoDb as storage Text mining is one of the applications of natural language…

5 条评论
Building Your First Spark : Logistic Regression Model

2015年11月17日

Building Your First Spark : Logistic Regression Model

Spark has recently been gaining traction. So I thought of providing starting point to play with Spark.

10 条评论
What is Topic Modeling ?

2015年11月3日

What is Topic Modeling ?

Topic Modeling has every growing relevance, specially with most of data being generated is unstructured data. So I…

4 条评论
Introduction to Support Vector Machines (SVM)

2015年10月21日

Introduction to Support Vector Machines (SVM)

Support Vector Machines are supervised learning models, algorithms used to analyze data and recognize patterns. They…

5 条评论
Machine Learning helps in building High Performing Agent Sales Force

2015年9月9日

Machine Learning helps in building High Performing Agent Sales Force

Business Context: A Leading Fashion brand uses direct sales agents to sell products directly to the customer. Agents…

4 条评论
Machine Learning: Identifying Serviceable Tweets

2015年7月29日

Machine Learning: Identifying Serviceable Tweets

Twitter is fast evolving as servicing channel, though in a nascent stage. Twitter helps provide near real-time customer…

2 条评论

See all articles

Introduction to Reinforcement Learning

Shailendra Singh Kathait

Co-Founder & Chief Data Scientist @ Valiance | Envisioning a Future Transformed by AI | Harnessing AI Responsibly | Prioritizing Global Impact |

Shailendra Singh Kathait的更多文章

社区洞察

其他会员也浏览了

?? This Prompt Makes Learning FUN

Self-Supervised Learning

Types of Machine Learning - Mustafa Mahmud HussAIn

AI Atlas #21: Zero-Shot Learning

Self-Supervised Learning Guide: Super simple way to understand AI

The Learning Revolution: Zero-Shot, One-Shot, and Few-Shot Learning

Ch 12:Reinforcement learning Complete Guide

Supervised vs. Reinforcement Learning: Real-World Applications & Key Insights

A Comprehensive Hands on guide to transfer learning

Exploring Reinforcement Learning: How Machines Learn Through Trial and Error

Shailendra Singh Kathait的更多文章

How did we reduce our data, computing, and storage requirements by roughly 96.6 % from the peak of 3 TB per day? Without dip in performance matrix

Identifying New Mineral Occurrence using Remote Sensing Images

Credit Risk Scorecard Monitoring

Moving towards world powered by Artificial Intelligence & Deep Learning.

Collecting Twitter Stream : Using Python & MongoDB

Building Your First Spark : Logistic Regression Model

What is Topic Modeling ?

Introduction to Support Vector Machines (SVM)

Machine Learning helps in building High Performing Agent Sales Force

Machine Learning: Identifying Serviceable Tweets

社区洞察

其他会员也浏览了

?? This Prompt Makes Learning FUN

Self-Supervised Learning

Types of Machine Learning - Mustafa Mahmud HussAIn

AI Atlas #21: Zero-Shot Learning

Self-Supervised Learning Guide: Super simple way to understand AI

The Learning Revolution: Zero-Shot, One-Shot, and Few-Shot Learning

Ch 12:Reinforcement learning Complete Guide

Supervised vs. Reinforcement Learning: Real-World Applications & Key Insights

A Comprehensive Hands on guide to transfer learning

Exploring Reinforcement Learning: How Machines Learn Through Trial and Error