Markov Decision Process
Image Credits - Devika Das

Markov Decision Process

Definition

The Markov Decision Process (MDP) is a mathematical framework or model which can be used for decision-making in discrete, stochastic, sequential environments. MDP, like the ?Markov chain (MC), attempts to predict the future state based only on the information provided by the current state. In addition to that, MDP incorporates the characteristics of actions and motivations. At each time step during future processes, the decision maker, or agent, within an environment takes an action on the available current state. In response to such action, the environment changes state randomly and affects the immediate reward obtained by the decision maker.

No alt text provided for this image
Image credits: Quora.com

MDP Model

MDPs are commonly used to describe dynamical systems and represent environment in the Reinforcement Learning (RL) framework.

?An MDP is a tuple < S, A,P, R, γ >

? S: The set of states.

? A: The set of actions.

? P: The set of transition probability.

? R: The set of immediate rewards associated with the state-action pairs.

? 0 ≤ γ ≤ 1: Discount factor.

The Agent (decision maker) interacts continually with its?Environment?by performing?actions?sequentially at each discrete time step. As the state of the Environment changes, the interaction of the Agent with the Environment changes. Therefore, the Agent gets a numerical reward from its Environment.?

Transition Probability

A Markov Process is defined by (S, P) where S are the states, and P is the state-transition probability. This process consists of series of random states S?, S?, etc. where all states obey the Property.

The transition probability describes the dynamics of the MDP. It shows the transition probability from all states s to all successor states s 0 for each action a. P is the set of transition probability with na matrices each of dimension ns × ns where the s, s0 entry reads

[P a ]ss0 = p[st+1 = s 0 |st = s, at = a].

?One can verify that the row sum is equal to one.

Application of Markov Chain

I would like to cite an example of real-world application of Markov Chain given by Prateek Sharma &?Priya Chetty ?

Markov chain and its use in solving real world problems (projectguru.in)

Suppose there are 2 types of weather in an area, ‘sunny’ and ‘cloudy’. In their broadcast, a news channel wants to predict about the weather for the next week.?

The channel hires a weather forecast company to find out the weather for the next few weeks. ?Currently, the weather is ‘sunny’ in that area.

The probabilities for the following week are as given below:

  • Staying ‘sunny’ the following week = 80%.
  • Changes from ‘sunny’ to ‘cloudy’ over a week = 20%
  • Staying cloudy in the following week = 70%
  • Changes from ‘cloudy’ to ‘sunny’ over a week is 30%

Although, it is predicted to be ‘sunny’ the whole week, one cannot be fully sure about the next week without making some transition calculations.

The matrix below explains the transition:

No alt text provided for this image
No alt text provided for this image

Current State * Transition Matrix = Final State

S=Sunny; C= Cloudy

No alt text provided for this image
Calculations for the weather in the following week

We conclude, there is an 80% chance that next week will be ‘sunny’. However, there is a 20% chance that next week, the weather may become cloudy. This calculation is called the Markov chain.

If transition matrix doesn’t change with time, one can also predict the weather for further weeks using the same equation.?

No alt text provided for this image
Calculations for the weather forecast in the two weeks’ time

MDP in AI/ML

A?machine learning (ML)?algorithm may be tasked with an optimization problem. Using?reinforcement learning (RL), the algorithm tries to optimize actions taken within an environment, by maximizing the potential reward. Supervised learning techniques require correct pairs of input/output to create a model, while RL uses MDPs to achieve an optimal balance of exploitation and exploration. In case of unspecified/unknown probabilities and rewards, ML may use RL through MDP.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了