Markov Decision Process
Definition
The Markov Decision Process (MDP) is a mathematical framework or model which can be used for decision-making in discrete, stochastic, sequential environments. MDP, like the ?Markov chain (MC), attempts to predict the future state based only on the information provided by the current state. In addition to that, MDP incorporates the characteristics of actions and motivations. At each time step during future processes, the decision maker, or agent, within an environment takes an action on the available current state. In response to such action, the environment changes state randomly and affects the immediate reward obtained by the decision maker.
MDP Model
MDPs are commonly used to describe dynamical systems and represent environment in the Reinforcement Learning (RL) framework.
?An MDP is a tuple < S, A,P, R, γ >
? S: The set of states.
? A: The set of actions.
? P: The set of transition probability.
? R: The set of immediate rewards associated with the state-action pairs.
? 0 ≤ γ ≤ 1: Discount factor.
The Agent (decision maker) interacts continually with its?Environment?by performing?actions?sequentially at each discrete time step. As the state of the Environment changes, the interaction of the Agent with the Environment changes. Therefore, the Agent gets a numerical reward from its Environment.?
Transition Probability
A Markov Process is defined by (S, P) where S are the states, and P is the state-transition probability. This process consists of series of random states S?, S?, etc. where all states obey the Property.
The transition probability describes the dynamics of the MDP. It shows the transition probability from all states s to all successor states s 0 for each action a. P is the set of transition probability with na matrices each of dimension ns × ns where the s, s0 entry reads
[P a ]ss0 = p[st+1 = s 0 |st = s, at = a].
?One can verify that the row sum is equal to one.
领英推荐
Application of Markov Chain
I would like to cite an example of real-world application of Markov Chain given by Prateek Sharma &?Priya Chetty ?
Suppose there are 2 types of weather in an area, ‘sunny’ and ‘cloudy’. In their broadcast, a news channel wants to predict about the weather for the next week.?
The channel hires a weather forecast company to find out the weather for the next few weeks. ?Currently, the weather is ‘sunny’ in that area.
The probabilities for the following week are as given below:
Although, it is predicted to be ‘sunny’ the whole week, one cannot be fully sure about the next week without making some transition calculations.
The matrix below explains the transition:
Current State * Transition Matrix = Final State
S=Sunny; C= Cloudy
We conclude, there is an 80% chance that next week will be ‘sunny’. However, there is a 20% chance that next week, the weather may become cloudy. This calculation is called the Markov chain.
If transition matrix doesn’t change with time, one can also predict the weather for further weeks using the same equation.?
MDP in AI/ML
A?machine learning (ML)?algorithm may be tasked with an optimization problem. Using?reinforcement learning (RL), the algorithm tries to optimize actions taken within an environment, by maximizing the potential reward. Supervised learning techniques require correct pairs of input/output to create a model, while RL uses MDPs to achieve an optimal balance of exploitation and exploration. In case of unspecified/unknown probabilities and rewards, ML may use RL through MDP.