Partially Observable MDPs

Partially Observable MDPs

Hello Again ^^, Long time

In last time, We talked about some interesting ways of building RL agents with gifts/disorders. if you missed it,check it out here.

Today, We are going to discuss one of crucial topics in optimization problems which is "mathematical modelling" which is mathematically defining our problem before even trying to find ways to solve it. RL problem is a problem that involves a sequential decision making under uncertainty. We usually try to model the decision-making of a dynamic system using a mathematical framework called Markov Decision Process (MDP), So let's dive deep into our RL problem and MDPs


RL Problems as MDP

An RL problem is generally an Agent-Environement type of problem in which agent is initially in a certain state and has to take an action A which will results in a transition of the state and the environment will produce a reward as a feedback signal to the agent, the hope of the RL problem is to maximize the expected overall total rewards obtained

The RL problem is well-suited to be modeled as an MDP problem. an MDP consists mainly of a tuple <S,A,P,r,γ> in which:

States (S): The set of all possible states the environment can be in. Mathematically, S represents the state space.

Actions (A): The set of all possible actions the agent can take. Mathematically, A represents the action space.

Transition Probability (P(s'|s,a)): The probability of transitioning from the current state s to the next state s' after taking action a. This defines the dynamics of the environment.

Reward (r(s,a,s')): The immediate reward the agent receives when transitioning from state s to state s' after taking action a. The reward function represents the objective the agent is trying to maximize.

Discount Factor (γ): A factor between 0 and 1 that determines the relative importance of future rewards compared to immediate rewards. It helps to balance the trade-off between short-term and long-term rewards.

The MDP holds a very important assumption 'Markovian Property' which is

Markovian Property


The MDP assumes that the current state is sufficient to do planning throwing away all history of interaction between the environment "throwing away all previous history states"



THE STATE

Let's talk more about the state, Imagine that you are designing an RL agent to land a rocket, what are the states needed in order to make decisions? Do we need information about every atom forming the rocket? I mean, that could help, but it is not necessary in order to secure a good decision making. It is practically impossible to obtain all the information about our environment 'The Environment State'. What happens in reality is that we observe partial information or readings about the environment using sensors, this is called 'Observation'. Our Agent also construct a belief about the current state of the environment which is called 'Agent State'

Different states for same problem


Such generalization which is introduced, differentiating between different types of states is to introduce a problem which is breaking the markovian property, This is because what really is observable to the agent is the observations which may not complete or in other words, may not be sufficient to do planning, Such new general framework is called Partial Observable MDPs. In partially observable MDPs, The Environment state is generally not equal to either the Observations or the Agent State

But does this mean that we can't use algorithms developed to work on the classic MDPs with our new general framework?

Solutions

Observation concatenation

One way which may be helpful in not breaking the markovian property in case of partial observability is buffering a certain number of observations and use such buffer as a collateral state. Such method can be helpful in very tiny non-complex environments and is not scalable to complex ones like dynamics problem. Such method is good because it is very trivial to be used and is not computationally expensive

Concatenation of observations in decision making



Belief Monitoring

A very common solution for such problem is trying to estimate the states that are widely named now 'hidden'. Such method is highly dependent on bayesian rule. A bayesian network can be drawn in order to simply the problem and can be represented as follows

Bayesian Network and Belief update equation

The belief update equation can be can be easily interpreted if we divided it into small chunks

1)The predicted state part: The predicted state can be interpreted as the probability of transitioning from State St-1 to St given that I have a certain belief bt-1 of initially being in state St-1 (the product of both)

2)The estimated state part: The estimated state is a corrected predicted state in which the predicted state is multiplied by both a)the normalizing factor and b)probability of observing a certain observation Ot given I am in state St which is mainly the sensor model

There are several advantages of such method over the previously mentioned one is that we now don't have to track or buffer many observations, Also such method can be a separate module which can be used in the overall RL agent architecture

There is a limitation in such method which is that we are now dependent on the environment dynamics model which can be not accessible. Another drawback is the computational complexity.


Recurrent Neural Networks

Recurrent Neural Networks (RNNs) can be useful in their application to Partially Observable Markov Decision Processes (POMDPs) as RNNs have the ability to maintain an internal state and incorporate past observations, allowing them to model the temporal and sequential nature of the problem. Two huge advantages of RNNs over all the other previously mentioned methods are:

A) RNNs can be trained end-to-end, directly mapping observations to actions, without the need for explicit state estimation or belief tracking modules.

B) RNNs can be integrated with other neural network components, such as convolutional layers for processing visual inputs, making them a versatile and scalable choice for complex POMDP problems.

But there is no free lunch, RNN as a method does suffer from many disadvantages which are the common disadvantages of any deep learning algorithms like the vanishing/exploding gradients, sensitivity to hyper parameters and data efficiency


References



要查看或添加评论,请登录

Abram George的更多文章

社区洞察

其他会员也浏览了