ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Partially Observable MDPs

Abram George

Embedded Automotive SWE | The German University in Cairo

å‘å¸ƒæ—¥æœŸ: 2024å¹´6æœˆ23æ—¥

Hello Again ^^, Long time

In last time, We talked about some interesting ways of building RL agents with gifts/disorders. if you missed it,check it out here.

Today, We are going to discuss one of crucial topics in optimization problems which is "mathematical modelling" which is mathematically defining our problem before even trying to find ways to solve it. RL problem is a problem that involves a sequential decision making under uncertainty. We usually try to model the decision-making of a dynamic system using a mathematical framework called Markov Decision Process (MDP), So let's dive deep into our RL problem and MDPs

RL Problems as MDP

An RL problem is generally an Agent-Environement type of problem in which agent is initially in a certain state and has to take an action A which will results in a transition of the state and the environment will produce a reward as a feedback signal to the agent, the hope of the RL problem is to maximize the expected overall total rewards obtained

The RL problem is well-suited to be modeled as an MDP problem. an MDP consists mainly of a tuple <S,A,P,r,Î³> in which:

States (S): The set of all possible states the environment can be in. Mathematically, S represents the state space.

Actions (A): The set of all possible actions the agent can take. Mathematically, A represents the action space.

Transition Probability (P(s'|s,a)): The probability of transitioning from the current state s to the next state s' after taking action a. This defines the dynamics of the environment.

Reward (r(s,a,s')): The immediate reward the agent receives when transitioning from state s to state s' after taking action a. The reward function represents the objective the agent is trying to maximize.

Discount Factor (Î³): A factor between 0 and 1 that determines the relative importance of future rewards compared to immediate rewards. It helps to balance the trade-off between short-term and long-term rewards.

The MDP holds a very important assumption 'Markovian Property' which is

The MDP assumes that the current state is sufficient to do planning throwing away all history of interaction between the environment "throwing away all previous history states"

THE STATE

Let's talk more about the state, Imagine that you are designing an RL agent to land a rocket, what are the states needed in order to make decisions? Do we need information about every atom forming the rocket? I mean, that could help, but it is not necessary in order to secure a good decision making. It is practically impossible to obtain all the information about our environment 'The Environment State'. What happens in reality is that we observe partial information or readings about the environment using sensors, this is called 'Observation'. Our Agent also construct a belief about the current state of the environment which is called 'Agent State'

Such generalization which is introduced, differentiating between different types of states is to introduce a problem which is breaking the markovian property, This is because what really is observable to the agent is the observations which may not complete or in other words, may not be sufficient to do planning, Such new general framework is called Partial Observable MDPs. In partially observable MDPs, The Environment state is generally not equal to either the Observations or the Agent State

But does this mean that we can't use algorithms developed to work on the classic MDPs with our new general framework?

é¢†è‹±æŽ¨è

The salmon in the fMRI scanner - or why we should stop testing everything for significance

The salmon in the fMRI scanner - or why we should stopâ€¦

Andrew Seinfeld 1 å¹´å‰

Scatterplot Gallery

Kristopher Abdelmessih 9 ä¸ªæœˆå‰

Willingness to Pay Video, Article by Bryan Orme + Commentary by David Lyon

Willingness to Pay Video, Article by Bryan Orme +â€¦

Bryan K. Orme 3 å¹´å‰

Solutions

Observation concatenation

One way which may be helpful in not breaking the markovian property in case of partial observability is buffering a certain number of observations and use such buffer as a collateral state. Such method can be helpful in very tiny non-complex environments and is not scalable to complex ones like dynamics problem. Such method is good because it is very trivial to be used and is not computationally expensive

Concatenation of observations in decision making

Belief Monitoring

A very common solution for such problem is trying to estimate the states that are widely named now 'hidden'. Such method is highly dependent on bayesian rule. A bayesian network can be drawn in order to simply the problem and can be represented as follows

Bayesian Network and Belief update equation

The belief update equation can be can be easily interpreted if we divided it into small chunks

1)The predicted state part: The predicted state can be interpreted as the probability of transitioning from State St-1 to St given that I have a certain belief bt-1 of initially being in state St-1 (the product of both)

2)The estimated state part: The estimated state is a corrected predicted state in which the predicted state is multiplied by both a)the normalizing factor and b)probability of observing a certain observation Ot given I am in state St which is mainly the sensor model

There are several advantages of such method over the previously mentioned one is that we now don't have to track or buffer many observations, Also such method can be a separate module which can be used in the overall RL agent architecture

There is a limitation in such method which is that we are now dependent on the environment dynamics model which can be not accessible. Another drawback is the computational complexity.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) can be useful in their application to Partially Observable Markov Decision Processes (POMDPs) as RNNs have the ability to maintain an internal state and incorporate past observations, allowing them to model the temporal and sequential nature of the problem. Two huge advantages of RNNs over all the other previously mentioned methods are:

A) RNNs can be trained end-to-end, directly mapping observations to actions, without the need for explicit state estimation or belief tracking modules.

B) RNNs can be integrated with other neural network components, such as convolutional layers for processing visual inputs, making them a versatile and scalable choice for complex POMDP problems.

But there is no free lunch, RNN as a method does suffer from many disadvantages which are the common disadvantages of any deep learning algorithms like the vanishing/exploding gradients, sensitivity to hyper parameters and data efficiency

References

University of Waterloo Spring 2018 Partially Observable RL lecture 1
University of Waterloo Spring 2018 Partially Observable RL lecture 2
UCL RL Course 2018 Introduction to RL lecture

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Abram Georgeçš„æ›´å¤šæ–‡ç«

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

2025å¹´3æœˆ5æ—¥

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

Hello! It has been a long long time ?? Last time we explored Eligibility Traces which introduces a spectrum ofâ€¦
Eligibility Traces, Spectrum of new learning algorithms ??

2024å¹´7æœˆ21æ—¥

Eligibility Traces, Spectrum of new learning algorithms ??

Hello there ???? In last time, we talked about states that are not observable and how can we deal with it in RLâ€¦
Reinforcement Learning agents with gifts/disorders

2024å¹´3æœˆ12æ—¥

Reinforcement Learning agents with gifts/disorders

Hello Again ??. In last article, We discussed how can we use intrinsic as well as extrinsic rewards -if designedâ€¦
Curiosity-Driven Reinforcement Learning

2024å¹´1æœˆ21æ—¥

Curiosity-Driven Reinforcement Learning

Hello Again ??, Have you ever wondered how does human learn? At first glance, you may say, well, human learns fromâ€¦

2 æ¡è¯„è®º
Policy Gradient Theorem for continuous tasks ?? -RL

2023å¹´9æœˆ9æ—¥

Policy Gradient Theorem for continuous tasks ?? -RL

Welcome Again ??! In today's article, we will discuss policy gradient theorem proof. policy gradient theorem plays aâ€¦
Importance Sampling and Monte Carlo Methods

2023å¹´7æœˆ16æ—¥

Importance Sampling and Monte Carlo Methods

Hello again??. Today I want to dive deep into the concept of importance sampling, Importance sampling is a techniqueâ€¦
Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

2023å¹´7æœˆ1æ—¥

Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

Introduction Alright, so you do algorithms based on fierce mathematics especially those which contain lots ofâ€¦

See all articles

Partially Observable MDPs

Abram George

Embedded Automotive SWE | The German University in Cairo

RL Problems as MDP

THE STATE

é¢†è‹±æŽ¨è

Solutions

Observation concatenation

Belief Monitoring

Recurrent Neural Networks

References

Abram Georgeçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Accuracy and Precision

Modeling Queries as Bags of Documents

An Intro to Fisher's Exact?Test

Finding directionality

the problem when the evidence looks right

Classification Trend-Following Assets by the Employment of Dynamic Time Warping (DTW) and Machine Learning Algorithm

Accuracy & Prediction of your forecast: Throw the dart to find out.

Delay Analysis and Data Research the Method that Works

Diffusion Processes 1

WHAT IS GRADIENT DESCENT ? Let's see from basic to advance...

RL Problems as MDP

THE STATE

é¢†è‹±æŽ¨è

Solutions

Observation concatenation

Belief Monitoring

Recurrent Neural Networks

References

Abram Georgeçš„æ›´å¤šæ–‡ç«

Twin Delayed Deep Deterministic Reinforcement learning (TD3)

Eligibility Traces, Spectrum of new learning algorithms ??

Reinforcement Learning agents with gifts/disorders

Curiosity-Driven Reinforcement Learning

Policy Gradient Theorem for continuous tasks ?? -RL

Importance Sampling and Monte Carlo Methods

Dynamic Programming- Policy Improvement - Intro to Reinforcement Learning

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Accuracy and Precision

Modeling Queries as Bags of Documents

An Intro to Fisher's Exact?Test

Finding directionality

the problem when the evidence looks right

Classification Trend-Following Assets by the Employment of Dynamic Time Warping (DTW) and Machine Learning Algorithm

Accuracy & Prediction of your forecast: Throw the dart to find out.

Delay Analysis and Data Research the Method that Works

Diffusion Processes 1

WHAT IS GRADIENT DESCENT ? Let's see from basic to advance...

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†