Harmonizing Reinforcement Learning and Maximum Likelihood Estimation
https://unsplash.com/photos/-vBn1T8g0D4?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink

Harmonizing Reinforcement Learning and Maximum Likelihood Estimation

A Journey into Intelligent Decision-Making: Part-1

Introduction

In artificial intelligence, where machines learn to make decisions akin to human thought processes, the fusion of Reinforcement Learning (RL) and Maximum Likelihood Estimation (MLE) stands as an emblem of the field's remarkable progress. As we navigate a world increasingly influenced by intelligent machines, the quest for optimal decision-making has driven the relentless pursuit of sophisticated algorithms that power our devices, robots, and self-improving systems. At the heart of this endeavor lies the symbiotic relationship between Reinforcement Learning and Maximum Likelihood Estimation. In this article, we will understand how the unity of these two concepts has revolutionized how machines learn and adapt in complex environments.

But let us first understand what is Reinforcement Learning and Maximum Likelihood Estimation. And how they benefit a self-learning system such as an AI agent.

What is Reinforcement Learning?

Reinforcement Learning is a branch in machine learning where an agent (or a deep neural network) learns patterns within the data through a reward system. The agent explores the various inputs (also known as a state) to see which one of those yields an output (also known as an action) that has the highest reward. At its core, RL revolves around the interplay between an agent, the entity making decisions, and an environment, the context within which these decisions unfold receiving the highest or optimal reward.

In simpler words, the agent tries to explore an unseen environment by taking certain actions. The idea is to find an action that yields the highest or optimal reward.?

Mathematically, RL can be framed as a Markov Decision Process (MDP), denoted by a tuple (S,A,P,R) where:

  • S represents the set of possible states in the environment, which is the input.
  • A symbolizes the set of available actions the agent can take, which is the output.
  • P embodies the transition dynamics, capturing the probabilities of transitioning from one state to another given an action.
  • R signifies the reward function, quantifying the immediate benefit an agent gains by taking a certain action in a specific state.

A Markov Decision Process is a probabilistic model P, that predicts the future state s1 given the current state s0. It yields the probability of moving from the agent from state s0 to the next state s1. Imagine we have a simple 4x4 grid environment where each cell represents a unique state. The agent can navigate through this world by taking actions - up, down, left, or right.?

The Markov Decision Process leverages probability score to determine the course of the next action which is known as transition probabilities. The transition probabilities help us to understand where the agent is likely to end up after taking a specific action.

For instance, from the given table below imagine we're tracking weather conditions with a Markov chain. Now, let us assume that we have three states: cloudy, rainy, and windy. In order to understand how an agent will move in these states we will use the Markov table.

Now, let's decode this table:

  • Cloudy to Rainy: Let’s say that we transition from a cloudy day, then according to the table, there’s a 60% chance it will be rainy and a 30% chance it will become windy.
  • Rainy to Cloudy: If it's rainy, it usually sticks around - there's an 80% probability of it remaining rainy and a 20% chance of it becoming cloudy.
  • Windy to Rainy: Windy days always shift to rainy without fail, with a 100% probability.

This Markov table is our compass, telling us the most likely paths between weather states. It's like predicting tomorrow's weather based on today's conditions. Understanding these probabilities helps us make informed decisions.

To have a more hands-on experience we can use the Gymnasium library along with Pytorch.

import gymnasium as gym 
env = gym.make('FrozenLake-v1', is_slippery=True) 
state = env.reset() 
action = env.action_space.sample() observation, reward, terminated, truncated, info = env.step(action) 

print(info)        
>> {'prob': 0.3333333333333333}        

Essentially, the code block above does the following:

  • Environment Setup:Import Gym library and create a 'FrozenLake-v1' environment with 'is_slippery=True' for added challenge. 'FrozenLake-v1' is a popular environment provided by the OpenAI Gym toolkit. It's a simple grid-world environment where an agent (like a character in a game) moves on a grid of tiles. The objective is to reach a goal tile while avoiding holes in the ice.?
  • Initial State and Action:Reset the environment to get the initial state.Randomly sample an action from the available action space.
  • Taking a Step:Execute the selected action in the environment to transition to a new state.
  • Observations and Rewards:Receive the new state, a reward value, and episode status (terminated or truncated).
  • Information Display:Print additional insights from the 'info' dictionary. In this case, it returns transition probability.?

Source: https://spinningup.openai.com/en/latest/spinningup/rl_intro.html


The image above is a very simple but crucial representation of “reinforcement learning”. This diagram will serve you well as we will unfold various topics below. It is important to remember that:

  1. An Agent or policy is a neural network that takes a set of states at a time t and returns a probability distribution over action.?
  2. The best action is taken and it is then evaluated against the reward. Because each action yields a reward.?
  3. Based on the reward the agent is fine-tuned and optimized.??

In the coming sections, we will see how a policy or a policy network which is a simple neural network can be trained to take optimal actions. For consistency, we will be denoting the policy network as:

?


Each section will essentially help you understand the fundamentals of reinforcement learning through equations and code. So let’s get started.?

What is Maximum Likelihood Estimation?

The Maximum Likelihood Estimation, MLE, is a fundamental statistical technique used to estimate the parameters of a statistical model based on observed data. It operates on the principle of finding the parameter values that maximize the likelihood of observing the given data under the assumed model. In simpler terms, MLE aims to find the most likely values of the model's parameters that would make the observed data most probable.

In other words, MLE seeks to answer the question: What parameter values would make the observed data most probable according to the assumed model?

Now, assuming that we have a mathematical (probabilistic) model, our aim is to find the best parameter that yields an optimal action. We can define the whole thing as:

Since we are dealing with a neural network we replace the probabilistic model with a neural network with learnable parameter. This network takes states as input and yields actions as output. When we modify equation (1) as a part of the neural network equation we get:

Let’s assume that the policy network is a linear network and we want to find optimal parameters for each state. Then we can use the following code:

policy_net = PolicyNetwork(input_dim, output_dim) 
state_tensor = torch.tensor(state_one_hot) 
action_probs = policy_net(state_tensor) 
action = torch.multinomial(action_probs, 1).item() 

observation, reward, terminated, truncated, info = env.step(action) 
loss = -torch.log(action_probs[action])
loss.backward()        

Keep in mind that the practice of sampling is done by torch.multinomial(action_probs, 1).item() is very essential. When you have a stochastic policy (a policy that outputs a probability distribution over actions), you often want to sample from that distribution rather than just taking the action with the highest probability (i.e., using argmax).?

But, if you were working with a deterministic policy or in a scenario where you specifically want to use the action with the highest probability, then argmax would be suitable. In practice, the choice between argmax and sampling depends on the problem and the desired behavior of the agent.

Policy

The agent's objective is to learn a policy pi(a|s), which maps states s to actions in a manner that maximizes the expected cumulative reward. Firstly, a reward is a vector that represents how important a certain action is when given a state.?

In an iterative process, these rewards are stored to calculate the expected cumulative reward.?

Now, it is fair to assume that the policy is a learnable function that is nothing but a neural network. This neural network takes in state s as the input and yields probable actions a as the output.?

So essentially, a policy network or a policy function can be implemented as a neural network. For an explanation point of view let be a linear layer.

Policy = nn.Linear(state_dim, action_dim)        

This linear will yield raw actions and this can be harmful because it is now a deterministic agent. In other words, the network is rigid and not flexible. It also means that it will not encourage the agent to explore various possibilities or even learn the patterns with the given state. Because the output from the linear layer is linearly transformed and it doesn’t yield any specific data distribution and to estimate something we need a probabilistic distribution. The probabilistic distribution encourages the agent to learn patterns, and representations, and even explore various possibilities. Exploration stands as a key mechanism that lets the agent explore different actions which will optimal reward.?

One of the ways in which we ask the policy network to explore and learn patterns and representations is to feed the linear output to the Softmax function.

action_prob = torch.softmax(nn.Linear(state_dim, action_dim), dim=-1)        

Or,?

action_prob = torch.softmax(Policy)        

Now that we dealt with how to construct the policy network we now need to understand how to optimize it using equation (1) from below.

log_prob = torch.log(action_prob[action])?        
log_prob_list.append(log_prob) policy_loss = -torch.stack(log_prob_list).sum() * torch.stack(reward).sum()?        
policy_loss.backward()        

Action-value function

But bear in mind that Q-values can be calculated in various ways. One of the most popular methods is using a Q-table. You can also use a learnable neural network for the same.?

But in any case, it is essential to expand the equation. Upon doing that we can the following

Now, there is another Bellman equation for the q-function,

For convenience, let me label equation (6) as Bellman Equation for Action-Values with Policy π and equation (7) as Bellman Equation for Action-Values with Optimal Policy.

Now, the question arises which one to choose?

Where α is the learning rate.?

Equation (8) will allow us to update the Q-network to find the optimal value estimation.?

We can code equation (8) as follows:

delta = (reward + gamma  torch.max(qtable[new_state, :]) - qtable[state, action]) 

q_update = qtable[state, action] + learning_rate  delta        

As you can see from the snippet above there is a term called the qtable recurring quite often. That is as the name suggest a table. This table stores all the q-values that we get when we solve equation (8). This table essentially is a lookup table that helps an RL agent make decisions by storing and updating the expected cumulative rewards (Q-values) for each possible combination of states and actions in an environment.

We initialize the qtable as a null matrix and update it with each iteration. It is important to remember that the table must be of the size of state-size, and action-size.

qtable = torch.zeros((state_size, action_size))        

We can update the q-table as shown below.?

state = env.reset(seed=params.seed)[0] 
action_prob = net(state.float().unsqueeze(0)) 
action = action_prob.multinomial(1) 

qtable[state, action] = learner.update( state, action, reward, new_state )        

If all actions are the same for this state we choose a random one otherwise we take an action with maximum reward using torch.argmax().?

We then take the policy loss i.e. sum the log probabilities and q-values for each time step and take a product of it. Thereafter we will perform a backward propagation.

policy_loss = -torch.stack(log_prob_list).sum() * torch.stack(qas_list).sum() 

policy_loss.backward(retain_graph=True)        

Importance of Bellman equation:

  • Core of Dynamic Programming: The Bellman equation is at the heart of dynamic programming, enabling optimal decision-making and value estimation in sequential scenarios.
  • Optimal Decisions: It guides us to make optimal choices by balancing immediate rewards with long-term gains, enhancing decision quality.
  • Iterative Algorithms: It powers value iteration and policy improvement algorithms, refining estimates for better decision strategies.
  • Generalization: The equation's recursion lets solutions extend to various situations, offering a versatile framework.
  • RL Foundations: In reinforcement learning, it forms the basis for modeling agent-environment interactions mathematically.
  • Key RL Algorithms: Q-learning and SARSA build upon the Bellman equation, updating action-value estimates iteratively.
  • Temporal Difference Learning: Agents use it for updating values based on current and future estimates, aiding adaptive learning.
  • Real-World Impact: Its principles impact domains like robotics, finance, and game AI, guiding optimal decisions.
  • Theoretical and Practical: It combines theoretical insight with practical algorithms, enhancing decision-making in complex environments.

Results

Below are the results that this simple network produced. Essentially, the last frame of the agent is taken which shows the steps it took to:

  1. Avoid the obstacle?
  2. Reach the goal.?

To make the algorithm more complex the map size is increased. The algorithm starts at? 4x4, 7x7, 9x9, and 11x11.?

4x4

Author

7x7

Author

9x9

Author

11x11

Author

All branches of AI leverages mathematical equations which allows them to generate promising results. Although the example that we saw is simple these principles when scaled up can produce agents like self-driving, LLMs et cetera.?

The fundamental of reinforcement learning starts with the MDP which leverages transition probability. But efficiency and effectiveness come from applying other core concepts such as:

  1. MLE to find optimal parameters that map state to action.?
  2. Incorporating equations such as policy loss to find better parametric solutions.?
  3. Also, upgrading the policy loss with a guiding factor such as the Bellman equation adds flexibility to the algorithm.?

My aim is to explore these concepts to create a complicated system. In the coming weeks, I will introduce new concepts that will convey the importance and the applications in the real-world scenario.?

References

  1. OpenAI: Key Concepts in RL
  2. Sutton, Richard S.;?Barto, Andrew G.?(2018) [1998].?Reinforcement Learning: An Introduction?(2nd?ed.). MIT Press.?ISBN?978-0-262-03924-6.
  3. Frozenlake benchmark
  4. Deep Reinforcement Learning With Python
  5. Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons

Kajal Singh

HR Operations | Implementation of HRIS systems & Employee Onboarding | HR Policies | Exit Interviews

1 年

A very interesting take. "The term ""algorithm"" is derived from the last name of Persian mathematician al-Khwarizmi, who presented the first systematic technique for solving equations. Traditional algorithms are well-defined processes or sets of rules for solving problems. Indeed, these algorithms are fixed and do not change over time or after processing more data. On the other hand, just like humans, Machine Learning algorithms learn and modify themselves as they process more data. Hence, in 1950s, the paradigm of traditional algorithms was upended by that of Machine Learning algorithms, and in Thomas Kuhn’s terminology, a scientific revolution occurred. Today, Machine Learning is a vast field that includes supervised learning, unsupervised learning, reinforcement learning, and mixed learning. Supervised Machine Learning involves humans training a computer program to classify data based on pre-labeled examples. Unsupervised Machine Learning techniques do not require pre-labeled data or a human trainer. Reinforcement Learning algorithms learn from the consequences of their actions and improve their performance through trial and error. Finally, Mixed Learning combines all these techniques.

回复

要查看或添加评论,请登录

Nilesh Barla的更多文章

  • Difference between Casual and Flash Attention

    Difference between Casual and Flash Attention

    In the past two weeks, my work largely revolved around the attention mechanism. In the last edition of my newsletter, I…

  • Why LLMs are obsessed with "Attention"?

    Why LLMs are obsessed with "Attention"?

    It's been a long time that I have released a new edition of this newsletter. All I can say is that life comes in the…

    2 条评论
  • Importance of Alignment in LLMs

    Importance of Alignment in LLMs

    In the rapidly evolving world of language models, one concept stands out as crucial: alignment. Alignment is one of the…

    9 条评论
  • Human Machine Intelligence

    Human Machine Intelligence

    Exploring the very elements of who we are and what we can do Content The Human Brain Thoughts Emotions Decision Making…

  • Artificial Intelligence over a cup of coffee — The Dawn

    Artificial Intelligence over a cup of coffee — The Dawn

    From inception to reality and beyond Every morning as soon as I get up, I make sure that I make a cup of coffee. How…

  • Elemental Knowledge of Data Science and the role of a Data Scientist

    Elemental Knowledge of Data Science and the role of a Data Scientist

    We certainly must have heard a buzz about Big Data and AI and of course Data science. I have read a lot of article…

    2 条评论
  • Food that destroys your day

    Food that destroys your day

    Last week as soon as I reached my office I started craving for breakfast. I knew there a fast food shop near my office…

社区洞察

其他会员也浏览了