Harmonizing Reinforcement Learning and Maximum Likelihood Estimation
A Journey into Intelligent Decision-Making: Part-1
Introduction
In artificial intelligence, where machines learn to make decisions akin to human thought processes, the fusion of Reinforcement Learning (RL) and Maximum Likelihood Estimation (MLE) stands as an emblem of the field's remarkable progress. As we navigate a world increasingly influenced by intelligent machines, the quest for optimal decision-making has driven the relentless pursuit of sophisticated algorithms that power our devices, robots, and self-improving systems. At the heart of this endeavor lies the symbiotic relationship between Reinforcement Learning and Maximum Likelihood Estimation. In this article, we will understand how the unity of these two concepts has revolutionized how machines learn and adapt in complex environments.
But let us first understand what is Reinforcement Learning and Maximum Likelihood Estimation. And how they benefit a self-learning system such as an AI agent.
What is Reinforcement Learning?
Reinforcement Learning is a branch in machine learning where an agent (or a deep neural network) learns patterns within the data through a reward system. The agent explores the various inputs (also known as a state) to see which one of those yields an output (also known as an action) that has the highest reward. At its core, RL revolves around the interplay between an agent, the entity making decisions, and an environment, the context within which these decisions unfold receiving the highest or optimal reward.
In simpler words, the agent tries to explore an unseen environment by taking certain actions. The idea is to find an action that yields the highest or optimal reward.?
Mathematically, RL can be framed as a Markov Decision Process (MDP), denoted by a tuple (S,A,P,R) where:
A Markov Decision Process is a probabilistic model P, that predicts the future state s1 given the current state s0. It yields the probability of moving from the agent from state s0 to the next state s1. Imagine we have a simple 4x4 grid environment where each cell represents a unique state. The agent can navigate through this world by taking actions - up, down, left, or right.?
The Markov Decision Process leverages probability score to determine the course of the next action which is known as transition probabilities. The transition probabilities help us to understand where the agent is likely to end up after taking a specific action.
For instance, from the given table below imagine we're tracking weather conditions with a Markov chain. Now, let us assume that we have three states: cloudy, rainy, and windy. In order to understand how an agent will move in these states we will use the Markov table.
Now, let's decode this table:
This Markov table is our compass, telling us the most likely paths between weather states. It's like predicting tomorrow's weather based on today's conditions. Understanding these probabilities helps us make informed decisions.
To have a more hands-on experience we can use the Gymnasium library along with Pytorch.
import gymnasium as gym
env = gym.make('FrozenLake-v1', is_slippery=True)
state = env.reset()
action = env.action_space.sample() observation, reward, terminated, truncated, info = env.step(action)
print(info)
>> {'prob': 0.3333333333333333}
Essentially, the code block above does the following:
The image above is a very simple but crucial representation of “reinforcement learning”. This diagram will serve you well as we will unfold various topics below. It is important to remember that:
In the coming sections, we will see how a policy or a policy network which is a simple neural network can be trained to take optimal actions. For consistency, we will be denoting the policy network as:
?
Each section will essentially help you understand the fundamentals of reinforcement learning through equations and code. So let’s get started.?
What is Maximum Likelihood Estimation?
The Maximum Likelihood Estimation, MLE, is a fundamental statistical technique used to estimate the parameters of a statistical model based on observed data. It operates on the principle of finding the parameter values that maximize the likelihood of observing the given data under the assumed model. In simpler terms, MLE aims to find the most likely values of the model's parameters that would make the observed data most probable.
In other words, MLE seeks to answer the question: What parameter values would make the observed data most probable according to the assumed model?
Now, assuming that we have a mathematical (probabilistic) model, our aim is to find the best parameter that yields an optimal action. We can define the whole thing as:
Since we are dealing with a neural network we replace the probabilistic model with a neural network with learnable parameter. This network takes states as input and yields actions as output. When we modify equation (1) as a part of the neural network equation we get:
Let’s assume that the policy network is a linear network and we want to find optimal parameters for each state. Then we can use the following code:
policy_net = PolicyNetwork(input_dim, output_dim)
state_tensor = torch.tensor(state_one_hot)
action_probs = policy_net(state_tensor)
action = torch.multinomial(action_probs, 1).item()
observation, reward, terminated, truncated, info = env.step(action)
loss = -torch.log(action_probs[action])
loss.backward()
Keep in mind that the practice of sampling is done by torch.multinomial(action_probs, 1).item() is very essential. When you have a stochastic policy (a policy that outputs a probability distribution over actions), you often want to sample from that distribution rather than just taking the action with the highest probability (i.e., using argmax).?
But, if you were working with a deterministic policy or in a scenario where you specifically want to use the action with the highest probability, then argmax would be suitable. In practice, the choice between argmax and sampling depends on the problem and the desired behavior of the agent.
Policy
The agent's objective is to learn a policy pi(a|s), which maps states s to actions in a manner that maximizes the expected cumulative reward. Firstly, a reward is a vector that represents how important a certain action is when given a state.?
In an iterative process, these rewards are stored to calculate the expected cumulative reward.?
Now, it is fair to assume that the policy is a learnable function that is nothing but a neural network. This neural network takes in state s as the input and yields probable actions a as the output.?
So essentially, a policy network or a policy function can be implemented as a neural network. For an explanation point of view let be a linear layer.
Policy = nn.Linear(state_dim, action_dim)
This linear will yield raw actions and this can be harmful because it is now a deterministic agent. In other words, the network is rigid and not flexible. It also means that it will not encourage the agent to explore various possibilities or even learn the patterns with the given state. Because the output from the linear layer is linearly transformed and it doesn’t yield any specific data distribution and to estimate something we need a probabilistic distribution. The probabilistic distribution encourages the agent to learn patterns, and representations, and even explore various possibilities. Exploration stands as a key mechanism that lets the agent explore different actions which will optimal reward.?
One of the ways in which we ask the policy network to explore and learn patterns and representations is to feed the linear output to the Softmax function.
action_prob = torch.softmax(nn.Linear(state_dim, action_dim), dim=-1)
Or,?
领英推荐
action_prob = torch.softmax(Policy)
Now that we dealt with how to construct the policy network we now need to understand how to optimize it using equation (1) from below.
log_prob = torch.log(action_prob[action])?
log_prob_list.append(log_prob) policy_loss = -torch.stack(log_prob_list).sum() * torch.stack(reward).sum()?
policy_loss.backward()
Action-value function
But bear in mind that Q-values can be calculated in various ways. One of the most popular methods is using a Q-table. You can also use a learnable neural network for the same.?
But in any case, it is essential to expand the equation. Upon doing that we can the following
Now, there is another Bellman equation for the q-function,
For convenience, let me label equation (6) as Bellman Equation for Action-Values with Policy π and equation (7) as Bellman Equation for Action-Values with Optimal Policy.
Now, the question arises which one to choose?
Where α is the learning rate.?
Equation (8) will allow us to update the Q-network to find the optimal value estimation.?
We can code equation (8) as follows:
delta = (reward + gamma torch.max(qtable[new_state, :]) - qtable[state, action])
q_update = qtable[state, action] + learning_rate delta
As you can see from the snippet above there is a term called the qtable recurring quite often. That is as the name suggest a table. This table stores all the q-values that we get when we solve equation (8). This table essentially is a lookup table that helps an RL agent make decisions by storing and updating the expected cumulative rewards (Q-values) for each possible combination of states and actions in an environment.
We initialize the qtable as a null matrix and update it with each iteration. It is important to remember that the table must be of the size of state-size, and action-size.
qtable = torch.zeros((state_size, action_size))
We can update the q-table as shown below.?
state = env.reset(seed=params.seed)[0]
action_prob = net(state.float().unsqueeze(0))
action = action_prob.multinomial(1)
qtable[state, action] = learner.update( state, action, reward, new_state )
If all actions are the same for this state we choose a random one otherwise we take an action with maximum reward using torch.argmax().?
We then take the policy loss i.e. sum the log probabilities and q-values for each time step and take a product of it. Thereafter we will perform a backward propagation.
policy_loss = -torch.stack(log_prob_list).sum() * torch.stack(qas_list).sum()
policy_loss.backward(retain_graph=True)
Importance of Bellman equation:
Results
Below are the results that this simple network produced. Essentially, the last frame of the agent is taken which shows the steps it took to:
To make the algorithm more complex the map size is increased. The algorithm starts at? 4x4, 7x7, 9x9, and 11x11.?
4x4
7x7
9x9
11x11
All branches of AI leverages mathematical equations which allows them to generate promising results. Although the example that we saw is simple these principles when scaled up can produce agents like self-driving, LLMs et cetera.?
The fundamental of reinforcement learning starts with the MDP which leverages transition probability. But efficiency and effectiveness come from applying other core concepts such as:
My aim is to explore these concepts to create a complicated system. In the coming weeks, I will introduce new concepts that will convey the importance and the applications in the real-world scenario.?
References
HR Operations | Implementation of HRIS systems & Employee Onboarding | HR Policies | Exit Interviews
1 年A very interesting take. "The term ""algorithm"" is derived from the last name of Persian mathematician al-Khwarizmi, who presented the first systematic technique for solving equations. Traditional algorithms are well-defined processes or sets of rules for solving problems. Indeed, these algorithms are fixed and do not change over time or after processing more data. On the other hand, just like humans, Machine Learning algorithms learn and modify themselves as they process more data. Hence, in 1950s, the paradigm of traditional algorithms was upended by that of Machine Learning algorithms, and in Thomas Kuhn’s terminology, a scientific revolution occurred. Today, Machine Learning is a vast field that includes supervised learning, unsupervised learning, reinforcement learning, and mixed learning. Supervised Machine Learning involves humans training a computer program to classify data based on pre-labeled examples. Unsupervised Machine Learning techniques do not require pre-labeled data or a human trainer. Reinforcement Learning algorithms learn from the consequences of their actions and improve their performance through trial and error. Finally, Mixed Learning combines all these techniques.