登录查看更多内容

BxD Primer Series: Deep Q-Network (DQN) Reinforcement Learning Models

Mayank K.

Founding Partner - BUSINESS x DATA

发布日期: 2023年6月14日

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Deep Q-Network Reinforcement Learning Models. Let’s get started:

The What:

Deep Q-Network (DQN) combines deep neural networks with Q-learning algorithm to approximate the optimal state or action-value function of an agent in a given environment. Check our edition on Q-Learning?here.

Since we have already covered Q-Learning and don’t want to goo too heavy on neural networks in this edition (later there will be ~30 editions on neural networks itself), we will discuss the improvements in Q-Learning that makes it Deep Q-Network.

Basic Architecture:

Both Q-Learning and DQN learn to approximate the optimal Q-value function. As you can observe from above diagrams, Q-Learning requires both the current state and action taken in that state as inputs to calculate the Q-value for that state-action pair. In contrast, DQN only requires the current state as input to predict the Q-value for each possible action.

This is because in DQN, the neural network learns to approximate Q-value function directly, given only current state as input. Q-value for each action is then obtained by simply evaluating the output of neural network for each action. This makes DQN more computationally efficient than Q-learning, since it doesn't require the additional step of computing Q-value for each possible action in each state.

The How:

Let's consider a DQN with one hidden layer, which is the most common architecture for DQNs.

Let?s?be the state representation which is a vector of dimensionality?d, and?a?be the action taken in the current state.

Let?Q(s, a; ??)?be the Q-value function, which is parameterized by the weights????of neural network.

? DQN takes the state?s?as input and passes it through a fully connected layer with?n?neurons.

Let?W?be the weight matrix of this layer, and?b?be the bias vector. The output of this layer is the hidden layer activation?h, which can be written as:

h = relu(Ws + b)

where?relu?is the rectified linear activation function.

? Output layer of DQN takes the hidden layer activation?h?as input and passes it through another fully connected layer with?m?neurons, where?m?is the number of possible actions in the environment.

Let?U?be the weight matrix of this layer, and?c?be the bias vector. The output of this layer is the Q-value vector?Q(s, a; ??), which can be written as:

Q(s, a; θ) = U h + c

where?a?is a one-hot vector that represents the selected action, i.e.,?a[i] = 1?if action?i?is selected and?a[j] = 0?for all?j ≠ i.

? During training, DQN uses the Bellman equation to update Q-values based on the observed rewards and next states.

Let?r?be the reward for current state-action pair, and?s'?be the next state. Target Q-value for current state-action pair is computed as:

target = r + gamma * max_a' Q_target(s', a'; θ_target)

Where,

gamma?is the discount factor
θ_target?are the weights of target network, which is a copy of DQN used to generate target Q-values.
Max operation is taken over all possible actions?a'?in next state?s'.

? DQN then uses a optimization technique to minimize the mean squared error between predicted Q-value and target Q-value. The loss function can be written as:

loss = (Q(s, a; θ) - target)^2

? The DQN updates the weights?θ?of network by backpropagating the gradients of loss with respect to weights and biases, using the chain rule. The target network is updated periodically to keep it in sync with the current network.

Function Approximation:

Key difference between Q-Learning and DQNs is the representation of Q-value function. In Q-Learning, Q-value function is represented as a table, whereas in DQNs, Q-value function is represented as a neural network.

? Q-Learning update rule for updating the Q-value for a state-action pair (s, a) is given by:

Q(s, a) ← Q(s, a) + alpha * (r + gamma * max(Q(s', a')) - Q(s, a))

Where,

Q(s, a) is the Q-value for state-action pair (s, a)
alpha is the learning rate
r is the reward for taking action a in state s
gamma is the discount factor
s' is the next state
a' is the action that maximizes the Q-value for next state s'.

? In contrast, DQN update rule for updating the Q-value for a state-action pair (s, a) is given by:

Q(s, a) = r + gamma * max_a' Q_target(s', a'; θ_target)

where Q_target(s', a'; θ_target) is the Q-value for the next state s' and action a', as predicted by the target neural network with weights θ_target. The target network is updated periodically to keep it in sync with the current network.

领英推荐

Optimizing hidden layers of neural networks: AI web…

Rakuten Symphony 5 个月前

Understanding the Perceptron: The First Step in Deep…

Khichad Technologies 2 周前

Artificial Intelligence in Healthcare : Algorithm 42 -…

SynapseHealthTech (Synapse Analytics IT Services) 1 年前

Experience Replay:

Experience replay involves storing a large number of experiences (i.e., state-action pairs and their associated rewards) in a replay buffer and randomly sampling from this buffer to train the network. This helps to reduce the “correlation between successive training examples”.

For example, in robotic control tasks, the agent generates a sequence of state-action pairs as it interacts with the environment. These sequences can be highly correlated, as small changes in state can lead to similar actions being taken. This causes overfitting and instability in learning process.

By using experience replay, the agent can break the correlation between successive training examples and learn more efficiently.

Target Network:

Since the Q-values are used as targets to train DQN neural network, they change with each iteration of training process as the weights of neural network are updated. This means that the targets used to train the network are not fixed. This can lead to a phenomenon called "catastrophic forgetting", where the network forgets what it has learned in previous iterations and oscillates between different sets of weights.

To address this issue, DQN algorithm uses a separate target network, which is a copy of neural network used to approximate the Q-value function. Q-values used to train the main network are generated by target network and are fixed for a certain number of iterations. This provides stability to the training process and helps to prevent overestimation of Q-values.

Note: Target networks are also used to perform transfer learning. A target network is pre-trained on a large dataset and then fine-tuned on a smaller dataset specific to the task at hand. This allows the model to leverage the pre-trained knowledge and improve performance on the new task. We will cover transfer learning in dedicated editions.

Exploration Strategy:

While both Q-Learning and DQNs use an epsilon-greedy exploration strategy to balance exploration and exploitation, DQNs can use more sophisticated exploration strategies, such as Boltzmann exploration, to encourage exploration of less certain actions.

Rather than choosing the action with maximum Q-value (as in epsilon-greedy exploration), Boltzmann exploration samples an action from a probability distribution that is proportional to the exponential of Q-value divided by a temperature parameter.

Probability of selecting action?a?at state?s?using Boltzmann exploration can be written as:

Where,

Q(s, a) is the Q-value of action?a?in state?s
τ is the temperature parameter
Σ is the sum over all possible actions a' in state s.

Temperature parameter controls the degree of exploration.

When temperature is high, the probability distribution over actions is more uniform, which encourages exploration of less certain actions.
As temperature decreases, the probability distribution becomes more peaked around the action with the highest Q-value, which encourages exploitation of the learned policy.
It can be annealed (gradually decreased) over the course of training to shift the agent's behavior from exploration to exploitation as it gains more experience.

Boltzmann exploration encourages more systematic exploration of action space when there are many actions with similar Q-values.

Non-Deterministic Environments:

Q-Learning assumes that the environment is deterministic, meaning that same action taken in same state will always result in the same next state and reward. However, many real-world scenarios are non-deterministic, meaning that the same action may lead to different outcomes.

In robotics, motion of the robot is affected by factors such as sensor noise, actuator errors, and unpredictable disturbances in the environment. As a result, the same control actions applied in the same initial state can lead to different outcomes.
In autonomous driving, the behavior of other vehicles and pedestrians is unpredictable, and the sensors may not always provide accurate information. This makes it challenging to plan a safe and efficient trajectory for the vehicle. hence complex DQNs are used.

DQNs can handle non-deterministic environments by learning a stochastic policy that maps from states to a distribution over actions.

Continuous Action Spaces:

Continuous action spaces refer to environments in which an agent must select an action from a continuous range of possible values, rather than a discrete set of options. This can be the case in many real-world scenarios, such as controlling throttle of a vehicle or pitch and roll of a drone.

In traditional Q-Learning, the Q-value function is represented as a lookup table that maps state-action pairs to Q-values. However, with a continuous action space, the number of possible actions can be infinite, making it impractical to store a separate Q-value for every action.

To handle continuous action spaces, there are multiple approach, one being to use a separate neural network called a policy network that maps states directly to actions. The output of policy network is a probability distribution over possible actions, out of which a specific action is selected. This approach is called Double DQN.

The Why:

Reasons to use DQNs:

Better performance:?DQNs have been shown to outperform traditional Q-Learning methods in environments with high-dimensional state spaces and large action spaces.
Efficient training:?DQNs use function approximation to learn the Q-value function, which allows for better generalization across similar states.
Improved exploration:?DQNs use Boltzmann exploration policies to balance exploitation and exploration, allowing the agent to learn more about the environment and discover optimal policies faster.
Transfer learning:?DQNs can be pre-trained on similar environments and then fine-tuned on new environments, allowing for faster learning.
Wide range of applications?from game playing to robotics to finance, and have been shown to achieve state-of-the-art results in many domains.

The Why Not:

Reasons to not use DQNs:

Complex implementation: DQNs require design of the neural network architecture, which can only be done by experienced developers.
Large data requirements?to train effectively, which can be a challenge in domains where data is expensive or difficult to obtain.
DQNs are susceptible to overfitting, which can occur when model becomes too complex and starts to fit noise in the data rather than the underlying patterns.
DQNs are considered a black box approach?to reinforcement learning, meaning that it is difficult to understand how the model is making decisions and what factors are most important in determining its behavior.

Time for you to support:

Reply to this article with your question
Forward/Share to a friend who can benefit from this
Chat on Substack with BxD (here)
Engage with BxD on LinkedIN (here)

In next coming posts, we will cover two more Reinforcement Learning models: Genetic Algorithm, Multi-Agent

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata?#bxd?#Deep #QNetwork #DQN #Reinforcement #Learning #primer

BUSINESS x DATA

764 位关注者

Shivani Singh

sophomores at IIT Roorkee|| Chemical Engineering'27|| Data scientist || Volleyball Player

1 个月

and this article is too good ??

Shivani Singh

sophomores at IIT Roorkee|| Chemical Engineering'27|| Data scientist || Volleyball Player

1 个月

can you please give some exmaple also by taking state and giving the q value

查看更多评论

要查看或添加评论，请登录

Mayank K.的更多文章

What we look for in new recruits?

2024年9月22日

What we look for in new recruits?

Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…
500+ Enrollments, ?????????? Ratings and a Podcast

2024年9月14日

500+ Enrollments, ?????????? Ratings and a Podcast

We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.
What you mean 'Build A Business'?

2024年9月7日

What you mean 'Build A Business'?

We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.
Why 'AI-Driven Personalization' niche?

2024年8月31日

Why 'AI-Driven Personalization' niche?

We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…
Entering the next chapter of BxD

2024年8月24日

Entering the next chapter of BxD

We are all in for AI Driven Personalization in Business. And recently we created a course about it.

1 条评论
We are ranking #1

2024年8月17日

We are ranking #1

We are all in for AI Driven Personalization in Business. And recently we created a course about it.
My favorites from the new release

2024年7月27日

My favorites from the new release

The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.
Many senior level jobs inside....

2024年7月7日

Many senior level jobs inside....

Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…
People need more jobs and videos.

2024年6月29日

People need more jobs and videos.

From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…
BxD Saturday Letter #202425

2024年6月22日

BxD Saturday Letter #202425

Please take 2 mins to send your feedback. Link: https://forms.

See all articles

BxD Primer Series: Deep Q-Network (DQN) Reinforcement Learning Models

Mayank K.

Founding Partner - BUSINESS x DATA

The What:

Basic Architecture:

The How:

Function Approximation:

领英推荐

Experience Replay:

Target Network:

Exploration Strategy:

Non-Deterministic Environments:

Continuous Action Spaces:

The Why:

The Why Not:

Time for you to support:

BUSINESS x DATA

764 位关注者

Mayank K.的更多文章

社区洞察

其他会员也浏览了

Evolution of Neural Network

BxD Primer Series: Convolutional Neural Networks

How Convolutional Neural Networks (CNNs) for Image Classification Works ?

Hyperparameter Tuning of Neural Networks: From Dirichlet Lens

BxD Primer Series: Long Short-Term Memory (LSTM) Neural Networks

BxD Primer Series: Liquid State Machine (LSM) Neural Networks

BxD Primer Series: Boltzmann Machine Neural Networks

What is Neural Network?

What Is Neural Network In Artificial Intelligence

How Convolutional Neural Network (CNN) Processes Image Data

The What:

Basic Architecture:

The How:

Function Approximation:

领英推荐

Experience Replay:

Target Network:

Exploration Strategy:

Non-Deterministic Environments:

Continuous Action Spaces:

The Why:

The Why Not:

Time for you to support:

BUSINESS x DATA

764 位关注者

Mayank K.的更多文章

What we look for in new recruits?

500+ Enrollments, ?????????? Ratings and a Podcast

What you mean 'Build A Business'?

Why 'AI-Driven Personalization' niche?

Entering the next chapter of BxD

We are ranking #1

My favorites from the new release

Many senior level jobs inside....

People need more jobs and videos.

BxD Saturday Letter #202425

社区洞察

其他会员也浏览了

Evolution of Neural Network

BxD Primer Series: Convolutional Neural Networks

How Convolutional Neural Networks (CNNs) for Image Classification Works ?

Hyperparameter Tuning of Neural Networks: From Dirichlet Lens

BxD Primer Series: Long Short-Term Memory (LSTM) Neural Networks

BxD Primer Series: Liquid State Machine (LSM) Neural Networks

BxD Primer Series: Boltzmann Machine Neural Networks

What is Neural Network?

What Is Neural Network In Artificial Intelligence

How Convolutional Neural Network (CNN) Processes Image Data