BxD Primer Series: Deep Q-Network (DQN) Reinforcement Learning Models

BxD Primer Series: Deep Q-Network (DQN) Reinforcement Learning Models

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Deep Q-Network Reinforcement Learning Models. Let’s get started:

The What:

Deep Q-Network (DQN) combines deep neural networks with Q-learning algorithm to approximate the optimal state or action-value function of an agent in a given environment. Check our edition on Q-Learning?here.

No alt text provided for this image

Since we have already covered Q-Learning and don’t want to goo too heavy on neural networks in this edition (later there will be ~30 editions on neural networks itself), we will discuss the improvements in Q-Learning that makes it Deep Q-Network.

No alt text provided for this image

Basic Architecture:

Both Q-Learning and DQN learn to approximate the optimal Q-value function. As you can observe from above diagrams, Q-Learning requires both the current state and action taken in that state as inputs to calculate the Q-value for that state-action pair. In contrast, DQN only requires the current state as input to predict the Q-value for each possible action.

This is because in DQN, the neural network learns to approximate Q-value function directly, given only current state as input. Q-value for each action is then obtained by simply evaluating the output of neural network for each action. This makes DQN more computationally efficient than Q-learning, since it doesn't require the additional step of computing Q-value for each possible action in each state.

The How:

Let's consider a DQN with one hidden layer, which is the most common architecture for DQNs.

Let?s?be the state representation which is a vector of dimensionality?d, and?a?be the action taken in the current state.

Let?Q(s, a; ??)?be the Q-value function, which is parameterized by the weights????of neural network.

? DQN takes the state?s?as input and passes it through a fully connected layer with?n?neurons.

Let?W?be the weight matrix of this layer, and?b?be the bias vector. The output of this layer is the hidden layer activation?h, which can be written as:

h = relu(Ws + b)

where?relu?is the rectified linear activation function.

? Output layer of DQN takes the hidden layer activation?h?as input and passes it through another fully connected layer with?m?neurons, where?m?is the number of possible actions in the environment.

Let?U?be the weight matrix of this layer, and?c?be the bias vector. The output of this layer is the Q-value vector?Q(s, a; ??), which can be written as:

Q(s, a; θ) = U h + c

where?a?is a one-hot vector that represents the selected action, i.e.,?a[i] = 1?if action?i?is selected and?a[j] = 0?for all?j ≠ i.

? During training, DQN uses the Bellman equation to update Q-values based on the observed rewards and next states.

Let?r?be the reward for current state-action pair, and?s'?be the next state. Target Q-value for current state-action pair is computed as:

target = r + gamma * max_a' Q_target(s', a'; θ_target)

Where,

  • gamma?is the discount factor
  • θ_target?are the weights of target network, which is a copy of DQN used to generate target Q-values.
  • Max operation is taken over all possible actions?a'?in next state?s'.

? DQN then uses a optimization technique to minimize the mean squared error between predicted Q-value and target Q-value. The loss function can be written as:

loss = (Q(s, a; θ) - target)^2

? The DQN updates the weights?θ?of network by backpropagating the gradients of loss with respect to weights and biases, using the chain rule. The target network is updated periodically to keep it in sync with the current network.


Function Approximation:

Key difference between Q-Learning and DQNs is the representation of Q-value function. In Q-Learning, Q-value function is represented as a table, whereas in DQNs, Q-value function is represented as a neural network.

? Q-Learning update rule for updating the Q-value for a state-action pair (s, a) is given by:

Q(s, a) ← Q(s, a) + alpha * (r + gamma * max(Q(s', a')) - Q(s, a))

Where,

  • Q(s, a) is the Q-value for state-action pair (s, a)
  • alpha is the learning rate
  • r is the reward for taking action a in state s
  • gamma is the discount factor
  • s' is the next state
  • a' is the action that maximizes the Q-value for next state s'.

? In contrast, DQN update rule for updating the Q-value for a state-action pair (s, a) is given by:

Q(s, a) = r + gamma * max_a' Q_target(s', a'; θ_target)

where Q_target(s', a'; θ_target) is the Q-value for the next state s' and action a', as predicted by the target neural network with weights θ_target. The target network is updated periodically to keep it in sync with the current network.

Experience Replay:

Experience replay involves storing a large number of experiences (i.e., state-action pairs and their associated rewards) in a replay buffer and randomly sampling from this buffer to train the network. This helps to reduce the “correlation between successive training examples”.

For example, in robotic control tasks, the agent generates a sequence of state-action pairs as it interacts with the environment. These sequences can be highly correlated, as small changes in state can lead to similar actions being taken. This causes overfitting and instability in learning process.

By using experience replay, the agent can break the correlation between successive training examples and learn more efficiently.

Target Network:

Since the Q-values are used as targets to train DQN neural network, they change with each iteration of training process as the weights of neural network are updated. This means that the targets used to train the network are not fixed. This can lead to a phenomenon called "catastrophic forgetting", where the network forgets what it has learned in previous iterations and oscillates between different sets of weights.

To address this issue, DQN algorithm uses a separate target network, which is a copy of neural network used to approximate the Q-value function. Q-values used to train the main network are generated by target network and are fixed for a certain number of iterations. This provides stability to the training process and helps to prevent overestimation of Q-values.

Note: Target networks are also used to perform transfer learning. A target network is pre-trained on a large dataset and then fine-tuned on a smaller dataset specific to the task at hand. This allows the model to leverage the pre-trained knowledge and improve performance on the new task. We will cover transfer learning in dedicated editions.

Exploration Strategy:

While both Q-Learning and DQNs use an epsilon-greedy exploration strategy to balance exploration and exploitation, DQNs can use more sophisticated exploration strategies, such as Boltzmann exploration, to encourage exploration of less certain actions.

Rather than choosing the action with maximum Q-value (as in epsilon-greedy exploration), Boltzmann exploration samples an action from a probability distribution that is proportional to the exponential of Q-value divided by a temperature parameter.

Probability of selecting action?a?at state?s?using Boltzmann exploration can be written as:

No alt text provided for this image

Where,

  • Q(s, a) is the Q-value of action?a?in state?s
  • τ is the temperature parameter
  • Σ is the sum over all possible actions a' in state s.

Temperature parameter controls the degree of exploration.

  • When temperature is high, the probability distribution over actions is more uniform, which encourages exploration of less certain actions.
  • As temperature decreases, the probability distribution becomes more peaked around the action with the highest Q-value, which encourages exploitation of the learned policy.
  • It can be annealed (gradually decreased) over the course of training to shift the agent's behavior from exploration to exploitation as it gains more experience.

Boltzmann exploration encourages more systematic exploration of action space when there are many actions with similar Q-values.

Non-Deterministic Environments:

Q-Learning assumes that the environment is deterministic, meaning that same action taken in same state will always result in the same next state and reward. However, many real-world scenarios are non-deterministic, meaning that the same action may lead to different outcomes.

  • In robotics, motion of the robot is affected by factors such as sensor noise, actuator errors, and unpredictable disturbances in the environment. As a result, the same control actions applied in the same initial state can lead to different outcomes.
  • In autonomous driving, the behavior of other vehicles and pedestrians is unpredictable, and the sensors may not always provide accurate information. This makes it challenging to plan a safe and efficient trajectory for the vehicle. hence complex DQNs are used.

DQNs can handle non-deterministic environments by learning a stochastic policy that maps from states to a distribution over actions.

Continuous Action Spaces:

Continuous action spaces refer to environments in which an agent must select an action from a continuous range of possible values, rather than a discrete set of options. This can be the case in many real-world scenarios, such as controlling throttle of a vehicle or pitch and roll of a drone.

In traditional Q-Learning, the Q-value function is represented as a lookup table that maps state-action pairs to Q-values. However, with a continuous action space, the number of possible actions can be infinite, making it impractical to store a separate Q-value for every action.

To handle continuous action spaces, there are multiple approach, one being to use a separate neural network called a policy network that maps states directly to actions. The output of policy network is a probability distribution over possible actions, out of which a specific action is selected. This approach is called Double DQN.


The Why:

Reasons to use DQNs:

  1. Better performance:?DQNs have been shown to outperform traditional Q-Learning methods in environments with high-dimensional state spaces and large action spaces.
  2. Efficient training:?DQNs use function approximation to learn the Q-value function, which allows for better generalization across similar states.
  3. Improved exploration:?DQNs use Boltzmann exploration policies to balance exploitation and exploration, allowing the agent to learn more about the environment and discover optimal policies faster.
  4. Transfer learning:?DQNs can be pre-trained on similar environments and then fine-tuned on new environments, allowing for faster learning.
  5. Wide range of applications?from game playing to robotics to finance, and have been shown to achieve state-of-the-art results in many domains.

The Why Not:

Reasons to not use DQNs:

  1. Complex implementation: DQNs require design of the neural network architecture, which can only be done by experienced developers.
  2. Large data requirements?to train effectively, which can be a challenge in domains where data is expensive or difficult to obtain.
  3. DQNs are susceptible to overfitting, which can occur when model becomes too complex and starts to fit noise in the data rather than the underlying patterns.
  4. DQNs are considered a black box approach?to reinforcement learning, meaning that it is difficult to understand how the model is making decisions and what factors are most important in determining its behavior.

Time for you to support:

  1. Reply to this article with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here)
  4. Engage with BxD on LinkedIN (here)

In next coming posts, we will cover two more Reinforcement Learning models: Genetic Algorithm, Multi-Agent

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata?#bxd?#Deep #QNetwork #DQN #Reinforcement #Learning #primer

Shivani Singh

sophomores at IIT Roorkee|| Chemical Engineering'27|| Data scientist || Volleyball Player

1 个月

and this article is too good ??

回复
Shivani Singh

sophomores at IIT Roorkee|| Chemical Engineering'27|| Data scientist || Volleyball Player

1 个月

can you please give some exmaple also by taking state and giving the q value

回复

要查看或添加评论,请登录

Mayank K.的更多文章

  • What we look for in new recruits?

    What we look for in new recruits?

    Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…

  • 500+ Enrollments, ?????????? Ratings and a Podcast

    500+ Enrollments, ?????????? Ratings and a Podcast

    We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.

  • What you mean 'Build A Business'?

    What you mean 'Build A Business'?

    We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.

  • Why 'AI-Driven Personalization' niche?

    Why 'AI-Driven Personalization' niche?

    We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…

  • Entering the next chapter of BxD

    Entering the next chapter of BxD

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

    1 条评论
  • We are ranking #1

    We are ranking #1

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

  • My favorites from the new release

    My favorites from the new release

    The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: ?? https://www.

  • Many senior level jobs inside....

    Many senior level jobs inside....

    Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…

  • People need more jobs and videos.

    People need more jobs and videos.

    From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…

  • BxD Saturday Letter #202425

    BxD Saturday Letter #202425

    Please take 2 mins to send your feedback. Link: https://forms.

社区洞察

其他会员也浏览了