BxD Primer Series: Deep Q-Network (DQN) Reinforcement Learning Models
Hey there ??
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Deep Q-Network Reinforcement Learning Models. Let’s get started:
The What:
Deep Q-Network (DQN) combines deep neural networks with Q-learning algorithm to approximate the optimal state or action-value function of an agent in a given environment. Check our edition on Q-Learning?here.
Since we have already covered Q-Learning and don’t want to goo too heavy on neural networks in this edition (later there will be ~30 editions on neural networks itself), we will discuss the improvements in Q-Learning that makes it Deep Q-Network.
Basic Architecture:
Both Q-Learning and DQN learn to approximate the optimal Q-value function. As you can observe from above diagrams, Q-Learning requires both the current state and action taken in that state as inputs to calculate the Q-value for that state-action pair. In contrast, DQN only requires the current state as input to predict the Q-value for each possible action.
This is because in DQN, the neural network learns to approximate Q-value function directly, given only current state as input. Q-value for each action is then obtained by simply evaluating the output of neural network for each action. This makes DQN more computationally efficient than Q-learning, since it doesn't require the additional step of computing Q-value for each possible action in each state.
The How:
Let's consider a DQN with one hidden layer, which is the most common architecture for DQNs.
Let?s?be the state representation which is a vector of dimensionality?d, and?a?be the action taken in the current state.
Let?Q(s, a; ??)?be the Q-value function, which is parameterized by the weights????of neural network.
? DQN takes the state?s?as input and passes it through a fully connected layer with?n?neurons.
Let?W?be the weight matrix of this layer, and?b?be the bias vector. The output of this layer is the hidden layer activation?h, which can be written as:
h = relu(Ws + b)
where?relu?is the rectified linear activation function.
? Output layer of DQN takes the hidden layer activation?h?as input and passes it through another fully connected layer with?m?neurons, where?m?is the number of possible actions in the environment.
Let?U?be the weight matrix of this layer, and?c?be the bias vector. The output of this layer is the Q-value vector?Q(s, a; ??), which can be written as:
Q(s, a; θ) = U h + c
where?a?is a one-hot vector that represents the selected action, i.e.,?a[i] = 1?if action?i?is selected and?a[j] = 0?for all?j ≠ i.
? During training, DQN uses the Bellman equation to update Q-values based on the observed rewards and next states.
Let?r?be the reward for current state-action pair, and?s'?be the next state. Target Q-value for current state-action pair is computed as:
target = r + gamma * max_a' Q_target(s', a'; θ_target)
Where,
? DQN then uses a optimization technique to minimize the mean squared error between predicted Q-value and target Q-value. The loss function can be written as:
loss = (Q(s, a; θ) - target)^2
? The DQN updates the weights?θ?of network by backpropagating the gradients of loss with respect to weights and biases, using the chain rule. The target network is updated periodically to keep it in sync with the current network.
Function Approximation:
Key difference between Q-Learning and DQNs is the representation of Q-value function. In Q-Learning, Q-value function is represented as a table, whereas in DQNs, Q-value function is represented as a neural network.
? Q-Learning update rule for updating the Q-value for a state-action pair (s, a) is given by:
Q(s, a) ← Q(s, a) + alpha * (r + gamma * max(Q(s', a')) - Q(s, a))
Where,
? In contrast, DQN update rule for updating the Q-value for a state-action pair (s, a) is given by:
Q(s, a) = r + gamma * max_a' Q_target(s', a'; θ_target)
where Q_target(s', a'; θ_target) is the Q-value for the next state s' and action a', as predicted by the target neural network with weights θ_target. The target network is updated periodically to keep it in sync with the current network.
领英推荐
Experience Replay:
Experience replay involves storing a large number of experiences (i.e., state-action pairs and their associated rewards) in a replay buffer and randomly sampling from this buffer to train the network. This helps to reduce the “correlation between successive training examples”.
For example, in robotic control tasks, the agent generates a sequence of state-action pairs as it interacts with the environment. These sequences can be highly correlated, as small changes in state can lead to similar actions being taken. This causes overfitting and instability in learning process.
By using experience replay, the agent can break the correlation between successive training examples and learn more efficiently.
Target Network:
Since the Q-values are used as targets to train DQN neural network, they change with each iteration of training process as the weights of neural network are updated. This means that the targets used to train the network are not fixed. This can lead to a phenomenon called "catastrophic forgetting", where the network forgets what it has learned in previous iterations and oscillates between different sets of weights.
To address this issue, DQN algorithm uses a separate target network, which is a copy of neural network used to approximate the Q-value function. Q-values used to train the main network are generated by target network and are fixed for a certain number of iterations. This provides stability to the training process and helps to prevent overestimation of Q-values.
Note: Target networks are also used to perform transfer learning. A target network is pre-trained on a large dataset and then fine-tuned on a smaller dataset specific to the task at hand. This allows the model to leverage the pre-trained knowledge and improve performance on the new task. We will cover transfer learning in dedicated editions.
Exploration Strategy:
While both Q-Learning and DQNs use an epsilon-greedy exploration strategy to balance exploration and exploitation, DQNs can use more sophisticated exploration strategies, such as Boltzmann exploration, to encourage exploration of less certain actions.
Rather than choosing the action with maximum Q-value (as in epsilon-greedy exploration), Boltzmann exploration samples an action from a probability distribution that is proportional to the exponential of Q-value divided by a temperature parameter.
Probability of selecting action?a?at state?s?using Boltzmann exploration can be written as:
Where,
Temperature parameter controls the degree of exploration.
Boltzmann exploration encourages more systematic exploration of action space when there are many actions with similar Q-values.
Non-Deterministic Environments:
Q-Learning assumes that the environment is deterministic, meaning that same action taken in same state will always result in the same next state and reward. However, many real-world scenarios are non-deterministic, meaning that the same action may lead to different outcomes.
DQNs can handle non-deterministic environments by learning a stochastic policy that maps from states to a distribution over actions.
Continuous Action Spaces:
Continuous action spaces refer to environments in which an agent must select an action from a continuous range of possible values, rather than a discrete set of options. This can be the case in many real-world scenarios, such as controlling throttle of a vehicle or pitch and roll of a drone.
In traditional Q-Learning, the Q-value function is represented as a lookup table that maps state-action pairs to Q-values. However, with a continuous action space, the number of possible actions can be infinite, making it impractical to store a separate Q-value for every action.
To handle continuous action spaces, there are multiple approach, one being to use a separate neural network called a policy network that maps states directly to actions. The output of policy network is a probability distribution over possible actions, out of which a specific action is selected. This approach is called Double DQN.
The Why:
Reasons to use DQNs:
The Why Not:
Reasons to not use DQNs:
Time for you to support:
In next coming posts, we will cover two more Reinforcement Learning models: Genetic Algorithm, Multi-Agent
Let us know your feedback!
Until then,
Have a great time! ??
sophomores at IIT Roorkee|| Chemical Engineering'27|| Data scientist || Volleyball Player
1 个月and this article is too good ??
sophomores at IIT Roorkee|| Chemical Engineering'27|| Data scientist || Volleyball Player
1 个月can you please give some exmaple also by taking state and giving the q value