A Primer on Reinforcement Learning
Minh Trinh
Managing Partner at Centaurs Fabs, Author of "Foundations of Artificial Intelligence Finance" and "The Artificial Intelligence Handbook Series", Organizer "Artificial Intelligence for Good (New York)"
What is Reinforcement Learning
Reinforcement Learning (RL) (Sutton and Barto, 2018) lies at the intersection of many fields: computer science, machine learning, operation research, applied mathematics, psychology, cognitive science, game theory, economics, and finance. It deals with decision-making and pursuing an optimal course of action with the aim to collect some future rewards.
In RL, it is assumed that the goal of the agent is to take a series of actions to maximize an expected cumulative sum of rewards. The reward can arrive immediately after an action or much later. A discount factor can be applied to rewards to represent the time value of money, a reward tomorrow is worth less than a reward today. The reward is not necessarily a monetary payment, it can be an indicator of a final state such as 1 if the agent wins a game and 0 if the agent loses, or can take the form of a utility function.
RL has natural applications to games (backgammon, checkers, go, Atari), where there are clear states, actions, and rewards, but also real-life situations that involve optimal control: piloting a vehicle, controlling a power plant, scheduling servers in the cloud, optimizing recommender systems such as choosing individualized ads to put on a website. It is also very applicable to economics and finance as these fields are studying agents pursuing the maximization of some inter-temporal objectives (their utility function or a discounted sum of cash-flows) by taking some actions, such as consuming, saving, trading, or investing.
The agent can take actions in response to a state of the environment, after the agent takes an action, it receives a reward, and the environment moves to another state. This is represented in Figure 1:
Figure 1. Reinforcement Learning framework: the agent observes the state of the environment S(t), performs an action A(t) and receives a reward R(r+1) and observes a new state S(t+1)
In finance, the agent could be a trading engine that makes trade decisions, the environment can be the market and the set of asset prices, the reward can be the P&L (profit and loss) of the trading system. In economics, the agent is a representative consumer that makes sequential consumption and saving decisions, the environment is in the interest rate and prices of consumption goods, and the reward is in the form of a utility function.
The environment can be deterministic or stochastic, can be discrete with a finite number of states, or be continuous. The environment produces a new state after an agent takes an action. The state can be influenced by the agent’s action but does not have to be. The actions that the agent can take can be discrete (invest all or not at all, or in incremental amounts of the number of assets), usually finite, or continuous (invest a fraction of total wealth in different assets).
The agent can fully observe, not observe, or only partially observe the environment. For instance, an agent might know only the state of the economy imperfectly based on lagged economic indicators. The agent might know the dynamics of the environment such as the distribution of future states given the current state and the agent’s actions or not. The time horizon of the agent can be finite, in this case, the agent’s interaction with the environment is episodic with a starting state and a terminal state, or infinite with no terminal state.
Markov Decision Processes
The environment for reinforcement learning is usually represented by a Markov Decision Process (MDP). An MDP Model is a finite set of states S, a finite set of actions A, a transition dynamics P, the time discount factor, and a reward function R: (S,A,P,R,γ).
At each period t, the environment is in state s(t), an agent takes action a(t) and receives a reward (or utility) r(s(t),a(t)). A new state s(t+1) occurs after action at is taken.
A state s(t) in an MDP has the Markov property. A future state s(t+1) depends only on the current state s(t) but on any previous states. The current state contains all the useful information to predict the future state.
P(s(t+1)|s,a) represents the transition probability from state-action s,a to state s'. (S,P) is a Markov Process. In a discrete and finite environment, P is represented by a state transition matrix. P represents how the world works, it is a model representation of the world.
With the addition of a reward function R, (S,P,R) becomes a Markov Reward Process. With the addition of an action space A, (S,A,P,R) becomes a Markov Decision Process. To account for the time dimension we add the discount factor γ and the MDP becomes (S,A,P,R,γ).
Taxonomy of Reinforcement Learning
Reinforcement Learning can be divided into model-based and model-free Reinforcement Learning. Model-based (Model-free) reinforcement learning can be seen as a method for solving a dynamic programming problem when the MDP is (not) known.
Reinforcement Learning uses a terminology that distinguishes the prediction of future payoffs assuming a given policy (optimal or not) and the control of the actions the agent needs to take. It is summarized in Figure 2.
Figure 2. Terminology in Reinforcement Learning
Figure 3. Taxonomy of Reinforcement Learning
Model-Based RL: Planning
Planning (also called control) is making decisions based on the agent’s knowledge of the environment. It involves prediction and sequential decisions. The agent knows the model driving all the state variables that matter and can use the model to make predictions of the effects of his decisions. Having a model is the standard approach for economists. The dynamic of asset prices is usually assumed to be known, such as a Brownian motion process, or the way monetary policy is conducted, such as targeting inflation and the output gap.
Model-based RL assumes that the agent knows the dynamics of the environment in the form of a model. There is only planning involved and no learning because the agent knows precisely the transition probabilities from one state to another given this history of states and actions. This is a pure planning exercise since we know how the agent’s actions influence the future states and how much rewards she can collect. She can use dynamic programming or search algorithms to find the optimal policy.
An alternative to planning and model-based RL is learning where the agent does not know the environment dynamic and has to make decisions based on interactions with the environment and the observations of the states. She has to learn the impact of her actions and her future rewards. She is involved in learning in a model-free RL framework.
Model-Free RL: Learning and planning
Model-free RL assumes that the agent does not know the model. The agent interacts with the environment but doesn’t know the intrinsic dynamic that affects it. The agent has to learn and predict the rewards received from his actions from experience and come up with the optimal behavior. The experience can be online (live) or stored in memory or retrieved from human experts or simulated experiences.
There are three approaches to solve a model-free RL problem:
- Value-based methods evaluate a value function: The present value of future rewards for each state (a State Value Function V) or each set of action and state (a State-Action Value Function Q) is evaluated. The agent then uses the estimated value function to plan his actions.
- Policy-based methods evaluate a policy function: A Policy which is a deterministic or stochastic function that maps each state to an action of a set of action probabilities is evaluated. The agent then uses the estimated policy function directly to plan his actions.
- Actor-critic methods evaluate both a value and policy function. A model (the Critic) evaluates a Value Function, the present value of future rewards, and a model (the Actor) evaluates the optimal action in the form of a Policy. These two models combine to form an Actor-Critic model. The agent then uses the estimated policy and value functions to plan his actions.
Figure 4. Actor-Critic Model
Goal of an RL Agent
Given an MDP (S,A,P,R,γ), an agent wants to maximize his lifetime cumulative rewards, has a policy π, a state-value function Vπ and a state-action value function Qπ. The policy maps actions and current state to new states and defines the transition probability Pπ(s'|s) from state s to state s'. Note that the states sand s' are the states observed by the agent. They are called information states. The environment might have other environmental states unobserved by the agent but the MDP assumes that these states are the same.
A state-value Vπ(s) represents the expected cumulative sum of rewards that the agent will receive if the policy π is followed starting from state s. A state-action value Qπ(s,a) represents the expected cumulative sum of rewards that the agent will receive if the policy π is followed starting from state s and action a. The policy π can be deterministic, for each state policy s it assigns an action a=π(s) or it can be stochastic policy with a conditional probability distribution π(a|s).
Eπ[.] is the expectation operator. Here is the expectation given the policy π and current state s(t). The policy π defines the actions a?,..,a(T).
Vπ(s?)=Eπ[r(s?,a?)+γr(s???,a???)+..+γ?r(s(T),a(T))]
Where a???=π(s???), i=0,...,T-t
We also have that
Qπ(s?,a?)=E[r(s?,a?)+γr(s???,a???)+..+γ?r(s(T),a(T))]
Dynamic Programming
Policy Evaluation
The state and action are known at time 0 and we can rewrite this equation in a recursive manner as: Vπ(s?)=r(s?,π(s?))+EπVπ(s???) (Bellman’s consistency, expectation or, backup equation)
The previous equation can be used to evaluate a policy : how much total reward the agent will receive if the policy is consistently followed ?
If we define the Bellman backup operator TVπ as: TVπ=r(s?,π(s?))+EπVπ(s???)
Then we want to solve: Vπ=TVπ and thus Vπ is a fixed point of this equation.
Policy Optimization
The agent wants to maximize its cumulative reward: MaxπVπ(s).
Dynamic programming is concerned with solving such a problem, i.e. finding the optimal policy to maximize the value function when the MDP is known. With an MDP, the Bellman optimality principle shows that this the optimal policy can be found with the following Bellman’s optimality equations:
Vπ(s)=r(s,π(s))+Es'MaxπVπ(s')
Or
Q(s,a)=r(s,a)+Es'Maxa'Qπ(s',a')
s' follows the distribution P(.|s,π(s))
V*and Q*indicate the optimal value and action value along the optimal decision path.
We can solve for V*and Q*and by iteration. Each time we update the new value Vπ we are performing a policy evaluation. After each step we can improve the policy and find the new optimal policy that will be found by solving π'(s)=argmaxaQ(s,a). We say that the agent acts greedily with respect to Q, this is the policy improvement step. We alternate between policy evaluation and policy improvement till convergence to V*, Q*, π*.
It is possible to solve for the optimal value V* without solving for the optimal policy at each step. In that case, we only use value iteration.
One drawback of dynamic programming is that if the number of states is very large each iteration is very computationally expensive. Solving for V*, Q*, π* with the Bellman equation also assumes that we know P, the model of the environment. If we don’t know P, then we do model-free reinforcement learning.
Model-free Reinforcement Learning
In model-free reinforcement learning, the agent does not know the MDP but has to evaluate and optimize his policy by interacting with the environment. The agent can do policy evaluation using three methods: Monte-Carlo, Temporal Difference learning TD(0), and TD(λ).
Policy Evaluation
Monte-Carlo evaluation
With Monte-Carlo and a given policy, an agent goes through multiple sequences of state, action, and reward and receives the cumulative reward. This is only possible if the agent’s task is episodic and has an end (otherwise the Monte-Carlo run cannot complete).
R? is a sample sum of discounted rewards for a Monte Carlo simulation.
Each time a state s? is visited the first time we add R? to the total rewards and divide by the total number of trajectories that visited that state. This is the first-visit Monte-Carlo. Because the trajectories are independent, the average cumulative reward is guaranteed to converge to the true cumulative reward thanks to the Law of Large Numbers.
An alternative is the every-visit Monte-Carlo where we count the rewards of all trajectories that visit that state and count them multiple times if they visit that state more than once. The average cumulative reward is no longer guaranteed to converge to the true cumulative reward because the trajectories visiting the same state several times are not independent of themselves. We expect that there will be more trajectory visits so the variance of the cumulative reward estimate will be lower but with some possibility of bias (the average will not exactly be the true cumulative reward).
An alternative to updating the value my recalculating the average is to perform an incremental update ( is a positive value smaller than 1):
V(s?)=V(s?)+α[R?-V(s?)]
This way, each time Rt is higher than V(s(t)), V(s(t)) is adjusted upward.
Another important aspect to Monte-Carlo is that updates can only be done at the end of each run (or episode) and not during a run. The Temporal-Difference Learning method can perform updates after each time step.
Temporal-Difference (TD) Learning
An alternative to Monte-Carlo to evaluate a policy is Temporal Difference (TD) Learning, also called TD(0). TD(0) will update for often than Monte-Carlo, will be less noisy but will be more biased. It is applicable to non-episodic tasks contrary to Monte-Carlo methods.
TD(0)
To evaluate a policy ??, at each time step, the agent at state s? takes an action a?=??(s?), receives reward is updated toward the estimated return r???+V(s???) (instead of the full Rtlike in Monte-Carlo) is taken
V(s?)=V(s?)+α[r???+V(s???)-V(s?)]
The algorithm is as follows:
TD(0) Algorithm
TD(λ)
TD(λ) is an extension of TD(0) where we update the state values after several steps instead of only one step. TD(λ) combines these estimated returns into one single target return. When λ=0 then TD(λ) is the same as TD(0). When λ=1 then TD(λ) is the same as the Monte-Carlo method.
Importance Sampling: Off-policy evaluation
It is possible to evaluate a policy ??using another policy μ by using importance sampling. The value update formula uses the ratio of probability densities. This ratio needs to be defined for all attainable states.
Policy Optimization
Monte-Carlo control
With Monte-Carlo control, the policy is improved using the state-action value function Q(s?,a?).
Q(s?,a?) can be updated as follows:
Q(s?,a?)=Q(s?,a?)+α[R?-Q(s?,a?)]
Then the policy is greedily improved by (s?)=argmaxaQ(s?,a)
To facilitate exploration, we can use an ε-greedy approach: with probability ε it is uniformly random in the action space and with probability (1-ε) it is the greedy solution (s?)=argmaxaQ(s?,a).
Sarsa: On-policy TD control
The model has an agent who is in state s? and can perform an action at that pays a reward r??? in the next period. In the next period the state transitions to a new state s???. Q(s?,a?) can be updated using Temporal Difference as in TD(0):
Q(s?,a?)=Q(s?,a?)+α[r???+Q(s???,a???)-Q(s?,a?)]
Then we can use the same ε-greedy approach, the new action is chosen with probability ε to be random in the action space and with probability (1-ε) to be the greedy solution (s?)=argmaxaQ(s?,a).
The sequence of state, action, reward becomes s?,a?,r???,s???,a???or Sarsa. The Sarsa algorithm is described below:
Sarsa algorithm
Sarsa(λ)
Instead of using TD(0) we can use TD(λ) to update the state-action value function. We then have Sarsa(λ) instead of Sarsa.
Q Learning: Off-policy TD control
The Q-Learning model (Watkins and Dayan, 1992) has an agent who is in state s and can perform an action a that pays a reward r in the next period. In the next period the state transitions to a new state s’.
Q(s?,a?)=Q(s?,a?)+α[r(a?,s?)+maxa(Q(s???,a))-Q(s?,a?)]
Q-Learning Algorithm
RL with Function Approximation
Curse of Dimensionality
The previous machine learning techniques are applicable to finite and low-dimensional state spaces. They suffer from the curse of dimensionality, a term introduced by Bellman: the number of states increases exponentially with the number of features and actions and makes the optimization problem insolvable. A way to address this is to move to continuous space and use parameterized approximated value and policy functions.
Q Learning with Linear Function Approximation
Q-Learning can be revisited with linear function approximation. It uses a feature vector φ(s,a)=[φ1(s,a),...,φn(s,a)]' and the state-action value function is approximated by the linear function θ?φ(s?,a?).
Q(s?,a?)=Q(s?,a?)+α[r(a?,s?)+maxa(θ?φ(s???,a))-θ?φ(s?,a?)]
Deep Q Networks (DQN)
DQN is using a variant of the Q-learning algorithm to train a neural network, using stochastic gradient descent, to control an agent. In the case of Mnih et al. (Mnih et al., 2013), they use a convolutional neural network to capture video input data of the environment in Atari games. They also use experience replay to sample previous transitions and behaviors to limit correlation and non-stationarity of data.
In DQN, an agent interacts with an environment (e.g. an Atari video-game emulator), observes a current state, takes a discrete (legal) action that leads to a new state (a video image) and a reward (game score points). The agent wants to maximize the sum of future rewards. The simplest reward can be just 1 it wins and 0 if it loses or it could be a total game score.
DQN is a model-free off-policy reinforcement learning algorithm. It uses only samples of the environment and does not attempt to model it. It learns about the optimal strategy by mixing a greedy-strategy with random exploratory strategies.
In this environment, there is an optimal action-value function Q*(s,a) which is the maximum amount of rewards after taking action a in state s.
To solve for the optimal strategy, we solve by iteration a Bellman equation that relates current Q*(s,a) value and the current reward and future Q*(s',a')values.
Q*(s,a)=Es'[r+maxa'Q*(s',a')|s,a]
A function approximator using parameter is used in practice to estimate Q*(s,a):
Q(s,a;θ)=Q*(s,a)
Q(s,a;θ) is the Q-network. The parameter θ is estimated by minimizing a sequence of loss functions by stochastic gradient descent.
The algorithm is as follows:
DQN Algorithm
Policy Gradient Methods
Policy-based reinforcement learning deals with the policy function instead of the action-state value functions. In value-based reinforcement learning methods, the value function is solved by Bellman-inspired equations (TD, Sarsa, Q Learning) and the action is found from the state and optimal value function. This defines the policy function.
By optimizing the policy directly, it is possible to find a deterministic or stochastic policy that a value-based method might not. An -greedy policy for instance alternates between an optimal policy and a random policy and might never become deterministic or sufficiently stochastic.
A policy can be parametrized by a parameter vector and will be written in form a probability distribution over action a depending on for state s:
πθ(s,a)=Prob(a|s,θ)
To optimize the policy, the optimal will be found by iteration with a gradient of an objective function. The action is now defined by the policy function. The objective function is a performance measure function J().
J()=vπθ(s)
And the objective is to find to maximize the performance J(). A common method to find the optimal value is to use gradient descent methods.
To maximize J() we need the compute the gradient J()with respect to θ . The optimal parameter vector is going to be found by iterating
θ???=θ?+αgradθJ(t)
is a positive step-size parameter so if the gradient is positive, we will increase at each time step, if it is negative we will decrease . This method will usually find a local minimum not necessarily a global minimum. If the step-size is too large, it might miss the local minimum.
We assume that the policy πθ is differentiable when it is nonzero. The Policy Gradient Theorem gives an expression of the gradient as a function of the score logπθ(s?,a?) and the policy objective (future reward) v?.
Policy Gradient Theorem
For any differentiable policy πθ(s?,a?), for any policy objective function v?
The policy gradient is: J(t)=Egradlogπθ(s?,a?)v?
Monte-Carlo Policy Gradient (REINFORCE)
The gradient result is used for the REINFORCE algorithm, which was introduced by Williams (Williams, 1992).
The algorithm is as follows:
Advanced Policy Gradients
Several gradient methods have been proposed to improve policy optimization.
Trust Region Policy Optimization (TRPO)
This is a model-free RL approach for policy optimization, introduced by Schulman et al. (Schulman et al., 2017a).
TRPO is an iterative practical approximation of a procedure that has been justified theoretically. It uses the same gradient methods applied to large neural networks. TRPO provides monotonic improvement of the policy after each iteration.
It applies to an infinite-horizon Markov Decision Process (MDP), with a finite set of states, a finite set of actions, a transition probability distribution, a reward function, a distribution of initial state, and a discount factor.
The MDP is associated with a state-action value function, a value function and an advantage function. The advantage function is the difference between the state-action value function and the value function.
The expected return of a new policy can be written as the expected return of an existing policy and a discounted sum of future advantage function values. If a new policy has a positive advantage value for some state action pairs with positive probability it will improve on the existing policy.
A local approximation of the expected return function is used. The local approximation is expressed as the expected return of an existing policy and a weighted sum of future advantage function values. The difference is in the usage of different discounted visitation frequencies.
The main result of TRPO is that improving the local approximation and a divergence term between the new policy and existing policy will monotonically improve the expected return of a new policy for general stochastic policies. The divergence term can be substituted by a KL divergence term.
When the policies are parametrized, this becomes an optimization problem, maximizing the local approximation of expected returns over policy parameters with some constraint on the average KL divergence.
The expectations are replaced by sample averages. It can be done on a single path or over multiple trajectories. With multiple trajectories or the vine approach, multiple states are chosen along these paths (rollout set) and actions are sampled from the states and the policy functions. Then Q-values can be estimated by Monte-Carlo from the rolled-out state-action pairs.
Averaging over samples gives an estimate of the objective function and the constraint. Solving the problem by calculating the conjugate gradient gives an update on the policy parameter.
Proximal Policy Optimization (PPO)
PPO is a family of policy gradient methods based on Schulman et al. (Schulman et al., 2017b). This is a model-free RL approach.
In policy gradient methods, policy gradient estimators are calculated and stochastic gradient ascent methods are used to maximize the expected return of the policy.
Like in TRPO, PPO uses surrogate objective functions to optimize. The surrogate function depends on the ratio of policy probability density function with the new parameters over the policy probability density function with the old parameters, and the advantage function. Furthermore, PPO clips this ratio of probability densities to be close to 1. The result is the clipped surrogate objective function.
Then multiple steps stochastic gradient ascent is performed on the clipped surrogate objective function.
If the policy and value functions share the same parameters, we add a term that depends on the squared difference between the value function and its objective. We also add an entropy term to encourage exploration.
Deep Deterministic Policy Gradient (DDPG)
This is a model-free RL approach. It was proposed by (Lillicrap et al., 2019). The method was developed to address continuous actions. It is an actor-critic method with both a critic (value) and an actor (policy) network as well as two target critic and actor networks. It uses a replay buffer to store past transitions {state, action, reward, future state}. It is also using batch normalization (Ioffe and Szegedy, 2015) to stabilize learning, and exploration thanks to a noise process added to the actor policy.
Actor-Critic Methods
Actor-Critic methods combine the approach of the value-based method (critic) and the policy-based approach (actor). A3C is an actor-critic method.
Asynchronous Methods for Deep Reinforcement Learning (A3C)
A3C is introduced by Mnih et al. (Mnih et al., 2016). It is a parallel reinforcement learning algorithm. A3C stands for asynchronous advantage actor-critic.
A3C is an alternative to experience replay as used in DQN. It used parallel asynchronous agents running on multiple instances of the environment.
We use asynchronous actor-learners working in parallel on a single machine. Each actor-learner can use a different exploration policy to increase diversity and have a more stable learning of the policy parameters. The training is reduced vs one single learner and on-policy reinforcement learning methods can be used because there is experience replay.
This method can be applied to one-step Sarsa, one-step Q-learning, n-step Q-learning, and advantage actor critic (A3C).
We define as the parameter vector for a policy and vas the parameter vector for the value function.
Advantage Actor Critic (A2C)
A2C (“OpenAI Baselines,” 2017) is a synchronous and deterministic variation of the A3C. It waits for all agents to complete training based on their experiences and then update by averaging across all the agents. It performs better than A3C.
Soft Actor Critic (SAC)
SAC was introduced by Haarnoja et al. (2018) (Haarnoja et al., 2018).
SAC uses an actor-critic architecture, works off-policy and includes entropy maximization to encourage stability and exploration. It is applicable to continuous states and action spaces.
We consider an infinite-horizon Markov Decision Process (MDP) with a continuous state space and action space, a transition probability of the next state given the present state and action, and a reward function.
We maximize an expected sum of future rewards augmented by an expected policy entropy term to favor stochastic policies and more exploration. The entropy is weighted by a temperature parameter that determines its relative importance relative to the reward.
A soft policy iteration algorithm will alternate between a soft policy evaluation step and a soft policy improvement step till convergence. Repeated application of the soft policy iteration will converge to an optimal policy that will have the highest Q values among all policies.
SAC is a practical approximation of the soft policy iteration. Function approximators based on neural networks are used for the Q-function and the policy. Both neural networks are optimized by stochastic gradient descent.
The parametrized state value function V(s) approximates the soft value. A parametrized state-action function Q(s,a) approximates the soft Q value. A parametrized policy function (a|s) approximates the soft policy function.
The parameter vector minimizes a squared residual error term for the value function. The Q-value parameter vector minimizes a soft Bellman residual. To update the Q value, a target value network V(s) is used, is an exponentially moving average of parameter vectors. The policy parameter vector minimizes an expected KL divergence term.
References
Haarnoja, T., Tang, H., Abbeel, P., Levine, S., 2017. Reinforcement Learning with Deep Energy-Based Policies. ArXiv170208165 Cs.
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S., 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ArXiv180101290 Cs Stat.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D., 2019. Continuous control with deep reinforcement learning. ArXiv150902971 Cs Stat.
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T.P., Harley, T., Silver, D., Kavukcuoglu, K., 2016. Asynchronous Methods for Deep Reinforcement Learning. ArXiv160201783 Cs.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M., 2013. Playing Atari with Deep Reinforcement Learning. ArXiv13125602 Cs.
OpenAI Baselines: ACKTR & A2C [WWW Document], 2017. . OpenAI. URL https://openai.com/blog/baselines-acktr-a2c/ (accessed 11.13.20).
Schaul, T., Quan, J., Antonoglou, I., Silver, D., 2016. Prioritized Experience Replay. ArXiv151105952 Cs.
Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P., 2017a. Trust Region Policy Optimization. ArXiv150205477 Cs.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017b. Proximal Policy Optimization Algorithms. ArXiv170706347 Cs.
Sutton, R.S., Barto, A.G., 2018. Reinforcement Learning, second edition: An Introduction (Adaptive Computation and Machine Learning series) [WWW Document]. URL (accessed 11.13.20).
Wang, R., Foster, D.P., Kakade, S.M., 2020. What are the Statistical Limits of Offline RL with Linear Function Approximation? ArXiv201011895 Cs Math Stat.
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., de Freitas, N., 2016. Dueling Network Architectures for Deep Reinforcement Learning. ArXiv151106581 Cs.
Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Mach. Learn. 8, 279–292. https://doi.org/10.1007/BF00992698
Williams, R.J., 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Lang. 8, 229–256. https://doi.org/10.1007/BF00992696