Reinforcement Learning and game player: big picture of the method, the information backpropagation in iterations
Reinforcement Learning (RL) has been a hot topic since AlphaGo beat human go player. It is very interesting and (together with Deep Learning) has been considered as 'the path to the artificial intelligence'. However, it is also difficult at first glance because it comes with a lot of new concepts (state, value, action, reward, policy, etc.), and the logic of the method is different compared with what we have been familiar with (supervised learning, or unsupervised learning).
In this blog, I'll give you a simple introduction to the RL by
- comparing it with the big picture of supervised learning
- The key components in Reinforcement Learning
- Information backpropagation in iterative methods
- Flappy bird and rocket landing project (link)
1. Big picture of Reinforcement Learning
Let's start with what we are familiar with: the classification in supervised learning!
Supposing we are trying to identify dogs and cats from a bunch of input with extracted features (not raw image). I deliberately start with extracted features because in the simple case, you can see the basic structure of RL. W???????e will mention DQN which can read the raw pixels in a moment.
Figure 1. Classification in supervised learning.
In Figure 1, we have traditional classification cases which requires the labeled data and the extracted features to train the classification models (decision tree, svm, random forest, neural nets, etc.), so that we can have a mapping from 'Input Features' to 'Labels', and use this mapping to predict the label of the target features, whose labels are unknown.
In Reinforcement Learning, we don't have 'Labeled Data', at least at first.
Figure 2. Reinforcement Learning is to pick up the action/actions to maximize the accumulated rewards - 'Value'. In stochastic cases like gambling, we need maximize E(Value). We'll come back to the details later.
We have four more differences. First, the mapping from 'Input Features' to 'Labels' is replaced by the mapping from 'Environment State' to 'Actions'. OK, it's only the name that has been changed. Second, the 'Rewards' for the actions and states have been introduced to measure the results of the actions. Third, a single action or a series of actions is needed, to change the environment state, from the 'Start State' to the 'Terminted State'. Fourth, what we want, is to find the mapping from 'Environment States' to 'Actions' which can maximized the sum or the discounted sum of the rewards (call it 'Value').
Before jumping into the details like the state transition or the discounted rewards or policy iteration, let's assume we know how to get the right 'Actions' and continue with the 'Big Picture'.
Now supposing we have already found the best actions which lead to the max Value for existing environment states, how about the environment states we haven't seen?
OK, let me rephrase: for some environment states, we have found the actions we want to take; for some other environment states, we don't have them yet and we want to find what actions we should take? Sounds familiar? YES, in this case, here comes our supervised learning or function approximating.
Figure 3. Once we find the actions for existing environment states, we can use them to approximate and predict actions for new environment states. Actions for environment states 1-4 are known (A1-A4). The action for environment state 99 is unknown. It is supervised learning to find Action99.
Till now, we are using engineered or extracted features as the input. It will take a lot of time to find and adjust these features. The learned features are usually not transferable, which means in another RL case, we need to find a new set of features. DQN is the method that can help us to extract features from images automatically. When we have images as the input, we can use DQN to read the raw pixels and let deep neural nets (CNN, etc.) to extract the features, then find best actions corresponding to these features, and approximate unknown environment sates using the features.
Figure 4. With DQN (deep learning + Q-Learning), we can read the raw pixels and extract features automatically (deep learning), then use Q-learning to find the actions corresponding to these environment state (features), and approximate and predict the actions in the new environment state.
These are the 'Big Pictures' for the reinforcement learning.
2. The Components in Reinforcement Learning
In the big picture, we have one important part left: how to find the best actions resulting in the highest 'Value'? It will be easy for single-step action scenario, just pick up the action leading to max value or max expected value (for stochastic cases).
When multiple actions are needed, things become more complicated. How can we find the best series of actions?
Let's ask this question in the context of flappy bird:
Figure 5. The bird starts at state is (1,1). It has to fly through the gap between the pipe in 4 steps. In each step, it has three choice: up forward, forward, or down forward. While pass the gap, the bird will get +1 rewards. Crashing on pipe or drop out of boundary will have -1 punishment.
Our purpose is teaching the bird to fly through the gap between two pipes in 4 steps.
2.1 States transition
The future states depends on the current states and which action the bird is picking up. The bird can fly upper-forward, forward, or down-forward in each step. So, from state (2,2), it can reach (1,3), (2,3), (3,3) in one step.
The property of 'future state depends only on current states' is called Markov Property. Don't be panic at this big name. There are a lot of fancy names ('irreducible', 'ergodic', blabla) in stochastic processes. Getting familiar with these concepts won't help you understand RL. But they are just names now, and they only means: given the current states, the bird can reach the adjacent states (not all states) by taking one action. That's it.
2.2 Policy
I care more about: which action to take at each environment state, or which environment state should I move to in the next step. It is called policy. It doesn't matter whether I have a smooth model to describe the policy over entire environment states. I can make a table to list all possible environment states and build the decision based on the table.
2.3 Value and Discounted Rewards
For flappy bird, it will get a '+1' reward for passing the pipe, or '-1' punishment for crash or flying out of bound. These rewarding or punishing moments are terminated states. For none terminated states, it has very little or no rewards.
The discounted rewards and its summation (which is 'Value') is useful especially for multi-step scenarios, like in 'flappy bird'.
For multi-step problems, usually you won't know the final rewards until you hit the terminated state. How to measure the result of the decision making in the previous steps? The discounted rewards is used to record and backpropogate the final 'Big Rewards' along the path states-action pairs, and help build up the reference or measures for decision making in these steps, with smaller impacts for further states. I'll talk the mechanisms for rewards backpropogation in Section 3.
2.4 'Best Policy' and its 'Value'
Instead of thinking forwards, let's first get through the game from the end to the start.
Figure 6. Play flappy bird backwards. (a,c,e,g) are the policy map. (b,d,f,h) are the corresponding value map. Each arrow in the policy map represents an action. The number in the value map is the value for its corresponding policy map. The 'Best Policy' map is derived from the value map by picking up the action leading to max value. And the value map is calculated directly following the current policy map. The discount for the rewards is 'a'.
In figure 6(a), which is the last step, if the little bird is in the states of (2,4), (3,4),and (4,4), it can fly through the gap and get '+1' rewards by flying down-forwards, forwards and up-forwards, respectively. From state (1,4), however, the bird can not fly through the gap anyway. It will get '-1' punishment no matter what actions it chose.
Figure 6(b) summarized the value of each state at its best policy for the last step. Based on figure 6(b) (plus the -1 punishment for flying out of boundary), we can derive the best policy for the second to the last step (figure 6.c). For example, in the state (1,3) of figure 6(c), the bird can fly to (0,4) - which is out of bound, (1,4), and (2,4). By checking the value at figure 6(b), it knows that flying to (0,4) and (1,4) will lead to a punishment, but (2,4) will lead to a reward. The bird will decide to fly to (2,4) from (1,3) based on the value map 6(b).
We can get the entire 'Best Policy' map 6(g) by jumping between 'Policy' and 'Value' from the end to the start. The rewards in the final step is discounted and propagated to the start states.
This example may help you understand the 'Policy' and the 'Value'. However, this is not the way we use to solve the RL problem.
First of all, in most cases, we need to solve it forwardly, from the start to the end. For example, when you play a game, you need to play it from the beginning to the end. You may fail several times, learn how to avoid the trap and then get a pass. You won't know how to pass at step 1, unless you cheat in the game.
Secondly, the 'Policy Map' derived here at each time step is the 'Best Policy' you may finally get, not the one you will get in the first a few times (may be hundreds or thousands) by doing it forwardly. So is the 'Value' map.
Here comes the question: how to get the 'Best Policy' and the corresponding 'Value' map by playing forwardly?
The answer is: Iteration.
2.5 Iteration
I think iteration is the key for Reinforcement Learning. In Sutton's textbook, a lot of iterative methods are in a chapter named 'Dynamic Programming' method. With all due respect, I would rather call that chapter 'Iterative Methods' instead of 'DP'. In fact, no matter whether it is value iteration, policy iteration, temporal difference and the Q learning method, they are all iterative methods.
What is iterative method? Using the current approximate solution, to calculate the improved approximate solution. We use it a lot in solving system of linear equations and differential equations, numerically.
Let's see an example of iterative method in solving linear equations.
Iterative methods in solving system of linear equations
Given the system of linear equations Ax = b, the none-iterative method will be calculate the inverse of matrix A, and multiply it with b, x=A.inverse * b. However, calculating the inverse of a matrix will be computational intensive, especially for large matrix.
To avoid calculating the inverse of A, iterative method is used to generate the approximation solution by update the initial guessing repeatedly. For example, for system of linear equations:
When [b1, b2, b3] = [10,18,-21], the analytical solution is [x1,x2,x3] = [1, 2, 3].
While we use the iterative methods, we transform the equations to:
Or more precisely, the following iterative form:
Given the initial guess [0,0,0], we updating the solution using the iterative form above again and again, it will approximate the exact solution in less than 10 iteration.
Figure 7. Solving the equations by iterative method. Starting from [0,0,0], it converges to the analytical solution [1,2,3] after 7 iterations.
The iteration for the 'Policy' and 'Value': value -> policy -> value -> policy ...
In general, the similar form of iterative method will be used to updating the initial guessing of 'Policy' and 'Value' repeatedly.
The iterative method starts at a 0 value map, and its 'Policy' map derived by picking up the action leading to the max value at each state. After a round of iterative updating, the approximated value map gets updated. Once 'Value' map is updated, the best action at one state may not be the same as it is in previous round. As the result, the approximation of the 'policy' map will be recalculated and updated. With the new 'policy' map, a new round of value iteration will be conducted. Thus iteratively updating the 'value' and 'policy' map.
When the newer 'value' map is the same as the old 'value' map, or the change in the 'value' map won't affect the choice of actions in the 'policy' map for any environment state, it stops updating the 'value' map and the 'policy' map. We say it reach the fixed-point and it is the optimal 'policy' map.
Figure 8. Concepts illustration of Value and policy iteration from Sutton and Barto 2017. (v, pi) is the initial value map and policy map. V_pi is the value map for policy map pi. Pi = greedy(v) is the policy map where its action in any states is the action leading to the state with max value. v_star and pi_star are the optimal value map and policy map, which are also the fixed-point for the iteration process.
What I want to emphasis is the information backpropagation in updating the 'value' map.
3. Information backpropagation in iterative methods
When playing backwards, I've demonstrated how the rewards propagate backwards to the starting steps. In this part, let's take a look at how the rewards backpropagate when we use iterative methods and play forwards.
3.1 Information backpropagation
We'll use Q-learning as an example.
Q-learning:
Figure 9. Q-learning algorithm, from Y. Li (2017). Q(s,a) is the action value function. r is reward, and gamma is the discount rate. When updating current Q(s,a), it uses the current version of Q(s,a) and Q(s',a') to iteratively get the next version of Q(s,a).
In Q-learning, it:
- Find the action it want to take and next environment state it moves to (in epsilon greedy strategy, I'll talk about it in the next section) based on the current policy/value map.
- Using the current version of Q(s,a), instance Reward R and Q(s',a'), to get the updated estimation of Q(s,a) (iterative updating). State s' is the successor of state s.
- Repeat step 1 and step 2 until it hits the terminated state. From the starting state to the terminated state, this chain of transition states builds up a iteration path (s,a) -> (s',a') -> (s'',a'') -> (s''',a''') -> ....
Figure 10. State transition and value update for Q-learning. Q(s,a) is the action value function representing the value of choice action (a) at state (s). s, s', s'', s''' are the environment states. a, a', a'', a''' are the actions. The definition of Q(s,a) can be found in Sutton and Barto (2017).
After it hits the terminated state, the algorithm starts at another initial state, and updates the value map along the new path, and repeating it on and on for thousands of times.
Then how does information flow backwards?
When the bird is moving from s --> s', the information (rewards and approximate values) is moving from (s') to (s) (Figure 10). However, in the flappy bird case, there is no rewards for previous states/actions, except a '+1' rewards at the terminated state for passing through the gap, and '-1' for crashing or out of boundary. As the result, for the first round of iterations, there won't be any value updating until the birds hit the terminated state. And the only state/action whose value get updated is the state one step ahead of the terminated state.
If starts again in a new path (consists of a sequence of iterations), the bird hit the previous updated state (call it state U1). The state ahead of U1 (along the path) will have a value updating. After hit the terminated state, it then starts a new path again.
In this way, the information (rewards) in the terminated states is slowly backpropagated to the previous states, once for a path of sequence of iterations.
Figure 11. Information backpropagation along the iterative updating path. The cyan blocks in PATH map represent the iterative updating path. The second row is the value map (V0 - V4). The value map showed how the iterative path draws the impact of rewards at terminated states, little by little, and propagate it backwards along the path.
After a large number of iterations, the value map may got enough information from the environment (and from terminated states of course). Then when the iteration won't update the approximation of the value map, or won't change the policy map, it reach the fixed-point and get the optimal policy.
3.2 Explore v.s Exploit
One key strategy in generating the paths is the epsilon greedy strategy. When picking up actions, the epsilon greedy strategy will choose the action resulting in the highest value at a probability of 1-epsilon, and will pick up other non-optimal actions at a probability of epsilon. Thus, it satisfy what the textbook and David Silver said: balance the explore (try more actions or states even if they are not the optimal choice in the short term, so that you may find a better solution overall) and exploit (utilize existing actions already been proven to be useful, to reduce the futile search in larger space).
With the epsilon greedy strategy, Q-learning build up paths in the environment states space, and let the rewards at the terminated state backpropagated along these paths and improve the approximation of the value map, and in turn improve the approximation of the policy map.
4. code for flappy bird and rocket landing
The projects for flappy bird and rocket landing is on my github RL project.
The flappy bird project is a simplified RL project. In the code, I didn't use Q-learning. I only distribute the final rewards along the fly-path. The q-learning and DQN version of flappy bird is here.
For the rocket landing project, the code is forked and modified from llSourcell. Siraj has a lot of wonderful video on AI and machine learning. But the code in his repository had some small issues and you can't run his code with out modifications. Also, you need to install several tools before run the model. The rocket landing project is more fancy and it is build from OpenAI's gym environment. There are other project built in gym environment. I'll try them later.
Thanks for reading such a long blog and I appreciate your feedback.
THE END.
PS1. So far, the most successfully implementation of RL would be playing games. Although it is declared to be used in robot control simulations, you'll be a little surprised to hear that Boston Dynamics hasn't apply any RL nor Machine Learning in their robot dogs.
PS2. The information propagation from terminated states to previous state looks extremely similar to the propagation of boundary condition in solving partial differential equations (PDE) using Partial Differential Methods. In PDE, the information in the initial condition and the boundary condition will help determine the coefficients in the governing equation. In the discretized scheme of the PDE, only when the inner point get enough information from initial condition and boundary condition by propagation iteratively, then can the numerical solution mimic the behavior of the real PDE in both temporal and spatial dimension.
PS3. The example of flappy bird is an highly simplified case in Reinforcement Learning. The stochastic cases like gambling are more complicated.
Reference:
Sutton, R. S., Barto, A. G., & Book, A. B. (2017). Reinforcement Learning: An Introduction. https://doi.org/10.1109/TNN.1998.712192
Silver, D. (n.d.). UCL Course on Reinforce Learning. Retrieved from https://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html
Li, Y. (2017). Deep Reinforcement Learning: An Overview, 1–70. https://doi.org/10.1007/978-3-319-56991-8_32