Reinforcement Learning and game player: big picture of the method, the information backpropagation in iterations

Reinforcement Learning and game player: big picture of the method, the information backpropagation in iterations

Reinforcement Learning (RL) has been a hot topic since AlphaGo beat human go player. It is very interesting and (together with Deep Learning) has been considered as 'the path to the artificial intelligence'. However, it is also difficult at first glance because it comes with a lot of new concepts (state, value, action, reward, policy, etc.), and the logic of the method is different compared with what we have been familiar with (supervised learning, or unsupervised learning).

In this blog, I'll give you a simple introduction to the RL by

  1. comparing it with the big picture of supervised learning
  2. The key components in Reinforcement Learning
  3. Information backpropagation in iterative methods
  4. Flappy bird and rocket landing project (link)

1. Big picture of Reinforcement Learning

Let's start with what we are familiar with: the classification in supervised learning!

Supposing we are trying to identify dogs and cats from a bunch of input with extracted features (not raw image). I deliberately start with extracted features because in the simple case, you can see the basic structure of RL. W???????e will mention DQN which can read the raw pixels in a moment.

Figure 1. Classification in supervised learning.

In Figure 1, we have traditional classification cases which requires the labeled data and the extracted features to train the classification models (decision tree, svm, random forest, neural nets, etc.), so that we can have a mapping from 'Input Features' to 'Labels', and use this mapping to predict the label of the target features, whose labels are unknown.

In Reinforcement Learning, we don't have 'Labeled Data', at least at first.

Figure 2. Reinforcement Learning is to pick up the action/actions to maximize the accumulated rewards - 'Value'. In stochastic cases like gambling, we need maximize E(Value). We'll come back to the details later.

We have four more differences. First, the mapping from 'Input Features' to 'Labels' is replaced by the mapping from 'Environment State' to 'Actions'. OK, it's only the name that has been changed. Second, the 'Rewards' for the actions and states have been introduced to measure the results of the actions. Third, a single action or a series of actions is needed, to change the environment state, from the 'Start State' to the 'Terminted State'. Fourth, what we want, is to find the mapping from 'Environment States' to 'Actions' which can maximized the sum or the discounted sum of the rewards (call it 'Value').

Before jumping into the details like the state transition or the discounted rewards or policy iteration, let's assume we know how to get the right 'Actions' and continue with the 'Big Picture'.

Now supposing we have already found the best actions which lead to the max Value for existing environment states, how about the environment states we haven't seen?

OK, let me rephrase: for some environment states, we have found the actions we want to take; for some other environment states, we don't have them yet and we want to find what actions we should take? Sounds familiar? YES, in this case, here comes our supervised learning or function approximating.

Figure 3. Once we find the actions for existing environment states, we can use them to approximate and predict actions for new environment states. Actions for environment states 1-4 are known (A1-A4). The action for environment state 99 is unknown. It is supervised learning to find Action99.

Till now, we are using engineered or extracted features as the input. It will take a lot of time to find and adjust these features. The learned features are usually not transferable, which means in another RL case, we need to find a new set of features. DQN is the method that can help us to extract features from images automatically. When we have images as the input, we can use DQN to read the raw pixels and let deep neural nets (CNN, etc.) to extract the features, then find best actions corresponding to these features, and approximate unknown environment sates using the features.

Figure 4. With DQN (deep learning + Q-Learning), we can read the raw pixels and extract features automatically (deep learning), then use Q-learning to find the actions corresponding to these environment state (features), and approximate and predict the actions in the new environment state.

These are the 'Big Pictures' for the reinforcement learning.

2. The Components in Reinforcement Learning

In the big picture, we have one important part left: how to find the best actions resulting in the highest 'Value'? It will be easy for single-step action scenario, just pick up the action leading to max value or max expected value (for stochastic cases).

When multiple actions are needed, things become more complicated. How can we find the best series of actions?

Let's ask this question in the context of flappy bird:

Figure 5. The bird starts at state is (1,1). It has to fly through the gap between the pipe in 4 steps. In each step, it has three choice: up forward, forward, or down forward. While pass the gap, the bird will get +1 rewards. Crashing on pipe or drop out of boundary will have -1 punishment.

Our purpose is teaching the bird to fly through the gap between two pipes in 4 steps.

2.1 States transition

The future states depends on the current states and which action the bird is picking up. The bird can fly upper-forward, forward, or down-forward in each step. So, from state (2,2), it can reach (1,3), (2,3), (3,3) in one step.

The property of 'future state depends only on current states' is called Markov Property. Don't be panic at this big name. There are a lot of fancy names ('irreducible', 'ergodic', blabla) in stochastic processes. Getting familiar with these concepts won't help you understand RL. But they are just names now, and they only means: given the current states, the bird can reach the adjacent states (not all states) by taking one action. That's it.

2.2 Policy

I care more about: which action to take at each environment state, or which environment state should I move to in the next step. It is called policy. It doesn't matter whether I have a smooth model to describe the policy over entire environment states. I can make a table to list all possible environment states and build the decision based on the table.

2.3 Value and Discounted Rewards

For flappy bird, it will get a '+1' reward for passing the pipe, or '-1' punishment for crash or flying out of bound. These rewarding or punishing moments are terminated states. For none terminated states, it has very little or no rewards.

The discounted rewards and its summation (which is 'Value') is useful especially for multi-step scenarios, like in 'flappy bird'.

For multi-step problems, usually you won't know the final rewards until you hit the terminated state. How to measure the result of the decision making in the previous steps? The discounted rewards is used to record and backpropogate the final 'Big Rewards' along the path states-action pairs, and help build up the reference or measures for decision making in these steps, with smaller impacts for further states. I'll talk the mechanisms for rewards backpropogation in Section 3.

2.4 'Best Policy' and its 'Value'

Instead of thinking forwards, let's first get through the game from the end to the start.

Figure 6. Play flappy bird backwards. (a,c,e,g) are the policy map. (b,d,f,h) are the corresponding value map. Each arrow in the policy map represents an action. The number in the value map is the value for its corresponding policy map. The 'Best Policy' map is derived from the value map by picking up the action leading to max value. And the value map is calculated directly following the current policy map. The discount for the rewards is 'a'.

In figure 6(a), which is the last step, if the little bird is in the states of (2,4), (3,4),and (4,4), it can fly through the gap and get '+1' rewards by flying down-forwards, forwards and up-forwards, respectively. From state (1,4), however, the bird can not fly through the gap anyway. It will get '-1' punishment no matter what actions it chose.

Figure 6(b) summarized the value of each state at its best policy for the last step. Based on figure 6(b) (plus the -1 punishment for flying out of boundary), we can derive the best policy for the second to the last step (figure 6.c). For example, in the state (1,3) of figure 6(c), the bird can fly to (0,4) - which is out of bound, (1,4), and (2,4). By checking the value at figure 6(b), it knows that flying to (0,4) and (1,4) will lead to a punishment, but (2,4) will lead to a reward. The bird will decide to fly to (2,4) from (1,3) based on the value map 6(b).

We can get the entire 'Best Policy' map 6(g) by jumping between 'Policy' and 'Value' from the end to the start. The rewards in the final step is discounted and propagated to the start states.

This example may help you understand the 'Policy' and the 'Value'. However, this is not the way we use to solve the RL problem.

First of all, in most cases, we need to solve it forwardly, from the start to the end. For example, when you play a game, you need to play it from the beginning to the end. You may fail several times, learn how to avoid the trap and then get a pass. You won't know how to pass at step 1, unless you cheat in the game.

Secondly, the 'Policy Map' derived here at each time step is the 'Best Policy' you may finally get, not the one you will get in the first a few times (may be hundreds or thousands) by doing it forwardly. So is the 'Value' map.

Here comes the question: how to get the 'Best Policy' and the corresponding 'Value' map by playing forwardly?

The answer is: Iteration.

2.5 Iteration

I think iteration is the key for Reinforcement Learning. In Sutton's textbook, a lot of iterative methods are in a chapter named 'Dynamic Programming' method. With all due respect, I would rather call that chapter 'Iterative Methods' instead of 'DP'. In fact, no matter whether it is value iteration, policy iteration, temporal difference and the Q learning method, they are all iterative methods.

What is iterative method? Using the current approximate solution, to calculate the improved approximate solution. We use it a lot in solving system of linear equations and differential equations, numerically.

Let's see an example of iterative method in solving linear equations.

Iterative methods in solving system of linear equations

Given the system of linear equations Ax = b, the none-iterative method will be calculate the inverse of matrix A, and multiply it with b, x=A.inverse * b. However, calculating the inverse of a matrix will be computational intensive, especially for large matrix.

To avoid calculating the inverse of A, iterative method is used to generate the approximation solution by update the initial guessing repeatedly. For example, for system of linear equations:

When [b1, b2, b3] = [10,18,-21], the analytical solution is [x1,x2,x3] = [1, 2, 3].

While we use the iterative methods, we transform the equations to:

Or more precisely, the following iterative form:

Given the initial guess [0,0,0], we updating the solution using the iterative form above again and again, it will approximate the exact solution in less than 10 iteration.

Figure 7. Solving the equations by iterative method. Starting from [0,0,0], it converges to the analytical solution [1,2,3] after 7 iterations.

The iteration for the 'Policy' and 'Value': value -> policy -> value -> policy ...

In general, the similar form of iterative method will be used to updating the initial guessing of 'Policy' and 'Value' repeatedly.

The iterative method starts at a 0 value map, and its 'Policy' map derived by picking up the action leading to the max value at each state. After a round of iterative updating, the approximated value map gets updated. Once 'Value' map is updated, the best action at one state may not be the same as it is in previous round. As the result, the approximation of the 'policy' map will be recalculated and updated. With the new 'policy' map, a new round of value iteration will be conducted. Thus iteratively updating the 'value' and 'policy' map.

When the newer 'value' map is the same as the old 'value' map, or the change in the 'value' map won't affect the choice of actions in the 'policy' map for any environment state, it stops updating the 'value' map and the 'policy' map. We say it reach the fixed-point and it is the optimal 'policy' map.

Figure 8. Concepts illustration of Value and policy iteration from Sutton and Barto 2017. (v, pi) is the initial value map and policy map. V_pi is the value map for policy map pi. Pi = greedy(v) is the policy map where its action in any states is the action leading to the state with max value. v_star and pi_star are the optimal value map and policy map, which are also the fixed-point for the iteration process.

What I want to emphasis is the information backpropagation in updating the 'value' map.

3. Information backpropagation in iterative methods

When playing backwards, I've demonstrated how the rewards propagate backwards to the starting steps. In this part, let's take a look at how the rewards backpropagate when we use iterative methods and play forwards.

3.1 Information backpropagation

We'll use Q-learning as an example.

Q-learning:

Figure 9. Q-learning algorithm, from Y. Li (2017). Q(s,a) is the action value function. r is reward, and gamma is the discount rate. When updating current Q(s,a), it uses the current version of Q(s,a) and Q(s',a') to iteratively get the next version of Q(s,a).

In Q-learning, it:

  1. Find the action it want to take and next environment state it moves to (in epsilon greedy strategy, I'll talk about it in the next section) based on the current policy/value map.
  2. Using the current version of Q(s,a), instance Reward R and Q(s',a'), to get the updated estimation of Q(s,a) (iterative updating). State s' is the successor of state s.
  3. Repeat step 1 and step 2 until it hits the terminated state. From the starting state to the terminated state, this chain of transition states builds up a iteration path (s,a) -> (s',a') -> (s'',a'') -> (s''',a''') -> ....

Figure 10. State transition and value update for Q-learning. Q(s,a) is the action value function representing the value of choice action (a) at state (s). s, s', s'', s''' are the environment states. a, a', a'', a''' are the actions. The definition of Q(s,a) can be found in Sutton and Barto (2017).

After it hits the terminated state, the algorithm starts at another initial state, and updates the value map along the new path, and repeating it on and on for thousands of times.

Then how does information flow backwards?

When the bird is moving from s --> s', the information (rewards and approximate values) is moving from (s') to (s) (Figure 10). However, in the flappy bird case, there is no rewards for previous states/actions, except a '+1' rewards at the terminated state for passing through the gap, and '-1' for crashing or out of boundary. As the result, for the first round of iterations, there won't be any value updating until the birds hit the terminated state. And the only state/action whose value get updated is the state one step ahead of the terminated state.

If starts again in a new path (consists of a sequence of iterations), the bird hit the previous updated state (call it state U1). The state ahead of U1 (along the path) will have a value updating. After hit the terminated state, it then starts a new path again.

In this way, the information (rewards) in the terminated states is slowly backpropagated to the previous states, once for a path of sequence of iterations.

Figure 11. Information backpropagation along the iterative updating path. The cyan blocks in PATH map represent the iterative updating path. The second row is the value map (V0 - V4). The value map showed how the iterative path draws the impact of rewards at terminated states, little by little, and propagate it backwards along the path.

After a large number of iterations, the value map may got enough information from the environment (and from terminated states of course). Then when the iteration won't update the approximation of the value map, or won't change the policy map, it reach the fixed-point and get the optimal policy.

3.2 Explore v.s Exploit

One key strategy in generating the paths is the epsilon greedy strategy. When picking up actions, the epsilon greedy strategy will choose the action resulting in the highest value at a probability of 1-epsilon, and will pick up other non-optimal actions at a probability of epsilon. Thus, it satisfy what the textbook and David Silver said: balance the explore (try more actions or states even if they are not the optimal choice in the short term, so that you may find a better solution overall) and exploit (utilize existing actions already been proven to be useful, to reduce the futile search in larger space).

With the epsilon greedy strategy, Q-learning build up paths in the environment states space, and let the rewards at the terminated state backpropagated along these paths and improve the approximation of the value map, and in turn improve the approximation of the policy map.

4. code for flappy bird and rocket landing

The projects for flappy bird and rocket landing is on my github RL project.

The flappy bird project is a simplified RL project. In the code, I didn't use Q-learning. I only distribute the final rewards along the fly-path. The q-learning and DQN version of flappy bird is here.

For the rocket landing project, the code is forked and modified from llSourcell. Siraj has a lot of wonderful video on AI and machine learning. But the code in his repository had some small issues and you can't run his code with out modifications. Also, you need to install several tools before run the model. The rocket landing project is more fancy and it is build from OpenAI's gym environment. There are other project built in gym environment. I'll try them later.

Thanks for reading such a long blog and I appreciate your feedback.

THE END.

PS1. So far, the most successfully implementation of RL would be playing games. Although it is declared to be used in robot control simulations, you'll be a little surprised to hear that Boston Dynamics hasn't apply any RL nor Machine Learning in their robot dogs.

PS2. The information propagation from terminated states to previous state looks extremely similar to the propagation of boundary condition in solving partial differential equations (PDE) using Partial Differential Methods. In PDE, the information in the initial condition and the boundary condition will help determine the coefficients in the governing equation. In the discretized scheme of the PDE, only when the inner point get enough information from initial condition and boundary condition by propagation iteratively, then can the numerical solution mimic the behavior of the real PDE in both temporal and spatial dimension.

PS3. The example of flappy bird is an highly simplified case in Reinforcement Learning. The stochastic cases like gambling are more complicated.

Reference:

Sutton, R. S., Barto, A. G., & Book, A. B. (2017). Reinforcement Learning: An Introduction. https://doi.org/10.1109/TNN.1998.712192

Silver, D. (n.d.). UCL Course on Reinforce Learning. Retrieved from https://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

Li, Y. (2017). Deep Reinforcement Learning: An Overview, 1–70. https://doi.org/10.1007/978-3-319-56991-8_32

要查看或添加评论,请登录

Xiaoli C.的更多文章

  • Intuitive and Visual Explanation on the differences between L1 and L2 regularization

    Intuitive and Visual Explanation on the differences between L1 and L2 regularization

    This blog is quite easy in math. But it is still interesting.

    21 条评论
  • How to prove it in math: why deeper decision trees will never have higher expected cross entropy?

    How to prove it in math: why deeper decision trees will never have higher expected cross entropy?

    What we will discuss here is intuition v.s math derivation, and we are very lucky on this topic, our intuition is…

    11 条评论
  • 你的未来,与你孩子的未来发展,会被AI打断吗?

    你的未来,与你孩子的未来发展,会被AI打断吗?

    硅兔赛跑 (ID: sv_race) 硅谷是一种思考态度 文|陈晓理@数据应用学院 本文为数据应用学院为硅兔赛跑的特稿…

  • 新的一年,数据科学求职者应该做的几件事

    新的一年,数据科学求职者应该做的几件事

    悟以往之不谏,知来者之可追 ——陶潜《归去来兮》…

    1 条评论
  • 机会+八卦|Snapchat进入中国

    机会+八卦|Snapchat进入中国

    日前,数据应用学院的老学员发来消息,风靡欧美年轻群体的社交产品,‘阅后即焚’Snapchat终于开始在国内组建研发团队,坐标深圳,并可以为国内的学员提供内推。 招聘启事如下: Snapchat中国第一批骨干团队招聘…

  • 再论数据科学竞赛中的Data Leakage

    再论数据科学竞赛中的Data Leakage

    越来越多的数据爱好者把注意力放在了数据竞赛上,像Kaggle数据竞赛。这类数据竞赛中,有时会遇到Data Leakage。而大部分人对Data Leakage的概念理解都是错误的。这次,我们来梳理一下Data…

  • 数据应用学院如何上榜北美Top Data Camp

    数据应用学院如何上榜北美Top Data Camp

    感谢众位老师和助教的不懈努力,数据应用学院(Data Application Lab)被美国著名科技期刊Tech Beacon列入北美Top Data Camp,与老牌劲旅Data Incubator…

  • 关于数据科学竞赛的一点思考

    关于数据科学竞赛的一点思考

    我以前在国内是做数学竞赛的,保送到大学之后,又开始参加数学建模比赛,到美国后逐渐往数据科学方向转型。在数据应用学院担任助教和竞赛协调后,组织参加Kaggle竞赛了很多次,这次又跟企业合办Fintech数据科学竞赛。从这些竞赛的导向来看,很多…

    1 条评论
  • 理论上说,什么是数据工程师,什么是数据科学家

    理论上说,什么是数据工程师,什么是数据科学家

    这么多公司需要大数据人才,小伙伴们也纷纷跃跃欲试投身这场数据革命中。可到底大数据有哪些岗位需求呢?对用人的要求是怎样的呢?我们今天来仔细看一看。 数据行业里面,跟数据有关的岗位一般有三种:1. Data Analysis, 2.

社区洞察

其他会员也浏览了