Part 1 - Applied Reinforcement Learning with 2048
Reinforcement Learning Agent Plays Breakout. (Daniel Takeshi 2016)

Part 1 - Applied Reinforcement Learning with 2048

Hi there and welcome back!

It has been quite a long pause (over a year, actually) since my previous Artificial Intelligence / Machine Learning (AI/ML) post. So maybe a bit of housekeeping is in order to update what's been happening in the meantime.

A bit over half a year ago, I joined a Computer Vision (CV) and ML startup Clobotics, based in Shanghai. Departing from my previous job at a cross-border e-commerce company, where I started the whole Applied AI/ML article writing shenanigan, and moving back to my family in Shanghai was a welcome change, although I will miss the cozy small team of fellow Finnish colleagues back in Shenzhen.

Looking back these couple of years, especially my first ever article talking about the career jump from a film and animation producer into the world of AI/ML, it certainly felt like a dream that is coming true little by little. I actually got to know the folks at Clobotics in a very serendipitous way, working on my second hobby AI project on beating a casual jumping game featured on WeChat.

And it was on a WeChat discussion group about AI where I posted some problems that I encountered doing the project that I received a helping hand from none other than the founder and CEO of the company. A year or so later, here I am. Who would have thought? It's both a humbling and sobering experience, as the road certainly hasn't been as straightforward and rosy as it might have seem just by looking at these social media posts spaced months, even a year apart.

Part of the original question in the WeChat AI discussion group.

What I realized during this period is that I have to work even harder and improve even more to achieve what I'm aiming for and not let down those kind people who gave me opportunities time and again. So with that in mind, let's embark on yet another side project. As if to celebrate our WeChat conquest, this one is also (mainly) about games, although its applications can and should extend far broader. We are talking about Reinforcement Learning (RL).

Basics of Reinforcement Learning

I've actually long wanted to learn about RL, but apart from reading some simple basics about it, the idea never made past applying it myself. Until now, that is. However, before jumping headfirst into the project, let's review what we know about those basics. I am liberally referring to people much smarter than I am, so if I botched anything badly, all the errors are on me.

In its core, reinforcement learning refers to a subset of machine learning where an entity capable of some decision making on its actions (called an agent) is released into a world (called an environment - but it's more often very limited, and not our world per se). The agent may start with zero knowledge about its environment, or some simple rules can be coded into its memory.

From there on, the job of the agent is, through trial and error, learn how the environment works and eventually figure out a way to "win". Success is tracked and measured by some method of scoring, so the third key element in the RL vernacular is an objective function - that is, a way to convert the perceived state of the environment into a score metric that the agent is trying to maximize.

The simplest forms of agent - environment - objective function combinations are easily and readily extracted from the world of games, so it's no wonder that RL research has its many tappings in this form of pastime. As I mentioned in the introduction above, our project also revolves around a game. However, the idea of autonomous agents operating in an open world setting may eventually be how higher level artificial intelligence manifests in our real lives. That would be a topic for another day, though.

2048

It's not a year, although it might very well mark something in the not-too-distant future of AI development. For us, it's a relatively simple computer game that was quite popular a few years back. The simple aim is to slide and combine tiles of ever doubling score until one reaches a level of 2048. What is especially fascinating to programmers, of course, is that the scores come in the multiples of binaries: 2, 4, 8, 16 and so on... One can even continue playing the game after the 2048 level is achieved, as there are theoretically larger scores attainable.

Screenshot of one of the many possible starting states of 2048.

This next part will be a short tangent, but I think it's important to know the whys, or the motivators of me launching into this project. The screenshot above is captured from a website that sells a VPN service called Astrill. Due to various reasons, we need that in China to access internet properly. They did not pay me to say anything and there are many other VPN service providers that essentially do the same thing. Astrill may even be one of the more expensive ones. But they happen to have a coupon code system that whomever gets to the 2048 tile will be generated a discount code that is randomly valued between 5%, 10%, 20%, 25% and 50%.

Now, I am not a very good player at this game. I mean, I have it on my phone and even beat it once, so I tried my hand at this challenge once again in the hopes of a discount. But I got stuck at around 512 or 1024 level. Mind you, the levels get progressively tougher because to get to 2048 from 1024 requires essentially doubling all the work you took from zero to 1024 in order to accumulate two such tiles that can combine into the game-winning 2048 tile. So I thought, "screw it!", let the computer handle it.

Yeah, essentially I have embarked on a wild goose chase of learning RL to apply it for a potentially just 5% discount off my next subscription renewal. And what was your own excuse, again? :)

Step 1: Setting Up the Game and Basic Interactions

Many of the readily available reinforcement learning training systems that are referred by online tutorials are part of the OpenAI research laboratory's reinforcement learning toolkit called Gym, which includes environments like the classic Atari games, one example being the Breakout game depicted in the cover image of the post.

But as people who have read my articles in the past know, I kind of don't like playing with toy datasets and problems, and would rather build something from the ground up. So instead of applying one of these ready-made RL implementations, my conviction was to create an RL agent to learn the game 2048 from scratch and get me that VPN discount.

To start from the very beginning, we first need to make our agent aware of the game state. Thus, we need to give it a set of "eyes" to interpret the board. Luckily, I've already played around more than a little with the OpenCV library during my WeChat game project, and also have some Optical Character Recognition (OCR) experience from my license plate auction project, so I was quickly able to build a rudimentary image capturing and OCR pipeline to read the board state and save it into a 2-dimensional numpy array.

The process also involves adjusting the website scroll so the 2048 game appears just so in the browser to align the image capturing area and read the very pixels that represent each tile, then doing some image processing to get rid of the backgrounds and enhance the numbers before feeding it to the pytesseract OCR library for the numeric reading.

To express the screenshot a few paragraphs up in a format that can be fed into the computer for further processing, we can display the 2D array like this:

    [[4 0 0 0]
     [0 0 0 2]
     [0 0 0 0]
     [0 0 0 0]]
        

Some of you who have played a bit more with computer vision may recognize that this is basically also how a 2-dimensional grayscale image would be represented in the format that the computer understands - a feature that we might be able to exploit later. But so far, let's just be content that we have built the "eyes" of the agent. Yay!

Next, we are going to let the agent interact with the game. Luckily still, back in the WeChat project, we already played around with simulating human inputs via the virtual mouse. Do you see a pattern here, by the way? The past projects we did and the experience we accrued add up, so things get easier and faster now, down the road! Anyway, now we just need to fashion our code with some more bells and whistles to make the keyboard work, but everything follows the same basic principle so it was easy to set up.

The past projects we did and the experience we accrued add up, so things get easier and faster now, down the road!

I used the pynput library this time but I believe similar effect can be achieved with the pyautogui library that the Grand Theft Auto V autonomous driving tutorial used. By the way, I learned many great things from this tutorial series, as mentioned in the WeChat project article - so shout out to the sentdex YouTube guy over at Python Programming.

With pynput, we created five basic actions, and these are the very core ones that the agent will later utilize:

  1. Press the UP arrow key
  2. Press the DOWN arrow key
  3. Press the LEFT arrow key
  4. Press the RIGHT arrow key
  5. Move the mouse to a certain location in the screen, namely the red New Game button, and click (again, involved a bit trial and error to find a suitable pixel coordinate to move the cursor to, and is also dependent on carefully scrolling the game window to a specific height in the browser as mentioned above)

Actually, the movement buttons were also prefaced by clicking on an empty spot in the browser in order to gain the focus of the window before hitting the arrow keys, making sure we are sending the commands to the right place.

With the ability to "see" and "touch" the game now, our agent is fully armed to face the challenges of the 2048 game. Actually, at this point, I spent some time looking through the internet for some kind of solver that would take as input the game state, and would output the most optimal move automagically. Sadly, I wasn't able to find such a thing. So on we march!

Step 2: Q-Learning

I have to admit, just seeing this word (actually, two words?) was something that pretty much put me off looking deeper into RL before I had this inner motivation to learn more. It was because the letter Q somehow sounded foreign and there were strange mathematical formulas involved and so I gave up before I even started to understand. Spoiler alert: it turns out to be a very simple idea if someone would just have bothered to explain it past that formula.

So if this is where you stop reading, I totally get it. No hard feelings, and welcome back any day. But I attempt to explain it in layman's terms below.

The scary Q-learning formula, image shamelessly stolen from Valohai's excellent RL tutorial blog

For the rest of you that braved the formula and scrolled down here, congratulations. Valohai's Reinforcement Learning tutorial did a superb job of explaining the concept so that even I could get a grasp of it, and below I attempt to dumb it down even further. So here's the secret nobody bothered to tell you:

The Q-learning algorithm and the Q-table are nothing but a simple lookup table (think "Cheat Sheet") and a clever way of calculating and filling the values in it, that's it.

Now what do I mean by the that? Let's break it further down. The Q-learning method involves letting the agent loose in the environment, performing some random actions, check how the state of the environment has changed (if any), and record that information into a big-ass Excel spreadsheet (as it were) so it can be looked up later as accumulated past experience. The information that gets saved is the score, calculated from the game state through a specially designed objective function, as mentioned earlier.

There are no rules on how to formulate the objective function, but in general, it should direct the agent towards success, whatever you take it to mean. You can imagine it will be harder to do in the real life, but in games, it's fairly easy to keep the score, and some games even literally display it onscreen. The 2048 doesn't, so for our case, I set the scoring function as follows:

score = 10 * highest tile value + sum of all tile values

The idea is to give a big boost to whatever is the current highest scored tile, and to direct the learning towards obtaining bigger tiles, while also keeping tabs on the overall progression of the game by taking into account the slowly accruing sum of all the tiles combined. I might have taken this idea from some article on 2048 reinforcement learning that I read, and will update the reference if I bump into it again. But as mentioned, there are probably infinite other ways to define the score, and mine isn't necessarily even that good.

Now let's stop for a while and try to further break down the scary looking Q-learning formula. As I mentioned above, it's just a clever way of entering the information that the agent learns in the course of interacting with the environment into the cheat sheet, this so-called Q-table. Imagine that this lookup table looks like a giant Excel sheet with thousands of rows. Each row is a unique possible state of the game, encoded by the 2D numpy array representation, for example.

The columns, on the other hand, are the four different directional actions that the agent can take in the game for each of the game state. (For the sake of simplicity, we don't consider resetting the game as it's an action that takes place outside of the game. We would only invoke it if we detect the agent got stuck without any valid moves left, essentially ending the game.) So any cell of the table is the score of the game after some action was performed at a given game state. Essentially what the Q-learning algorithm does is this:

  1. After any performed activity, check the current state of the game and calculate the current score.
  2. Multiply the score by some ratio to control how much the future score information should flow back to the past state's cheat sheet lookup cell.
  3. Save the processed score as the learned future experience into the past state's row, in the respective action's cell.

Simple as that. As the agent explores the environment and accidentally happens to do something that increases the score, this future information will flow back to the past state so that the agent in the next iteration of the game, upon entering the same state, will now have this information available that it didn't have in the past. And iteration by iteration, the information in the very end of the game, say the score of the 2048 tile, will eventually trickle back to the beginning states as the agent traverses along the chain of states that lead up to the ending. It would have to do it many times, each time sending the score information further back by one state, and marking the action that lead to this higher reward.

It's like checking out the winning lottery numbers and sending it back in time to another you in the week before the draw

Theoretically, once we have much of the state space covered, the behavior of the agent can be changed from a reckless random walker to a highly optimized performer who always chooses the highest expected score action at any given junction of the game's state. Or if it encounters a new state that didn't have any previously traveled cheat info, it can still perform a random action just the same and keep recording the outcome to the Q-table. It's like checking out the winning lottery numbers and sending it back in time to another you in the week before the draw. Well, not quite, but you probably got the hang of it by now.

As to the actual implementation, it's just the matter following the math and coding it in Python. There was an even simpler implementation of the Q-learning algorithm presented in this RL tutorial on Towards Data Science, although instead of an actual two-dimensional array, I used the dictionary data structure for easy storing and call-up of the unique game states as the keys, with the values being a list of the future expected score of the actions: [up, down, left, right]. If a game state didn't exist in the dictionary, it simply means the agent has not yet seen it, and should treat it as uncharted grounds.

That's it. Now we have the agent and the environment all set up, and even pretend to understand something about this Q-learning thing to actually make it all work! In reality, of course, I faced tons of problems with the implementation, and even coded the Q-learning algorithm wrong first, getting some weird results for the longest time during my earlier runs.

A short video of the agent playing, for the lulz. Still pretty much running on random.

Q-Learning Efficiency Problems

I've now run the agent over one weekend with some incremental improvements to it. Such improvements include recording not just one but four different states and associated actions after each move by flipping the board horizontally, vertically and both horizontally and vertically the same time (obviously also mirroring the past action on the mirrored axis before saving the data). However, I have still run into efficiency issues.

Although I set the agent to do a random action 50% of the time, and exploit the past knowledge the other half of the time, I rarely actually saw the agent revisiting any previously traveled states for an efficient exploitation strategy. Instead, almost all of the states seem to be unique and new. Was there an issue with saving of the game states in the Q-table? Hardly, as I was tracking the size of the table, and it reached almost 145k unique states by now. The only times we occasionally saw a previously traveled state is in the very beginning of the game, maybe on the very first or second move.

[For] a game that reaches the level of 2048 on the standard 4x4 board, there are 44,096,709,674,720,289 unique board states

This led me to checking out the numerical properties of the 2048 game, and I was quickly devastated. According to this article, although the tiles tend to look similar to a human player, a game that reaches the level of 2048 on the standard 4x4 board, there are 44,096,709,674,720,289 unique board states. It's such a large number that I don't even know to say it. By randomly playing, my agent would usually get stuck on the 128 tile level. But even then, the state space is a whopping 33,080,342,678,945 large. This one I can take a stab at naming: over 33 thousand billion (trillion?) unique states.

Turns out, the initial state space is small but grows exponentially larger with each additional level. Even with over 145k unique states traversed, I've still only covered an infinitesimal corner of the entire ocean of possible states. And that's even just 1/16 of the target level. No wonder the agent almost never kept looping back into its previous traversals.

This problem is further compounded by how the Q-learning algorithm itself works. As mentioned, each time a known state is revisited, we can only push back the information from the future one state back. Which means even if we somehow used a billion years to accidentally reach that 2048 state by exploring and exploiting the almost entirety of the possible state space, we would still need to do it again thousands of times to effectively bring that knowledge back to the beginning of the game.

So the vanilla Q-learning approach seems to be highly inefficient in my case, and that VPN offer may not wait that long for me. Sounds vaguely familiar, right? I kind of reached a similar impasse during the early states of my WeChat game project. The solution then? Go Deeper.

Coming up...

In the next part of this article series, I am going to try implementing a deep neural network architecture onto the Q-learning algorithm, also known as a Deep Q-Network, or DQN. This method should let me approximate the estimated future score of a given game state from a limited number of actual historical experiences without actually forcing the agent to really play Pokémon and collect them all. As we know, even though the numbers on the tiles may vary, essentially the basic idea of the game is the same.

The promise of the DQN is to improve the efficiency of the Q-learning method radically by exploiting the inherent statistical similarities in the otherwise very unique game states.

Or so they say. We'll see :)

As always, I am writing these articles to encourage and tell people who are interested but perhaps a bit put off by AI and ML that it's not really that daunting of a task. All it takes is a curious heart and some hacker mentality. Feedback of any kind, be it about my writing style, content, or research methodology, is all warmly welcome.

My heartfelt thanks to anyone who liked, shared and left me messages on my previous articles. Thank you for the support! Also, if you have some ideas about your own problem that you think Machine Learning could be of help to solve, please feel free to let me know! I'd love to discuss and exchange thoughts about AI and ML.

要查看或添加评论,请登录

Tianyi Pan的更多文章

社区洞察

其他会员也浏览了