Part 1 - Applied Computer Vision: Using OpenCV to Beat the WeChat Jumping Minigame
The AI algorithm at its work, playing the WeChat minigame

Part 1 - Applied Computer Vision: Using OpenCV to Beat the WeChat Jumping Minigame

Those of you who finished my last write-up about a basic Machine Learning application on predicting movie box office revenues will know that I am now looking for some other avenues of research. One such case presented itself around the Christmas time when a new in-app minigame called "跳一跳," or Jump a Little, was released on the hugely popular WeChat app in China.

The game is actually pretty simple: it is about jumping on ever harder-to-land boxes and the only activity is to press the touch screen of your phone for a certain period of time, corresponding to the amount of power the "player" character uses for its jump. The distance to the next jump target will vary, and therefore various levels of power are required. Too little power and your jump falls short. Too much, and you will jump over the target. Both outcomes result in failure and the player has to begin anew. The game randomizes the location, size and shape of the next targets so each session will be a little bit different. So easy that surely even a cat could play it, right?

As the game goes on, the randomization will favor shapes that gradually decrease the actual area of successful landing, making progress beyond certain point quite difficult. As a human player, I was able to get to 70+ points (corresponding to some 70 successful jumps - but if you manage to land in the dead center of the target, you will actually get extra points. There are also certain easter egg targets that will give the player extra points if you know to stay on top of them for a little while before jumping on). But I had seen some of my friends get scores that were in the range of 2-300 so I was thinking if I'm able to beat them with an AI-assisted player.

A Right Amount of Challenge

The reason I became interested in developing a Computer Vision (CV) enabled AI to help me play the WeChat minigame was because it seemed to be just the suitable amount of challenge. Indeed, it's definitely not too easy because I had zero prior experience from any CV applications whatsoever, but it also wasn't too difficult because the game itself is quite easy to understand and I felt that the problems, although still unknown at the start, would still be within my reach if I went just a little bit beyond my comfort zone. In essence, I figured I would only need to build three main elements to make everything work:

  1. Some kind of setup to collect the state of the game (perhaps in a form of screenshot) into the subsequent CV application that I would write
  2. Actual CV sub-routines where both the player character and the next jump target would be identified and their locations extracted from the game state, so that we can calculate the distance between these two points and convert that information into the exact amount of time that we would have to press the screen
  3. A way to relay touch screen inputs back into the game in order to execute the jump, and then start the cycle again

A colleague of mine had experimented with one CV library before called OpenCV, and he recommended that to me. I checked the Python version of it and it seemed reasonably simple and robust to start playing around with, even though I practically had zero prior experience from Computer Vision. At this point, I have to give huge credit to a massively entertaining and deeply insightful video/article series on the Python Programming website about teaching the computer to play the game Grand Theft Auto V based on CV algorithms alone.

Although the application domain of the tutorial itself was vastly different from my own project, the main challenge of creating an AI player for GTA V was similar and boils down to the same three main steps I described above. So by watching the tutorial, I was able to learn the basic framework of how each of the stages of image capture, main CV application sub-routines and finally the input entry back into the game was built, and made to work together. It gave a clear template on how to structure my own project when the time would come.

In addition to the great GTA V tutorial, the Python Programming website also featured a more basic overview on the OpenCV library itself which, together with the technical documentation of the library, formed a kind of "handbook" that I would constantly return to check how each of the tools worked. Armed with these tutorial series, I felt I was ready to take on the WeChat minigame challenge myself.

Basic Building Blocks

Dividing the project to many sub-tasks and problems to solve was easy after the GTA video. But before even writing a single line of code, I had to figure out how to actually get the game running on a computer so that I could plug it into the program I was going to build. GTA was easy in that regard because it was already running on PC. WeChat, on the other hand, is a mobile application that requires native phone or tablet environment. I knew there had to be mobile emulators, though.

To save you all the trouble of trying different emulators, let me just jump into the conclusion: you want to get Bluestacks. It's the only one that I tried which worked not only for the WeChat app itself, but also the jumping minigame within it. I tried several other emulators before it that either did not run properly at all, or failed at the last mile when I tried to launch the minigame. Even then, the Bluestacks emulator is not very stable and crashed multiple times during the development and testing process.

No alt text provided for this image

First win: The minigame running on top of WeChat running on the Bluestacks emulator.

Second part, and my first programming challenge, was to get the screen grab to work in Python. This is the fundamental step because we need to feed graphical information into the algorithm for it to process. Luckily, here the process was more or less exactly the same as in the GTA tutorial. The only thing that required tweaking was the size of the game window - obviously a mobile phone emulated app is more likely to be vertically oriented. Nothing dramatic here.

Next up was the actual meat of the application: playing around with OpenCV and trying various methods to slice and dice the image in order to extract the most useful information to determine how much "energy" to use for the next jump.

For our particular jumping problem, the hypothesized solution is as follows: we need a way to unanimously determine the location of the next jump target and preferably extract the precise on-screen X, Y coordinates of a point within the area on the top of the target, and ways off from the edges where you can still easily fall off even if you manage to land on it at first.

Next, we also need to determine the X, Y coordinates of the "player character". Finally, we can calculate the distance between the two coordinates and somehow translate that distance into a certain length of time that you touch the screen / press the mouse in order to gather the proper amount of "energy" for the jump.

Let's break all of the above into even smaller problems to solve. First of all, how to determine what is the next target? Luckily, the game would progress generally to the upwards direction. Because the point of view is isometric, next objects would always appear on the upper left or right side of the player. In OpenCV, there is a way to crop out a region of interest, so for the purpose of cleaning up the image, we basically leave everything below the player character out of consideration. We will also crop some very top parts of the image since new blocks will never drop off beyond a certain distance. This cropping is to optimize object detection and avoid detecting past targets that are irrelevant to solving the next target problem.

I tried several ways to extract reliable coordinates information from the image. Some worked better and some didn't at all. For example, I tried to reduce the image into edges and corners, and use for example an average coordinates formed by the top 3-4 corner points (top because that maximize the likelihood that it's the next target and not some other random point on the screen). But the problem here was, sometimes the algorithm would not detect the corners correctly or the middle/average point between them would still be too close to an edge of the object.

No alt text provided for this image

Buggy corners detection where the algorithm would find corners (the highlighted round points in the image on the right) even on the edges of the region of interest, destroying the "topmost corners denote the next target" logic.

No alt text provided for this image

Another early-on prototype, showing the template detection for the player character and corner detection for the target object at work. Due to the suboptimal corner detection algorithm, the jump location (denoted by the red dot) is far from ideal, which should be the center of the object.

I also tried to create templates which are predetermined images that the algorithm would compare to the screen grab and try to locate as accurately as possible. But as the game would randomly vary the color, pattern and even the size and shape of the objects, it became quite impossible to try and create templates for all the various objects.

Since the majority of shapes that the game throws at you are simple square or round top ones (boxes or cylinders), I finally found a somewhat reliable way to get started by creating a negative template of the top part of both round and square objects, essentially their silhouettes.

No alt text provided for this image

Square and round negative silhouettes. Note that for the square silhouette, I even tried to match a little bit of the left edge in order to try to let the algorithm correctly place the template on top of an entire square instead of finding the first too small triangle on the upper side of the shape.

For these silhouettes to work, I had to get rid of the background (the algorithm would actually match the black parts). There were two problems related to that. First, the background has a little bit of a gradient, which means it's not actually any single color. Second, the background color actually changes over the course of the game, also randomly. I first found out about this the hard way after carefully checking and removing one shade of color, only to see a few jumps down the line that all my work went to waste as the game decided to rotate the background color into another one.

To add some intelligence to the background removal sub-routine, I had to first convert the captured color image into a grayscale one, then call a histogram function so that I could find the shade of gray that has the highest pixel count in the entire image, denoting the major part of the background. After that, to take into account the gradient effect, I would not only remove the shade which has the top pixel count but also several other shades around it. Through experimentation I found that I had to take out a band of about 25 consecutive shades of gray to make the entire background disappear no matter what the original bg color was that the game decided to throw at me.

For the player character detection, it was much easier, though. A template solution worked really well because the character never changes its shape or even color. So just a cleaned-up screenshot of it worked perfectly, as long as I fed the algorithm a grayscaled image of the game state without any edge or corner detections going on. At times the game would also throw some predetermined shape that would not ever change in size or color. There are at least two: a take-away coffee cup and a bottle of some sorts. There is also something that looks like a toilet paper roll, but its size may vary so it's not static.

No alt text provided for this image

Templates gallery: player character, bottle, coffee, paper roll, and various twists on the more generic blocks and cylinders. I could have kept creating these kind of things until I decided too much is too much!

Futher down as the game progressed, it would radically alter the shapes and sizes of the cylinders and boxes, making them almost impossible to recognize from the original, larger versions. This was mentioned earlier in the article but you can see from the last five objects in the above template gallery which are all variations of the regular target boxes. The silhouette detection does not work well for these objects, and I spent quite a lot of time to create custom templates whenever a new twist of the shape or color appeared.

But we are getting a little bit ahead of ourselves. Because first, I still had to solve one last problem: calculating the distance and power of the jump and executing it by inputting the touch screen emulation back into the game. Since there were multiple object detectors working at the same time, I had to create a logic of how they would fight each other and determine which one to use for each jump. Again, here I followed the age-old wisdom that "next is up," so I matched each detector's resulting target X, Y coordinates and chose the one which had the highest Y value, denoting the topmost point. I would then take that point and calculate its distance on the screen to the point determined by matching the player character.

Now, with the distance information between the two points known, I am able to turn that just linearly into a certain length of time in milliseconds that I would need to press the mouse button, simulating pressing on the touch screen, and just make sure the mouse cursor was on top of the game window. A little bit of trial and error and I was able to find a good ratio to convert the distance into time. Pressing the game button was again quite easy, and actually here even easier than the GTA V tutorial, probably because the game was running on some DirectX graphics driver and getting the inputs there into the game was a little bit more trickier than getting them just through the Android emulator Bluestacks.

Optimize, Optimize, Optimize!

All done and dusted? Wrong! The first version of the AI that I hacked together in about 8-10 hours was only able to get to a little over 100 points. Still better than my own 70-something record, but there were lots of other human players who scored much more. This first version was actually only running on the corner detector because at the time I was not even able to get the silhouette templates to work properly (hadn't figured out the background removal trick yet).

The next day I was determined to improve my algorithm to beat all of the human players. I was able to implement the silhouette detection and make various other templates to work. One more template, and perhaps the most important one, I also discovered on day 2: sometimes when you happen to land perfectly on the dead center of the target object, it would not only give you some bonus points, but also show the center point of the next object by marking it with a small, round white spot.

No alt text provided for this image

The so-called aim dot.

Unfortunately the dot does not show so well against already white surfaces, but for all other objects, having basically an aiming point made progress much easier. I had to just create a template of this aim dot and try to find that in the picture first. If the aim dot was not available, as the next best alternative I would then try to match against the silhouettes and other templates. And the last resort would still be the corner detection if everything else fails (but at that point usually the corner detector would often also fail to acquire the proper target coordinate).

Other optimization areas included a more smoother way to change the target coordinates following a jump. At first, after each jump the new coordinates would be detected immediately but sometimes they might have been wrong coordinates before the game state has actually fully settled after the jump. I created an averaging method where the game would constantly save the best coordinates from the last n frames captured and compare the average of those to the latest coordinate. This way if the coordinates would move around too widely right after each jump, the AI would not accept those coordinates as the final ones yet. Only after the movement has settled and the average coordinates across all n previous frames becomes close enough (say, less than 3-5 pixel difference) than the latest coordinate, it would be accepted as the true target coordinate. Making the game wait for n frames also has the added bonus of automatically collecting the easter egg bonuses when you stop for a while on certain target objects, such as the supermarket, manhole cover, the Rubik's cube or the turntable.

Also, I noticed (perhaps related to above issue of the coordinates tending to oscillate right after the screen changes) that sometimes right after one jump, the AI would mistakenly think the next coordinates are very close-by and would execute a very short jump. It would be so short that often it would even land on the same target it jumped from. But other times it would jump into the emptiness between the objects and the game would fail. I had to weed out these kind of twitchy / buggy jumps by setting a threshold so the AI would ignore jumps that had distances less than certain pixels long, even if all other prerequisites for a jump would be met.

By combining all of the various targeting methods, creating new templates as the game racked up the challenge, and optimizing the algorithm to avoid bugs leading to sudden deaths, I was able to edge up the score all the way to 1,136 by the end of day two, a score that was much better than anyone else on my scoreboard then. You can see the results from my tweet. I felt really accomplished then, and that was pretty much all the free time I had left from my New Year's weekend.

No alt text provided for this image

The winning leaderboard entry after my own 48 hour hackathon.

However, the AI was still far from optimal. One thing I later noticed was that by always pausing the game when it hit some unknown obstacle (usually new twist of existing blocks that it couldn't identify), creating custom templates and resuming the game from there, I had created too many things for the algorithm to look for. Later when I restarted the game, actually those templates that I created for the later parts of the game started to interfere with the detection of much simpler objects from early on in the game. It was never able to reach beyond the 1136 score level on its own.

Coming up ...

No alt text provided for this image

So far my AI was pretty much relying on customized templates and other rule-based methods to determine what constitutes a "target" that it needs to jump on. We saw that the approach would only get us this far, and the seams of our hack are already bursting. We need a way for the AI to learn by itself what the target is.

When I was studying Machine Learning on the legendary Coursera Stanford ML course, the instructor Andrew Ng briefly touched upon advanced neural network architectures when we were talking about deep learning. For example, that recurrent neural networks (RNNs) work well on text and speech recognition and understanding - things that have a clear temporal structure, and that convolutional neural networks (CNNs) work well on image recognition tasks. So a thought started to grow in the back of my head to build a deep CNN architecture and, through supervised learning, train my AI to recognize the target objects on its own. In the words of Mr. Cobb in the much-memefied motion picture Inception, "We have to go deeper."

This is also pretty exciting for me, as I have never experimented with neural networks beyond a simple three-layer MLP I used for the box office predictor, my previous ML project and CNN/Computer Vision was totally new to me. Have I tried to swallow too much this time? We shall see.

Update: Part 2 has been published! Head over and continue reading.

As always, I am writing these articles to encourage and tell people that are interested but perhaps a bit put off by AI and ML that it's not really that daunting of a task. All it takes is a curious heart and some hacker mentality. Feedback of any kind, be it about my writing style, content, or research methodology, is all warmly welcome.

My heartfelt thanks to anyone who liked, shared and left me messages on my previous articles. Thank you for the support! Also, if you have some ideas about your own problem that you think Machine Learning could be of help to solve, please feel free to let me know! I'd love to discuss and exchange thoughts about AI and ML.

要查看或添加评论,请登录

Tianyi Pan的更多文章

社区洞察

其他会员也浏览了