How to effectively hide an object in the background
Introduction
Machine learning can help to automate creating video games. Procedural content generation (PCG) is one of the many ways to do it. It can be presented by a sophisticated cascade of AI tools for generating levels on the fly, as well as by simple mechanisms that speed up preparing content. Recent success in Computer Vision (CV) led to incorporating its techniques into games too. And it is very natural since video games operate with visual content a lot.
There are many genres where AI can be applied to. But the particular question of this post is how can we automate the creation of a hidden object game? Our assumption that we could at least place an object in a specific part of a background automatically (Figure 1). From the mathematical perspective, it means to find 2 relative coordinates (x, y) of the object image on the plain of the background image. Additionally, we can rotate the object a bit as well as scale it down. Hence, we need to construct a function:
f: (BG, OBJ) ---> (x, y, rotation_angle, scale_factor)
Can we train a model (e.g. a neural network), which presents such a function? Of course, it is possible with supervised learning. The train data set in this case could look like {<BG, OBJ>, <x, y, rotation_angle, scale_factor>}. For example, it could be a generative model, which provides an adequate distribution of (x, y, rotation_angle, scale_factor) in the condition of <BG, OBJ>:
p (x, y, rotation_angle, scale_factor | BG, OBJ)
But the usual problem: we do not have such a train data set... To create it in a straight forward way, we should ask people to be "content designers" and place tons of object images into tons of background images. But can we solve such a problem in an unsupervised manner? To do so, we would like to find such an "appropriate" position of an object image in a background, using some similarity measure between the object image and a local part of the background. Since the goal is to "hide" the object, it should be a kind of "mimicry" similarity. But does such a specific thing even exist?
Figure 1. Placement object images onto a background image.
Contribution. First of all, we set up the task, which, to the best of our knowledge, had not been set up before. In order to solve the task, we created a special and well-interpreted metric for image similarity ("Grad, color & contour"). We have also proposed completely new methods for measuring this similarity (like "Joint Color Quantization"). We also studied what layers of ResNet (K. He et al., 2016) are appropriate for the task and how to measure the similarity between two images basing on them. The quality in the final experiments corresponds to the production-level. And to top it all off, we managed to use a no-training approach.
Structure. The post is organized as follows. We set the task of placing images of objects on the background image. The next sections present existing methods for measuring the similarity between two images, a description of our experiments and methods, and some ideas for future work.
Problem setup
We would like to have a similarity measure between two images (usually, of equal sizes), which reflects the human perception of the hiddenness of one image in another one. Having such similarity, we are able to search the best coordinates of the object image as ones that correspond to the highest value of the measure (score).
The search itself can be constructed in various ways. There can be a stage of the global search, which supposes to get the best (x, y), using strides on background coordinates. The second stage is a local search: it is similar to the global one, but in a local area (found in the first stage) with a small stride (1 pixel), rotation, and scaling object down. For the sake of acceleration, the search can be done in the frequency space, e.g. applying the Fourier transformation (applicable not for all methods).
The main purpose of this post is to study the following similarity measure:
f: X x Y ---> score, X, Y ∈ ?2, score ∈ [0, 1]
The X and Y are two images, which are presented by 2D matrices (or 3D, if there are several channels of information). And the score is a scalar, which is as closer to 1 as one image is closer to the second one in terms of similarity which is considered. The specificity of such similarity is in appropriacy to be used as a good place for an object image in hidden-object games.
Existing methods for measuring similarity between two images
Seems like, nobody solved the task in this particular definition. On the other hand, there is a big number of image similarity metrics. And lots of surveys about measuring similarities for image registration/alignment/stitching exist. For example, (Nag S., 2017) or (Szeliski R., 2006). More details for such algorithms can be found also in the (Szeliski R., 2010) book. All approaches can be grouped as pixel-based, morphological (geometrical), distribution-based, hash-based, semantic similarity, etc.
Pixel-based
The most straight forward approaches are pixel-based. It means that the colors of 2 image pixels with the same coordinates (x, y) are compared. These are various measures between pixel intensities (actual brightness values of each channel) in the corresponding coordinates. When considering color, each channel must be handled separately. Or you can take the average value.
Such a measure can be based on a mean squared error: the mean squared of residuals for each pixel in 2 images, L2-measure. In the computer vision community, this approach sometimes refers to the sum of squared differences (SSD). Someone can use mean absolute error, L1-measure. Of course, Minkowski distance (Lq) can be used too.
Another pixel-based approach is the normalized cross-correlation between pixel intensities. It can be treated as the cosine similarity of two matrices (images).
These methods are used in image stitching, alignment, and registration (Szeliski R., 2006).
Morphological (Geometrical)
Another idea is to find the similarity between the shapes of objects on two images. In practice, it usually needs to define the main contours of the objects. There are numerous algorithms for that. The most famous are Canny, Otsu, or someone can even use a threshold to binarize the image. Since the contour is all we need, it makes sense to convert the image into a grayscale format as a first step. After that, the contours are defined, and the whole image is binarized: e.g. 1 is for the contours, 0 is for the rest.
Two contours (of two images) must be compared. One of the methods is Hausdorff distance (Huttenlocher et al., 1993, Taha & Hanbury, 2015), which is the longest of all distances from a point in one set to the nearest point in another set.
Other ideas were proposed by (Pytiev & Chulikiv, 2010) - Pytiev morphology - and applied to matching 2d image shapes by (Vizilter & Zheltov, 2012). It is based on geometric and algebraic reasoning. In this framework images are treated as piece-wise constant 2D functions, tessellation of an image frame by the set of non-intersected connected regions determines the "shape" of the image and the projection of the image onto the shape of another image. Comparison of morphological images is performed using normalized morphological correlation coefficients. These coefficients estimate the closeness of one image to the shape of another image. This method of image analysis can be described as intensity-to-geometry matching.
Distribution-based
Every image can be also considered as a distribution of pixel intensities (for each channel). Hence, we can compare such distributions from two images. One of the methods to do this is Wasserstein distance (or Earth Mover's distance, EMD) (Dobrushin, 1970, Lavin et al., 1998, Rubner et al., 1998). Intuitively, if each distribution is viewed as a unit amount of earth (soil) piled on the metric space, the EMD is the minimum "cost" of turning one pile into the other, which is assumed to be the amount of earth that needs to be moved times the mean distance it has to be moved. Another method in the group is applying the idea of Mutual Information to 2D objects.
Hash-based
The next group of methods supposes the creation of a hash-function under the image and measuring a distance (e.g. Hamming distance) between two hashes. Specifically, it can be Perceptual hashing - using low-frequency information, because low frequency means the information about shape and main colors. In order to find identical images (with probably small differences in tones), Haar cascades can be used too. Haar cascade is a sequence of small arrays (usually with 2 coordinates only), which are used as kernels to calculate convolutions with an original image.
Semantics-based
If both pictures are cars, then the proximity will be high. So, it is about semantics. It is usually hard to determine the exact semantics of an image. Recent practical success in CV is connected to artificial neural networks, applied to CV tasks, e.g. image classification. Despite it is a black-box approach in general, it is believed that early blocks of such classifiers usually represent local features (texture), meanwhile the later ones represent high-level concepts (belonging to a semantic class). The whole specter of blocks can be tried for automation of the hiding objects.
Experiments and Results
It seems that the considered similarity is about color, shape, texture, silhouette, geometry... So, we need to play with all such metrics, build our own, and see which is better. With that, we construct the following experiments.
The overall flow of the experiments
We started from a straight forward approach: moving along BG image with some window (stride) in order to find the best place for OBJ image. For speeding up, a BG image is resized to get a height of 300-600 pixels. The stride - the size of a piece of BG image - is 10-30 pixels. For each such a piece (part), one of the similarity metrics (see below) is applied to the piece and OBJ image. In this way we can get a so-called heat-map: it includes scores of similarity between OBJ image and BG part. This process is called a global search. Having the heat-map, we can sample from Boltzmann distribution, build on these scores, to obtain adequate coordinates of the OBJ image. Small temperature (for calculating Boltzmann distribution) means getting close to maximum score.
The second part of the algorithm is a local search. We use a similar procedure in a local area, found on the stage of the global search, but with a small stride (typically 1 pixel). During the local search, we also applied some rotation (from -30 degrees to +30 degrees) and down-scaling (with a factor of 1.0-0.75).
This method can be applied to as many as possible random pairs of BG and OBJ images. And we can use a random walk in the space of all parameters (BG resizing factor, stride, similarity metric, temperature, etc.).
The outcome of every experiment is presented by the hidden object game stage: a final image with BG and placed OBJ. And it is validated by 3 people. The validation procedure is simple: experts mark every final image in the experiment as appropriate for a hidden-object game (HOG) or not. As a result, each similarity metric has its own rate of success.
Our similarity metrics
It makes sense to start simply from pixel-based metrics like L2 (SSD), L1, Normalized Cross-Correlation, etc. A special type of pixel-based methods, which is, meanwhile, close to geometrical approaches, is the gradient methods. Instead of applying a pixel-based metric to a raw image, we can do it for its gradient map. The expectation is that low-frequency signals provide information about contours, which can be useful for the task. Of course, we can directly use one of the geometrical metrics, like Hausdorff distance. We also can use Mutual Information (MI) as well as Wasserstein distance as the distribution-based metrics.
In addition, we can construct own metrics, just combining several basic methods. For example, since the task seems to be related to colors as well as contours, we can use weighted averaging gradient and color (L2, L1, etc.) metrics. Or we can add MI to this combination.
Finally, we can invent something totally new. For instance, we utilize the idea of color quantization. On the one hand, the color quantization removes high-frequency signals. So, it is easier to apply some pixel-based metric (like L1) after that, since an image becomes less noisy. On the other hand, this approach underlines the contours, which is useful for solving the task too. But it is important to compare 2 images (part of BG and OBJ), which colors are reduced to a shared palette: tones of the pink color on both images must be replaced with one basic pink color, same for both images. It can be achieved by applying the k-means algorithm (common implementation of the color quantization) jointly for both images (as to a single data set).
In order to make experiments with semantic similarity metrics, we take a pre-trained ResNet neural network. We use outputs of its hidden layers (blocks #2, #3, #4, #5) as image embedding vectors. Having such embeddings, one for each image (part of BG and OBJ), we can use cosine similarity to obtain a score value.
Experiments with all available metrics and raw parameters
We started from a global stride of 30 pixels and a scaled down BG image of 300 pixels in height. The sampling temperature was about 0.01 - 0.1. The metrics (we started with a lot of them), which were applied during this stage, can be found in Figure 2. You also can see that the rate of objects, that were hidden properly, is about 5-16%. Another observation is that the mixed metrics (a combination of pixel-based, geometric, etc.) show better results. It makes our assumption, that the desired metric relates to similar colors and contours. We also can see that block #4 of the ResNet is among leaders, meanwhile, block #5 (the highest level of abstraction and, hence, the presenter of truly semantic embeddings) has low results. We should not make outcomes basing on these results, but we should tune the parameters.
Figure 2. Results of the experiments with all available metrics and raw parameters.
Experiments with selected metrics and tuned parameters
This time we used a global stride of 10 pixels. So, the chance of finding better coordinates was increased. The sampling temperature was 0.001, so, the coordinates with better similarity scores get a higher chance to be sampled. And we scaled BG image height to 600 pixels. We selected only the best similarity metrics from the previous stage and made some new combinations. You can see the results in Figure 3. The rate of well-hidden objects is about 40-70%, which is much better. The changes in parameters made the global search more scrupulous. Block #4 of ResNet is among leaders again.
Figure 3. Results of the experiments with selected metrics and tuned parameters.
Experiments with aligned metrics
The flow of the first two stages supposes sampling random similarity metrics for the global search as well as for the local search independently. But could we make things better if we use a single similarity metric both for global and local search? The results are better, indeed. The rate is about 50-90% (Figure 4).
Figure 4. Results of the experiments with aligned metrics.
Experiments with the ResNet layers
We could make sure that the usage of embeddings from ResNet provides good results. But which layer is better? In order to answer this question, we made an additional experiment, focusing only on embeddings from ResNet layers. We build feature vectors for BG part and OBJ on the outputs of 4 ResNet blocks: "2c" ("conv2_block3_out"), "3d" ("conv3_block4_out"), "4f" ("conv4_block6_out"), and "5c" ("conv5_block3_out").
You can see (Figure 5) that "2c" layer turns out to be the most appropriate for the task. We can interpret this in the following way. It is not the best choice to hide a banana among bananas (semantic similarity), it is better to hide it among some yellow elongated stuff (shape/color similarity). "2c" layer wins by a wide margin. Later layers show decreasing efficiency. And a small increase on "5c" can tell that in our data set the strong semantic similarity ("5c" is a very last layer) sometimes can be a good indicator of the place where the object can be hidden.
Figure 5. Results of the experiments with the ResNet layers.
Thus, seems like true semantics is not the best choice for HOG. It makes almost no sense to hide a wall picture among other wall pictures: a user probably will pay more attention to wall pictures, once it is described in a task... It is better to hide an object among something semantically different, but similar by shapes and color. That is why metrics of similarity in texture is a better choice. The layer "2с" of the ResNet is seemed to be the best choice to solve the task.
Examples of hidden objects
Figure 6. Aloe.
Figure 7. Cherub.
Figure 8. Rocking Horse.
Figure 9. Maracas.
Method
So, we have made the choice. "2c" layer of the ResNet is the best way for extracting features, which works well for the task. And now we should create a production-level solution for providing the best coordinates (or even scores for any possible coordinates - a heat-map). It is not very hard to do. We can restructure the original ResNet, making own specific model with the following design (Figure 10). Two inputs (one for a BG and one for an OBJ) are passed into the pure ResNet models (restricted by the "2c" layer only). The size of the output feature dimensions is reduced by 4x. If an OBJ has a size of 80x80 pixels, the feature map has a 20x20 size (with 256 channels). In order to compare the OBJ feature map with all parts of the BG, we can apply an average pooling with a pool size of 20x20 (the size of the OBJ feature map). For that, we can use a stride of 1 or make it a bit bigger. The OBJ feature map, in its turn, is applied with a global average pooling. Each normalized feature vector of the BG feature map (the new one, after the average pooling) can be multiplied (dot product) with a normalized OBJ feature vector (after the global pooling) in order to get a map with cosine similarities. The latter is our desired heat-map.
Figure 10. ResNet-based DL model for getting the best coordinates by only one passing BG and OBJ.
Future Work
So, we have created a tool for placing a small object image in a background image (without any other changes). And what can we do next? Since we work with 2 different images (BG and OBJ), we can generate some art details - to fit an object better. It can be a kind of stylization of an OBJ to be closer to a BG in terms of BG colors, textures, etc. To train own model for providing the best coordinates, we could collect data directly inside a game, taking into account player behavior, e.g. time spent for finding an object. The model can be built as Variational Autoencoder (VAE) or Generative Adversarial Network (GAN). The reason for using a generative model is in willing to get a distribution of coordinates that are appropriate for the task of hiding an object. It is also possible to use ResNet for fine-tuning, training the model on collected data. In addition, we can try building the system above other known DL/CV designs, not only on ResNet.
Conclusion
Thus, we finished working on the task, which, to the best of our knowledge, was not put out before. It is the hiding a small object image inside a background image automatically. We started the research by applying a variety of approaches, mainly based on calculating the similarity of two images. And we were moving step by step through the empirical road of making a selection among methods that showed better results. Besides the ResNet-based method, there were several custom hybrid metrics among the winners. The so-called "Grad, color & contour (4:3:3)" method is as qualitative as the ResNet-based one. It is a big achievement. And not only because it is our custom idea. The "Grad, color & contour (4:3:3)", in contradiction to the ResNet-based method, is fully-interpretable: we can clearly see that coordinates, which is the best for hiding an object, depends on similarities of gradients, colors, and contour, and we even see the particular contribution of each component. It also means that we empirically showed a hypothetical sense of the ResNet function: it can provide the calculation that is potentially close to the "Grad, color & contour (4:3:3)" algorithm. So, we have partially opened the black box :). We also found what specific layer of ResNet is the most appropriate for the task. And it was another confirmation that the low-level features (texture/structure) relate to the earlier layers of neural nets for image classification. Finally, we have created a production-level neural-net solution, which allows us to get a placement heat-map very fast.
Acknowledgments
This work was a part of the internal researches at Arkadium and was supported by Jessica Rovello (CEO and Co-Founder), Kenny Rosenblatt (President and Co-Founder), Stanislav Tkachenko (VP of Engineering), Tom Rassweiler (SVP of Product), and others. Special thanks to Stanislav Tkachenko for the critical review of the post.
References
R.L. Dobrushin. Definition of a system of random variables by conditional distributions. Teor. Verojatnost. i Primenen. , 15 (1970) pp. 469–497 (In Russian)
K. He, X. Zhang, S. Ren and J. Sun. Deep Residual Learning for Image Recognition. 2016. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
D.P. Huttenlocher, G.A. Klanderman, and W.J. Rucklidge. Comparing images using the Hausdorff distance. In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 9, pp. 850-863, Sept. 1993, doi: 10.1109/34.232073.
Y. Lavin, R. Kumar Batra, and L. Hesselink. Feature Comparisons of Vector Fields Using Earth Mover’s Distance. In Proceedings IEEE Visualization’98, pages 103–110, 1998.
Sayan Nag. Image Registration Techniques: A Survey. 2017. URL https://arxiv.org/ftp/arxiv/papers/1712/1712.07540.pdf
U.P. Pytiev and A.I. Chulikiv. Methods for morphological analysis of images. 2010. (In Russian)
Y. Rubner, C. Tomasi, and L. J. Guibas. A Metric for Distributions with Applications to Image Databases. Computer Vision, 1998. Sixth International Conference on Computer Vision, 4-7 Jan 1998 Page(s): 59 - 66, 1998.
Richard Szeliski. Image Alignment and Stitching: A Tutorial. 2006. URL https://www.microsoft.com/en-us/research/wp-content/uploads/2004/10/tr-2004-92.pdf
Richard Szeliski. Computer Vision: Algorithms and Applications. 2010.
A.A. Taha and A. Hanbury. An efficient algorithm for calculating the exact Hausdorff distance. IEEE Transactions On Pattern Analysis And Machine Intelligence, vol. 37 pp. 2153-63, 2015.
Yuriy Vizilter and Sergey Zheltov. Geometrical Correlation and Matching of 2D Image Shapes. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences. 2012. I-3. 191-196. 10.5194/isprsannals-I-3-191-2012.