DCGAN Hyperparameter Tuning - Part 2
Images obtained from my own DCGAN Face Generation Project

DCGAN Hyperparameter Tuning - Part 2

DCGAN stands for Deep Convolutional Generative Adversarial Network. It is an artificial intelligence algorithm that creates images employing Deep Learning tools (Convolutional Neural Networks) to enhance its performance.

Generative Adversarial Network (GAN) is based on "Game Theory". It contains 2 players: the Generator and the Discriminator. During training they play different roles:

  • Generator: Produces fake images, from scratch, trying to fool the discriminator. It is trained in an unsupervised fashion using noise vectors as "unlabeled" data, however, its learning is guided by the discriminator. There is no method to compute the accuracy of this network, hence human observation is required to evaluate the quality of generator images.
  • Discriminator: Classifies input images as real or fake, trying not to be fooled by the Generator. Real images are the ones from the training set, whereas fake images are the ones produced by the Generator. It is trained in a usual supervised fashion, however, the simplicity of its labels is such that their creation can be fully automated. It is also possible to measure the accuracy of the discriminator.

This article is the second part of the report on my DCGAN Face Generation project developed for my Udacity Deep Learning Foundation Nanodegree. To better comprehend the explanations of this article, it is strongly recommended to read the article "DCGAN Hyperparameter Tuning: Part 1". The output images shown in the previous article were created by the "1st Version of my DCGAN Face Generation project".

For more interesting descriptive information on GAN and its potential, read "How an A.I. ‘Cat-and-Mouse Game’ Generates Believable Fake Photos" published by New Yourk Times.

The operating principle of AI algorithms based on neural networks is minimizing a function called "Loss" or "Cost", the mean square error (MSE) of the network batch score during training. The error term is the difference between the label and the score of the network for a given training input sample in a batch. Thereby, in every training iteration, "Loss" is obtained by taking the average of the squared errors among all the samples in the batch.

From the explanation above, we can conclude that the greater the "batch size" the more precise will be the computation of the network loss. However, we cannot make the "batch size" equal to the total amount of training samples because the DCGAN model would memorize the classification of the training data (composed by only 1 batch) instead of learning it. In AI, this problem is known as "overfitting", which makes the network fail to find good properties of the training set that would generalize well to the testing set. Therefore, during the training stage our discriminator model would perform well only classifying the real images, but poorly classifying the fake images. Since the GAN algorithm eventually reaches the "Nash Equilibrium", the better the discriminator performs classification, the better will be the quality of images rendered by the generator.

During the apprenticeship process, not only precision is required, but also variety between batches. It also would, probably, take a lot longer to train a model using only 1 batch. Thus, there is a tradeoff between score precision and model generalization, both required by the discriminator, in a way that there is an optimum value for the "batch size".

As explained in "Part1", the training process requires the configuration of several hyperparameters whose behaviors are connected. This characteristic makes the tuning process a great challenge.

In the beggining, the GAN model learns common features of the images. Therefore, the fake pictures have to look almost the same after every batch iteration. During this stage, the artificial intelligence has a lot to learn (as it goes from knowing nothing to knowing almost all common characteristics). Therefore, it is convenient that the model learns very fast throughout this period. This part of the training I call "transient state".

Next, the GAN model learns the small details that make each generated face unique. Therefore, now the resultant fake images have to present differences after every batch iteration. During this stage, the changes in the weights between layers of neurons have to be subtle in both adversarial networks, otherwise the knowledge acquired during the "transient state" gets lost. Therefore, it is convenient that the model learns really slowly throughout this period. This part of the training I call "steady state".

As an example that helps us to understand what was described in the 2 previous paragraphs, let me show the process followed in the "2nd Version of my DCGAN Face Generation project". In this project version, I used the default value for 'beta2' (which is 0.992) provided by the Tensorflow implementation of Adam Optimizer. In "Part1", I also explained that 'beta2' determines the beginning of the "steady state" for the variable factor of this function (no matter what value is chosen for 'beta1'). Below, I plotted a chart,according to the formula shown in "Tensorflow API Documentation of AdamOptimizer" , for the case where 'beta1' and 'beta2' are 0.6 and 0.992 respectively, to illustrate this strategy.

In the figure above, we know from the "Time Response Analysis" in Control Theory that we have "transient state", approximately, during iterations 1 through 400 and "steady state" from iteration 400 on. During the "transient state", the gradients are large and the variable factor accelerates, resulting in fast learning (large updates during backpropagation phase). During the "steady state", the gradients are small and the variable factor is almost constant, resulting in slow learning (small updates during backpropagation phase).

Now, all I had to do was tweak the "beta1", "z_dim", "batch_size" and "learning_rate" hyperparameters in a way that the generator would start rendering unique faces, approximately, at the 400th iteration. Again, "Part1" helps us to do this task by describing the correlation between these hyperparameters.

To depict what was described in the previous paragraph, the first 7 output images of the "2nd Version of my DCGAN Face Generation project" are shown below. Each image is created after 100 iterations, containing 25 generated faces.

STEPS = 100

STEPS = 200

STEPS = 300

STEPS = 400

STEPS = 500

STEPS = 600

STEPS = 700

At the 1800th step I reached the result shown below and I finished the training because I could not get significant improvements beyond this iteration.

Most of the walkthrough presented so far shows the results obtained using techniques developed by Ian Goodfellow et al. in 2014. However, There have been many improvements since then, which are called GAN Hacks. These are predominantly implemented as add-on codes to the original GAN algorithm. As we will see, inserting these hacks really boost our results and all the hyperparameters will have to be tuned again, this time in a completely different manner. But the whole previous procedure was necessary because we have to make sure that everything is working properly before we go on.

Some hacks I had already applied to the "1st Version of my DCGAN Face Generation project", but I had to add several others to reach the "Final Version of my DCGAN Face Generation project". Most of the tricks and tuning hints that will be shown next were provided to me by Udacity project reviewers. They are described as follows:

  • Leaky ReLU activation function helps with the gradient flow and alleviates the problem of sparse gradients (almost 0 gradients). I used it both in the generator and discriminator. Max pooling generates sparse gradients, which affects the stability of GAN training. That’s the reason, I chose not to use pooling.
  • Batch_normalization stabilizes GAN training by reducing internal covariant shift. You can go to this link to see the Tensorflow high-level implementation of this function. We have to keep in mind that when batch norm is applied and "training"=True, the moving mean and variance need to be updated before optimization. So, we add control dependency on the update ops before optimizing the network. More Info here: https://ruishu.io/2016/12/27/batchnorm/
  • Sigmoid activation function assigned to the discriminator output layer, which produces probability-like values between 0 and 1.
  • "truncated_normal_initializer" with stddev=0.02, which improves overall generated image quality, like in the DCGAN paper written by Alec Radford & Luke Metz and Soumith Chintala.
  • Dropout layer after dense layer. Applying dropout will decrease hyper learning distribution. I used it both in the generator and discriminator. Using dropout in generator, makes it less prone to learning the data distribution and avoid generating images that look like noise.
  • Hyperbolic Tangent as the last layer of the generator output. Thereby, After adding this improvement, the generator output lies between -1 and 1. Thus, I also had to normalize the real images to be between -1 and 1, in the train function, so that the input for the discriminator (be it from generator or the real image) lies within the same range.
  • DCGAN models produce better results when generator is bigger than discriminator. So, I used 4 conv2d_transpose layers in the generator and 3 conv2d layers in the discriminator .
  • Label smoothing for discriminator loss, also referred to as one-sided label smoothing, makes the discriminator generalize in a better way by preventing it from being too strong.
  • Double execution of optimization for the generator. This ensures that the discriminator loss does not go to 0 and impede learning.

As for the new tuning strategy, now it consists of following a series of good practices:

  • Experiment with various values of alpha (slope of the leaky Relu, as stated in the DCGAN paper) between 0.06 and 0.18 and compare your results.
  • If discriminator ends up dominating generator, you must reduce discriminator learning rate and increase dropout. Ref: F. Chollet, "Deep Learning with Python" chapt 8.32". Use: logits = tf.layers.dropout(logits, rate=0.6).
  • It is recommended to use values between 0.2 and 0.5 for "beta1". Here's a good post explaining the importance of beta values and which value might be empirically better.
  • An important point to note is: batch size and learning rate are linked. If the batch size is too small then the gradients will become more unstable and you would need to reduce the learning rate. Start point for experimenting on batch size would be somewhere between 16 to 32.
  • Experiment with more epochs.
  • You can also go through the original DCGAN paper to choose hyperparameters.
  • If you want to generate varied face shapes, experiment with the value of z_dim (probably in the range 128 - 256).
  • Experiment with different values for "kernel_size" and "strides". The best values for me were 5 and 2, respectively.
  • Even though making "dropout_rate"=0.6 worked fine for me, in most cases it is recommended to use a dropout rate between 20% and 50% .
  • Even though making "learning_rate"=0.0125 worked fine for me, in most cases it is recommended to use a learning rate between 0.0002 and 0.0008 .

Below I present some of the images generated by the "Final Version of my DCGAN Face Generation project".

From the above,we can conclude that the stability provided by GAN Hacks to the training is such that even if discouraged values are used in some of the hyperparameters, there is still a good chance that the model will learn properly. The generated fake faces turned out to be very detailed and realistic, despite the size of the training images being only 28x28 pixels.

I would like to mention that I had to experiment with several architectures and hyperparameter values to achieve a fair result. As soon as I solved this puzzle, I trained the model once again, but this time for a little longer (approximately 1 hour) to get some extra enhancements before I turn in my project to Udacity reviewers.

For me, the main purposes of this project were to obtain my Deep Learning certificate and to update my previous knowledge to the most important cutting-edge AI breakthroughs. To achieve these ends, not only my personal effort was required, but also financial expenses. All the training procedures were run in cloud computers provided by FloydHub, therefore I had to pay for GPU computing hours. The hardware used was a Tesla K80 (12 GB Memory / 61 GB RAM / 100 GB SSD). To check out this and other Deep Learning projects, go to my portfolio at FloydHub.

In the future, I intend to post more articles to my linkedin account. They certainly give a clearer understanding of what goes on in the process of developing ground-breaking technologies.

Those who are interested in the subject can also watch the talk on "How to train a GAN" by Soumith Chintala, one of the authors of the original DCGAN paper.

There are many uses for computer generated images, not only in games, films or other media, but also in AI itself, by creating samples to be used in other trainable algorithms. It is also possible to make a few changes to the GAN model to create a multi-class image classifier that can learn from mostly unlabeled data (which outstandingly improves its performance), thereby, now it learns in a semi-supervised fashion. For those eager to see some code examples about this topic, I also made available a GAN Semi-Supervised Learning Project.

More information on GAN can be found at:

Mohammed Mahmoud

Assistant Professor @ Arab Open University | PhD in Computer Science

6 年

How can i calculate the overall time of training ?

回复

要查看或添加评论,请登录

Thiago Abreu da Silva的更多文章

  • DCGAN Hyperparameter Tuning: Part 1

    DCGAN Hyperparameter Tuning: Part 1

    Recently I concluded the last project of Udacity Deep Learning Nanodegree core curriculum. After spending many hours…

社区洞察

其他会员也浏览了