Ch:14.1 Types of GAN's with?Math.
what’s up folks?? how are you doin’??
In the last article I talked about GAN’s with Math so this article I am gonna talk about different types of GAN’s that are invented since vanilla GAN’s.
Let’s roll baby!
Gan’s were introduced by Ian Goodfellow et al. in 2014, since then a lot of researchers from big companies have come up with a lot of different cool ideas to improve the gan’s training and gan’s performance.
Before I talk about it, Let me just give you a quick
Recap from the last article.
Gan’s have two networks
- Generator G takes some random noise and tries to produce real input training images by fooling the discriminator D.
- Discriminator D tries to classify the input being real if it’s coming from the training data and fake if it’s coming from the generator G.
During training these two networks compete with each other to be strong at there job’s
Highly recommend you to check that article cause the author writes with the same patterns.
okay I assume you already read the article so let’s move on
Types of Gan’s
Important Notes :
- I explain the type of the gan then I show the code w.r.t. the changes in the algorithm only, you can find the full code on my github as I dont get into explaining the code.
- The order of types of gan’s may not necessarily be in the order that they are invented, I explain them as they seem easy to understand.
- The code presented is trained for a while only with simple architectures ( feel free to try to train more with different architectures).
1. Deep Convolutional GANs (DCGAN)
The core idea is we use convolution neural networks instead of vanilla neural networks at both Discriminator D and Generator G.
Discriminator D is a set of convolution layers with strided convolutions so it down samples the input image at every conv layer.
as the input passes through every layer , the network leans the deep representation of the input images to classify (Fake or Real).
since discriminator D is just a classifier , we can take any good trained discriminators for the GAN training.
Generator G is a set of convolution layers with fractional-strided convolutions or Transpose convolutions so it upsamples the input image at every conv layer.
as the noise passes through every layer, the network increases the image size to the real image size.
and the same Math has been applied to DCGAN’s as we discussed in the last article.
This type of GAN has been invented in 2015 in this paper.
The authors also used different kind of additional techniques in the conv net’s
for the training details please check out the paper and here are the results on MNIST.
Since CNN networks work well we use DCGAN’s only for all other types.
Code changes snippets
The only change is the network architectures,
Here we just used the convolution neural networks at both G and D.
The remaining logic, cost and training process is same as vanilla GAN,
you can find the full notebook code here.
2. Conditional GANs (CGAN)
Gan’s can be extended to a conditional model if both the generator and discriminator are conditioned on some extra information y.
y can typically be a class label or tag.
The core idea is to train a GAN with a conditioner, We can perform the conditioning by feeding y into the both the discriminator and generator as additional input layer.
eg: for MNIST data , condition y, is the label one hot vector(10-length)
we concatenate this vector y with real x, then we feed it to the discriminator(D). also we concatenate this vector y with noise z, then we feed it to the generator(G).
The authors of the paper slightly modified the loss function
at Discriminator D, given input x and class y → classifies the input image is fake or real.
at Generator G, given noise z, and class y → generates the image which is conditioned by y.
after training, if we give a one hot vector of 5 digit as y and noise z to the generator G , then the learned generator gives an output which looks like 5 hand written digit.
Code changes snippets
Note: Changes are in the square box
The condition is added as y which is the label we take from the dataset.
At Generator G and Discriminator D we concatnate x with y
we calculate the loss as same as we did for vanilla GAN’s
so during the training we give the corresponding labels along with inputs to both G and D
after training the model is gonna be able to give the output based on the condition we give
example if we give random noise z and the digit 3 (one hot) as label y then we get the output like this
you can find the full notebook code here.
3. Least Square GANs (LSGAN)
As we know from the last article “Regular GAN’s may lead to the vanishing gradients problem” , this slows the learning process.
LSGAN attempts to overcome this problem by adopting the least squares loss function instead of the sigmoid cross entropy loss for the discriminator.
The authors of the paper trained this on different datasets and they have observed that there are two benefits of LSGANs over regular GANs. First, LSGANs are able to generate higher quality images than regular GANs. Second, LSGANs perform more stable during the learning process.
so what’s the reason to use L2 loss???
as we know the log loss only cares about wheather the sample is classified correctly or not, it does not care much about the distance between fake distribution(believed real) and real distribution.
This happens when updating the generator using the fake samples that are on the correct side of the decision boundary, but are still far from the real data (Observe the below picture b from the paper to understand more)
The illustrations of different behaviors of two loss functions. (a) Decision boundaries of two loss functions. Note that the decision boundary should go across the real data distribution for a successful GANs learning. Otherwise, the learning process is saturated. (b) The decision boundary of the sigmoid cross entropy loss function. It gets very small errors for the fake samples (in magenta) for updating G as they are on the correct side of the decision boundary. (C) Decision boundary of the least squares loss function. It penalizes the fake samples (in magenta), and as a result, it forces the generator to generate samples toward decision boundary
Here (b) the fake samples(magenta) are believed that they are real but very far from actual real distribution, so at G, we get almost no loss as these are correctly classified.
The L2 loss pulls the believed fake samples close to the original distribution by fixing the decision boundary.
The key idea is using the loss function which is able to move the fake samples toward the decision boundary.
so how’s the LSGAN’s loss function formed??
The objective function of LSGAN yields minimizing the Pearson χ 2 divergence ( I go deeper in these divergences later articles)
where a and b are the labels for fake data and real data, respectively and c denotes the value that G wants D to believe for fake data
the authors use 0–1 binary coding scheme to use c=b=1 and a=0, we get the following objective functions
Intuition is , in LSGAN the label for real data is 1 and the label for generated data is 0.
And of course at the generator, it wants the target label for generated data to be 1.
Another method to determine the values of a, b, and c in the equation is to satisfy the conditions of b-c = 1 and b-a = 2 so we get a = ?1, b = 1, and c = 0,
The equations change a bit but it does not effect the idea and process.
In terms of coding , the whole idea is same as regular GAN’s except a change in the loss function
we don’t use sigmoid function instead we use L2 loss.
the least squares loss function is flat only at one point, while the sigmoid cross entropy loss function will saturate when x is relatively large.
Code changes snippets
The only change in this gan is the loss function
The remaining process and training as same as traditional GAN
you can find the full notebook code here.
4. Auxiliary Classifier GAN (ACGAN)
in ACGAN’s, every generated sample has a corresponding class label, c ~ pc in addition to the noise z.
G uses both (noise z and class c) to generate images.
The discriminator gives both a probability distribution over sources and a probability distribution over the class labels,
D(X) = P(S | X), P(C | X)
X → fake and real samples, S is the prob dist (fake or not), C is the prob dist( class label ).
The objective function has two parts: the loglikelihood of the correct source, LS, and the log-likelihood of the correct class, LC .
D is trained to maximize LS + LC while G is trained to maximize LC ? LS
in terms of Architecture, Think of ACGAN’s as ConditionalGAN’s(CGAN’s) except no concatination at D and additional network is added to give class probs at D.
Code changes snippets
The main changes are , at D we get additional class prob outputs
The new loss is gonna be added at both G and D
Training very same!
After training we can predict the digit classes also along with the generated images
you can find the full notebook code here.
5. infoGAN
As we know in GAN’s, after training we just take some noise z and use the trained generator G to produce the fake(real) images.
This learning is called entangled representation.
Here we don’t know what’s going on inside(infact neural networks like that) and the only control we have here is random noise z ( of course that does not give an information for us to do something different at the input )
Question is → can we add something to input other than noise from which we can control the generator’s output??
Answer is → Yes by giving the condition (above CGAN) and latent codes by infoGAN.
info GAN’s are able to learn disentangled representation in an unsupervised way.
Disentangled representation means that the neurons in the neural network are somehow learning complete concepts alone.
Image Credit
in conditional GAN’s (CGAN) , we give the conditon c (label y) manually above but infoGAN’s try to make it learned automatically.
Eg: for MNIST data, we don’t need to provide labels to infoGAN’s supervisely
Ok how does it work???
Just like CGAN’s, the infoGAN splits the Generator input into two parts: 1 . the noise vector 2. a new “latent code” vector.
let’s say the latent code is c , in CGAN’s we give this supervisely but here assume c is unknown.
The authors of the paper suggest that we denote the set of structured latent variables by c1, c2, . . . , cL. In its simplest form, we may assume a factored distribution, given by P(c1, c2, . . . , cL) = ∏P(ci) → i = 1 to L.
from the MNIST dataset, it would be ideal if the model automatically chose to allocate a discrete random variable to represent the numerical identity of the digit (0–9), and chose to have two additional continuous variables that represent the digit’s angle and thickness of the digit’s stroke. It is the case that these attributes are both independent and salient, and it would be useful if we could recover these concepts without any supervision.
so c1 can be digit label vector, c2 and c3 represent digit’s angle and thickness vectors.
note: let’s just focus only on one c which is the digit label.
since c in unknown, the authors solve this by maximizing the Mutual information between the latent variable c and generator’s output. and use that information to train infoGAN.
In information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables.
More specifically, it quantifies the “amount of information” obtained about one random variable through observing the other random variable.
in our case the “amount of information” obtained about latent varable c through observing the generator’s output(xfake).
I(X; Y ) = H(X) ? H(X|Y ) = H(Y ) ? H(Y |X)
which means if we know Y then we use that Y to know something about X or vice versa.
in our case if we know generator’s output , we then use that to know something about the latent variable c.
I(c; G(z, c)) = I(X;Y)
if both variables are completely independent then I(X;Y) = 0.
with that they propose to solve the following information-regularized minimax game:
Vgan(D,G) → vanilla GAN cost and lamba is the regularization term typically set to 1.
so how do we calculate the I(c; G(z, c)) ??
ok let’s define some terms to understand better,
H(c) → entropy of latent variable ( high because of random and unknown)
G(z,c) → generator’s output with z being random noise and c being random digit label vector( sized 10 incase of MNIST dataset)
so I(c,G(z,c)) = H(c)- H(c|G(z,c))
H(c|G(z,c)) →given the generator’s output G(z,c) as evidence we need to find the entropy of c.
we have to maximize I(c,G(z,c)) to know more information about c from G(z,c). this is hard to maximize as it requires posterior P (c|x) →x = G(z,c) fortunately we can obtain a lower bound of it by defining auxillary distributin Q(c|x) to approximate P(c|x).
P(c|x) represents the likelihood of latent code c given the generated input x
ok in english, it’s like
- Take a separate neural network called Q.
- sample a latent variable c at random and give it to the generator G(z,c)
- while the generator and discriminator do their jobs, take that G(z,c) or xfake and feed it to Q network with c as label to the train Q.
Q(c|x) likelihood of latent code c given x
4. Repeat this process to train all 3 networks (G,D,Q) as long as you wish.
so ultimate result is like, we are learning the label from the generator’s output in an unsupervised way while the G and D fight with each other.
ok let’s code , the theory takes some time to settle but the code is super easy.
Code changes snippets
A new network Q has been added and latent variable y is added
Note: here I take only one variable for sake of understanding, but the authors proposed to take two more variables (1 for digit thickness and 1 for angle)
The new network loss = Q_loss
The optimizer for Q network and G also gets updated during backpropagation
During the training we can sample the labels y randomly (which is unsupervised)
After training if we give random noise z and take any one hot label vector(here i took 2) then we get the results like this
And this happens completly in the unsupervised manner
you can find the full notebook code here.
Alright! with that I am gonna call it a day.
well if you have any problem in understanding any of the concepts that are discussed, let me know here Linkedin (I am happy to help)
So far we disccussed only 5 different variations in GAN’s , there are lot of other advanced version’s also available, we will discuss them all in next articles.
Original Papers
DCGAN Paper CGAN Paper LSGAN Paper ACGAN Paper infoGAN Paper
The following things we discuss in next articles
- All divergences (KL, JS, Wasserstein)
- Wgan, STGAN, RSGAN,cycle GAN, BEGAN, etc.
- image 2 image , video 2 video with GAN’s
Questions/Suggestion/Mistakes/Comments are always Welcome :)
Principal Engineer Manager at Microsoft
5 年sounds all about loss function's physical meaning meaingful or not (any principles on check this besides test to see)