Generative Adversarial Networks: What it is, How they work, and My Experiments

Generative Adversarial Networks: What it is, How they work, and My Experiments

Over the past month, I've been studying Generative Adversarial Networks (GANs). I started with the basics, using the MNIST dataset, and advanced through implementing a Conditional GAN (CGAN) with a Deep Convolutional GAN (DCGAN) architecture. Here’s a breakdown of my progress, key insights, challenges, and future goals:

Understanding the Basics of GANs

At a high level, GANs are composed of two neural networks that compete with each other:

  1. Generator: This network takes random noise as input and generates synthetic data. Its goal is to create realistic data to fool the second network, the Discriminator.
  2. Discriminator: This network receives both real data (from a dataset) and fake data (from the Generator). Its objective is to classify the real data as real correctly and the fake data as fake.

The competition between these networks creates a feedback loop: as the Generator gets better at creating convincing fakes, the Discriminator has to become more discerning. Training them simultaneously requires a delicate balance—if one network gets too strong too fast, the other struggles to improve.

Discriminative vs. Generative Models: A Quick Overview

As I explored more about GANs and how they work, I needed to understand the difference between discriminative and generative models - since GANs fall into the generative category.

Discriminative Models:

  • Focus on predicting labels by learning the decision boundaries between different classes. Mainly used for classification.
  • Learns how to distinguish between classes, often called classifiers.
  • Examples include Logistic Regression and Support Vector Machines (SVMs)

Generative Models:

  • Aim to model the entire data distribution and generate new data samples.
  • Learns how to make realistic representations of classes.
  • It takes a random input called "noise", then it tries to create a set of features that look like the class.
  • Examples include Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs).

Generator:

Purpose:

The primary role of the Generator is to create fake data that resembles real data. In the case of image generation, it takes random noise as input and transforms it into images that ideally should look indistinguishable from real images in the training dataset.

How It Works:

  1. Input: The Generator takes a random noise vector (usually sampled from a normal distribution) as its input. This noise serves as a seed from which the Generator can create various outputs.
  2. Architecture: The Generator typically employs a series of transposed convolutional layers (also known as deconvolutional layers) to upsample the input noise. These layers reverse the downsampling process of convolutional layers used in the Discriminator. Activation Functions: The final layer usually employs a Tanh activation function to scale the output pixel values between -1 and 1 (commonly used in image generation). Intermediate layers often use ReLU or Leaky ReLU to maintain non-linearity and improve the flow of gradients.
  3. Training Process: During training, the Generator aims to maximize the probability of the Discriminator making a mistake by producing fake images that look real. It learns through backpropagation, receiving feedback from the Discriminator about how "real" or "fake" its outputs are.

Discriminator

Purpose:

The Discriminator's role is to differentiate between real data and the fake data generated by the Generator. It acts as a binary classifier, aiming to maximize its accuracy in identifying the source of the input data.

How It Works

  1. Input: The Discriminator takes both real images from the training dataset and fake images generated by the Generator as input.
  2. Architecture: It typically uses convolutional layers followed by pooling layers to downsample the input. These layers help the model learn spatial hierarchies and extract important features. Activation Functions: The output layer generally uses a sigmoid activation function to produce a probability score between 0 (fake) and 1 (real).
  3. Training Process: The Discriminator is trained to maximize the probability of correctly classifying real and fake images. It learns through backpropagation based on the errors it makes during predictions. The Discriminator’s feedback helps the Generator improve by letting it know how “real” or “fake” its generated outputs are.

For my first GAN, I used the MNIST dataset, which has grayscale images of handwritten digits (0-9). I implemented a training loop for 200 epochs, focusing on optimizing both the Generator and Discriminator. The losses reported after 200 epochs were as follows:

  • Generator Loss: 1.3830
  • Discriminator Loss: 0.4462

These results revealed that while the Discriminator was performing well, the Generator struggled to produce convincing images. This discrepancy is a common issue in GAN training, where an imbalanced training dynamic can hinder the Generator's ability to improve.


Fake Images
Real Images

Stepping Up with DCGAN

After experimenting with basic GAN, I decided to upgrade to a Deep Convolutional GAN (DCGAN). Deep Convolutional GANs (DCGANs) are a significant advancement in the GAN framework, proposed by Radford, Metz, and Chintala in their 2015 paper, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. Here's the link to the paper if you're interested!


Architectural drawing of a generator from DCGAN

Why DCGAN?

Standard GANs rely on fully connected layers, which don’t do a great job of capturing the spatial relationships in images. DCGANs, on the other hand, use convolutional layers, which are much better at recognizing features like edges and textures.

Key Changes:

I replaced fully connected layers with convolutional layers in both the Generator and the Discriminator. The Generator used transposed convolutions to upsample the data, creating more detailed images. I added batch normalization to stabilize the training and avoid wild fluctuations. For the activation functions, I went with LeakyReLU in the Discriminator to ensure better gradient flow and Tanh in the Generator’s output layer to normalize data between -1 and 1.

I found the DCGAN paper to be really interesting and helpful. It provided valuable insights into the architectural changes and the rationale behind them, making it easier to understand why these modifications lead to improved performance. I definitely recommend reading it if you’re interested in GANs!

Challenges with Binary Cross Entropy Cost Function

1. Issues with Binary Cross-Entropy (BCE) Loss: In traditional GANs, the loss function typically used is binary cross-entropy (BCE). While it provides a clear metric for distinguishing between real and fake images, it can lead to problems during training.

Specifically, BCE can result in poor gradient flow when the Discriminator becomes too confident, assigning low probability scores to generated samples. This overconfidence can halt the learning process for the Generator, making it difficult to improve and contribute to other challenges like mode collapse.

2. Mode Collapse: Mode collapse is a phenomenon where the Generator produces a limited variety of outputs, often generating the same or very similar images for different inputs.

This issue can severely restrict the diversity of the generated data, undermining the GAN's ability to learn and replicate the underlying distribution of the training data. Mode collapse is particularly problematic in applications where diversity is essential, such as in image synthesis.

3. Vanishing Gradients: Another issue was vanishing gradients, which can occur when the Discriminator becomes too powerful relative to the Generator. When the Discriminator learns to distinguish real from fake images too effectively, the Generator receives minimal gradient feedback, which is essential for updating its weights.

This situation can lead to stagnation in the Generator's learning, further exacerbating mode collapse and hindering overall model performance.

Training Results for DCGAN:

For 20 epochs, the training results for the DCGAN with Binary Cross-Entropy (BCE) loss were:

  • Generator Loss: 0.7231
  • Discriminator Loss: 0.6941

These results show that the DCGAN model is making solid progress, with both networks actively pushing each other to improve. The Generator is learning to generate more realistic images while the Discriminator continues to refine its ability to tell them apart. But as you can see, The generator is not producing the numbers correctly at all. While the images look sharper and more detailed than the standard GAN, the results are still not convincing.


Generated Images
Real Images

Solution: Earth Mover's Distance to BCE

One solution to the limitations of Binary Cross-Entropy (BCE) is the Earth Mover's Distance (EMD), also known as Wasserstein distance. EMD provides a better way to compare the distributions of real and generated data.

What is Earth Mover's Distance (EMD)? EMD measures the minimum "cost" to change one distribution into another. Think of it like moving a pile of dirt (representing generated samples) to create a new pile that looks like another distribution (representing real samples). EMD calculates how much effort it takes to make this transformation, considering how far each piece of dirt has to move.

The EMD between two probability distributions can be expressed as:


Earth Mover’s distance (Source: Demystified: Wasserstein GANs (WGAN))

Using EMD as a loss function in GANs has several benefits over BCE:

  1. Improved Gradient Flow: EMD gives clearer gradients when the Discriminator and Generator are far apart. Unlike BCE, where gradients can disappear if the Discriminator is too confident, EMD helps the Generator receive consistent feedback, even if its outputs aren't great.
  2. Mitigating Mode Collapse: EMD encourages the Generator to explore more of the output space. This helps prevent mode collapse, where the Generator produces only a limited range of outputs.
  3. Meaningful Training Progress: Since EMD focuses on the distance between distributions rather than simple yes-or-no classifications, it better reflects how well the Generator is performing. This leads to a more stable and understandable training process.

Wasserstein Loss (w-loss)

The use of EMD in GANs is often done through the Wasserstein loss (w-loss). This loss function relates directly to EMD and makes training GANs more practical.

Why w-loss Works:

  1. Stronger Convergence: w-loss stabilizes training by providing clear signals for both the Discriminator and Generator. It helps both models learn better, especially when dealing with difficult data.
  2. Avoiding Vanishing Gradients: Even when the Discriminator is very confident, w-loss keeps the loss meaningful. This prevents the Generator from stalling during training by ensuring it gets useful gradients.
  3. Lipschitz Constraint Requirement: w-loss requires a Lipschitz constraint on the Discriminator. This means the Discriminator's output should change smoothly. We can enforce this constraint using methods like gradient penalty, which helps create stable training dynamics.

By switching from BCE to EMD and using w-loss, I noticed significant improvements in training. The process became more stable, and the Generator produced more diverse and realistic outputs.

Lipschitz Constraint and Solutions

To tackle stability and convergence issues, I looked into ways to enforce the Lipschitz constraint. Here are two main methods I found:

  1. Weight Clipping: This simple approach involves limiting the weights of the Discriminator to a specific range (like [-0.01, 0.01]) after each update. While this method does enforce the Lipschitz constraint, it has drawbacks. Clipping can make the Discriminator's function sharp and discontinuous, which can hurt its learning ability and reduce its flexibility.
  2. Gradient Penalty: This method adds a penalty to the Discriminator's loss based on how much the output changes for its input. The gradient penalty can be expressed as:

The gradient penalty encourages smoother transitions in the Discriminator’s decisions. Gradient penalty is generally more effective because it keeps the Discriminator flexible while ensuring stable training.

Gradient penalty has become a popular choice, especially in Wasserstein GANs (WGANs), because it reduces problems like mode collapse and vanishing gradients. It allows the Generator to receive meaningful gradients, which helps it learn better and produce diverse outputs.


Wasserstein Loss with Gradient Penalty (Source: GAN: Wasserstein GAN & WGAN-GP by Jonathan Hui)

Exploring Conditional GAN with DCGAN Architecture

After experimenting with DCGAN, I decided to take on Conditional GANs (cGANs), which offer an exciting twist on the traditional GAN framework.

What is a CGAN?

Conditional GANs extend the GAN concept by allowing control over the generated output through conditioning. This means that both the Generator and Discriminator receive additional information—in the form of labels or other data—during training. This conditioning mechanism enables the model to produce outputs that are more aligned with specific criteria. For instance, when working with the MNIST dataset, I could specify which digit I wanted to generate (like “3” or “7”), and the model would respond by producing a corresponding image of that digit. This added layer of control makes cGANs incredibly powerful for tasks where output variety and specificity are required.

What Makes a CGAN Special?

The Conditional GAN introduces an additional input: conditioning information. This information can be anything that adds context to what the image should look like, such as:

  1. A label (like a number indicating which digit to generate),
  2. A class (like "cat" or "dog"),
  3. Even a text description (like "sunset over a mountain").

The core idea is to give both the Generator and the Discriminator some context, enabling the system to generate more targeted and relevant images.

Architecture Changes in CGAN

Generator Changes

In a standard GAN, the Generator takes a random noise vector z and outputs an image.

In a CGAN, the Generator takes two inputs:

1. Random noise z.

2. Conditioning information y (like a label, e.g., "2" for generating a handwritten digit "2").

These inputs are concatenated together into a single input vector, which then gets processed through the Generator network to produce an image that should match the given condition.

Discriminator Changes

In a CGAN, the Discriminator also receives the conditioning information y along with the image. The image and the label are combined, often by concatenating the label as an extra channel in the image. This setup forces the Discriminator to not only determine if an image is real or fake but also whether it matches the condition.

By adding conditioning, we can:

  • Directly Control the Output: You decide what kind of image you want by providing a condition, making the process predictable.
  • Improve Training Stability: Conditioning can help prevent mode collapse (when the Generator produces the same type of output repeatedly) by providing diverse training data through different labels.
  • Generate Diverse Outputs: A CGAN can create varied images for the same condition since the noise input allows for creativity, while the condition keeps the result relevant.

Implementation:

To implement the CGAN architecture, I started by utilizing one-hot encoded labels for conditioning. This approach allows the model to interpret the label data effectively:

  • Generator Modifications: In the Generator, I concatenated the noise vector with the one-hot encoded label. This means that when the model generates an image, it doesn’t just rely on random noise but also considers the label, guiding the generation process. This concatenation effectively integrates the information about which digit to produce, making the Generator's task more straightforward.
  • Discriminator Modifications: Similarly, in the Discriminator, the label was treated as an additional input channel. By including the label in the Discriminator’s input, I ensured that it could evaluate whether the generated image correctly corresponds to the specified digit. This adjustment helps the Discriminator learn the association between the label and the image, making it more discerning during the training process.

Training:

At epoch 20, the training showed encouraging progress, with the following results:

  • Generator Loss: 0.9469
  • Discriminator Loss: 0.6120

These loss values provide valuable insight into the model's performance:

Interpreting the Loss Values

  • Generator Loss (0.9469): This loss value indicates how well the Generator is doing in creating images that can deceive the Discriminator. A loss of 0.9469 suggests the Generator is learning effectively, producing images that look increasingly realistic. However, there is still room for improvement to ensure the generated images are even more convincing and accurately align with the conditioning information (like labels).
  • Discriminator Loss (0.6120): With a loss value of 0.6120, the Discriminator is performing well in distinguishing between real and generated images. Ideally, a Discriminator loss around 0.5 means it is being fooled about half the time, which indicates a perfect balance between the Generator and the Discriminator. A slightly higher value suggests that the Discriminator is doing a solid job but still finds some generated images challenging to classify.


Generated Images
Real Images
Loss Curve

This stage of training highlights that the cGAN is on the right track, with both networks improving and pushing each other toward better results.

Conclusion:

In conclusion, diving into Generative Adversarial Networks has been a fun experiment that improved my understanding of deep learning. I ran into some challenges while training these models and picking the right loss functions. I learned about different types of GANs, like Deep Convolutional GANs and Conditional GANs, and how they create realistic data. This experience taught me how important it is to experiment, as even small tweaks can make a big difference in how well the models work.


Erica Firouzbehi

Student Regulatory Affairs Officer at The Centre for Biosecurity ~ Molecular Biology and Genetics CO-OP Student at University of Guelph

3 周

I love how accessible you made this guide! It was very easy to follow along despite my sparse background knowledge. Happy learning!

回复
Noureldeen Ahmed

Seeking 2025 New Grad Positions | Software Engineering @ University of Guelph

1 个月

The generated results are really convincing! I'm guessing the primary purpose of this technique is to train an effective generator, and the discriminator is just an artifact/bonus of the training process?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了