BxD Primer Series: Variational Autoencoder (VAE) Neural Networks

BxD Primer Series: Variational Autoencoder (VAE) Neural Networks

Hey there ??

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Variational?Autoencoder Neural Networks. Let’s get started:

The What:

Variational Autoencoders (VAEs) are a class of generative models that can learn to generate new data samples that are similar to the training data. It is an unsupervised learning algorithm that can be used for tasks such as image and speech synthesis, data compression, and anomaly detection.

VAEs are composed of two parts: an encoder and a decoder. The encoder takes in input data and transforms it into a latent representation, which is a compressed and abstract representation of input data. The decoder then takes latent representation as input and reconstructs original data.

Key innovation of VAEs is the use of a?probabilistic framework for encoding and decoding?the data. They use a probabilistic latent variable model that assumes that latent representation follows a certain probability distribution, typically a Gaussian distribution. This probabilistic model allows VAEs to sample from the latent space and generate new data samples.

Training of VAEs involves optimizing a loss function that consists of two parts: a reconstruction loss that measures how well the decoder can reconstruct input data from latent representation, and a regularization loss that encourages latent representation to follow the assumed probability distribution.

During the training process, VAEs learn to encode input data into a compressed and abstract representation, while also learning to decode latent representation back into original data.

Once the VAE has been trained, it can generate new samples of data by sampling from the learned distribution in latent space and passing the samples through decoder to generate new data in original data space. Because the VAE has learned a?probability distribution?over latent space, it can generate?diverse and realistic?samples of data.

Note: Concepts which are common for Autoencoders and VAEs have already been covered in a previous edition, check?here .

Applications of VAEs:

Variational Autoencoder (VAE) Neural Networks have wide range of applications across domains:

  1. Image and Video Generation: By sampling from the learned probability distribution in latent space, VAEs can generate realistic and diverse samples of images or videos.
  2. Anomaly Detection: Inputs that result in a high reconstruction error are considered anomalies because they indicate unusual or unexpected behavior in input data.
  3. Data Compression: VAEs encode training and input data into a lower-dimensional latent space representation. This compression is useful for reducing storage requirements and computation requirements of downstream machine learning tasks.
  4. Denoising: VAEs can be trained to learn a noise model and then using the learned model to remove noise from new images. This is useful for improving the quality of noisy images, such as those taken in low-light conditions.
  5. Representation Learning: VAEs learn a compressed and disentangled representation of input data. This representation can be useful for downstream tasks such as clustering, classification, and regression.

Basic Architecture:

Basic components of Variational Autoencoder (VAE) neural networks:

  1. Encoder Network?takes in input data and maps it to latent space. It typically consists of several convolutional or fully connected layers, that progressively reduce the input dimensionality. Final layers of encoder network produce the mean (μ) and standard deviation (σ) vectors, which define the parameters of latent space distribution.
  2. Latent Space?is a lower-dimensional representation of input data. It is characterized by a probability distribution, typically assumed to be Gaussian.
  3. Sampling: From the mean (μ) and standard deviation (σ) vectors produced by encoder, a random sample is drawn using the?reparameterization trick. This sampling process introduces stochasticity into model and allows for the generation of diverse data points during training.
  4. Decoder Network?takes the sampled latent vector as input and aims to reconstruct original data. Similar to the encoder, it consists of several layers that gradually upsample and increase the dimensionality of latent vector. The final layer of decoder produces?reconstructed output, which should ideally be as close as possible to the original input.

No alt text provided for this image

Difference from a Traditional Autoencoder:

The main difference between a VAE and a traditional autoencoder lies in the way the latent space is learned. In a traditional autoencoder, the encoder maps input data to a fixed-dimensional latent space, and the decoder maps the latent space back to input space. The goal of autoencoder is to learn a compressed representation of input data in latent space, which can be used for tasks such as data compression, denoising, or dimensionality reduction.

In contrast, a VAE models the probability distribution of input data in latent space. The encoder maps the input data to a mean and a standard deviation of a normal distribution in latent space, rather than to a fixed-dimensional latent vector. The decoder then maps samples drawn from this latent distribution back to input space. This probabilistic formulation allows the VAE to generate new data samples by sampling from the learned distribution in latent space.

In summary, the main differences between a VAE and a traditional autoencoder are:

  1. VAEs learn a probability distribution in the latent space, while traditional autoencoders learn a fixed-dimensional latent representation.
  2. VAEs use a regularization term to encourage a smooth and continuous latent space, while traditional autoencoders do not.
  3. VAEs can generate new data samples by sampling from the learned distribution in latent space, while traditional autoencoders cannot.

Latent space in a VAE:

Latent space is a lower-dimensional representation of input data that is learned during training process. The dimensionality of latent space is typically much smaller than the dimensionality of input data, making it a compressed representation of input.

The goal of learning latent space is to capture important features of input data in a way that makes it easier to model and manipulate. For example, if input data consists of images of faces, the learned latent space might encode features such as the angle of face, the presence of glasses, or the color of hair etc.

Latent space is represented by a set of?continuous variables, which can be thought of as the mean and variance of a probability distribution. During training, the VAE learns to map each input data point to a distribution in latent space using the encoder network. The distribution in latent space is then sampled to generate a latent code, which is fed into the decoder network to generate a reconstructed version of original input data.

The continuous nature of latent space allows for interpolation between different samples in data space, enabling the generation of new and diverse samples. This is a powerful feature of VAEs, as it allows them to generate new data samples that are?similar to input data, but not identical. The compression of input data also enables to perform data denoising, data compression, and anomaly detection.

The latent space is often visualized as a scatter plot, where each point represents a different input data point projected into the latent space. Points that are close together in latent space represent input data points that are similar to each other in some way.

Objective function used to train a VAE:

The objective function of a VAE can be expressed as sum of two terms:

  1. The reconstruction loss term, which measures the difference between original input data and reconstructed output of decoder. This term is typically a mean squared error (MSE) loss, cross-entropy loss, or some other distance metric.
  2. The regularization term, which is a measure of divergence between the learned distribution in latent space and the prior distribution. This term is typically the Kullback-Leibler (KL) divergence.

KL divergence is a measure of the difference between two probability distributions and is used to ensure that the learned distribution in latent space is close to a simple prior distribution, such as a Gaussian distribution. Regularization term is important for preventing overfitting and ensuring that the learned latent space is smooth and continuous, which is important for generating realistic and coherent samples of data.

KL divergence encourages the learned distribution to be similar to target distribution, which is typically a standard Gaussian distribution. Since the prior distribution, p(z), is assumed to be a standard Gaussian distribution with mean zero and covariance equal to one, KL divergence loss is defined as:

No alt text provided for this image

Where,

  • q(z|x) is the learned distribution of latent variable z given the input x
  • p(z) is the target distribution (a standard Gaussian in most cases)
  • μ?and Σ are the mean and covariance estimated by the encoder network q(z|x)
  • k?is the dimensionality of the latent space

Overall objective of training a VAE is to maximize the evidence lower bound (ELBO), which is a lower bound on the log-likelihood of data given the model. The ELBO is expressed as the sum of reconstruction loss term and regularization term. Maximizing the ELBO involves minimizing reconstruction loss and regularization term simultaneously.

The How:

Below is a step-by-step explanation of how VAEs work:

Step 1: Encoder Network?maps the input data, denoted as?x, to parameters of latent space distribution. The output of encoder network, denoted as?h_enc, is computed as:

h_enc = f_enc(x)

where?f_enc?represents the encoder network's transformation.

Step 2: Latent Space Distribution: We assume that the latent space follows a multivariate Gaussian distribution with a mean vector?μ?and a diagonal covariance matrix?Σ?(diagonal covariance simplifies the computation). These parameters are estimated by encoder network. Mean vector?μ?and the logarithm of diagonal elements of covariance matrix, denoted as?log(Σ), is computed as:

No alt text provided for this image

Step 3: Sampling: To generate a sample from latent space distribution, we introduce a random noise vector???~N(0,?I), where?I?is the identity matrix. The latent vector?z?is?computed as:

No alt text provided for this image

The term???exp(0.5log(Σ))?applies the reparameterization trick, which allows the model to back-propagate through sampling process.

Step 4: Decoder Network?takes the sampled latent vector?z?and maps it back to the data space to reconstruct original input. Output of decoder network, denoted as?h_dec, is computed as:

h_dec = f_dec(z)

where?f_dec?represents the decoder network's transformation.

Step 5: Reconstructed output, denoted as?x′, is obtained by applying a suitable activation function?g_dec?to the decoder output?h_dec:

x’ = g_dec(h_dec)

Step 6: Loss Function: As explained in previous section, loss function is a combination of reconstruction loss?L_rec?and Kullback-Leibler (KL) divergence loss?L_KL.

Reconstruction loss measures the discrepancy between reconstructed output?x?and original input?x. It can be defined using a suitable distance metric such as mean squared error (MSE) or binary cross-entropy (BCE):

No alt text provided for this image

KL divergence loss quantifies the difference between learned latent distribution and the assumed prior distribution (usually a standard Gaussian):

No alt text provided for this image

Where,

  • q(z|x)?represents the encoder's approximation of true posterior distribution
  • q(z|x)=N(z|μ,Σ)
  • p(z)?represents the prior distribution, which is assumed to be?N(0,I)

Step 7: Overall Loss, denoted as?L, is the sum of reconstruction loss and KL divergence loss, weighted by a hyper-parameter?β:

No alt text provided for this image

Hyper-parameter?β?controls the trade-off between reconstruction accuracy and latent space regularization.

Step 8: Training: During training, the VAE aims to minimize the overall loss?L?by adjusting the weights and biases of encoder and decoder networks. This is typically done using back-propagation and gradient descent optimization algorithms.

Step 9: Generation: Once the VAE is trained, it can generate new samples by sampling from the prior distribution and passing the samples through the decoder network. Given a random vector?z~N(0,I), the generated output is obtained as:

No alt text provided for this image

This allows for generation of new data samples that resemble the characteristics of training data.

Step 10: Inference and Latent Space Interpolation: Given an input?x, the encoder network produces mean and covariance parameters of approximate posterior distribution?q(z|x). The latent vector?z?can be sampled from this approximate posterior distribution to obtain a representation in latent space.


Additionally, the VAE allows for smooth interpolation in latent space. By interpolating between two latent vectors?z1?and?z2, new latent vectors?z_interpolate?is generated. These interpolated latent vectors are then passed through the decoder network to produce corresponding data samples with gradual transitions between original samples.

Note 1: If latent space is too small or too restrictive, the generated samples may be limited and uninteresting. On the other hand, if latent space is too large or too unconstrained, the generated samples may be unrealistic or uninterpretable.

Note 2: Choosing the appropriate size of latent space is crucial for performance of VAE.

  • One common approach is to use a grid search or random search over a range of latent space sizes and evaluate the performance of model on a validation set.
  • Another approach is to use a technique called annealed importance sampling (AIS), which is a method for estimating the partition function of VAE. AIS allows the performance of VAE to be evaluated for different latent space sizes without retraining the model multiple times.

Note 3: Interpreting the learned latent space is important for understanding what information is encoded in it and how it can be used for downstream tasks.

  • One way is to visualize it using techniques such as?t-SNE ?or?PCA . This helps to identify clusters of data points in latent space that correspond to specific features of input data. For example, in an image dataset, clusters in latent space could correspond to different object categories, colors, or textures.
  • Another way is to perform arithmetic operations in latent space and observe corresponding changes in generated data. For example, if we have a VAE trained on a dataset of faces, we can manipulate the latent space to generate new faces with specific attributes. We can achieve this by adding or subtracting vectors in latent space that correspond to specific features such as smile, glasses, or hairstyle.
  • Interpreting the learned latent space can also help to identify the limitations and biases of model. For example, if certain features of input data are not captured well in latent space, it could indicate that the model is not able to learn those features effectively. Similarly, if certain regions of latent space are over-represented or under-represented, it could indicate that model has biases towards certain features or attributes.

Re-parameterization Trick:

Reparameterization trick involves separating the input into two parts, a deterministic part and a stochastic part. Stochastic part is sampled from a standard Gaussian distribution and then scaled and shifted by deterministic part, which is computed by the encoder network. By doing so, the gradient of stochastic part can be computed with respect to the parameters of encoder and decoder network, allowing for end-to-end training using SGD.

Mathematically,

  • Let z be a random variable distributed according to some distribution q(z|x)
  • Let ? be a random variable distributed according to a standard Gaussian distribution N(0, I).

We can express z as z = μ(x) + σ(x) * ?.

Where,

  • μ(x) and σ(x) are the output of encoder network for input x
  • ? is a random variable sampled from standard Gaussian distribution.

By using this formulation, we can backpropagate gradients through the encoder and decoder networks to optimize VAE's objective function.

Without the trick, presence of stochastic nodes in computation graph would prevent gradient descent from being used for optimization. This trick allows the VAE to learn a smooth, continuous latent space.

Under-complete and Over-complete VAE:

The term "undercomplete" or "overcomplete" refers to the size of latent space compared to size of input data.

An undercomplete VAE has a smaller latent space than input data dimensionality.

  • This means that the latent space is not capable of fully capturing all the information present in input data and retains only the most important information.
  • Undercomplete VAEs are commonly used for data compression or feature extraction tasks.

An overcomplete VAE has a larger latent space than input data dimensionality.

  • This means that the latent space has more dimensions than necessary to represent input data, potentially leading to overfitting and poor generalization performance.
  • Overcomplete VAEs are used for tasks where a richer, more complex representation of data is needed, such as in image or speech synthesis.

Deterministic v/s Probabilistic encoder:

The encoder network maps input data to a probability distribution in latent space. This probability distribution is represented by two vectors: a mean vector and a standard deviation vector.

  • Mean vector is a point in latent space that represents the most likely encoding of input data
  • Standard deviation vector represents the uncertainty or variance in encoding

In a deterministic encoder, mean vector is the only output of encoder network, and it is used directly as the encoding of input data.

In a probabilistic encoder, both mean vector and standard deviation vector are used to sample from probability distribution in the latent space. This sampling process introduces stochasticity into encoding process and allows the model to capture uncertainty or variability in input data.

Use of a probabilistic encoder in a VAE has several advantages:

  1. It allows the model to capture inherent uncertainty in input data, which improves robustness and generalization of the model.
  2. It enables the generation of new data by sampling from probability distribution in latent space. This is particularly useful for generative modeling tasks, where the goal is to generate new data that is similar to training data.

The Why:

Reasons for using VAEs:

  1. VAEs are?generative models?that can be used to generate new samples of data that are similar to training data. Useful for applications such as image and text generation.
  2. VAEs?learn a latent space representation?of the input data, which can be useful for downstream tasks such as clustering, classification, and visualization.
  3. VAEs?use variational inference?to learn the parameters of model, which makes them more computationally efficient than other generative models such as GANs.
  4. VAEs can be used with a wide range of data types, including continuous and discrete data, and can be adapted to different types of tasks such as semi-supervised learning, disentanglement, and flow-based modeling.
  5. VAEs are?inherently probabilistic, which makes them suitable for tasks where uncertainty in data is important, such as in medical diagnosis or anomaly detection.
  6. VAEs enforce regularization of latent space by penalizing deviations from a prior distribution. This helps to prevent overfitting and encourages the model to learn a more robust representation of data.

The Why Not:

Reasons for not using VAEs:

  1. Limited expressive power compared to other generative models such as GANs, which can result in lower-quality generated samples.
  2. Limited diversity in generated samples, which make them less suitable for applications such as art and design where diversity is desirable.
  3. Prone to mode collapse, where the model generates samples that are all similar to a few modes of data distribution, ignoring other modes.
  4. Limited resolution when generating images compared to other generative models such as GANs resulting in blurry or pixelated images.
  5. Difficulty in handling discrete data such as text or categorical data, as they rely on continuous latent variables.

Time for you to support:

  1. Reply to this email with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here )
  4. Engage with BxD on LinkedIN (here )

In next edition, we will cover Markov Chain Neural Networks.

Let us know your feedback!

Until then,

Have a great time! ??

#businessxdata ?#bxd ?#Variational #Autoencoder #neuralnetworks ?#primer


要查看或添加评论,请登录

社区洞察

其他会员也浏览了