BxD Primer Series: Variational Autoencoder (VAE) Neural Networks
Hey there ??
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on?Variational?Autoencoder Neural Networks. Let’s get started:
The What:
Variational Autoencoders (VAEs) are a class of generative models that can learn to generate new data samples that are similar to the training data. It is an unsupervised learning algorithm that can be used for tasks such as image and speech synthesis, data compression, and anomaly detection.
VAEs are composed of two parts: an encoder and a decoder. The encoder takes in input data and transforms it into a latent representation, which is a compressed and abstract representation of input data. The decoder then takes latent representation as input and reconstructs original data.
Key innovation of VAEs is the use of a?probabilistic framework for encoding and decoding?the data. They use a probabilistic latent variable model that assumes that latent representation follows a certain probability distribution, typically a Gaussian distribution. This probabilistic model allows VAEs to sample from the latent space and generate new data samples.
Training of VAEs involves optimizing a loss function that consists of two parts: a reconstruction loss that measures how well the decoder can reconstruct input data from latent representation, and a regularization loss that encourages latent representation to follow the assumed probability distribution.
During the training process, VAEs learn to encode input data into a compressed and abstract representation, while also learning to decode latent representation back into original data.
Once the VAE has been trained, it can generate new samples of data by sampling from the learned distribution in latent space and passing the samples through decoder to generate new data in original data space. Because the VAE has learned a?probability distribution?over latent space, it can generate?diverse and realistic?samples of data.
Note: Concepts which are common for Autoencoders and VAEs have already been covered in a previous edition, check?here .
Applications of VAEs:
Variational Autoencoder (VAE) Neural Networks have wide range of applications across domains:
Basic Architecture:
Basic components of Variational Autoencoder (VAE) neural networks:
Difference from a Traditional Autoencoder:
The main difference between a VAE and a traditional autoencoder lies in the way the latent space is learned. In a traditional autoencoder, the encoder maps input data to a fixed-dimensional latent space, and the decoder maps the latent space back to input space. The goal of autoencoder is to learn a compressed representation of input data in latent space, which can be used for tasks such as data compression, denoising, or dimensionality reduction.
In contrast, a VAE models the probability distribution of input data in latent space. The encoder maps the input data to a mean and a standard deviation of a normal distribution in latent space, rather than to a fixed-dimensional latent vector. The decoder then maps samples drawn from this latent distribution back to input space. This probabilistic formulation allows the VAE to generate new data samples by sampling from the learned distribution in latent space.
In summary, the main differences between a VAE and a traditional autoencoder are:
Latent space in a VAE:
Latent space is a lower-dimensional representation of input data that is learned during training process. The dimensionality of latent space is typically much smaller than the dimensionality of input data, making it a compressed representation of input.
The goal of learning latent space is to capture important features of input data in a way that makes it easier to model and manipulate. For example, if input data consists of images of faces, the learned latent space might encode features such as the angle of face, the presence of glasses, or the color of hair etc.
Latent space is represented by a set of?continuous variables, which can be thought of as the mean and variance of a probability distribution. During training, the VAE learns to map each input data point to a distribution in latent space using the encoder network. The distribution in latent space is then sampled to generate a latent code, which is fed into the decoder network to generate a reconstructed version of original input data.
The continuous nature of latent space allows for interpolation between different samples in data space, enabling the generation of new and diverse samples. This is a powerful feature of VAEs, as it allows them to generate new data samples that are?similar to input data, but not identical. The compression of input data also enables to perform data denoising, data compression, and anomaly detection.
The latent space is often visualized as a scatter plot, where each point represents a different input data point projected into the latent space. Points that are close together in latent space represent input data points that are similar to each other in some way.
Objective function used to train a VAE:
The objective function of a VAE can be expressed as sum of two terms:
KL divergence is a measure of the difference between two probability distributions and is used to ensure that the learned distribution in latent space is close to a simple prior distribution, such as a Gaussian distribution. Regularization term is important for preventing overfitting and ensuring that the learned latent space is smooth and continuous, which is important for generating realistic and coherent samples of data.
KL divergence encourages the learned distribution to be similar to target distribution, which is typically a standard Gaussian distribution. Since the prior distribution, p(z), is assumed to be a standard Gaussian distribution with mean zero and covariance equal to one, KL divergence loss is defined as:
Where,
Overall objective of training a VAE is to maximize the evidence lower bound (ELBO), which is a lower bound on the log-likelihood of data given the model. The ELBO is expressed as the sum of reconstruction loss term and regularization term. Maximizing the ELBO involves minimizing reconstruction loss and regularization term simultaneously.
The How:
Below is a step-by-step explanation of how VAEs work:
Step 1: Encoder Network?maps the input data, denoted as?x, to parameters of latent space distribution. The output of encoder network, denoted as?h_enc, is computed as:
h_enc = f_enc(x)
where?f_enc?represents the encoder network's transformation.
Step 2: Latent Space Distribution: We assume that the latent space follows a multivariate Gaussian distribution with a mean vector?μ?and a diagonal covariance matrix?Σ?(diagonal covariance simplifies the computation). These parameters are estimated by encoder network. Mean vector?μ?and the logarithm of diagonal elements of covariance matrix, denoted as?log(Σ), is computed as:
Step 3: Sampling: To generate a sample from latent space distribution, we introduce a random noise vector???~N(0,?I), where?I?is the identity matrix. The latent vector?z?is?computed as:
The term???exp(0.5log(Σ))?applies the reparameterization trick, which allows the model to back-propagate through sampling process.
Step 4: Decoder Network?takes the sampled latent vector?z?and maps it back to the data space to reconstruct original input. Output of decoder network, denoted as?h_dec, is computed as:
h_dec = f_dec(z)
where?f_dec?represents the decoder network's transformation.
Step 5: Reconstructed output, denoted as?x′, is obtained by applying a suitable activation function?g_dec?to the decoder output?h_dec:
x’ = g_dec(h_dec)
Step 6: Loss Function: As explained in previous section, loss function is a combination of reconstruction loss?L_rec?and Kullback-Leibler (KL) divergence loss?L_KL.
Reconstruction loss measures the discrepancy between reconstructed output?x′?and original input?x. It can be defined using a suitable distance metric such as mean squared error (MSE) or binary cross-entropy (BCE):
领英推荐
KL divergence loss quantifies the difference between learned latent distribution and the assumed prior distribution (usually a standard Gaussian):
Where,
Step 7: Overall Loss, denoted as?L, is the sum of reconstruction loss and KL divergence loss, weighted by a hyper-parameter?β:
Hyper-parameter?β?controls the trade-off between reconstruction accuracy and latent space regularization.
Step 8: Training: During training, the VAE aims to minimize the overall loss?L?by adjusting the weights and biases of encoder and decoder networks. This is typically done using back-propagation and gradient descent optimization algorithms.
Step 9: Generation: Once the VAE is trained, it can generate new samples by sampling from the prior distribution and passing the samples through the decoder network. Given a random vector?z~N(0,I), the generated output is obtained as:
This allows for generation of new data samples that resemble the characteristics of training data.
Step 10: Inference and Latent Space Interpolation: Given an input?x, the encoder network produces mean and covariance parameters of approximate posterior distribution?q(z|x). The latent vector?z?can be sampled from this approximate posterior distribution to obtain a representation in latent space.
Additionally, the VAE allows for smooth interpolation in latent space. By interpolating between two latent vectors?z1?and?z2, new latent vectors?z_interpolate?is generated. These interpolated latent vectors are then passed through the decoder network to produce corresponding data samples with gradual transitions between original samples.
Note 1: If latent space is too small or too restrictive, the generated samples may be limited and uninteresting. On the other hand, if latent space is too large or too unconstrained, the generated samples may be unrealistic or uninterpretable.
Note 2: Choosing the appropriate size of latent space is crucial for performance of VAE.
Note 3: Interpreting the learned latent space is important for understanding what information is encoded in it and how it can be used for downstream tasks.
Re-parameterization Trick:
Reparameterization trick involves separating the input into two parts, a deterministic part and a stochastic part. Stochastic part is sampled from a standard Gaussian distribution and then scaled and shifted by deterministic part, which is computed by the encoder network. By doing so, the gradient of stochastic part can be computed with respect to the parameters of encoder and decoder network, allowing for end-to-end training using SGD.
Mathematically,
We can express z as z = μ(x) + σ(x) * ?.
Where,
By using this formulation, we can backpropagate gradients through the encoder and decoder networks to optimize VAE's objective function.
Without the trick, presence of stochastic nodes in computation graph would prevent gradient descent from being used for optimization. This trick allows the VAE to learn a smooth, continuous latent space.
Under-complete and Over-complete VAE:
The term "undercomplete" or "overcomplete" refers to the size of latent space compared to size of input data.
An undercomplete VAE has a smaller latent space than input data dimensionality.
An overcomplete VAE has a larger latent space than input data dimensionality.
Deterministic v/s Probabilistic encoder:
The encoder network maps input data to a probability distribution in latent space. This probability distribution is represented by two vectors: a mean vector and a standard deviation vector.
In a deterministic encoder, mean vector is the only output of encoder network, and it is used directly as the encoding of input data.
In a probabilistic encoder, both mean vector and standard deviation vector are used to sample from probability distribution in the latent space. This sampling process introduces stochasticity into encoding process and allows the model to capture uncertainty or variability in input data.
Use of a probabilistic encoder in a VAE has several advantages:
The Why:
Reasons for using VAEs:
The Why Not:
Reasons for not using VAEs:
Time for you to support:
In next edition, we will cover Markov Chain Neural Networks.
Let us know your feedback!
Until then,
Have a great time! ??