KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

KL Divergence: Prerequisite to Variational AutoEncoder (VAE)

KL Divergence

The Kullback-Leibler divergence (KL divergence) assesses the inefficiency of approximating the true probability distribution (P) with a predicted one (Q). Denoted by D_KL(P || Q), it quantifies the additional information, on average, required to describe reality using the predicted scenario. Consequently, a higher KL divergence indicates a larger discrepancy between predicted and actual outcomes. We will see how we use this measure to serve as a penalty in VAE loss function.

Figure 1. Formula of KL divergence for continuous distributions

The above KL divergence formula penalizes poor distribution predictions (Q) by leveraging logarithms. When Q significantly underestimates probabilities compared to the actual distribution (P), the P/Q ratio inflates the logarithm, magnifying the discrepancy. Additionally, the formula incorporates P as a weighting factor, emphasizing penalties in regions with high probabilities in the actual distribution. This ensures that KL divergence prioritizes accurate predictions for frequently occurring events.

Properties of KL divergence

  1. D_KL (P || Q) >=0 and D_KL (Q || P) >=0
  2. D_KL (P || Q) ≠ D_KL (Q || P)

Discrete Distributions

Let's take an example for discrete distributions:

Figure 2. Discrete distributions P (uniform distribution) and Q

D_KL = 0.25 ln(0.25/0.18) + 0.25 ln(0.25/0.23) + 0.25 ln(0.25/0.15) + 0.25 ln(0.25/0.44) = 0.0893 nats

Multivariate normal distribution

We have multivariate normal distributions with mean μ1 and μ2; covariance matrices Σ1 and Σ2. Here, x is a vector of length k.

Figure 3. P and Q are multivariate normal distributions

For the above distributions, KL divergence comes to be:

Figure 4. KL divergence for multivariate normal distributions


要查看或添加评论,请登录

Shradha Agarwal的更多文章

  • Denoising Diffusion Probabilistic Model - DDPM

    Denoising Diffusion Probabilistic Model - DDPM

    Diffusion model is a generative model that has emerged as a powerful technique for creating realistic data. It operates…

  • PEFT with LoRA for Fine-tuning

    PEFT with LoRA for Fine-tuning

    Fine-tuning is a process where a pre-trained model is further trained on new data to enhance its performance on a…

  • Retrieval Augmented Generation (RAG)

    Retrieval Augmented Generation (RAG)

    Retrieval-Augmented Generation (RAG) is a method that improves how language models create text by using additional…

    3 条评论
  • BERT-Bidirectional Encoder Representations from Transformers

    BERT-Bidirectional Encoder Representations from Transformers

    Introduction BERT was introduced in the research paper - "BERT: Pre-training of Deep Bidirectional Transformers for…

  • Transformer Architecture

    Transformer Architecture

    The Transformer is a groundbreaking model architecture introduced in the seminal paper “Attention is All You Need” by…

  • Attention Mechanisms

    Attention Mechanisms

    The attention mechanism has significantly improved the performance of models in tasks like machine translation and text…

  • Long Short Term Memory (LSTM)

    Long Short Term Memory (LSTM)

    Figure 1. LSTM Architecture at time step t Long Short-Term Memory (LSTM) networks tackle a challenge in deep learning:…

  • Recurrent Neural Networks

    Recurrent Neural Networks

    RNNs are a type of artificial neural network architected specifically to tackle sequential data. In contrast to…

  • Generative Adversarial Networks

    Generative Adversarial Networks

    Figure 1. GAN Architecture Vanilla GAN introduced by Ian J.

  • Variational Autoencoders

    Variational Autoencoders

    Variational Autoencoders (VAEs) are generative models explicitly designed to capture the underlying probability…

社区洞察

其他会员也浏览了