KL Divergence: Prerequisite to Variational AutoEncoder (VAE)
Shradha Agarwal
SWE ops, Newfold Digital | IIITD MTech CSE (AI) | Devops | AWS Certified SA-Associate
KL Divergence
The Kullback-Leibler divergence (KL divergence) assesses the inefficiency of approximating the true probability distribution (P) with a predicted one (Q). Denoted by D_KL(P || Q), it quantifies the additional information, on average, required to describe reality using the predicted scenario. Consequently, a higher KL divergence indicates a larger discrepancy between predicted and actual outcomes. We will see how we use this measure to serve as a penalty in VAE loss function.
The above KL divergence formula penalizes poor distribution predictions (Q) by leveraging logarithms. When Q significantly underestimates probabilities compared to the actual distribution (P), the P/Q ratio inflates the logarithm, magnifying the discrepancy. Additionally, the formula incorporates P as a weighting factor, emphasizing penalties in regions with high probabilities in the actual distribution. This ensures that KL divergence prioritizes accurate predictions for frequently occurring events.
Properties of KL divergence
Discrete Distributions
Let's take an example for discrete distributions:
D_KL = 0.25 ln(0.25/0.18) + 0.25 ln(0.25/0.23) + 0.25 ln(0.25/0.15) + 0.25 ln(0.25/0.44) = 0.0893 nats
Multivariate normal distribution
We have multivariate normal distributions with mean μ1 and μ2; covariance matrices Σ1 and Σ2. Here, x is a vector of length k.
For the above distributions, KL divergence comes to be: