Yuhan's blog: December 2024

Sunday, December 08, 2024

Short Explanation of KL Divergence

What is KL divergence? It measures the distance between 2 Gaussian distributions. Be more precise - given 2 distributions P and Q, and let P be the target distribution. KL divergence is: CrossEntropy(P|Q) - Entropy(P)

What is Entropy and Cross Entropy? Entropy is a special case of Cross Entropy where the two distributions are the same.

As explained in the video tutorial (see the video in the reference), starting with entropy, given a Gaussian distribution with probability function p(x), do 10 actual samples to get a sequence. The probability of getting the sequence is the product of the p(xi) of each step. To make the equation into summation, take a log at both side of the equation, each sample become log(p(xi)). Group the p(xi) of the same values together and divide by total, that should also be p(x), because we are repeating the sampling step. That makes the Entropy(P)= −∑ p(x) ln(p(x))

Note that p(x) is the probability of the actual sampling, and ln(p(x)) is the probability of the distribution.

Get another distribution Q with q(x), replace the ln(p(x)) part with ln(q(x)), it becomes Cross Entropy(P|Q)= −∑ p(x) ln(q(x)). It measures the probability to generate the same sequence with a different distribution. This value is larger than Entropy(P) because it is a different distribution.

KL divergence is CrossEntropy(P|Q) - Entropy(P)

References:

KL Divergence: https://youtu.be/sjgZxuCm_8Q?si=CAU5g6_DGxto0h0J

Short Explanation of Variational Autoencoder (VAE) and Controlled VAE

A Variational Autoencoder is an auto encoder with a twist.

An Autoencoder is network that takes in a large input (ex: an image), encodes it into a smaller vector, and decode it back into the original input. The smaller intermediate vector is called "latent vector".

A Variational Autoencoder takes in a large input and encodes it into a Gaussian distribution in the latent vector space (so it encodes into a mean and standard deviation). To decode it, randomly sample from this Gaussian distribution to get a latent vector, and the decoder should bring it back to the original input.

Note that since the latent vector is not exact every time, the nearby values in the Gaussian distribution should all be able to bring back an output close to the output. If your network was trained to encode 2 images, the point in the middle of the two distributions will have characteristics of both images.

VAE is usually trained with a weighted sum of KL divergence + difference from the desired output (where input and desired output are the same in this case). KL divergence measures how far the sampled latent vector is. The formula in the calculation only considers the one sample point, because one of the distributions was 0 centered and was normalized. So more intuitively, the further away of the sampling, the more tolerance to the difference between the desired output and the actual output.

In case of Controlled VAE, a text is also decoded into latent vector. And the text latent vector and the image latent vector is concatenated as the input to the decode.

References:

VAE: https://youtu.be/X73mKrIuLbs?si=e7tYZRoWm8QO60R1