Sunday, December 08, 2024

Short Explanation of Variational Autoencoder (VAE) and Controlled VAE

A Variational Autoencoder is an auto encoder with a twist.

An Autoencoder is network that takes in a large input (ex: an image), encodes it into a smaller vector, and decode it back into the original input. The smaller intermediate vector is called "latent vector".

A Variational Autoencoder takes in a large input and encodes it into a Gaussian distribution in the latent vector space (so it encodes into a mean and standard deviation). To decode it, randomly sample from this Gaussian distribution to get a latent vector, and the decoder should bring it back to the original input.

Note that since the latent vector is not exact every time, the nearby values in the Gaussian distribution should all be able to bring back an output close to the output. If your network was trained to encode 2 images, the point in the middle of the two distributions will have characteristics of both images.

VAE is usually trained with a weighted sum of KL divergence + difference from the desired output (where input and desired output are the same in this case). KL divergence measures how far the sampled latent vector is. The formula in the calculation only considers the one sample point, because one of the distributions was 0 centered and was normalized. So more intuitively, the further away of the sampling, the more tolerance to the difference between the desired output and the actual output.

In case of Controlled VAE, a text is also decoded into latent vector. And the text latent vector and the image latent vector is concatenated as the input to the decode.


References:

VAE: https://youtu.be/X73mKrIuLbs?si=e7tYZRoWm8QO60R1


No comments: