Variational Autoencoders (VAEs)

Variational autoencoders for generation

Cover Image

About

Variational Autoencoders (VAEs) are models that can be used to

generate

new images similar to ones seen in a training dataset.

While VAEs themselves aren’t directly used in modern generative models, they form a basis for many of them including diffusion models.

Latent Variables

Since latent variables are at the heart of VAEs, it is important to define what they actually are.

If you have heard about latent variables before, you might have heard them described as hidden variables that determine the observed variables. I had a hard time wrapping my head around this so I will give a few examples:

  1. Say you have a bunch of students who took an exam for the course MATH 101. Trivially, the observed variables are the exam scores. However, a latent variable would be the amount of MATH 101 that was actually learned by each student.

  2. Plato’s Allegory of the Cave

    . Lets say that you were born and raised inside a cave. You have never left this cave your entire life. Every day at dusk, there are people who walk by the outside of the cave. Due to the sun, you see shadows of these people cast into the cave. You never know what the people who are casting the shadows look like, instead you can only

    observe

    the shadows. Here the latent variables are the humans on the outside of the cave casting/controlling the shadows.

  3. Le

    t’s say you have 2 variables

    x,y{x,y}

    . These 2 variables are used to produce 5 dimensional vectors that are functions of the 2 variables.

    For example:

    [x2,x+2,y3x,x3+y2,sin(x)+ey]{[x^2,x+2,y^3x,x^3+y^2,\sin(x)+e^y]}

    . We only

    observe

    the 5 dimensional vectors for different values of

    x{x}

    and

    y{y}

    . You can think of

    x{x}

    and

    y{y}

    as controlling the changes in the 5 dimensional vectors. Therefore,

    x{x}

    and

    y{y}

    are the latent variables.

High Level Overview

In VAEs, we essentially train and “organize” a latent space (vector space of latent variables).

We do this by using a neural network to take a high dimensional input vector

x{x}

to a lower dimensional vector

z{z}

in the latent space. We then use another neural network to take

z{z}

back into a higher dimensional vector

x^{\hat{x}}

which is the same dimension as

x{x}

. Our network is trained to both “organize” (I will expand on what “organize” means later) the latent space and reduce the difference between

x{x}

and

x^{\hat x}

.

To generate new data, we simply take a random sample from the latent space and use the second neural network to produce a vector in the input space.

Overview of VAE architecture. Source: Lilian Weng

https://lilianweng.github.io/posts/2018-08-12-vae/

Autoencoders

Before we talk about VAEs, we should talk about autoencoders.

Illustration of an autoencoder. Source: Lilian Weng

https://lilianweng.github.io/posts/2018-08-12-vae/

Autoencoders are models that try to learn low dimensional representations of input vectors. The low dimensional representations should contain the “most important” features of the input vector. This allows you to (approximately) reconstruct the original input vector from the low dimensional vector.

Observe how this is very similar to

PCA

. Recall that PCA uses a linear transformation (matrix) to go from a high dimensional vector to a low dimensional vector and then a second linear transformation (matrix) to go from the low dimensional vector to a high dimensional vector. PCA then tries to minimize the difference (reconstruction loss) between the 2 high dimensional vectors. Autoencoders do the same, but instead of going from different dimensions using linear transformations, it uses neural networks. One neural network (encoder) decreases the dimension and another neural network (decoder) increases it. This is why a lot of people refer to autoencoders as a kind of non-linear PCA.

In autoencoders, the space of low dimensional vectors can be thought of as the space of latent variables i.e. the latent space.

Autoencoders for Generation

You may now be thinking that to generate new content, we can simply take a random sample of the latent space and use the decoder neural network to get a vector that is the same dimension as our input vectors.

However, this won’t work because the latent space isn’t “organized”. Meaning a random sample is likely to produce gibberish.

The above image tries to model a latent space. The gradient colors represent regions in the latent space that, if sampled and passed through the decoder neural network, will produce some meaningful reconstruction (like the triangle, square, and circle). However, notice how taking a random sample of this space is likely not going to fall in one of these gradients. Decoding a sample from the white space is going to produce gibberish, like the purple squiggly line.

Irregular latent space prevents use from using autoencoder for new content generation. Source:

https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

We need some way to “organize” this space so that the majority of samples fall in these gradients.

VAEs

Definition

A variational autoencoder can be described as a model that not only tries to learn a good latent space, but also “organize” it such a way that random samples of the space can produce meaningful results.

💡

VAEs achieve this by trying to learn the probability distribution of the latent space.

VAEs are similar to autoencoders in that they use

probabilistic

encoders/decoders instead of normal encoders/decoders. This means that instead of using neural networks to get a vector in the latent space given a vector in the input space (and vice-versa), it uses neural networks to get a probability distribution of the latent space given an vector in the input space (and vice-versa).

Math

Objective

For all observed data

x{x}

, we assume they are i.i.d. and come from some underlying probability distribution

p(x){p(x)}

. We then try to learn a model that generates a probability distribution

pθ{p_\theta}

, for model parameters

θ{\theta}

, such that the likelihood

p(xθ){p(x|\theta)}

is maximized for all observed data points. This approach is called

likelihood maximization

.

We can imagine the latent variables

z{z}

of the observed data as modeled by a joint probability distribution

pθ(x,z){p_\theta(x,z)}

. By chain rule of probability, we have

pθ(x)=pθ(x,z)pθ(zx){p_\theta(x)=\frac{p_\theta(x,z)}{p_\theta(z|x)}}

. We also have

pθ(x)=pθ(x,z)dz{p_\theta(x)=\int p_\theta(x,z)dz}

by marginalization. Directly computing

pθ(x){p_\theta(x)}

and trying to maximize it is difficult because you would have to either have access to the ground truth distribution of the latent encoder,

pθ(zx){p_\theta(z|x)}

, or integrate all latent variables

z{z}

out (intractable for high dimensional

z{z}

).

Instead we try to approximate

pθ(zx){p_\theta(z|x)}

.

Let us first formally define our objective. As with many other probabilistic models, we try to maximize the log-likelihood of the observed data.

arg maxθlogpθ(x)=arg maxθlogpθ(x,z)pθ(zx)\argmax_\theta \log p_\theta(x)\\=\argmax_\theta \log \frac{p_\theta(x,z)}{p_\theta(z|x)}

Since we don’t have access to

pθ(zx){p_\theta (z|x)}

, we try to estimate it with a variational distribution

qϕ(zx){q_\phi(z|x)}

for parameters

ϕ{\phi}

. More specifically,

arg maxθlogpθ(x)=arg maxθlogpθ(x,z)dz=arg maxθ,ϕlogpθ(x,z)qϕ(zx)qϕ(zx)dz=arg maxθ,ϕlogEqϕ(zx)[pθ(x,z)qϕ(zx)] (expectation definition)arg maxθ,ϕEqϕ(zx)[logpθ(x,z)qϕ(zx)] (Jensen’s Inequality)\argmax_\theta \log p_\theta(x)\\=\argmax_\theta \log\int p_\theta (x,z)dz\\=\argmax_{\theta,\phi} \log\int p_\theta(x,z)\frac{q_\phi(z|x)}{q_\phi(z|x)}dz\\=\argmax_{\theta,\phi} \log \mathbb{E}_{q_\phi(z|x)}[\frac{p_\theta(x,z)}{q_\phi(z|x)}]\text{~(expectation definition)} \\\ge\argmax_{\theta,\phi} \mathbb{E}_{q_\phi(z|x)}[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}] \text{~(Jensen's Inequality)}

The last step follows from

Jensen’s Inequality

. The last term,

Eqϕ(zx)[logpθ(x,z)qϕ(zx)]{\mathbb{E}_{q_\phi(z|x)}[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}]}

, is formally known as the evidence based lower bound (ELBO). Since it is a lower bound, maximizing this will also maximize the log-likelihood:

logpθ(x){\log p_\theta (x)}

.

ELBO can be simplified even further (and makes the objective function of VAEs more apparent)

Eqϕ(zx)[logpθ(x,z)qϕ(zx)]=Eqϕ(zx)[logpθ(xz)pθ((z)qϕ(zx)]=Eqϕ(zx)[logpθ(xz)]Eqϕ(zx)[pθ(z)qϕ(zx)]=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)  pθ(z)) (KL-divergence defintion)\mathbb{E}_{q_\phi(z|x)}[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}]=\mathbb{E}_{q_\phi(z|x)}[\log \frac{p_\theta(x|z)p_\theta((z)}{q_\phi(z|x)}]\\=\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]-\mathbb{E}_{q_\phi(z|x)}[\frac{p_\theta(z)}{q_\phi(z|x)}]\\= \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]-D_{\text{KL}}(q_\phi(z|x)~||~p_\theta(z)) \text{~(KL-divergence defintion)}

Here

DKL(qϕ(zx)  pθ(z)){D_{\text{KL}}(q_\phi(z|x)~||~p_\theta(z))}

is known as the

KL-divergence

which is a measure of how close 2 probability distributions. Note that the KL-divergence is a non-symmetric measurement, but that is not too important right now.

We try to match

qϕ(zx){q_\phi(z|x)}

to

pθ(zx){p_\theta(z|x)}

by reducing the difference (KL-divergence) between

qϕ(zx){q_\phi(z|x)}

and

pθ(z){p_\theta(z)}

. This works because

pθ(z){p_\theta(z)}

is an upper bound on

pθ(zx){p_\theta(z|x)}

.

Note that

qϕ(zx){q_\phi(z|x)}

gives a probability distribution of latent variables

z{z}

given an observed variable

x{x}

. Therefore,

qϕ(zx){q_\phi(z|x)}

can be thought of as a

probabilistic encoder

. Similarly,

pθ(xz){p_\theta(x|z)}

gives a probability distribution of observed variables

x{x}

given a latent variable

z{z}

. Therefore

pθ(xz){p_\theta(x|z)}

can be thought of as a

probabilistic decoder

.

Probabilistic encoder and decoders. Source:

https://arxiv.org/pdf/2208.11970

Once again, our objective is to maximize ELBO over parameters

θ,ϕ{\theta,\phi}

where ELBO is:

ELBO=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)  pθ(z))\text{ELBO}=\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]-D_{\text{KL}}(q_\phi(z|x)~||~p_\theta(z))

Optimizing

We first look at the second term in ELBO which is KL-divergence.

In VAEs, when trying to maximize ELBO, we assume

p(z){p(z)}

to be the standard multivariate gaussian.

p(z)=N(0,I) p(z)=N(0,I)

We also assume

qϕ(zx){q_\phi(z|x)}

to be some multivariate gaussian with diagonal covariance.

qϕ(zx)=N(μϕ(x),σϕ2(x)I)q_\phi(z|x)=N(\mu_\phi(x), \sigma^2_\phi(x)I)

This means we learn models

μϕ(x){\mu_\phi(x)}

and

σϕ(x){\sigma_\phi(x)}

, which are smaller neural networks, in order to create a mean vector and a standard deviation vector given a datapoint

x{x}

.

Since we are trying to reduce the KL-divergence and bring

qϕ(zx){q_\phi(z|x)}

as close to

p(z){p(z)}

as possible and

p(z){p(z) }

is the standard normal distribution, we are essentially trying to bring the distribution of all latent variables to the standard normal distribution. This is how we “organize” the latent space.

The returned distributions of VAEs have to be “organized” to obtain a latent space with good properties. Source:

https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

We now look at the first term in ELBO:

Eqϕ(zx)[logpθ(xz)]{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}

. This term tells the model to maximize the log-likelihood for seeing the input data

x{x}

given latent representation

z{z}

.

Since we don’t have access to all possible latent variables

z{z}

, we instead estimate this expectation using a

Monte-Carlo estimate

. Specifically, we do:

Eqϕ(zx)[logpθ(xz)]1Ll=1Llogpθ(xzl)\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]\approx \frac{1}{L}\sum^L_{l=1}\log p_\theta (x|z_l)

This means for some relatively small

L{L}

, we sample

z1,...,zLqϕ(zx){z_1,...,z_L\sim q_\phi(z|x)}

and compute

1Ll=1Llogpθ(xzl){\frac{1}{L}\sum^L_{l=1}\log p_\theta (x|z_l)}

. In practice, if the dataset is large, we can use

L=1{L=1}

.

There is still one small problem though. Since sampling is a non-differentiable procedure, we can’t just simply generate samples from

ziqϕ(zx)=N(μϕ(x),σϕ2(x)I){z_i\sim q_\phi(z|x)=N(\mu_\phi(x), \sigma^2_\phi(x)I)}

. This is because we want to be able to differentiate

μϕ{\mu_\phi}

and

σϕ{\sigma_\phi}

during back propagation.

Instead, we employ a method called the reparameterization trick. This trick rewrites

qϕ(zx){q_\phi(z|x)}

as a deterministic function and offloads the randomness/sampling to a random variable

ζN(0,I){\zeta\sim N(0,I)}

. Note that

ζ{\zeta}

doesn’t need to be differentiated because it is not learned.

Therefore instead of sampling

zqϕ(zx){z\sim q_\phi(z|x)}

, we first sample

ζN(0,I){\zeta\sim N(0,I)}

and then say

z=μϕ(x)+σϕ(x)ζz=\mu_\phi(x)+\sigma_\phi(x)\cdot\zeta

That’s it! We now have a well defined objective function to optimize over. Maximizing the

Eqϕ(zx)[logpθ(xz)]{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}

term of ELBO tells our model to look for latent representations that allows for the reconstruction to be as close to the initial input and the

DKL(qϕ(zx)  pθ(z)){D_{\text{KL}}(q_\phi(z|x)~||~p_\theta(z)) }

term of ELBO tells our model to “organize” the latent space for good sampling during generation.

In practice, however, we don’t use the probabilistic decoder

pθ(xz){p_\theta(x|z)}

to once again model the distribution and compute

Eqϕ(zx)[logpθ(xz)]{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}

. Instead, we directly compute

x^{\hat{x}}

(reconstruction of

x{x}

) from

z{z}

using the probabilistic decoder neural network and try to make it as close to

x{x}

as possible. That is we try to minimize

l(x,x^){l(x,\hat{x})}

for some loss function

l{l}

instead of maximizing

Eqϕ(zx)[logpθ(xz)]{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}

.

Overview of VAE architecture. Source: Lilian Weng

https://lilianweng.github.io/posts/2018-08-12-vae/

Code

The following shows how we can create a VAE on the

MNIST dataset

.

We use the following libraries and starter code:

import torch import torch.nn as nn import math import torch.optim as optim import torchvision.datasets as datasets import matplotlib.pyplot as plt import torchvision.transforms as transforms device = "cpu" if torch.cuda.is_available(): device = "cuda" elif torch.backends.mps.is_available(): device = "mps"

Loading the data

# download the MNIST dataset train = datasets.MNIST(root='./data', train=True, transform=transforms.ToTensor(), download=True) test = datasets.MNIST(root='./data', train=False, transform=transforms.ToTensor(), download=True) # binarize the data train.data = torch.where(train.data > 128, 1, 0) test.data = torch.where(test.data > 128, 1, 0) # plot samples sample_train = train.data[:2] sample_test = test.data[:2] for sample in sample_train: plt.figure() plt.imshow(sample.numpy(), cmap='gray') for sample in sample_test: plt.figure() plt.imshow(sample.numpy(), cmap='gray')

Loss term

Recall that we are trying to maximize the ELBO which is formally:

arg maxθ,ϕEqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)  pθ(z))\argmax_{\theta,\phi}\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]-D_{\text{KL}}(q_\phi(z|x)~||~p_\theta(z))

Let’s look at the first term which is

arg maxθ,ϕEqϕ(zx)[logpθ(xz)]{\argmax_{\theta,\phi}\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}

. Recall that in practice, we don’t use the decoder neural network to model the distribution

pθ(xz){p_\theta(x|z)}

, but instead just use it to directly compute

x^{\hat{x}}

from

z{z}

which is a reconstruction of

x{x}

. Therefore, maximizing this is equivalent to minimizing the reconstruction loss between

x^{\hat{x} }

and

x{x}

. Therefore, we use the

binary cross entropy loss

which is a good loss to measure reconstruction. Note that other losses can be used, such as

MSE loss

. So instead of

arg maxθ,ϕEqϕ(zx)[logpθ(xz)]{\argmax_{\theta,\phi}\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}

we do:

arg minθ,ϕBCE-Loss(x^,x)\argmin_{\theta,\phi} \text{BCE-Loss}(\hat{x},x)

Now let’s look at the second term which is

DKL(qϕ(zx)  pθ(z)){D_{\text{KL}}(q_\phi(z|x)~||~p_\theta(z)) }

. Instead of maximizing the negative of the KL-divergence, we simply just minimize it. Recall that we assume

pθ(z){p_\theta(z)}

to be the standard normal distribution

N(0,I){N(0,I)}

and

qϕ(zx){q_\phi(z|x)}

to be the normal distribution

N(μϕ(x),σϕ2(x)I){N(\mu_\phi(x),\sigma^2_\phi(x)I)}

. Using the formula for KL-divergence on

k{k}

dimensional multivariate gaussians and simplifying based on the fact that

pθ(z)=N(0,I){p_\theta(z)=N(0,I)}

, we get that the KL-divergence is:

arg minϕ12[μϕ(x)Tμϕ(x)+tr(σϕ2(x)I)ki=1klog(σϕ2(x)i)]\argmin_\phi\frac{1}{2}[\mu_\phi(x)^T\mu_\phi(x) + tr(\sigma^2_\phi(x)I)-k-\sum_{i=1}^k\log(\sigma^2_\phi(x)_i)]

Note that the

k{-k}

can be omitted because it is a constant.

Putting this into code, we get the following:

def elbo_loss(x_hat, x, mu, std): bce_loss = nn.BCELoss(reduction='sum') kl_divergence = 0.5 * torch.sum(mu ** 2 + std ** 2 - torch.log(std ** 2)) return cross_entropy_loss(gen_x, x) + 5 * kl_divergence

Approximated ELBO loss function

Note that we chose to multiply the KL-divergence by a positive constant in order to tell the model to put more weight on organizing the latent space. This allows for slightly more meaningful generation.

VAE Model

The VAE model is essentially just a bunch of neural networks.

More specifically, for a latent space dimension

d{d}

, the probabilistic encoder is a neural network that takes an input vector and produces a

d{d}

dimensional mean vector (representing

μϕ(x){\mu_\phi(x)}

) and a

d{d}

dimensional standard deviation vector (representing

σϕ(x){\sigma_\phi(x)}

).

The probabilistic decoder is another neural network that takes a

d{d}

dimensional vector in the latent space and outputs a vector in the input space.

For the model, we chose to use a mix of non-linearities. The most important part is that we have

tanh{tanh}

non-linearities at the end of our probabilistic encoder network layers (in order to constrain the outputs between -1 and 1) and a

sigmoid{sigmoid}

non-linearity at the end of the decoder layer (to allow for good inputs for the

log{\log}

term in the loss function).

class VAE(nn.Module): def __init__(self, input_dim): super(VAE, self).__init__() self.input_dim = input_dim self.latent_dim = 12 self.encoder = nn.Sequential( nn.Linear(self.input_dim, 512), nn.ReLU(), nn.Linear(512, 64), nn.ReLU(), ) self.mu = nn.Sequential( nn.Linear(64, self.latent_dim), nn.Tanh() ) self.std = nn.Sequential( nn.Linear(64, self.latent_dim), nn.Tanh() ) self.decoder = nn.Sequential( nn.Linear(self.latent_dim, 64), nn.ReLU(), nn.Linear(64, 512), nn.ReLU(), nn.Linear(512, self.input_dim,), nn.Sigmoid() ) def get_latent_distribution(self, x): mu = self.mu(self.encoder(x)) std = self.std(self.encoder(x)) return mu, std def sample_latent_distribution(self, mu, std): # Samples a vector z using the reparameterization trick # Here we are also using a monte-carlo estimate on 1 sample. # This is sufficient since our dataset is large. return std * torch.randn(std.shape[0]).unsqueeze(-1).to(device) + mu def encode(self, x): mu, std = self.get_latent_distribution(x) return self.sample_latent_distribution(mu, std), mu, std def decode(self, z): return self.decoder(z) def forward(self, x): x_hat, mu, std = self.encode(x) return self.decode(x_hat), mu, std def generate(self): # Take a random sample z with the assumption the latent space # is the standard gaussian z = self.sample_latent_distribution(torch.zeros(self.latent_dim).to(device), torch.ones(self.latent_dim).to(device)) size = int(math.sqrt(self.input_dim)) return self.decode(z).reshape(z.shape[0], size, size)

VAE model

Training Loop

The training loop is defined as usual:

vae = VAE(28**2).to(device) optimizer = optim.Adam(vae.parameters(), lr=1e-3) batch_size = 32 train_loader = torch.utils.data.DataLoader(train.data, batch_size=batch_size, shuffle=True) test_loader = torch.utils.data.DataLoader(test.data, batch_size=100, shuffle=True) epochs = 15 for epoch in range(epochs): epoch_loss = 0 for batch_idx, x in enumerate(train_loader): optimizer.zero_grad() x = torch.flatten(x.float(), start_dim=1).to(device) x_hat, mu, std = vae(x) loss = elbo(x_hat, x, mu, std) loss.backward() optimizer.step() epoch_loss += loss.item() print('Epoch: {}, Loss: {}'.format(epoch + 1, epoch_loss / len(train_loader)))

Results

The results and all the code can be found at:

Decoding

The following images are reconstructed from a single pass through the model. The image on the left of each figure is the original image from the dataset and the image on the right of each figure is the reconstruction.

Generation

The following set of images are generated by randomly sampling vectors in the latent space and decoding them using the model.

Generation of random samples from latent space.

Thanks for reading! If you have any questions or notice something wrong, please email me at

varchi [at] seas [dot] upenn [dot] edu.