Variational Autoencoders

Variational Autoencoders (VAEs) are a powerful class of generative models that merge deep learning with the principles of Bayesian inference. They allow you to learn complex, high-dimensional data distributions—like images, sounds, or text—and generate new, plausible samples from them. Beyond mere generation, their structured latent space provides a robust framework for tasks like data compression, anomaly detection, and feature discovery.

Core Architecture: From Autoencoder to VAE

A standard autoencoder consists of two neural networks: an encoder that compresses input data $x$ into a lower-dimensional latent code $z$ , and a decoder that reconstructs the input from this code. The goal is to minimize reconstruction error. However, this deterministic mapping often leads to an irregular, "hole-ridden" latent space where sampling can produce nonsensical outputs.

The VAE fundamentally reimagines this process. Instead of mapping an input to a single point in latent space, the encoder outputs the parameters of a probability distribution—typically a Gaussian. For a given input $x$ , the encoder defines a posterior distribution $q_{ϕ} (z ∣ x)$ , with mean $μ_{ϕ} (x)$ and variance $σ_{ϕ}^{2} (x)$ . The decoder then models the likelihood $p_{θ} (x ∣ z)$ , the probability of the data given a latent code. The latent variable $z$ itself is sampled from the distribution defined by the encoder: $z \sim q_{ϕ} (z ∣ x)$ .

This probabilistic approach forces the model to learn a continuous, structured latent space where every point is likely to decode into valid data.

The Reparameterization Trick and ELBO Derivation

The central challenge in training this probabilistic model is backpropagation through the stochastic sampling operation $z \sim q_{ϕ} (z ∣ x)$ . Sampling is a non-differentiable process. The reparameterization trick provides an elegant solution. Instead of sampling $z$ directly, we express it as a deterministic function of the distribution parameters and an auxiliary noise variable.

Specifically, we sample $ϵ$ from a standard normal distribution, $ϵ \sim N (0, I)$ . We then calculate the latent variable as: $z = μ_{ϕ} (x) + σ_{ϕ} (x) ⊙ ϵ$ Here, $⊙$ denotes element-wise multiplication. This reformulation allows gradients to flow through the deterministic parameters $μ_{ϕ}$ and $σ_{ϕ}$ during backpropagation, while the stochasticity comes from the independent variable $ϵ$ .

To train the model, we need an objective function. We wish to maximize the log-likelihood of the data, $lo g p_{θ} (x)$ , but this is computationally intractable. VAEs instead maximize a tractable surrogate called the Evidence Lower Bound (ELBO). The derivation proceeds by introducing the approximate posterior $q_{ϕ} (z ∣ x)$ :

$lo g p_{θ} (x) = E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x)]$

We can then multiply and divide by $q_{ϕ} (z ∣ x)$ inside the expectation and apply Jensen's inequality to obtain the ELBO:

$lo g p_{θ} (x) \geq E_{z \sim q_{ϕ} (z ∣ x)} [lo g \frac{p _{θ} ( x , z )}{q _{ϕ} ( z ∣ x )}] = ELBO (θ, ϕ; x)$

Expanding this joint probability, we arrive at the standard, interpretable form of the ELBO:

$ELBO (θ, ϕ; x) = E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{K L} (q_{ϕ} (z ∣ x) ∣∣ p (z))$

The first term is the reconstruction loss, encouraging the decoder to output data similar to the input $x$ . The second term is the Kullback-Leibler (KL) divergence, which acts as a regularizer. It pushes the encoder's distribution $q_{ϕ} (z ∣ x)$ toward a simple prior $p (z)$ , which we define as a standard normal distribution $N (0, I)$ . This regularization is what enforces a smooth, organized latent space.

Properties of the Learned Latent Space

The structure imposed by the KL divergence term leads to two powerful emergent properties: smooth interpolation and disentanglement.

Latent space interpolation becomes meaningful. Because the latent space is continuous and probabilistic, you can take two data points, encode them to get their latent distributions, and smoothly interpolate between their latent means. Decoding these interpolated points yields a coherent sequence of transitions between the two original data types (e.g., morphing a handwritten '7' into a '1').

Disentangled representations are a highly desired, though not guaranteed, outcome. A disentangled latent space is one where individual latent dimensions correspond to single, interpretable generative factors of the data variation (e.g., one dimension controls object size, another controls rotation, another controls color). While the standard VAE objective encourages some disentanglement, achieving it often requires modifications, such as weighting the KL divergence term more heavily (the $β$ -VAE framework).

Key Applications: Generation and Beyond

The most direct application of VAEs is image generation. By sampling a latent vector $z$ from the prior $p (z) = N (0, I)$ and passing it through the trained decoder, you can synthesize novel images that resemble the training data. While early VAE-generated images were often blurrier than those from Generative Adversarial Networks (GANs), their training stability and principled probabilistic framework remain major advantages.

A critically important application is anomaly detection. The ELBO provides a natural anomaly score. For a new data point $x_{t es t}$ , you compute its ELBO. A well-represented, "normal" point will have a high ELBO (high reconstruction likelihood and low KL cost). An anomalous point, which the model has not seen during training, will be poorly reconstructed and/or mapped to a region of latent space far from the prior, resulting in a low ELBO score. This makes VAEs particularly useful for detecting defects, fraud, or system failures.

Common Pitfalls

Posterior Collapse: In some scenarios, especially with powerful decoders, the KL divergence term can vanish. The encoder learns to ignore the input and outputs $q_{ϕ} (z ∣ x) = p (z)$ , the prior. The model reduces to a standard autoencoder, losing all the benefits of a structured latent space. Correction: Techniques like KL annealing (gradually increasing the weight of the KL term) or using a more expressive encoder architecture can mitigate this.

Blurry Reconstructions and Samples: VAEs, particularly with a Gaussian decoder modeling pixel-wise mean-squared error, often produce averaged, blurry outputs. This is because they minimize a reconstruction loss that averages over all possibilities. Correction: Using a different likelihood model (e.g., a Bernoulli distribution for binarized images or a discretized logistic distribution for color images) can yield sharper results.

Misinterpreting the Latent Space as Perfectly Disentangled: The standard VAE objective does not explicitly enforce disentanglement; it encourages a factorized aggregate posterior. Strong disentanglement usually requires architectural constraints or modified objectives. Correction: Use and understand specialized variants like $β$ -VAE or FactorVAE if disentanglement is the primary goal.

Ignoring the Limitations of the Gaussian Prior: The choice of a standard normal prior $p (z)$ is convenient but can be overly restrictive, potentially creating a "holes" problem where regions of low prior probability are not used by the encoder. Correction: Consider more flexible priors, such as a mixture of Gaussians, or employ hierarchical latent variable models.

Summary

VAEs are probabilistic generative models that learn to encode data into a distribution over a latent space and decode from it, enabling both data reconstruction and novel sample generation.
The reparameterization trick enables gradient-based training by making the stochastic sampling step differentiable, a cornerstone of the VAE framework.
The model is trained by maximizing the Evidence Lower Bound (ELBO), which balances a reconstruction term (fidelity to the data) and a KL divergence term (regularization of the latent space).
The regularized latent space enables meaningful interpolation and can, under the right conditions, learn disentangled representations where single latent dimensions control interpretable data features.
Beyond image generation, VAEs are highly effective for anomaly detection, using the ELBO as a principled measure of how well a new data point fits the learned model.

Variational Autoencoders

Variational Autoencoders

Core Architecture: From Autoencoder to VAE

The Reparameterization Trick and ELBO Derivation

Properties of the Learned Latent Space

Key Applications: Generation and Beyond

Common Pitfalls

Summary

Write better notes with AI