Variational Autoencoders — Learning Latent Representations
The reparameterisation trick, KL divergence loss, and why VAEs enable controllable generation through structured latent spaces.
A regular autoencoder compresses images to a point. A VAE compresses images to a region — a probability distribution. That one change makes the latent space smooth, structured, and generatable from.
A standard autoencoder has an encoder that maps an image to a fixed latent vector z, and a decoder that maps z back to an image. Trained to minimise reconstruction error, it learns an efficient compression. But the latent space it creates is fragmented — arbitrary points in it decode to garbage because the model was never trained to handle points other than the exact codes it memorised for training images.
A VAE changes the encoder's output from a single point to a probability distribution — specifically a Gaussian defined by mean μ and variance σ². During training, the latent code z is sampled from this distribution rather than fixed. A regularisation term (KL divergence) forces all these distributions to stay close to a standard normal N(0, I). The result: the entire latent space is covered continuously — any point you sample from N(0, I) decodes to a meaningful image.
A regular autoencoder is like a library where each book has a specific assigned shelf location. The shelves between books are empty — if you reach between two books you get nothing. A VAE is like a library organised by topic, with smooth transitions between subjects — books on cricket shade gradually into books on other sports, then into general fitness. Any point on the shelf has something meaningful. You can navigate by sliding from one location to another and find related content throughout.
The KL divergence term is the librarian enforcing this organisation. Without it, the encoder would cram all books into tiny clusters and leave most of the shelf empty — efficient but not navigable.
Encoder, reparameterisation, decoder — every component explained
ELBO — Evidence Lower Bound — reconstruction loss plus KL divergence
The VAE is trained to maximise the ELBO (Evidence Lower Bound) — a lower bound on the log likelihood of the data. Maximising ELBO is equivalent to minimising two terms: the reconstruction loss (how well does the decoder reconstruct the input) and the KL divergence (how close is the encoder's distribution to N(0, I)). These two terms are in tension — the KL term wants to collapse all encodings to N(0, I) which would lose all information, while the reconstruction term wants to preserve all information. The balance between them creates the structured latent space.
Binary CE for image pixels in [0,1]. MSE also common.
Maximise → decoder gets better at reconstruction.
Closed form: −0.5 × Σ(1 + log σ² − μ² − σ²)
Minimise → encoder's distributions stay near standard normal.
Complete training pipeline with KL annealing
A critical practical detail: if you start training with the full KL term, the encoder immediately collapses all posteriors to N(0, I) because that minimises KL loss trivially — the reconstruction loss hasn't had time to build useful representations yet. KL annealing fixes this: start with β=0 (pure reconstruction), gradually increase β to 1 over the first 10–20 epochs. The encoder first learns to reconstruct, then learns to organise the latent space.
Interpolation, generation, and anomaly detection — the three VAE superpowers
β-VAE — disentangled representations where each dimension has meaning
In a standard VAE (β=1), the latent dimensions are not necessarily interpretable — dimension 7 might encode a mixture of colour, texture, and shape simultaneously. β-VAE increases the KL weight (β > 1), forcing the encoder to use each latent dimension more independently. With enough pressure, individual dimensions learn to represent single factors of variation — one dimension for colour, one for shape, one for size. This is called disentanglement.
Every common VAE mistake — explained and fixed
You understand latent variable models. Next: the architecture that generates the sharpest images ever produced by AI.
GANs are sharp but unstable. VAEs are stable but blurry. Diffusion models get the best of both — they are stable to train, produce sharp photorealistic outputs, and avoid mode collapse entirely. Module 63 explains the forward noising process, the reverse denoising network, and how Stable Diffusion uses a VAE latent space to make diffusion fast enough for practical use.
Forward noise, reverse denoising, DDPM, latent diffusion — how Stable Diffusion generates photorealistic images from text.
🎯 Key Takeaways
- ✓A regular autoencoder maps each image to a fixed point in latent space — the space between points is empty and decodes to garbage. A VAE maps each image to a probability distribution (Gaussian with mean μ and variance σ²) and regularises all distributions to stay near N(0, I). Any point sampled from N(0, I) decodes to a meaningful image.
- ✓The reparameterisation trick makes VAE training possible: instead of sampling z ~ N(μ, σ²) directly (which breaks gradients), compute z = μ + σ × ε where ε ~ N(0, I). The random ε is independent of the parameters — gradients flow through μ and σ normally.
- ✓ELBO loss has two terms: reconstruction loss (BCE or MSE — how well does decoder reproduce the input) and KL divergence (−0.5 × Σ(1 + log σ² − μ² − σ²) — how close is the encoder distribution to N(0, I)). These are in tension — the balance creates a structured, navigable latent space.
- ✓KL annealing is essential for stable training: start β=0 (pure reconstruction) and linearly increase to β=1 over 10–20 epochs. Without annealing the KL term causes posterior collapse — the encoder ignores the input and outputs N(0, I) trivially, and the decoder learns to generate average blurry images without using z.
- ✓β-VAE (β > 1) increases KL weight to encourage disentanglement — individual latent dimensions learn to represent independent factors (colour, shape, size). β=1 gives best reconstruction quality. β=4 gives partial disentanglement. β≥10 gives strong disentanglement but noticeably blurry outputs.
- ✓Three production applications: interpolation (smooth transition between two encoded images by linearly blending their latent vectors), anomaly detection (high reconstruction error = unusual item — train only on normal items), and attribute manipulation (compute direction vectors in latent space for specific attributes like colour and add them to new encodings).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.