Diffusion Models and Stable Diffusion
Forward noise, reverse denoising, DDPM, latent diffusion — how Stable Diffusion generates photorealistic images from text prompts.
Diffusion models learn one thing: given a slightly noisy image, predict the noise that was added. Run this backwards 1000 times starting from pure noise and you get a photorealistic image.
GANs generate images in one forward pass — fast but unstable. VAEs generate via a compressed latent code — stable but blurry. Diffusion models take a third path: learn to reverse a gradual noising process. The training objective is deceptively simple — take a real image, add a known amount of Gaussian noise, ask the model to predict what noise was added. Repeat this for every noise level from slightly noisy to pure noise. At generation time, start from pure Gaussian noise and iteratively denoise, guided by what the model learned.
The results are extraordinary — diffusion models produce images that are sharper, more diverse, and more faithful to text prompts than any previous approach. Stable Diffusion, DALL-E 3, Midjourney, and Google Imagen are all diffusion models. At Indian startups: Meesho uses Stable Diffusion fine-tuned on Indian fashion to generate product variations. Adobe's Firefly (used by Indian creative agencies) is diffusion-based. Every modern text-to-image system is built on this foundation.
Teaching someone to restore old damaged photographs. You take a pristine photo and progressively scratch it — first a tiny scratch, then more, then more, until it is completely unrecognisable static. You train a restorer to undo each level of damage. After enough practice they can take completely random static and restore it step by step into a meaningful photograph. The key insight: each restoration step is easy — remove a small amount of noise. But chaining 1000 easy steps produces something remarkable.
The model never needs to generate from nothing. It only ever needs to answer: "given this noisy image at this noise level, what noise should I remove?" That is a much simpler task than "generate a photorealistic image from scratch."
Adding noise — the Markov chain from image to pure noise
The forward process is fixed — not learned. It gradually adds Gaussian noise to an image over T timesteps (typically T=1000). At each timestep t, a small amount of noise is added according to a noise schedule β₁, β₂, …, β_T. By timestep T the image is indistinguishable from pure Gaussian noise. The key mathematical property: you can jump directly to any timestep t without simulating all steps sequentially. This is what makes training efficient.
The U-Net denoiser — predict the noise, not the image
The learnable part of a diffusion model is a neural network that takes a noisy image x_t and a timestep t as input, and predicts the noise ε that was added. The architecture is a U-Net with time conditioning — the timestep t is embedded into a sinusoidal positional encoding and injected into every residual block via addition or cross-attention. The network must learn to denoise differently for each noise level — removing a tiny amount of noise at t=10 is very different from recovering structure at t=900.
DDPM training loop and reverse process sampling
Training is remarkably simple: sample a random image from the dataset, sample a random timestep t, add the corresponding amount of noise, ask the model to predict the noise, compute MSE loss. That is the entire training algorithm. No adversarial game, no posterior collapse, no mode collapse. This simplicity is why diffusion models train so reliably compared to GANs.
Sampling (generation) runs the reverse process: start from pure Gaussian noise x_T, iteratively denoise using the trained model, and arrive at a clean image x_0 after T steps. Each denoising step predicts the noise at the current timestep and subtracts it, producing a slightly cleaner image. The full T=1000 steps is slow — DDIM (denoising diffusion implicit models) achieves similar quality in 20–50 steps.
Why Stable Diffusion runs on consumer GPUs — diffusion in latent space
Running DDPM directly on 512×512 images requires 1000 U-Net forward passes on high-resolution feature maps — enormously expensive. Stable Diffusion's key insight: run diffusion in the latent space of a pretrained VAE, not in pixel space. A VAE encodes a 512×512×3 image into a 64×64×4 latent tensor — a 48× reduction in resolution. Diffusion in this compressed space is 48× faster per step with no loss in final quality, because the VAE decoder restores full resolution at the end. This is Latent Diffusion Models (LDM).
DreamBooth and LoRA — fine-tuning on your own images
Pretrained Stable Diffusion generates generic content. For Indian fashion product images, architectural styles, or brand-specific visual language, you need to fine-tune. Two efficient methods: DreamBooth fine-tunes the entire U-Net on 3–30 images of a specific concept and teaches the model a new token that refers to it. LoRA (Module 51) fine-tunes only 0.5% of the U-Net parameters — achieves similar results with 10× less memory and training time.
Every common diffusion model mistake — explained and fixed
You understand how images are generated. Next: how the largest language models are built and aligned.
Diffusion models generate images by learning to reverse a noising process. LLMs generate text by learning to predict the next token — but at a scale and with emergent capabilities that make them qualitatively different from anything before. Module 64 covers how GPT, Claude, and Gemini are built: next-token pretraining at scale, RLHF alignment, DPO, instruction tuning, and the scaling laws that predict capability from compute.
How GPT, Claude, and Gemini are built. Next-token prediction at scale, RLHF alignment, DPO, and the laws that predict capability.
🎯 Key Takeaways
- ✓Diffusion models learn to reverse a fixed noising process. The forward process gradually adds Gaussian noise to an image over T=1000 steps until it becomes pure noise. The reverse process trains a U-Net to predict the noise at each step. Generation = start from pure noise, run the reverse process T times.
- ✓The closed-form forward process lets you jump to any timestep t directly: x_t = √ᾱ_t × x_0 + √(1−ᾱ_t) × ε where ᾱ_t is the cumulative product of (1−β_s). Training samples random t values and asks the model to predict ε from x_t — the entire training algorithm is this MSE loss.
- ✓The denoising U-Net takes a noisy image x_t and a timestep t as input. Timestep t is converted to a sinusoidal embedding and injected into every residual block. The architecture is identical to segmentation U-Net but with time conditioning — skip connections preserve spatial detail for precise denoising.
- ✓Stable Diffusion runs diffusion in the 64×64×4 latent space of a pretrained VAE, not in 512×512 pixel space. This 48× compression makes each denoising step 48× cheaper with no quality loss. The VAE decoder restores full resolution at the end. This is Latent Diffusion Models (LDM).
- ✓Classifier-free guidance (CFG) runs the U-Net twice per step: once with the text prompt and once without. The final prediction is: eps_uncond + scale × (eps_text − eps_uncond). guidance_scale=7.5 is the standard. Higher scale = more prompt adherence but potential distortion. Lower scale = more diversity but ignores prompt.
- ✓Fine-tuning options by cost: Textual Inversion (learn one new token, 100KB, 5 images, weakest) → LoRA (train 0.09% of U-Net, 50MB, 10-50 images, strong) → DreamBooth (fine-tune full U-Net, 4GB, 5-30 images, strongest). Always use prior preservation loss in DreamBooth to prevent catastrophic forgetting of general knowledge.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.