What is Generative AI?
GANs, VAEs, diffusion, and LLMs — what makes each one generative, and when each one is the right architecture. The shift from recognising to creating.
Every model you have built so far maps input to label. Generative models learn the data distribution itself — then sample from it to create new data that never existed.
Sections 5 through 9 covered discriminative models — they draw a boundary between classes. Given an image, predict "kurta" or "jeans." Given a sentence, predict "positive" or "negative." The model learns P(label | data) — the probability of a label given the data. It never learns what data looks like, only how to classify it.
Generative models learn P(data) — the probability distribution of the data itself. A model that has learned P(data) for fashion images can answer: "what does a typical kurta look like?" and then generate one. It can synthesise new kurta images that are statistically indistinguishable from real ones — because it has learned the underlying distribution, not just the boundary between categories.
Why does this matter for Indian tech? Myntra uses generative models to create product variations — same design, different colours — without photographing each one. Swiggy uses them to generate synthetic training data for rare dish categories with few real photos. Razorpay uses LLMs (the largest generative models) to draft merchant communications. Every use case involves creating new content from a learned distribution.
A discriminative model is a critic — shown a painting, they say "Monet" or "Picasso." They have learned boundaries between styles but cannot paint. A generative model is an artist — they have studied thousands of Monet paintings so deeply that they can create a new painting that looks authentically Monet, even though that exact painting never existed.
The critic learns P(style | painting). The artist learns P(painting) in Monet's style — the full distribution of what Monet paintings look like — and samples from it. That is the fundamental difference.
Four generative model families — what each one does and how
Latent space — the compressed representation all generative models share
Every generative model learns to compress data into a lower-dimensional latent space and decode from it. A 224×224 RGB image has 150,528 dimensions. A well-trained VAE compresses this to 128 or 256 latent dimensions that capture all meaningful variation — colour scheme, shape, texture, style — while discarding irrelevant pixel-level noise. The latent space is a map of the data distribution.
Nearby points in latent space correspond to similar images. You can interpolate between two points and get a smooth transition between two images. You can add and subtract directions: the famous example from Word2Vec — king − man + woman ≈ queen — works in image latent spaces too: kurta_latent + blue_colour_vector ≈ blue_kurta_latent. This is what makes latent spaces useful for creative applications.
This only works cleanly in VAE latent spaces — GANs have unstructured spaces
FID, IS, and CLIP score — evaluating generative models
You cannot use accuracy to evaluate a generative model — there is no correct answer. How do you measure whether a generated image is "good"? Three metrics are standard: Fréchet Inception Distance (FID) measures how similar the distribution of generated images is to real images. Inception Score (IS) measures diversity and quality together. CLIP score measures how well an image matches a text description.
Distance between real and generated image distributions in InceptionV3 feature space
exp(E[KL(p(y|x) || p(y))]) — measures diversity across classes and confidence per image
Cosine similarity between CLIP image embedding and CLIP text embedding
Which generative model for which task — a practical framework
You understand the generative landscape. Next: the adversarial game that started it all.
This module introduced all four families at a high level. The next four modules go deep on each one in turn. Module 61 builds a GAN from scratch — generator, discriminator, the adversarial training loop, and why training is so unstable. Understanding GANs first builds the intuition that makes VAEs, diffusion, and LLMs click into place.
Two networks in adversarial competition. Mode collapse, training instability, Wasserstein distance — the honest account of what makes GANs hard to train.
🎯 Key Takeaways
- ✓Discriminative models learn P(label | data) — they classify. Generative models learn P(data) — the full data distribution — and can sample new data from it. This shift from recognising to creating is the core of generative AI.
- ✓Four generative model families: GANs (adversarial training, sharpest images, unstable), VAEs (smooth latent space, stable training, blurry outputs), Diffusion models (best quality and diversity, slow inference, powers Stable Diffusion), LLMs (autoregressive text generation, emergent capabilities, powers GPT and Claude).
- ✓All generative models share a key concept: the latent space — a compressed lower-dimensional representation of the data distribution. Nearby points in latent space correspond to similar outputs. You can interpolate, add, and subtract direction vectors to control generation.
- ✓The reparameterisation trick is what makes VAE training work: instead of sampling z directly (which breaks gradients), sample epsilon ~ N(0,I) and compute z = mean + std × epsilon. This makes the sampling operation differentiable so gradients can flow through the encoder.
- ✓Evaluate generative models with FID (lower = better, measures distribution similarity to real data), IS (higher = better, measures quality and diversity), and CLIP score (higher = better, measures text-image alignment). Never use accuracy — there is no single correct output.
- ✓Architecture selection: LLMs for any text task, Diffusion for text-to-image and image editing, GANs for fast single-pass image synthesis, VAEs for anomaly detection and structured latent space applications. Diffusion has overtaken GANs for image quality; LLMs have overtaken rule-based systems for text.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.