AI/ML — Module 60Advanced

What is Generative AI?

GANs, VAEs, diffusion, and LLMs — what makes each one generative, and when each one is the right architecture. The shift from recognising to creating.

22–28 min March 2026

Module 60 · Generative AI

GenAI · 9 modulesModule 60

What GANs Variational Diffusion LLMs LLM Multimodal Advanced Agents

Before any formula — discriminative vs generative

Every model you have built so far maps input to label. Generative models learn the data distribution itself — then sample from it to create new data that never existed.

Sections 5 through 9 covered discriminative models — they draw a boundary between classes. Given an image, predict "kurta" or "jeans." Given a sentence, predict "positive" or "negative." The model learns P(label | data) — the probability of a label given the data. It never learns what data looks like, only how to classify it.

Generative models learn P(data) — the probability distribution of the data itself. A model that has learned P(data) for fashion images can answer: "what does a typical kurta look like?" and then generate one. It can synthesise new kurta images that are statistically indistinguishable from real ones — because it has learned the underlying distribution, not just the boundary between categories.

Why does this matter for Indian tech? Myntra uses generative models to create product variations — same design, different colours — without photographing each one. DoorDash uses them to generate synthetic training data for rare dish categories with few real photos. Stripe uses LLMs (the largest generative models) to draft merchant communications. Every use case involves creating new content from a learned distribution.

🧠 Analogy — read this first

A discriminative model is a critic — shown a painting, they say "Monet" or "Picasso." They have learned boundaries between styles but cannot paint. A generative model is an artist — they have studied thousands of Monet paintings so deeply that they can create a new painting that looks authentically Monet, even though that exact painting never existed.

The critic learns P(style | painting). The artist learns P(painting) in Monet's style — the full distribution of what Monet paintings look like — and samples from it. That is the fundamental difference.

The landscape

Four generative model families — what each one does and how

Generative model families — approach, strength, weakness

GANs — Generative Adversarial Networks2014

Two networks compete: Generator creates fake data, Discriminator distinguishes real from fake. Generator improves until Discriminator cannot tell the difference.

✓ Sharpest, most photorealistic images. Fast inference once trained.

✗ Notoriously unstable training. Mode collapse — generates only a few types of outputs. Hard to evaluate quality objectively.

Use for: Image synthesis, style transfer, face generation, data augmentation.

Produces: Images directly in one forward pass

VAEs — Variational Autoencoders2013

Encoder compresses data to a smooth latent space. Decoder reconstructs from latent vectors. Latent space is regularised to be continuous — you can interpolate between points.

✓ Stable training. Smooth, structured latent space enables interpolation and editing. Well-understood theoretically.

✗ Generated images are blurry — optimising pixel-wise reconstruction loss averages across modes.

Use for: Drug discovery, anomaly detection, representation learning, controllable generation.

Produces: Reconstructions from compressed latent codes

Diffusion Models2020

Forward process gradually adds Gaussian noise to data over T steps until pure noise. Reverse process trains a neural network to denoise step by step. Generation = start from noise, run denoising T times.

✓ Best image quality of any method. Stable training. Diverse outputs — no mode collapse. Powers Stable Diffusion, DALL-E, Midjourney.

✗ Very slow inference — requires T denoising steps (typically 20–1000). Computationally expensive.

Use for: Text-to-image generation, image editing, video generation, audio synthesis.

Produces: Images via iterative denoising from pure Gaussian noise

LLMs — Large Language Models2018+

Transformer trained to predict the next token given all previous tokens. Trained on hundreds of billions of tokens. Generation = sample from the predicted distribution at each step.

✓ Generalises to any text task with prompting. Emergent capabilities at scale. Code, math, reasoning, conversation.

✗ Requires enormous compute to train. Hallucination. Context window limits. No persistent memory by default.

Use for: Text generation, code, Q&A, summarisation, translation, agents.

Produces: Text token by token via autoregressive sampling

The key concept

Latent space — the compressed representation all generative models share

Every generative model learns to compress data into a lower-dimensional latent space and decode from it. A 224×224 RGB image has 150,528 dimensions. A well-trained VAE compresses this to 128 or 256 latent dimensions that capture all meaningful variation — colour scheme, shape, texture, style — while discarding irrelevant pixel-level noise. The latent space is a map of the data distribution.

Nearby points in latent space correspond to similar images. You can interpolate between two points and get a smooth transition between two images. You can add and subtract directions: the famous example from Word2Vec — king − man + woman ≈ queen — works in image latent spaces too: kurta_latent + blue_colour_vector ≈ blue_kurta_latent. This is what makes latent spaces useful for creative applications.

Latent space arithmetic — creative generation by vector manipulation

encode(image) → z (latent vector, e.g. 128-dim)

decode(z) → image (reconstruct)

decode(z + noise) → slightly different image (variation)

decode(lerp(z1, z2, t)) → smooth interpolation between two images

lerp = linear interpolation: z_t = (1−t)×z1 + t×z2 for t ∈ [0, 1]
This only works cleanly in VAE latent spaces — GANs have unstructured spaces

python

import torch
import torch.nn as nn
import numpy as np

# ── Minimal VAE to illustrate latent space ────────────────────────────
class SimpleVAE(nn.Module):
    def __init__(self, input_dim: int = 784, latent_dim: int = 32):
        super().__init__()
        # Encoder: input → mean and log_variance of latent distribution
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU(),
        )
        self.fc_mean    = nn.Linear(128, latent_dim)
        self.fc_log_var = nn.Linear(128, latent_dim)

        # Decoder: latent vector → reconstructed input
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128), nn.ReLU(),
            nn.Linear(128, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid(),
        )

    def encode(self, x):
        h       = self.encoder(x)
        mean    = self.fc_mean(h)
        log_var = self.fc_log_var(h)
        return mean, log_var

    def reparameterise(self, mean, log_var):
        """
        The reparameterisation trick: sample z = mean + std × epsilon
        epsilon ~ N(0, I) — random noise
        This makes sampling differentiable — gradients can flow through.
        Without this trick, sampling breaks backpropagation.
        """
        std     = torch.exp(0.5 * log_var)
        epsilon = torch.randn_like(std)
        return mean + std * epsilon

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mean, log_var = self.encode(x)
        z             = self.reparameterise(mean, log_var)
        x_recon       = self.decode(z)
        return x_recon, mean, log_var

def vae_loss(x_recon, x, mean, log_var):
    """
    ELBO loss = Reconstruction loss + KL divergence
    Reconstruction: how well did we reconstruct the input?
    KL divergence:  how close is the latent distribution to N(0, I)?
                    This regularises the latent space to be smooth.
    """
    recon_loss = nn.functional.binary_cross_entropy(x_recon, x, reduction='sum')
    # KL divergence between N(mean, var) and N(0, 1):
    # −0.5 × Σ(1 + log_var − mean² − exp(log_var))
    kl_loss    = -0.5 * torch.sum(1 + log_var - mean.pow(2) - log_var.exp())
    return recon_loss + kl_loss

# ── Shape demonstration ───────────────────────────────────────────────
torch.manual_seed(42)
vae   = SimpleVAE(input_dim=784, latent_dim=32)
x     = torch.randn(8, 784)   # batch of 8 flattened 28×28 images

x_recon, mean, log_var = vae(x)
loss = vae_loss(x_recon, torch.sigmoid(x), mean, log_var)

print(f"VAE shapes:")
print(f"  Input:        {tuple(x.shape)}")
print(f"  Latent mean:  {tuple(mean.shape)}   ← 32-dim latent space")
print(f"  Latent logvar:{tuple(log_var.shape)}")
print(f"  Reconstructed:{tuple(x_recon.shape)}")
print(f"  ELBO loss:    {loss.item():.2f}")

# ── Latent space operations ───────────────────────────────────────────
with torch.no_grad():
    # Encode two images
    mean1, _ = vae.encode(x[0:1])
    mean2, _ = vae.encode(x[1:2])

    # Interpolate between them
    print(f"
Latent space interpolation:")
    for t in [0.0, 0.25, 0.5, 0.75, 1.0]:
        z_interp = (1 - t) * mean1 + t * mean2
        decoded  = vae.decode(z_interp)
        print(f"  t={t:.2f}: z_norm={z_interp.norm():.3f}  decoded_mean={decoded.mean():.4f}")

    # Sample new images from prior
    print(f"
Sampling from prior N(0, I):")
    for i in range(3):
        z_random = torch.randn(1, 32)   # sample from standard normal
        new_img  = vae.decode(z_random)
        print(f"  Sample {i+1}: mean={new_img.mean():.4f}  std={new_img.std():.4f}")

How do you measure quality?

FID, IS, and CLIP score — evaluating generative models

You cannot use accuracy to evaluate a generative model — there is no correct answer. How do you measure whether a generated image is "good"? Three metrics are standard: Fréchet Inception Distance (FID) measures how similar the distribution of generated images is to real images. Inception Score (IS) measures diversity and quality together. CLIP score measures how well an image matches a text description.

FID — Fréchet Inception Distance

Distance between real and generated image distributions in InceptionV3 feature space

Lower is better. FID=0 means generated = real. FID<10 is excellent. FID>100 is poor.

Use: Standard for image generation quality. Used in GAN and diffusion model papers.

Limitation: Requires large sample (10k+ images) for reliable estimates. Not aligned with human perception.

IS — Inception Score

exp(E[KL(p(y|x) || p(y))]) — measures diversity across classes and confidence per image

Higher is better. Measures both quality (confident predictions) and diversity (many different classes).

Use: Quick sanity check for unconditional image generation.

Limitation: Does not compare to real images — a model generating diverse blurry images can score well.

CLIP Score

Cosine similarity between CLIP image embedding and CLIP text embedding

Higher is better. Measures how well generated image matches its text prompt.

Use: Text-to-image evaluation (Stable Diffusion, DALL-E). Human preference correlation.

Limitation: High CLIP score does not mean photorealistic — stylised images can score well.

python

import torch
import numpy as np
from scipy import linalg

# ── FID from scratch — the core calculation ───────────────────────────
def compute_fid(real_features: np.ndarray,
                fake_features: np.ndarray) -> float:
    """
    Compute FID between real and generated image features.
    Features are typically extracted from InceptionV3's penultimate layer.
    real_features: (N, 2048) — features from real images
    fake_features: (N, 2048) — features from generated images
    """
    # Compute mean and covariance of each distribution
    mu_real   = real_features.mean(axis=0)
    mu_fake   = fake_features.mean(axis=0)
    sigma_real = np.cov(real_features, rowvar=False)
    sigma_fake = np.cov(fake_features, rowvar=False)

    # Fréchet distance between two multivariate Gaussians:
    # ||mu_r - mu_f||² + Tr(sigma_r + sigma_f - 2√(sigma_r × sigma_f))
    diff       = mu_real - mu_fake
    mean_term  = diff @ diff   # ||mu_r - mu_f||²

    # Matrix square root of sigma_r × sigma_f
    covmean, _ = linalg.sqrtm(sigma_real @ sigma_fake, disp=False)
    if np.iscomplexobj(covmean):
        covmean = covmean.real   # numerical correction

    trace_term = np.trace(sigma_real + sigma_fake - 2 * covmean)
    fid        = mean_term + trace_term
    return float(fid)

# ── Simulate feature distributions ────────────────────────────────────
np.random.seed(42)
N_SAMPLES    = 1000
FEATURE_DIM  = 2048

# Perfect model: generated features = real features (FID ≈ 0)
real_features    = np.random.randn(N_SAMPLES, FEATURE_DIM)
perfect_features = real_features + np.random.randn(N_SAMPLES, FEATURE_DIM) * 0.01

# Good model: small distribution shift
good_features    = real_features + np.random.randn(N_SAMPLES, FEATURE_DIM) * 0.5

# Poor model: large distribution shift
poor_features    = np.random.randn(N_SAMPLES, FEATURE_DIM) * 2 + 3

print("FID scores (lower = better, 0 = perfect):")
print(f"  Perfect model: {compute_fid(real_features, perfect_features):.2f}")
print(f"  Good model:    {compute_fid(real_features, good_features):.2f}")
print(f"  Poor model:    {compute_fid(real_features, poor_features):.2f}")

print("
FID benchmarks in literature:")
benchmarks = [
    ('StyleGAN2 (FFHQ 256)',     2.8,  'State of the art face generation'),
    ('Stable Diffusion 2.1',     8.6,  'Text-to-image, COCO benchmark'),
    ('DALL-E 2',                 10.4, 'Text-to-image'),
    ('GAN with mode collapse',   80.0, 'Poor diversity'),
    ('Random noise baseline',   300.0, 'No learning'),
]
for name, fid, note in benchmarks:
    bar = '█' * int(min(fid, 100) / 4)
    print(f"  {name:<35}: FID={fid:>6.1f}  {bar}  {note}")

Decision guide

Which generative model for which task — a practical framework

Generative model selection — task to architecture mapping

TaskBest architectureWhy

Text generation / chatLLM (GPT/LLaMA)Autoregressive token prediction is the natural formulation

Code generationLLM (CodeLlama)LLMs generalise to code as a language domain

Text-to-imageDiffusion (Stable Diffusion)Best quality + diversity. CLIP-guided conditioning

Image editingDiffusion (InstructPix2Pix)Inpainting and instruction-following built in

Fast image synthesisGAN (StyleGAN)Single forward pass — 100× faster than diffusion

Anomaly detectionVAEReconstruction error on normal data flags anomalies

Molecular generationVAE or DiffusionSmooth latent space enables drug candidate search

Data augmentationGAN or DiffusionGenerate rare class examples for training

Audio synthesisDiffusion (WaveNet/AudioLDM)Diffusion now dominates audio generation quality

Video generationDiffusion (Sora-style)Temporal diffusion with 3D attention

python

# ── Quick API tour — generate content with each model family ─────────

# ── 1. LLM generation — text ──────────────────────────────────────────
from groq import Groq
import os

client = Groq(api_key=os.environ.get('GROQ_API_KEY', 'demo'))

# Autoregressive generation — sample one token at a time
# response = client.chat.completions.create(
#     model='llama-3.3-70b-versatile',
#     messages=[{'role': 'user', 'content': 'Write a product description for a Shopify kurta'}],
#     temperature=0.8,   # higher = more creative/diverse
#     max_tokens=200,
# )
# text = response.choices[0].message.content
print("LLM generation: autoregressive sampling, token by token")
print("  temperature=0.0 → deterministic (always picks highest probability token)")
print("  temperature=1.0 → samples from full distribution")
print("  temperature>1.0 → more random, creative, possibly incoherent")

# ── 2. Diffusion generation — image from text ─────────────────────────
print("
Diffusion generation (Stable Diffusion):")
print("  pip install diffusers torch")
print("""
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    'runwayml/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
).to('cuda')

image = pipe(
    prompt='A beautiful silk saree in red and gold, product photography',
    negative_prompt='blurry, low quality, distorted',
    num_inference_steps=30,    # denoising steps — more = better quality, slower
    guidance_scale=7.5,        # how strongly to follow the text prompt
    height=512, width=512,
).images[0]
image.save('saree.png')
""")

# ── 3. VAE — encode and decode ────────────────────────────────────────
print("VAE generation:")
print("  Encode image → 128-dim latent vector")
print("  Perturb latent vector → decode → variation of original image")
print("  Interpolate between two latent vectors → smooth transition")

# ── 4. GAN — single forward pass ─────────────────────────────────────
print("
GAN generation (StyleGAN2):")
print("""
import torch
# Load pretrained generator
# G = StyleGAN2Generator(...)
# z = torch.randn(1, 512)    # random latent
# img = G(z)                  # single forward pass → 1024×1024 face
# Total: one matrix multiplication chain, ~10ms on GPU
""")

What comes next

You understand the generative landscape. Next: the adversarial game that started it all.

This module introduced all four families at a high level. The next four modules go deep on each one in turn. Module 61 builds a GAN from scratch — generator, discriminator, the adversarial training loop, and why training is so unstable. Understanding GANs first builds the intuition that makes VAEs, diffusion, and LLMs click into place.

Next — Module 61 · Generative AI

GANs — Generator vs Discriminator

Two networks in adversarial competition. Mode collapse, training instability, Wasserstein distance — the honest account of what makes GANs hard to train.

coming soon

🎯 Key Takeaways

✓Discriminative models learn P(label | data) — they classify. Generative models learn P(data) — the full data distribution — and can sample new data from it. This shift from recognising to creating is the core of generative AI.
✓Four generative model families: GANs (adversarial training, sharpest images, unstable), VAEs (smooth latent space, stable training, blurry outputs), Diffusion models (best quality and diversity, slow inference, powers Stable Diffusion), LLMs (autoregressive text generation, emergent capabilities, powers GPT and Claude).
✓All generative models share a key concept: the latent space — a compressed lower-dimensional representation of the data distribution. Nearby points in latent space correspond to similar outputs. You can interpolate, add, and subtract direction vectors to control generation.
✓The reparameterisation trick is what makes VAE training work: instead of sampling z directly (which breaks gradients), sample epsilon ~ N(0,I) and compute z = mean + std × epsilon. This makes the sampling operation differentiable so gradients can flow through the encoder.
✓Evaluate generative models with FID (lower = better, measures distribution similarity to real data), IS (higher = better, measures quality and diversity), and CLIP score (higher = better, measures text-image alignment). Never use accuracy — there is no single correct output.
✓Architecture selection: LLMs for any text task, Diffusion for text-to-image and image editing, GANs for fast single-pass image synthesis, VAEs for anomaly detection and structured latent space applications. Diffusion has overtaken GANs for image quality; LLMs have overtaken rule-based systems for text.

Discussion

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub