GANs — Generator vs Discriminator
Two networks in adversarial competition. Mode collapse, training instability, Wasserstein distance — the honest account of what makes GANs hard to train.
A forger tries to create fake currency that fools the bank. The bank trains detectors to catch fakes. The forger studies the detector's failures and improves. Both get better in lockstep. That is a GAN.
Ian Goodfellow invented GANs in 2014 — the idea came to him at a bar in Montreal after a friend suggested using a neural network to generate images. The insight: instead of hand-crafting a loss function that measures image quality (which is impossible to define), learn the loss function itself using a second neural network. Let the Discriminator define what "real" looks like, and let the Generator learn to fool it.
The Generator takes random noise as input and produces an image. It never sees real images directly. The Discriminator takes an image — either real or generated — and outputs a probability that it is real. The two networks are trained simultaneously with opposing objectives: the Generator wants to maximise the Discriminator's error, the Discriminator wants to minimise it. At equilibrium — the Nash equilibrium — the Generator produces images indistinguishable from real ones.
A counterfeiter (Generator) and a detective (Discriminator) playing an arms race. The counterfeiter starts producing terrible fakes — the detective catches them all. The counterfeiter studies which fake features gave them away and improves. The detective trains on the new fakes and gets better. After 100,000 rounds, the counterfeiter produces fakes so good even experts cannot tell them apart. Neither was told what perfect currency looks like — they learned from each other.
The catch: if the detective gets too good too fast, the counterfeiter receives no useful signal — all attempts score equally bad. If the counterfeiter gets too good too fast, the detective gives up and labels everything as fake. Balance is everything — and maintaining balance is why GAN training is notoriously hard.
Minimax game — what each network optimises and why
The GAN objective is a minimax game. The Discriminator D maximises its ability to distinguish real from fake. The Generator G minimises D's ability — equivalently, maximises the probability that D mistakes its outputs for real.
DCGAN — Deep Convolutional GAN — the architecture that made GANs work for images
The original GAN used fully connected layers and only worked on tiny 28×28 images. DCGAN (Radford et al., 2015) replaced them with convolutional layers and introduced a set of architectural guidelines that made GAN training dramatically more stable. These guidelines are still followed in modern GANs.
Mode collapse, vanishing gradients, and the Wasserstein fix
Two failure modes plague vanilla GAN training. Mode collapse: the Generator finds a single image (or a small set) that always fools the Discriminator and stops exploring. You get 1,000 generated images that all look nearly identical. Vanishing gradients: when the Discriminator becomes too good, it outputs probabilities near 0 for all fakes — the gradient of log(1 − D(G(z))) saturates and the Generator receives no signal.
Wasserstein GAN (WGAN, 2017) addresses both by replacing the Jensen-Shannon divergence objective with the Wasserstein distance — a metric that provides meaningful gradients even when the generated and real distributions do not overlap. WGAN removes the Sigmoid from the Discriminator (now called Critic), clips weights to enforce a Lipschitz constraint, and trains the Critic more steps than the Generator.
Conditional GAN — generate specific classes on demand
Vanilla GANs generate random samples from the full data distribution. Conditional GANs (cGAN) condition generation on a label — generate a kurta specifically, not a random fashion item. Both Generator and Discriminator receive the class label as additional input. The Generator learns to produce images for each class. The Discriminator learns to judge whether an image matches its label — not just whether it looks real.
Every common GAN failure — explained and fixed
You understand adversarial training. Next: a smoother path to generation via structured latent spaces.
GANs generate sharp images but training is unstable and mode collapse is a constant risk. Variational Autoencoders take a different path — instead of adversarial competition, they use a principled probabilistic framework that guarantees a smooth, structured latent space. Module 62 builds a VAE from scratch, derives the ELBO loss, and shows the reparameterisation trick that makes it trainable.
The reparameterisation trick, KL divergence loss, and why VAEs enable controllable generation through structured latent spaces.
🎯 Key Takeaways
- ✓A GAN pits two networks against each other: the Generator maps random noise to fake data, the Discriminator classifies real vs fake. The Generator is trained to fool the Discriminator; the Discriminator is trained to catch fakes. At Nash equilibrium the Generator produces data indistinguishable from real.
- ✓The training loop alternates: train Discriminator on real (label 1) and fake (label 0), then train Generator to make Discriminator output 1 on fakes. Use detach() when training D to prevent gradients flowing back through G. Use betas=(0.5, 0.999) for Adam — lower momentum reduces oscillation.
- ✓DCGAN architectural rules that stabilise training: strided convolutions instead of pooling, BatchNorm everywhere except first D layer and last G layer, LeakyReLU(0.2) in D, ReLU in G, Tanh output. Weight initialisation: normal distribution with mean=0, std=0.02.
- ✓Two main failure modes: mode collapse (Generator outputs same image repeatedly — fix with minibatch discrimination, WGAN-GP, or larger latent dim) and vanishing gradients (Discriminator wins too easily — fix with non-saturating G loss, label smoothing, or WGAN-GP).
- ✓WGAN-GP replaces the Discriminator with a Critic (no Sigmoid), uses Wasserstein distance instead of BCE loss, and enforces the Lipschitz constraint with gradient penalty (LAMBDA_GP=10) instead of weight clipping. Train critic 5 steps per generator step. Use betas=(0.0, 0.9) instead of (0.5, 0.999).
- ✓Conditional GANs add class labels as input to both G and D — the Generator learns to produce images of a specific class, the Discriminator judges both realism and label consistency. Implement via nn.Embedding: embed integer class label to a dense vector and concatenate with noise (G) or image (D).
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.