Semantic Segmentation — Pixel-Level Classification
U-Net architecture, skip connections, and how segmentation powers medical imaging and autonomous vehicles. Label every pixel in one forward pass.
Object detection draws rectangles. Semantic segmentation colours every pixel with its class — no rectangles, no approximations, pixel-perfect boundaries.
A radiologist reading a chest X-ray does not draw a box around the tumour and call it done. They need to know the exact boundary — how many cubic centimetres, which tissue is affected, where does it end. A bounding box cannot answer these questions. Semantic segmentation can. It produces a mask: every pixel labelled as "tumour", "healthy tissue", "background."
Practical examples in India: Ola and Uber's dashcam systems segment road, vehicles, pedestrians, and lane markings pixel-by-pixel for driver safety scoring. Agri-tech startups segment satellite images into crop types for yield forecasting. Quality control systems at garment factories segment defect regions in fabric images to measure defect area precisely.
The output of segmentation is a mask — a 2D array of the same height and width as the input image, where each value is a class index. For a 3-class problem (background=0, road=1, vehicle=2), the mask contains integers 0, 1, or 2 at every pixel position.
Colouring a map. Detection is like placing stickers on a map — one sticker per city, approximately where each city is. Segmentation is like colouring the map by region — every pixel of India is coloured by state, every coastline is traced exactly, every river is coloured blue. Much more precise, much more useful for geography.
The challenge: to colour pixels precisely, the model needs to understand both the broad context (what is in the image) and fine spatial detail (exactly where boundaries are). Pooling layers in CNNs lose spatial detail. U-Net's skip connections restore it — that is the key architectural insight.
Semantic vs instance vs panoptic — what each one produces
U-Net — encoder, bottleneck, decoder, and skip connections
U-Net (Ronneberger et al., 2015) was designed for medical image segmentation with very few training images. Its key insight: the encoder (contracting path) captures what is in the image by progressively downsampling. The decoder (expanding path) restores spatial resolution. Skip connections copy feature maps directly from encoder to decoder at each scale — providing fine spatial detail that pooling destroyed. The result: precise pixel boundaries even from a small dataset.
Loss functions, masks, and the complete training pipeline
Segmentation training is similar to classification but operates at the pixel level. The target is not a single integer per image — it is a 2D mask of shape (H, W) where each value is a class index. The loss is cross-entropy computed over all pixels simultaneously. Class imbalance is severe in segmentation — background pixels vastly outnumber foreground pixels in most tasks. Weighted loss or Dice loss addresses this.
Pixel accuracy, IoU, and mIoU — the segmentation metric family
Pixel accuracy — fraction of correctly classified pixels — is misleading when classes are imbalanced. A model that predicts "background" for every pixel gets 90% pixel accuracy on a dataset where 90% of pixels are background. The correct metrics are per-class IoU and mean IoU (mIoU).
DeepLab and SegFormer — pretrained segmentation models for fine-tuning
U-Net trained from scratch requires thousands of labelled images. For most production tasks, fine-tune a pretrained segmentation model instead. DeepLabV3+ (Google) and SegFormer (Nvidia) are the two most widely used pretrained models — both available via HuggingFace with ImageNet-pretrained backbones and COCO/Cityscapes-pretrained heads.
Every common segmentation mistake — explained and fixed
You can segment any image. Next: get ImageNet-level features without ImageNet-level compute.
You have built segmentation from scratch and used pretrained models. Both required labelled masks — expensive to collect. Module 59 covers transfer learning for vision: how to use a ResNet or EfficientNet backbone pretrained on ImageNet as a feature extractor for your own task, freezing early layers and fine-tuning later layers. The same technique powers every production computer vision system at Indian startups today — building on ImageNet representations instead of training from scratch.
Feature extraction vs fine-tuning, layer freezing, and choosing the right backbone for your task.
🎯 Key Takeaways
- ✓Semantic segmentation assigns a class label to every pixel — output is a 2D mask of shape (H, W) with integer class indices. Unlike detection (bounding boxes) it traces exact boundaries. Unlike classification (one label per image) it works at pixel granularity.
- ✓U-Net has two paths: the encoder (downsampling with MaxPool) captures what is in the image, the decoder (upsampling) restores spatial resolution. Skip connections copy encoder feature maps directly to the decoder at each scale — providing fine spatial detail that pooling destroyed. This is why U-Net produces sharp precise boundaries.
- ✓Input dimensions must be divisible by 2^(number of pooling layers). U-Net with 4 pooling layers requires input divisible by 16. Use F.pad in the decoder to handle any size mismatches between encoder skip connections and upsampled decoder features.
- ✓Never use ToTensor() on segmentation masks — it adds a channel dimension and normalises to [0, 1], destroying integer class indices. Convert masks with: torch.tensor(np.array(mask_pil), dtype=torch.long) for shape (H, W) with correct integer values.
- ✓Pixel accuracy is misleading for imbalanced datasets — always use mIoU (mean Intersection over Union). A model predicting all-background gets high pixel accuracy but near-zero mIoU. Use class-weighted CrossEntropyLoss or Dice loss to prevent the model from collapsing to predicting the majority class.
- ✓For production: fine-tune SegFormer or DeepLabV3 pretrained on ADE20K or Cityscapes. Requires far fewer labelled images than training U-Net from scratch. SegFormer outputs at 1/4 resolution — always upsample with F.interpolate(logits, size=(H,W), mode="bilinear") before argmax for the final prediction.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.