Transfer Learning — Fine-Tuning Pretrained Vision Models
Feature extraction vs fine-tuning, layer freezing, and choosing the right backbone. Get ImageNet-level features without ImageNet-level compute.
Training ResNet50 from scratch on ImageNet took 29 hours on 8 V100 GPUs. Transfer learning uses those weights as a starting point and fine-tunes to your task in 20 minutes on one GPU. This is how every production vision system at Indian startups is built.
Module 46 introduced the concept — take a pretrained backbone, replace the classifier head, fine-tune. This module goes deep on everything that matters in practice: which layers to freeze, which learning rates to use per layer group, how to choose the right backbone for your constraints, and when feature extraction beats fine-tuning.
The intuition: the first layers of any image model learn universal low-level features — edges, textures, gradients. These are the same whether the task is classifying fashion products or detecting defects in circuit boards. Later layers learn task-specific high-level features. Transfer learning reuses the universal layers and only retrains the task-specific ones.
A civil engineer who has spent 10 years building roads in India knows structural principles, material properties, load calculations — universal engineering knowledge. To now build bridges, they do not retrain their entire education. They learn bridge-specific design on top of existing expertise. That is transfer learning.
Early layers of a pretrained CNN have learned to detect edges, curves, and textures from 1.2 million ImageNet images. That knowledge transfers perfectly to detecting fabric defects, product damage, or medical anomalies. Only the final task-specific layers need to learn from your small dataset.
Feature extraction vs fine-tuning — when each one wins
Differential learning rates — tiny lr for backbone, normal lr for head
The biggest mistake in fine-tuning: using the same learning rate for all layers. Early backbone layers contain universal features learned from 1.2 million images — they need tiny updates (lr ≈ 1e-5) to preserve that knowledge. The new classification head is randomly initialised and needs large updates (lr ≈ 1e-3) to learn quickly. Using 1e-3 on backbone layers causes catastrophic forgetting. Using 1e-5 on the head causes slow convergence. Differential learning rates solve both.
ResNet vs EfficientNet vs ViT — which backbone for which constraint
The backbone is the pretrained feature extractor. Choosing the right one depends on your accuracy requirement, inference latency budget, available GPU memory, and dataset size. There is no universal best — EfficientNet-B0 is better than ResNet50 for mobile deployment, ViT-B/16 is better for large datasets and highest accuracy, ResNet18 is better when training data is very scarce.
Complete transfer learning pipeline — Meesho product classification
Every common transfer learning mistake — explained and fixed
The Computer Vision section is complete. Section 10 — Generative AI — begins next.
You have completed the full Computer Vision section: image fundamentals, data augmentation, object detection, semantic segmentation, and transfer learning. You can build, train, evaluate, and deploy any standard vision system. Section 10 shifts from recognising images to generating them — GANs, VAEs, diffusion models, and the architecture behind Stable Diffusion.
GANs, VAEs, diffusion, and LLMs — what makes each one generative, and when each one is the right architecture.
🎯 Key Takeaways
- ✓Transfer learning reuses weights from a model pretrained on a large dataset (ImageNet) as the starting point for a new task. Early layers capture universal features (edges, textures) that transfer across all vision tasks. Only later layers and the classification head need retraining on your specific data.
- ✓Feature extraction freezes all backbone layers and only trains the new head — best for very small datasets (<500 images) or when domain is very similar to ImageNet. Fine-tuning unfreezes later backbone layers — best for moderate datasets (500–50k) and when highest accuracy is needed.
- ✓Differential learning rates are essential for fine-tuning: early layers get lr × 0.01, mid layers lr × 0.1, late layers lr × 1.0, head lr × 10. Using the same lr for all layers causes catastrophic forgetting in early layers and slow convergence in the head simultaneously.
- ✓Backbone selection depends on constraints: ResNet18 for very small datasets or extreme latency, ResNet50 for the default production choice, EfficientNet-B3 for best accuracy at similar parameter count, EfficientNet-B0 for mobile/edge deployment, ViT-B/16 for large datasets needing highest accuracy.
- ✓Always apply ImageNet normalisation (mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]) for any ImageNet-pretrained backbone. Missing normalisation is the single most common reason a fine-tuned model fails to learn — activations are in the wrong range and the pretrained features are meaningless.
- ✓Export trained models to ONNX (opset_version=17) for production deployment. ONNX runs on CPU, GPU, and edge devices without PyTorch dependency. Always validate with onnx.checker.check_model() after export. Use dynamic_axes to support variable batch sizes in deployment.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.