CNNs — Meesho Product Image Classification
Filters, feature maps, pooling, and how CNNs learn to recognise objects at any position in an image. Built from scratch then scaled with transfer learning.
A 224×224 image has 150,528 pixels. An MLP treating each pixel as a separate input needs millions of parameters just for the first layer — and still cannot recognise a shirt if it appears in a different corner of the image.
Meesho lists millions of fashion products. Each listing needs a category tag — kurta, saree, jeans, sneakers. An MLP flattens the image to a vector of 150,528 numbers and connects every pixel to every neuron. A first hidden layer of 512 neurons needs 77 million weights. It trains on images of shirts centred in frame, then fails on shirts shifted slightly to the left — because it memorised pixel positions, not the concept of "shirt."
CNNs solve both problems with one idea: instead of connecting every pixel to every neuron, slide a small filter (typically 3×3 pixels) across the entire image. The same filter weights are reused at every position — this is weight sharing. A filter that detects a vertical edge detects it whether the edge is in the top-left or bottom-right corner. This gives CNNs two critical properties: far fewer parameters, and translation invariance.
Imagine inspecting a large fabric roll for defects. You would not look at the entire roll simultaneously — you would use a small magnifying glass and slide it across, looking for the same pattern (a tear, a stain) at every position. You use the same visual skill (the same filter weights) everywhere.
A CNN's convolutional layer is that magnifying glass — a small filter sliding across the image, applying the same weights at every position. Early layers detect edges and textures. Middle layers detect shapes. Deep layers detect objects. The hierarchy of features is learned automatically from labelled images.
Convolution — sliding a filter across an image
A convolution takes a filter (a small matrix of learnable weights, e.g. 3×3) and slides it across the input image. At each position the filter is placed, the element-wise product between the filter and the overlapping image patch is computed and summed. The result — one number per position — forms the feature map. Multiple filters produce multiple feature maps, one per filter.
Pooling, flattening, and the full CNN pipeline
A complete CNN stacks three types of layers. Convolutional layers detect features — edges, textures, shapes — and produce feature maps. Pooling layers reduce spatial dimensions — typically MaxPool2d(2,2) halves H and W, keeping the most prominent feature in each 2×2 region. This builds translation invariance and reduces computation. Fully connected layers at the end combine all detected features to make the final classification decision.
Full training loop — data augmentation, class weighting, early stopping
Training a CNN from scratch on images requires one additional technique not needed for tabular data: data augmentation. Images of the same product can be flipped, rotated, cropped, or colour-jittered without changing their category. Applying random transformations during training artificially multiplies the dataset size and teaches the network that these variations should produce the same prediction. Without augmentation, CNNs overfit rapidly on small datasets.
Transfer learning — take ResNet50 pretrained on ImageNet, fine-tune the head
Training a CNN from scratch requires hundreds of thousands of labelled images and days of GPU compute. Meesho does not do this. Nobody does this for product classification. Instead, they use a model pretrained on ImageNet — a dataset of 1.2 million images across 1,000 categories. That model has already learned to detect edges, textures, shapes, and objects. Replace only the final classification layer with one that outputs your 6 categories, then fine-tune. This is transfer learning — and it produces better results with 1,000 images than training from scratch with 100,000.
Freeze all pretrained layers. Train only the new classification head. Fast — only a small number of parameters update. Best when your dataset is small (<1,000 images) and similar to ImageNet.
Freeze early layers (edge/texture detectors — universal). Unfreeze later layers (task-specific features). Train head + later layers with a small lr. Best balance of speed and accuracy.
Unfreeze all layers. Train entire network with a very small lr (1e-5). Early layers need tiny updates — they are already good. Risk of catastrophic forgetting if lr is too high.
Every common CNN mistake — explained and fixed
You can classify images. Next: model sequences — text, time series, audio.
CNNs exploit spatial structure in images. But many real-world problems involve sequences — a sentence is a sequence of words, a stock price is a sequence of daily values, a user session is a sequence of actions. Sequences have temporal structure: what came earlier affects what comes later. CNNs treat every position independently and cannot model this dependency. Module 47 covers RNNs and LSTMs — architectures designed specifically to process sequences by maintaining a hidden state that carries information forward across time steps.
Hidden states, vanishing gradients across time, and how LSTMs use gates to selectively remember and forget.
🎯 Key Takeaways
- ✓CNNs solve two fundamental problems with MLPs on images: too many parameters (a 224×224 image needs 150k inputs × hidden units) and no spatial invariance (an MLP memorises pixel positions, not visual patterns). Convolutional filters slide across the image reusing the same weights everywhere — weight sharing dramatically reduces parameters and gives translation invariance.
- ✓A convolutional layer applies multiple small filters (typically 3×3) across the input, producing one feature map per filter. Output size = (H + 2×padding − kernel) / stride + 1. padding=1 with a 3×3 filter keeps spatial dimensions unchanged. MaxPool2d(2,2) halves H and W, keeping the strongest activation in each 2×2 region.
- ✓A complete CNN stacks: Conv+BN+ReLU blocks (feature extraction) → MaxPool (spatial reduction) → AdaptiveAvgPool (fixed output size) → Flatten → FC layers (classification). BatchNorm2d is placed after Conv2d and before ReLU for stable training.
- ✓Data augmentation is essential for CNN training — random flips, rotations, colour jitter, and random erasing artificially expand the dataset and teach the network that these variations do not change the category. Apply augmentation only during training, never at validation or test time.
- ✓Transfer learning is how all production image classifiers are built. Take a model pretrained on ImageNet, replace the final FC layer with one matching your number of classes, and fine-tune. Use differential learning rates: very small (1e-5) for pretrained backbone layers, normal (1e-3) for the new head. Always apply ImageNet normalisation when using pretrained models.
- ✓The four CNN gotchas: input must be (batch, channels, H, W) — use unsqueeze(0) for single images. Always normalise pixel values to 0–1 before training. When using pretrained models always apply ImageNet mean/std normalisation. Reduce batch size or use gradient accumulation when hitting CUDA OOM errors.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.