Python · SQL · Web Dev · Java · AI/ML tracks launching soon — your one platform for all of IT

CNNs — Meesho Product Image Classification

Filters, feature maps, pooling, and how CNNs learn to recognise objects at any position in an image. Built from scratch then scaled with transfer learning.

40–45 min March 2026
Before any formula — why MLPs fail on images

A 224×224 image has 150,528 pixels. An MLP treating each pixel as a separate input needs millions of parameters just for the first layer — and still cannot recognise a shirt if it appears in a different corner of the image.

Meesho lists millions of fashion products. Each listing needs a category tag — kurta, saree, jeans, sneakers. An MLP flattens the image to a vector of 150,528 numbers and connects every pixel to every neuron. A first hidden layer of 512 neurons needs 77 million weights. It trains on images of shirts centred in frame, then fails on shirts shifted slightly to the left — because it memorised pixel positions, not the concept of "shirt."

CNNs solve both problems with one idea: instead of connecting every pixel to every neuron, slide a small filter (typically 3×3 pixels) across the entire image. The same filter weights are reused at every position — this is weight sharing. A filter that detects a vertical edge detects it whether the edge is in the top-left or bottom-right corner. This gives CNNs two critical properties: far fewer parameters, and translation invariance.

🧠 Analogy — read this first

Imagine inspecting a large fabric roll for defects. You would not look at the entire roll simultaneously — you would use a small magnifying glass and slide it across, looking for the same pattern (a tear, a stain) at every position. You use the same visual skill (the same filter weights) everywhere.

A CNN's convolutional layer is that magnifying glass — a small filter sliding across the image, applying the same weights at every position. Early layers detect edges and textures. Middle layers detect shapes. Deep layers detect objects. The hierarchy of features is learned automatically from labelled images.

🎯 Pro Tip
This module builds a CNN from scratch in PyTorch, trains it on a simulated Meesho-style product classification task, then shows transfer learning — using a pretrained ResNet50 and fine-tuning only the final layer. Transfer learning is how all production image classifiers are built today.
The core operation

Convolution — sliding a filter across an image

A convolution takes a filter (a small matrix of learnable weights, e.g. 3×3) and slides it across the input image. At each position the filter is placed, the element-wise product between the filter and the overlapping image patch is computed and summed. The result — one number per position — forms the feature map. Multiple filters produce multiple feature maps, one per filter.

3×3 filter sliding across a 5×5 input — one step of convolution
INPUT (5×5)
1
2
3
0
1
0
1
2
3
1
1
0
1
2
0
2
1
0
1
3
0
1
2
0
1
×
FILTER (3×3)
1
0
-1
1
0
-1
1
0
-1
=
OUTPUT (3×3)
-2
0
4
-1
2
3
0
-1
2

Top-left output value: (1×1)+(2×0)+(3×−1)+(0×1)+(1×0)+(2×−1)+(1×1)+(0×0)+(1×−1) = −2. This filter detects vertical edges — positive response where left side is brighter than right.

Key CNN vocabulary — four terms you must know
Filter (kernel)Small weight matrix (3×3, 5×5) that slides across input. One filter = one learned feature detector. 32 filters → 32 feature maps.
Feature mapOutput of one filter sliding across the input. Shape: (H_out × W_out). One per filter. Represents "where this feature appears in the image."
StrideHow many pixels the filter jumps each step. Stride=1: dense output. Stride=2: halves spatial dimensions (H and W). Controls output size.
PaddingZeros added around the input border. padding=1 with 3×3 filter keeps output same spatial size as input (same padding). padding=0 shrinks it.
python
import numpy as np
import torch
import torch.nn as nn

# ── Manual 2D convolution — see every step ────────────────────────────
def conv2d_manual(input_2d, kernel, stride=1, padding=0):
    """
    input_2d: (H, W)
    kernel:   (kH, kW)
    Returns:  feature map (H_out, W_out)
    """
    H, W   = input_2d.shape
    kH, kW = kernel.shape

    if padding > 0:
        input_2d = np.pad(input_2d, padding, mode='constant')
        H, W     = input_2d.shape

    H_out = (H - kH) // stride + 1
    W_out = (W - kW) // stride + 1
    output = np.zeros((H_out, W_out))

    for i in range(H_out):
        for j in range(W_out):
            patch         = input_2d[i*stride:i*stride+kH, j*stride:j*stride+kW]
            output[i, j]  = (patch * kernel).sum()
    return output

# ── Test: vertical edge detector ──────────────────────────────────────
image = np.array([
    [0, 0, 0, 255, 255],
    [0, 0, 0, 255, 255],
    [0, 0, 0, 255, 255],
    [0, 0, 0, 255, 255],
    [0, 0, 0, 255, 255],
], dtype=float)

vertical_edge = np.array([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]], dtype=float)
horizontal_edge = np.array([[-1,-1,-1], [0,0,0], [1,1,1]], dtype=float)

print("Vertical edge filter response:")
print(conv2d_manual(image, vertical_edge))
print("
Horizontal edge filter response:")
print(conv2d_manual(image, horizontal_edge))

# ── Output size formula ────────────────────────────────────────────────
def output_size(H, kernel, stride, padding):
    return (H + 2*padding - kernel) // stride + 1

print("
Output size examples (input=224):")
for k, s, p in [(3,1,0),(3,1,1),(3,2,1),(5,1,2),(7,2,3)]:
    out = output_size(224, k, s, p)
    print(f"  kernel={k} stride={s} pad={p}: output={out}×{out}")
Building blocks

Pooling, flattening, and the full CNN pipeline

A complete CNN stacks three types of layers. Convolutional layers detect features — edges, textures, shapes — and produce feature maps. Pooling layers reduce spatial dimensions — typically MaxPool2d(2,2) halves H and W, keeping the most prominent feature in each 2×2 region. This builds translation invariance and reduces computation. Fully connected layers at the end combine all detected features to make the final classification decision.

CNN pipeline — data shape at each stage for Meesho product images
Input(batch, 3, 128, 128)RGB image — 3 channels, 128×128 pixels
Conv1 (32 filters, 3×3, pad=1) + ReLU(batch, 32, 128, 128)32 edge/texture detectors, same spatial size
MaxPool (2×2)(batch, 32, 64, 64)Halve spatial dims, keep strongest activations
Conv2 (64 filters, 3×3, pad=1) + ReLU(batch, 64, 64, 64)64 shape/pattern detectors
MaxPool (2×2)(batch, 64, 32, 32)Halve again
Conv3 (128 filters, 3×3, pad=1) + ReLU(batch, 128, 32, 32)128 complex feature detectors
MaxPool (2×2)(batch, 128, 16, 16)Halve again
AdaptiveAvgPool(batch, 128, 4, 4)Fixed output regardless of input size
Flatten(batch, 2048)128 × 4 × 4 = 2048 features
FC (512) + ReLU + Dropout(batch, 512)Dense classification head
FC (n_classes)(batch, 6)6 logits — one per product category
python
import torch
import torch.nn as nn

# ── CNN from scratch for Meesho product classification ─────────────────
# 6 categories: kurta, saree, jeans, sneakers, watch, handbag

class MeeshoCNN(nn.Module):
    def __init__(self, n_classes=6):
        super().__init__()

        # Feature extractor — conv layers learn visual features
        self.features = nn.Sequential(
            # Block 1: 3 → 32 channels
            nn.Conv2d(3, 32, kernel_size=3, padding=1),   # (B,3,128,128) → (B,32,128,128)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # → (B,32,64,64)

            # Block 2: 32 → 64 channels
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # → (B,64,64,64)
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # → (B,64,32,32)

            # Block 3: 64 → 128 channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1), # → (B,128,32,32)
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),                            # → (B,128,16,16)

            # Block 4: 128 → 256 channels
            nn.Conv2d(128, 256, kernel_size=3, padding=1),# → (B,256,16,16)
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((4, 4)),                  # → (B,256,4,4) fixed
        )

        # Classifier head — fully connected layers make final decision
        self.classifier = nn.Sequential(
            nn.Flatten(),                                  # → (B, 4096)
            nn.Linear(256 * 4 * 4, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.4),
            nn.Linear(512, n_classes),                    # → (B, 6) raw logits
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

model = MeeshoCNN(n_classes=6)

# ── Parameter count ───────────────────────────────────────────────────
total  = sum(p.numel() for p in model.parameters())
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters:     {total:,}")
print(f"Trainable parameters: {trainable:,}")

# ── Shape check — forward pass with dummy batch ───────────────────────
dummy = torch.randn(4, 3, 128, 128)   # batch of 4 RGB images at 128×128
out   = model(dummy)
print(f"
Input shape:  {dummy.shape}")
print(f"Output shape: {out.shape}  ← (4 samples, 6 class logits)")

# ── Inspect intermediate feature map shapes ───────────────────────────
print("
Feature map shapes through each block:")
x = dummy
for i, layer in enumerate(model.features):
    x = layer(x)
    if isinstance(layer, (nn.MaxPool2d, nn.AdaptiveAvgPool2d)):
        print(f"  After {layer.__class__.__name__}: {x.shape}")
Training from scratch

Full training loop — data augmentation, class weighting, early stopping

Training a CNN from scratch on images requires one additional technique not needed for tabular data: data augmentation. Images of the same product can be flipped, rotated, cropped, or colour-jittered without changing their category. Applying random transformations during training artificially multiplies the dataset size and teaches the network that these variations should produce the same prediction. Without augmentation, CNNs overfit rapidly on small datasets.

python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as T
import numpy as np
from torch.utils.data import Dataset, DataLoader
import copy, warnings
warnings.filterwarnings('ignore')

torch.manual_seed(42)
np.random.seed(42)

# ── Simulate Meesho product image dataset ─────────────────────────────
# In production: torchvision.datasets.ImageFolder pointing to your image directory
# Here: synthetic RGB images with class-specific colour patterns

CATEGORIES = ['kurta', 'saree', 'jeans', 'sneakers', 'watch', 'handbag']
N_CLASSES  = len(CATEGORIES)

class SyntheticMeeshoDataset(Dataset):
    def __init__(self, n_samples=2000, img_size=64, transform=None):
        self.n  = n_samples
        self.sz = img_size
        self.transform = transform
        np.random.seed(42)
        self.labels = np.random.randint(0, N_CLASSES, n_samples)
        # Each class has a distinct colour bias — simulates real category difference
        self.color_bias = np.array([
            [0.8, 0.3, 0.2],   # kurta    — warm red
            [0.2, 0.7, 0.5],   # saree    — green/teal
            [0.2, 0.3, 0.8],   # jeans    — blue
            [0.7, 0.7, 0.2],   # sneakers — yellow/white
            [0.5, 0.5, 0.5],   # watch    — grey/silver
            [0.6, 0.2, 0.6],   # handbag  — purple
        ])

    def __len__(self): return self.n

    def __getitem__(self, idx):
        label = self.labels[idx]
        bias  = self.color_bias[label]
        # Generate image with class-specific colour + noise
        img = np.random.randn(3, self.sz, self.sz) * 0.15
        for c in range(3):
            img[c] += bias[c]
        img = np.clip(img, 0, 1).astype(np.float32)
        img = torch.FloatTensor(img)
        if self.transform:
            img = self.transform(img)
        return img, label

# ── Data augmentation for training ────────────────────────────────────
train_transform = T.Compose([
    T.RandomHorizontalFlip(p=0.5),
    T.RandomVerticalFlip(p=0.1),
    T.RandomRotation(degrees=15),
    T.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.2),
    T.RandomErasing(p=0.2),       # randomly mask patches — prevents overfit
])
val_transform = None   # no augmentation at validation time

train_ds = SyntheticMeeshoDataset(1600, transform=train_transform)
val_ds   = SyntheticMeeshoDataset(400,  transform=val_transform)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True,  num_workers=0)
val_loader   = DataLoader(val_ds,   batch_size=64, shuffle=False, num_workers=0)

# ── Model, loss, optimiser ─────────────────────────────────────────────
class MeeshoCNN(nn.Module):
    def __init__(self, n=6):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3,32,3,padding=1), nn.BatchNorm2d(32), nn.ReLU(True), nn.MaxPool2d(2),
            nn.Conv2d(32,64,3,padding=1), nn.BatchNorm2d(64), nn.ReLU(True), nn.MaxPool2d(2),
            nn.Conv2d(64,128,3,padding=1), nn.BatchNorm2d(128), nn.ReLU(True),
            nn.AdaptiveAvgPool2d((4,4)),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(), nn.Linear(128*16,256),
            nn.ReLU(True), nn.Dropout(0.4), nn.Linear(256,n),
        )
    def forward(self, x): return self.classifier(self.features(x))

model     = MeeshoCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30)

# ── Training loop with early stopping ─────────────────────────────────
best_acc, best_wts, patience_count = 0.0, None, 0
PATIENCE = 8

print(f"Training MeeshoCNN from scratch:")
print(f"{'Epoch':>6} {'Train loss':>12} {'Val acc':>10} {'LR':>12}")
print("─" * 44)

for epoch in range(1, 31):
    model.train()
    total_loss = 0
    for Xb, yb in train_loader:
        optimizer.zero_grad()
        loss = criterion(model(Xb), yb)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    scheduler.step()

    model.eval()
    correct = 0
    with torch.no_grad():
        for Xb, yb in val_loader:
            correct += (model(Xb).argmax(1) == yb).sum().item()
    val_acc = correct / len(val_ds)
    lr_now  = optimizer.param_groups[0]['lr']

    if epoch % 5 == 0:
        print(f"  {epoch:>4}  {total_loss/len(train_loader):>12.4f}  {val_acc:>10.4f}  {lr_now:>12.6f}")

    if val_acc > best_acc:
        best_acc, best_wts = val_acc, copy.deepcopy(model.state_dict())
        patience_count = 0
    else:
        patience_count += 1
        if patience_count >= PATIENCE:
            print(f"  Early stop at epoch {epoch}")
            break

model.load_state_dict(best_wts)
print(f"
Best val accuracy: {best_acc:.4f}")
How production image classifiers are actually built

Transfer learning — take ResNet50 pretrained on ImageNet, fine-tune the head

Training a CNN from scratch requires hundreds of thousands of labelled images and days of GPU compute. Meesho does not do this. Nobody does this for product classification. Instead, they use a model pretrained on ImageNet — a dataset of 1.2 million images across 1,000 categories. That model has already learned to detect edges, textures, shapes, and objects. Replace only the final classification layer with one that outputs your 6 categories, then fine-tune. This is transfer learning — and it produces better results with 1,000 images than training from scratch with 100,000.

Feature extraction

Freeze all pretrained layers. Train only the new classification head. Fast — only a small number of parameters update. Best when your dataset is small (<1,000 images) and similar to ImageNet.

Use when: Small dataset, limited GPU, quick prototype.
Fine-tuning (partial)

Freeze early layers (edge/texture detectors — universal). Unfreeze later layers (task-specific features). Train head + later layers with a small lr. Best balance of speed and accuracy.

Use when: Medium dataset (1k–100k images). Standard production approach.
Full fine-tuning

Unfreeze all layers. Train entire network with a very small lr (1e-5). Early layers need tiny updates — they are already good. Risk of catastrophic forgetting if lr is too high.

Use when: Large dataset (100k+ images) or domain very different from ImageNet.
python
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
import numpy as np
from torch.utils.data import Dataset, DataLoader
import warnings
warnings.filterwarnings('ignore')

torch.manual_seed(42)

CATEGORIES = ['kurta', 'saree', 'jeans', 'sneakers', 'watch', 'handbag']
N_CLASSES  = 6

class SyntheticMeeshoDataset(torch.utils.data.Dataset):
    def __init__(self, n=500):
        np.random.seed(42)
        self.labels = np.random.randint(0, N_CLASSES, n)
        self.bias   = np.array([[.8,.3,.2],[.2,.7,.5],[.2,.3,.8],
                                 [.7,.7,.2],[.5,.5,.5],[.6,.2,.6]])
    def __len__(self): return len(self.labels)
    def __getitem__(self, i):
        b   = self.bias[self.labels[i]]
        img = np.random.randn(3,64,64)*.15
        for c in range(3): img[c] += b[c]
        return torch.FloatTensor(np.clip(img,0,1).astype(np.float32)), self.labels[i]

train_ds = SyntheticMeeshoDataset(400)
val_ds   = SyntheticMeeshoDataset(100)
train_ld = DataLoader(train_ds, 32, shuffle=True)
val_ld   = DataLoader(val_ds,   32)

# ── Strategy 1: Feature extraction — freeze backbone ──────────────────
backbone = models.resnet18(weights=None)   # use pretrained=True in production
# Freeze all layers
for param in backbone.parameters():
    param.requires_grad = False
# Replace the final FC layer — only this trains
backbone.fc = nn.Linear(backbone.fc.in_features, N_CLASSES)
# Only fc parameters have requires_grad=True
trainable = sum(p.numel() for p in backbone.parameters() if p.requires_grad)
total     = sum(p.numel() for p in backbone.parameters())
print(f"Feature extraction: {trainable:,} / {total:,} parameters trainable")

# ── Strategy 2: Partial fine-tuning — unfreeze layer4 + fc ───────────
backbone2 = models.resnet18(weights=None)
for param in backbone2.parameters():
    param.requires_grad = False
# Unfreeze only the last residual block and fc
for param in backbone2.layer4.parameters():
    param.requires_grad = True
backbone2.fc = nn.Linear(backbone2.fc.in_features, N_CLASSES)
for param in backbone2.fc.parameters():
    param.requires_grad = True
trainable2 = sum(p.numel() for p in backbone2.parameters() if p.requires_grad)
print(f"Partial fine-tuning: {trainable2:,} / {total:,} parameters trainable")

# ── Train feature extraction version ──────────────────────────────────
# Resize input to 224×224 for ResNet (designed for ImageNet size)
# Here: use adaptive pool to handle our 64×64 images
model = backbone
criterion = nn.CrossEntropyLoss()
# Only pass trainable parameters to optimiser — good practice
optimizer = optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=1e-3, weight_decay=0.01,
)

print("
Fine-tuning classification head only:")
for epoch in range(1, 11):
    model.train()
    for Xb, yb in train_ld:
        optimizer.zero_grad()
        criterion(model(Xb), yb).backward()
        optimizer.step()

    model.eval()
    correct = 0
    with torch.no_grad():
        for Xb, yb in val_ld:
            correct += (model(Xb).argmax(1) == yb).sum().item()
    if epoch % 2 == 0:
        print(f"  Epoch {epoch:2d}: val acc = {correct/len(val_ds):.4f}")

# ── Differential learning rates — best practice for fine-tuning ───────
# Backbone layers: very small lr (they are already good)
# Classification head: normal lr (it is randomly initialised)
backbone3 = models.resnet18(weights=None)
backbone3.fc = nn.Linear(backbone3.fc.in_features, N_CLASSES)

optimizer_diff = optim.AdamW([
    {'params': backbone3.layer1.parameters(), 'lr': 1e-5},
    {'params': backbone3.layer2.parameters(), 'lr': 1e-5},
    {'params': backbone3.layer3.parameters(), 'lr': 1e-4},
    {'params': backbone3.layer4.parameters(), 'lr': 1e-4},
    {'params': backbone3.fc.parameters(),     'lr': 1e-3},   # head: normal lr
], weight_decay=0.01)

print("
Differential lr param groups:")
for group in optimizer_diff.param_groups:
    n = sum(p.numel() for p in group['params'])
    print(f"  lr={group['lr']:.0e}  params={n:,}")
Errors you will hit

Every common CNN mistake — explained and fixed

RuntimeError: Expected 4D tensor but got 3D tensor for input
Why it happens

PyTorch Conv2d expects input of shape (batch, channels, H, W) — 4 dimensions. You passed a single image of shape (channels, H, W) — 3 dimensions, missing the batch dimension. This happens when you load one image and pass it directly without adding a batch dimension, or when DataLoader is bypassed.

Fix

Add a batch dimension with unsqueeze: image = image.unsqueeze(0) — converts (3, H, W) to (1, 3, H, W). When using DataLoader this is handled automatically. Also check your Dataset __getitem__ returns (C, H, W) not (H, W, C) — PyTorch uses channels-first format while PIL and numpy use channels-last. Use transforms.ToTensor() to convert PIL images — it handles both the channel order and the 0–255 to 0–1 scaling.

CNN trains to ~16% accuracy (random) and never improves — all classes predicted equally
Why it happens

The input images are not normalised. Raw pixel values 0–255 are fed directly to the network. The first Conv2d weights are initialised near zero — multiplied by 200-range pixel values, the pre-activations in the first layer are enormous, causing saturated activations, near-zero gradients, and no learning. Also check that your DataLoader is shuffling — if all class 0 images come in the first epoch and all class 1 images in the second, BatchNorm running statistics will be corrupted.

Fix

Always normalise images: divide by 255 to get 0–1 range, then apply per-channel mean/std normalisation. For ImageNet-pretrained models use: transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]). For custom datasets compute mean and std from your training set. Set shuffle=True in DataLoader.

Transfer learning model performs worse than training from scratch
Why it happens

The pretrained model's input normalisation was not applied. ResNet, EfficientNet, and all ImageNet-pretrained models expect inputs normalised with ImageNet mean and std. Without this normalisation the pretrained features are computed from out-of-distribution inputs and produce garbage. Also: learning rate too high during fine-tuning — catastrophic forgetting overwrites pretrained features.

Fix

Always apply ImageNet normalisation when using pretrained models: transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225]). Use a much smaller learning rate for pretrained layers (1e-5) than for the new head (1e-3). Freeze the backbone entirely for the first few epochs, then gradually unfreeze from the later layers backward.

CUDA out of memory during training — RuntimeError: CUDA out of memory
Why it happens

Batch size is too large for the available GPU memory. A single 224×224 RGB image in float32 takes 224×224×3×4 = 600KB. A batch of 64 images = 38MB, plus the network activations (which are stored for backprop) can multiply this by 5–10×. A ResNet50 with batch size 64 needs ~8GB VRAM.

Fix

Reduce batch_size — halving it roughly halves memory usage. Use gradient accumulation to simulate large batches: accumulate gradients over N steps before calling optimizer.step(). Use torch.cuda.empty_cache() between epochs. For inference use with torch.no_grad() — this does not store activations and uses much less memory. Switch to mixed precision training: torch.cuda.amp.autocast() halves memory with float16.

What comes next

You can classify images. Next: model sequences — text, time series, audio.

CNNs exploit spatial structure in images. But many real-world problems involve sequences — a sentence is a sequence of words, a stock price is a sequence of daily values, a user session is a sequence of actions. Sequences have temporal structure: what came earlier affects what comes later. CNNs treat every position independently and cannot model this dependency. Module 47 covers RNNs and LSTMs — architectures designed specifically to process sequences by maintaining a hidden state that carries information forward across time steps.

Next — Module 47 · Deep Learning
RNNs and LSTMs — Sequence Modelling

Hidden states, vanishing gradients across time, and how LSTMs use gates to selectively remember and forget.

coming soon

🎯 Key Takeaways

  • CNNs solve two fundamental problems with MLPs on images: too many parameters (a 224×224 image needs 150k inputs × hidden units) and no spatial invariance (an MLP memorises pixel positions, not visual patterns). Convolutional filters slide across the image reusing the same weights everywhere — weight sharing dramatically reduces parameters and gives translation invariance.
  • A convolutional layer applies multiple small filters (typically 3×3) across the input, producing one feature map per filter. Output size = (H + 2×padding − kernel) / stride + 1. padding=1 with a 3×3 filter keeps spatial dimensions unchanged. MaxPool2d(2,2) halves H and W, keeping the strongest activation in each 2×2 region.
  • A complete CNN stacks: Conv+BN+ReLU blocks (feature extraction) → MaxPool (spatial reduction) → AdaptiveAvgPool (fixed output size) → Flatten → FC layers (classification). BatchNorm2d is placed after Conv2d and before ReLU for stable training.
  • Data augmentation is essential for CNN training — random flips, rotations, colour jitter, and random erasing artificially expand the dataset and teach the network that these variations do not change the category. Apply augmentation only during training, never at validation or test time.
  • Transfer learning is how all production image classifiers are built. Take a model pretrained on ImageNet, replace the final FC layer with one matching your number of classes, and fine-tune. Use differential learning rates: very small (1e-5) for pretrained backbone layers, normal (1e-3) for the new head. Always apply ImageNet normalisation when using pretrained models.
  • The four CNN gotchas: input must be (batch, channels, H, W) — use unsqueeze(0) for single images. Always normalise pixel values to 0–1 before training. When using pretrained models always apply ImageNet mean/std normalisation. Reduce batch size or use gradient accumulation when hitting CUDA OOM errors.
Share

Discussion

0

Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.

Continue with GitHub
Loading...