Image Fundamentals — Pixels, Channels and Tensors
How computers see images. Pixel values, colour channels, image tensors, normalisation, and the preprocessing pipeline every vision model expects.
You see a photo of a Meesho kurta. A computer sees a 3D array of integers — height × width × channels, each value between 0 and 255. Everything in computer vision starts from this representation.
A digital image is a grid of pixels. Each pixel is a tiny square of colour. For a 224×224 RGB image there are 50,176 pixels. Each pixel has three values — one for red intensity, one for green, one for blue — each ranging from 0 (none) to 255 (full intensity). The full image is represented as a 3D array: 224 rows × 224 columns × 3 channels = 150,528 numbers.
Every computer vision operation — loading, resizing, cropping, normalising, augmenting — is a transformation of this array. Every vision model — CNN, ViT, CLIP — takes this array as input. Getting the array into exactly the right shape, dtype, and value range before the model sees it is the preprocessing pipeline. Most computer vision bugs are preprocessing bugs — wrong channel order, wrong value range, wrong normalisation statistics.
Think of a crossword puzzle grid. Each cell has a letter. The grid has rows, columns, and one layer of content. An image is like three crossword grids stacked on top of each other — one grid for red intensity, one for green, one for blue. Each cell has a number 0–255 instead of a letter. Read all three grids together and you reconstruct the colour.
PyTorch uses channels-first format: (channels, height, width). Pillow and OpenCV use channels-last: (height, width, channels). Mixing these two formats silently produces garbage predictions — this is the single most common computer vision bug.
Pixels, channels, and image tensors — from file to array
Colour spaces — RGB, greyscale, HSV, and when each matters
RGB is the default but not always the best representation. Greyscale (single channel) halves memory and speeds up models when colour is not informative — document OCR, X-ray analysis, fingerprint matching. HSV separates hue (colour type), saturation (colour intensity), and value (brightness) — useful for colour-based detection where you want to detect "red objects" regardless of lighting. LAB separates luminance from colour and is perceptually uniform — used in medical imaging.
Resize, crop, and pad — getting every image to the same shape
Every model expects a fixed input size — ResNet expects 224×224, EfficientNet expects 380×380, ViT-B/16 expects 224×224. Real-world images come in all sizes and aspect ratios. Transforming them to the required size without distorting the content requires understanding the trade-offs between resize, centre crop, random crop, and padding.
Normalisation — why ImageNet statistics are used everywhere and when to recompute them
After converting to float32 and dividing by 255 (values in 0–1), you must normalise each channel to zero mean and unit standard deviation. Without normalisation the network's first layer receives values in [0, 1] — a very different distribution from what the pretrained weights expect. The result: predictions that look random even from a perfectly fine model.
For any model pretrained on ImageNet, use the ImageNet statistics: mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]. These numbers were computed over the entire 1.2M ImageNet training set. For custom datasets (medical images, satellite imagery, product photos) compute your own statistics — ImageNet numbers may be far off.
Production image preprocessing pipeline — from file on disk to model input
Every common image preprocessing mistake — explained and fixed
Images are tensors. Next: multiply your training data without collecting a single new image.
You now understand the complete representation of an image and how to preprocess it for any vision model. The next challenge is data — vision models need thousands of labelled images but collecting and labelling them is expensive. Data augmentation synthetically multiplies your dataset by applying random transformations that preserve the label. Module 56 covers every augmentation technique used in production and explains exactly what each one teaches the model.
Flips, crops, colour jitter, mixup, cutout — and how each one affects what the model learns.
🎯 Key Takeaways
- ✓A digital image is a 3D array: (height, width, channels). RGB images have 3 channels, each pixel value 0–255. PyTorch uses channels-first format (C, H, W). PIL and OpenCV use channels-last (H, W, C). Mixing these formats silently produces wrong results — always permute when converting.
- ✓The standard loading pipeline: PIL Image.open().convert("RGB") → T.ToTensor() divides by 255 and converts to (C, H, W) float32 → T.Normalize() applies per-channel mean/std normalisation. Never skip the convert("RGB") call — PNGs have RGBA (4 channels) and greyscale images have 1 channel.
- ✓For any ImageNet-pretrained model always use ImageNet normalisation statistics: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]. Skipping normalisation is the most common reason a pretrained model predicts the same class for every image.
- ✓Training and validation transforms are different. Training: RandomResizedCrop + RandomHorizontalFlip + ColorJitter + ToTensor + Normalize. Validation: Resize(256) + CenterCrop(224) + ToTensor + Normalize. Never apply random operations to the validation set — it makes metrics inconsistent.
- ✓Memory scales quadratically with image size. A batch of 32 images at 224×224 float32 = 38MB. At 384×384 it is 113MB. At 512×512 it is 200MB. Always check memory requirements before choosing image size. pin_memory=True and num_workers=4 are standard DataLoader settings for GPU training.
- ✓OpenCV loads images as BGR not RGB. Always convert after loading: cv2.cvtColor(img, cv2.COLOR_BGR2RGB). Or use PIL exclusively: Image.open(path).convert("RGB") always gives RGB. Swapped channels degrade colour-sensitive model performance and are notoriously hard to debug because the image looks correct when displayed.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.