Object Detection — YOLO and Feature Pyramids
Anchor boxes, IoU, non-maximum suppression, and why YOLO became the production standard for real-time detection. Built from concepts to code.
Classification asks: what is in this image? Object detection asks: what is in this image, where exactly is it, and how many of them are there? That extra "where" changes everything about the architecture.
Meesho's quality control system needs to detect defective stitching in product photos — not just classify "defective" vs "good" but locate exactly where the defect is. Swiggy's delivery verification system detects and reads the order number on the packaging. A traffic camera system counts vehicles and detects lane violations. All require object detection — bounding boxes around every object of interest in the image.
The output of a detection model is a list of bounding boxes. Each box has four coordinates (x_center, y_center, width, height), a confidence score (how sure is the model an object is here), and a class label (what kind of object). One image can produce dozens of boxes — one per detected object instance.
Classification is multiple choice: "Is this a cat, dog, or bird?" Object detection is open-ended search: "Find every animal in this photo, draw a box around each one, and tell me what kind it is." The search problem is fundamentally harder — the model must check every possible location and scale simultaneously.
YOLO's insight: instead of searching sequentially, divide the image into a grid and have every grid cell predict boxes simultaneously. One forward pass, all detections at once. Fast enough for real-time video at 30+ frames per second.
Bounding boxes and IoU — how detection accuracy is measured
A bounding box is defined by four numbers. Two formats are common: corner format (x_min, y_min, x_max, y_max) and centre format (x_center, y_center, width, height). YOLO uses centre format. COCO annotations use corner format. Always know which format your data and model expect.
Intersection over Union (IoU) measures how well a predicted box matches the ground truth box. It is the area of overlap divided by the area of the union. IoU = 1.0 is a perfect match. IoU = 0.0 means no overlap. The standard threshold is IoU ≥ 0.5 for a detection to count as correct (True Positive).
Non-Maximum Suppression — keep the best box, discard the rest
A detection model produces hundreds of candidate boxes. Multiple boxes often overlap on the same object — the model detected the same car 15 times at slightly different positions. Non-Maximum Suppression (NMS) removes duplicates: keep the box with the highest confidence score, remove all other boxes that overlap it significantly (IoU > threshold). Repeat for the remaining boxes.
High IoU threshold (0.7): permissive — more boxes, may include duplicates
YOLO — You Only Look Once — grid-based detection in a single pass
YOLO divides the input image into an S×S grid. Each grid cell predicts B bounding boxes and their confidence scores, plus C class probabilities. All predictions are made simultaneously in one forward pass — hence "You Only Look Once." This makes YOLO dramatically faster than two-stage detectors (like Faster R-CNN) which first propose regions then classify them.
YOLO version history — what each version improved
YOLOv8 with Ultralytics — inference, fine-tuning, and deployment
mAP — mean Average Precision — the standard detection metric
Accuracy does not apply to detection. The standard metric is mAP (mean Average Precision). For each class, you compute Average Precision (AP) — the area under the precision-recall curve across all confidence thresholds. mAP is the mean AP across all classes. mAP@0.5 uses IoU ≥ 0.5 as the threshold for a True Positive. mAP@0.5:0.95 averages mAP across IoU thresholds from 0.5 to 0.95 — the COCO standard, much harder.
Every common object detection mistake — explained and fixed
You can detect objects. Next: label every pixel in the image.
Object detection draws bounding boxes — rectangular approximations of object locations. Semantic segmentation goes further: it assigns a class label to every single pixel in the image. Instead of "there is a person at coordinates (100, 200)–(180, 400)" it produces "every pixel that belongs to a person, exactly." Module 58 covers U-Net — the architecture that powers medical image segmentation — and how skip connections preserve fine spatial detail lost during downsampling.
U-Net architecture, skip connections, and how segmentation powers medical imaging and autonomous vehicles.
🎯 Key Takeaways
- ✓Object detection predicts bounding boxes (location + size), confidence scores, and class labels for every object in an image — not just one label for the whole image. Output format: [x_center, y_center, width, height, confidence, class_probs] per predicted box.
- ✓IoU (Intersection over Union) measures overlap between predicted and ground truth boxes. IoU = intersection_area / union_area. IoU ≥ 0.5 is the standard threshold for a True Positive. IoU is used for both NMS (removing duplicates) and mAP evaluation (judging correctness).
- ✓Non-Maximum Suppression (NMS) removes duplicate detections. Sort by confidence → keep highest box → remove all overlapping boxes with IoU above threshold → repeat. Use iou_threshold=0.45 as default. Too high: duplicates survive. Too low: legitimate nearby objects suppressed.
- ✓YOLO divides the image into an S×S grid. Each cell predicts B bounding boxes simultaneously in one forward pass. Multi-scale detection (YOLOv3+) uses three grid sizes to detect small, medium, and large objects. YOLOv8 (anchor-free) is the current production standard — use Ultralytics library.
- ✓mAP (mean Average Precision) is the standard detection metric. mAP@0.5 uses IoU≥0.5 as TP threshold. mAP@0.5:0.95 averages across IoU thresholds 0.5 to 0.95 — the harder COCO standard. Typical production targets: mAP@0.5 > 0.7 for good detection, > 0.85 for excellent.
- ✓Fine-tuning YOLOv8: prepare YOLO-format labels (normalised coordinates 0–1), create dataset.yaml, call model.train(data=yaml, epochs=50). All box coordinates must be normalised by image dimensions. Labels folder must mirror images folder structure with identical filenames but .txt extension.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.