Neural Networks from Scratch
Forward pass, backpropagation, and gradient descent built in NumPy before touching PyTorch. The foundation every deep learning framework is built on.
Every algorithm in Section 5 required you to hand-craft features. A neural network learns its own features directly from raw data. That is the entire revolution.
When Swiggy wants to predict delivery time, you manually build features: distance, traffic score, restaurant prep time, time of day. You encode your domain knowledge into numbers. The model learns relationships between those numbers and the target. The quality of your model is bounded by the quality of your features.
Now imagine Swiggy wants to detect damaged packaging from a photo. What features do you hand-craft from an image? Pixel brightness? Edge patterns? Colour distributions? You do not know which pixel combinations indicate damage. A neural network does not need you to know. It learns the relevant features — edges, shapes, textures — directly from thousands of labelled photos. The layers of a network are a hierarchy of learned feature detectors, going from raw pixels to abstract concepts without any human guidance.
This module builds a neural network from scratch in NumPy — no PyTorch, no TensorFlow. Every operation is explicit. You will understand exactly what a forward pass does, what backpropagation computes, and why gradient descent works. After this module, PyTorch becomes obvious — it automates exactly what you will code by hand here.
Imagine teaching a child to recognise cats. You could write down rules: "furry, four legs, pointed ears, whiskers." That is classical ML — hand-crafted features. Or you could show the child 10,000 photos of cats and non-cats and let them figure out the pattern themselves. They learn features you never named — the specific curve of an ear, the texture of fur, the shape of eyes. That is a neural network.
The child's brain adjusts internal connections after each photo — strengthening what was right, weakening what was wrong. A neural network does exactly this: adjust weights after each prediction based on how wrong it was. That adjustment process is backpropagation. The rule for how much to adjust is gradient descent.
One neuron — weighted sum plus activation
A single neuron does two things. First it computes a weighted sum of its inputs — each input multiplied by a weight, all added together, plus a bias term. Then it applies an activation function to that sum — a non-linear transformation that lets the network learn non-linear patterns. Without activation functions, stacking many neurons would still only produce a linear model.
Activation functions — why they matter and which to use
Output zero for negative z, z for positive. Simple, fast, does not saturate for positive values. Default choice for hidden layers.
Squashes output to (0, 1). Interpretable as probability. Saturates at both ends — gradients vanish for large |z|.
Converts a vector of scores to probabilities summing to 1. Each output is the probability of that class.
Squashes to (−1, 1). Zero-centred — better gradient flow than sigmoid. Still saturates for large |z|.
The forward pass — data flows through layers, one matrix multiply at a time
A neural network is multiple neurons stacked into layers. Every neuron in one layer connects to every neuron in the next — a fully connected (dense) layer. The forward pass computes a prediction by passing data from the input layer through each hidden layer to the output layer. Each layer is one matrix multiplication plus an activation.
Backpropagation — the chain rule applied backwards through the network
The forward pass produces a prediction. The prediction is wrong. We compute the loss — how wrong it is. Now we need to know: how should each weight change to make the prediction less wrong? The answer is the gradient of the loss with respect to each weight — ∂Loss/∂W.
Backpropagation computes these gradients efficiently using the chain rule from calculus. The loss depends on the output, which depends on layer 3, which depends on layer 2, which depends on layer 1, which depends on the weights. Backprop unrolls this chain from output back to input — hence "backward" propagation.
MSE loss for regression: L = mean((y_pred − y_true)²). The gradient of MSE with respect to the prediction is: ∂L/∂A3 = 2 × (y_pred − y_true) / n
∂L/∂W3 = A2ᵀ @ ∂L/∂A3 — how much does the output layer weight contribute to the loss? ∂L/∂A2 = ∂L/∂A3 @ W3ᵀ — how much does the signal from layer 2 contribute?
ReLU kills gradients for negative pre-activations. ∂L/∂Z2 = ∂L/∂A2 × relu_derivative(Z2) — element-wise multiply. Zero where Z2 was negative, pass-through where Z2 was positive.
Apply the same pattern for layer 1. Each layer produces two gradients: one for its weights (∂L/∂W) and one to pass backward (∂L/∂A_prev).
Gradient descent — update weights, repeat until convergence
Backpropagation computes the direction of steepest increase in the loss. Gradient descent moves weights in the opposite direction — subtracting a fraction of the gradient called the learning rate. One forward pass + one backward pass + one weight update = one training step. Repeat over the entire dataset many times (epochs) until the loss converges.
Three variants of gradient descent
Compute gradient on the entire dataset per step. Exact gradient. Slow on large datasets. Never used in deep learning.
Compute gradient on one sample per step. Very noisy — gradient direction jumps randomly. Can escape local minima. Very fast per step.
Compute gradient on a batch of 32–256 samples. Best of both — stable enough to converge, fast enough for large datasets. What every deep learning framework uses by default.
The same network in PyTorch — autograd handles backpropagation for you
Everything you just coded by hand — forward pass, loss computation, backward pass, weight updates — PyTorch automates with one call to loss.backward(). Its autograd engine traces all operations in the forward pass and automatically computes gradients for every parameter. The code becomes dramatically shorter without changing what happens.
Every common neural network mistake — explained and fixed
You built a neural network from scratch. Now: make it train faster and better.
The network you just built works — but plain SGD is the slowest, least reliable optimizer available. Module 41 covers the training techniques that make modern deep learning practical: Adam optimizer (adaptive learning rates per parameter), batch normalisation (stabilise activations between layers), dropout (prevent overfitting), and learning rate schedules (reduce lr as training progresses). These four techniques take a network from "trains but slowly" to "trains fast and generalises well."
The four techniques that separate a network that trains from one that trains well. Used in every production deep learning system.
🎯 Key Takeaways
- ✓A neural network learns its own features from raw data — stacked layers of weighted sums followed by non-linear activations. No manual feature engineering needed. Each layer learns increasingly abstract representations.
- ✓One neuron: z = Σ(wᵢxᵢ) + b, a = activation(z). One layer: Z = X @ W + b, A = activation(Z). Matrix multiplication makes the computation efficient for batches of samples simultaneously.
- ✓Use ReLU (max(0, z)) as the default activation for hidden layers. It does not saturate for positive values, is fast to compute, and produces sparse activations. Use sigmoid only at the output for binary classification, softmax for multi-class, linear for regression.
- ✓Backpropagation applies the chain rule backwards through the network to compute ∂Loss/∂W for every weight. Each layer produces two gradients: one to update its own weights and one to pass backward to the previous layer. Gradient check (compare analytical vs numerical gradients) verifies correctness.
- ✓Mini-batch gradient descent — process batches of 32–256 samples per update — is the correct trade-off between noisy single-sample updates and slow full-dataset updates. Shuffle data each epoch to prevent the model from memorising the order.
- ✓PyTorch automates backpropagation via autograd. loss.backward() computes all gradients, optimizer.step() applies them. The from-scratch implementation is identical in logic — PyTorch just removes the manual gradient code so you can focus on architecture design.
Discussion
0Have a better approach? Found something outdated? Share it — your knowledge helps everyone learning here.