To fully understand backpropagation, we need to explore how gradients flow using calculus.
๐งฎ Derivatives and Chain Rule
The chain rule is key to computing gradients in neural networks:
โL/โx = โL/โz ร โz/โx
Where L
is the loss, z
is an intermediate variable, and x
is a weight or activation.
In neural networks, we apply this recursively through layers.
๐ Example: Two-layer Network
Letโs say we have:
zโ = wโx + bโ โ aโ = ReLU(zโ)
zโ = wโaโ + bโ โ ลท = sigmoid(zโ)
Loss: L = MSE(ลท, y)
To compute the gradient of L
w.r.t. wโ
:
- โL/โลท
- โลท/โzโ
- โzโ/โaโ
- โaโ/โzโ
- โzโ/โwโ
๐ Matrix Form
In vectorized networks, backpropagation uses matrix calculus.
โL/โW = โL/โZ ร โZ/โW
Frameworks like PyTorch or TensorFlow handle this automatically using autograd.
๐ง Intuition
Backprop tells us how a small change in a weight affects the final loss.
The goal of gradient descent is to follow the negative gradient to minimize loss.
โ๏ธ Gradient Flow Issues
- Vanishing gradients: Sigmoid/Tanh cause very small derivatives
- Exploding gradients: Can happen in deep networks
- Solutions: ReLU, LayerNorm, Residual connections
๐ Visualization
This shows how errors move backward and accumulate.
โ Self-Check
- What does the chain rule allow us to compute?
- What causes vanishing gradients?
- How do frameworks calculate gradients?