Gradients and Derivatives: Backpropagation Deep Dive
To fully understand backpropagation, we need to explore how gradients flow using calculus.
๐งฎ Derivatives and Chain Rule
The chain rule is key to computing gradients in neural networks:
โL/โx = โL/โz ร โz/โx
Where L
is the loss, z
is an intermediate variable, and x
is a weight or activation.
In neural networks, we apply this recursively through layers.
๐ Example: Two-layer Network
Letโs say we have:
zโ = wโx + bโ โ aโ = ReLU(zโ)
zโ = wโaโ + bโ โ ลท = sigmoid(zโ)
Loss: L = MSE(ลท, y)
To compute the gradient of L
w.r.t. wโ
:
- โL/โลท
- โลท/โzโ
- โzโ/โaโ
- โaโ/โzโ
- โzโ/โwโ
๐ Matrix Form
In vectorized networks, backpropagation uses matrix calculus.
โL/โW = โL/โZ ร โZ/โW
Frameworks like PyTorch or TensorFlow handle this automatically using autograd.
๐ง Intuition
Backprop tells us how a small change in a weight affects the final loss.
The goal of gradient descent is to follow the negative gradient to minimize loss.
โ๏ธ Gradient Flow Issues
- Vanishing gradients: Sigmoid/Tanh cause very small derivatives
- Exploding gradients: Can happen in deep networks
- Solutions: ReLU, LayerNorm, Residual connections
๐ Visualization
This shows how errors move backward and accumulate.
Self-Check
- What does the chain rule allow us to compute?
- What causes vanishing gradients?
- How do frameworks calculate gradients?
Explore More Learning
Continue your AI learning journey with our comprehensive courses and resources.