Gradients and Derivatives: Backpropagation Deep Dive

Summary

A deeper look into how backpropagation works using calculus and partial derivatives.

advanced
neural-network-basics

To fully understand backpropagation, we need to explore how gradients flow using calculus.


๐Ÿงฎ Derivatives and Chain Rule

The chain rule is key to computing gradients in neural networks:

โˆ‚L/โˆ‚x = โˆ‚L/โˆ‚z ร— โˆ‚z/โˆ‚x

Where L is the loss, z is an intermediate variable, and x is a weight or activation.

In neural networks, we apply this recursively through layers.


๐Ÿ” Example: Two-layer Network

Letโ€™s say we have:

zโ‚ = wโ‚x + bโ‚ โ†’ aโ‚ = ReLU(zโ‚)
zโ‚‚ = wโ‚‚aโ‚ + bโ‚‚ โ†’ ลท = sigmoid(zโ‚‚)

Loss: L = MSE(ลท, y)

To compute the gradient of L w.r.t. wโ‚:

  1. โˆ‚L/โˆ‚ลท
  2. โˆ‚ลท/โˆ‚zโ‚‚
  3. โˆ‚zโ‚‚/โˆ‚aโ‚
  4. โˆ‚aโ‚/โˆ‚zโ‚
  5. โˆ‚zโ‚/โˆ‚wโ‚

๐Ÿ“‰ Matrix Form

In vectorized networks, backpropagation uses matrix calculus.

โˆ‚L/โˆ‚W = โˆ‚L/โˆ‚Z ร— โˆ‚Z/โˆ‚W

Frameworks like PyTorch or TensorFlow handle this automatically using autograd.


๐Ÿง  Intuition

Backprop tells us how a small change in a weight affects the final loss.

The goal of gradient descent is to follow the negative gradient to minimize loss.


โš™๏ธ Gradient Flow Issues

  • Vanishing gradients: Sigmoid/Tanh cause very small derivatives
  • Exploding gradients: Can happen in deep networks
  • Solutions: ReLU, LayerNorm, Residual connections

๐Ÿ“Š Visualization

This shows how errors move backward and accumulate.


โœ… Self-Check

  • What does the chain rule allow us to compute?
  • What causes vanishing gradients?
  • How do frameworks calculate gradients?