Gradients and Derivatives: Backpropagation Deep Dive

Level 401advanced
10 mins

To fully understand backpropagation, we need to explore how gradients flow using calculus.


๐Ÿงฎ Derivatives and Chain Rule

The chain rule is key to computing gradients in neural networks:

โˆ‚L/โˆ‚x = โˆ‚L/โˆ‚z ร— โˆ‚z/โˆ‚x

Where L is the loss, z is an intermediate variable, and x is a weight or activation.

In neural networks, we apply this recursively through layers.


๐Ÿ” Example: Two-layer Network

Letโ€™s say we have:

zโ‚ = wโ‚x + bโ‚ โ†’ aโ‚ = ReLU(zโ‚)
zโ‚‚ = wโ‚‚aโ‚ + bโ‚‚ โ†’ ลท = sigmoid(zโ‚‚)

Loss: L = MSE(ลท, y)

To compute the gradient of L w.r.t. wโ‚:

  1. โˆ‚L/โˆ‚ลท
  2. โˆ‚ลท/โˆ‚zโ‚‚
  3. โˆ‚zโ‚‚/โˆ‚aโ‚
  4. โˆ‚aโ‚/โˆ‚zโ‚
  5. โˆ‚zโ‚/โˆ‚wโ‚

๐Ÿ“‰ Matrix Form

In vectorized networks, backpropagation uses matrix calculus.

โˆ‚L/โˆ‚W = โˆ‚L/โˆ‚Z ร— โˆ‚Z/โˆ‚W

Frameworks like PyTorch or TensorFlow handle this automatically using autograd.


๐Ÿง  Intuition

Backprop tells us how a small change in a weight affects the final loss.

The goal of gradient descent is to follow the negative gradient to minimize loss.


โš™๏ธ Gradient Flow Issues

  • Vanishing gradients: Sigmoid/Tanh cause very small derivatives
  • Exploding gradients: Can happen in deep networks
  • Solutions: ReLU, LayerNorm, Residual connections

๐Ÿ“Š Visualization

This shows how errors move backward and accumulate.


Self-Check

  • What does the chain rule allow us to compute?
  • What causes vanishing gradients?
  • How do frameworks calculate gradients?

Explore More Learning

Continue your AI learning journey with our comprehensive courses and resources.

Gradients and Derivatives: Backpropagation Deep Dive - AI Course | HowAIWorks.ai