Gradient Descent: Math, Derivatives, Optimizers

Summary

Dive deeper into the math behind gradient descent, including partial derivatives and popular optimization methods.

advanced
core-ai

Let’s break down the mathematical side of gradient descent and explore key optimizers like Adam, RMSProp, and Momentum.


🔣 Partial Derivatives

In machine learning, we often deal with multiple parameters (weights).

We use partial derivatives to compute the slope for each parameter:

∂L / ∂w₁, ∂L / ∂w₂, ..., ∂L / ∂b

The gradient is a vector of these partial derivatives.


🔁 Gradient Update Rule (Vector Form)

θ = θ - η * ∇L(θ)

Where:

  • θ is a vector of parameters (weights and biases)
  • ∇L(θ) is the gradient vector
  • η is the learning rate

This is repeated for every batch during training.


⚙️ Popular Optimizers

Different optimizers improve how steps are taken:

1. SGD (Stochastic Gradient Descent)

  • Uses mini-batches
  • Simple but noisy updates

2. Momentum

  • Adds velocity to updates
  • Helps escape local minima
v = γ * v - η * ∇L(θ)
θ = θ + v

3. RMSProp

  • Adapts learning rate for each parameter
  • Divides by running average of squared gradients

4. Adam

  • Combines Momentum + RMSProp
  • Default optimizer in many deep learning libraries

🧪 Visualization

See how different optimizers navigate the loss surface differently.


🧠 Summary

| Optimizer | Key Feature | |----------|------------------------------| | SGD | Simple updates per batch | | Momentum | Smooth updates with velocity | | RMSProp | Scaled learning per param | | Adam | Fast and adaptive |


✅ Self-Check

  • What’s the difference between gradient and partial derivative?
  • Why is Adam better for deep learning?
  • What role does momentum play?