Let’s break down the mathematical side of gradient descent and explore key optimizers like Adam, RMSProp, and Momentum.
🔣 Partial Derivatives
In machine learning, we often deal with multiple parameters (weights).
We use partial derivatives to compute the slope for each parameter:
∂L / ∂w₁, ∂L / ∂w₂, ..., ∂L / ∂b
The gradient is a vector of these partial derivatives.
🔁 Gradient Update Rule (Vector Form)
θ = θ - η * ∇L(θ)
Where:
θ
is a vector of parameters (weights and biases)∇L(θ)
is the gradient vectorη
is the learning rate
This is repeated for every batch during training.
⚙️ Popular Optimizers
Different optimizers improve how steps are taken:
1. SGD (Stochastic Gradient Descent)
- Uses mini-batches
- Simple but noisy updates
2. Momentum
- Adds velocity to updates
- Helps escape local minima
v = γ * v - η * ∇L(θ)
θ = θ + v
3. RMSProp
- Adapts learning rate for each parameter
- Divides by running average of squared gradients
4. Adam
- Combines Momentum + RMSProp
- Default optimizer in many deep learning libraries
🧪 Visualization
See how different optimizers navigate the loss surface differently.
🧠 Summary
| Optimizer | Key Feature | |----------|------------------------------| | SGD | Simple updates per batch | | Momentum | Smooth updates with velocity | | RMSProp | Scaled learning per param | | Adam | Fast and adaptive |
✅ Self-Check
- What’s the difference between gradient and partial derivative?
- Why is Adam better for deep learning?
- What role does momentum play?