Gradient Descent: Math, Derivatives, Optimizers

Let’s break down the mathematical side of gradient descent and explore key optimizers like Adam, RMSProp, and Momentum.

🔣 Partial Derivatives

In machine learning, we often deal with multiple parameters (weights).

We use partial derivatives to compute the slope for each parameter:

∂L / ∂w₁, ∂L / ∂w₂, ..., ∂L / ∂b

The gradient is a vector of these partial derivatives.

🔁 Gradient Update Rule (Vector Form)

θ = θ - η * ∇L(θ)

Where:

θ is a vector of parameters (weights and biases)
∇L(θ) is the gradient vector
η is the learning rate

This is repeated for every batch during training.

⚙️ Popular Optimizers

Different optimizers improve how steps are taken:

1. SGD (Stochastic Gradient Descent)

Uses mini-batches
Simple but noisy updates

2. Momentum

Adds velocity to updates
Helps escape local minima

v = γ * v - η * ∇L(θ)
θ = θ + v

3. RMSProp

Adapts learning rate for each parameter
Divides by running average of squared gradients

4. Adam

Combines Momentum + RMSProp
Default optimizer in many deep learning libraries

🧪 Visualization

Optimizer Comparison

Compare how different optimizers navigate the loss landscape

Loss Landscape

Optimizer Details

SGD

Stochastic Gradient Descent - simple but effective

Parameters:

Learning Rate: 0.1

Momentum: 0.0

Adam

Adaptive learning rates with momentum

Parameters:

Learning Rate: 0.01

β₁: 0.9

β₂: 0.999

RMSprop

Root Mean Square propagation

Parameters:

Learning Rate: 0.01

ρ: 0.9

Training Progress

Epoch 1 / 502%

Performance Comparison

Optimizer	Convergence Speed	Stability
SGD	Slow	Low
Adam	Fast	High
RMSprop	Medium	Medium

Key Differences

SGD: Simple gradient descent, can be slow and oscillate

Adam: Adaptive learning rates, combines momentum and RMSprop benefits

RMSprop: Adapts learning rates based on recent gradient magnitudes

See how different optimizers navigate the loss surface differently.

🧠 Summary

| Optimizer | Key Feature | |----------|------------------------------| | SGD | Simple updates per batch | | Momentum | Smooth updates with velocity | | RMSProp | Scaled learning per param | | Adam | Fast and adaptive |

✅ Self-Check

What’s the difference between gradient and partial derivative?
Why is Adam better for deep learning?
What role does momentum play?