25 March 2026
Gradient Descent Optimizers: From SGD to Adam
A practical guide to the most widely used optimization algorithms in deep learning — what they compute, why they differ, and when to use each one.
The Core Problem
Training a neural network means minimising a loss function L(θ) over potentially billions of parameters θ. Gradient descent does this iteratively:
θt+1=θt−η∇θL(θt)
where η is the learning rate. The challenge is that computing the full gradient over the entire dataset is prohibitively expensive — which is why we use stochastic variants.
Stochastic Gradient Descent (SGD)
SGD computes the gradient on a random mini-batch of size B:
θt+1=θt−η∇θLB(θt)
This introduces noise, but that noise acts as a regulariser and allows escape from sharp minima. With momentum, we accumulate a velocity vector:
vt+1=μvt−η∇LB(θt) θt+1=θt+vt+1
Momentum (μ≈0.9) smooths updates and accelerates convergence along consistent gradient directions.
AdaGrad
AdaGrad adapts the learning rate per-parameter based on the accumulated squared gradients:
Gt=∑τ=1tgτ2 θt+1=θt−Gt+ϵηgt
Parameters that receive large gradients get a smaller effective learning rate. This helps sparse features (e.g. word embeddings) but the monotonically increasing Gt causes the learning rate to decay to near-zero.
RMSProp
RMSProp fixes AdaGrad’s decay problem with an exponential moving average:
Gt=ρGt−1+(1−ρ)gt2 θt+1=θt−Gt+ϵηgt
Typical ρ=0.99. The effective learning rate stabilises rather than decaying indefinitely.
Adam
Adam (Adaptive Moment Estimation) combines momentum and RMSProp:
mt=β1mt−1+(1−β1)gt(first moment) vt=β2vt−1+(1−β2)gt2(second moment)
Because m0=v0=0, early estimates are biased toward zero. Bias correction:
m^t=1−β1tmt,v^t=1−β2tvt
θt+1=θt−v^t+ϵηm^t
Defaults β1=0.9, β2=0.999, ϵ=10−8 work well across a broad range of tasks.
AdamW
Adam has a subtle flaw: L2 regularisation via weight decay interacts with the adaptive learning rates in an unintended way. AdamW decouples weight decay from the gradient update:
θt+1=θt−v^t+ϵηm^t−ηλθt
This is the standard choice for training Transformers and modern LLMs.
PyTorch Usage
import torch.optim as optim
# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
# Adam
optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))
# AdamW (preferred for Transformers)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
# Learning rate scheduler
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
Learning Rate Schedules
No matter the optimizer, the learning rate schedule has a large impact:
| Schedule | Description | Use Case |
|---|---|---|
| Constant | Fixed η | Quick experiments |
| Step decay | Multiply by γ every k epochs | ResNet-style training |
| Cosine annealing | η follows a cosine curve | General deep learning |
| Warmup + cosine | Linear warmup then cosine | Transformers, LLMs |
| OneCycleLR | Fast ramp up, slow ramp down | Short training runs |
When to Use What
- SGD + momentum: Computer vision (ResNets, ConvNets) — often reaches better generalisation than Adam with the right schedule
- Adam/AdamW: NLP, Transformers, any task with sparse gradients
- RMSProp: RNNs, reinforcement learning
- AdaGrad: Sparse input features, NLP with bag-of-words representations
Key Takeaways
- SGD is a strong baseline for vision; Adam/AdamW dominates for language
- Decouple weight decay from adaptive gradient scaling — use AdamW, not Adam + L2
- The learning rate schedule often matters as much as the optimizer itself
- Warmup prevents instability at the start of Transformer training