Back to Blog

25 March 2026

Gradient Descent Optimizers: From SGD to Adam

A practical guide to the most widely used optimization algorithms in deep learning — what they compute, why they differ, and when to use each one.

Deep Learning Optimization PyTorch Training

The Core Problem

Training a neural network means minimising a loss function L(θ)\mathcal{L}(\theta) over potentially billions of parameters θ\theta. Gradient descent does this iteratively:

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

where η\eta is the learning rate. The challenge is that computing the full gradient over the entire dataset is prohibitively expensive — which is why we use stochastic variants.

Stochastic Gradient Descent (SGD)

SGD computes the gradient on a random mini-batch of size BB:

θt+1=θtηθLB(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}_B(\theta_t)

This introduces noise, but that noise acts as a regulariser and allows escape from sharp minima. With momentum, we accumulate a velocity vector:

vt+1=μvtηLB(θt)v_{t+1} = \mu v_t - \eta \nabla \mathcal{L}_B(\theta_t) θt+1=θt+vt+1\theta_{t+1} = \theta_t + v_{t+1}

Momentum (μ0.9\mu \approx 0.9) smooths updates and accelerates convergence along consistent gradient directions.

AdaGrad

AdaGrad adapts the learning rate per-parameter based on the accumulated squared gradients:

Gt=τ=1tgτ2G_t = \sum_{\tau=1}^{t} g_\tau^2 θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t

Parameters that receive large gradients get a smaller effective learning rate. This helps sparse features (e.g. word embeddings) but the monotonically increasing GtG_t causes the learning rate to decay to near-zero.

RMSProp

RMSProp fixes AdaGrad’s decay problem with an exponential moving average:

Gt=ρGt1+(1ρ)gt2G_t = \rho G_{t-1} + (1 - \rho) g_t^2 θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} g_t

Typical ρ=0.99\rho = 0.99. The effective learning rate stabilises rather than decaying indefinitely.

Adam

Adam (Adaptive Moment Estimation) combines momentum and RMSProp:

mt=β1mt1+(1β1)gt(first moment)m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment)} vt=β2vt1+(1β2)gt2(second moment)v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment)}

Because m0=v0=0m_0 = v_0 = 0, early estimates are biased toward zero. Bias correction:

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

θt+1=θtηv^t+ϵm^t\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Defaults β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8} work well across a broad range of tasks.

AdamW

Adam has a subtle flaw: L2 regularisation via weight decay interacts with the adaptive learning rates in an unintended way. AdamW decouples weight decay from the gradient update:

θt+1=θtηv^t+ϵm^tηλθt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \eta \lambda \theta_t

This is the standard choice for training Transformers and modern LLMs.

PyTorch Usage

import torch.optim as optim

# SGD with momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

# Adam
optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999))

# AdamW (preferred for Transformers)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Learning rate scheduler
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

Learning Rate Schedules

No matter the optimizer, the learning rate schedule has a large impact:

ScheduleDescriptionUse Case
ConstantFixed η\etaQuick experiments
Step decayMultiply by γ\gamma every kk epochsResNet-style training
Cosine annealingη\eta follows a cosine curveGeneral deep learning
Warmup + cosineLinear warmup then cosineTransformers, LLMs
OneCycleLRFast ramp up, slow ramp downShort training runs

When to Use What

Key Takeaways