SGD, Adam, AdamW, and the math intuition behind gradient-based optimization
Training deep neural networks is an optimization problem: find weights that minimize loss. The optimizer determines how we update weights given gradients. The choice of optimizer—and its hyperparameters—significantly affects training dynamics, final performance, and convergence speed.
This guide explains the mathematics behind major optimizers, when each performs well, and how learning rate scheduling interacts with optimizer choice.
All neural network optimizers are variations on gradient descent: move weights in the direction that reduces loss most rapidly.
θ_{t+1} = θ_t - η × ∇L(θ_t)
Where:
θ = model parameters
η = learning rate
∇L = gradient of loss w.r.t. parameters
The gradient points uphill; we move in the opposite direction. The learning rate η controls step size—too large and we overshoot; too small and we converge slowly.
SGD with momentum is still widely used, especially for computer vision:
# SGD with Momentum
v_t = β × v_{t-1} + (1 - β) × ∇L(θ_t) # velocity
θ_{t+1} = θ_t - η × v_t # update
Where:
β = momentum coefficient (typically 0.9)
v = exponentially weighted moving average of gradients
Momentum simulates a ball rolling down a loss surface:
| Hyperparameter | Typical Values | Effect |
|---|---|---|
| Learning rate | 0.001 - 0.1 | Primary determinant of convergence speed |
| Momentum | 0.9 - 0.999 | Higher = smoother updates, less oscillation |
| Weight decay | 1e-5 - 1e-3 | L2 regularization (when properly implemented) |
| Nesterov | True/False | Lookahead gradient for better momentum |
Adam combines momentum and per-parameter adaptive learning rates:
# Adam
m_t = β₁ × m_{t-1} + (1 - β₁) × ∇L # First moment (mean gradient)
v_t = β₂ × v_{t-1} + (1 - β₂) × (∇L)² # Second moment (variance of gradient)
m̂_t = m_t / (1 - β₁^t) # Bias correction for m
v̂_t = v_t / (1 - β₂^t) # Bias correction for v
θ_{t+1} = θ_t - η × m̂_t / (√v̂_t + ε)
Parameters with large gradients get smaller effective learning rates; parameters with small gradients get larger rates. This adapts to the geometry of the loss surface.
| Parameter | Default | Typical Range | Notes |
|---|---|---|---|
| β₁ (momentum) | 0.9 | 0.9 | Usually not tuned |
| β₂ (RMSprop decay) | 0.999 | 0.9 - 0.999 | Lower for non-stationary targets |
| ε (epsilon) | 1e-8 | 1e-8 to 1e-3 | Larger = more regularization |
| Learning rate | 0.001 | 1e-5 to 1e-3 | Primary tuning parameter |
AdamW (Loshchilov and Hutter, 2019) fixes how Adam handles weight decay:
# Adam with L2 regularization (wrong)
θ_{t+1} = θ_t - η × (∇L + λθ_t)
# AdamW (correct)
θ_{t+1} = θ_t - η × ∇L - η × λ × θ_t
_____________________________
|
Weight decay applied directly
rather than to gradient
The difference matters: in Adam with L2, the adaptive learning rates reduce the regularization effect for parameters with large gradients. In AdamW, weight decay is applied uniformly regardless of gradient magnitude.
Despite similar mathematics, they're different in Adam:
Use AdamW for transformer training—it's the standard optimizer for BERT, GPT, and similar models.
RMSprop predates Adam and inspired its second-moment estimation:
v_t = β × v_{t-1} + (1 - β) × (∇L)² # Same as Adam's v_t
θ_{t+1} = θ_t - η × ∇L / √(v_t + ε) # No momentum term
Where:
β = decay rate (typically 0.9)
RMSprop lacks the momentum term in Adam. It's particularly effective for RNNs and non-stationary problems.
v_t = v_{t-1} + (∇L)²
θ_{t+1} = θ_t - η × ∇L / √(v_t + ε)
AdaGrad accumulates squared gradients in v_t, dividing the effective learning rate. This automatically reduces learning rates for frequently occurring features and increases rates for rare features.
Problem: v_t only grows, so learning rates monotonically decrease. Training can stall completely after many iterations. AdaGrad is good for sparse data but rarely used for deep learning today.
| Optimizer | Best For | Strengths | Weaknesses |
|---|---|---|---|
| SGD + Momentum | CV, large batches | Better generalization, scales | Requires careful LR tuning |
| Adam | Quick experiments, NLP | Robust defaults, fast convergence | Often worse generalization |
| AdamW | Transformers, LLMs | Correct weight decay, stable | Similar to Adam |
| RMSprop | RNNs, RL | Handles non-stationary targets | No momentum |
| Lion | Emerging research | Memory efficient, simple | New, less validated |
The learning rate schedule is as important as the optimizer. Common schedules:
LR = LR₀ × drop^{floor(epoch / drop_epochs)}
Example: LR₀ = 0.1, drop = 0.1, drop_epochs = 30
Epochs 0-29: LR = 0.1
Epochs 30-59: LR = 0.01
Epochs 60+: LR = 0.001
LR_t = LR_max × (cos(π × t / T) + 1) / 2
Where:
t = current step
T = total steps
LR_max = maximum learning rate
LR_min = minimum learning rate (often 0)
Cosine annealing provides smooth, gradual learning rate reduction. With warm restarts (cosine annealing with periodic resets), it can escape local minima.
For t < warmup_steps:
LR_t = LR_target × (t / warmup_steps)
For t ≥ warmup_steps:
LR_t = LR_target
Warmup prevents large gradient updates early in training before the model has reasonable weights. It's standard for transformer training.
Phase 1 (warmup): Increase LR from ~1e-7 to LR_max
Phase 2 (annealing): Decrease LR from LR_max to LR_min
Total steps = exactly one cycle
Often combined with momentum scheduling
OneCycle (Smith and Topin, 2019) achieves super-convergence: faster training to the same or better final accuracy by exploiting the generalizability-maximizing region of the learning rate spectrum.
Optimizer: AdamW
Learning rate: 1e-4 to 3e-4 (for 7B+ models)
β₁: 0.9
β₂: 0.95 (more stable than 0.999 for long training)
ε: 1e-8
Weight decay: 0.1
Schedule: Cosine with warmup (2000 steps for large models)
Optimizer: SGD with momentum
Learning rate: 0.1 (for 256 batch), scale linearly with batch size
Momentum: 0.9
Weight decay: 1e-4
Schedule: Step decay by 10x at epochs 30, 60, 90 (for 90 epoch training)
Or cosine annealing
Optimizer: AdamW or Lion
Learning rate: 1e-3 (Adam), 1e-4 (Lion)
Weight decay: 0.01
Schedule: Constant or cosine
# Gradient clipping by norm (most common)
∇L = clip(∇L, max_norm) × (max_norm / ||∇L||) if ||∇L|| > max_norm
# Gradient clipping by value
∇L = clamp(∇L, -clip_value, +clip_value)
Gradient clipping prevents exploding gradients, especially important for RNNs and transformers early in training. Clip at global norm of 1.0 for transformers—it's a standard stability technique.
No single optimizer dominates all others. Adam/AdamW converges quickly and works well for most problems—it's the default choice for experimentation. SGD with momentum often generalizes better for computer vision, which is why many production vision models use it despite slower initial convergence.
The optimizer and learning rate schedule are intertwined: Adam works with constant or cosine learning rates; SGD typically requires step or cosine decay. Choose the combination that fits your problem domain and experiment with the primary hyperparameters before deep optimization.