Deep Learning Optimizers Comparison

SGD, Adam, AdamW, and the math intuition behind gradient-based optimization

Published: January 2026 | Reading Time: 15 minutes | Category: AI & Machine Learning

Server infrastructure representing optimization computation

Training deep neural networks is an optimization problem: find weights that minimize loss. The optimizer determines how we update weights given gradients. The choice of optimizer—and its hyperparameters—significantly affects training dynamics, final performance, and convergence speed.

This guide explains the mathematics behind major optimizers, when each performs well, and how learning rate scheduling interacts with optimizer choice.

Gradient Descent: The Foundation

All neural network optimizers are variations on gradient descent: move weights in the direction that reduces loss most rapidly.

θ_{t+1} = θ_t - η × ∇L(θ_t)

Where:
  θ = model parameters
  η = learning rate
  ∇L = gradient of loss w.r.t. parameters

The gradient points uphill; we move in the opposite direction. The learning rate η controls step size—too large and we overshoot; too small and we converge slowly.

The Three Variants

Batch (vanilla) GD: Compute gradient over entire dataset. Stable but slow.
Stochastic GD (SGD): Compute gradient over single examples. Noisy but fast updates.
Mini-batch GD: Compute gradient over batches. The standard in practice.

SGD: The Workhorse

SGD with momentum is still widely used, especially for computer vision:

# SGD with Momentum
v_t = β × v_{t-1} + (1 - β) × ∇L(θ_t)  # velocity
θ_{t+1} = θ_t - η × v_t                  # update

Where:
  β = momentum coefficient (typically 0.9)
  v = exponentially weighted moving average of gradients

Intuition: The Ball Analogy

Momentum simulates a ball rolling down a loss surface:

In steep regions, the ball accelerates
In flat regions, momentum carries it through
Near local minima, momentum can cause overshooting

SGD Hyperparameters

Hyperparameter	Typical Values	Effect
Learning rate	0.001 - 0.1	Primary determinant of convergence speed
Momentum	0.9 - 0.999	Higher = smoother updates, less oscillation
Weight decay	1e-5 - 1e-3	L2 regularization (when properly implemented)
Nesterov	True/False	Lookahead gradient for better momentum

When SGD Wins

Computer vision: ResNet, VGG, and other CV models often train better with SGD
Large batch training: Scales well to very large batch sizes (4096+)
Generalization: Often generalizes better than adaptive methods (empirically)

        Generalization Puzzle: Adaptive methods (Adam, RMSprop) often converge faster but SGD generalizes better. This remains an active research area. For production models where generalization matters, SGD with momentum is worth considering despite slower initial convergence.
    

Adam: Adaptive Moment Estimation

Adam combines momentum and per-parameter adaptive learning rates:

# Adam
m_t = β₁ × m_{t-1} + (1 - β₁) × ∇L    # First moment (mean gradient)
v_t = β₂ × v_{t-1} + (1 - β₂) × (∇L)²  # Second moment (variance of gradient)

m̂_t = m_t / (1 - β₁^t)    # Bias correction for m
v̂_t = v_t / (1 - β₂^t)    # Bias correction for v

θ_{t+1} = θ_t - η × m̂_t / (√v̂_t + ε)

Intuition

m_t: Accumulated gradient direction (like momentum)
v_t: Adaptive per-parameter learning rate based on gradient magnitude
Bias correction: Compensates for initialization bias toward zero
ε (epsilon): Prevents division by zero (typically 1e-8)

Parameters with large gradients get smaller effective learning rates; parameters with small gradients get larger rates. This adapts to the geometry of the loss surface.

Adam Hyperparameters

Parameter	Default	Typical Range	Notes
β₁ (momentum)	0.9	0.9	Usually not tuned
β₂ (RMSprop decay)	0.999	0.9 - 0.999	Lower for non-stationary targets
ε (epsilon)	1e-8	1e-8 to 1e-3	Larger = more regularization
Learning rate	0.001	1e-5 to 1e-3	Primary tuning parameter

AdamW: Decoupling Weight Decay

AdamW (Loshchilov and Hutter, 2019) fixes how Adam handles weight decay:

# Adam with L2 regularization (wrong)
θ_{t+1} = θ_t - η × (∇L + λθ_t)

# AdamW (correct)
θ_{t+1} = θ_t - η × ∇L - η × λ × θ_t
          _____________________________
                    |
          Weight decay applied directly
          rather than to gradient

The difference matters: in Adam with L2, the adaptive learning rates reduce the regularization effect for parameters with large gradients. In AdamW, weight decay is applied uniformly regardless of gradient magnitude.

Weight Decay vs L2 Regularization

Despite similar mathematics, they're different in Adam:

L2 regularization: Adds λ||w||² to loss, gradient is 2λw. Divided by RMS in Adam.
Weight decay: Explicitly subtracts ηλw from weights. Independent of gradient scaling.

Use AdamW for transformer training—it's the standard optimizer for BERT, GPT, and similar models.

RMSprop: Root Mean Square Propagation

RMSprop predates Adam and inspired its second-moment estimation:

v_t = β × v_{t-1} + (1 - β) × (∇L)²  # Same as Adam's v_t
θ_{t+1} = θ_t - η × ∇L / √(v_t + ε)   # No momentum term

Where:
  β = decay rate (typically 0.9)

RMSprop lacks the momentum term in Adam. It's particularly effective for RNNs and non-stationary problems.

AdaGrad: Adaptive Gradient

v_t = v_{t-1} + (∇L)²
θ_{t+1} = θ_t - η × ∇L / √(v_t + ε)

AdaGrad accumulates squared gradients in v_t, dividing the effective learning rate. This automatically reduces learning rates for frequently occurring features and increases rates for rare features.

Problem: v_t only grows, so learning rates monotonically decrease. Training can stall completely after many iterations. AdaGrad is good for sparse data but rarely used for deep learning today.

Optimizer Comparison

Optimizer	Best For	Strengths	Weaknesses
SGD + Momentum	CV, large batches	Better generalization, scales	Requires careful LR tuning
Adam	Quick experiments, NLP	Robust defaults, fast convergence	Often worse generalization
AdamW	Transformers, LLMs	Correct weight decay, stable	Similar to Adam
RMSprop	RNNs, RL	Handles non-stationary targets	No momentum
Lion	Emerging research	Memory efficient, simple	New, less validated

Learning Rate Scheduling

The learning rate schedule is as important as the optimizer. Common schedules:

Step Decay

LR = LR₀ × drop^{floor(epoch / drop_epochs)}

Example: LR₀ = 0.1, drop = 0.1, drop_epochs = 30
Epochs 0-29:  LR = 0.1
Epochs 30-59: LR = 0.01
Epochs 60+:   LR = 0.001

Cosine Annealing

LR_t = LR_max × (cos(π × t / T) + 1) / 2

Where:
  t = current step
  T = total steps
  LR_max = maximum learning rate
  LR_min = minimum learning rate (often 0)

Cosine annealing provides smooth, gradual learning rate reduction. With warm restarts (cosine annealing with periodic resets), it can escape local minima.

Warmup

For t < warmup_steps:
  LR_t = LR_target × (t / warmup_steps)
For t ≥ warmup_steps:
  LR_t = LR_target

Warmup prevents large gradient updates early in training before the model has reasonable weights. It's standard for transformer training.

OneCycle

Phase 1 (warmup):   Increase LR from ~1e-7 to LR_max
Phase 2 (annealing): Decrease LR from LR_max to LR_min

Total steps = exactly one cycle
Often combined with momentum scheduling

OneCycle (Smith and Topin, 2019) achieves super-convergence: faster training to the same or better final accuracy by exploiting the generalizability-maximizing region of the learning rate spectrum.

Practical Recommendations

Transformer / LLM Training

Optimizer:    AdamW
Learning rate: 1e-4 to 3e-4 (for 7B+ models)
β₁:           0.9
β₂:           0.95 (more stable than 0.999 for long training)
ε:            1e-8
Weight decay:  0.1
Schedule:     Cosine with warmup (2000 steps for large models)

Computer Vision (ResNet, etc.)

Optimizer:    SGD with momentum
Learning rate: 0.1 (for 256 batch), scale linearly with batch size
Momentum:     0.9
Weight decay:  1e-4
Schedule:     Step decay by 10x at epochs 30, 60, 90 (for 90 epoch training)
              Or cosine annealing

Quick Experiment / Prototyping

Optimizer:    AdamW or Lion
Learning rate: 1e-3 (Adam), 1e-4 (Lion)
Weight decay:  0.01
Schedule:     Constant or cosine

Gradient Clipping

# Gradient clipping by norm (most common)
∇L = clip(∇L, max_norm) × (max_norm / ||∇L||) if ||∇L|| > max_norm

# Gradient clipping by value
∇L = clamp(∇L, -clip_value, +clip_value)

Gradient clipping prevents exploding gradients, especially important for RNNs and transformers early in training. Clip at global norm of 1.0 for transformers—it's a standard stability technique.

Conclusion

No single optimizer dominates all others. Adam/AdamW converges quickly and works well for most problems—it's the default choice for experimentation. SGD with momentum often generalizes better for computer vision, which is why many production vision models use it despite slower initial convergence.

The optimizer and learning rate schedule are intertwined: Adam works with constant or cosine learning rates; SGD typically requires step or cosine decay. Choose the combination that fits your problem domain and experiment with the primary hyperparameters before deep optimization.

ML Model Evaluation Metrics Transformer Architecture Explained MLOps Engineering Practice