Deep Learning Optimizers Comparison

SGD, Adam, AdamW, and the math intuition behind gradient-based optimization

Published: January 2026 | Reading Time: 15 minutes | Category: AI & Machine Learning

Server infrastructure representing optimization computation

Training deep neural networks is an optimization problem: find weights that minimize loss. The optimizer determines how we update weights given gradients. The choice of optimizer—and its hyperparameters—significantly affects training dynamics, final performance, and convergence speed.

This guide explains the mathematics behind major optimizers, when each performs well, and how learning rate scheduling interacts with optimizer choice.

Gradient Descent: The Foundation

All neural network optimizers are variations on gradient descent: move weights in the direction that reduces loss most rapidly.

θ_{t+1} = θ_t - η × ∇L(θ_t)

Where:
  θ = model parameters
  η = learning rate
  ∇L = gradient of loss w.r.t. parameters
    

The gradient points uphill; we move in the opposite direction. The learning rate η controls step size—too large and we overshoot; too small and we converge slowly.

The Three Variants

SGD: The Workhorse

SGD with momentum is still widely used, especially for computer vision:

# SGD with Momentum
v_t = β × v_{t-1} + (1 - β) × ∇L(θ_t)  # velocity
θ_{t+1} = θ_t - η × v_t                  # update

Where:
  β = momentum coefficient (typically 0.9)
  v = exponentially weighted moving average of gradients
    

Intuition: The Ball Analogy

Momentum simulates a ball rolling down a loss surface:

SGD Hyperparameters

Hyperparameter Typical Values Effect
Learning rate 0.001 - 0.1 Primary determinant of convergence speed
Momentum 0.9 - 0.999 Higher = smoother updates, less oscillation
Weight decay 1e-5 - 1e-3 L2 regularization (when properly implemented)
Nesterov True/False Lookahead gradient for better momentum

When SGD Wins

Generalization Puzzle: Adaptive methods (Adam, RMSprop) often converge faster but SGD generalizes better. This remains an active research area. For production models where generalization matters, SGD with momentum is worth considering despite slower initial convergence.

Adam: Adaptive Moment Estimation

Adam combines momentum and per-parameter adaptive learning rates:

# Adam
m_t = β₁ × m_{t-1} + (1 - β₁) × ∇L    # First moment (mean gradient)
v_t = β₂ × v_{t-1} + (1 - β₂) × (∇L)²  # Second moment (variance of gradient)

m̂_t = m_t / (1 - β₁^t)    # Bias correction for m
v̂_t = v_t / (1 - β₂^t)    # Bias correction for v

θ_{t+1} = θ_t - η × m̂_t / (√v̂_t + ε)
    

Intuition

Parameters with large gradients get smaller effective learning rates; parameters with small gradients get larger rates. This adapts to the geometry of the loss surface.

Adam Hyperparameters

Parameter Default Typical Range Notes
β₁ (momentum) 0.9 0.9 Usually not tuned
β₂ (RMSprop decay) 0.999 0.9 - 0.999 Lower for non-stationary targets
ε (epsilon) 1e-8 1e-8 to 1e-3 Larger = more regularization
Learning rate 0.001 1e-5 to 1e-3 Primary tuning parameter

AdamW: Decoupling Weight Decay

AdamW (Loshchilov and Hutter, 2019) fixes how Adam handles weight decay:

# Adam with L2 regularization (wrong)
θ_{t+1} = θ_t - η × (∇L + λθ_t)

# AdamW (correct)
θ_{t+1} = θ_t - η × ∇L - η × λ × θ_t
          _____________________________
                    |
          Weight decay applied directly
          rather than to gradient
    

The difference matters: in Adam with L2, the adaptive learning rates reduce the regularization effect for parameters with large gradients. In AdamW, weight decay is applied uniformly regardless of gradient magnitude.

Weight Decay vs L2 Regularization

Despite similar mathematics, they're different in Adam:

Use AdamW for transformer training—it's the standard optimizer for BERT, GPT, and similar models.

RMSprop: Root Mean Square Propagation

RMSprop predates Adam and inspired its second-moment estimation:

v_t = β × v_{t-1} + (1 - β) × (∇L)²  # Same as Adam's v_t
θ_{t+1} = θ_t - η × ∇L / √(v_t + ε)   # No momentum term

Where:
  β = decay rate (typically 0.9)
    

RMSprop lacks the momentum term in Adam. It's particularly effective for RNNs and non-stationary problems.

AdaGrad: Adaptive Gradient

v_t = v_{t-1} + (∇L)²
θ_{t+1} = θ_t - η × ∇L / √(v_t + ε)
    

AdaGrad accumulates squared gradients in v_t, dividing the effective learning rate. This automatically reduces learning rates for frequently occurring features and increases rates for rare features.

Problem: v_t only grows, so learning rates monotonically decrease. Training can stall completely after many iterations. AdaGrad is good for sparse data but rarely used for deep learning today.

Optimizer Comparison

Optimizer Best For Strengths Weaknesses
SGD + Momentum CV, large batches Better generalization, scales Requires careful LR tuning
Adam Quick experiments, NLP Robust defaults, fast convergence Often worse generalization
AdamW Transformers, LLMs Correct weight decay, stable Similar to Adam
RMSprop RNNs, RL Handles non-stationary targets No momentum
Lion Emerging research Memory efficient, simple New, less validated

Learning Rate Scheduling

The learning rate schedule is as important as the optimizer. Common schedules:

Step Decay

LR = LR₀ × drop^{floor(epoch / drop_epochs)}

Example: LR₀ = 0.1, drop = 0.1, drop_epochs = 30
Epochs 0-29:  LR = 0.1
Epochs 30-59: LR = 0.01
Epochs 60+:   LR = 0.001
    

Cosine Annealing

LR_t = LR_max × (cos(π × t / T) + 1) / 2

Where:
  t = current step
  T = total steps
  LR_max = maximum learning rate
  LR_min = minimum learning rate (often 0)
    

Cosine annealing provides smooth, gradual learning rate reduction. With warm restarts (cosine annealing with periodic resets), it can escape local minima.

Warmup

For t < warmup_steps:
  LR_t = LR_target × (t / warmup_steps)
For t ≥ warmup_steps:
  LR_t = LR_target
    

Warmup prevents large gradient updates early in training before the model has reasonable weights. It's standard for transformer training.

OneCycle

Phase 1 (warmup):   Increase LR from ~1e-7 to LR_max
Phase 2 (annealing): Decrease LR from LR_max to LR_min

Total steps = exactly one cycle
Often combined with momentum scheduling
    

OneCycle (Smith and Topin, 2019) achieves super-convergence: faster training to the same or better final accuracy by exploiting the generalizability-maximizing region of the learning rate spectrum.

Practical Recommendations

Transformer / LLM Training

Optimizer:    AdamW
Learning rate: 1e-4 to 3e-4 (for 7B+ models)
β₁:           0.9
β₂:           0.95 (more stable than 0.999 for long training)
ε:            1e-8
Weight decay:  0.1
Schedule:     Cosine with warmup (2000 steps for large models)
    

Computer Vision (ResNet, etc.)

Optimizer:    SGD with momentum
Learning rate: 0.1 (for 256 batch), scale linearly with batch size
Momentum:     0.9
Weight decay:  1e-4
Schedule:     Step decay by 10x at epochs 30, 60, 90 (for 90 epoch training)
              Or cosine annealing
    

Quick Experiment / Prototyping

Optimizer:    AdamW or Lion
Learning rate: 1e-3 (Adam), 1e-4 (Lion)
Weight decay:  0.01
Schedule:     Constant or cosine
    

Gradient Clipping

# Gradient clipping by norm (most common)
∇L = clip(∇L, max_norm) × (max_norm / ||∇L||) if ||∇L|| > max_norm

# Gradient clipping by value
∇L = clamp(∇L, -clip_value, +clip_value)
    

Gradient clipping prevents exploding gradients, especially important for RNNs and transformers early in training. Clip at global norm of 1.0 for transformers—it's a standard stability technique.

Conclusion

No single optimizer dominates all others. Adam/AdamW converges quickly and works well for most problems—it's the default choice for experimentation. SGD with momentum often generalizes better for computer vision, which is why many production vision models use it despite slower initial convergence.

The optimizer and learning rate schedule are intertwined: Adam works with constant or cosine learning rates; SGD typically requires step or cosine decay. Choose the combination that fits your problem domain and experiment with the primary hyperparameters before deep optimization.