Deep Learning Optimizers: SGD, Adam, and Beyond

Training a neural network is an exercise in high-dimensional optimization: finding the model parameters that minimize a loss function across a vast and complex landscape. The algorithm you choose to navigate this landscape—your optimizer—profoundly impacts training speed, final model performance, and stability. From the foundational Stochastic Gradient Descent to sophisticated adaptive methods like Adam, understanding these tools is crucial for efficient and effective deep learning.

The Foundation: Stochastic Gradient Descent and Momentum

At its core, Stochastic Gradient Descent (SGD) is the workhorse of neural network training. Instead of computing the gradient using the entire dataset, which is computationally prohibitive, SGD estimates it using a small, randomly sampled batch. This introduces helpful noise that can help escape shallow local minima. The update rule is simple: $w_{t + 1} = w_{t} - η \cdot g_{t}$ , where $w_{t}$ are the weights at step $t$ , $η$ is the learning rate, and $g_{t}$ is the gradient of the loss with respect to the weights.

The primary challenge with vanilla SGD is its propensity to oscillate in ravines of the loss landscape (areas with steep curvature in one dimension and a gentle slope in another). This leads to slow convergence. SGD with momentum addresses this by introducing a velocity term, $ν$ , that accumulates past gradients, giving the optimization process inertia. The update equations are: $ν_{t} = β ν_{t - 1} + g_{t}$ $w_{t + 1} = w_{t} - η \cdot ν_{t}$

Here, $β$ is the momentum term, typically set to 0.9. Think of it as a heavy ball rolling down a hill; it builds up speed in consistent directions and dampens oscillations, allowing for faster traversal of flat regions and smoother descent through ravines.

Adaptive Learning Rates: The RMSProp Approach

While momentum adjusts the direction of updates based on past gradients, the learning rate $η$ remains a single, global hyperparameter. This is problematic when features have different frequencies or scales. RMSProp (Root Mean Square Propagation) introduces per-parameter adaptive learning rates.

RMSProp maintains a moving average of the squared gradients for each weight. This average, $s_{t}$ , estimates the variance (or the magnitude) of recent gradients for that parameter. The update scales the learning rate for each weight inversely by the square root of this moving average: $s_{t} = ρ s_{t - 1} + (1 - ρ) g_{t}^{2}$ $w_{t + 1} = w_{t} - \frac{η}{s _{t} + ϵ} \cdot g_{t}$

The decay rate $ρ$ is often set to 0.9, and $ϵ$ is a small constant (e.g., $1 0^{- 8}$ ) for numerical stability. Parameters with large, frequent gradients (steep slopes) get a reduced effective learning rate, while parameters with small, infrequent gradients get a larger one. This helps navigate ill-conditioned landscapes but can be overly aggressive, leading to premature convergence in sharp minima.

The Adaptive Moment Estimator: Combining Ideas in Adam

Adam (Adaptive Moment Estimation) is arguably the most widely used optimizer as it elegantly combines the concepts of momentum and adaptive learning rates. It maintains two moving averages for each parameter: the first moment (the mean of gradients, $m_{t}$ ) and the second moment (the uncentered variance of gradients, $v_{t}$ ).

The algorithm estimates these moments with bias-correction to account for their initialization at zero. The update rule is: $m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$ $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$ $\overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}, \overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{2}^{t}}$ $w_{t + 1} = w_{t} - \frac{η}{v ^ _{t} + ϵ} \cdot \overset{m}{^}_{t}$

Standard values are $β_{1} = 0.9$ , $β_{2} = 0.999$ . Adam acts like momentum-guided SGD ( $\overset{m}{^}_{t}$ provides direction and velocity) with per-parameter learning rates scaled by $\overset{v}{^}_{t}$ (like RMSProp). This makes it robust to the choice of learning rate and typically converges very quickly.

Advanced Refinements: Scheduling, Decoupling, and Scaling

Modern optimizers build upon Adam and SGD with key refinements for better generalization and scalability.

Learning rate scheduling is critical. Instead of a fixed $η$ , it is decayed over time. Cosine annealing with warm restarts is a powerful strategy where the learning rate follows a cosine curve down to a minimum value over a set number of epochs (the $T_{c u r}$ period), then jumps back to a high value (a "warm restart"). This restarting mechanism helps the model escape saddle points or local minima, acting as a simulated annealing process.

Weight decay is a traditional $L_{2}$ regularization penalty added to the loss. However, in Adam, this penalty becomes entangled with the adaptive learning rate scaling, making its effect dependent on the gradient history. AdamW fixes this by decoupling weight decay from the adaptive gradient update. The weight decay term is added directly to the final update, independent of $v_{t}$ : $w_{t + 1} = w_{t} - η (\frac{m ^ _{t}}{v ^ _{t} + ϵ} + λ w_{t})$ where $λ$ is the weight decay coefficient. This leads to more effective regularization and often better generalization performance.

For large batch training, such as in distributed settings, simply scaling the batch size can cause optimization instability. LAMB (Layer-wise Adaptive Moments optimizer for Batch training) adapts the per-layer learning rate in Adam by normalizing the update size by the norm of the weights. This allows for extreme batch size scaling (e.g., batches of 32,000+ examples) without losing accuracy, as it ensures trust in the update direction for each layer is calibrated.

Common Pitfalls

Using Adam as a Default Without Thought: While Adam converges quickly, it can sometimes generalize worse than SGD with momentum on certain tasks, particularly in computer vision. For convolutional networks, SGD with momentum and a good learning rate schedule often produces state-of-the-art results, though it requires more careful tuning.
Neglecting Learning Rate Scheduling: A constant learning rate will halt progress. Always use a schedule. Start with simple step decay or linear warmup, and experiment with cosine annealing for more complex tasks.
Confusing Weight Decay in Adam and AdamW: Using the traditional weight decay implementation (adding it to the loss) with the Adam optimizer is less effective. If your framework's "Adam" has a weight_decay parameter, verify if it implements the correct AdamW decoupling. If not, use a separate implementation.
Overlooking Batch Size and Optimizer Pairing: When dramatically increasing batch size for faster training, be aware of the optimizer's limitations. Vanilla SGD and Adam may become unstable; consider using LAMB or adjusting other hyperparameters like learning rate scaling.

Summary

The optimizer's role is to efficiently navigate the high-dimensional loss landscape. Stochastic Gradient Descent (SGD) with momentum adds inertia to smooth out oscillations and speed up convergence in consistent directions.
Adaptive learning rate methods like RMSProp and Adam scale the update for each parameter individually, making them less sensitive to the initial global learning rate and excellent for sparse or noisy data.
Adam combines momentum and adaptive learning rates, but its standard form can generalize poorly compared to SGD+Momentum on some architectures. Its improved variant, AdamW, correctly decouples weight decay for better regularization.
Learning rate scheduling, especially advanced methods like cosine annealing with warm restarts, is non-negotiable for achieving peak performance, as it allows the model to settle into sharper, better minima.
Your choice should be informed by model architecture and dataset characteristics. Experiment: use Adam/AdamW as a robust starting point, but for CNNs, always benchmark against tuned SGD with momentum. For large batch training, employ optimizers like LAMB designed for stability at scale.

Deep Learning Optimizers: SGD, Adam, and Beyond

Deep Learning Optimizers: SGD, Adam, and Beyond

The Foundation: Stochastic Gradient Descent and Momentum

Adaptive Learning Rates: The RMSProp Approach

The Adaptive Moment Estimator: Combining Ideas in Adam

Advanced Refinements: Scheduling, Decoupling, and Scaling

Common Pitfalls

Summary

Write better notes with AI