Subgradient Methods

When optimizing real-world problems in machine learning, engineering, or economics, you will inevitably encounter functions that are not smoothly differentiable. Hinge losses, L1 regularization, and ReLU activations are cornerstone examples. The subgradient method is the fundamental workhorse for minimizing such nonsmooth convex functions. Unlike gradient descent, which fails when the gradient does not exist, subgradient methods provide a robust, iterative framework for navigating the slopes of functions with "kinks" and corners, trading off the speed of smooth optimization for broad applicability.

Subgradients and Subdifferentials: The Foundation

For a smooth convex function $f : R^{n} \to R$ , the gradient at a point $x$ , $\nabla f (x)$ , defines a unique supporting hyperplane to the function's graph. For a nonsmooth convex function, this unique tangent is replaced by a set of supporting hyperplanes. A vector $g \in R^{n}$ is called a subgradient of $f$ at a point $x$ if it satisfies the supporting hyperplane inequality for all $y$ : $f (y) \geq f (x) + g^{⊤} (y - x) .$

Geometrically, each subgradient defines a line (or plane) that lies completely below the graph of $f$ . The set of all subgradients at $x$ is called the subdifferential, denoted $\partial f (x)$ . This is always a closed, convex set. At a point where $f$ is differentiable, the subdifferential contains only the gradient: $\partial f (x) = {\nabla f (x)}$ . At a "kink," like the vertex of the absolute value function $∣ x ∣$ at $x = 0$ , the subdifferential is an interval of slopes. For $f (x) = ∣ x ∣$ , we have $\partial f (0) = [- 1, 1]$ .

Understanding the subdifferential is crucial because it generalizes the first-order optimality condition. For a convex function, a point $x^{*}$ is a global minimizer if and only if $0 \in \partial f (x^{*})$ . This is the nonsmooth counterpart to $\nabla f (x^{*}) = 0$ .

The Subgradient Method Algorithm and Step Size Rules

The basic subgradient method mimics gradient descent but uses an arbitrary subgradient. Starting from an initial point $x_{0}$ , the iteration for $k = 0, 1, 2, ...$ is: $x_{k + 1} = x_{k} - α_{k} g_{k}, where g_{k} \in \partial f (x_{k}) .$

A critical distinction from gradient descent is that $- g_{k}$ is not necessarily a descent direction. The function value $f (x_{k + 1})$ can increase from the previous iteration. Therefore, the algorithm must keep track of the best point found so far, $x_{best}^{k} = argmin_{i = 0, ..., k} f (x_{i})$ .

The choice of step size $α_{k}$ is paramount and follows different rules than smooth optimization due to the lack of a guaranteed descent direction. Common, theoretically sound step size rules include:

Constant Step Size: $α_{k} = α$ . Simple but only converges to within a neighborhood of the optimum.
Diminishing Step Sizes: Step sizes that satisfy the conditions $α_{k} > 0$ , $\sum_{k = 1}^{\infty} α_{k} = \infty$ , and $\sum_{k = 1}^{\infty} α_{k}^{2} < \infty$ . A typical choice is $α_{k} = a / (b + k)$ . These rules guarantee asymptotic convergence to the optimum.
Square Summable but Not Summable: A stricter diminishing rule like $α_{k} = 1/ (k + 1)$ .

Unlike in gradient descent, you cannot use line searches to choose $α_{k}$ adaptively, as the function may not decrease along $- g_{k}$ .

Convergence Analysis and Rates

The convergence theory for the subgradient method is more nuanced than for gradient descent. A standard result states that for a convex function $f$ with minimum value $f^{*}$ , using a diminishing step size rule, we have $f (x_{best}^{k}) \to f^{*}$ as $k \to \infty$ .

More informatively, we can analyze the rate. A fundamental inequality drives the analysis: $∥ x_{k + 1} - x^{*} ∥^{2} \leq ∥ x_{k} - x^{*} ∥^{2} - 2 α_{k} (f (x_{k}) - f^{*}) + α_{k}^{2} ∥ g_{k} ∥^{2} .$ Assuming the subgradients are bounded, $∥ g_{k} ∥ \leq G$ , this inequality can be manipulated to show an $O (1/ k)$ convergence rate for the best function value. Specifically, with a step size rule $α_{k} = c / k$ , you can achieve: $f (x_{best}^{k}) - f^{*} \leq \frac{O ( 1 )}{k} .$

This is markedly slower than the $O (1/ k)$ rate for gradient descent on smooth convex functions or the $O (1/ k^{2})$ rate with acceleration. This is the price paid for nonsmoothness. The subgradient method is robust but inherently slow.

Advanced Methods: Bundle Techniques

To improve upon the slow convergence of the basic subgradient method, more sophisticated bundle methods were developed. These methods leverage memory—a "bundle" of past subgradients—to build a better local model of the function.

The core idea is to use past iterates $x_{i}$ and their subgradients $g_{i} \in \partial f (x_{i})$ to construct a cutting-plane model of $f$ : $\hat{f}_{k} (x) = i \in I_{k} max [f (x_{i}) + g_{i}^{⊤} (x - x_{i})],$ where $I_{k}$ is an index set of past iterates. This model $\hat{f}_{k}$ is a piecewise linear, convex function that under-estimates $f$ . Bundle methods then minimize this model (plus a stabilizing quadratic term to keep the next iterate close to the current best point) to compute a candidate point. If the candidate point yields significant descent, a "serious step" is taken. Otherwise, a "null step" is taken, but the subgradient information from the candidate point is added to the bundle, enriching the model.

This use of history allows bundle methods to detect the structure of the nonsmooth function, often leading to faster practical convergence and even finite termination for piecewise linear problems, a significant advantage over the basic subgradient method.

Application to Machine Learning with Non-differentiable Losses

Subgradient methods are directly applicable to many core machine learning problems. Consider L1-regularized empirical risk minimization, such as Lasso: $w min \frac{1}{m} i = 1 \sum m L (w; x_{i}, y_{i}) + λ ∥ w ∥_{1} .$

The regularizer $∥ w ∥_{1} = \sum_{j} ∣ w_{j} ∣$ is nonsmooth at any $w_{j} = 0$ . Its subgradient for component $j$ is $g_{j} \in \partial ∣ w_{j} ∣$ , where $g_{j} = sign (w_{j})$ if $w_{j} \neq = 0$ , and $g_{j} \in [- 1, 1]$ if $w_{j} = 0$ . A subgradient method can be applied directly to the entire objective. The update for a component $w_{j}$ at zero becomes: $w_{j}^{(k + 1)} = w_{j}^{(k)} - α_{k} ((gradient of loss)_{j} + λ g_{j}), g_{j} \in [- 1, 1] .$ This can be interpreted as a form of soft-thresholding when combined with specific step sizes.

Similarly, training a linear support vector machine (SVM) with the hinge loss $L (y, \overset{y}{^}) = max (0, 1 - y \overset{y}{^})$ involves a nonsmooth objective. The subgradient of the hinge loss with respect to the model parameters is straightforward to compute, making the subgradient method a simple, though not always the fastest, baseline solver.

Common Pitfalls

Expecting Monotonic Descent: The most frequent mistake is expecting $f (x_{k + 1}) < f (x_{k})$ on every iteration. This is not guaranteed. You must track the best iterate $x_{best}^{k}$ separately to monitor convergence.
Using Inappropriate Step Sizes: Applying a line search or an adaptive step-size rule designed for smooth functions (like Adam, without modification) to a raw subgradient method often fails, as the direction is not a descent direction. Stick to theoretically justified rules like diminishing step sizes or use specialized methods like subgradient clipping with momentum.
Misidentifying the Subdifferential: For complex functions, correctly computing an element $g \in \partial f (x)$ is essential. A common error is to use a gradient formula at a point of non-differentiability. Always refer to the definition: $g$ is valid if it satisfies the supporting hyperplane inequality.
Overlooking Better Alternatives: While the subgradient method is universally applicable, it is often slow. For specific problems like L1 regularization, more efficient proximal gradient methods exist, which handle the nonsmooth part explicitly and elegantly. Always check if a specialized solver is available for your problem structure.

Summary

Subgradients generalize gradients to nonsmooth convex functions, and the subdifferential $\partial f (x)$ is the set of all subgradients at a point. The condition $0 \in \partial f (x^{*})$ characterizes a global minimum.
The subgradient method iteratively updates parameters using any subgradient and a pre-defined step size rule. Crucially, it does not guarantee descent at each step, requiring you to track the best iterate found.
Convergence is robust but slow, with a theoretical rate of $O (1/ k)$ for diminishing step sizes, which is a fundamental consequence of nonsmoothness.
Bundle methods improve performance by maintaining a history of subgradients to build an approximating cutting-plane model, allowing for more intelligent step choices.
These methods are directly applicable to key machine learning problems involving non-differentiable components like the hinge loss for SVMs and L1 regularization for sparse model induction.

Subgradient Methods

Subgradient Methods

Subgradients and Subdifferentials: The Foundation

The Subgradient Method Algorithm and Step Size Rules

Convergence Analysis and Rates

Advanced Methods: Bundle Techniques

Application to Machine Learning with Non-differentiable Losses

Common Pitfalls

Summary

Write better notes with AI