Linear Discriminant Analysis for Classification

Linear Discriminant Analysis (LDA) is a cornerstone of statistical pattern recognition, providing a powerful and elegant framework for supervised classification. At its core, LDA finds the optimal linear boundaries between classes by leveraging assumptions about the underlying data distribution. Understanding LDA not only gives you a robust classification tool but also deepens your insight into dimensionality reduction, Bayes' optimal decisions, and the fundamental trade-offs in model complexity.

Foundational Assumptions: Gaussian Classes with Shared Covariance

LDA is built upon a specific probabilistic model. It assumes that the data for each class $k$ is generated from a multivariate Gaussian (normal) distribution. This means the probability density function for a feature vector $x$ given class $k$ is:

$P (x ∣ Y = k) = \frac{1}{( 2 π ) ^{p /2} ∣ Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ_{k})^{T} Σ^{- 1} (x - μ_{k}))$

The critical assumption here is that all classes share the same covariance matrix $Σ$ . Think of it as assuming each class forms an identically shaped and oriented "cloud" in feature space, but each cloud is centered at its own class mean $μ_{k}$ . This assumption of homoscedasticity (equal covariance) is what leads to linear decision boundaries.

To classify a new point $x$ , LDA applies Bayes' theorem. It computes the posterior probability $P (Y = k ∣ x)$ —the probability of belonging to class $k$ given the observed features. The classifier then assigns $x$ to the class with the highest posterior probability. After applying the Gaussian density and simplifying with logarithms, the discriminant function $δ_{k} (x)$ for class $k$ becomes:

$δ_{k} (x) = x^{T} Σ^{- 1} μ_{k} - \frac{1}{2} μ_{k}^{T} Σ^{- 1} μ_{k} + lo g (π_{k})$

where $π_{k}$ is the prior probability of class $k$ . You classify $x$ by calculating $δ_{k} (x)$ for all classes and choosing the $k$ that yields the largest value. The function is linear in $x$ , hence the name Linear Discriminant Analysis. The decision boundary between any two classes is found where their discriminant functions are equal, which forms a hyperplane.

Fisher's Discriminant: The Geometric Perspective

An equivalent and highly intuitive view of LDA is provided by Fisher's Linear Discriminant. This approach frames the problem as a projection. Instead of modeling distributions, Fisher asked: "What one-dimensional projection (i.e., line) of the data maximizes the separation between the classes?"

He defined separation as the ratio of the between-class variance to the within-class variance for the projected data. You want the class means to be as far apart as possible (large between-class variance) while keeping the data from each class as tight as possible (small within-class variance). This ratio is called Fisher's criterion.

Mathematically, for a two-class problem, we seek a projection vector $w$ that maximizes:

$J (w) = \frac{( w ^{T} μ _{1} - w ^{T} μ _{2} ) ^{2}}{w ^{T} Σ w} = \frac{Between-class variance}{Within-class variance}$

The solution for the optimal $w$ is proportional to $Σ^{- 1} (μ_{1} - μ_{2})$ . For $K$ classes, Fisher's method finds up to $K - 1$ projection directions that maximize class separation in a lower-dimensional subspace. This projection is identical to the classification rule derived from the Gaussian assumption. You project the data onto these directions and then perform classification in this reduced space, which is excellent for visualization and combating the curse of dimensionality.

Extension to Quadratic Discriminant Analysis (QDA)

What if the assumption of a shared covariance matrix is too restrictive? In many real-world datasets, one class might be more spread out than another. Quadratic Discriminant Analysis (QDA) relaxes this key LDA assumption. It models each class $k$ with its own multivariate Gaussian distribution, characterized by a unique mean $μ_{k}$ and a unique covariance matrix $Σ_{k}$ .

Following the same Bayes' rule procedure, the discriminant function for QDA becomes:

$δ_{k}^{Q D A} (x) = - \frac{1}{2} lo g ∣ Σ_{k} ∣ - \frac{1}{2} (x - μ_{k})^{T} Σ_{k}^{- 1} (x - μ_{k}) + lo g (π_{k})$

Notice the term $(x - μ_{k})^{T} Σ_{k}^{- 1} (x - μ_{k})$ is not linear in $x$ because $Σ_{k}$ is inside the quadratic form. This results in quadratic decision boundaries, which are more flexible than the hyperplanes from LDA. QDA can model more complex, curved shapes separating the classes. The trade-off is increased model complexity: QDA must estimate many more parameters (a separate covariance matrix for each class), which requires more data to avoid high variance in the estimates.

LDA in Comparison with Logistic Regression

Both LDA and logistic regression are linear classification methods, but they originate from different philosophical approaches. Logistic regression models the posterior probabilities $P (Y = k ∣ x)$ directly using the logistic (or softmax) function. It makes no explicit assumptions about the distribution of the features $x$ . It is a discriminative model because it focuses on modeling the boundary between classes.

LDA, in contrast, is a generative model. It models the full joint distribution $P (x, Y)$ by specifying the class-conditional distributions $P (x ∣ Y)$ and the class priors $P (Y)$ . Classification is performed by applying Bayes' rule to these modeled distributions.

This distinction leads to practical differences:

When Assumptions Hold: When the Gaussian class-conditional assumption with shared covariance is approximately true, LDA tends to be more statistically efficient than logistic regression, especially with small sample sizes. It uses the data more effectively because it's based on a "correct" model of the data generation process.
Robustness to Model Misspecification: Logistic regression is more robust. If the features are clearly non-Gaussian (e.g., binary features), logistic regression will typically outperform LDA because it doesn't rely on those distributional assumptions.
Stability with Few Data Points: With well-separated classes or very few data points, the parameters in logistic regression can become unstable (tending toward infinity), while LDA's parameter estimates remain stable.

In practice, the performance of the two methods is often very similar. The choice can come down to the nature of your data and your comfort with the underlying assumptions.

Common Pitfalls

Assuming Linearity is Always Sufficient: Applying LDA to data where classes have vastly different covariance structures or are separated by a highly non-linear boundary will lead to poor performance. Always visualize your data or use techniques like cross-validation to compare LDA with QDA or non-linear classifiers. If a simple quadratic boundary (QDA) performs significantly better, your data violates LDA's core assumption.
Ignoring the Prior Probabilities ( $π_{k}$ ): The default in many software implementations is to estimate priors from the class proportions in the training set. However, if your operational environment has a different class prevalence (e.g., diagnosing a rare disease), you must explicitly set the priors $π_{k}$ to reflect the true base rates. Failing to do so will bias predictions toward the more common class in your training data, which may not be appropriate.
Using LDA with Categorical or Sparse Features: LDA's Gaussian assumption is poorly suited for binary, count, or extremely high-dimensional sparse data (like text bag-of-words). The estimates for the covariance matrix become meaningless or singular. In these cases, models like logistic regression, Naive Bayes (with appropriate distributions), or regularized classifiers are far more appropriate.
Confusing Classification LDA with Dimensionality Reduction LDA: It's important to distinguish between LDA as a classifier (which outputs class labels) and Fisher's LDA as a supervised dimensionality reduction technique (which outputs projections). They are derived from the same math, but the latter is often used as a preprocessing step for visualization or before applying another classifier.

Summary

LDA is a generative, probabilistic classifier that assumes each class follows a Gaussian distribution with a class-specific mean but a common covariance matrix across all classes. This leads to optimal linear decision boundaries.
Fisher's Discriminant provides a geometric interpretation, finding projections that maximize between-class separation relative to within-class spread. This is identical to the classification LDA solution and is a powerful tool for supervised dimensionality reduction.
Quadratic Discriminant Analysis (QDA) relaxes the equal-covariance assumption, allowing each class its own covariance matrix. This results in more flexible quadratic decision boundaries at the cost of requiring more data to reliably estimate more parameters.
Compared to logistic regression, LDA is often more efficient if its Gaussian assumptions are met, especially with limited data. Logistic regression is more robust when these assumptions are violated, as it makes no distributional claims about the features.
Successful application requires diagnosing your data's structure. Use visualization and model comparison to check if the linearity and homoscedasticity assumptions are reasonable before choosing LDA as your classifier.

Linear Discriminant Analysis for Classification

Linear Discriminant Analysis for Classification

Foundational Assumptions: Gaussian Classes with Shared Covariance

Fisher's Discriminant: The Geometric Perspective

Extension to Quadratic Discriminant Analysis (QDA)

LDA in Comparison with Logistic Regression

Common Pitfalls

Summary

Write better notes with AI