LDA for Dimensionality Reduction

In the world of machine learning, having too many features can be as paralyzing as having too few; it leads to the curse of dimensionality, where model performance degrades and computational cost soars. While unsupervised methods like PCA reduce dimensions, they ignore a crucial piece of information: class labels. Linear Discriminant Analysis (LDA) is a powerful supervised dimensionality reduction technique that finds the axes maximizing the separation between multiple classes. By projecting data onto these new, discriminative axes, LDA preserves class separability, often leading to more effective and efficient classifiers.

The Intuition Behind LDA: Maximizing Separation

At its core, LDA seeks a lower-dimensional projection of your data that maximizes the distance between the means of different classes while minimizing the spread (variance) within each class. Imagine you have data points from two classes plotted on a graph. An effective projection would be a new line (axis) where, when you project all points onto it, the clusters for each class are as far apart from each other as possible and each cluster is as tight as possible. This intuitive goal is formalized by Fisher's criterion.

For a projection vector $w$ , Fisher defined the objective as maximizing the ratio of the between-class scatter to the within-class scatter. In simpler terms, we want to maximize the difference between class means (between-class) while minimizing the variance within each class (within-class). The mathematical formulation of this ratio is what drives the LDA algorithm to find the optimal projection.

The Mathematical Engine: Scatter Matrices and Eigenvectors

The execution of LDA's intuition relies on constructing two key matrices from the data. First, the within-class scatter matrix $S_{W}$ measures the spread of data within each class. It is calculated as the sum of covariance matrices for each class. A small $S_{W}$ indicates that samples of the same class are close to their class mean.

$S_{W} = i = 1 \sum c x \in D_{i} \sum (x - m_{i}) (x - m_{i})^{T}$

where $c$ is the number of classes, $D_{i}$ is the set of samples in class $i$ , and $m_{i}$ is the mean vector of class $i$ .

Second, the between-class scatter matrix $S_{B}$ measures the separation between the means of different classes. It accumulates the weighted outer product of the difference between each class mean and the overall mean.

$S_{B} = i = 1 \sum c N_{i} (m_{i} - m) (m_{i} - m)^{T}$

Here, $N_{i}$ is the number of samples in class $i$ , and $m$ is the overall mean vector of all data.

LDA's objective is to find projection vectors $w$ that maximize Fisher's criterion:

$J (w) = \frac{w ^{T} S _{B} w}{w ^{T} S _{W} w}$

The solution to this optimization problem is found by solving a generalized eigenvalue problem: $S_{B} w = λ S_{W} w$ . The eigenvectors $w$ corresponding to the largest eigenvalues $λ$ become the new axes of our discriminative subspace. Crucially, for $c$ classes, LDA can find at most $c - 1$ meaningful projection directions.

LDA vs. PCA: Supervised vs. Unsupervised Goals

A fundamental understanding requires contrasting LDA with its well-known unsupervised cousin, Principal Component Analysis (PCA). While both are linear transformation techniques, their objectives are fundamentally different.

Aspect	Linear Discriminant Analysis (LDA)	Principal Component Analysis (PCA)
Goal	Maximize class separability (supervised).	Maximize variance retained (unsupervised).
Information Used	Uses feature data and class labels.	Uses only feature data.
Focus	Finds directions of maximum between-class vs. within-class scatter.	Finds directions of maximum overall variance.
Output Axes	At most $c - 1$ components for $c$ classes.	Up to $n$ components for $n$ original features.
Result	Axes are discriminative and useful for classification.	Axes are orthogonal and capture global data shape.

Consider a dataset where the direction of maximum overall variance (PCA's first component) runs parallel to the blend of two classes. PCA would choose this direction, despite it being poor for separation. LDA would ignore this high-variance but non-discriminative direction and find an orthogonal axis that cleanly separates the class means, even if the spread along that axis is smaller.

Limitations and Assumptions of LDA

LDA is a powerful tool, but it operates under specific assumptions that limit its applicability. Understanding these limitations is key to using it effectively.

First, LDA assumes that each class has a Gaussian (normal) distribution. It performs best when the data for each class is roughly bell-shaped. Second, it assumes all classes share a common covariance matrix. This means the shape and orientation of the spread of data points should be similar across different classes. When these assumptions hold, LDA is optimal. When they are violated, its performance can degrade.

A major practical limitation is that LDA is a linear method. It can only find straight-line boundaries and projections. It struggles with non-linear data where classes are separated by curved boundaries or are nested within one another. For such problems, kernelized versions (Kernel LDA) or other non-linear methods may be necessary. Furthermore, LDA suffers in very high-dimensional settings where the number of features far exceeds the number of samples, as the within-class scatter matrix $S_{W}$ becomes singular and non-invertible, requiring regularization techniques.

Practical Applications: From Faces to Text

Despite its linearity, LDA has proven highly effective in numerous real-world domains. In face recognition, techniques like Fisherfaces are built upon LDA. Here, each person is a class. The algorithm projects facial images onto a subspace that maximizes the difference between individuals while minimizing the difference between different images of the same person, making recognition more robust to lighting and expression changes.

In text classification and natural language processing, LDA is frequently used for supervised topic modeling and dimensionality reduction. Documents are represented as high-dimensional vectors of word counts (e.g., TF-IDF). LDA can project these documents onto a lower-dimensional space where documents about the same topic (e.g., "sports," "politics") are clustered together, improving the efficiency and accuracy of classifiers like logistic regression or SVMs.

Common Pitfalls

Applying LDA to Non-Gaussian or Heteroscedastic Data: Using LDA when classes have wildly different covariance structures or are non-normally distributed is a recipe for poor performance. Correction: Always visualize your data first. Use Q-Q plots to check for normality. Consider quadratic discriminant analysis (QDA) if covariances differ, or move to non-parametric classifiers like random forests if assumptions are severely violated.

Expecting More Than $c - 1$ Components: A user trying to reduce 1000 features to 100 dimensions using LDA for a 3-class problem will be disappointed. Correction: Remember the hard limit: for $c$ classes, you get at most $c - 1$ discriminative components. If you need more dimensions, consider using LDA for initial separation and then applying PCA within the LDA subspace, or use a different method altogether.

Using LDA for Regression or Unsupervised Tasks: LDA is fundamentally a supervised technique for classification. Correction: Do not use it if you lack class labels. For unsupervised dimensionality reduction, use PCA or t-SNE. For regression tasks with a continuous target, look at methods like Partial Least Squares Regression (PLSR).

Ignoring the Singular Matrix Problem in High Dimensions: With datasets where features >> samples (e.g., genomics, text), the $S_{W}$ matrix is singular, and the standard eigenvalue decomposition fails. Correction: Apply regularization (e.g., add a small multiple of the identity matrix to $S_{W}$ in a technique called Regularized LDA) or first use PCA for initial dimensionality reduction before applying LDA.

Summary

LDA is a supervised dimensionality reduction technique designed to find feature projections that maximize separation between predefined classes, formalized by maximizing Fisher's criterion.
It operates by mathematically constructing and optimizing the ratio of the between-class scatter matrix $S_{B}$ (separation of means) to the within-class scatter matrix $S_{W}$ (internal variance).
Unlike the unsupervised PCA, which seeks directions of maximum variance, LDA explicitly uses class labels to find discriminative directions, resulting in at most $c - 1$ components for $c$ classes.
Its main limitations include assumptions of Gaussian distributions and equal covariance across classes, and an inherent linearity that fails with complex, non-linear data boundaries.
LDA remains highly effective in applied fields like face recognition (Fisherfaces) and text classification, where its ability to preserve class structure in a lower-dimensional space enhances computational efficiency and classifier performance.

LDA for Dimensionality Reduction

LDA for Dimensionality Reduction

The Intuition Behind LDA: Maximizing Separation

The Mathematical Engine: Scatter Matrices and Eigenvectors

LDA vs. PCA: Supervised vs. Unsupervised Goals

Limitations and Assumptions of LDA

Practical Applications: From Faces to Text

Common Pitfalls

Summary

Write better notes with AI