Expectation-Maximization Algorithm

Finding patterns in data is straightforward when you can see everything. But what happens when crucial pieces are missing, or when the data you observe is generated by underlying, hidden processes you cannot directly measure? The Expectation-Maximization (EM) algorithm is a powerful iterative method designed precisely for this challenge: finding maximum likelihood estimates for model parameters when your data has missing values or latent variables. It is the engine behind training models for clustering, sequence prediction, and countless other tasks where the true data-generating story is partially obscured.

The Incomplete Data Problem and EM Intuition

Many statistical models become tractable if we could only observe some hidden, or latent, variable $Z$ . For instance, in a population with several height distributions, knowing each person's genetic subgroup (the latent variable) would make estimating each subgroup's average height simple. Without it, the problem is hard. We have observed data $X$ , but the complete data would be $(X, Z)$ . The goal is to estimate parameters $θ$ , such as means and variances, that maximize the likelihood $p (X ∣ θ)$ .

Directly maximizing $p (X ∣ θ)$ is often impossible because it requires summing or integrating over all possible values of $Z$ : $p (X ∣ θ) = \int p (X, Z ∣ θ) d Z$ . The EM algorithm cleverly sidesteps this intractable calculation through an iterative two-step process, guaranteed to climb the likelihood hill. Think of it as a sophisticated treasure hunt where you first guess the treasure's location (E-step) and then dig there (M-step), repeatedly updating your guess based on what you find.

Derivation of the General EM Algorithm

The algorithm's beauty lies in optimizing a surrogate function easier to handle than the true likelihood. We start from the log-likelihood of the observed data: $ℓ (θ ∣ X) = lo g p (X ∣ θ)$ . Using the law of total probability and introducing a distribution $q (Z)$ over the latent variable, we can write:

$ℓ (θ ∣ X) = lo g \int p (X, Z ∣ θ) d Z = lo g \int q (Z) \frac{p ( X , Z ∣ θ )}{q ( Z )} d Z .$

Jensen's inequality tells us that for a concave function like log, the expectation of the log is greater than or equal to the log of the expectation. Applying it here gives us a lower bound on our log-likelihood, known as the Evidence Lower BOund (ELBO):

$ℓ (θ ∣ X) \geq \int q (Z) lo g \frac{p ( X , Z ∣ θ )}{q ( Z )} d Z = L (q, θ) .$

The EM algorithm proceeds by alternately maximizing this lower bound $L (q, θ)$ with respect to $q$ (E-step) and $θ$ (M-step).

Expectation Step (E-step): Given the current parameter estimate $θ^{(t)}$ , we make $q$ as tight a lower bound as possible. This is achieved by setting $q (Z) = p (Z ∣ X, θ^{(t)})$ , the posterior distribution of the latent variables given the data and current parameters. This step computes the expected value of the complete-data log-likelihood, $Q (θ ∣ θ^{(t)}) = E_{Z ∣ X, θ^{(t)}} [lo g p (X, Z ∣ θ)]$ .
Maximization Step (M-step): We now maximize the $Q$ function computed in the E-step with respect to $θ$ to find a new estimate: $θ^{(t + 1)} = ar g max_{θ} Q (θ ∣ θ^{(t)})$ .

By choosing $q (Z) = p (Z ∣ X, θ^{(t)})$ in the E-step, the inequality becomes an equality at $θ^{(t)}$ , and the M-step is guaranteed to find a $θ^{(t + 1)}$ that increases the true log-likelihood $ℓ (θ ∣ X)$ .

Convergence to a Local Maximum

The convergence proof of EM follows directly from the construction of the $Q$ function and Jensen's inequality. After an M-step, we have:

$ℓ (θ^{(t + 1)} ∣ X) \geq Q (θ^{(t + 1)} ∣ θ^{(t)}) \geq Q (θ^{(t)} ∣ θ^{(t)}) = ℓ (θ^{(t)} ∣ X) .$

The first inequality holds because the ELBO is a lower bound for any $θ$ . The second inequality is due to the maximization in the M-step. The final equality holds because, as noted, when $q (Z) = p (Z ∣ X, θ^{(t)})$ , the lower bound touches the true log-likelihood at $θ^{(t)}$ . Therefore, each EM iteration never decreases the observed-data log-likelihood. This monotonic increase guarantees convergence to a stationary point (typically a local maximum) of the likelihood function, barring pathological cases.

Application 1: Gaussian Mixture Models (GMMs)

A canonical application of EM is learning the parameters of a Gaussian Mixture Model, a soft clustering algorithm. Here, the latent variable $Z_{i}$ for each data point $x_{i}$ is a one-hot vector indicating which of $K$ Gaussian components generated it. The parameters $θ$ are the mixture weights $π_{k}$ , means $μ_{k}$ , and covariance matrices $Σ_{k}$ .

E-step: Compute the responsibility $r_{ik}$ , which is the posterior probability that component $k$ generated point $i$ : $r_{ik} = p (Z_{i} = k ∣ x_{i}, θ) = \frac{π _{k} N ( x _{i} ∣ μ _{k} , Σ _{k} )}{\sum _{j = 1}^{K} π _{j} N ( x _{i} ∣ μ _{j} , Σ _{j} )}$ . This is a "soft" cluster assignment.
M-step: Update the parameters using weighted statistics based on the responsibilities. The new mean $μ_{k}$ is the responsibility-weighted average of all points. The new mixture weight $π_{k}$ is the average responsibility for component $k$ . These updates are intuitive: they assign each component parameters based on the data points it is probabilistically responsible for.

Application 2: Hidden Markov Model Training

For Hidden Markov Models (HMMs), the observed data is a sequence of emissions $X$ , and the latent data $Z$ is the sequence of hidden states. The parameters $θ$ are the transition matrix, emission probabilities, and initial state distribution. The EM algorithm applied here is known as the Baum-Welch algorithm.

E-step: Using the forward-backward algorithm, compute two key posterior distributions for all states and times: the probability of being in a state at time $t$ (the gamma variable) and the probability of transitioning between two states at consecutive times (the xi variable). This step efficiently calculates the expected state occupancies and transition counts.
M-step: Update the HMM parameters using these expected counts. The new transition probability from state $i$ to $j$ is the expected number of $i \to j$ transitions divided by the total expected number of transitions from $i$ . Emission probabilities are updated similarly. This trains the HMM to best explain the observed sequence.

Common Pitfalls

Convergence to Local Maxima: EM is guaranteed to converge, but only to a local maximum of the likelihood. The final solution is highly sensitive to the initialization of parameters. In practice, you must run the algorithm multiple times with different random seeds and choose the solution with the highest final likelihood.
Slow Convergence in Plateaus: The algorithm's convergence rate can be painfully slow, especially when the likelihood function has flat regions or long "ridges." Near the optimum, progress can become glacial. Techniques like accelerated EM or simply setting a sensible convergence tolerance (e.g., a minimal change in log-likelihood) are necessary.
Numerical Stability with Responsibilities: In the E-step for models like GMMs, computing responsibilities involves evaluating Gaussian densities, which can underflow to zero. This is especially true in high dimensions or with many components. Always compute responsibilities in the log domain (using log probabilities) and use the log-sum-exp trick for stable normalization.
Misidentifying the Latent Structure: EM will find parameters that maximize likelihood for the model you specify. If you choose an inappropriate number of components (e.g., $K = 3$ for data with 4 true clusters) or an incorrect model family, the results will be misleading. Model selection (e.g., using Bayesian Information Criterion) is a separate, critical step.

Summary

The Expectation-Maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models with latent variables or missing data.
It operates by alternating an E-step, which computes a lower bound (the expected complete-data log-likelihood, or $Q$ function) using the current posterior of latent variables, and an M-step, which maximizes this bound to update the parameters.
The algorithm is monotonically increasing in the observed-data likelihood, guaranteeing convergence to a local maximum, though careful initialization is required to avoid poor local optima.
Its applications are foundational, including unsupervised clustering via Gaussian Mixture Models and sequence modeling through the training of Hidden Markov Models (the Baum-Welch algorithm).
Practical implementation requires attention to initialization strategies, numerical stability (e.g., working in the log domain), and awareness of its sometimes slow convergence rate.

Expectation-Maximization Algorithm

Expectation-Maximization Algorithm

The Incomplete Data Problem and EM Intuition

Derivation of the General EM Algorithm

Convergence to a Local Maximum

Application 1: Gaussian Mixture Models (GMMs)

Application 2: Hidden Markov Model Training

Common Pitfalls

Summary

Write better notes with AI