Bayesian Model Comparison

Choosing the right model is one of the most consequential decisions in statistical analysis. In the Bayesian paradigm, model comparison moves beyond simple goodness-of-fit to a formal evaluation of predictive performance and evidence, balancing a model's ability to explain observed data with its inherent complexity. Core tools such as Bayes factors, WAIC, and LOO-CV are used to compare models using posterior samples, and strategies like model averaging and stacking explore ways to combine models for better predictions.

Core Concepts in Bayesian Comparison

The foundation of Bayesian model comparison is the marginal likelihood, sometimes called the model evidence. For a model $M$ with parameters $θ$ , it is the probability of the observed data $y$ under that model, computed by averaging the likelihood over the prior: $p (y ∣ M) = \int p (y ∣ θ, M) p (θ ∣ M) d θ$ . It represents how well the model, as a whole, predicts the data it was fit to, with a built-in penalty for complexity (overly flexible models spread their predictive probability thinly, resulting in a lower average).

The Bayes factor is the classical tool for comparing two models, $M_{1}$ and $M_{2}$ . It is the ratio of their marginal likelihoods: $B F_{12} = p (y ∣ M_{1}) / p (y ∣ M_{2})$ . A Bayes factor greater than 1 favors $M_{1}$ ; less than 1 favors $M_{2}$ . Interpretation often uses heuristic scales (e.g., 3-20: positive evidence, >150: very strong evidence). However, computing the marginal likelihood can be challenging for complex models, and the result is highly sensitive to diffuse priors. A very wide, uninformative prior can artificially penalize a model by placing substantial prior mass in regions with poor likelihood, making the Bayes factor difficult to calculate and interpret in high-dimensional settings.

Modern approaches often bypass the marginal likelihood in favor of out-of-sample predictive accuracy, estimated from posterior samples. The key quantity is the log pointwise predictive density (lppd). For observed data points $y_{i}$ , it is computed using the posterior distribution of the parameters: $lppd = \sum_{i = 1}^{n} lo g (\frac{1}{S} \sum_{s = 1}^{S} p (y_{i} ∣ θ^{s}))$ , where $θ^{s}$ are $S$ posterior draws. The lppd measures how likely the observed data is under the model, using the posterior. However, it is an in-sample measure and tends to be overly optimistic about a model's predictive skill.

Information Criteria: WAIC and LOO-CV

To correct for overfitting, we use information criteria that penalize model complexity. The Widely Applicable Information Criterion (WAIC) is a fully Bayesian criterion. It is computed as $WAIC = - 2 (lppd - p_{WAIC})$ . Here, $p_{WAIC}$ is an estimate of the effective number of parameters, quantifying model flexibility. A lower WAIC indicates better expected out-of-sample predictive performance. WAIC is advantageous because it uses the full posterior and is computationally straightforward from posterior samples.

Leave-One-Out Cross-Validation (LOO-CV) is arguably the gold standard for estimating predictive accuracy. The Bayesian LOO estimate of the out-of-sample lppd is $loo-lppd = \sum_{i = 1}^{n} lo g p (y_{i} ∣ y_{- i})$ , where $p (y_{i} ∣ y_{- i})$ is the posterior predictive distribution given all data except point $i$ . Computing this directly requires fitting the model $n$ times, which is prohibitive. Fortunately, efficient approximations like Pareto-smoothed importance sampling (PSIS-LOO) allow us to compute LOO-CV using only a single set of posterior draws from the full model. Like WAIC, we often report the LOO Information Criterion (LOOIC) as $- 2 * loo-lppd$ for comparison with other criteria.

Combining Models: Averaging and Stacking

Instead of selecting a single "best" model, we can often make better predictions by combining several. Bayesian Model Averaging (BMA) is a formal method for this. If we have a set of candidate models with prior model probabilities, BMA combines their predictions by weighting each model's posterior predictive distribution by its posterior model probability (which is derived from the marginal likelihood and prior). BMA is optimal when one of the candidate models is the true data-generating process, but it can perform poorly when the "true model" is not in the set or when models make very similar predictions.

A more robust alternative for pure predictive performance is Bayesian stacking. Stacking finds the optimal linear combination of predictive distributions from different models to maximize the leave-one-out predictive density. The resulting stacking weights are chosen to minimize the divergence between the combined prediction and the true (unknown) data-generating process. Unlike BMA weights, stacking weights are not probabilities and can be zero for models that don't improve the ensemble. Stacking is particularly useful when models are complementary, each capturing different aspects of the data.

Common Pitfalls

Over-reliance on a Single Number: No single criterion (BF, WAIC, LOO) is infallible. They answer slightly different questions: Bayes factors assess relative evidence, while WAIC and LOO estimate predictive accuracy. You should use multiple criteria, check for consistency, and always base final decisions on the scientific context and model checking (e.g., posterior predictive checks).

Ignoring Prior Sensitivity in Bayes Factors: A common and critical mistake is computing a Bayes factor using default, overly diffuse priors without conducting a sensitivity analysis. The result can be meaningless or wildly misleading. Always check how the Bayes factor changes with reasonable, substantive prior choices.

Misinterpreting WAIC/LOOIC Differences: Small differences (e.g., <2-5 points) in WAIC or LOOIC between models are generally not meaningful. Focus on larger differences and the associated standard errors of these estimates. Furthermore, these criteria are not valid for models that are not statistically regular (e.g., mixtures with varying components).

Forgetting the Goal: Prediction vs. Explanation: If your goal is causal explanation or parameter inference, selecting a single best-fitting model may be necessary. If your goal is purely the most accurate prediction, model averaging via stacking is almost always superior. Don't let a comparison tool dictate an inappropriate objective.

Summary

Bayes factors provide a direct measure of relative evidence between two models based on marginal likelihood, but they are computationally challenging and highly sensitive to prior choice, especially with diffuse priors.
WAIC and LOO-CV estimate out-of-sample predictive accuracy directly from posterior samples. WAIC is efficient to compute, while LOO-CV (via PSIS) is a more robust approximation to cross-validation and is generally preferred.
The log predictive density (lppd) is the core in-sample measure, which information criteria like WAIC adjust with a penalty for effective parameters to estimate out-of-sample performance.
Instead of choosing one model, Bayesian model averaging weights models by their posterior probabilities, while stacking finds optimal weights to maximize predictive performance, often yielding more robust combined predictions.
Always use model comparison tools as guides, not arbiters. Validate models with posterior predictive checks and base final decisions on your research goals.

Bayesian Model Comparison

Bayesian Model Comparison

Core Concepts in Bayesian Comparison

Information Criteria: WAIC and LOO-CV

Combining Models: Averaging and Stacking

Common Pitfalls

Summary

Write better notes with AI