Propensity Score Matching

When you cannot randomly assign participants to treatment and control groups—as is the case in most observational studies in economics, medicine, and social sciences—comparing outcomes becomes fraught with danger. The primary threat is selection bias, where systematic differences between the groups, rather than the treatment itself, explain any observed effect. Propensity score matching (PSM) is a powerful statistical method designed to simulate the conditions of a randomized experiment by creating a balanced comparison group, thereby reducing this bias and providing more credible estimates of causal effects.

The Core Idea: From Many Confounders to a Single Score

At its heart, PSM is a dimensionality reduction technique. In an observational study, individuals in the treatment group (e.g., those who received a drug, enrolled in a job training program) often differ from those in the control group across many confounding variables (e.g., age, income, disease severity). Matching directly on all these variables simultaneously is often impractical—a problem known as the "curse of dimensionality."

The genius of PSM, introduced by Paul Rosenbaum and Donald Rubin in 1983, is to collapse these many confounders into a single metric: the propensity score. Formally, the propensity score $e (X)$ is the conditional probability of receiving the treatment given the observed pre-treatment covariates $X$ :

$e (X_{i}) = P (T_{i} = 1∣ X_{i})$

Here, $T_{i} = 1$ indicates treatment assignment for individual $i$ . The core theoretical result is that, conditional on the propensity score, the distribution of observed covariates $X$ is independent of treatment assignment. This is known as balance. In simpler terms, if you take two individuals—one treated, one untreated—with the same or very similar propensity scores, they are, statistically speaking, comparable in all their observed pre-treatment characteristics. Any remaining difference in their outcomes can therefore be more plausibly attributed to the treatment.

Estimating the Propensity Score

In practice, the true propensity score is unknown and must be estimated. The most common method is logistic regression, where the treatment indicator (1/0) is the dependent variable, and all relevant observed pre-treatment covariates are independent variables. The predicted probabilities from this model become the estimated propensity scores for each individual in your sample.

Model specification is critical. You must include all variables that influence both the treatment assignment and the outcome. Omitting a key confounder will leave residual bias. However, you should avoid including variables that are only affected by the treatment (post-treatment variables), as this will bias your effect estimate. There is no outcome variable in this model; it is purely a model of selection into treatment.

Assessing Balance: The Proof is in the Comparison

Estimating scores is just the first step. The crucial test of whether PSM is successful is whether it achieves balance in the covariates between the treated and matched control groups. After matching, you must systematically check covariate balance. Common methods include:

Standardized Mean Differences (SMD): For each covariate, calculate the difference in means between groups divided by a pooled standard deviation. An SMD below 0.1 (10%) after matching is generally considered good balance.
Variance Ratios: The ratio of variances for each covariate in the treatment vs. control group should be close to 1 after matching.
Visual Inspection: Overlapping histograms or kernel density plots of the propensity score distribution for both groups show whether the common support condition (overlap) is met.

If balance is not achieved, you may need to re-specify your propensity score model by adding interaction or polynomial terms before proceeding to estimate treatment effects.

Matching Algorithms and Estimating the Treatment Effect

Once you have scores and have verified balance, you match each treated individual to one or more untreated individuals with a similar propensity score. Different algorithms handle this:

Nearest Neighbor Matching: Pairs each treated individual with the untreated individual whose score is closest. You can match with or without replacement.
Caliper Matching: Imposes a tolerance level on the maximum score distance allowed for a match (e.g., 0.2 standard deviations of the logit of the propensity score). This avoids poor matches.
Stratification (Subclassification): Divides the sample into strata (e.g., quintiles) based on the propensity score and calculates the treatment effect within each stratum before averaging.
Kernel Matching: Uses a weighted average of all control participants, with weights inversely proportional to the distance from the treated individual's score.

After creating the matched sample, the Average Treatment Effect on the Treated (ATT) is estimated simply as the mean difference in outcomes between the treated and their matched controls. This is the effect for those who actually received the treatment. Standard errors must account for the matching process, often via bootstrapping.

Reporting, Sensitivity Analysis, and Critical Assumptions

A robust PSM analysis does not end with the ATT estimate. You must transparently report the matching process, show balance tables for key covariates before and after matching, and, most importantly, conduct sensitivity analyses.

PSM rests on two critical assumptions. The first is ignorability (or unconfoundedness), which states that all relevant confounders are observed and included in the propensity score model. The second is overlap, meaning every individual has a non-zero probability of receiving either treatment (common support). The ignorability assumption is untestable. Sensitivity analysis probes how strong an unmeasured confounder would need to be to nullify your significant result, providing a measure of the finding's robustness.

Common Pitfalls

Matching on the Propensity Score Alone and Calling it a Day: Failing to check and report covariate balance is a cardinal sin. The propensity score is a means to achieve balance, not an end in itself. Always present pre-match and post-match balance statistics.
Incorrect Propensity Score Model Specification: Including post-treatment variables or excluding important confounders invalidates the analysis. Use subject-matter knowledge to build the model, not stepwise automatic selection.
Ignoring the Common Support Region: Attempting to estimate effects for treated individuals who have no comparable controls in the data (i.e., outside the region of overlap) leads to extrapolation and biased estimates. Always visualize and restrict analysis to the region of common support.
Misinterpreting the Result as Proof of Causality: PSM reduces bias from observed confounders. It is not a magic bullet for causality. Your conclusion is only as good as the data and theory that went into the model. Always clearly state the assumption of no unmeasured confounding as a major limitation.

Summary

Propensity score matching reduces selection bias in observational studies by matching treated and control units on their probability of receiving treatment, creating a more apples-to-apples comparison.
The propensity score is typically estimated via logistic regression using all relevant pre-treatment confounders, and its success is validated by checking covariate balance (e.g., Standardized Mean Differences < 0.1) after matching.
Different matching algorithms (nearest neighbor, caliper, kernel) can be used to form the matched sample, from which the Average Treatment Effect on the Treated (ATT) is calculated.
The validity of PSM depends on the strong ignorability assumption (no unmeasured confounding). Therefore, sensitivity analysis is essential to assess how vulnerable the results are to a potential hidden bias.
PSM is a powerful tool for improving causal inference from non-experimental data, but it requires careful execution, transparent reporting, and a sober understanding of its limitations.

Propensity Score Matching

Propensity Score Matching

The Core Idea: From Many Confounders to a Single Score

Estimating the Propensity Score

Assessing Balance: The Proof is in the Comparison

Matching Algorithms and Estimating the Treatment Effect

Reporting, Sensitivity Analysis, and Critical Assumptions

Common Pitfalls

Summary

Write better notes with AI