Causal Inference with Propensity Score Matching

In observational studies across fields like healthcare, economics, and social science, researchers often seek to estimate causal effects—for example, does a new drug improve patient outcomes? Unlike randomized controlled trials, observational data is plagued by selection bias, where treated and control groups differ systematically. Propensity score matching is a powerful statistical technique that reduces this bias by mimicking randomization, allowing you to draw more credible causal conclusions from non-experimental data.

Foundations: Propensity Scores and Estimation with Logistic Regression

At its core, a propensity score is the probability that a unit (e.g., a patient, customer, or school) receives a treatment, given its observed characteristics or covariates. Formally, for a binary treatment $T$ (where $T = 1$ means treated and $T = 0$ means control) and a vector of covariates $X$ , the propensity score is defined as $e (X) = P (T = 1∣ X)$ . By summarizing all covariates into a single score, we can balance the treated and control groups on observed factors, much like random assignment would.

The most common method to estimate propensity scores is logistic regression. You model the treatment assignment as a function of the covariates: $l o g i t (P (T = 1∣ X)) = β_{0} + β_{1} X_{1} + β_{2} X_{2} + ... + β_{k} X_{k}$ . Solving this gives the estimated propensity score: $\overset{e}{^} (X) = \frac{1}{1 + e ^{- (\hat{β}_{0} + \hat{β}_{1} X_{1} + ... + \hat{β}_{k} X_{k})}}$ . Think of it as creating a "similarity index" based on pre-treatment variables. For instance, in a study on job training programs, covariates might include age, education, and prior income; logistic regression calculates each person's likelihood of enrolling in the program based on these factors.

Implementing Matching and Computing Treatment Effects

Once propensity scores are estimated, the next step is to match treated and control units with similar scores. The goal is to pair each treated unit with one or more control units that are statistically twins in terms of their probability of treatment. A refined approach is caliper matching, which sets a maximum allowable distance (the caliper) between propensity scores for a match—often a fraction of the standard deviation of the logit of the propensity score. This prevents poor matches by ensuring units are only paired if their scores are sufficiently close, improving the quality of the comparison.

After matching, you compute the average treatment effect (ATE) or, more commonly in matching, the average treatment effect on the treated (ATT). The ATT estimates the effect for those who actually received the treatment. For a matched set, if you have $N_{t}$ treated units, and for each treated unit $i$ , you have a matched control unit (or an average of several matches) with outcome $Y_{j}$ , the ATT is calculated as: $A TT = \frac{1}{N _{t}} i \in Treated \sum Y_{i} - \frac{1}{M} j \in Matched Controls for i \sum Y_{j}$ where $M$ is the number of controls matched to each treated unit. In a healthcare example, if matched patients show a 10-point higher recovery score than their controls, that difference averaged across all pairs estimates the drug's effect.

Assessing Balance and Sensitivity to Unmeasured Confounders

Matching is only valid if it achieves covariate balance, meaning the distribution of covariates is similar between treated and control groups after matching. You assess this by comparing standardized mean differences, variances, or graphical plots like love plots. A standardized mean difference below 0.1 for each covariate is often considered balanced. If balance isn't achieved, you may need to refine the propensity score model or matching method—balance checks are a critical diagnostic tool, not an optional step.

Even with perfect balance on observed covariates, unmeasured confounders could bias results. Sensitivity analysis quantifies how strong an unobserved variable would need to be to overturn your conclusions. One approach uses Rosenbaum bounds, which provide a threshold for the odds ratio of treatment assignment due to hidden bias. For example, if a sensitivity analysis shows that an unmeasured confounder would need to double the odds of treatment to nullify the effect, your finding is relatively robust; if a small bias could do so, interpret results with caution.

Alternatives and When Matching Outperforms Regression

Inverse propensity weighting (IPW) is a popular alternative to matching. Instead of pairing units, IPW weights each observation by the inverse of its propensity score: treated units get weight $1/ e (X)$ and control units get weight $1/ (1 - e (X))$ . This creates a pseudo-population where treatment assignment is independent of covariates, allowing direct estimation of ATE. However, IPW can be unstable with extreme propensity scores, requiring trimming or careful implementation.

A key practical decision is when matching outperforms regression adjustment. Regression adjustment controls for covariates directly in an outcome model, but it assumes a correct functional form (e.g., linearity). Matching excels when there is sufficient overlap in propensity scores between groups but the outcome model might be misspecified. For instance, if the relationship between covariates and outcome is complex or unknown, matching on propensity scores nonparametrically can reduce bias more effectively than a misspecified regression. Matching also provides transparency by showing you exactly which units are compared, whereas regression extrapolates across all data.

Common Pitfalls

Ignoring the Overlap Assumption: Propensity score matching requires substantial overlap in scores between treated and control groups. If overlap is poor—for example, some treated units have no comparable controls—matching will discard data and estimates may be unreliable. Always inspect the propensity score distribution before matching; consider trimming or using methods like IPW if overlap is limited.

Failing to Assess Covariate Balance Adequately: Simply matching on propensity scores doesn't guarantee balance. A common mistake is to skip formal balance checks. Use multiple metrics (e.g., standardized differences, variance ratios) and visualize results. If balance isn't achieved, revisit your propensity score model by adding interaction terms or using machine learning methods like boosted regression.

Confusing ATE and ATT: Matching often estimates ATT, while regression or IPW might target ATE. ATT answers "What was the effect for those treated?", while ATE answers "What would be the effect if everyone were treated?". Misinterpreting which estimand you're calculating can lead to incorrect policy implications. Clearly define your causal question from the start.

Overlooking Sensitivity to Unmeasured Confounding: Assuming that matching eliminates all bias is dangerous. Observational studies always carry the risk of hidden variables. Neglecting sensitivity analysis can overstate the confidence in your findings. Always conduct and report sensitivity tests to acknowledge the limitations of your data.

Summary

Propensity scores, typically estimated via logistic regression, condense pre-treatment covariates into a single probability of treatment, enabling bias reduction in observational studies.
Matching methods like caliper matching pair treated and control units with similar scores, after which you compute average treatment effects such as ATT by comparing outcomes within matched sets.
Covariate balance assessment is mandatory to verify that matching succeeded in making groups comparable, while sensitivity analysis evaluates how robust results are to potential unmeasured confounders.
Inverse propensity weighting offers an alternative weighting approach, but matching often outperforms regression adjustment when functional form assumptions are questionable and overlap between groups is good.
Always check for overlap, clearly define your estimand (ATE vs. ATT), and incorporate sensitivity analyses to responsibly communicate the strength of causal evidence.

Causal Inference with Propensity Score Matching

Causal Inference with Propensity Score Matching

Foundations: Propensity Scores and Estimation with Logistic Regression

Implementing Matching and Computing Treatment Effects

Assessing Balance and Sensitivity to Unmeasured Confounders

Alternatives and When Matching Outperforms Regression

Common Pitfalls

Summary

Write better notes with AI