PR Curves for Imbalanced Data

When your machine learning model must distinguish rare events—like fraudulent transactions, tumors in medical scans, or defective parts on an assembly line—traditional evaluation metrics can be dangerously misleading. In these imbalanced datasets, where the class of interest (the positive class) is vastly outnumbered, the Precision-Recall (PR) curve becomes an essential diagnostic tool. It moves beyond accuracy and even the popular ROC curve to deliver a clear, actionable picture of your model's performance where it matters most: its ability to correctly identify positive cases without being flooded by false alarms.

Why PR Curves Trump ROC Curves for Imbalanced Data

To understand the supremacy of the PR curve for imbalanced problems, you must first grasp the limitations of the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the True Positive Rate (Recall or Sensitivity) against the False Positive Rate (FPR) at various classification thresholds. The FPR is calculated as $FPR = \frac{FP}{FP + TN}$ , where $FP$ is false positives and $TN$ is true negatives.

In a severely imbalanced dataset, the number of true negatives ( $TN$ ) is enormous. This massive denominator makes the FPR overly optimistic and insensitive to changes in the number of false positives, as adding a few hundred false positives to a pool of millions of true negatives barely moves the needle. Consequently, a model can produce a deceptively great-looking ROC curve and high Area Under the ROC Curve (AUC) while performing poorly on the rare class.

The PR curve sidesteps this issue by ignoring the true negatives altogether. It plots Precision against Recall (True Positive Rate). Precision asks, "Of all the instances my model labeled positive, how many are actually correct?" It is defined as $P rec i s i o n = \frac{TP}{TP + FP}$ . Recall asks, "Of all the actual positive instances, how many did my model successfully find?" defined as $R ec a ll = \frac{TP}{TP + FN}$ . By focusing solely on the positive class and its common errors (false positives and false negatives), the PR curve provides a magnified view of performance on the critical, minority class, making it the tool of choice for imbalanced scenarios.

Constructing the Precision-Recall Curve

Building a PR curve is a threshold-sweeping operation, similar to creating an ROC curve but with a different pair of metrics. For a binary classifier that outputs a probability or score for the positive class, you follow these steps:

Collect Scores and True Labels: For your test set, gather the model's predicted probability for the positive class and the true binary label (0 or 1) for each instance.
Sort and Threshold: Sort all instances by their predicted probability in descending order. Now, imagine a sliding threshold. Start with a threshold higher than your maximum probability, where you classify nothing as positive. At this point, $TP = 0$ and $FP = 0$ , so $P rec i s i o n = \frac{0}{0}$ is undefined, and $R ec a ll = 0$ .
Calculate Pairs: Move the threshold down to each unique predicted probability value. For each threshold, classify all instances with a score >= threshold as positive. Compute the resulting Confusion Matrix for that threshold, and from it, calculate the (Precision, Recall) pair.
Plot the Curve: Plot all the (Recall, Precision) pairs on a graph, with Recall on the x-axis (from 0 to 1) and Precision on the y-axis (from 0 to 1). Connect the points to form the PR curve.

A perfect classifier would be a point in the top-right corner at (Recall=1, Precision=1). A no-skill classifier, which makes random guesses, will have a horizontal line at a precision equal to the prevalence of the positive class in the dataset. For a highly imbalanced dataset (e.g., 1% positive), this "no-skill" line is at Precision=0.01. Your model's curve must be significantly above this baseline to be useful.

Interpolation and Computing Average Precision (AP)

The raw PR curve is often jagged. To calculate a robust summary metric and create a smoother visualization, we use interpolation. The most common method is to define the interpolated precision at a given recall level $r$ as the maximum precision obtained for any recall value $r^{'} \geq r$ .

$P_{in t er p} (r) = \tilde{r} \geq r max P (\tilde{r})$

This creates a monotonically decreasing step function. The key summary metric is the Average Precision (AP), which approximates the Area Under the PR Curve (AUC-PR). It is computed as the weighted mean of precisions at each recall level, with the increase in recall from the previous threshold used as the weight.

Practically, for a set of $k$ threshold points, AP is often calculated using the formula:

$A P = n = 1 \sum k (R_{n} - R_{n - 1}) P_{n}$

where $P_{n}$ and $R_{n}$ are the precision and recall at the $n$ -th threshold. This implementation is what libraries like scikit-learn use. A higher AP (closer to 1) indicates a better overall model across all thresholds. AP provides a single number to summarize the PR curve, making it invaluable for quick model comparison.

Comparing Classifiers Using PR Analysis

When evaluating multiple models on an imbalanced task, you must move beyond a single-number metric like accuracy or even ROC-AUC. A systematic PR curve analysis involves a layered comparison:

Visual Curve Inspection: Overlay the PR curves for all classifiers on one plot. The curve that dominates—sitting above and to the right of others across most recall levels—is generally superior. Pay close attention to the region of the curve that matters for your application. A fraud detection system might prioritize very high precision (low false alarms), requiring analysis at the left side of the curve (low recall). A cancer screening tool might prioritize high recall, focusing on the right side.
Compare Average Precision (AP): The model with the highest AP has better overall performance across all possible thresholds. This is your primary numeric indicator.
Threshold Selection Post-Analysis: The PR curve is the perfect tool for threshold tuning. Once you've selected the best model based on its curve and AP, you can examine the curve to find the operating point (threshold) that delivers the precision-recall trade-off mandated by your business or clinical costs. There is no "best" threshold; it is a strategic decision informed by the PR curve.

Common Pitfalls

Misinterpreting a High ROC-AUC as Good Performance: The most critical mistake is being reassured by a high ROC-AUC (e.g., 0.95) on an imbalanced dataset. Always generate the PR curve to see the true story. A model can achieve a 0.95 ROC-AUC by perfectly classifying the majority class while missing half the positive cases, which would be catastrophic.
Ignoring the No-Skill Baseline: On your PR plot, always plot the horizontal line representing the precision of a random classifier (the prevalence). If your model's curve lingers near this line, it is performing no better than chance on the positive class, regardless of other metrics.
Using AP for Macro-Averaged Multi-Class Problems Incorrectly: In multi-class problems, you often compute a PR curve for each class (treating it as the positive class). A common pitfall is to naively average the AP scores across classes (macro-average) in a highly imbalanced setting. This gives equal weight to the performance on a tiny class and a huge class. Often, a weighted average, which weights each class's AP by its support (the number of true instances), is more representative of the practical reality.
Forgetting That Precision is Not Monotonic: Unlike the ROC curve, the PR curve is not guaranteed to be monotonic. Precision can zigzag up and down as the threshold decreases and you add more predictions, which is why interpolation is necessary for calculating AP. Don't be alarmed by a non-smooth raw curve.

Summary

For imbalanced datasets, the Precision-Recall (PR) curve is a more informative and reliable diagnostic tool than the ROC curve because it focuses exclusively on the performance regarding the critical minority class, ignoring the uninformative true negatives.
The curve is constructed by plotting the (Precision, Recall) pair calculated at every possible classification threshold, providing a complete view of the trade-off between a model's correctness (precision) and its completeness (recall).
The Area Under the PR Curve is summarized by the Average Precision (AP) metric, which is computed using interpolation to handle the curve's non-monotonic nature. A higher AP indicates a better model.
Comparing classifiers on imbalanced data requires analyzing both the visual dominance of their PR curves and their AP scores, followed by strategic threshold selection based on the operational cost-benefit trade-off highlighted by the curve.
Always contextualize your model's PR curve against the no-skill baseline (precision = class prevalence) to avoid being misled by models that perform no better than random guessing on the positive class.

PR Curves for Imbalanced Data

PR Curves for Imbalanced Data

Why PR Curves Trump ROC Curves for Imbalanced Data

Constructing the Precision-Recall Curve

Interpolation and Computing Average Precision (AP)

Comparing Classifiers Using PR Analysis

Common Pitfalls

Summary

Write better notes with AI