Math AI HL: Pearson's Correlation and Significance Testing

Understanding the relationship between two quantitative variables is one of the most powerful tools in data analysis. For IB Math AI HL, mastering Pearson’s product-moment correlation coefficient, denoted $r$ , and its associated significance testing is crucial. This technique allows you to quantify the strength and direction of a linear association, moving beyond visual guesswork from scatter plots to rigorous, statistical conclusions applicable in fields from economics to medicine.

Calculating and Interpreting the Correlation Coefficient

The first step is calculating $r$ . This value, always between -1 and +1, summarizes a linear trend. The formula is:

$r = \frac{S _{x y}}{S _{xx} S _{yy}}$

where $S_{xx} = \sum (x_{i} - \overset{x}{ˉ})^{2}$ , $S_{yy} = \sum (y_{i} - \overset{y}{ˉ})^{2}$ , and $S_{x y} = \sum (x_{i} - \overset{x}{ˉ}) (y_{i} - \overset{y}{ˉ})$ . In practice, you will use your GDC’s statistics mode. For example, given paired data for hours studied ( $x$ ) and test score ( $y$ ), input the lists and calculate the correlation.

Interpreting $r$ involves both its sign and magnitude. A positive $r$ (e.g., $r = 0.85$ ) indicates a positive linear relationship: as $x$ increases, $y$ tends to increase. A negative $r$ (e.g., $r = - 0.72$ ) indicates a negative linear relationship: as $x$ increases, $y$ tends to decrease. The magnitude describes strength:

$0.7 \leq ∣ r ∣ \leq 1$ : Strong correlation.
$0.3 \leq ∣ r ∣ < 0.7$ : Moderate correlation.
$0 < ∣ r ∣ < 0.3$ : Weak correlation.

Crucially, $r$ only measures linear association. A strong parabolic relationship could yield an $r$ near zero, highlighting the need to always examine a scatter plot first.

The Coefficient of Determination

A step deeper in interpretation is the coefficient of determination, $r^{2}$ . This is simply the square of the correlation coefficient, but its meaning is profoundly useful. While $r$ tells you the strength and direction of a relationship, $r^{2}$ tells you the proportion of the variance in the dependent variable ( $y$ ) that is predictable from the independent variable ( $x$ ).

If you find $r = 0.9$ for our study hours and test score example, then $r^{2} = 0.81$ . You interpret this as: 81% of the variation in test scores can be explained by the linear relationship with hours studied. The remaining 19% of the variation is due to other factors (e.g., prior knowledge, question difficulty). This moves interpretation from describing association to assessing explanatory power.

Hypothesis Testing for Correlation Significance

Finding a non-zero $r$ in your sample data doesn’t prove a relationship exists in the broader population. The calculated $r$ could be due to random chance in your specific sample. This is where hypothesis testing comes in. The test determines if the observed correlation is statistically significant.

The standard procedure is a t-test. You begin by stating your hypotheses:

Null Hypothesis ( $H_{0}$ ): $ρ = 0$ . (The population correlation coefficient is zero—no linear relationship).
Alternative Hypothesis ( $H_{1}$ ): $ρ \neq = 0$ for a two-tailed test, or $ρ > 0$ / $ρ < 0$ for one-tailed tests. (A linear relationship exists).

The test statistic is calculated using your sample $r$ and sample size $n$ : $t = r \frac{n - 2}{1 - r ^{2}}$ This $t$ -statistic follows a t-distribution with $n - 2$ degrees of freedom. Your GDC will perform this entire test. You input the paired data and select the linear regression t-test. The critical output is the p-value.

Interpreting the p-value is key to IB success. If the p-value is less than your chosen significance level (e.g., $α = 0.05$ ), you reject $H_{0}$ . You conclude there is sufficient statistical evidence to suggest a linear correlation in the population. If the p-value is greater than $α$ , you fail to reject $H_{0}$ , meaning the evidence is insufficient to claim a correlation exists. Remember, "failing to reject" is not the same as proving no correlation exists.

Assumptions, Limitations, and Causation

Pearson’s correlation is powerful but has strict assumptions. Violating these can render your analysis invalid. The core assumptions are:

Linearity: The relationship between $x$ and $y$ must be linear. Check with a scatter plot.
Bivariate Normality: For reliable significance testing, both variables should be approximately normally distributed. For larger samples ( $n > 30$ ), this is less critical due to the Central Limit Theorem.
Homoscedasticity: The variability of data points around the line of best fit should be constant across all values of $x$ .

The most critical limitation is the distinction between correlation and causation. A significant correlation does not imply that changes in $x$ cause changes in $y$ . There are three alternative explanations:

Reverse Causation: $y$ might actually cause $x$ .
Confounding Variable: A third, unseen variable $z$ causes both $x$ and $y$ to vary. For instance, ice cream sales ( $x$ ) and drowning incidents ( $y$ ) are correlated, but heat ( $z$ ) is the confounding variable driving both.
Pure Coincidence: Especially with large datasets, some statistically significant correlations will occur by random chance.

Common Pitfalls

Assuming Correlation Implies Causation: This is the most serious and common error. Always consider and state possible confounding variables when interpreting a significant $r$ . Correlation is a tool for identifying relationships to investigate further, not for drawing causal conclusions on its own.

Ignoring Assumptions, Particularly Linearity: Calculating $r$ for a clear nonlinear relationship (e.g., a U-shaped parabola) will produce a misleading value near zero. You might incorrectly conclude "no relationship" when a strong, non-linear one exists. Always plot your data first.

Overinterpreting a Significant Result with Low $r$ : With a very large sample size (e.g., $n > 1000$ ), you can get a very small p-value for a very weak correlation (e.g., $r = 0.05$ ). While statistically significant, such a relationship has negligible practical importance. Always report and consider both $r$ and the p-value together.

Misusing the Coefficient of Determination: $r^{2}$ explains variance, not direct proportion. An $r^{2}$ of 0.64 does not mean that 64% of $y$ is caused by $x$ ; it means 64% of the variation in $y$ is associated with variation in $x$ . Furthermore, a high $r^{2}$ does not validate a model if the underlying assumptions are violated.

Summary

Pearson’s $r$ quantifies the strength and direction of a linear association between two quantitative variables, ranging from -1 (perfect negative) to +1 (perfect positive).
The coefficient of determination, $r^{2}$ , interprets the proportion of variance in one variable explained by the linear relationship with the other.
Hypothesis testing (via a t-test) uses a p-value to determine if an observed sample correlation provides sufficient evidence to infer a linear relationship exists in the broader population.
Significance does not imply causation. A significant correlation can be due to confounding variables, reverse causation, or coincidence.
Valid analysis requires checking key assumptions: linearity, bivariate normality, and homoscedasticity. Ignoring these, especially linearity, invalidates the results.

Math AI HL: Pearson's Correlation and Significance Testing

Math AI HL: Pearson's Correlation and Significance Testing

Calculating and Interpreting the Correlation Coefficient

The Coefficient of Determination

Hypothesis Testing for Correlation Significance

Assumptions, Limitations, and Causation

Common Pitfalls

Summary

Write better notes with AI