AP Statistics: Inference for Linear Regression

Moving beyond simply describing a linear relationship with a correlation coefficient, inference for linear regression allows you to make probabilistic statements about a relationship in a broader population. This is the toolset that transforms an observed pattern in your sample data into a defensible conclusion about whether a real association exists and how strong it might be. It is the bridge between exploratory data analysis and confirmatory, data-driven decision-making, a skill fundamental to all STEM fields and data science.

The Linear Regression Model: From Sample to Population

When you calculate a least-squares regression line from sample data, you are estimating a true, underlying linear relationship that exists in the population. This true relationship is expressed by the population regression model:

$y = β_{0} + β_{1} x + ϵ$

Here, $y$ is the response variable, $x$ is the explanatory variable, $β_{0}$ is the population y-intercept, $β_{1}$ is the population slope, and $ϵ$ represents random error. The key parameter for inference is $β_{1}$ . Your sample regression line, $\overset{y}{^} = b_{0} + b_{1} x$ , provides the estimates $b_{0}$ and $b_{1}$ . The fundamental question of inference is: "Is the observed sample slope $b_{1}$ convincing evidence that the population slope $β_{1}$ is not zero?"

For example, if you collect data on study hours and test scores for 30 students, your calculated slope $b_{1}$ might be 2.5 points per hour. Inference helps you determine if this is likely a real trend for all students, or just a pattern that occurred by chance in your particular sample.

Hypothesis Testing for the Slope: The t-Test

The primary method for determining if a statistically significant linear relationship exists is a t-test on the slope. The null hypothesis always states that there is no linear relationship in the population, meaning the population slope is zero. The alternative hypothesis claims a linear relationship exists (the slope is not zero, or is greater/less than zero in a one-tailed test).

Hypotheses: $H_{0} : β_{1} = 0$ vs. $H_{a} : β_{1} \neq = 0$ (or $> 0$ , or $< 0$ ).
Test Statistic: The formula is $t = \frac{b _{1} - hypothesized value}{S E _{b_{1}}}$ . Since we are almost always testing if $β_{1} = 0$ , this simplifies to $t = \frac{b _{1}}{S E _{b_{1}}}$ . The standard error of the slope, $S E_{b_{1}}$ , measures how much the sample slope $b_{1}$ typically varies from the true slope $β_{1}$ from sample to sample. A larger standard error creates more uncertainty.
P-value: The p-value is calculated using a t-distribution with $n - 2$ degrees of freedom, where $n$ is the sample size. It represents the probability of obtaining a sample slope as extreme as, or more extreme than, the one you observed, assuming the null hypothesis ( $β_{1} = 0$ ) is true.

Let's apply this. Suppose your study hours vs. score data yields $b_{1} = 2.5$ and $S E_{b_{1}} = 0.8$ with $n = 30$ .

Test Statistic: $t = \frac{2.5}{0.8} = 3.125$
Degrees of Freedom: $df = 30 - 2 = 28$
Using a t-distribution table or calculator, the two-tailed p-value for $t = 3.125$ with $df = 28$ is approximately 0.004.
A small p-value (typically < 0.05) provides strong evidence against $H_{0}$ . You would reject the null hypothesis and conclude there is statistically significant evidence of a linear relationship between study hours and test score.

Confidence Interval for the True Slope

A hypothesis test tells you if a relationship exists, but a confidence interval for the slope tells you what that relationship might be. It provides a range of plausible values for the population slope $β_{1}$ . The formula is:

$b_{1} \pm t_{n - 2}^{*} \cdot S E_{b_{1}}$

Here, $t_{n - 2}^{*}$ is the critical t-value for your desired confidence level (e.g., 95%) with $n - 2$ degrees of freedom. Interpreting a 95% confidence interval means you are 95% confident that this interval captures the true population slope.

Continuing our example, for a 95% CI with $df = 28$ , $t^{*} \approx 2.048$ .

Margin of Error: $2.048 \cdot 0.8 \approx 1.64$
95% CI for $β_{1}$ : $2.5 \pm 1.64 \to (0.86, 4.14)$ points per hour.

This means we are 95% confident that, for the population, each additional hour studied is associated with an average increase in test score between 0.86 and 4.14 points. Note that this entire interval is positive, which is consistent with our significant t-test result.

Conditions and Diagnostics: Checking the Residuals

The validity of the t-test and confidence interval depends on four key conditions, which are checked using the residuals (the differences between observed and predicted y-values).

Linear Relationship: The relationship between $x$ and $y$ must be linear. Check a scatterplot of the data or, more definitively, a residual plot (residuals vs. $x$ ). The plot should show no obvious curved pattern.
Independent Errors: The residuals should be independent of each other. This is most often violated with time-series data. You check this by ensuring the data come from a random sample or randomized experiment.
Equal Variance of Errors (Homoscedasticity): The vertical spread of the residuals should be roughly constant across all values of $x$ . In the residual plot, the cloud of points should have consistent "fan-in" or "fan-out" patterns.
Normal Distribution of Errors: For small sample sizes ( $n < 30$ ), the residuals should be approximately Normally distributed. Check a histogram or a Normal probability plot of the residuals. For larger sample sizes, the t-procedures are robust to minor departures from Normality.

If these conditions are not reasonably met, the p-values and confidence intervals produced may not be trustworthy.

Interpreting Standard Computer Output

You will often perform these analyses using software. A standard regression output table will provide all the necessary components:

Predictor	Coef ( $b$ )	SE Coef ( $S E_{b_{1}}$ )	T-Value	P-Value
Constant	58.2	3.1	18.77	0.000
Study Hrs	2.5	0.8	3.13	0.004

"Coef" for your explanatory variable is the slope, $b_{1} = 2.5$ .
"SE Coef" is its standard error, $S E_{b_{1}} = 0.8$ .
"T-Value" is the test statistic for the slope: $t = 2.5/0.8 = 3.13$ .
"P-Value" for the slope (0.004) is used to test $H_{0} : β_{1} = 0$ .

Additional output will often include $R^{2}$ (the coefficient of determination) and $s$ (the standard deviation of the residuals, which estimates the typical prediction error).

Common Pitfalls

Confusing Statistical Significance with Practical Importance: A very small p-value means the evidence for a linear relationship is strong, but it does not mean the relationship is practically important. With a very large sample size, even a tiny, trivial slope can produce a statistically significant result. Always interpret the slope in context and consider the confidence interval's range. An interval of (0.1, 0.3) for a slope might be statistically significant (not containing 0) but represent a negligible real-world effect.

Misinterpreting the Hypotheses: The null hypothesis $β_{1} = 0$ specifically tests for a linear relationship. Failing to reject $H_{0}$ does not prove "no relationship"—it only means there is not convincing evidence for a linear one. The relationship could be curved or nonlinear.

Neglecting the Conditions: Jumping to inference without checking the residual plots is a critical error. Applying a linear regression t-test to data that is clearly curved or has non-constant variance will lead to invalid conclusions. The inference procedures are built on these assumptions.

Incorrectly Stating the Conclusion in a CI Context: Saying "there is a 95% probability that the true slope is between 0.86 and 4.14" is incorrect. The population slope is a fixed, unknown value. The probability statement is about the method: 95% of confidence intervals constructed in this way will capture the true parameter.

Summary

Inference for linear regression uses sample data to make conclusions about the population slope $β_{1}$ , primarily through a t-test for the slope and a confidence interval for the slope.
A t-test evaluates the null hypothesis $H_{0} : β_{1} = 0$ against an alternative. A small p-value provides evidence for a statistically significant linear relationship.
A confidence interval provides a range of plausible values for the true slope, quantifying the uncertainty in your estimate.
The validity of these procedures rests on four key conditions checked using residuals: linearity, independence, constant variance, and Normality of errors.
Always distinguish statistical significance from practical significance. A result can be statistically convincing yet have minimal real-world impact, a distinction often revealed by the confidence interval's width and endpoints.

AP Statistics: Inference for Linear Regression

AP Statistics: Inference for Linear Regression

The Linear Regression Model: From Sample to Population

Hypothesis Testing for the Slope: The t-Test

Confidence Interval for the True Slope

Conditions and Diagnostics: Checking the Residuals

Interpreting Standard Computer Output

Common Pitfalls

Summary

Write better notes with AI