Math AI: Linear Regression and Prediction

Linear regression is one of the most powerful and widely used tools for making sense of the world through data. In IB Math AI, it moves beyond a simple graphing technique to become a formal method for modeling relationships, quantifying uncertainty, and making informed predictions. Mastering it means you can transform raw numbers into meaningful insights about trends, from business forecasts to scientific correlations.

The Foundation: Modeling Relationships with a Straight Line

At its core, linear regression seeks to find the straight line that best summarizes the linear relationship between two variables. We typically denote the independent (or explanatory) variable as $x$ and the dependent (or response) variable as $y$ . The model is expressed by the equation of a line: $y = a + b x$ . Here, $b$ represents the slope, which indicates the average change in the $y$ -variable for a one-unit increase in the $x$ -variable. The value $a$ represents the y-intercept, which is the predicted value of $y$ when $x = 0$ . Crucially, the intercept must always be interpreted within the context of the data; if $x = 0$ is outside the observed range, the interpretation may not be meaningful.

We visually assess the suitability of a linear model by creating a scatter plot. If the data points appear to cluster around a straight-line pattern, linear regression is appropriate. However, a curved pattern or the presence of significant outliers suggests a linear model may be misleading. This initial graphical check is a critical first step before any calculation.

The Least Squares Method: Defining "Best Fit"

How do we determine the single "best" line through a cloud of points? The least squares method provides the objective criterion. It calculates the line that minimizes the sum of the squares of the residuals. A residual, denoted as $e_{i}$ , is the vertical distance between an observed data point $(x_{i}, y_{i})$ and the point on the regression line $(x_{i}, \overset{y}{^}_{i})$ , where $\overset{y}{^}_{i}$ is the predicted value. In formula terms, $e_{i} = y_{i} - \overset{y}{^}_{i}$ .

The method minimizes the sum: $S = i = 1 \sum n (e_{i})^{2} = i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2} .$ By squaring the residuals, we ensure positive and negative errors don't cancel each other out, and we penalize larger errors more heavily. The formulas for the slope $b$ and intercept $a$ are derived using calculus to find the minimum of $S$ : $b = \frac{S _{x y}}{S _{xx}}, a = \overset{y}{ˉ} - b \overset{x}{ˉ} .$ Here, $S_{x y}$ is the covariance sum $\sum (x_{i} - \overset{x}{ˉ}) (y_{i} - \overset{y}{ˉ})$ and $S_{xx}$ is the variance sum $\sum (x_{i} - \overset{x}{ˉ})^{2}$ , while $\overset{x}{ˉ}$ and $\overset{y}{ˉ}$ are the sample means. You are not required to derive these in IB Math AI, but understanding their logic is key.

Using Your GDC for Efficient Analysis

Manually calculating the least squares line is impractical for real datasets. Your Graphical Display Calculator (GDC) is essential. The standard process is:

Enter the $x$ and $y$ data into two separate lists.
Perform a linear regression calculation (often LinReg(ax+b) or similar). Your GDC will output the values for $a$ (intercept) and $b$ (slope).
Plot the scatter diagram and then plot the regression line on the same axes to visually confirm the fit.

You must be able to perform this procedure fluently in an exam setting. Furthermore, your GDC can often calculate predicted values directly and provide the correlation coefficient $r$ and the coefficient of determination $r^{2}$ , which leads to the next core concept.

Interpreting the Model: Slope, Intercept, and $r^{2}$

Contextual interpretation separates a mechanical calculation from true understanding. For example, if a regression of ice cream sales ( $y$ in dollars) against daily temperature ( $x$ in °C) yields $\overset{y}{^} = 50 + 15 x$ , you interpret:

Slope ( $b = 15$ ): For each additional degree Celsius in temperature, the model predicts an average increase of $15 in daily ice cream sales.
Intercept ( $a = 50$ ): When the temperature is 0°C, the model predicts $50 in ice cream sales. This might represent a baseline sale of other frozen items, but care is needed as 0°C is likely outside the data range used to build the model.

The coefficient of determination, $r^{2}$ , is arguably the most important statistic for interpreting your model's usefulness. It represents the proportion of the variation in the $y$ -variable that is explained by the linear relationship with the $x$ -variable. An $r^{2}$ value of 0.85 means 85% of the variation in $y$ can be explained by the variation in $x$ via the linear model. The remaining 15% is due to other factors or random variation. A higher $r^{2}$ (closer to 1) indicates a more reliable model for prediction within the data range.

Making and Evaluating Predictions

The primary purpose of regression is prediction. Using the regression equation $\overset{y}{^} = a + b x$ , you can substitute a value for $x$ to generate a predicted $\overset{y}{^}$ .

Interpolation: Making a prediction for an $x$ -value within the range of the original data is generally reliable, provided the linear trend holds.
Extrapolation: Making a prediction for an $x$ -value outside the range of the original data is risky and often unreliable. The relationship observed may not continue linearly (or at all) beyond the observed data. For instance, predicting ice cream sales at 45°C based on data collected between 15°C and 30°C is an extrapolation and is likely to be inaccurate.

The reliability of a prediction depends on the strength of the linear association (high $r^{2}$ ), the context of the prediction (interpolation vs. extrapolation), and the spread of the original data points around the line. Predictions are estimates, not certainties.

Common Pitfalls

Confusing Correlation with Causation: A strong linear regression model shows association, not cause. Just because ice cream sales and shark attacks are correlated (both increase in summer) does not mean one causes the other. They are both linked to a lurking variable: temperature.
Misinterpreting the Intercept: Always ask if $x = 0$ is within the context of your data. Predicting a car's resale value when its age is zero (brand new) is meaningful. Predicting a student's test score based on hours studied when hours studied is zero is less so, as it's unrealistic.
Ignoring the Limitations of Extrapolation: This is the most common predictive error. The model describes the pattern within your data. Assuming this pattern continues indefinitely into uncharted territory is a major mistake that leads to faulty conclusions.
Using Linear Regression for Non-Linear Data: Fitting a line to data that clearly follows a curve will produce a model with large residuals and a misleading $r^{2}$ . Always look at the scatter plot first.

Summary

Linear regression uses the least squares method to find the line $y = a + b x$ that minimizes the sum of squared vertical distances (residuals) from data points.
The slope ( $b$ ) indicates the predicted change in $y$ per unit increase in $x$ , and the intercept ( $a$ ) is the predicted $y$ -value when $x = 0$ , interpreted within context.
The coefficient of determination ( $r^{2}$ ) measures the proportion of variation in $y$ explained by the linear model with $x$ ; a value closer to 1 indicates a stronger, more reliable relationship.
Use your GDC efficiently to calculate the regression equation, plot the model, and generate predictions.
Interpolation (predicting within the data range) is reliable if the model is strong; extrapolation (predicting outside the range) is risky and often invalid.
A strong correlation does not imply causation, and a linear model should only be used when a scatter plot indicates a roughly linear trend.

Math AI: Linear Regression and Prediction

Math AI: Linear Regression and Prediction

The Foundation: Modeling Relationships with a Straight Line

The Least Squares Method: Defining "Best Fit"

Using Your GDC for Efficient Analysis

Interpreting the Model: Slope, Intercept, and r2

Making and Evaluating Predictions

Common Pitfalls

Summary

Write better notes with AI

Interpreting the Model: Slope, Intercept, and $r^{2}$