Linear Regression Assumptions and Diagnostics
AI-Generated Content
Linear Regression Assumptions and Diagnostics
Ordinary Least Squares (OLS) linear regression is a foundational tool in statistics and data science, but its power rests on a set of core assumptions. Treating these assumptions as mere formalities can lead to models that are inefficient, biased, or entirely misleading. This guide moves beyond theory to focus on the practical workflow of verifying these assumptions and implementing robust solutions when reality deviates from the ideal.
Core Concepts of OLS Assumptions
A valid and reliable OLS regression model depends on five critical assumptions. Violating any of these can compromise your coefficient estimates, confidence intervals, and hypothesis tests.
1. Linearity and Additivity
The model assumes that the relationship between each predictor variable and the response variable is linear, and that the effects of different predictors are additive. This means a one-unit change in a predictor is associated with a constant change in the response, regardless of the predictor's current value or the values of other predictors.
Diagnostics: The primary tool is visual inspection of residual plots. Plot the model's residuals (observed value minus predicted value) against the predicted values and against each individual predictor. You should see a random scatter of points with no discernible pattern. A clear curved pattern (e.g., a U-shape) indicates a violation of linearity.
Applied Scenario: If you're predicting house price based on square footage, a linear model assumes each additional square foot adds the same dollar amount. In reality, the value per square foot might decrease for very large mansions—a non-linear relationship detectable in a residual plot.
2. Independence of Errors
The residuals (errors) from your model must be independent of each other. This is most commonly violated in time-series data or clustered data, where an observation at one time or in one group is correlated with another.
Diagnostics: For time-series data, the Durbin-Watson test is standard. Its statistic ranges from 0 to 4; a value near 2 suggests no autocorrelation, while values significantly below 2 indicate positive autocorrelation (successive errors are similar), and values above 2 indicate negative autocorrelation. You can also visually inspect a plot of residuals versus the order of data collection (e.g., time sequence).
3. Homoscedasticity (Constant Variance)
The variance of the errors should be constant across all levels of the independent variables. When this assumption is violated, it's called heteroscedasticity. This doesn't bias the coefficient estimates but makes them less efficient and undermines the standard errors, leading to unreliable hypothesis tests.
Diagnostics: Again, the plot of residuals versus predicted values is key. Look for a "fanning" or "cone" shape where the spread of residuals systematically increases or decreases with the predicted value. A formal statistical test is the Breusch-Pagan test, which tests the null hypothesis of constant variance.
Example: Predicting personal savings based on income might show heteroscedasticity. Individuals with low incomes have little room to save, so their savings are clustered near zero (low variance). High-income individuals can save a lot or spend it all, leading to a much wider spread of savings amounts (high variance).
4. Normality of Errors
For the purpose of hypothesis testing and constructing confidence intervals, the residuals should be approximately normally distributed. This assumption is less critical for large sample sizes due to the Central Limit Theorem, but it's important for small samples.
Diagnostics: Use a Normal Probability Plot (Q-Q plot). If the residuals are normally distributed, the points will fall roughly along the 45-degree reference line. Significant deviations, especially in the tails, indicate non-normality. The Shapiro-Wilk test provides a formal check but is sensitive to large sample sizes.
5. No Perfect Multicollinearity
While some correlation between predictors is expected, multicollinearity occurs when predictors are highly correlated with each other. This doesn't violate an OLS assumption in a way that invalidates the model, but it makes it hard to isolate the individual effect of each predictor, inflates the standard errors of coefficients, and can make the model unstable.
Diagnostics: The primary tool is the Variance Inflation Factor (VIF). The VIF for a predictor measures how much its standard error is inflated due to correlation with other predictors. A common rule of thumb is that a VIF > 5 or 10 indicates problematic multicollinearity. You calculate VIF for the predictor as , where is the R-squared from regressing that predictor on all other predictors.
Remedial Measures for Violated Assumptions
When diagnostics reveal a problem, you have several tools to address it. The choice depends on which assumption is violated.
For Non-Linearity: Consider transforming your variables. Applying a log, square root, or polynomial transformation to the predictor or the response variable can often linearize a relationship. The Box-Cox transformation is a more systematic, data-driven method to find the best power transformation for the response variable to achieve normality and linearity.
For Heteroscedasticity:
- Weighted Least Squares (WLS): If you can model the pattern of non-constant variance (e.g., variance is proportional to a predictor), you can use WLS, which gives less weight to observations with higher error variance.
- Robust Standard Errors (Huber-White/sandwich estimators): This is often the most practical solution. It calculates standard errors that are valid even in the presence of heteroscedasticity, leaving the original coefficient estimates unchanged. Most statistical software packages can compute these easily.
- Transform the Response Variable: A log transformation of the response variable can often stabilize variance.
For Autocorrelation (Non-Independence):
- Generalized Least Squares (GLS): This method directly models the correlation structure within the errors (e.g., an AR(1) process for time series) to produce more efficient estimates.
- Include Relevant Variables: Sometimes autocorrelation is a symptom of a missing time-related predictor (e.g., a seasonal indicator).
For Multicollinearity:
- Remove one of the highly correlated variables.
- Combine correlated variables into a single index (e.g., through PCA).
- Use regularization techniques like Ridge Regression, which adds a penalty to the model fitting process to shrink coefficients and reduce variance.
Common Pitfalls
- Checking Assumptions Only Once: The model diagnostics and remedial measures themselves can affect other assumptions. For example, applying a log transformation to fix non-linearity can also fix heteroscedasticity but might introduce it elsewhere. Always re-run your diagnostics after applying a fix.
- Over-reliance on Formal Tests: Statistical tests like Breusch-Pagan or Shapiro-Wilk are sensitive to sample size. With large datasets, they can detect trivial violations that are not practically significant. Always pair formal tests with visual diagnostics (plots), which give you context about the severity and nature of the violation.
- Ignoring the Independence Assumption: Data scientists often focus on cross-sectional data assumptions but forget that rows in a dataset can be non-independent (e.g., multiple entries from the same user, geographic clustering). Failing to account for this can severely underestimate uncertainty.
- Misinterpreting VIF: A high VIF indicates that the precision of estimating a coefficient is reduced, but it does not mean the variable is unimportant or that the coefficient is biased. The model's overall predictive power might still be good. The problem arises when you need to interpret the individual contribution of each predictor.
Summary
- OLS regression requires verifying five key assumptions: linearity, independence, homoscedasticity, normality of errors, and no perfect multicollinearity. Violations can lead to biased, inefficient, or uninterpretable models.
- Diagnostics are a mix of visual and statistical tools. Residual plots are indispensable for checking linearity, independence, and constant variance. The Durbin-Watson test checks for autocorrelation, the Breusch-Pagan test for heteroscedasticity, a Q-Q plot for normality, and VIF for multicollinearity.
- Practical solutions exist for common violations. These include variable transformations (Box-Cox), using robust standard errors for heteroscedasticity, employing Generalized Least Squares for autocorrelation, and applying regularization or feature removal for multicollinearity.
- The diagnostic process is iterative. After applying a remedial measure, you must re-check all assumptions, as fixing one problem can create another.
- Never automate assumption checking. Context and visual interpretation are crucial, as formal statistical tests can be misleading with very large or very small samples.