Loading content...
Everything we've derived so far—the OLS estimators, their statistical properties, confidence intervals, hypothesis tests—relies on certain assumptions being true. These aren't arbitrary mathematical conveniences; they're substantive claims about how the data was generated.
Violating these assumptions doesn't necessarily make regression useless, but it changes what we can conclude. Understanding assumptions is the difference between wielding regression as a powerful inferential tool versus as a curve-fitting exercise of unknown validity.
This page examines each assumption in depth: what it means, why it matters, how to check it, what goes wrong when it fails, and what to do about violations.
By the end of this page, you will understand each classical regression assumption (linearity, random sampling, conditional mean zero, homoscedasticity, normality), diagnose violations using residual analysis, assess the severity of different violations, and know when results remain trustworthy despite violated assumptions.
The classical linear regression model makes five key assumptions. Together, they enable the strong theoretical guarantees of OLS. We'll list them first, then examine each in depth.
The Classical Linear Model Assumptions (CLRM)
For the model $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$:
Linearity in Parameters: The true relationship is $y = \beta_0 + \beta_1 x + \varepsilon$
Random Sampling: The observations $(x_i, y_i)$ are a random sample from the population
Conditional Mean Zero: $E[\varepsilon | x] = 0$ (no systematic error pattern)
Homoscedasticity: $\text{Var}(\varepsilon | x) = \sigma^2$ (constant error variance)
Normality (for inference): $\varepsilon | x \sim N(0, \sigma^2)$ (errors are normally distributed)
Plus a technical condition:
| Assumption | Enables | If Violated |
|---|---|---|
| Linearity | Model correctly captures E[y|x] | Biased estimates, wrong predictions |
| Random sampling | Sample represents population | Selection bias, no generalization |
| E[ε|x] = 0 | Unbiased estimators | Biased estimates |
| Homoscedasticity | Correct standard errors | Wrong inference, inefficient estimates |
| Normality | Exact t and F tests | Tests are only approximate |
Not all assumptions are equally important. Linearity and E[ε|x]=0 are critical—violations bias estimates. Homoscedasticity and normality affect inference but not necessarily bias. With large samples, normality becomes less important (Central Limit Theorem). Understanding this hierarchy helps prioritize diagnostics.
Statement: The true conditional expectation function is linear:
$$E[y | x] = \beta_0 + \beta_1 x$$
This means the relationship between the expected value of $y$ and $x$ is a straight line—not a curve, not a step function, not some other shape.
Why It Matters
If the true relationship is nonlinear (e.g., quadratic), a linear model is misspecified. The fitted line will be the "best" straight line through curved data, but:
Diagnosing Linearity Violations
Scatterplot of y vs. x: Does the data follow a straight line, or curve?
Residual plot (residuals vs. x or ŷ): Residuals should scatter randomly around zero. Curved patterns indicate nonlinearity.
Residual plot (residuals vs. ŷ): Same as above, often clearer.
Added variable plots: In multiple regression (advanced).
Common Patterns Indicating Nonlinearity
12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as npimport matplotlib.pyplot as plt # Example: True relationship is quadratic, we fit linearnp.random.seed(42)n = 100x = np.random.uniform(0, 10, n)y_true = 5 + 2*x - 0.2*x**2 # Quadratic truthy = y_true + np.random.normal(0, 1, n) # Fit linear modelfrom numpy.polynomial import polynomial as Px_bar, y_bar = np.mean(x), np.mean(y)beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)beta_0 = y_bar - beta_1 * x_bary_hat = beta_0 + beta_1 * xresiduals = y - y_hat # Residual plot reveals nonlinearityfig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Left: Scatterplot with fitted lineaxes[0].scatter(x, y, alpha=0.7)x_line = np.linspace(0, 10, 100)axes[0].plot(x_line, beta_0 + beta_1*x_line, 'r-', linewidth=2)axes[0].set_xlabel('x')axes[0].set_ylabel('y')axes[0].set_title('Data with Fitted Linear Model') # Right: Residual plotaxes[1].scatter(x, residuals, alpha=0.7)axes[1].axhline(y=0, color='r', linestyle='--')axes[1].set_xlabel('x')axes[1].set_ylabel('Residuals')axes[1].set_title('Residuals vs x (Shows Nonlinearity)')# Note: Curved pattern in residuals indicates violated linearity plt.tight_layout()plt.savefig('linearity_diagnostic.png', dpi=100)print("Saved linearity diagnostic plot")Statement: The observations $(x_1, y_1), \ldots, (x_n, y_n)$ are a random sample from the population of interest.
This means:
Why It Matters
Random sampling ensures that sample statistics (like $\hat{\beta}_1$) are unbiased estimates of population parameters. It also ensures that our confidence intervals have their stated coverage probability.
If sampling depends on y (or variables correlated with y), estimates will be biased. Example: Surveying only high-income households about income vs. education. Low-income responses are missing, distorting the observed relationship.
Diagnosing Random Sampling Violations
This is often a design issue, not detectable from the data alone:
Special Case: Time Series
With time series data (observations over time), observations are typically not independent—values depend on past values (autocorrelation). This violates random sampling and requires special methods (time series regression, ARIMA, etc.).
Statement: $E[\varepsilon | x] = 0$ for all values of $x$.
This is arguably the most critical assumption. It says that, conditional on knowing $x$, the expected error is zero. On average, our predictions are neither systematically too high nor too low for any value of $x$.
Equivalent Formulations:
Why It Matters
If $E[\varepsilon | x] \ne 0$, OLS estimates are biased. The slope $\hat{\beta}_1$ doesn't estimate the true $\beta_1$—even with infinite data. This is a fundamental failure.
Sources of Violation
$$\text{Bias} = \beta_z \cdot \frac{\text{Cov}(x, z)}{\text{Var}(x)}$$
$$\hat{\beta}1 \xrightarrow{p} \beta_1 \cdot \frac{\sigma_x^2}{\sigma_x^2 + \sigma\eta^2}$$
(Attenuation bias—slope shrinks toward zero.)
Simultaneity/Reverse Causation: $y$ affects $x$ as well as vice versa.
Model Misspecification: The true functional form is nonlinear.
If a third variable z affects y and is correlated with x, omitting z biases β̂₁. Example: Regressing wages on education without controlling for ability. Ability affects wages AND correlates with education, so the education coefficient absorbs part of ability's effect.
Diagnosing E[ε|x] = 0 Violations
This is challenging because we don't observe $\varepsilon$, only residuals $e$:
Key Point: We can never prove E[ε|x] = 0 from data alone. We can only check for obvious violations and think carefully about the data generating process.
Statement: $\text{Var}(\varepsilon | x) = \sigma^2$ for all $x$.
The error variance is constant across all values of $x$. "Homo" = same; "scedasticity" = spread. When this fails, we have heteroscedasticity (different spread).
Why It Matters
Homoscedasticity affects inference, not unbiasedness:
Common Heteroscedasticity Patterns
Diagnosing Heteroscedasticity
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats # Example: Variance increases with x (heteroscedastic data)np.random.seed(42)n = 200x = np.random.uniform(1, 10, n)# Error variance proportional to xepsilon = np.random.normal(0, 1, n) * np.sqrt(x) # SD = sqrt(x)y = 5 + 2*x + epsilon # Fit OLSx_bar, y_bar = np.mean(x), np.mean(y)beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)beta_0 = y_bar - beta_1 * x_bary_hat = beta_0 + beta_1 * xresiduals = y - y_hat # Residual plot shows fanningfig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Residuals vs fitted valuesaxes[0].scatter(y_hat, residuals, alpha=0.6)axes[0].axhline(y=0, color='r', linestyle='--')axes[0].set_xlabel('Fitted values (ŷ)')axes[0].set_ylabel('Residuals')axes[0].set_title('Residuals vs Fitted (Fan Pattern → Heteroscedasticity)') # Squared residuals vs x (Breusch-Pagan idea)axes[1].scatter(x, residuals**2, alpha=0.6)axes[1].set_xlabel('x')axes[1].set_ylabel('Squared Residuals')axes[1].set_title('Squared Residuals vs x') # Simple Breusch-Pagan test statisticslope_bp, intercept_bp, r_value, p_value, std_err = stats.linregress(x, residuals**2)print(f"Breusch-Pagan slope: {slope_bp:.4f}")print(f"If significantly different from 0, heteroscedasticity is present")print(f"p-value: {p_value:.4f}") plt.tight_layout()plt.savefig('heteroscedasticity_diagnostic.png', dpi=100)Statement: $\varepsilon | x \sim N(0, \sigma^2)$
Conditional on $x$, the errors follow a normal (Gaussian) distribution with mean 0 and variance $\sigma^2$.
Why It Matters
Normality enables exact statistical inference:
Normality is the least critical assumption for large samples. By the Central Limit Theorem, β̂ is approximately normal regardless of error distribution when n is large. Tests become approximately valid even without normality. For small samples (n < 30), normality matters more.
Diagnosing Normality Violations
Histogram of residuals: Should look roughly bell-shaped
Q-Q plot (quantile-quantile): Residuals vs. theoretical normal quantiles. Should be roughly linear.
Shapiro-Wilk test: Formal test of normality (H₀: data is normal)
Jarque-Bera test: Tests skewness and kurtosis
Common Departures
1234567891011121314151617181920212223242526272829303132333435363738
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats # Generate residuals from a fitted model (using earlier data)np.random.seed(42)n = 100x = np.random.uniform(0, 10, n)y = 5 + 2*x + np.random.normal(0, 2, n) # Fit and get residualsx_bar, y_bar = np.mean(x), np.mean(y)beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)beta_0 = y_bar - beta_1 * x_barresiduals = y - (beta_0 + beta_1*x) fig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Histogramaxes[0].hist(residuals, bins=20, density=True, alpha=0.7, edgecolor='black')# Overlay normal curvex_norm = np.linspace(residuals.min(), residuals.max(), 100)axes[0].plot(x_norm, stats.norm.pdf(x_norm, 0, residuals.std()), 'r-', linewidth=2)axes[0].set_xlabel('Residuals')axes[0].set_ylabel('Density')axes[0].set_title('Histogram of Residuals with Normal Overlay') # Q-Q plotstats.probplot(residuals, dist="norm", plot=axes[1])axes[1].set_title('Q-Q Plot (Points Should Follow Line)') # Shapiro-Wilk teststat, p_value = stats.shapiro(residuals)print(f"Shapiro-Wilk test: statistic = {stat:.4f}, p-value = {p_value:.4f}")print(f"If p > 0.05, fail to reject normality (data is consistent with normal)") plt.tight_layout()plt.savefig('normality_diagnostic.png', dpi=100)Remedies for Non-Normality
Not all assumption violations are equally serious. Here's a practical ranking:
| Priority | Assumption | If Violated | Consequence | Remedy Difficulty |
|---|---|---|---|---|
| 🔴 Critical | Linearity | Model is wrong | Biased estimates | Medium (transforms) |
| 🔴 Critical | E[ε|x] = 0 | Endogeneity | Biased estimates | Hard (IV, experiments) |
| 🟡 Moderate | Random Sampling | Sample not representative | No generalization | Design issue |
| 🟢 Minor | Homoscedasticity | Wrong SEs | Invalid inference (fixable) | Easy (robust SEs) |
| 🟢 Minor | Normality | Approximate tests | OK for large n | Often unnecessary |
In practice: (1) Always plot residuals vs. x and residuals vs. ŷ—catches linearity and heteroscedasticity. (2) Think hard about omitted variables—this is usually the biggest real-world threat. (3) Use robust standard errors by default unless you're sure of homoscedasticity. (4) Don't obsess over normality with n > 50.
The Gauss-Markov Theorem
Under assumptions 1-4 (but NOT requiring normality):
OLS is BLUE (Best Linear Unbiased Estimator)
This is a remarkable result: among all estimators that are linear and unbiased, OLS has the smallest variance. Adding normality (Assumption 5) makes OLS the best among all unbiased estimators, not just linear ones.
The assumptions underlying linear regression are the foundation for valid inference. Let's consolidate:
Module Complete: Simple Linear Regression
Congratulations! You've completed the first module on linear regression, covering:
This foundation prepares you for Module 2: Multiple Linear Regression, where we extend to multiple predictors, introducing new challenges and richer interpretations.
You now possess a deep understanding of simple linear regression—not just the formulas, but the reasoning behind them. You can derive OLS, interpret coefficients, check assumptions, and communicate results responsibly. You've built the conceptual foundation for all supervised learning that follows.