Simple Linear Regression - Learning Module

Loading content...

0/245

Assumptions

The Fine Print of Regression

Everything we've derived so far—the OLS estimators, their statistical properties, confidence intervals, hypothesis tests—relies on certain assumptions being true. These aren't arbitrary mathematical conveniences; they're substantive claims about how the data was generated.

Violating these assumptions doesn't necessarily make regression useless, but it changes what we can conclude. Understanding assumptions is the difference between wielding regression as a powerful inferential tool versus as a curve-fitting exercise of unknown validity.

This page examines each assumption in depth: what it means, why it matters, how to check it, what goes wrong when it fails, and what to do about violations.

What You Will Learn

By the end of this page, you will understand each classical regression assumption (linearity, random sampling, conditional mean zero, homoscedasticity, normality), diagnose violations using residual analysis, assess the severity of different violations, and know when results remain trustworthy despite violated assumptions.

The Classical Assumptions

The classical linear regression model makes five key assumptions. Together, they enable the strong theoretical guarantees of OLS. We'll list them first, then examine each in depth.

The Classical Linear Model Assumptions (CLRM)

For the model $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$:

Linearity in Parameters: The true relationship is $y = \beta_0 + \beta_1 x + \varepsilon$
Random Sampling: The observations $(x_i, y_i)$ are a random sample from the population
Conditional Mean Zero: $E[\varepsilon | x] = 0$ (no systematic error pattern)
Homoscedasticity: $\text{Var}(\varepsilon | x) = \sigma^2$ (constant error variance)
Normality (for inference): $\varepsilon | x \sim N(0, \sigma^2)$ (errors are normally distributed)

Plus a technical condition:

Non-constant x: $\text{Var}(x) > 0$ (x actually varies in the sample)

What Each Assumption Enables
Assumption	Enables	If Violated
Linearity	Model correctly captures E[y\|x]	Biased estimates, wrong predictions
Random sampling	Sample represents population	Selection bias, no generalization
E[ε\|x] = 0	Unbiased estimators	Biased estimates
Homoscedasticity	Correct standard errors	Wrong inference, inefficient estimates
Normality	Exact t and F tests	Tests are only approximate

Hierarchy of Assumptions

Not all assumptions are equally important. Linearity and E[ε|x]=0 are critical—violations bias estimates. Homoscedasticity and normality affect inference but not necessarily bias. With large samples, normality becomes less important (Central Limit Theorem). Understanding this hierarchy helps prioritize diagnostics.

Assumption 1: Linearity

Statement: The true conditional expectation function is linear:

$$E[y | x] = \beta_0 + \beta_1 x$$

This means the relationship between the expected value of $y$ and $x$ is a straight line—not a curve, not a step function, not some other shape.

Why It Matters

If the true relationship is nonlinear (e.g., quadratic), a linear model is misspecified. The fitted line will be the "best" straight line through curved data, but:

Predictions will be systematically wrong at different x values
The slope estimate won't represent the true marginal effect
Residuals will show a pattern (not random scatter)

Diagnosing Linearity Violations

Scatterplot of y vs. x: Does the data follow a straight line, or curve?
Residual plot (residuals vs. x or ŷ): Residuals should scatter randomly around zero. Curved patterns indicate nonlinearity.
Residual plot (residuals vs. ŷ): Same as above, often clearer.
Added variable plots: In multiple regression (advanced).

Common Patterns Indicating Nonlinearity

U-shaped or inverted-U residuals → quadratic relationship
Asymptotic pattern → logarithmic or saturation relationship
S-shaped pattern → logistic-type relationship

Remedies for Nonlinearity

Transform x: Try log(x), √x, 1/x to linearize. 2) Transform y: log(y) is common. 3) Add polynomial terms: Include x² or x³. 4) Piecewise linear (splines): Different slopes in different regions. 5) Use nonlinear models: When transformations don't work.

check_linearity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
import matplotlib.pyplot as plt
 
# Example: True relationship is quadratic, we fit linear
np.random.seed(42)
n = 100
x = np.random.uniform(0, 10, n)
y_true = 5 + 2*x - 0.2*x**2  # Quadratic truth
y = y_true + np.random.normal(0, 1, n)
 
# Fit linear model
from numpy.polynomial import polynomial as P
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)
beta_0 = y_bar - beta_1 * x_bar
y_hat = beta_0 + beta_1 * x
residuals = y - y_hat
 
# Residual plot reveals nonlinearity
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
 
# Left: Scatterplot with fitted line
axes[0].scatter(x, y, alpha=0.7)
x_line = np.linspace(0, 10, 100)
axes[0].plot(x_line, beta_0 + beta_1*x_line, 'r-', linewidth=2)
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Data with Fitted Linear Model')
 
# Right: Residual plot
axes[1].scatter(x, residuals, alpha=0.7)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residuals vs x (Shows Nonlinearity)')
# Note: Curved pattern in residuals indicates violated linearity
 
plt.tight_layout()
plt.savefig('linearity_diagnostic.png', dpi=100)
print("Saved linearity diagnostic plot")

Assumption 2: Random Sampling

Statement: The observations $(x_1, y_1), \ldots, (x_n, y_n)$ are a random sample from the population of interest.

This means:

Each observation is drawn independently from the same population
Selection into the sample doesn't depend on $y$ or $\varepsilon$
The sample represents the population we want to generalize to

Why It Matters

Random sampling ensures that sample statistics (like $\hat{\beta}_1$) are unbiased estimates of population parameters. It also ensures that our confidence intervals have their stated coverage probability.

Selection Bias

If sampling depends on y (or variables correlated with y), estimates will be biased. Example: Surveying only high-income households about income vs. education. Low-income responses are missing, distorting the observed relationship.

Random Sampling Holds

•Simple random sample from full population
•Stratified sample with proper weights
•All subgroups proportionally represented
•No systematic dropouts based on y

Random Sampling Violated

•Convenience sampling (e.g., only willing participants)
•Survivorship bias (only successful cases observed)
•Time series data (observations not independent)
•Clustered data (students within schools)

Diagnosing Random Sampling Violations

This is often a design issue, not detectable from the data alone:

Consider your sampling process: How were observations collected?
Compare sample to population: If you know population characteristics
Check for time/space patterns: If observations are not independent

Special Case: Time Series

With time series data (observations over time), observations are typically not independent—values depend on past values (autocorrelation). This violates random sampling and requires special methods (time series regression, ARIMA, etc.).

Assumption 3: Conditional Mean Zero

Statement: $E[\varepsilon | x] = 0$ for all values of $x$.

This is arguably the most critical assumption. It says that, conditional on knowing $x$, the expected error is zero. On average, our predictions are neither systematically too high nor too low for any value of $x$.

Equivalent Formulations:

$E[\varepsilon | x] = 0$ (conditional expectation)
$\text{Cov}(x, \varepsilon) = 0$ (implied, but weaker)
The linear model correctly specifies $E[y|x]$

Why It Matters

If $E[\varepsilon | x] \ne 0$, OLS estimates are biased. The slope $\hat{\beta}_1$ doesn't estimate the true $\beta_1$—even with infinite data. This is a fundamental failure.

Sources of Violation

Omitted Variable Bias: A relevant variable $z$ is left out. If $z$ is correlated with both $x$ and $y$:

$$\text{Bias} = \beta_z \cdot \frac{\text{Cov}(x, z)}{\text{Var}(x)}$$

Measurement Error in x: If we observe $x^* = x + \eta$ instead of true $x$:

$$\hat{\beta}1 \xrightarrow{p} \beta_1 \cdot \frac{\sigma_x^2}{\sigma_x^2 + \sigma\eta^2}$$

(Attenuation bias—slope shrinks toward zero.)

Simultaneity/Reverse Causation: $y$ affects $x$ as well as vice versa.
Model Misspecification: The true functional form is nonlinear.

Omitted Variable Bias: The Classic Threat

If a third variable z affects y and is correlated with x, omitting z biases β̂₁. Example: Regressing wages on education without controlling for ability. Ability affects wages AND correlates with education, so the education coefficient absorbs part of ability's effect.

Diagnosing E[ε|x] = 0 Violations

This is challenging because we don't observe $\varepsilon$, only residuals $e$:

Residual plots (e vs. x): Patterns suggest misspecification
Ramsey RESET test: Tests for omitted nonlinearity
Add suspected omitted variables: See if coefficients change
Instrumental variables: If available, can test exclusion

Key Point: We can never prove E[ε|x] = 0 from data alone. We can only check for obvious violations and think carefully about the data generating process.

Assumption 4: Homoscedasticity

Statement: $\text{Var}(\varepsilon | x) = \sigma^2$ for all $x$.

The error variance is constant across all values of $x$. "Homo" = same; "scedasticity" = spread. When this fails, we have heteroscedasticity (different spread).

Why It Matters

Homoscedasticity affects inference, not unbiasedness:

OLS estimates are still unbiased under heteroscedasticity
But standard errors are wrong (usually too small)
Confidence intervals and p-values are therefore invalid
OLS is no longer the most efficient estimator

Common Heteroscedasticity Patterns

Fan/cone shape: Variance increases with $x$ (e.g., income vs. spending)
Inverse cone: Variance decreases with $x$ (less common)
Discrete patterns: Different variances for different groups

Diagnosing Heteroscedasticity

Residual vs. fitted plot: Look for fanning patterns
Breusch-Pagan test: Regress squared residuals on $x$; test if slope ≠ 0
White test: More general test for heteroscedasticity
Goldfeld-Quandt test: Compare variance in subsamples

check_heteroscedasticity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Example: Variance increases with x (heteroscedastic data)
np.random.seed(42)
n = 200
x = np.random.uniform(1, 10, n)
# Error variance proportional to x
epsilon = np.random.normal(0, 1, n) * np.sqrt(x)  # SD = sqrt(x)
y = 5 + 2*x + epsilon
 
# Fit OLS
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)
beta_0 = y_bar - beta_1 * x_bar
y_hat = beta_0 + beta_1 * x
residuals = y - y_hat
 
# Residual plot shows fanning
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
 
# Residuals vs fitted values
axes[0].scatter(y_hat, residuals, alpha=0.6)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Fitted values (ŷ)')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted (Fan Pattern → Heteroscedasticity)')
 
# Squared residuals vs x (Breusch-Pagan idea)
axes[1].scatter(x, residuals**2, alpha=0.6)
axes[1].set_xlabel('x')
axes[1].set_ylabel('Squared Residuals')
axes[1].set_title('Squared Residuals vs x')
 
# Simple Breusch-Pagan test statistic
slope_bp, intercept_bp, r_value, p_value, std_err = stats.linregress(x, residuals**2)
print(f"Breusch-Pagan slope: {slope_bp:.4f}")
print(f"If significantly different from 0, heteroscedasticity is present")
print(f"p-value: {p_value:.4f}")
 
plt.tight_layout()
plt.savefig('heteroscedasticity_diagnostic.png', dpi=100)

Remedies for Heteroscedasticity

Robust (heteroscedasticity-consistent) standard errors: Fixes inference without changing estimates. 2) Weighted Least Squares (WLS): If variance function is known. 3) Transform y: log(y) often stabilizes variance. 4) Generalized Least Squares (GLS): If variance structure can be modeled.

Assumption 5: Normality of Errors

Statement: $\varepsilon | x \sim N(0, \sigma^2)$

Conditional on $x$, the errors follow a normal (Gaussian) distribution with mean 0 and variance $\sigma^2$.

Why It Matters

Normality enables exact statistical inference:

With normality, $\hat{\beta}_1$ follows an exact normal distribution
The t-statistics follow exact t-distributions
F-statistics follow exact F-distributions
Confidence intervals and p-values are exact

The Good News About Normality

Normality is the least critical assumption for large samples. By the Central Limit Theorem, β̂ is approximately normal regardless of error distribution when n is large. Tests become approximately valid even without normality. For small samples (n < 30), normality matters more.

Diagnosing Normality Violations

Histogram of residuals: Should look roughly bell-shaped
Q-Q plot (quantile-quantile): Residuals vs. theoretical normal quantiles. Should be roughly linear.
Shapiro-Wilk test: Formal test of normality (H₀: data is normal)
Jarque-Bera test: Tests skewness and kurtosis

Common Departures

Skewness: Residuals are asymmetric (long tail on one side)
Heavy tails (leptokurtosis): More outliers than normal predicts
Light tails (platykurtosis): Fewer outliers than normal predicts
Bimodality: Two peaks (may indicate missing categorical predictor)

check_normality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Generate residuals from a fitted model (using earlier data)
np.random.seed(42)
n = 100
x = np.random.uniform(0, 10, n)
y = 5 + 2*x + np.random.normal(0, 2, n)
 
# Fit and get residuals
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)
beta_0 = y_bar - beta_1 * x_bar
residuals = y - (beta_0 + beta_1*x)
 
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
 
# Histogram
axes[0].hist(residuals, bins=20, density=True, alpha=0.7, edgecolor='black')
# Overlay normal curve
x_norm = np.linspace(residuals.min(), residuals.max(), 100)
axes[0].plot(x_norm, stats.norm.pdf(x_norm, 0, residuals.std()), 'r-', linewidth=2)
axes[0].set_xlabel('Residuals')
axes[0].set_ylabel('Density')
axes[0].set_title('Histogram of Residuals with Normal Overlay')
 
# Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot (Points Should Follow Line)')
 
# Shapiro-Wilk test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk test: statistic = {stat:.4f}, p-value = {p_value:.4f}")
print(f"If p > 0.05, fail to reject normality (data is consistent with normal)")
 
plt.tight_layout()
plt.savefig('normality_diagnostic.png', dpi=100)

Remedies for Non-Normality

Large sample: Often no action needed—just rely on CLT
Transform y: log(y) often normalizes right-skewed data
Robust regression: Less sensitive to outliers (M-estimation, quantile regression)
Bootstrap inference: Doesn't assume normality
Non-parametric tests: When distribution is very non-normal

Which Assumptions Matter Most?

Not all assumption violations are equally serious. Here's a practical ranking:

Severity of Assumption Violations
Priority	Assumption	If Violated	Consequence	Remedy Difficulty
🔴 Critical	Linearity	Model is wrong	Biased estimates	Medium (transforms)
🔴 Critical	E[ε\|x] = 0	Endogeneity	Biased estimates	Hard (IV, experiments)
🟡 Moderate	Random Sampling	Sample not representative	No generalization	Design issue
🟢 Minor	Homoscedasticity	Wrong SEs	Invalid inference (fixable)	Easy (robust SEs)
🟢 Minor	Normality	Approximate tests	OK for large n	Often unnecessary

Practical Wisdom

In practice: (1) Always plot residuals vs. x and residuals vs. ŷ—catches linearity and heteroscedasticity. (2) Think hard about omitted variables—this is usually the biggest real-world threat. (3) Use robust standard errors by default unless you're sure of homoscedasticity. (4) Don't obsess over normality with n > 50.

The Gauss-Markov Theorem

Under assumptions 1-4 (but NOT requiring normality):

OLS is BLUE (Best Linear Unbiased Estimator)

Best: Minimum variance among all linear unbiased estimators
Linear: A linear function of the response values
Unbiased: $E[\hat{\beta}] = \beta$
Estimator: A statistic computed from data

This is a remarkable result: among all estimators that are linear and unbiased, OLS has the smallest variance. Adding normality (Assumption 5) makes OLS the best among all unbiased estimators, not just linear ones.

Summary: Assumptions

The assumptions underlying linear regression are the foundation for valid inference. Let's consolidate:

Key Takeaways

•Five classical assumptions: Linearity, random sampling, E[ε|x]=0, homoscedasticity, normality
•Linearity: True relationship is linear in parameters; check with residual plots
•E[ε|x] = 0: Critical for unbiased estimates; threatened by omitted variables, measurement error, reverse causation
•Homoscedasticity: Constant error variance; affects inference but easy to fix with robust SEs
•Normality: Least critical, especially with large samples (CLT)
•Diagnostic tools: Residual plots are your best friends; always create them
•Gauss-Markov: Under 1-4, OLS is BLUE; add normality for full optimality
•Practical priority: Focus on substantive threats (endogeneity, nonlinearity) over technical ones (normality)

Module Complete: Simple Linear Regression

Congratulations! You've completed the first module on linear regression, covering:

Model Formulation: The mathematical structure of simple linear regression
Least Squares Derivation: How OLS estimators are derived from first principles
Geometric Interpretation: Regression as projection in n-dimensional space
Coefficient Interpretation: How to correctly understand and communicate results
Assumptions: The conditions required for valid inference

This foundation prepares you for Module 2: Multiple Linear Regression, where we extend to multiple predictors, introducing new challenges and richer interpretations.

Module Complete

You now possess a deep understanding of simple linear regression—not just the formulas, but the reasoning behind them. You can derive OLS, interpret coefficients, check assumptions, and communicate results responsibly. You've built the conceptual foundation for all supervised learning that follows.

Assumptions

The Fine Print of Regression

This page examines each assumption in depth: what it means, why it matters, how to check it, what goes wrong when it fails, and what to do about violations.

What You Will Learn

The Classical Assumptions

The classical linear regression model makes five key assumptions. Together, they enable the strong theoretical guarantees of OLS. We'll list them first, then examine each in depth.

The Classical Linear Model Assumptions (CLRM)

For the model $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$:

Linearity in Parameters: The true relationship is $y = \beta_0 + \beta_1 x + \varepsilon$
Random Sampling: The observations $(x_i, y_i)$ are a random sample from the population
Conditional Mean Zero: $E[\varepsilon | x] = 0$ (no systematic error pattern)
Homoscedasticity: $\text{Var}(\varepsilon | x) = \sigma^2$ (constant error variance)
Normality (for inference): $\varepsilon | x \sim N(0, \sigma^2)$ (errors are normally distributed)

Plus a technical condition:

Non-constant x: $\text{Var}(x) > 0$ (x actually varies in the sample)

What Each Assumption Enables
Assumption	Enables	If Violated
Linearity	Model correctly captures E[y\|x]	Biased estimates, wrong predictions
Random sampling	Sample represents population	Selection bias, no generalization
E[ε\|x] = 0	Unbiased estimators	Biased estimates
Homoscedasticity	Correct standard errors	Wrong inference, inefficient estimates
Normality	Exact t and F tests	Tests are only approximate

Hierarchy of Assumptions

Assumption 1: Linearity

Statement: The true conditional expectation function is linear:

$$E[y | x] = \beta_0 + \beta_1 x$$

This means the relationship between the expected value of $y$ and $x$ is a straight line—not a curve, not a step function, not some other shape.

Why It Matters

If the true relationship is nonlinear (e.g., quadratic), a linear model is misspecified. The fitted line will be the "best" straight line through curved data, but:

Predictions will be systematically wrong at different x values
The slope estimate won't represent the true marginal effect
Residuals will show a pattern (not random scatter)

Diagnosing Linearity Violations

Scatterplot of y vs. x: Does the data follow a straight line, or curve?
Residual plot (residuals vs. x or ŷ): Residuals should scatter randomly around zero. Curved patterns indicate nonlinearity.
Residual plot (residuals vs. ŷ): Same as above, often clearer.
Added variable plots: In multiple regression (advanced).

Common Patterns Indicating Nonlinearity

U-shaped or inverted-U residuals → quadratic relationship
Asymptotic pattern → logarithmic or saturation relationship
S-shaped pattern → logistic-type relationship

Remedies for Nonlinearity

Transform x: Try log(x), √x, 1/x to linearize. 2) Transform y: log(y) is common. 3) Add polynomial terms: Include x² or x³. 4) Piecewise linear (splines): Different slopes in different regions. 5) Use nonlinear models: When transformations don't work.

check_linearity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
import matplotlib.pyplot as plt
 
# Example: True relationship is quadratic, we fit linear
np.random.seed(42)
n = 100
x = np.random.uniform(0, 10, n)
y_true = 5 + 2*x - 0.2*x**2  # Quadratic truth
y = y_true + np.random.normal(0, 1, n)
 
# Fit linear model
from numpy.polynomial import polynomial as P
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)
beta_0 = y_bar - beta_1 * x_bar
y_hat = beta_0 + beta_1 * x
residuals = y - y_hat
 
# Residual plot reveals nonlinearity
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
 
# Left: Scatterplot with fitted line
axes[0].scatter(x, y, alpha=0.7)
x_line = np.linspace(0, 10, 100)
axes[0].plot(x_line, beta_0 + beta_1*x_line, 'r-', linewidth=2)
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Data with Fitted Linear Model')
 
# Right: Residual plot
axes[1].scatter(x, residuals, alpha=0.7)
axes[1].axhline(y=0, color='r', linestyle='--')
axes[1].set_xlabel('x')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Residuals vs x (Shows Nonlinearity)')
# Note: Curved pattern in residuals indicates violated linearity
 
plt.tight_layout()
plt.savefig('linearity_diagnostic.png', dpi=100)
print("Saved linearity diagnostic plot")

Assumption 2: Random Sampling

Statement: The observations $(x_1, y_1), \ldots, (x_n, y_n)$ are a random sample from the population of interest.

This means:

Each observation is drawn independently from the same population
Selection into the sample doesn't depend on $y$ or $\varepsilon$
The sample represents the population we want to generalize to

Why It Matters

Selection Bias

Random Sampling Holds

•Simple random sample from full population
•Stratified sample with proper weights
•All subgroups proportionally represented
•No systematic dropouts based on y

Random Sampling Violated

•Convenience sampling (e.g., only willing participants)
•Survivorship bias (only successful cases observed)
•Time series data (observations not independent)
•Clustered data (students within schools)

Diagnosing Random Sampling Violations

This is often a design issue, not detectable from the data alone:

Consider your sampling process: How were observations collected?
Compare sample to population: If you know population characteristics
Check for time/space patterns: If observations are not independent

Special Case: Time Series

Assumption 3: Conditional Mean Zero

Statement: $E[\varepsilon | x] = 0$ for all values of $x$.

Equivalent Formulations:

$E[\varepsilon | x] = 0$ (conditional expectation)
$\text{Cov}(x, \varepsilon) = 0$ (implied, but weaker)
The linear model correctly specifies $E[y|x]$

Why It Matters

If $E[\varepsilon | x] \ne 0$, OLS estimates are biased. The slope $\hat{\beta}_1$ doesn't estimate the true $\beta_1$—even with infinite data. This is a fundamental failure.

Sources of Violation

Omitted Variable Bias: A relevant variable $z$ is left out. If $z$ is correlated with both $x$ and $y$:

$$\text{Bias} = \beta_z \cdot \frac{\text{Cov}(x, z)}{\text{Var}(x)}$$

Measurement Error in x: If we observe $x^* = x + \eta$ instead of true $x$:

$$\hat{\beta}1 \xrightarrow{p} \beta_1 \cdot \frac{\sigma_x^2}{\sigma_x^2 + \sigma\eta^2}$$

(Attenuation bias—slope shrinks toward zero.)

Simultaneity/Reverse Causation: $y$ affects $x$ as well as vice versa.
Model Misspecification: The true functional form is nonlinear.

Omitted Variable Bias: The Classic Threat

Diagnosing E[ε|x] = 0 Violations

This is challenging because we don't observe $\varepsilon$, only residuals $e$:

Residual plots (e vs. x): Patterns suggest misspecification
Ramsey RESET test: Tests for omitted nonlinearity
Add suspected omitted variables: See if coefficients change
Instrumental variables: If available, can test exclusion

Key Point: We can never prove E[ε|x] = 0 from data alone. We can only check for obvious violations and think carefully about the data generating process.

Assumption 4: Homoscedasticity

Statement: $\text{Var}(\varepsilon | x) = \sigma^2$ for all $x$.

The error variance is constant across all values of $x$. "Homo" = same; "scedasticity" = spread. When this fails, we have heteroscedasticity (different spread).

Why It Matters

Homoscedasticity affects inference, not unbiasedness:

OLS estimates are still unbiased under heteroscedasticity
But standard errors are wrong (usually too small)
Confidence intervals and p-values are therefore invalid
OLS is no longer the most efficient estimator

Common Heteroscedasticity Patterns

Fan/cone shape: Variance increases with $x$ (e.g., income vs. spending)
Inverse cone: Variance decreases with $x$ (less common)
Discrete patterns: Different variances for different groups

Diagnosing Heteroscedasticity

Residual vs. fitted plot: Look for fanning patterns
Breusch-Pagan test: Regress squared residuals on $x$; test if slope ≠ 0
White test: More general test for heteroscedasticity
Goldfeld-Quandt test: Compare variance in subsamples

check_heteroscedasticity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Example: Variance increases with x (heteroscedastic data)
np.random.seed(42)
n = 200
x = np.random.uniform(1, 10, n)
# Error variance proportional to x
epsilon = np.random.normal(0, 1, n) * np.sqrt(x)  # SD = sqrt(x)
y = 5 + 2*x + epsilon
 
# Fit OLS
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)
beta_0 = y_bar - beta_1 * x_bar
y_hat = beta_0 + beta_1 * x
residuals = y - y_hat
 
# Residual plot shows fanning
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
 
# Residuals vs fitted values
axes[0].scatter(y_hat, residuals, alpha=0.6)
axes[0].axhline(y=0, color='r', linestyle='--')
axes[0].set_xlabel('Fitted values (ŷ)')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted (Fan Pattern → Heteroscedasticity)')
 
# Squared residuals vs x (Breusch-Pagan idea)
axes[1].scatter(x, residuals**2, alpha=0.6)
axes[1].set_xlabel('x')
axes[1].set_ylabel('Squared Residuals')
axes[1].set_title('Squared Residuals vs x')
 
# Simple Breusch-Pagan test statistic
slope_bp, intercept_bp, r_value, p_value, std_err = stats.linregress(x, residuals**2)
print(f"Breusch-Pagan slope: {slope_bp:.4f}")
print(f"If significantly different from 0, heteroscedasticity is present")
print(f"p-value: {p_value:.4f}")
 
plt.tight_layout()
plt.savefig('heteroscedasticity_diagnostic.png', dpi=100)

Remedies for Heteroscedasticity

Robust (heteroscedasticity-consistent) standard errors: Fixes inference without changing estimates. 2) Weighted Least Squares (WLS): If variance function is known. 3) Transform y: log(y) often stabilizes variance. 4) Generalized Least Squares (GLS): If variance structure can be modeled.

Assumption 5: Normality of Errors

Statement: $\varepsilon | x \sim N(0, \sigma^2)$

Conditional on $x$, the errors follow a normal (Gaussian) distribution with mean 0 and variance $\sigma^2$.

Why It Matters

Normality enables exact statistical inference:

With normality, $\hat{\beta}_1$ follows an exact normal distribution
The t-statistics follow exact t-distributions
F-statistics follow exact F-distributions
Confidence intervals and p-values are exact

The Good News About Normality

Diagnosing Normality Violations

Histogram of residuals: Should look roughly bell-shaped
Q-Q plot (quantile-quantile): Residuals vs. theoretical normal quantiles. Should be roughly linear.
Shapiro-Wilk test: Formal test of normality (H₀: data is normal)
Jarque-Bera test: Tests skewness and kurtosis

Common Departures

Skewness: Residuals are asymmetric (long tail on one side)
Heavy tails (leptokurtosis): More outliers than normal predicts
Light tails (platykurtosis): Fewer outliers than normal predicts
Bimodality: Two peaks (may indicate missing categorical predictor)

check_normality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Generate residuals from a fitted model (using earlier data)
np.random.seed(42)
n = 100
x = np.random.uniform(0, 10, n)
y = 5 + 2*x + np.random.normal(0, 2, n)
 
# Fit and get residuals
x_bar, y_bar = np.mean(x), np.mean(y)
beta_1 = np.sum((x - x_bar)*(y - y_bar)) / np.sum((x - x_bar)**2)
beta_0 = y_bar - beta_1 * x_bar
residuals = y - (beta_0 + beta_1*x)
 
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
 
# Histogram
axes[0].hist(residuals, bins=20, density=True, alpha=0.7, edgecolor='black')
# Overlay normal curve
x_norm = np.linspace(residuals.min(), residuals.max(), 100)
axes[0].plot(x_norm, stats.norm.pdf(x_norm, 0, residuals.std()), 'r-', linewidth=2)
axes[0].set_xlabel('Residuals')
axes[0].set_ylabel('Density')
axes[0].set_title('Histogram of Residuals with Normal Overlay')
 
# Q-Q plot
stats.probplot(residuals, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot (Points Should Follow Line)')
 
# Shapiro-Wilk test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk test: statistic = {stat:.4f}, p-value = {p_value:.4f}")
print(f"If p > 0.05, fail to reject normality (data is consistent with normal)")
 
plt.tight_layout()
plt.savefig('normality_diagnostic.png', dpi=100)

Remedies for Non-Normality

Large sample: Often no action needed—just rely on CLT
Transform y: log(y) often normalizes right-skewed data
Robust regression: Less sensitive to outliers (M-estimation, quantile regression)
Bootstrap inference: Doesn't assume normality
Non-parametric tests: When distribution is very non-normal

Which Assumptions Matter Most?

Not all assumption violations are equally serious. Here's a practical ranking:

Severity of Assumption Violations
Priority	Assumption	If Violated	Consequence	Remedy Difficulty
🔴 Critical	Linearity	Model is wrong	Biased estimates	Medium (transforms)
🔴 Critical	E[ε\|x] = 0	Endogeneity	Biased estimates	Hard (IV, experiments)
🟡 Moderate	Random Sampling	Sample not representative	No generalization	Design issue
🟢 Minor	Homoscedasticity	Wrong SEs	Invalid inference (fixable)	Easy (robust SEs)
🟢 Minor	Normality	Approximate tests	OK for large n	Often unnecessary

Practical Wisdom

The Gauss-Markov Theorem

Under assumptions 1-4 (but NOT requiring normality):

OLS is BLUE (Best Linear Unbiased Estimator)

Best: Minimum variance among all linear unbiased estimators
Linear: A linear function of the response values
Unbiased: $E[\hat{\beta}] = \beta$
Estimator: A statistic computed from data

Summary: Assumptions

The assumptions underlying linear regression are the foundation for valid inference. Let's consolidate:

Key Takeaways

•Five classical assumptions: Linearity, random sampling, E[ε|x]=0, homoscedasticity, normality
•Linearity: True relationship is linear in parameters; check with residual plots
•E[ε|x] = 0: Critical for unbiased estimates; threatened by omitted variables, measurement error, reverse causation
•Homoscedasticity: Constant error variance; affects inference but easy to fix with robust SEs
•Normality: Least critical, especially with large samples (CLT)
•Diagnostic tools: Residual plots are your best friends; always create them
•Gauss-Markov: Under 1-4, OLS is BLUE; add normality for full optimality
•Practical priority: Focus on substantive threats (endogeneity, nonlinearity) over technical ones (normality)

Module Complete: Simple Linear Regression

Congratulations! You've completed the first module on linear regression, covering:

Model Formulation: The mathematical structure of simple linear regression
Least Squares Derivation: How OLS estimators are derived from first principles
Geometric Interpretation: Regression as projection in n-dimensional space
Coefficient Interpretation: How to correctly understand and communicate results
Assumptions: The conditions required for valid inference

This foundation prepares you for Module 2: Multiple Linear Regression, where we extend to multiple predictors, introducing new challenges and richer interpretations.

Module Complete