Multiple Linear Regression - Learning Module

Loading content...

0/278

OLS Estimator Properties

Beyond the Formula

Having derived the OLS estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$, we now face deeper questions: How good is this estimator?

An estimator isn't just a formula—it's a random variable because it depends on random data. Different samples yield different estimates. To evaluate an estimator, we must understand its statistical properties:

Is it centered on the true value (unbiased)?
How much does it vary across samples (variance)?
Is it the best we can do under certain constraints (optimality)?

These questions lead to the Gauss-Markov theorem, one of the most celebrated results in statistical theory, which establishes OLS as the Best Linear Unbiased Estimator (BLUE).

What You Will Learn

By the end of this page, you will prove that OLS is unbiased under standard assumptions, derive its variance-covariance matrix, understand the Gauss-Markov theorem, and recognize when OLS achieves optimality.

The Standard Assumptions

All statistical properties of OLS depend on specific assumptions about the data-generating process. Let's state them precisely.

The Linear Model Assumptions:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

where:

Gauss-Markov Assumptions

•Linearity in parameters: The model is $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ (can be nonlinear in features, but linear in β).
•Full rank: $\mathbf{X}$ has full column rank (columns are linearly independent), so $\mathbf{X}^\top\mathbf{X}$ is invertible.
•Exogeneity: $\mathbb{E}[\boldsymbol{\epsilon} | \mathbf{X}] = \mathbf{0}$ — errors have zero mean conditional on X.
•Spherical errors: $\text{Var}(\boldsymbol{\epsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n$ — errors have constant variance (homoscedasticity) and are uncorrelated.

What About Normality?

The Gauss-Markov theorem does NOT require normally distributed errors. It only requires zero mean and constant variance. Normality is needed for exact inference (t-tests, F-tests) but not for BLUE properties. We'll add normality later for hypothesis testing.

Understanding spherical errors:

The assumption $\text{Var}(\boldsymbol{\epsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n$ packs two conditions:

Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$ for all $i$ — constant variance
No autocorrelation: $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i eq j$ — uncorrelated errors

The covariance matrix of $\boldsymbol{\epsilon}$ is: $$\text{Var}(\boldsymbol{\epsilon}) = \sigma^2 \begin{bmatrix} 1 & 0 & \cdots & 0 \ 0 & 1 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & 1 \end{bmatrix} = \sigma^2 \mathbf{I}_n$$

Unbiasedness of the OLS Estimator

Definition: An estimator $\hat{\boldsymbol{\beta}}$ is unbiased if its expected value equals the true parameter: $$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$

Theorem: Under assumptions 1-3 (linearity, full rank, exogeneity), the OLS estimator is unbiased.

Proof:

Starting with the OLS formula: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$$

Substitute $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$:

$$\begin{aligned} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}) &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon} &= \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon} &= \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon} \end{aligned}$$

Taking expectations conditional on $\mathbf{X}$: $$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbb{E}[\boldsymbol{\epsilon} | \mathbf{X}]$$

By assumption 3, $\mathbb{E}[\boldsymbol{\epsilon} | \mathbf{X}] = \mathbf{0}$, so: $$\boxed{\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}}$$

OLS is Unbiased

On average, OLS gets the right answer. If we could repeat our experiment infinitely many times with different noise realizations, the average of our estimates would converge to the true β. This is a finite-sample property—it holds for any sample size n.

Variance-Covariance of the OLS Estimator

Knowing that OLS is unbiased tells us where it's centered, but not how much it fluctuates. The variance-covariance matrix captures this uncertainty.

Theorem: Under assumptions 1-4, the variance-covariance matrix of $\hat{\boldsymbol{\beta}}$ is: $$\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$

Proof:

From the unbiasedness proof, we established: $$\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon}$$

The variance-covariance matrix is: $$\begin{aligned} \text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= \mathbb{E}[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^\top | \mathbf{X}] &= \mathbb{E}[(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} | \mathbf{X}] &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbb{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\top | \mathbf{X}] \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \end{aligned}$$

By assumption 4, $\mathbb{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\top | \mathbf{X}] = \sigma^2 \mathbf{I}_n$: $$\begin{aligned} \text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top (\sigma^2 \mathbf{I}_n) \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} &= \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} &= \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1} \cdot \mathbf{I} &= \boxed{\sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}} \end{aligned}$$

Reading the Variance Matrix

The diagonal elements σ²[(X⊤X)⁻¹]ⱼⱼ give Var(β̂ⱼ)—the variance of each coefficient estimate. Off-diagonal elements give covariances between coefficient estimates. Standard errors are √(diagonal elements).

Factors Affecting Estimator Variance

The formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}$ reveals several important insights about what controls estimation precision:

Factors Influencing Coefficient Variance
Factor	Effect on Variance	Intuition
Error variance σ²	Direct proportionality	More noise → less precise estimates
Sample size n	Inverse relationship (via X⊤X)	More data → more precise estimates
Feature variance	Inverse relationship	More spread in X → more precise estimates
Multicollinearity	Increases variance	Correlated features → unstable estimates

The multicollinearity effect:

When features are highly correlated, $(\mathbf{X}^\top\mathbf{X})^{-1}$ has large elements. To see why, consider the simple case of two perfectly correlated features: then $\mathbf{X}^\top\mathbf{X}$ is singular and cannot be inverted.

Near-perfect correlation (multicollinearity) makes $(\mathbf{X}^\top\mathbf{X})^{-1}$ nearly singular with very large entries, inflating coefficient variances. Individual coefficients become unreliable, even though predictions may still be accurate.

Variance Inflation Factor (VIF): $$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the R-squared from regressing feature $j$ on all other features. VIF > 10 indicates problematic multicollinearity.

Multicollinearity Paradox

With multicollinearity, you can have excellent prediction accuracy (low SSE) but terrible coefficient estimates (high variance). The model 'works' but you can't interpret individual coefficients or trust extrapolations.

The Gauss-Markov Theorem

We've established that OLS is unbiased with variance $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$. But is there a better unbiased estimator with lower variance? The Gauss-Markov theorem answers definitively: No.

Theorem (Gauss-Markov): Under assumptions 1-4, the OLS estimator is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\beta}$.

Best: Minimum variance among the class
Linear: Estimator is a linear function of $\mathbf{y}$
Unbiased: $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$
Estimator: It estimates the unknown parameter $\boldsymbol{\beta}$

Proof (sketch):

Consider any linear unbiased estimator $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$ for some matrix $\mathbf{C}$.

Unbiasedness requires: $$\mathbb{E}[\tilde{\boldsymbol{\beta}}] = \mathbb{E}[\mathbf{C}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon})] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}$$

This must hold for all $\boldsymbol{\beta}$, so $\mathbf{C}\mathbf{X} = \mathbf{I}$.

Write $\mathbf{C} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top + \mathbf{D}$ where $\mathbf{D}\mathbf{X} = \mathbf{0}$.

Then: $$\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2\mathbf{C}\mathbf{C}^\top = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}^\top$$

Since $\mathbf{D}\mathbf{D}^\top$ is positive semi-definite: $$\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}})$$

Equality holds only when $\mathbf{D} = \mathbf{0}$, i.e., when $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{\text{OLS}}$.

OLS is Optimal

Among all linear unbiased estimators, OLS has the smallest variance. You cannot do better while remaining linear and unbiased. This is why OLS is the default choice for linear regression.

What Gauss-Markov Doesn't Say

The Gauss-Markov theorem is powerful, but its scope is limited. Understanding these boundaries is crucial for modern machine learning.

Limitations of Gauss-Markov

•Only linear estimators: Nonlinear estimators (neural networks, trees) can outperform OLS, especially for nonlinear relationships.
•Requires unbiasedness: Biased estimators (Ridge, Lasso) can have lower mean squared error by trading bias for reduced variance. This is the bias-variance tradeoff.
•Assumes correct specification: If the true model is nonlinear or has omitted variables, all bets are off.
•Spherical errors required: Heteroscedasticity or autocorrelation violates assumption 4, making OLS no longer BLUE.
•Optimality in variance only: OLS minimizes variance, not prediction error on new data. Regularized estimators often predict better.

Within Scope

For linear, correctly-specified models with homoscedastic uncorrelated errors where you want unbiased estimates, OLS is provably optimal.

Beyond Scope

For high-dimensional settings (p > n), prediction tasks, or when assumptions are violated, regularization and other methods often outperform OLS.

Estimating the Error Variance

The variance formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ involves the unknown $\sigma^2$. We need an estimator for the error variance.

Natural idea: Use the mean squared residual: $$\hat{\sigma}^2_{\text{naive}} = \frac{1}{n}\sum_{i=1}^n e_i^2 = \frac{1}{n}\mathbf{e}^\top\mathbf{e}$$

Unfortunately, this estimator is biased because residuals systematically underestimate true errors (the fitted line goes through the cloud of points).

Unbiased estimator: $$\hat{\sigma}^2 = \frac{\mathbf{e}^\top\mathbf{e}}{n - p - 1} = \frac{\text{SSE}}{n - p - 1}$$

where $p+1$ is the number of parameters (including intercept).

Why the denominator is n - p - 1:

The residuals $\mathbf{e} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ satisfy $p+1$ linear constraints (the normal equations). This means the residual vector has only $n - p - 1$ degrees of freedom.

Intuitively: fitting $p+1$ parameters 'uses up' $p+1$ degrees of freedom from the data, leaving $n - p - 1$ independent pieces of information to estimate $\sigma^2$.

The estimated variance-covariance matrix: $$\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$

Standard errors are the square roots of the diagonal elements.

Degrees of Freedom in Practice

When n is large relative to p, the difference between n and n-p-1 is negligible. But for small samples or high-dimensional data, using the correct denominator is essential for unbiased variance estimation.

Properties of Fitted Values and Residuals

The fitted values and residuals have their own statistical properties that are useful for diagnostics and theory.

Fitted values: $$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{H}\mathbf{y}$$

where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix (because it puts the 'hat' on y).

Residuals: $$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{H}\mathbf{y} = (\mathbf{I} - \mathbf{H})\mathbf{y}$$

Properties of Fitted Values and Residuals
Property	Fitted Values ŷ	Residuals e
Formula	$\mathbf{H}\mathbf{y}$	$(\mathbf{I} - \mathbf{H})\mathbf{y}$
Expected value	$\mathbf{X}\boldsymbol{\beta}$	$\mathbf{0}$
Variance	$\sigma^2\mathbf{H}$	$\sigma^2(\mathbf{I} - \mathbf{H})$
Sum	Not constrained	$\sum e_i = 0$ (with intercept)
Correlation with X	Perfectly aligned	Orthogonal: $\mathbf{X}^\top\mathbf{e} = \mathbf{0}$

Orthogonality of Residuals

The result X⊤e = 0 is fundamental. It says residuals are orthogonal to all columns of X, including the intercept column (so Σeᵢ = 0). This orthogonality is why they're called 'normal' equations—normal means perpendicular.

Computing OLS Statistics

Let's implement computation of OLS statistics, including standard errors and the variance-covariance matrix.

ols_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy import stats
 
class OLSRegression:
    """
    OLS Regression with full statistical inference.
    """
    def __init__(self, fit_intercept=True):
        self.fit_intercept = fit_intercept
        self.beta_ = None
        self.sigma2_ = None
        self.var_cov_ = None
        self.std_errors_ = None
        
    def fit(self, X, y):
        """
        Fit OLS regression and compute all statistics.
        """
        n = X.shape[0]
        
        # Add intercept column
        if self.fit_intercept:
            X_design = np.column_stack([np.ones(n), X])
        else:
            X_design = X.copy()
        
        p_plus_1 = X_design.shape[1]  # Number of parameters
        
        # Compute OLS estimates
        gram = X_design.T @ X_design
        gram_inv = np.linalg.inv(gram)
        self.beta_ = gram_inv @ X_design.T @ y
        
        # Compute residuals and error variance estimate
        y_pred = X_design @ self.beta_
        residuals = y - y_pred
        sse = residuals @ residuals
        df = n - p_plus_1  # Degrees of freedom
        self.sigma2_ = sse / df
        
        # Variance-covariance matrix of coefficients
        self.var_cov_ = self.sigma2_ * gram_inv
        
        # Standard errors (square root of diagonal)
        self.std_errors_ = np.sqrt(np.diag(self.var_cov_))
        
        # Store for later use
        self.n_ = n
        self.p_ = X.shape[1]
        self.df_ = df
        self.sse_ = sse
        self.X_design_ = X_design
        self.residuals_ = residuals
        
        return self
    
    def t_statistics(self):
        """Compute t-statistics for coefficients."""
        return self.beta_ / self.std_errors_
    
    def p_values(self):
        """Compute p-values for two-sided tests (H0: β_j = 0)."""
        t_stats = self.t_statistics()
        # Two-sided p-value from t-distribution
        return 2 * (1 - stats.t.cdf(np.abs(t_stats), df=self.df_))
    
    def confidence_intervals(self, alpha=0.05):
        """Compute confidence intervals for coefficients."""
        t_critical = stats.t.ppf(1 - alpha/2, df=self.df_)
        margin = t_critical * self.std_errors_
        lower = self.beta_ - margin
        upper = self.beta_ + margin
        return np.column_stack([lower, upper])
    
    def summary(self):
        """Print regression summary."""
        print("=" * 60)
        print("OLS Regression Results")
        print("=" * 60)
        print(f"Observations: {self.n_}")
        print(f"Parameters: {self.p_ + 1}")
        print(f"Residual DF: {self.df_}")
        print(f"σ² estimate: {self.sigma2_:.4f}")
        print(f"RMSE: {np.sqrt(self.sigma2_):.4f}")
        print("-" * 60)
        print(f"{'Coef':>8} {'Estimate':>12} {'Std Err':>10} {'t-stat':>10} {'p-value':>10}")
        print("-" * 60)
        
        t_stats = self.t_statistics()
        p_vals = self.p_values()
        
        for j in range(len(self.beta_)):
            name = "Intercept" if j == 0 else f"x{j}"
            print(f"{name:>8} {self.beta_[j]:>12.4f} {self.std_errors_[j]:>10.4f} "
                  f"{t_stats[j]:>10.2f} {p_vals[j]:>10.4f}")
        print("=" * 60)
 
# Example usage
np.random.seed(42)
n, p = 100, 3
X = np.random.randn(n, p)
true_beta = np.array([5.0, 2.0, -1.5, 0.8])
y = np.column_stack([np.ones(n), X]) @ true_beta + np.random.randn(n) * 2
 
model = OLSRegression()
model.fit(X, y)
model.summary()
 
print("
True coefficients:", true_beta)
print("
Variance-Covariance Matrix:")
print(model.var_cov_)
 
print("
95% Confidence Intervals:")
ci = model.confidence_intervals()
for j, (low, high) in enumerate(ci):
    name = "Intercept" if j == 0 else f"x{j}"
    contains = "✓" if low <= true_beta[j] <= high else "✗"
    print(f"  {name}: [{low:.4f}, {high:.4f}] {contains}")

Summary: OLS Estimator Properties

We've established the fundamental statistical properties of the OLS estimator. Let's consolidate:

Key Takeaways

•Unbiasedness: Under linearity, full rank, and exogeneity, $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$.
•Variance: Under spherical errors, $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$.
•Gauss-Markov: OLS is BLUE—the best (minimum variance) linear unbiased estimator.
•Error variance: Estimated by $\hat{\sigma}^2 = \text{SSE}/(n - p - 1)$ with proper degrees of freedom.
•Limitations: Gauss-Markov doesn't guarantee optimality for prediction or when assumptions are violated.
•Multicollinearity: High feature correlation inflates variance, making coefficients unreliable.

What's next:

We've focused on algebraic and statistical properties. The next page develops the geometric interpretation of OLS—viewing regression as projection onto the column space of X. This perspective illuminates why the normal equations work and connects to core linear algebra concepts.

Page Complete

You now understand the statistical properties of OLS: unbiasedness, variance, and optimality under the Gauss-Markov assumptions. These properties justify OLS's widespread use and reveal its limitations when assumptions are violated.

OLS Estimator Properties

Beyond the Formula

Having derived the OLS estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$, we now face deeper questions: How good is this estimator?

Is it centered on the true value (unbiased)?
How much does it vary across samples (variance)?
Is it the best we can do under certain constraints (optimality)?

These questions lead to the Gauss-Markov theorem, one of the most celebrated results in statistical theory, which establishes OLS as the Best Linear Unbiased Estimator (BLUE).

What You Will Learn

The Standard Assumptions

All statistical properties of OLS depend on specific assumptions about the data-generating process. Let's state them precisely.

The Linear Model Assumptions:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

where:

Gauss-Markov Assumptions

•Linearity in parameters: The model is $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ (can be nonlinear in features, but linear in β).
•Full rank: $\mathbf{X}$ has full column rank (columns are linearly independent), so $\mathbf{X}^\top\mathbf{X}$ is invertible.
•Exogeneity: $\mathbb{E}[\boldsymbol{\epsilon} | \mathbf{X}] = \mathbf{0}$ — errors have zero mean conditional on X.
•Spherical errors: $\text{Var}(\boldsymbol{\epsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n$ — errors have constant variance (homoscedasticity) and are uncorrelated.

What About Normality?

Understanding spherical errors:

The assumption $\text{Var}(\boldsymbol{\epsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n$ packs two conditions:

Homoscedasticity: $\text{Var}(\epsilon_i) = \sigma^2$ for all $i$ — constant variance
No autocorrelation: $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i eq j$ — uncorrelated errors

Unbiasedness of the OLS Estimator

Definition: An estimator $\hat{\boldsymbol{\beta}}$ is unbiased if its expected value equals the true parameter: $$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$

Theorem: Under assumptions 1-3 (linearity, full rank, exogeneity), the OLS estimator is unbiased.

Proof:

Starting with the OLS formula: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$$

Substitute $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$:

By assumption 3, $\mathbb{E}[\boldsymbol{\epsilon} | \mathbf{X}] = \mathbf{0}$, so: $$\boxed{\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}}$$

OLS is Unbiased

Variance-Covariance of the OLS Estimator

Knowing that OLS is unbiased tells us where it's centered, but not how much it fluctuates. The variance-covariance matrix captures this uncertainty.

Theorem: Under assumptions 1-4, the variance-covariance matrix of $\hat{\boldsymbol{\beta}}$ is: $$\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$

Proof:

From the unbiasedness proof, we established: $$\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon}$$

Reading the Variance Matrix

Factors Affecting Estimator Variance

The formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}$ reveals several important insights about what controls estimation precision:

Factors Influencing Coefficient Variance
Factor	Effect on Variance	Intuition
Error variance σ²	Direct proportionality	More noise → less precise estimates
Sample size n	Inverse relationship (via X⊤X)	More data → more precise estimates
Feature variance	Inverse relationship	More spread in X → more precise estimates
Multicollinearity	Increases variance	Correlated features → unstable estimates

The multicollinearity effect:

Variance Inflation Factor (VIF): $$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the R-squared from regressing feature $j$ on all other features. VIF > 10 indicates problematic multicollinearity.

Multicollinearity Paradox

The Gauss-Markov Theorem

Theorem (Gauss-Markov): Under assumptions 1-4, the OLS estimator is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\beta}$.

Best: Minimum variance among the class
Linear: Estimator is a linear function of $\mathbf{y}$
Unbiased: $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$
Estimator: It estimates the unknown parameter $\boldsymbol{\beta}$

Proof (sketch):

Consider any linear unbiased estimator $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$ for some matrix $\mathbf{C}$.

This must hold for all $\boldsymbol{\beta}$, so $\mathbf{C}\mathbf{X} = \mathbf{I}$.

Write $\mathbf{C} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top + \mathbf{D}$ where $\mathbf{D}\mathbf{X} = \mathbf{0}$.

Then: $$\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2\mathbf{C}\mathbf{C}^\top = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}^\top$$

Since $\mathbf{D}\mathbf{D}^\top$ is positive semi-definite: $$\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}})$$

Equality holds only when $\mathbf{D} = \mathbf{0}$, i.e., when $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{\text{OLS}}$.

OLS is Optimal

Among all linear unbiased estimators, OLS has the smallest variance. You cannot do better while remaining linear and unbiased. This is why OLS is the default choice for linear regression.

What Gauss-Markov Doesn't Say

The Gauss-Markov theorem is powerful, but its scope is limited. Understanding these boundaries is crucial for modern machine learning.

Limitations of Gauss-Markov

•Only linear estimators: Nonlinear estimators (neural networks, trees) can outperform OLS, especially for nonlinear relationships.
•Requires unbiasedness: Biased estimators (Ridge, Lasso) can have lower mean squared error by trading bias for reduced variance. This is the bias-variance tradeoff.
•Assumes correct specification: If the true model is nonlinear or has omitted variables, all bets are off.
•Spherical errors required: Heteroscedasticity or autocorrelation violates assumption 4, making OLS no longer BLUE.
•Optimality in variance only: OLS minimizes variance, not prediction error on new data. Regularized estimators often predict better.

Within Scope

For linear, correctly-specified models with homoscedastic uncorrelated errors where you want unbiased estimates, OLS is provably optimal.

Beyond Scope

For high-dimensional settings (p > n), prediction tasks, or when assumptions are violated, regularization and other methods often outperform OLS.

Estimating the Error Variance

The variance formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ involves the unknown $\sigma^2$. We need an estimator for the error variance.

Natural idea: Use the mean squared residual: $$\hat{\sigma}^2_{\text{naive}} = \frac{1}{n}\sum_{i=1}^n e_i^2 = \frac{1}{n}\mathbf{e}^\top\mathbf{e}$$

Unfortunately, this estimator is biased because residuals systematically underestimate true errors (the fitted line goes through the cloud of points).

Unbiased estimator: $$\hat{\sigma}^2 = \frac{\mathbf{e}^\top\mathbf{e}}{n - p - 1} = \frac{\text{SSE}}{n - p - 1}$$

where $p+1$ is the number of parameters (including intercept).

Why the denominator is n - p - 1:

Intuitively: fitting $p+1$ parameters 'uses up' $p+1$ degrees of freedom from the data, leaving $n - p - 1$ independent pieces of information to estimate $\sigma^2$.

The estimated variance-covariance matrix: $$\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$

Standard errors are the square roots of the diagonal elements.

Degrees of Freedom in Practice

Properties of Fitted Values and Residuals

The fitted values and residuals have their own statistical properties that are useful for diagnostics and theory.

Fitted values: $$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{H}\mathbf{y}$$

where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix (because it puts the 'hat' on y).

Residuals: $$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{H}\mathbf{y} = (\mathbf{I} - \mathbf{H})\mathbf{y}$$

Properties of Fitted Values and Residuals
Property	Fitted Values ŷ	Residuals e
Formula	$\mathbf{H}\mathbf{y}$	$(\mathbf{I} - \mathbf{H})\mathbf{y}$
Expected value	$\mathbf{X}\boldsymbol{\beta}$	$\mathbf{0}$
Variance	$\sigma^2\mathbf{H}$	$\sigma^2(\mathbf{I} - \mathbf{H})$
Sum	Not constrained	$\sum e_i = 0$ (with intercept)
Correlation with X	Perfectly aligned	Orthogonal: $\mathbf{X}^\top\mathbf{e} = \mathbf{0}$

Orthogonality of Residuals

Computing OLS Statistics

Let's implement computation of OLS statistics, including standard errors and the variance-covariance matrix.

ols_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy import stats
 
class OLSRegression:
    """
    OLS Regression with full statistical inference.
    """
    def __init__(self, fit_intercept=True):
        self.fit_intercept = fit_intercept
        self.beta_ = None
        self.sigma2_ = None
        self.var_cov_ = None
        self.std_errors_ = None
        
    def fit(self, X, y):
        """
        Fit OLS regression and compute all statistics.
        """
        n = X.shape[0]
        
        # Add intercept column
        if self.fit_intercept:
            X_design = np.column_stack([np.ones(n), X])
        else:
            X_design = X.copy()
        
        p_plus_1 = X_design.shape[1]  # Number of parameters
        
        # Compute OLS estimates
        gram = X_design.T @ X_design
        gram_inv = np.linalg.inv(gram)
        self.beta_ = gram_inv @ X_design.T @ y
        
        # Compute residuals and error variance estimate
        y_pred = X_design @ self.beta_
        residuals = y - y_pred
        sse = residuals @ residuals
        df = n - p_plus_1  # Degrees of freedom
        self.sigma2_ = sse / df
        
        # Variance-covariance matrix of coefficients
        self.var_cov_ = self.sigma2_ * gram_inv
        
        # Standard errors (square root of diagonal)
        self.std_errors_ = np.sqrt(np.diag(self.var_cov_))
        
        # Store for later use
        self.n_ = n
        self.p_ = X.shape[1]
        self.df_ = df
        self.sse_ = sse
        self.X_design_ = X_design
        self.residuals_ = residuals
        
        return self
    
    def t_statistics(self):
        """Compute t-statistics for coefficients."""
        return self.beta_ / self.std_errors_
    
    def p_values(self):
        """Compute p-values for two-sided tests (H0: β_j = 0)."""
        t_stats = self.t_statistics()
        # Two-sided p-value from t-distribution
        return 2 * (1 - stats.t.cdf(np.abs(t_stats), df=self.df_))
    
    def confidence_intervals(self, alpha=0.05):
        """Compute confidence intervals for coefficients."""
        t_critical = stats.t.ppf(1 - alpha/2, df=self.df_)
        margin = t_critical * self.std_errors_
        lower = self.beta_ - margin
        upper = self.beta_ + margin
        return np.column_stack([lower, upper])
    
    def summary(self):
        """Print regression summary."""
        print("=" * 60)
        print("OLS Regression Results")
        print("=" * 60)
        print(f"Observations: {self.n_}")
        print(f"Parameters: {self.p_ + 1}")
        print(f"Residual DF: {self.df_}")
        print(f"σ² estimate: {self.sigma2_:.4f}")
        print(f"RMSE: {np.sqrt(self.sigma2_):.4f}")
        print("-" * 60)
        print(f"{'Coef':>8} {'Estimate':>12} {'Std Err':>10} {'t-stat':>10} {'p-value':>10}")
        print("-" * 60)
        
        t_stats = self.t_statistics()
        p_vals = self.p_values()
        
        for j in range(len(self.beta_)):
            name = "Intercept" if j == 0 else f"x{j}"
            print(f"{name:>8} {self.beta_[j]:>12.4f} {self.std_errors_[j]:>10.4f} "
                  f"{t_stats[j]:>10.2f} {p_vals[j]:>10.4f}")
        print("=" * 60)
 
# Example usage
np.random.seed(42)
n, p = 100, 3
X = np.random.randn(n, p)
true_beta = np.array([5.0, 2.0, -1.5, 0.8])
y = np.column_stack([np.ones(n), X]) @ true_beta + np.random.randn(n) * 2
 
model = OLSRegression()
model.fit(X, y)
model.summary()
 
print("
True coefficients:", true_beta)
print("
Variance-Covariance Matrix:")
print(model.var_cov_)
 
print("
95% Confidence Intervals:")
ci = model.confidence_intervals()
for j, (low, high) in enumerate(ci):
    name = "Intercept" if j == 0 else f"x{j}"
    contains = "✓" if low <= true_beta[j] <= high else "✗"
    print(f"  {name}: [{low:.4f}, {high:.4f}] {contains}")

Summary: OLS Estimator Properties

We've established the fundamental statistical properties of the OLS estimator. Let's consolidate:

Key Takeaways

•Unbiasedness: Under linearity, full rank, and exogeneity, $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$.
•Variance: Under spherical errors, $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$.
•Gauss-Markov: OLS is BLUE—the best (minimum variance) linear unbiased estimator.
•Error variance: Estimated by $\hat{\sigma}^2 = \text{SSE}/(n - p - 1)$ with proper degrees of freedom.
•Limitations: Gauss-Markov doesn't guarantee optimality for prediction or when assumptions are violated.
•Multicollinearity: High feature correlation inflates variance, making coefficients unreliable.

What's next:

Page Complete