Loading content...
Having derived the OLS estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$, we now face deeper questions: How good is this estimator?
An estimator isn't just a formula—it's a random variable because it depends on random data. Different samples yield different estimates. To evaluate an estimator, we must understand its statistical properties:
These questions lead to the Gauss-Markov theorem, one of the most celebrated results in statistical theory, which establishes OLS as the Best Linear Unbiased Estimator (BLUE).
By the end of this page, you will prove that OLS is unbiased under standard assumptions, derive its variance-covariance matrix, understand the Gauss-Markov theorem, and recognize when OLS achieves optimality.
All statistical properties of OLS depend on specific assumptions about the data-generating process. Let's state them precisely.
The Linear Model Assumptions:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$
where:
The Gauss-Markov theorem does NOT require normally distributed errors. It only requires zero mean and constant variance. Normality is needed for exact inference (t-tests, F-tests) but not for BLUE properties. We'll add normality later for hypothesis testing.
Understanding spherical errors:
The assumption $\text{Var}(\boldsymbol{\epsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n$ packs two conditions:
The covariance matrix of $\boldsymbol{\epsilon}$ is: $$\text{Var}(\boldsymbol{\epsilon}) = \sigma^2 \begin{bmatrix} 1 & 0 & \cdots & 0 \ 0 & 1 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & 1 \end{bmatrix} = \sigma^2 \mathbf{I}_n$$
Definition: An estimator $\hat{\boldsymbol{\beta}}$ is unbiased if its expected value equals the true parameter: $$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$
Theorem: Under assumptions 1-3 (linearity, full rank, exogeneity), the OLS estimator is unbiased.
Proof:
Starting with the OLS formula: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$$
Substitute $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$:
$$\begin{aligned} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}) &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon} &= \mathbf{I}\boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon} &= \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon} \end{aligned}$$
Taking expectations conditional on $\mathbf{X}$: $$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbb{E}[\boldsymbol{\epsilon} | \mathbf{X}]$$
By assumption 3, $\mathbb{E}[\boldsymbol{\epsilon} | \mathbf{X}] = \mathbf{0}$, so: $$\boxed{\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}}$$
On average, OLS gets the right answer. If we could repeat our experiment infinitely many times with different noise realizations, the average of our estimates would converge to the true β. This is a finite-sample property—it holds for any sample size n.
Knowing that OLS is unbiased tells us where it's centered, but not how much it fluctuates. The variance-covariance matrix captures this uncertainty.
Theorem: Under assumptions 1-4, the variance-covariance matrix of $\hat{\boldsymbol{\beta}}$ is: $$\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$
Proof:
From the unbiasedness proof, we established: $$\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon}$$
The variance-covariance matrix is: $$\begin{aligned} \text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= \mathbb{E}[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^\top | \mathbf{X}] &= \mathbb{E}[(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} | \mathbf{X}] &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \mathbb{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\top | \mathbf{X}] \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \end{aligned}$$
By assumption 4, $\mathbb{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^\top | \mathbf{X}] = \sigma^2 \mathbf{I}_n$: $$\begin{aligned} \text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top (\sigma^2 \mathbf{I}_n) \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} &= \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} &= \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1} \cdot \mathbf{I} &= \boxed{\sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}} \end{aligned}$$
The diagonal elements σ²[(X⊤X)⁻¹]ⱼⱼ give Var(β̂ⱼ)—the variance of each coefficient estimate. Off-diagonal elements give covariances between coefficient estimates. Standard errors are √(diagonal elements).
The formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}$ reveals several important insights about what controls estimation precision:
| Factor | Effect on Variance | Intuition |
|---|---|---|
| Error variance σ² | Direct proportionality | More noise → less precise estimates |
| Sample size n | Inverse relationship (via X⊤X) | More data → more precise estimates |
| Feature variance | Inverse relationship | More spread in X → more precise estimates |
| Multicollinearity | Increases variance | Correlated features → unstable estimates |
The multicollinearity effect:
When features are highly correlated, $(\mathbf{X}^\top\mathbf{X})^{-1}$ has large elements. To see why, consider the simple case of two perfectly correlated features: then $\mathbf{X}^\top\mathbf{X}$ is singular and cannot be inverted.
Near-perfect correlation (multicollinearity) makes $(\mathbf{X}^\top\mathbf{X})^{-1}$ nearly singular with very large entries, inflating coefficient variances. Individual coefficients become unreliable, even though predictions may still be accurate.
Variance Inflation Factor (VIF): $$\text{VIF}_j = \frac{1}{1 - R_j^2}$$
where $R_j^2$ is the R-squared from regressing feature $j$ on all other features. VIF > 10 indicates problematic multicollinearity.
With multicollinearity, you can have excellent prediction accuracy (low SSE) but terrible coefficient estimates (high variance). The model 'works' but you can't interpret individual coefficients or trust extrapolations.
We've established that OLS is unbiased with variance $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$. But is there a better unbiased estimator with lower variance? The Gauss-Markov theorem answers definitively: No.
Theorem (Gauss-Markov): Under assumptions 1-4, the OLS estimator is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\beta}$.
Proof (sketch):
Consider any linear unbiased estimator $\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$ for some matrix $\mathbf{C}$.
Unbiasedness requires: $$\mathbb{E}[\tilde{\boldsymbol{\beta}}] = \mathbb{E}[\mathbf{C}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon})] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}$$
This must hold for all $\boldsymbol{\beta}$, so $\mathbf{C}\mathbf{X} = \mathbf{I}$.
Write $\mathbf{C} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top + \mathbf{D}$ where $\mathbf{D}\mathbf{X} = \mathbf{0}$.
Then: $$\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2\mathbf{C}\mathbf{C}^\top = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1} + \sigma^2\mathbf{D}\mathbf{D}^\top$$
Since $\mathbf{D}\mathbf{D}^\top$ is positive semi-definite: $$\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}}_{\text{OLS}})$$
Equality holds only when $\mathbf{D} = \mathbf{0}$, i.e., when $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{\text{OLS}}$.
Among all linear unbiased estimators, OLS has the smallest variance. You cannot do better while remaining linear and unbiased. This is why OLS is the default choice for linear regression.
The Gauss-Markov theorem is powerful, but its scope is limited. Understanding these boundaries is crucial for modern machine learning.
For linear, correctly-specified models with homoscedastic uncorrelated errors where you want unbiased estimates, OLS is provably optimal.
For high-dimensional settings (p > n), prediction tasks, or when assumptions are violated, regularization and other methods often outperform OLS.
The variance formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ involves the unknown $\sigma^2$. We need an estimator for the error variance.
Natural idea: Use the mean squared residual: $$\hat{\sigma}^2_{\text{naive}} = \frac{1}{n}\sum_{i=1}^n e_i^2 = \frac{1}{n}\mathbf{e}^\top\mathbf{e}$$
Unfortunately, this estimator is biased because residuals systematically underestimate true errors (the fitted line goes through the cloud of points).
Unbiased estimator: $$\hat{\sigma}^2 = \frac{\mathbf{e}^\top\mathbf{e}}{n - p - 1} = \frac{\text{SSE}}{n - p - 1}$$
where $p+1$ is the number of parameters (including intercept).
Why the denominator is n - p - 1:
The residuals $\mathbf{e} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ satisfy $p+1$ linear constraints (the normal equations). This means the residual vector has only $n - p - 1$ degrees of freedom.
Intuitively: fitting $p+1$ parameters 'uses up' $p+1$ degrees of freedom from the data, leaving $n - p - 1$ independent pieces of information to estimate $\sigma^2$.
The estimated variance-covariance matrix: $$\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$
Standard errors are the square roots of the diagonal elements.
When n is large relative to p, the difference between n and n-p-1 is negligible. But for small samples or high-dimensional data, using the correct denominator is essential for unbiased variance estimation.
The fitted values and residuals have their own statistical properties that are useful for diagnostics and theory.
Fitted values: $$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{H}\mathbf{y}$$
where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix (because it puts the 'hat' on y).
Residuals: $$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{H}\mathbf{y} = (\mathbf{I} - \mathbf{H})\mathbf{y}$$
| Property | Fitted Values ŷ | Residuals e |
|---|---|---|
| Formula | $\mathbf{H}\mathbf{y}$ | $(\mathbf{I} - \mathbf{H})\mathbf{y}$ |
| Expected value | $\mathbf{X}\boldsymbol{\beta}$ | $\mathbf{0}$ |
| Variance | $\sigma^2\mathbf{H}$ | $\sigma^2(\mathbf{I} - \mathbf{H})$ |
| Sum | Not constrained | $\sum e_i = 0$ (with intercept) |
| Correlation with X | Perfectly aligned | Orthogonal: $\mathbf{X}^\top\mathbf{e} = \mathbf{0}$ |
The result X⊤e = 0 is fundamental. It says residuals are orthogonal to all columns of X, including the intercept column (so Σeᵢ = 0). This orthogonality is why they're called 'normal' equations—normal means perpendicular.
Let's implement computation of OLS statistics, including standard errors and the variance-covariance matrix.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom scipy import stats class OLSRegression: """ OLS Regression with full statistical inference. """ def __init__(self, fit_intercept=True): self.fit_intercept = fit_intercept self.beta_ = None self.sigma2_ = None self.var_cov_ = None self.std_errors_ = None def fit(self, X, y): """ Fit OLS regression and compute all statistics. """ n = X.shape[0] # Add intercept column if self.fit_intercept: X_design = np.column_stack([np.ones(n), X]) else: X_design = X.copy() p_plus_1 = X_design.shape[1] # Number of parameters # Compute OLS estimates gram = X_design.T @ X_design gram_inv = np.linalg.inv(gram) self.beta_ = gram_inv @ X_design.T @ y # Compute residuals and error variance estimate y_pred = X_design @ self.beta_ residuals = y - y_pred sse = residuals @ residuals df = n - p_plus_1 # Degrees of freedom self.sigma2_ = sse / df # Variance-covariance matrix of coefficients self.var_cov_ = self.sigma2_ * gram_inv # Standard errors (square root of diagonal) self.std_errors_ = np.sqrt(np.diag(self.var_cov_)) # Store for later use self.n_ = n self.p_ = X.shape[1] self.df_ = df self.sse_ = sse self.X_design_ = X_design self.residuals_ = residuals return self def t_statistics(self): """Compute t-statistics for coefficients.""" return self.beta_ / self.std_errors_ def p_values(self): """Compute p-values for two-sided tests (H0: β_j = 0).""" t_stats = self.t_statistics() # Two-sided p-value from t-distribution return 2 * (1 - stats.t.cdf(np.abs(t_stats), df=self.df_)) def confidence_intervals(self, alpha=0.05): """Compute confidence intervals for coefficients.""" t_critical = stats.t.ppf(1 - alpha/2, df=self.df_) margin = t_critical * self.std_errors_ lower = self.beta_ - margin upper = self.beta_ + margin return np.column_stack([lower, upper]) def summary(self): """Print regression summary.""" print("=" * 60) print("OLS Regression Results") print("=" * 60) print(f"Observations: {self.n_}") print(f"Parameters: {self.p_ + 1}") print(f"Residual DF: {self.df_}") print(f"σ² estimate: {self.sigma2_:.4f}") print(f"RMSE: {np.sqrt(self.sigma2_):.4f}") print("-" * 60) print(f"{'Coef':>8} {'Estimate':>12} {'Std Err':>10} {'t-stat':>10} {'p-value':>10}") print("-" * 60) t_stats = self.t_statistics() p_vals = self.p_values() for j in range(len(self.beta_)): name = "Intercept" if j == 0 else f"x{j}" print(f"{name:>8} {self.beta_[j]:>12.4f} {self.std_errors_[j]:>10.4f} " f"{t_stats[j]:>10.2f} {p_vals[j]:>10.4f}") print("=" * 60) # Example usagenp.random.seed(42)n, p = 100, 3X = np.random.randn(n, p)true_beta = np.array([5.0, 2.0, -1.5, 0.8])y = np.column_stack([np.ones(n), X]) @ true_beta + np.random.randn(n) * 2 model = OLSRegression()model.fit(X, y)model.summary() print("True coefficients:", true_beta)print("Variance-Covariance Matrix:")print(model.var_cov_) print("95% Confidence Intervals:")ci = model.confidence_intervals()for j, (low, high) in enumerate(ci): name = "Intercept" if j == 0 else f"x{j}" contains = "✓" if low <= true_beta[j] <= high else "✗" print(f" {name}: [{low:.4f}, {high:.4f}] {contains}")We've established the fundamental statistical properties of the OLS estimator. Let's consolidate:
What's next:
We've focused on algebraic and statistical properties. The next page develops the geometric interpretation of OLS—viewing regression as projection onto the column space of X. This perspective illuminates why the normal equations work and connects to core linear algebra concepts.
You now understand the statistical properties of OLS: unbiasedness, variance, and optimality under the Gauss-Markov assumptions. These properties justify OLS's widespread use and reveal its limitations when assumptions are violated.