Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

1 / 5

The Gauss-Markov Theorem

Why OLS Stands Above All Other Linear Estimators

In the universe of statistical estimation, a remarkable result stands as a beacon of mathematical elegance: the Gauss-Markov theorem. Named after Carl Friedrich Gauss and Andrey Markov, this theorem answers a question that every data scientist should ask but few deeply understand:

Among all possible ways to estimate regression coefficients using linear combinations of the data, why is Ordinary Least Squares (OLS) the best choice?

The Gauss-Markov theorem doesn't merely suggest that OLS is good—it proves that OLS is optimal among a large class of estimators, under certain conditions. This optimality isn't approximate or heuristic; it's mathematically rigorous and universally applicable.

What You Will Learn

By the end of this page, you will understand the precise mathematical statement of the Gauss-Markov theorem, the five classical assumptions that guarantee OLS optimality, why 'Best Linear Unbiased Estimator' (BLUE) is the correct characterization, and the profound implications this has for practical regression analysis. You will be able to verify whether real datasets satisfy these assumptions and understand what happens when they're violated.

Historical Context and Significance

The story of the Gauss-Markov theorem spans two centuries of statistical development and represents one of the earliest rigorous results in estimation theory.

Carl Friedrich Gauss (1777–1855) first developed the method of least squares in 1795, at the age of 18, while working on astronomical calculations. He used it to predict the orbit of the dwarf planet Ceres when it disappeared behind the Sun. His method proved spectacularly successful—when Ceres re-emerged, it was almost exactly where Gauss predicted.

Gauss published his least squares method in 1809 in Theoria Motus Corporum Coelestium (Theory of the Motion of Heavenly Bodies), where he showed that least squares is optimal under normally distributed errors—a result we now call maximum likelihood estimation under Gaussian errors.

Andrey Markov (1856–1922), working a century later, extended Gauss's result in a profound way. Markov showed that OLS is optimal without assuming normality—requiring only finite variance and uncorrelatedness of errors. This made the result far more general and applicable.

The Power of Generalization

Gauss showed OLS is optimal if errors are Gaussian. Markov weakened this to: OLS is optimal if errors are uncorrelated with constant variance—no distributional assumption needed. This generalization is why the theorem bears both names: Gauss contributed the method and initial optimality proof; Markov provided the minimally restrictive conditions.

Why This Matters for Machine Learning:

The Gauss-Markov theorem is not merely historical curiosity. It provides:

Theoretical justification for why OLS is the default regression method in statistics and introductory ML
Clear conditions under which OLS is optimal, guiding practitioners on when to use it
A template for understanding optimality in estimation—similar theorems exist for other estimators
A diagnostic framework for identifying when alternatives (Ridge, Lasso, GLS) might be preferable

Understanding Gauss-Markov transforms OLS from a 'recipe' into a principled choice.

The Classical Linear Regression Model

Before stating the Gauss-Markov theorem, we must precisely define the classical linear regression model and its assumptions. The theorem is only valid when these assumptions hold.

The Model:

We assume the true data-generating process follows:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$$

Where:

$\mathbf{y} \in \mathbb{R}^n$ is the vector of observed responses (dependent variable)
$\mathbf{X} \in \mathbb{R}^{n \times p}$ is the design matrix (independent variables/features)
$\boldsymbol{\beta} \in \mathbb{R}^p$ is the vector of true, unknown population parameters
$\boldsymbol{\varepsilon} \in \mathbb{R}^n$ is the vector of random errors (disturbances)

The goal of regression is to estimate $\boldsymbol{\beta}$ from observed data $(\mathbf{X}, \mathbf{y})$.

Fixed vs. Random Design

In the classical setup, X is treated as fixed (non-random)—either by design (controlled experiments) or by conditioning on observed values. All randomness comes from ε. In the random design case, results hold conditional on X. This distinction matters for understanding what 'unbiased' means: E[β̂|X] = β.

The Five Gauss-Markov Assumptions:

The Gauss-Markov theorem requires the following five conditions. We denote these as GM1–GM5 for reference throughout this module.

The Five Gauss-Markov Assumptions
Assumption	Mathematical Statement	Intuitive Meaning
GM1: Linearity	$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$	The true relationship is linear in parameters (not necessarily in variables)
GM2: Full Rank	$\text{rank}(\mathbf{X}) = p$ (full column rank)	No perfect multicollinearity; each predictor provides unique information
GM3: Exogeneity	$\mathbb{E}[\varepsilon_i \| \mathbf{X}] = 0$ for all $i$	Errors have zero mean and are uncorrelated with predictors
GM4: Homoscedasticity	$\text{Var}(\varepsilon_i \| \mathbf{X}) = \sigma^2$ for all $i$	Error variance is constant across all observations
GM5: No Autocorrelation	$\text{Cov}(\varepsilon_i, \varepsilon_j \| \mathbf{X}) = 0$ for $i \neq j$	Errors are uncorrelated with each other

Combined Covariance Structure (GM4 + GM5):

Assumptions GM4 and GM5 together imply a spherical error covariance structure:

$$\text{Var}(\boldsymbol{\varepsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n$$

Where $\mathbf{I}_n$ is the $n \times n$ identity matrix. This says:

All errors have the same variance: $\sigma^2$
All errors are uncorrelated: off-diagonal elements are zero

This 'spherical' structure is named because the constant-probability contours of $\boldsymbol{\varepsilon}$ form hyperspheres in $\mathbb{R}^n$.

What's NOT Required

Critically, the Gauss-Markov theorem does not require normality of errors. The theorem holds for any error distribution with zero mean and finite variance σ². Normality is only needed for exact t-tests and F-tests; asymptotic inference works without it.

The OLS Estimator

The Ordinary Least Squares (OLS) estimator is derived by minimizing the sum of squared residuals:

$$\hat{\boldsymbol{\beta}}{OLS} = \arg\min{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2$$

The solution, under the full rank assumption (GM2), is:

$$\hat{\boldsymbol{\beta}}_{OLS} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$

Key Properties of the OLS Estimator:

Existence: Requires $(\mathbf{X}^\top \mathbf{X})$ to be invertible, which is guaranteed by GM2
Uniqueness: The solution is unique when $\mathbf{X}$ has full column rank
Algebraic Form: OLS is a linear function of $\mathbf{y}$—crucial for the theorem

ols_estimator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from numpy.linalg import inv
 
def ols_estimator(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """
    Compute the OLS estimator: β̂ = (X'X)^(-1) X'y
    
    Parameters:
    -----------
    X : np.ndarray, shape (n, p)
        Design matrix with n observations and p features
    y : np.ndarray, shape (n,) or (n, 1)
        Response vector
    
    Returns:
    --------
    beta_hat : np.ndarray, shape (p,)
        OLS coefficient estimates
    
    Mathematical Form:
    ------------------
    β̂_OLS = (X'X)^(-1) X'y
    
    This is the unique solution to the normal equations: X'X β = X'y
    """
    XtX = X.T @ X
    Xty = X.T @ y
    beta_hat = inv(XtX) @ Xty
    return beta_hat
 
# Example: Simple demonstration
np.random.seed(42)
n, p = 100, 3
 
# Generate design matrix with intercept
X = np.column_stack([
    np.ones(n),           # Intercept column
    np.random.randn(n),   # Feature 1
    np.random.randn(n)    # Feature 2
])
 
# True coefficients
beta_true = np.array([2.0, -1.5, 0.5])
 
# Generate response with Gaussian noise
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Estimate
beta_hat = ols_estimator(X, y)
 
print("True β:     ", beta_true)
print("Estimated β̂:", beta_hat)
print("Difference: ", beta_true - beta_hat)

Linearity of the OLS Estimator:

A crucial observation is that $\hat{\boldsymbol{\beta}}_{OLS}$ is a linear function of $\mathbf{y}$:

$$\hat{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$$

Where $\mathbf{C} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$ is a fixed $p \times n$ matrix that depends only on $\mathbf{X}$.

This linearity is essential because the Gauss-Markov theorem compares OLS to all linear estimators—those that can be expressed as $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ for some matrix $\mathbf{A}$.

Unbiasedness of OLS

The first key property of OLS is that it is unbiased—on average, it hits the true parameter values exactly.

Definition: Unbiased Estimator

An estimator $\hat{\boldsymbol{\beta}}$ is unbiased if:

$$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$

This means: if we could repeat the sampling process infinitely many times (with fixed $\mathbf{X}$), the average of our estimates would equal the true parameters.

Proof of OLS Unbiasedness:

Starting from the OLS formula and substituting the true model:

$$\begin{align} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top (\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) \quad \text{(substituting } \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\text{)} \ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \ &= \boldsymbol{\beta} + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \end{align}$$

Taking expectations conditional on $\mathbf{X}$:

$$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \boldsymbol{\beta} + \mathbf{0} = \boldsymbol{\beta}$$

The last step uses GM3: $\mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \mathbf{0}$. ∎

The Decomposition Insight

The expression β̂ = β + (X'X)⁻¹X'ε is profoundly informative. It shows that our estimate equals the truth plus a 'noise term.' This noise term has mean zero (giving unbiasedness) but non-zero variance (giving estimation error). The Gauss-Markov theorem says this variance is minimal among linear unbiased estimators.

What Unbiasedness Means (and Doesn't Mean):

✅ On average, OLS is correct: No systematic over- or under-estimation

✅ The method is 'fair': If you used OLS repeatedly, your errors would average out

❌ Individual estimates might be far from truth: Unbiasedness says nothing about any single estimate

❌ Low variance isn't guaranteed yet: An unbiased estimator can still have high variance

Unbiasedness alone isn't sufficient—we need the estimator to also have low variance. That's where the 'Best' part of BLUE comes in.

Statement of the Gauss-Markov Theorem

We now arrive at the central result of this page.

Theorem (Gauss-Markov):

Under assumptions GM1–GM5, the OLS estimator $\hat{\boldsymbol{\beta}}_{OLS}$ is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\beta}$.

What 'BLUE' Means:

Let's unpack each word:

Best: Has the smallest variance among the class of estimators considered
Linear: Is a linear function of the observations $\mathbf{y}$
Unbiased: Has expectation equal to the true parameter
Estimator: A function of the data used to estimate unknown parameters

More precisely: among all estimators of the form $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ that satisfy $\mathbb{E}[\tilde{\boldsymbol{\beta}}|\mathbf{X}] = \boldsymbol{\beta}$, the OLS estimator has the smallest variance (in the matrix sense).

The Optimality Result

For any linear unbiased estimator β̃, we have: Var(β̃|X) - Var(β̂_OLS|X) is positive semi-definite. This means OLS has the smallest variance for every coefficient and every linear combination of coefficients. No other linear unbiased estimator can do better.

Formal Mathematical Statement:

Let $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ be any linear estimator of $\boldsymbol{\beta}$ satisfying:

Linearity: $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ for some matrix $\mathbf{A}$ that may depend on $\mathbf{X}$
Unbiasedness: $\mathbb{E}[\tilde{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$ for all $\boldsymbol{\beta}$

Then:

$$\text{Var}(\tilde{\boldsymbol{\beta}} | \mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) \succeq 0$$

Where $\succeq 0$ denotes positive semi-definiteness.

Equivalently, for any non-zero vector $\mathbf{c} \in \mathbb{R}^p$:

$$\text{Var}(\mathbf{c}^\top \tilde{\boldsymbol{\beta}} | \mathbf{X}) \geq \text{Var}(\mathbf{c}^\top \hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X})$$

Proof of the Gauss-Markov Theorem

The proof is elegant and reveals deep geometric structure. We present it in full detail.

Setup:

Let $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ be any linear unbiased estimator. Define:

$$\mathbf{D} = \mathbf{A} - (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$$

So $\mathbf{A} = \mathbf{D} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$, and:

$$\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y} = \mathbf{D}\mathbf{y} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{D}\mathbf{y} + \hat{\boldsymbol{\beta}}_{OLS}$$

Step 1: Constraint from Unbiasedness

For $\tilde{\boldsymbol{\beta}}$ to be unbiased:

$$\mathbb{E}[\mathbf{A}\mathbf{y}|\mathbf{X}] = \mathbf{A}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta} \quad \text{for all } \boldsymbol{\beta}$$

This requires $\mathbf{A}\mathbf{X} = \mathbf{I}_p$. Since $(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X} = \mathbf{I}_p$, we need:

$$(\mathbf{D} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top)\mathbf{X} = \mathbf{I}_p$$ $$\mathbf{D}\mathbf{X} + \mathbf{I}_p = \mathbf{I}_p$$ $$\mathbf{D}\mathbf{X} = \mathbf{0}$$

Crucial constraint: Any linear unbiased estimator different from OLS must have $\mathbf{D} \neq \mathbf{0}$ but $\mathbf{D}\mathbf{X} = \mathbf{0}$.

Step 2: Compute the Variance of $\tilde{\boldsymbol{\beta}}$

Using $\tilde{\boldsymbol{\beta}} = \mathbf{D}\mathbf{y} + \hat{\boldsymbol{\beta}}_{OLS}$ and $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$:

$$\tilde{\boldsymbol{\beta}} = \mathbf{D}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) + \hat{\boldsymbol{\beta}}{OLS} = \mathbf{D}\mathbf{X}\boldsymbol{\beta} + \mathbf{D}\boldsymbol{\varepsilon} + \hat{\boldsymbol{\beta}}{OLS}$$

Since $\mathbf{D}\mathbf{X} = \mathbf{0}$:

$$\tilde{\boldsymbol{\beta}} = \mathbf{D}\boldsymbol{\varepsilon} + \hat{\boldsymbol{\beta}}_{OLS}$$

Now compute the variance:

$$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) = \text{Var}(\mathbf{D}\boldsymbol{\varepsilon} + \hat{\boldsymbol{\beta}}_{OLS}|\mathbf{X})$$

Recall $\hat{\boldsymbol{\beta}}_{OLS} = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$, so:

$$\tilde{\boldsymbol{\beta}} - \boldsymbol{\beta} = \mathbf{D}\boldsymbol{\varepsilon} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon} = \mathbf{A}\boldsymbol{\varepsilon}$$

Thus: $$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) = \mathbf{A} \cdot \text{Var}(\boldsymbol{\varepsilon}|\mathbf{X}) \cdot \mathbf{A}^\top = \sigma^2 \mathbf{A}\mathbf{A}^\top$$

Step 3: Expand and Simplify

Expanding $\mathbf{A}\mathbf{A}^\top$:

$$\mathbf{A}\mathbf{A}^\top = (\mathbf{D} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top)(\mathbf{D} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top)^\top$$

$$= \mathbf{D}\mathbf{D}^\top + \mathbf{D}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{D}^\top + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}$$

Since $\mathbf{D}\mathbf{X} = \mathbf{0}$, the middle terms vanish:

$$\mathbf{A}\mathbf{A}^\top = \mathbf{D}\mathbf{D}^\top + (\mathbf{X}^\top\mathbf{X})^{-1}$$

Therefore:

$$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) = \sigma^2(\mathbf{D}\mathbf{D}^\top + (\mathbf{X}^\top\mathbf{X})^{-1}) = \sigma^2\mathbf{D}\mathbf{D}^\top + \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}|\mathbf{X})$$

Step 4: Conclude

Since $\mathbf{D}\mathbf{D}^\top$ is positive semi-definite (it's the product of a matrix and its transpose):

$$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}|\mathbf{X}) = \sigma^2\mathbf{D}\mathbf{D}^\top \succeq 0$$

Equality holds if and only if $\mathbf{D} = \mathbf{0}$, i.e., $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS}$. ∎

Proof Complete

Any alternative linear unbiased estimator has variance Var(β̂_OLS) + σ²DD', where DD' ≥ 0. This proves OLS has minimum variance. The extra term σ²DD' represents the 'efficiency loss' from using a different linear unbiased estimator.

Variance of the OLS Estimator

From the proof, we derived a fundamental result:

$$\text{Var}(\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$$

This formula is central to understanding the precision of our estimates and constructing confidence intervals.

Interpreting the Variance:

The matrix $(\mathbf{X}^\top \mathbf{X})^{-1}$ captures how the design of the experiment affects estimation precision:

Diagonal elements $(\mathbf{X}^\top \mathbf{X})_{jj}^{-1}$ determine the variance of individual coefficient estimates $\hat{\beta}_j$
Off-diagonal elements determine covariances between coefficient estimates
Larger values indicate less precise estimation

ols_variance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
from numpy.linalg import inv
 
def ols_variance(X: np.ndarray, sigma_sq: float) -> np.ndarray:
    """
    Compute the variance-covariance matrix of OLS estimator.
    
    Var(β̂|X) = σ² (X'X)^(-1)
    
    Parameters:
    -----------
    X : np.ndarray, shape (n, p)
        Design matrix
    sigma_sq : float
        Error variance σ²
    
    Returns:
    --------
    var_beta : np.ndarray, shape (p, p)
        Variance-covariance matrix of β̂
    """
    XtX_inv = inv(X.T @ X)
    return sigma_sq * XtX_inv
 
def ols_standard_errors(X: np.ndarray, sigma_sq: float) -> np.ndarray:
    """
    Compute standard errors of OLS coefficients.
    
    SE(β̂_j) = sqrt(σ² * [(X'X)^(-1)]_jj)
    """
    var_beta = ols_variance(X, sigma_sq)
    return np.sqrt(np.diag(var_beta))
 
# Example
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
true_sigma_sq = 0.25  # True error variance
 
# Compute variance-covariance matrix
var_cov = ols_variance(X, true_sigma_sq)
se = ols_standard_errors(X, true_sigma_sq)
 
print("Variance-Covariance Matrix of β̂:")
print(var_cov)
print()
print("Standard Errors:")
print(f"  SE(β̂₀) = {se[0]:.4f}")
print(f"  SE(β̂₁) = {se[1]:.4f}")
print(f"  SE(β̂₂) = {se[2]:.4f}")

Factors Affecting Coefficient Variance:

From $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$, we can identify what makes estimates more precise:

Factor	Effect on Var(β̂)	Interpretation
Larger σ²	Increases variance	More noise → less precise estimates
Larger n	Decreases variance	More data → more precise estimates
Higher variance in X	Decreases variance	More spread in predictors → better estimation
Multicollinearity	Increases variance	Correlated predictors → inflated standard errors

The last point is particularly important: when predictors are highly correlated, $(\mathbf{X}^\top\mathbf{X})^{-1}$ has large elements, inflating variance.

The Multicollinearity Problem

When predictors are nearly collinear, (X'X) is nearly singular, so (X'X)^(-1) has very large elements. This means coefficient estimates are highly unstable—small changes in data cause large changes in β̂. The coefficients remain unbiased, but their variance is so large that estimates are practically useless.

Implications and Limitations

The Gauss-Markov theorem is powerful but not without boundaries. Understanding both its implications and limitations is crucial for applying it correctly.

Key Implications:

When Gauss-Markov Holds

•OLS is optimal among linear unbiased estimators — No need to search for alternatives within this class
•Standard errors from OLS are valid — Confidence intervals and tests have correct coverage/size
•t-tests and F-tests are justified — Adding normality, exact finite-sample inference is valid
•Model comparison via R² and adjusted R² is meaningful — Residual variance is correctly estimated
•Simple implementation suffices — No need for iterative or weighted schemes under classical assumptions

Critical Limitations:

What Gauss-Markov Does NOT Guarantee

•Not optimal among all estimators — Biased estimators (Ridge, Lasso) can have lower MSE through the bias-variance tradeoff
•Not optimal under heteroscedasticity — When Var(εᵢ) varies, Generalized Least Squares (GLS) is more efficient
•Not optimal under autocorrelation — With correlated errors (time series), GLS or specific methods are preferred
•Linearity restriction may be severe — Nonlinear estimators (e.g., neural networks) aren't compared by the theorem
•Doesn't address model specification — BLUE of wrong model parameters is still wrong—Gauss-Markov assumes the model is correct

The Bias-Variance Tradeoff

OLS minimizes variance among unbiased estimators. But in high-dimensional settings (p ≈ n or p > n), the variance of OLS explodes. Ridge regression introduces bias but reduces variance, often achieving lower overall MSE. The Gauss-Markov theorem doesn't contradict this—it simply doesn't address biased estimators.

When to Trust OLS (Gauss-Markov Applies):

Cross-sectional data with independent observations
Controlled experiments with fixed design
Standard errors of similar magnitude across observations
Low-dimensional settings (n >> p)
Predictors have low correlation

When to Consider Alternatives:

Time series data (autocorrelation likely)
Panel data (individual-specific heterogeneity)
High-dimensional data (p close to or > n)
Heterogeneous variance (grouped data, etc.)
Highly correlated predictors (multicollinearity)

Verifying Assumptions in Practice

The Gauss-Markov theorem's guarantees depend entirely on assumptions GM1–GM5 holding. In practice, we must check these assumptions using data diagnostics.

Diagnostic Strategies:

Checking Gauss-Markov Assumptions
Assumption	Diagnostic Method	What to Look For
GM1: Linearity	Residuals vs. fitted plot	Random scatter; no curves or patterns
GM2: Full Rank	Variance Inflation Factor (VIF)	VIF < 10 for all predictors
GM3: Exogeneity	Residuals vs. predictors plots	No systematic relationship
GM4: Homoscedasticity	Scale-location plot, Breusch-Pagan test	Constant spread in residuals
GM5: No Autocorrelation	Durbin-Watson test, ACF of residuals	DW ≈ 2; no significant autocorrelation

assumption_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
import scipy.stats as stats
from typing import Dict, Any
 
def check_gauss_markov_assumptions(X: np.ndarray, 
                                    y: np.ndarray, 
                                    residuals: np.ndarray) -> Dict[str, Any]:
    """
    Perform diagnostic checks for Gauss-Markov assumptions.
    
    Returns a dictionary with test results for each assumption.
    """
    n = len(y)
    p = X.shape[1]
    
    diagnostics = {}
    
    # GM2: Check for multicollinearity via VIF
    # VIF_j = 1 / (1 - R²_j), where R²_j is from regressing X_j on all other X
    vifs = []
    for j in range(1, p):  # Skip intercept
        X_others = np.delete(X, j, axis=1)
        # Simplified VIF calculation using correlation structure
        X_centered = X[:, 1:] - X[:, 1:].mean(axis=0)
        corr_matrix = np.corrcoef(X_centered.T)
        if corr_matrix.shape[0] > 1:
            vifs = np.diag(np.linalg.inv(corr_matrix))
        else:
            vifs = [1.0]
    diagnostics['vif'] = {
        'values': vifs,
        'max_vif': max(vifs) if vifs else 1.0,
        'concern': max(vifs) > 10 if vifs else False
    }
    
    # GM4: Breusch-Pagan test for heteroscedasticity
    # H0: Constant variance (homoscedasticity)
    residuals_sq = residuals ** 2
    # Regress squared residuals on X
    XtX_inv = np.linalg.inv(X.T @ X)
    bp_beta = XtX_inv @ X.T @ residuals_sq
    bp_fitted = X @ bp_beta
    bp_ss = np.sum((bp_fitted - residuals_sq.mean())**2)
    bp_ss_total = np.sum((residuals_sq - residuals_sq.mean())**2)
    bp_r_sq = bp_ss / bp_ss_total if bp_ss_total > 0 else 0
    bp_stat = n * bp_r_sq  # LM statistic
    bp_pvalue = 1 - stats.chi2.cdf(bp_stat, p - 1)
    diagnostics['breusch_pagan'] = {
        'statistic': bp_stat,
        'pvalue': bp_pvalue,
        'heteroscedastic': bp_pvalue < 0.05
    }
    
    # GM5: Durbin-Watson test for autocorrelation
    # DW ≈ 2 means no autocorrelation
    dw_num = np.sum(np.diff(residuals)**2)
    dw_denom = np.sum(residuals**2)
    dw_stat = dw_num / dw_denom if dw_denom > 0 else 2.0
    diagnostics['durbin_watson'] = {
        'statistic': dw_stat,
        'interpretation': 'No autocorrelation' if 1.5 < dw_stat < 2.5 
                         else 'Positive autocorrelation' if dw_stat < 1.5 
                         else 'Negative autocorrelation'
    }
    
    # GM3: Exogeneity - correlation between residuals and X
    exog_corrs = [np.corrcoef(residuals, X[:, j])[0, 1] for j in range(p)]
    diagnostics['exogeneity'] = {
        'residual_predictor_correlations': exog_corrs,
        'max_abs_correlation': max(abs(c) for c in exog_corrs),
        'concern': max(abs(c) for c in exog_corrs) > 0.1
    }
    
    return diagnostics
 
# Example usage
np.random.seed(42)
n = 100
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
beta_true = np.array([1.0, 2.0, -1.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Fit OLS
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
residuals = y - X @ beta_hat
 
# Run diagnostics
results = check_gauss_markov_assumptions(X, y, residuals)
for test, result in results.items():
    print(f"{test}: {result}")

Summary: The Foundation of OLS Theory

The Gauss-Markov theorem is a cornerstone of statistical estimation theory. Let's consolidate what we've learned:

Key Takeaways

•The Gauss-Markov theorem proves OLS is BLUE — Best (minimum variance) Linear Unbiased Estimator under classical assumptions
•Five assumptions (GM1–GM5) — Linearity, full rank, exogeneity, homoscedasticity, and no autocorrelation together guarantee optimality
•OLS is unbiased — E[β̂|X] = β, meaning estimates are centered on true values
•Variance formula Var(β̂) = σ²(X'X)⁻¹ — This determines precision of estimates and is crucial for inference
•Limitations exist — Biased estimators (Ridge, Lasso) can achieve lower MSE; heteroscedasticity and autocorrelation require GLS
•Practical diagnostics matter — Always verify assumptions before trusting OLS optimality guarantees

What's Next:

Now that we understand why OLS is optimal under classical assumptions, we'll explore the BLUE property in greater depth. The next page examines what 'Best Linear Unbiased Estimator' means operationally, how it compares to other estimation criteria, and why unbiasedness is sometimes worth sacrificing for efficiency.

Page Complete

You now understand the Gauss-Markov theorem—one of the most important results in statistical estimation theory. This theorem justifies the default use of OLS in classical regression settings and provides a framework for understanding when alternatives might be preferable.

1 / 5

Loading learning content...

Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

1 / 5

The Gauss-Markov Theorem

Why OLS Stands Above All Other Linear Estimators

Among all possible ways to estimate regression coefficients using linear combinations of the data, why is Ordinary Least Squares (OLS) the best choice?

What You Will Learn

Historical Context and Significance

The story of the Gauss-Markov theorem spans two centuries of statistical development and represents one of the earliest rigorous results in estimation theory.

The Power of Generalization

Why This Matters for Machine Learning:

The Gauss-Markov theorem is not merely historical curiosity. It provides:

Theoretical justification for why OLS is the default regression method in statistics and introductory ML
Clear conditions under which OLS is optimal, guiding practitioners on when to use it
A template for understanding optimality in estimation—similar theorems exist for other estimators
A diagnostic framework for identifying when alternatives (Ridge, Lasso, GLS) might be preferable

Understanding Gauss-Markov transforms OLS from a 'recipe' into a principled choice.

The Classical Linear Regression Model

Before stating the Gauss-Markov theorem, we must precisely define the classical linear regression model and its assumptions. The theorem is only valid when these assumptions hold.

The Model:

We assume the true data-generating process follows:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$$

Where:

$\mathbf{y} \in \mathbb{R}^n$ is the vector of observed responses (dependent variable)
$\mathbf{X} \in \mathbb{R}^{n \times p}$ is the design matrix (independent variables/features)
$\boldsymbol{\beta} \in \mathbb{R}^p$ is the vector of true, unknown population parameters
$\boldsymbol{\varepsilon} \in \mathbb{R}^n$ is the vector of random errors (disturbances)

The goal of regression is to estimate $\boldsymbol{\beta}$ from observed data $(\mathbf{X}, \mathbf{y})$.

Fixed vs. Random Design

The Five Gauss-Markov Assumptions:

The Gauss-Markov theorem requires the following five conditions. We denote these as GM1–GM5 for reference throughout this module.

The Five Gauss-Markov Assumptions
Assumption	Mathematical Statement	Intuitive Meaning
GM1: Linearity	$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$	The true relationship is linear in parameters (not necessarily in variables)
GM2: Full Rank	$\text{rank}(\mathbf{X}) = p$ (full column rank)	No perfect multicollinearity; each predictor provides unique information
GM3: Exogeneity	$\mathbb{E}[\varepsilon_i \| \mathbf{X}] = 0$ for all $i$	Errors have zero mean and are uncorrelated with predictors
GM4: Homoscedasticity	$\text{Var}(\varepsilon_i \| \mathbf{X}) = \sigma^2$ for all $i$	Error variance is constant across all observations
GM5: No Autocorrelation	$\text{Cov}(\varepsilon_i, \varepsilon_j \| \mathbf{X}) = 0$ for $i \neq j$	Errors are uncorrelated with each other

Combined Covariance Structure (GM4 + GM5):

Assumptions GM4 and GM5 together imply a spherical error covariance structure:

$$\text{Var}(\boldsymbol{\varepsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n$$

Where $\mathbf{I}_n$ is the $n \times n$ identity matrix. This says:

All errors have the same variance: $\sigma^2$
All errors are uncorrelated: off-diagonal elements are zero

This 'spherical' structure is named because the constant-probability contours of $\boldsymbol{\varepsilon}$ form hyperspheres in $\mathbb{R}^n$.

What's NOT Required

The OLS Estimator

The Ordinary Least Squares (OLS) estimator is derived by minimizing the sum of squared residuals:

The solution, under the full rank assumption (GM2), is:

$$\hat{\boldsymbol{\beta}}_{OLS} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$

Key Properties of the OLS Estimator:

Existence: Requires $(\mathbf{X}^\top \mathbf{X})$ to be invertible, which is guaranteed by GM2
Uniqueness: The solution is unique when $\mathbf{X}$ has full column rank
Algebraic Form: OLS is a linear function of $\mathbf{y}$—crucial for the theorem

ols_estimator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from numpy.linalg import inv
 
def ols_estimator(X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """
    Compute the OLS estimator: β̂ = (X'X)^(-1) X'y
    
    Parameters:
    -----------
    X : np.ndarray, shape (n, p)
        Design matrix with n observations and p features
    y : np.ndarray, shape (n,) or (n, 1)
        Response vector
    
    Returns:
    --------
    beta_hat : np.ndarray, shape (p,)
        OLS coefficient estimates
    
    Mathematical Form:
    ------------------
    β̂_OLS = (X'X)^(-1) X'y
    
    This is the unique solution to the normal equations: X'X β = X'y
    """
    XtX = X.T @ X
    Xty = X.T @ y
    beta_hat = inv(XtX) @ Xty
    return beta_hat
 
# Example: Simple demonstration
np.random.seed(42)
n, p = 100, 3
 
# Generate design matrix with intercept
X = np.column_stack([
    np.ones(n),           # Intercept column
    np.random.randn(n),   # Feature 1
    np.random.randn(n)    # Feature 2
])
 
# True coefficients
beta_true = np.array([2.0, -1.5, 0.5])
 
# Generate response with Gaussian noise
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Estimate
beta_hat = ols_estimator(X, y)
 
print("True β:     ", beta_true)
print("Estimated β̂:", beta_hat)
print("Difference: ", beta_true - beta_hat)

Linearity of the OLS Estimator:

A crucial observation is that $\hat{\boldsymbol{\beta}}_{OLS}$ is a linear function of $\mathbf{y}$:

$$\hat{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$$

Where $\mathbf{C} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$ is a fixed $p \times n$ matrix that depends only on $\mathbf{X}$.

Unbiasedness of OLS

The first key property of OLS is that it is unbiased—on average, it hits the true parameter values exactly.

Definition: Unbiased Estimator

An estimator $\hat{\boldsymbol{\beta}}$ is unbiased if:

$$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$

This means: if we could repeat the sampling process infinitely many times (with fixed $\mathbf{X}$), the average of our estimates would equal the true parameters.

Proof of OLS Unbiasedness:

Starting from the OLS formula and substituting the true model:

Taking expectations conditional on $\mathbf{X}$:

The last step uses GM3: $\mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \mathbf{0}$. ∎

The Decomposition Insight

What Unbiasedness Means (and Doesn't Mean):

✅ On average, OLS is correct: No systematic over- or under-estimation

✅ The method is 'fair': If you used OLS repeatedly, your errors would average out

❌ Individual estimates might be far from truth: Unbiasedness says nothing about any single estimate

❌ Low variance isn't guaranteed yet: An unbiased estimator can still have high variance

Unbiasedness alone isn't sufficient—we need the estimator to also have low variance. That's where the 'Best' part of BLUE comes in.

Statement of the Gauss-Markov Theorem

We now arrive at the central result of this page.

Theorem (Gauss-Markov):

Under assumptions GM1–GM5, the OLS estimator $\hat{\boldsymbol{\beta}}_{OLS}$ is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\beta}$.

What 'BLUE' Means:

Let's unpack each word:

Best: Has the smallest variance among the class of estimators considered
Linear: Is a linear function of the observations $\mathbf{y}$
Unbiased: Has expectation equal to the true parameter
Estimator: A function of the data used to estimate unknown parameters

The Optimality Result

Formal Mathematical Statement:

Let $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ be any linear estimator of $\boldsymbol{\beta}$ satisfying:

Linearity: $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ for some matrix $\mathbf{A}$ that may depend on $\mathbf{X}$
Unbiasedness: $\mathbb{E}[\tilde{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$ for all $\boldsymbol{\beta}$

Then:

$$\text{Var}(\tilde{\boldsymbol{\beta}} | \mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) \succeq 0$$

Where $\succeq 0$ denotes positive semi-definiteness.

Equivalently, for any non-zero vector $\mathbf{c} \in \mathbb{R}^p$:

$$\text{Var}(\mathbf{c}^\top \tilde{\boldsymbol{\beta}} | \mathbf{X}) \geq \text{Var}(\mathbf{c}^\top \hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X})$$

Proof of the Gauss-Markov Theorem

The proof is elegant and reveals deep geometric structure. We present it in full detail.

Setup:

Let $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ be any linear unbiased estimator. Define:

$$\mathbf{D} = \mathbf{A} - (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$$

So $\mathbf{A} = \mathbf{D} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$, and:

$$\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y} = \mathbf{D}\mathbf{y} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{D}\mathbf{y} + \hat{\boldsymbol{\beta}}_{OLS}$$

Step 1: Constraint from Unbiasedness

For $\tilde{\boldsymbol{\beta}}$ to be unbiased:

$$\mathbb{E}[\mathbf{A}\mathbf{y}|\mathbf{X}] = \mathbf{A}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta} \quad \text{for all } \boldsymbol{\beta}$$

This requires $\mathbf{A}\mathbf{X} = \mathbf{I}_p$. Since $(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X} = \mathbf{I}_p$, we need:

$$(\mathbf{D} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top)\mathbf{X} = \mathbf{I}_p$$ $$\mathbf{D}\mathbf{X} + \mathbf{I}_p = \mathbf{I}_p$$ $$\mathbf{D}\mathbf{X} = \mathbf{0}$$

Crucial constraint: Any linear unbiased estimator different from OLS must have $\mathbf{D} \neq \mathbf{0}$ but $\mathbf{D}\mathbf{X} = \mathbf{0}$.

Step 2: Compute the Variance of $\tilde{\boldsymbol{\beta}}$

Using $\tilde{\boldsymbol{\beta}} = \mathbf{D}\mathbf{y} + \hat{\boldsymbol{\beta}}_{OLS}$ and $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$:

Since $\mathbf{D}\mathbf{X} = \mathbf{0}$:

$$\tilde{\boldsymbol{\beta}} = \mathbf{D}\boldsymbol{\varepsilon} + \hat{\boldsymbol{\beta}}_{OLS}$$

Now compute the variance:

$$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) = \text{Var}(\mathbf{D}\boldsymbol{\varepsilon} + \hat{\boldsymbol{\beta}}_{OLS}|\mathbf{X})$$

Recall $\hat{\boldsymbol{\beta}}_{OLS} = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$, so:

$$\tilde{\boldsymbol{\beta}} - \boldsymbol{\beta} = \mathbf{D}\boldsymbol{\varepsilon} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon} = \mathbf{A}\boldsymbol{\varepsilon}$$

Thus: $$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) = \mathbf{A} \cdot \text{Var}(\boldsymbol{\varepsilon}|\mathbf{X}) \cdot \mathbf{A}^\top = \sigma^2 \mathbf{A}\mathbf{A}^\top$$

Step 3: Expand and Simplify

Expanding $\mathbf{A}\mathbf{A}^\top$:

$$\mathbf{A}\mathbf{A}^\top = (\mathbf{D} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top)(\mathbf{D} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top)^\top$$

Since $\mathbf{D}\mathbf{X} = \mathbf{0}$, the middle terms vanish:

$$\mathbf{A}\mathbf{A}^\top = \mathbf{D}\mathbf{D}^\top + (\mathbf{X}^\top\mathbf{X})^{-1}$$

Therefore:

Step 4: Conclude

Since $\mathbf{D}\mathbf{D}^\top$ is positive semi-definite (it's the product of a matrix and its transpose):

$$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}|\mathbf{X}) = \sigma^2\mathbf{D}\mathbf{D}^\top \succeq 0$$

Equality holds if and only if $\mathbf{D} = \mathbf{0}$, i.e., $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS}$. ∎

Proof Complete

Variance of the OLS Estimator

From the proof, we derived a fundamental result:

$$\text{Var}(\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$$

This formula is central to understanding the precision of our estimates and constructing confidence intervals.

Interpreting the Variance:

The matrix $(\mathbf{X}^\top \mathbf{X})^{-1}$ captures how the design of the experiment affects estimation precision:

Diagonal elements $(\mathbf{X}^\top \mathbf{X})_{jj}^{-1}$ determine the variance of individual coefficient estimates $\hat{\beta}_j$
Off-diagonal elements determine covariances between coefficient estimates
Larger values indicate less precise estimation

ols_variance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
from numpy.linalg import inv
 
def ols_variance(X: np.ndarray, sigma_sq: float) -> np.ndarray:
    """
    Compute the variance-covariance matrix of OLS estimator.
    
    Var(β̂|X) = σ² (X'X)^(-1)
    
    Parameters:
    -----------
    X : np.ndarray, shape (n, p)
        Design matrix
    sigma_sq : float
        Error variance σ²
    
    Returns:
    --------
    var_beta : np.ndarray, shape (p, p)
        Variance-covariance matrix of β̂
    """
    XtX_inv = inv(X.T @ X)
    return sigma_sq * XtX_inv
 
def ols_standard_errors(X: np.ndarray, sigma_sq: float) -> np.ndarray:
    """
    Compute standard errors of OLS coefficients.
    
    SE(β̂_j) = sqrt(σ² * [(X'X)^(-1)]_jj)
    """
    var_beta = ols_variance(X, sigma_sq)
    return np.sqrt(np.diag(var_beta))
 
# Example
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
true_sigma_sq = 0.25  # True error variance
 
# Compute variance-covariance matrix
var_cov = ols_variance(X, true_sigma_sq)
se = ols_standard_errors(X, true_sigma_sq)
 
print("Variance-Covariance Matrix of β̂:")
print(var_cov)
print()
print("Standard Errors:")
print(f"  SE(β̂₀) = {se[0]:.4f}")
print(f"  SE(β̂₁) = {se[1]:.4f}")
print(f"  SE(β̂₂) = {se[2]:.4f}")

Factors Affecting Coefficient Variance:

From $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$, we can identify what makes estimates more precise:

Factor	Effect on Var(β̂)	Interpretation
Larger σ²	Increases variance	More noise → less precise estimates
Larger n	Decreases variance	More data → more precise estimates
Higher variance in X	Decreases variance	More spread in predictors → better estimation
Multicollinearity	Increases variance	Correlated predictors → inflated standard errors

The last point is particularly important: when predictors are highly correlated, $(\mathbf{X}^\top\mathbf{X})^{-1}$ has large elements, inflating variance.

The Multicollinearity Problem

Implications and Limitations

The Gauss-Markov theorem is powerful but not without boundaries. Understanding both its implications and limitations is crucial for applying it correctly.

Key Implications:

When Gauss-Markov Holds

•OLS is optimal among linear unbiased estimators — No need to search for alternatives within this class
•Standard errors from OLS are valid — Confidence intervals and tests have correct coverage/size
•t-tests and F-tests are justified — Adding normality, exact finite-sample inference is valid
•Model comparison via R² and adjusted R² is meaningful — Residual variance is correctly estimated
•Simple implementation suffices — No need for iterative or weighted schemes under classical assumptions

Critical Limitations:

What Gauss-Markov Does NOT Guarantee

•Not optimal among all estimators — Biased estimators (Ridge, Lasso) can have lower MSE through the bias-variance tradeoff
•Not optimal under heteroscedasticity — When Var(εᵢ) varies, Generalized Least Squares (GLS) is more efficient
•Not optimal under autocorrelation — With correlated errors (time series), GLS or specific methods are preferred
•Linearity restriction may be severe — Nonlinear estimators (e.g., neural networks) aren't compared by the theorem
•Doesn't address model specification — BLUE of wrong model parameters is still wrong—Gauss-Markov assumes the model is correct

The Bias-Variance Tradeoff

When to Trust OLS (Gauss-Markov Applies):

Cross-sectional data with independent observations
Controlled experiments with fixed design
Standard errors of similar magnitude across observations
Low-dimensional settings (n >> p)
Predictors have low correlation

When to Consider Alternatives:

Time series data (autocorrelation likely)
Panel data (individual-specific heterogeneity)
High-dimensional data (p close to or > n)
Heterogeneous variance (grouped data, etc.)
Highly correlated predictors (multicollinearity)

Verifying Assumptions in Practice

The Gauss-Markov theorem's guarantees depend entirely on assumptions GM1–GM5 holding. In practice, we must check these assumptions using data diagnostics.

Diagnostic Strategies:

Checking Gauss-Markov Assumptions
Assumption	Diagnostic Method	What to Look For
GM1: Linearity	Residuals vs. fitted plot	Random scatter; no curves or patterns
GM2: Full Rank	Variance Inflation Factor (VIF)	VIF < 10 for all predictors
GM3: Exogeneity	Residuals vs. predictors plots	No systematic relationship
GM4: Homoscedasticity	Scale-location plot, Breusch-Pagan test	Constant spread in residuals
GM5: No Autocorrelation	Durbin-Watson test, ACF of residuals	DW ≈ 2; no significant autocorrelation

assumption_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
import scipy.stats as stats
from typing import Dict, Any
 
def check_gauss_markov_assumptions(X: np.ndarray, 
                                    y: np.ndarray, 
                                    residuals: np.ndarray) -> Dict[str, Any]:
    """
    Perform diagnostic checks for Gauss-Markov assumptions.
    
    Returns a dictionary with test results for each assumption.
    """
    n = len(y)
    p = X.shape[1]
    
    diagnostics = {}
    
    # GM2: Check for multicollinearity via VIF
    # VIF_j = 1 / (1 - R²_j), where R²_j is from regressing X_j on all other X
    vifs = []
    for j in range(1, p):  # Skip intercept
        X_others = np.delete(X, j, axis=1)
        # Simplified VIF calculation using correlation structure
        X_centered = X[:, 1:] - X[:, 1:].mean(axis=0)
        corr_matrix = np.corrcoef(X_centered.T)
        if corr_matrix.shape[0] > 1:
            vifs = np.diag(np.linalg.inv(corr_matrix))
        else:
            vifs = [1.0]
    diagnostics['vif'] = {
        'values': vifs,
        'max_vif': max(vifs) if vifs else 1.0,
        'concern': max(vifs) > 10 if vifs else False
    }
    
    # GM4: Breusch-Pagan test for heteroscedasticity
    # H0: Constant variance (homoscedasticity)
    residuals_sq = residuals ** 2
    # Regress squared residuals on X
    XtX_inv = np.linalg.inv(X.T @ X)
    bp_beta = XtX_inv @ X.T @ residuals_sq
    bp_fitted = X @ bp_beta
    bp_ss = np.sum((bp_fitted - residuals_sq.mean())**2)
    bp_ss_total = np.sum((residuals_sq - residuals_sq.mean())**2)
    bp_r_sq = bp_ss / bp_ss_total if bp_ss_total > 0 else 0
    bp_stat = n * bp_r_sq  # LM statistic
    bp_pvalue = 1 - stats.chi2.cdf(bp_stat, p - 1)
    diagnostics['breusch_pagan'] = {
        'statistic': bp_stat,
        'pvalue': bp_pvalue,
        'heteroscedastic': bp_pvalue < 0.05
    }
    
    # GM5: Durbin-Watson test for autocorrelation
    # DW ≈ 2 means no autocorrelation
    dw_num = np.sum(np.diff(residuals)**2)
    dw_denom = np.sum(residuals**2)
    dw_stat = dw_num / dw_denom if dw_denom > 0 else 2.0
    diagnostics['durbin_watson'] = {
        'statistic': dw_stat,
        'interpretation': 'No autocorrelation' if 1.5 < dw_stat < 2.5 
                         else 'Positive autocorrelation' if dw_stat < 1.5 
                         else 'Negative autocorrelation'
    }
    
    # GM3: Exogeneity - correlation between residuals and X
    exog_corrs = [np.corrcoef(residuals, X[:, j])[0, 1] for j in range(p)]
    diagnostics['exogeneity'] = {
        'residual_predictor_correlations': exog_corrs,
        'max_abs_correlation': max(abs(c) for c in exog_corrs),
        'concern': max(abs(c) for c in exog_corrs) > 0.1
    }
    
    return diagnostics
 
# Example usage
np.random.seed(42)
n = 100
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
beta_true = np.array([1.0, 2.0, -1.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Fit OLS
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
residuals = y - X @ beta_hat
 
# Run diagnostics
results = check_gauss_markov_assumptions(X, y, residuals)
for test, result in results.items():
    print(f"{test}: {result}")

Summary: The Foundation of OLS Theory

The Gauss-Markov theorem is a cornerstone of statistical estimation theory. Let's consolidate what we've learned:

Key Takeaways

•The Gauss-Markov theorem proves OLS is BLUE — Best (minimum variance) Linear Unbiased Estimator under classical assumptions
•Five assumptions (GM1–GM5) — Linearity, full rank, exogeneity, homoscedasticity, and no autocorrelation together guarantee optimality
•OLS is unbiased — E[β̂|X] = β, meaning estimates are centered on true values
•Variance formula Var(β̂) = σ²(X'X)⁻¹ — This determines precision of estimates and is crucial for inference
•Limitations exist — Biased estimators (Ridge, Lasso) can achieve lower MSE; heteroscedasticity and autocorrelation require GLS
•Practical diagnostics matter — Always verify assumptions before trusting OLS optimality guarantees

What's Next:

Page Complete

1 / 5