Loading learning content...
In the universe of statistical estimation, a remarkable result stands as a beacon of mathematical elegance: the Gauss-Markov theorem. Named after Carl Friedrich Gauss and Andrey Markov, this theorem answers a question that every data scientist should ask but few deeply understand:
Among all possible ways to estimate regression coefficients using linear combinations of the data, why is Ordinary Least Squares (OLS) the best choice?
The Gauss-Markov theorem doesn't merely suggest that OLS is good—it proves that OLS is optimal among a large class of estimators, under certain conditions. This optimality isn't approximate or heuristic; it's mathematically rigorous and universally applicable.
By the end of this page, you will understand the precise mathematical statement of the Gauss-Markov theorem, the five classical assumptions that guarantee OLS optimality, why 'Best Linear Unbiased Estimator' (BLUE) is the correct characterization, and the profound implications this has for practical regression analysis. You will be able to verify whether real datasets satisfy these assumptions and understand what happens when they're violated.
The story of the Gauss-Markov theorem spans two centuries of statistical development and represents one of the earliest rigorous results in estimation theory.
Carl Friedrich Gauss (1777–1855) first developed the method of least squares in 1795, at the age of 18, while working on astronomical calculations. He used it to predict the orbit of the dwarf planet Ceres when it disappeared behind the Sun. His method proved spectacularly successful—when Ceres re-emerged, it was almost exactly where Gauss predicted.
Gauss published his least squares method in 1809 in Theoria Motus Corporum Coelestium (Theory of the Motion of Heavenly Bodies), where he showed that least squares is optimal under normally distributed errors—a result we now call maximum likelihood estimation under Gaussian errors.
Andrey Markov (1856–1922), working a century later, extended Gauss's result in a profound way. Markov showed that OLS is optimal without assuming normality—requiring only finite variance and uncorrelatedness of errors. This made the result far more general and applicable.
Gauss showed OLS is optimal if errors are Gaussian. Markov weakened this to: OLS is optimal if errors are uncorrelated with constant variance—no distributional assumption needed. This generalization is why the theorem bears both names: Gauss contributed the method and initial optimality proof; Markov provided the minimally restrictive conditions.
Why This Matters for Machine Learning:
The Gauss-Markov theorem is not merely historical curiosity. It provides:
Understanding Gauss-Markov transforms OLS from a 'recipe' into a principled choice.
Before stating the Gauss-Markov theorem, we must precisely define the classical linear regression model and its assumptions. The theorem is only valid when these assumptions hold.
The Model:
We assume the true data-generating process follows:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$$
Where:
The goal of regression is to estimate $\boldsymbol{\beta}$ from observed data $(\mathbf{X}, \mathbf{y})$.
In the classical setup, X is treated as fixed (non-random)—either by design (controlled experiments) or by conditioning on observed values. All randomness comes from ε. In the random design case, results hold conditional on X. This distinction matters for understanding what 'unbiased' means: E[β̂|X] = β.
The Five Gauss-Markov Assumptions:
The Gauss-Markov theorem requires the following five conditions. We denote these as GM1–GM5 for reference throughout this module.
| Assumption | Mathematical Statement | Intuitive Meaning |
|---|---|---|
| GM1: Linearity | $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ | The true relationship is linear in parameters (not necessarily in variables) |
| GM2: Full Rank | $\text{rank}(\mathbf{X}) = p$ (full column rank) | No perfect multicollinearity; each predictor provides unique information |
| GM3: Exogeneity | $\mathbb{E}[\varepsilon_i | \mathbf{X}] = 0$ for all $i$ | Errors have zero mean and are uncorrelated with predictors |
| GM4: Homoscedasticity | $\text{Var}(\varepsilon_i | \mathbf{X}) = \sigma^2$ for all $i$ | Error variance is constant across all observations |
| GM5: No Autocorrelation | $\text{Cov}(\varepsilon_i, \varepsilon_j | \mathbf{X}) = 0$ for $i \neq j$ | Errors are uncorrelated with each other |
Combined Covariance Structure (GM4 + GM5):
Assumptions GM4 and GM5 together imply a spherical error covariance structure:
$$\text{Var}(\boldsymbol{\varepsilon} | \mathbf{X}) = \sigma^2 \mathbf{I}_n$$
Where $\mathbf{I}_n$ is the $n \times n$ identity matrix. This says:
This 'spherical' structure is named because the constant-probability contours of $\boldsymbol{\varepsilon}$ form hyperspheres in $\mathbb{R}^n$.
Critically, the Gauss-Markov theorem does not require normality of errors. The theorem holds for any error distribution with zero mean and finite variance σ². Normality is only needed for exact t-tests and F-tests; asymptotic inference works without it.
The Ordinary Least Squares (OLS) estimator is derived by minimizing the sum of squared residuals:
$$\hat{\boldsymbol{\beta}}{OLS} = \arg\min{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2$$
The solution, under the full rank assumption (GM2), is:
$$\hat{\boldsymbol{\beta}}_{OLS} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$
Key Properties of the OLS Estimator:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npfrom numpy.linalg import inv def ols_estimator(X: np.ndarray, y: np.ndarray) -> np.ndarray: """ Compute the OLS estimator: β̂ = (X'X)^(-1) X'y Parameters: ----------- X : np.ndarray, shape (n, p) Design matrix with n observations and p features y : np.ndarray, shape (n,) or (n, 1) Response vector Returns: -------- beta_hat : np.ndarray, shape (p,) OLS coefficient estimates Mathematical Form: ------------------ β̂_OLS = (X'X)^(-1) X'y This is the unique solution to the normal equations: X'X β = X'y """ XtX = X.T @ X Xty = X.T @ y beta_hat = inv(XtX) @ Xty return beta_hat # Example: Simple demonstrationnp.random.seed(42)n, p = 100, 3 # Generate design matrix with interceptX = np.column_stack([ np.ones(n), # Intercept column np.random.randn(n), # Feature 1 np.random.randn(n) # Feature 2]) # True coefficientsbeta_true = np.array([2.0, -1.5, 0.5]) # Generate response with Gaussian noisey = X @ beta_true + 0.5 * np.random.randn(n) # Estimatebeta_hat = ols_estimator(X, y) print("True β: ", beta_true)print("Estimated β̂:", beta_hat)print("Difference: ", beta_true - beta_hat)Linearity of the OLS Estimator:
A crucial observation is that $\hat{\boldsymbol{\beta}}_{OLS}$ is a linear function of $\mathbf{y}$:
$$\hat{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}$$
Where $\mathbf{C} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$ is a fixed $p \times n$ matrix that depends only on $\mathbf{X}$.
This linearity is essential because the Gauss-Markov theorem compares OLS to all linear estimators—those that can be expressed as $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ for some matrix $\mathbf{A}$.
The first key property of OLS is that it is unbiased—on average, it hits the true parameter values exactly.
Definition: Unbiased Estimator
An estimator $\hat{\boldsymbol{\beta}}$ is unbiased if:
$$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$
This means: if we could repeat the sampling process infinitely many times (with fixed $\mathbf{X}$), the average of our estimates would equal the true parameters.
Proof of OLS Unbiasedness:
Starting from the OLS formula and substituting the true model:
$$\begin{align} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top (\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) \quad \text{(substituting } \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\text{)} \ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \ &= \boldsymbol{\beta} + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \boldsymbol{\varepsilon} \end{align}$$
Taking expectations conditional on $\mathbf{X}$:
$$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta} + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \boldsymbol{\beta} + \mathbf{0} = \boldsymbol{\beta}$$
The last step uses GM3: $\mathbb{E}[\boldsymbol{\varepsilon} | \mathbf{X}] = \mathbf{0}$. ∎
The expression β̂ = β + (X'X)⁻¹X'ε is profoundly informative. It shows that our estimate equals the truth plus a 'noise term.' This noise term has mean zero (giving unbiasedness) but non-zero variance (giving estimation error). The Gauss-Markov theorem says this variance is minimal among linear unbiased estimators.
What Unbiasedness Means (and Doesn't Mean):
✅ On average, OLS is correct: No systematic over- or under-estimation
✅ The method is 'fair': If you used OLS repeatedly, your errors would average out
❌ Individual estimates might be far from truth: Unbiasedness says nothing about any single estimate
❌ Low variance isn't guaranteed yet: An unbiased estimator can still have high variance
Unbiasedness alone isn't sufficient—we need the estimator to also have low variance. That's where the 'Best' part of BLUE comes in.
We now arrive at the central result of this page.
Theorem (Gauss-Markov):
Under assumptions GM1–GM5, the OLS estimator $\hat{\boldsymbol{\beta}}_{OLS}$ is the Best Linear Unbiased Estimator (BLUE) of $\boldsymbol{\beta}$.
What 'BLUE' Means:
Let's unpack each word:
More precisely: among all estimators of the form $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ that satisfy $\mathbb{E}[\tilde{\boldsymbol{\beta}}|\mathbf{X}] = \boldsymbol{\beta}$, the OLS estimator has the smallest variance (in the matrix sense).
For any linear unbiased estimator β̃, we have: Var(β̃|X) - Var(β̂_OLS|X) is positive semi-definite. This means OLS has the smallest variance for every coefficient and every linear combination of coefficients. No other linear unbiased estimator can do better.
Formal Mathematical Statement:
Let $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ be any linear estimator of $\boldsymbol{\beta}$ satisfying:
Then:
$$\text{Var}(\tilde{\boldsymbol{\beta}} | \mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) \succeq 0$$
Where $\succeq 0$ denotes positive semi-definiteness.
Equivalently, for any non-zero vector $\mathbf{c} \in \mathbb{R}^p$:
$$\text{Var}(\mathbf{c}^\top \tilde{\boldsymbol{\beta}} | \mathbf{X}) \geq \text{Var}(\mathbf{c}^\top \hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X})$$
The proof is elegant and reveals deep geometric structure. We present it in full detail.
Setup:
Let $\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$ be any linear unbiased estimator. Define:
$$\mathbf{D} = \mathbf{A} - (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$$
So $\mathbf{A} = \mathbf{D} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$, and:
$$\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y} = \mathbf{D}\mathbf{y} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} = \mathbf{D}\mathbf{y} + \hat{\boldsymbol{\beta}}_{OLS}$$
Step 1: Constraint from Unbiasedness
For $\tilde{\boldsymbol{\beta}}$ to be unbiased:
$$\mathbb{E}[\mathbf{A}\mathbf{y}|\mathbf{X}] = \mathbf{A}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta} \quad \text{for all } \boldsymbol{\beta}$$
This requires $\mathbf{A}\mathbf{X} = \mathbf{I}_p$. Since $(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X} = \mathbf{I}_p$, we need:
$$(\mathbf{D} + (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top)\mathbf{X} = \mathbf{I}_p$$ $$\mathbf{D}\mathbf{X} + \mathbf{I}_p = \mathbf{I}_p$$ $$\mathbf{D}\mathbf{X} = \mathbf{0}$$
Crucial constraint: Any linear unbiased estimator different from OLS must have $\mathbf{D} \neq \mathbf{0}$ but $\mathbf{D}\mathbf{X} = \mathbf{0}$.
Step 2: Compute the Variance of $\tilde{\boldsymbol{\beta}}$
Using $\tilde{\boldsymbol{\beta}} = \mathbf{D}\mathbf{y} + \hat{\boldsymbol{\beta}}_{OLS}$ and $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$:
$$\tilde{\boldsymbol{\beta}} = \mathbf{D}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) + \hat{\boldsymbol{\beta}}{OLS} = \mathbf{D}\mathbf{X}\boldsymbol{\beta} + \mathbf{D}\boldsymbol{\varepsilon} + \hat{\boldsymbol{\beta}}{OLS}$$
Since $\mathbf{D}\mathbf{X} = \mathbf{0}$:
$$\tilde{\boldsymbol{\beta}} = \mathbf{D}\boldsymbol{\varepsilon} + \hat{\boldsymbol{\beta}}_{OLS}$$
Now compute the variance:
$$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) = \text{Var}(\mathbf{D}\boldsymbol{\varepsilon} + \hat{\boldsymbol{\beta}}_{OLS}|\mathbf{X})$$
Recall $\hat{\boldsymbol{\beta}}_{OLS} = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$, so:
$$\tilde{\boldsymbol{\beta}} - \boldsymbol{\beta} = \mathbf{D}\boldsymbol{\varepsilon} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon} = \mathbf{A}\boldsymbol{\varepsilon}$$
Thus: $$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) = \mathbf{A} \cdot \text{Var}(\boldsymbol{\varepsilon}|\mathbf{X}) \cdot \mathbf{A}^\top = \sigma^2 \mathbf{A}\mathbf{A}^\top$$
Step 3: Expand and Simplify
Expanding $\mathbf{A}\mathbf{A}^\top$:
$$\mathbf{A}\mathbf{A}^\top = (\mathbf{D} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top)(\mathbf{D} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top)^\top$$
$$= \mathbf{D}\mathbf{D}^\top + \mathbf{D}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{D}^\top + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}$$
Since $\mathbf{D}\mathbf{X} = \mathbf{0}$, the middle terms vanish:
$$\mathbf{A}\mathbf{A}^\top = \mathbf{D}\mathbf{D}^\top + (\mathbf{X}^\top\mathbf{X})^{-1}$$
Therefore:
$$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) = \sigma^2(\mathbf{D}\mathbf{D}^\top + (\mathbf{X}^\top\mathbf{X})^{-1}) = \sigma^2\mathbf{D}\mathbf{D}^\top + \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}|\mathbf{X})$$
Step 4: Conclude
Since $\mathbf{D}\mathbf{D}^\top$ is positive semi-definite (it's the product of a matrix and its transpose):
$$\text{Var}(\tilde{\boldsymbol{\beta}}|\mathbf{X}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}|\mathbf{X}) = \sigma^2\mathbf{D}\mathbf{D}^\top \succeq 0$$
Equality holds if and only if $\mathbf{D} = \mathbf{0}$, i.e., $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS}$. ∎
Any alternative linear unbiased estimator has variance Var(β̂_OLS) + σ²DD', where DD' ≥ 0. This proves OLS has minimum variance. The extra term σ²DD' represents the 'efficiency loss' from using a different linear unbiased estimator.
From the proof, we derived a fundamental result:
$$\text{Var}(\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$$
This formula is central to understanding the precision of our estimates and constructing confidence intervals.
Interpreting the Variance:
The matrix $(\mathbf{X}^\top \mathbf{X})^{-1}$ captures how the design of the experiment affects estimation precision:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as npfrom numpy.linalg import inv def ols_variance(X: np.ndarray, sigma_sq: float) -> np.ndarray: """ Compute the variance-covariance matrix of OLS estimator. Var(β̂|X) = σ² (X'X)^(-1) Parameters: ----------- X : np.ndarray, shape (n, p) Design matrix sigma_sq : float Error variance σ² Returns: -------- var_beta : np.ndarray, shape (p, p) Variance-covariance matrix of β̂ """ XtX_inv = inv(X.T @ X) return sigma_sq * XtX_inv def ols_standard_errors(X: np.ndarray, sigma_sq: float) -> np.ndarray: """ Compute standard errors of OLS coefficients. SE(β̂_j) = sqrt(σ² * [(X'X)^(-1)]_jj) """ var_beta = ols_variance(X, sigma_sq) return np.sqrt(np.diag(var_beta)) # Examplenp.random.seed(42)n, p = 100, 3X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])true_sigma_sq = 0.25 # True error variance # Compute variance-covariance matrixvar_cov = ols_variance(X, true_sigma_sq)se = ols_standard_errors(X, true_sigma_sq) print("Variance-Covariance Matrix of β̂:")print(var_cov)print()print("Standard Errors:")print(f" SE(β̂₀) = {se[0]:.4f}")print(f" SE(β̂₁) = {se[1]:.4f}")print(f" SE(β̂₂) = {se[2]:.4f}")Factors Affecting Coefficient Variance:
From $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}$, we can identify what makes estimates more precise:
| Factor | Effect on Var(β̂) | Interpretation |
|---|---|---|
| Larger σ² | Increases variance | More noise → less precise estimates |
| Larger n | Decreases variance | More data → more precise estimates |
| Higher variance in X | Decreases variance | More spread in predictors → better estimation |
| Multicollinearity | Increases variance | Correlated predictors → inflated standard errors |
The last point is particularly important: when predictors are highly correlated, $(\mathbf{X}^\top\mathbf{X})^{-1}$ has large elements, inflating variance.
When predictors are nearly collinear, (X'X) is nearly singular, so (X'X)^(-1) has very large elements. This means coefficient estimates are highly unstable—small changes in data cause large changes in β̂. The coefficients remain unbiased, but their variance is so large that estimates are practically useless.
The Gauss-Markov theorem is powerful but not without boundaries. Understanding both its implications and limitations is crucial for applying it correctly.
Key Implications:
Critical Limitations:
OLS minimizes variance among unbiased estimators. But in high-dimensional settings (p ≈ n or p > n), the variance of OLS explodes. Ridge regression introduces bias but reduces variance, often achieving lower overall MSE. The Gauss-Markov theorem doesn't contradict this—it simply doesn't address biased estimators.
When to Trust OLS (Gauss-Markov Applies):
When to Consider Alternatives:
The Gauss-Markov theorem's guarantees depend entirely on assumptions GM1–GM5 holding. In practice, we must check these assumptions using data diagnostics.
Diagnostic Strategies:
| Assumption | Diagnostic Method | What to Look For |
|---|---|---|
| GM1: Linearity | Residuals vs. fitted plot | Random scatter; no curves or patterns |
| GM2: Full Rank | Variance Inflation Factor (VIF) | VIF < 10 for all predictors |
| GM3: Exogeneity | Residuals vs. predictors plots | No systematic relationship |
| GM4: Homoscedasticity | Scale-location plot, Breusch-Pagan test | Constant spread in residuals |
| GM5: No Autocorrelation | Durbin-Watson test, ACF of residuals | DW ≈ 2; no significant autocorrelation |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npimport scipy.stats as statsfrom typing import Dict, Any def check_gauss_markov_assumptions(X: np.ndarray, y: np.ndarray, residuals: np.ndarray) -> Dict[str, Any]: """ Perform diagnostic checks for Gauss-Markov assumptions. Returns a dictionary with test results for each assumption. """ n = len(y) p = X.shape[1] diagnostics = {} # GM2: Check for multicollinearity via VIF # VIF_j = 1 / (1 - R²_j), where R²_j is from regressing X_j on all other X vifs = [] for j in range(1, p): # Skip intercept X_others = np.delete(X, j, axis=1) # Simplified VIF calculation using correlation structure X_centered = X[:, 1:] - X[:, 1:].mean(axis=0) corr_matrix = np.corrcoef(X_centered.T) if corr_matrix.shape[0] > 1: vifs = np.diag(np.linalg.inv(corr_matrix)) else: vifs = [1.0] diagnostics['vif'] = { 'values': vifs, 'max_vif': max(vifs) if vifs else 1.0, 'concern': max(vifs) > 10 if vifs else False } # GM4: Breusch-Pagan test for heteroscedasticity # H0: Constant variance (homoscedasticity) residuals_sq = residuals ** 2 # Regress squared residuals on X XtX_inv = np.linalg.inv(X.T @ X) bp_beta = XtX_inv @ X.T @ residuals_sq bp_fitted = X @ bp_beta bp_ss = np.sum((bp_fitted - residuals_sq.mean())**2) bp_ss_total = np.sum((residuals_sq - residuals_sq.mean())**2) bp_r_sq = bp_ss / bp_ss_total if bp_ss_total > 0 else 0 bp_stat = n * bp_r_sq # LM statistic bp_pvalue = 1 - stats.chi2.cdf(bp_stat, p - 1) diagnostics['breusch_pagan'] = { 'statistic': bp_stat, 'pvalue': bp_pvalue, 'heteroscedastic': bp_pvalue < 0.05 } # GM5: Durbin-Watson test for autocorrelation # DW ≈ 2 means no autocorrelation dw_num = np.sum(np.diff(residuals)**2) dw_denom = np.sum(residuals**2) dw_stat = dw_num / dw_denom if dw_denom > 0 else 2.0 diagnostics['durbin_watson'] = { 'statistic': dw_stat, 'interpretation': 'No autocorrelation' if 1.5 < dw_stat < 2.5 else 'Positive autocorrelation' if dw_stat < 1.5 else 'Negative autocorrelation' } # GM3: Exogeneity - correlation between residuals and X exog_corrs = [np.corrcoef(residuals, X[:, j])[0, 1] for j in range(p)] diagnostics['exogeneity'] = { 'residual_predictor_correlations': exog_corrs, 'max_abs_correlation': max(abs(c) for c in exog_corrs), 'concern': max(abs(c) for c in exog_corrs) > 0.1 } return diagnostics # Example usagenp.random.seed(42)n = 100X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])beta_true = np.array([1.0, 2.0, -1.0])y = X @ beta_true + 0.5 * np.random.randn(n) # Fit OLSbeta_hat = np.linalg.inv(X.T @ X) @ X.T @ yresiduals = y - X @ beta_hat # Run diagnosticsresults = check_gauss_markov_assumptions(X, y, residuals)for test, result in results.items(): print(f"{test}: {result}")The Gauss-Markov theorem is a cornerstone of statistical estimation theory. Let's consolidate what we've learned:
What's Next:
Now that we understand why OLS is optimal under classical assumptions, we'll explore the BLUE property in greater depth. The next page examines what 'Best Linear Unbiased Estimator' means operationally, how it compares to other estimation criteria, and why unbiasedness is sometimes worth sacrificing for efficiency.
You now understand the Gauss-Markov theorem—one of the most important results in statistical estimation theory. This theorem justifies the default use of OLS in classical regression settings and provides a framework for understanding when alternatives might be preferable.