Loading learning content...
The Gauss-Markov theorem declares OLS the Best Linear Unbiased Estimator (BLUE). But what exactly makes an estimator 'best'? Why do we care about linearity and unbiasedness? And are there situations where being 'best' in the BLUE sense isn't actually what we want?
This page takes a deep dive into the BLUE property, examining:
By the end of this page, you will understand the complete meaning of the BLUE acronym, grasp why minimum variance is the correct optimality criterion for unbiased estimators, see how OLS achieves minimum variance geometrically, and appreciate the fundamental tradeoffs between bias, variance, and mean squared error.
Let's examine each word in 'Best Linear Unbiased Estimator' with mathematical precision.
Estimator:
An estimator $\hat{\boldsymbol{\theta}}$ is a function of the observed data that produces an estimate of an unknown population parameter $\boldsymbol{\theta}$. In regression:
$$\hat{\boldsymbol{\beta}} = T(\mathbf{X}, \mathbf{y})$$
Where $T$ is some function (rule) mapping data to parameter estimates. Different choices of $T$ give different estimators—OLS is one particular choice among infinitely many.
Linear:
An estimator is linear if it can be expressed as:
$$\hat{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$$
Where $\mathbf{A}$ is a $p \times n$ matrix that may depend on $\mathbf{X}$ but not on $\mathbf{y}$. This is a crucial restriction:
Linear estimators are computationally simple, have nice analytical properties, and are easy to analyze theoretically. The Gauss-Markov theorem says: within this tractable class, you cannot do better than OLS. This doesn't mean nonlinear estimators can't be better overall—they sometimes are!
Unbiased:
An estimator is unbiased if its expected value equals the true parameter:
$$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$
This must hold for all possible values of $\boldsymbol{\beta}$. Intuitively, if we could repeat the sampling process infinitely many times, the average of our estimates would hit the true value exactly.
Bias is defined as:
$$\text{Bias}(\hat{\boldsymbol{\beta}}) = \mathbb{E}[\hat{\boldsymbol{\beta}}] - \boldsymbol{\beta}$$
An unbiased estimator has zero bias.
Best (Minimum Variance):
Among all estimators in the class being considered (linear and unbiased), 'best' means minimum variance. Specifically:
$$\text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \preceq \text{Var}(\tilde{\boldsymbol{\beta}}) \quad \text{for all LUE } \tilde{\boldsymbol{\beta}}$$
The $\preceq$ notation means the difference is positive semi-definite: $\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \succeq 0$.
Given that we've restricted to unbiased estimators (mean = truth), what's the natural way to rank them? The answer comes from considering Mean Squared Error (MSE).
MSE Decomposition:
For any estimator $\hat{\theta}$ of a scalar parameter $\theta$:
$$\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2$$
This fundamental identity shows that estimation error comes from two sources:
For unbiased estimators, bias = 0, so:
$$\text{MSE}(\hat{\theta}_{unbiased}) = \text{Var}(\hat{\theta})$$
Thus, among unbiased estimators, minimum variance = minimum MSE. The 'best' unbiased estimator is simply the one with lowest variance.
For vector parameters β, we compare variance-covariance matrices using matrix ordering. Matrix A ≤ B means B - A is positive semi-definite. This ensures every scalar linear combination c'β̂ also has minimum variance: Var(c'β̂_OLS) ≤ Var(c'β̃) for all c and all LUE β̃.
Alternative Criterion: Mean Absolute Error
One might ask: why not minimize $\mathbb{E}[|\hat{\theta} - \theta|]$ instead? This leads to the median rather than the mean, and corresponding estimators.
| Criterion | Optimal Point Estimator | Regression Method |
|---|---|---|
| Minimize E[(θ̂ - θ)²] | Mean | OLS |
| Minimize E[ | θ̂ - θ | ] |
MSE (and thus variance for unbiased estimators) is preferred because:
The Gauss-Markov theorem specifically addresses the MSE criterion for unbiased estimators.
Let's examine exactly what 'minimum variance' means in the matrix sense and prove a key result about linear combinations.
Individual Coefficients:
For each coefficient $\beta_j$, the OLS estimator $\hat{\beta}_j$ has variance:
$$\text{Var}(\hat{\beta}j | \mathbf{X}) = \sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}$$
For any other linear unbiased estimator $\tilde{\beta}_j$:
$$\text{Var}(\tilde{\beta}_j | \mathbf{X}) \geq \text{Var}(\hat{\beta}_j | \mathbf{X})$$
Linear Combinations:
More generally, for any linear combination $\mathbf{c}^\top\boldsymbol{\beta}$ (e.g., a contrast, a prediction):
$$\text{Var}(\mathbf{c}^\top\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) \leq \text{Var}(\mathbf{c}^\top\tilde{\boldsymbol{\beta}} | \mathbf{X})$$
for any linear unbiased estimator $\tilde{\boldsymbol{\beta}}$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npfrom numpy.linalg import invimport matplotlib.pyplot as plt def demonstrate_minimum_variance(): """ Demonstrate that OLS has minimum variance among linear unbiased estimators by comparing to alternative (inefficient) linear unbiased estimators. """ np.random.seed(42) n, p = 50, 2 n_simulations = 5000 # Fixed design matrix X = np.column_stack([np.ones(n), np.random.randn(n)]) beta_true = np.array([1.0, 2.0]) sigma = 0.5 # OLS weighting matrix A_ols = inv(X.T @ X) @ X.T # (X'X)^(-1) X' # Alternative linear unbiased estimator: # Add some perturbation D such that DX = 0 (maintains unbiasedness) # Example: D that redistributes weight to certain observations D = np.zeros((p, n)) # Make D satisfy DX = 0 by construction # D adds weight to residual-like components null_space = np.eye(n) - X @ inv(X.T @ X) @ X.T # Projection onto null space D = 0.2 * np.random.randn(p, n) @ null_space # Random perturbation in null space A_alt = A_ols + D # Alternative estimator: same A structure + perturbation # Verify unbiasedness: A_alt @ X should equal I_p print("Checking unbiasedness:") print(f" A_ols @ X = \n{A_ols @ X}") print(f" A_alt @ X = \n{A_alt @ X}") # Simulate many datasets ols_estimates = [] alt_estimates = [] for _ in range(n_simulations): eps = sigma * np.random.randn(n) y = X @ beta_true + eps beta_ols = A_ols @ y beta_alt = A_alt @ y ols_estimates.append(beta_ols) alt_estimates.append(beta_alt) ols_estimates = np.array(ols_estimates) alt_estimates = np.array(alt_estimates) # Compare variances print("\nEmpirical Results (β₁ = slope coefficient):") print(f" True β₁ = {beta_true[1]:.4f}") print(f" OLS mean: {ols_estimates[:, 1].mean():.4f}, " f"variance: {ols_estimates[:, 1].var():.6f}") print(f" Alt mean: {alt_estimates[:, 1].mean():.4f}, " f"variance: {alt_estimates[:, 1].var():.6f}") print(f" Variance ratio (Alt/OLS): {alt_estimates[:, 1].var() / ols_estimates[:, 1].var():.4f}") # Theoretical variance comparison var_ols_theory = sigma**2 * inv(X.T @ X)[1, 1] var_alt_theory = sigma**2 * (A_alt @ A_alt.T)[1, 1] print(f"\nTheoretical Variances:") print(f" OLS: {var_ols_theory:.6f}") print(f" Alt: {var_alt_theory:.6f}") return ols_estimates, alt_estimates ols_est, alt_est = demonstrate_minimum_variance()Matrix Positive Semi-Definiteness:
The statement $\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \succeq 0$ is a matrix inequality. It means:
This is the strongest possible optimality statement—OLS is best not just for each coefficient, but for every possible linear function of coefficients.
The minimum variance property has a beautiful geometric interpretation that illuminates why OLS is optimal.
The Space of Linear Unbiased Estimators:
Recall that any linear unbiased estimator has the form:
$$\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS} + \mathbf{D}\boldsymbol{\varepsilon}$$
Where $\mathbf{D}\mathbf{X} = \mathbf{0}$ (the unbiasedness constraint).
The rows of $\mathbf{D}$ must lie in the null space of $\mathbf{X}^\top$. This is the orthogonal complement of the column space of $\mathbf{X}$:
$$\text{col}(\mathbf{X})^\perp = {\mathbf{v} \in \mathbb{R}^n : \mathbf{X}^\top\mathbf{v} = \mathbf{0}}$$
Why Adding D Increases Variance:
The variance of the alternative estimator is:
$$\text{Var}(\tilde{\boldsymbol{\beta}}) = \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) + \sigma^2\mathbf{D}\mathbf{D}^\top$$
Geometrically, $\mathbf{D}\mathbf{D}^\top$ is a positive semi-definite matrix representing the 'extra' variance from the perturbation $\mathbf{D}\boldsymbol{\varepsilon}$.
Since $\mathbf{D}$ has rows in the null space of $\mathbf{X}^\top$ (orthogonal to the signal), $\mathbf{D}\boldsymbol{\varepsilon}$ captures only noise—it cannot contain information about $\boldsymbol{\beta}$!
OLS projects y onto the column space of X in a way that extracts ALL information about β. Any alternative estimator that adds a component from the null space is just adding noise—it cannot reduce variance, only inflate it. This is why OLS is optimal: it uses exactly the informative part of the data and nothing else.
Visualization in Simple Regression:
Consider simple linear regression: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$.
The column space of $\mathbf{X} = [\mathbf{1}, \mathbf{x}]$ is a 2-dimensional plane in $\mathbb{R}^n$. The OLS fitted values $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ are the orthogonal projection of $\mathbf{y}$ onto this plane.
The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ is perpendicular to the plane—it contains no information about $\boldsymbol{\beta}$.
Alternative estimators that use components from the residual space are essentially using random noise to estimate $\boldsymbol{\beta}$—hence their increased variance.
The Fundamental Insight:
OLS is optimal because orthogonal projection extracts all signal and discards all noise. Any deviation from orthogonal projection necessarily introduces noise into the estimate.
The concept of efficiency formalizes how estimators compare in terms of variance.
Definition: Relative Efficiency
For two unbiased estimators $\hat{\theta}_1$ and $\hat{\theta}_2$ of the same parameter:
$$\text{RE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}$$
If $\text{RE} > 1$, then $\hat{\theta}_1$ is more efficient (lower variance) than $\hat{\theta}_2$.
OLS Efficiency:
By Gauss-Markov, for any linear unbiased estimator $\tilde{\boldsymbol{\beta}}$:
$$\text{RE}(\hat{\boldsymbol{\beta}}_{OLS}, \tilde{\boldsymbol{\beta}}) \geq 1$$
with equality only when $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS}$.
Efficiency Bound (Scalar Case):
For estimating a single coefficient $\beta_j$:
$$\text{Var}(\tilde{\beta}j) \geq \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}$$
This is the efficiency bound for linear unbiased estimation. OLS achieves this bound exactly.
| Estimator | Form | Unbiased? | Efficiency vs OLS |
|---|---|---|---|
| OLS | $(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$ | Yes | 100% (benchmark) |
| Equally-weighted | $(\mathbf{1}'\mathbf{X})^{-1}\mathbf{1}'\mathbf{y}$ (intercept only) | Yes if balanced | < 100% in general |
| Ridge (biased) | $(\mathbf{X}'\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}'\mathbf{y}$ | No | Not comparable (biased) |
| Random subsampling | OLS on subset | Yes | < 100% (throws away data) |
| GLS (correct weights) | $(\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{y}$ | Yes | 100% if heteroscedastic |
OLS is only the most efficient linear unbiased estimator when Gauss-Markov assumptions hold. Under heteroscedasticity, GLS can be more efficient. Under heavy-tailed errors, robust methods may have better practical performance. The theorem applies conditional on its assumptions.
The Gauss-Markov theorem provides a variance lower bound within the class of linear unbiased estimators. A related but more general result is the Cramér-Rao Lower Bound (CRLB), which provides a variance lower bound for all unbiased estimators (not just linear ones).
The Cramér-Rao Lower Bound:
Under regularity conditions, for any unbiased estimator $\hat{\theta}$ of $\theta$:
$$\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$$
Where $I(\theta)$ is the Fisher information:
$$I(\theta) = \mathbb{E}\left[\left(\frac{\partial \log f(\mathbf{y}; \theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2 \log f(\mathbf{y}; \theta)}{\partial \theta^2}\right]$$
OLS and CRLB Under Gaussian Errors:
When $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, the Fisher information for $\boldsymbol{\beta}$ is:
$$I(\boldsymbol{\beta}) = \frac{1}{\sigma^2}\mathbf{X}^\top\mathbf{X}$$
The CRLB gives:
$$\text{Var}(\hat{\boldsymbol{\beta}}) \geq [I(\boldsymbol{\beta})]^{-1} = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$$
This is exactly the variance of OLS! Thus, under Gaussian errors:
OLS achieves the Cramér-Rao bound—it is the minimum variance unbiased estimator among ALL unbiased estimators, not just linear ones.
Under Gaussian errors, OLS is not just BLUE (best linear unbiased) but MVUE (Minimum Variance Unbiased Estimator) overall. No unbiased estimator—linear or nonlinear—can have lower variance than OLS when errors are normal.
The Relationship Between Gauss-Markov and Cramér-Rao:
| Property | Gauss-Markov | Cramér-Rao |
|---|---|---|
| Class of estimators | Linear unbiased | All unbiased |
| Distributional assumption | None (beyond finite variance) | Regularity conditions |
| Bound achieved by OLS | Always (under GM assumptions) | Only under Gaussian errors |
| Bound formula | $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ | $[I(\boldsymbol{\beta})]^{-1}$ |
The Gauss-Markov theorem is more general (no distributional assumption) but more limited (only linear estimators). The Cramér-Rao bound is less general (needs smooth density) but more comprehensive (all estimators).
The Gauss-Markov theorem proves OLS is optimal among unbiased estimators. But should we always demand unbiasedness?
The Bias-Variance Tradeoff:
Recall the MSE decomposition:
$$\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2$$
An estimator with small bias but much smaller variance can have lower MSE than an unbiased estimator with high variance.
Example: Ridge Regression
The Ridge estimator:
$$\hat{\boldsymbol{\beta}}_{Ridge} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$
is biased:
$$\mathbb{E}[\hat{\boldsymbol{\beta}}_{Ridge}] = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \neq \boldsymbol{\beta}$$
But its variance is lower:
$$\text{Var}(\hat{\boldsymbol{\beta}}_{Ridge}) = \sigma^2(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}$$
For appropriate $\lambda$, the MSE of Ridge can be lower than OLS, especially when:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npfrom numpy.linalg import inv def compare_ols_ridge_mse(n=50, p=40, true_beta_scale=0.1, sigma=1.0, lambda_ridge=1.0, n_sims=1000): """ Compare MSE of OLS vs Ridge regression. In high-dimensional settings with small true coefficients, Ridge often has lower MSE despite being biased. """ mse_ols_list = [] mse_ridge_list = [] np.random.seed(42) # True coefficients (small) beta_true = np.random.randn(p) * true_beta_scale for _ in range(n_sims): # Generate data X = np.random.randn(n, p) eps = np.random.randn(n) * sigma y = X @ beta_true + eps # OLS estimator XtX = X.T @ X try: beta_ols = inv(XtX) @ X.T @ y except np.linalg.LinAlgError: continue # Skip if singular # Ridge estimator beta_ridge = inv(XtX + lambda_ridge * np.eye(p)) @ X.T @ y # Compute MSE (parameter estimation error) mse_ols = np.mean((beta_ols - beta_true)**2) mse_ridge = np.mean((beta_ridge - beta_true)**2) mse_ols_list.append(mse_ols) mse_ridge_list.append(mse_ridge) avg_mse_ols = np.mean(mse_ols_list) avg_mse_ridge = np.mean(mse_ridge_list) print(f"Settings: n={n}, p={p}, λ={lambda_ridge}") print(f"Average MSE (OLS): {avg_mse_ols:.4f}") print(f"Average MSE (Ridge): {avg_mse_ridge:.4f}") print(f"Ratio (OLS/Ridge): {avg_mse_ols/avg_mse_ridge:.2f}x") print() return avg_mse_ols, avg_mse_ridge # Low-dimensional: OLS winsprint("=== Low-dimensional setting ===")compare_ols_ridge_mse(n=200, p=10) # High-dimensional: Ridge winsprint("=== High-dimensional setting ===")compare_ols_ridge_mse(n=50, p=40)The BLUE concept extends to more general settings beyond standard regression.
Generalized Least Squares (GLS):
When errors have non-spherical covariance $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\boldsymbol{\Omega}$ (where $\boldsymbol{\Omega} \neq \mathbf{I}$), the Gauss-Markov theorem no longer applies to OLS.
The GLS estimator:
$$\hat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}^\top\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\Omega}^{-1}\mathbf{y}$$
is BLUE when $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\boldsymbol{\Omega}$. GLS 'whitens' the error structure before applying least squares.
Special Cases of GLS:
| Error Structure | $\boldsymbol{\Omega}$ Form | Name | Application |
|---|---|---|---|
| Heteroscedasticity | $\text{diag}(\omega_1, ..., \omega_n)$ | WLS (Weighted LS) | Different variance per obs |
| AR(1) errors | Toeplitz with decay | FGLS | Time series |
| Grouped data | Block diagonal | Clustered | Panel data |
Best Linear Unbiased Prediction (BLUP):
In mixed effects models with random effects $\mathbf{u}$:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon}$$
We want to estimate the fixed effects $\boldsymbol{\beta}$ and predict the random effects $\mathbf{u}$.
The BLUP of $\mathbf{u}$ minimizes prediction error variance among linear unbiased predictors—an extension of BLUE principles to prediction.
Aitken Theorem:
The result that GLS is BLUE under correlated errors is called the Aitken theorem (or generalized Gauss-Markov theorem):
Under the model $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}$ and $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\boldsymbol{\Omega}$, the GLS estimator is BLUE.
In practice, Ω is unknown and must be estimated. The resulting estimator, FGLS (Feasible GLS), uses an estimated Ω̂. FGLS is not exactly BLUE but has the same asymptotic properties as GLS under mild conditions.
Understanding the BLUE property has direct practical consequences for regression analysis.
When to Rely on OLS Optimality:
Common Mistakes Related to BLUE:
❌ Assuming OLS is always best — It's only BLUE under specific conditions
❌ Ignoring the 'linear' restriction — Nonlinear estimators can beat OLS in MSE
❌ Confusing unbiasedness with accuracy — An unbiased estimator with huge variance can be practically useless
❌ Over-relying on OLS in high dimensions — When p ≈ n, regularization typically dominates
❌ Forgetting that BLUE is conditional on X — The comparison is among estimators for the same design matrix
We've thoroughly explored what it means for OLS to be the Best Linear Unbiased Estimator. Here are the key insights:
What's Next:
With the theoretical foundation of OLS optimality established, we now turn to practical inference. The next page examines the variance of OLS estimates in detail—how to estimate σ², construct standard errors, and understand what affects estimation precision.
You now have a complete understanding of the BLUE property—what it means, why it matters, when it applies, and what alternatives exist when its conditions don't hold. This knowledge is fundamental for correctly applying and interpreting regression analysis.