Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

2 / 5

The BLUE Property and Minimum Variance Estimation

Understanding Optimality in Estimation

The Gauss-Markov theorem declares OLS the Best Linear Unbiased Estimator (BLUE). But what exactly makes an estimator 'best'? Why do we care about linearity and unbiasedness? And are there situations where being 'best' in the BLUE sense isn't actually what we want?

This page takes a deep dive into the BLUE property, examining:

The precise mathematical meaning of each component
Why minimum variance is the natural criterion for 'best'
How BLUE compares to other optimality criteria
When sacrificing unbiasedness might actually improve estimation

What You Will Learn

By the end of this page, you will understand the complete meaning of the BLUE acronym, grasp why minimum variance is the correct optimality criterion for unbiased estimators, see how OLS achieves minimum variance geometrically, and appreciate the fundamental tradeoffs between bias, variance, and mean squared error.

Deconstructing BLUE: Best Linear Unbiased Estimator

Let's examine each word in 'Best Linear Unbiased Estimator' with mathematical precision.

Estimator:

An estimator $\hat{\boldsymbol{\theta}}$ is a function of the observed data that produces an estimate of an unknown population parameter $\boldsymbol{\theta}$. In regression:

$$\hat{\boldsymbol{\beta}} = T(\mathbf{X}, \mathbf{y})$$

Where $T$ is some function (rule) mapping data to parameter estimates. Different choices of $T$ give different estimators—OLS is one particular choice among infinitely many.

Linear:

An estimator is linear if it can be expressed as:

$$\hat{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$$

Where $\mathbf{A}$ is a $p \times n$ matrix that may depend on $\mathbf{X}$ but not on $\mathbf{y}$. This is a crucial restriction:

✅ OLS: $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ — Linear in $\mathbf{y}$
❌ LAD (Least Absolute Deviations): Depends on $\mathbf{y}$ nonlinearly through the $\text{sign}$ function
❌ Ridge (with data-dependent $\lambda$): If $\lambda$ depends on cross-validation of $\mathbf{y}$, it's nonlinear
❌ Maximum Likelihood under non-Gaussian errors: Generally nonlinear

Why Restrict to Linear Estimators?

Linear estimators are computationally simple, have nice analytical properties, and are easy to analyze theoretically. The Gauss-Markov theorem says: within this tractable class, you cannot do better than OLS. This doesn't mean nonlinear estimators can't be better overall—they sometimes are!

Unbiased:

An estimator is unbiased if its expected value equals the true parameter:

$$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$

This must hold for all possible values of $\boldsymbol{\beta}$. Intuitively, if we could repeat the sampling process infinitely many times, the average of our estimates would hit the true value exactly.

Bias is defined as:

$$\text{Bias}(\hat{\boldsymbol{\beta}}) = \mathbb{E}[\hat{\boldsymbol{\beta}}] - \boldsymbol{\beta}$$

An unbiased estimator has zero bias.

Best (Minimum Variance):

Among all estimators in the class being considered (linear and unbiased), 'best' means minimum variance. Specifically:

$$\text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \preceq \text{Var}(\tilde{\boldsymbol{\beta}}) \quad \text{for all LUE } \tilde{\boldsymbol{\beta}}$$

The $\preceq$ notation means the difference is positive semi-definite: $\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \succeq 0$.

Why Minimum Variance is the Right Criterion

Given that we've restricted to unbiased estimators (mean = truth), what's the natural way to rank them? The answer comes from considering Mean Squared Error (MSE).

MSE Decomposition:

For any estimator $\hat{\theta}$ of a scalar parameter $\theta$:

$$\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2$$

This fundamental identity shows that estimation error comes from two sources:

Variance: Random fluctuation around the expected estimate
Bias²: Systematic deviation of expected estimate from truth

For unbiased estimators, bias = 0, so:

$$\text{MSE}(\hat{\theta}_{unbiased}) = \text{Var}(\hat{\theta})$$

Thus, among unbiased estimators, minimum variance = minimum MSE. The 'best' unbiased estimator is simply the one with lowest variance.

The Multivariate Extension

For vector parameters β, we compare variance-covariance matrices using matrix ordering. Matrix A ≤ B means B - A is positive semi-definite. This ensures every scalar linear combination c'β̂ also has minimum variance: Var(c'β̂_OLS) ≤ Var(c'β̃) for all c and all LUE β̃.

Alternative Criterion: Mean Absolute Error

One might ask: why not minimize $\mathbb{E}[|\hat{\theta} - \theta|]$ instead? This leads to the median rather than the mean, and corresponding estimators.

Criterion	Optimal Point Estimator	Regression Method
Minimize E[(θ̂ - θ)²]	Mean	OLS
Minimize E[	θ̂ - θ	]

MSE (and thus variance for unbiased estimators) is preferred because:

It has nicer mathematical properties (differentiable, decomposable)
It naturally connects to the Gaussian distribution (MLE = OLS under Gaussian errors)
For symmetric distributions, mean and median coincide
It enables elegant analytical solutions

The Gauss-Markov theorem specifically addresses the MSE criterion for unbiased estimators.

The Minimum Variance Property in Detail

Let's examine exactly what 'minimum variance' means in the matrix sense and prove a key result about linear combinations.

Individual Coefficients:

For each coefficient $\beta_j$, the OLS estimator $\hat{\beta}_j$ has variance:

$$\text{Var}(\hat{\beta}j | \mathbf{X}) = \sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}$$

For any other linear unbiased estimator $\tilde{\beta}_j$:

$$\text{Var}(\tilde{\beta}_j | \mathbf{X}) \geq \text{Var}(\hat{\beta}_j | \mathbf{X})$$

Linear Combinations:

More generally, for any linear combination $\mathbf{c}^\top\boldsymbol{\beta}$ (e.g., a contrast, a prediction):

$$\text{Var}(\mathbf{c}^\top\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) \leq \text{Var}(\mathbf{c}^\top\tilde{\boldsymbol{\beta}} | \mathbf{X})$$

for any linear unbiased estimator $\tilde{\boldsymbol{\beta}}$.

minimum_variance_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from numpy.linalg import inv
import matplotlib.pyplot as plt
 
def demonstrate_minimum_variance():
    """
    Demonstrate that OLS has minimum variance among linear unbiased estimators
    by comparing to alternative (inefficient) linear unbiased estimators.
    """
    np.random.seed(42)
    n, p = 50, 2
    n_simulations = 5000
    
    # Fixed design matrix
    X = np.column_stack([np.ones(n), np.random.randn(n)])
    beta_true = np.array([1.0, 2.0])
    sigma = 0.5
    
    # OLS weighting matrix
    A_ols = inv(X.T @ X) @ X.T  # (X'X)^(-1) X'
    
    # Alternative linear unbiased estimator:
    # Add some perturbation D such that DX = 0 (maintains unbiasedness)
    # Example: D that redistributes weight to certain observations
    D = np.zeros((p, n))
    # Make D satisfy DX = 0 by construction
    # D adds weight to residual-like components
    null_space = np.eye(n) - X @ inv(X.T @ X) @ X.T  # Projection onto null space
    D = 0.2 * np.random.randn(p, n) @ null_space  # Random perturbation in null space
    
    A_alt = A_ols + D  # Alternative estimator: same A structure + perturbation
    
    # Verify unbiasedness: A_alt @ X should equal I_p
    print("Checking unbiasedness:")
    print(f"  A_ols @ X = \n{A_ols @ X}")
    print(f"  A_alt @ X = \n{A_alt @ X}")
    
    # Simulate many datasets
    ols_estimates = []
    alt_estimates = []
    
    for _ in range(n_simulations):
        eps = sigma * np.random.randn(n)
        y = X @ beta_true + eps
        
        beta_ols = A_ols @ y
        beta_alt = A_alt @ y
        
        ols_estimates.append(beta_ols)
        alt_estimates.append(beta_alt)
    
    ols_estimates = np.array(ols_estimates)
    alt_estimates = np.array(alt_estimates)
    
    # Compare variances
    print("\nEmpirical Results (β₁ = slope coefficient):")
    print(f"  True β₁ = {beta_true[1]:.4f}")
    print(f"  OLS mean: {ols_estimates[:, 1].mean():.4f}, "
          f"variance: {ols_estimates[:, 1].var():.6f}")
    print(f"  Alt mean: {alt_estimates[:, 1].mean():.4f}, "
          f"variance: {alt_estimates[:, 1].var():.6f}")
    print(f"  Variance ratio (Alt/OLS): {alt_estimates[:, 1].var() / ols_estimates[:, 1].var():.4f}")
    
    # Theoretical variance comparison
    var_ols_theory = sigma**2 * inv(X.T @ X)[1, 1]
    var_alt_theory = sigma**2 * (A_alt @ A_alt.T)[1, 1]
    print(f"\nTheoretical Variances:")
    print(f"  OLS: {var_ols_theory:.6f}")
    print(f"  Alt: {var_alt_theory:.6f}")
    
    return ols_estimates, alt_estimates
 
ols_est, alt_est = demonstrate_minimum_variance()

Matrix Positive Semi-Definiteness:

The statement $\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \succeq 0$ is a matrix inequality. It means:

All eigenvalues of the difference matrix are non-negative
For every vector $\mathbf{c}$: $\mathbf{c}^\top[\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS})]\mathbf{c} \geq 0$
Equivalently: $\text{Var}(\mathbf{c}^\top\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\mathbf{c}^\top\hat{\boldsymbol{\beta}}_{OLS})$ for all $\mathbf{c}$

This is the strongest possible optimality statement—OLS is best not just for each coefficient, but for every possible linear function of coefficients.

Geometric Interpretation of Minimum Variance

The minimum variance property has a beautiful geometric interpretation that illuminates why OLS is optimal.

The Space of Linear Unbiased Estimators:

Recall that any linear unbiased estimator has the form:

$$\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS} + \mathbf{D}\boldsymbol{\varepsilon}$$

Where $\mathbf{D}\mathbf{X} = \mathbf{0}$ (the unbiasedness constraint).

The rows of $\mathbf{D}$ must lie in the null space of $\mathbf{X}^\top$. This is the orthogonal complement of the column space of $\mathbf{X}$:

$$\text{col}(\mathbf{X})^\perp = {\mathbf{v} \in \mathbb{R}^n : \mathbf{X}^\top\mathbf{v} = \mathbf{0}}$$

Why Adding D Increases Variance:

The variance of the alternative estimator is:

$$\text{Var}(\tilde{\boldsymbol{\beta}}) = \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) + \sigma^2\mathbf{D}\mathbf{D}^\top$$

Geometrically, $\mathbf{D}\mathbf{D}^\top$ is a positive semi-definite matrix representing the 'extra' variance from the perturbation $\mathbf{D}\boldsymbol{\varepsilon}$.

Since $\mathbf{D}$ has rows in the null space of $\mathbf{X}^\top$ (orthogonal to the signal), $\mathbf{D}\boldsymbol{\varepsilon}$ captures only noise—it cannot contain information about $\boldsymbol{\beta}$!

The Orthogonal Projection Insight

OLS projects y onto the column space of X in a way that extracts ALL information about β. Any alternative estimator that adds a component from the null space is just adding noise—it cannot reduce variance, only inflate it. This is why OLS is optimal: it uses exactly the informative part of the data and nothing else.

Visualization in Simple Regression:

Consider simple linear regression: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$.

The column space of $\mathbf{X} = [\mathbf{1}, \mathbf{x}]$ is a 2-dimensional plane in $\mathbb{R}^n$. The OLS fitted values $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ are the orthogonal projection of $\mathbf{y}$ onto this plane.

The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ is perpendicular to the plane—it contains no information about $\boldsymbol{\beta}$.

Alternative estimators that use components from the residual space are essentially using random noise to estimate $\boldsymbol{\beta}$—hence their increased variance.

The Fundamental Insight:

OLS is optimal because orthogonal projection extracts all signal and discards all noise. Any deviation from orthogonal projection necessarily introduces noise into the estimate.

Efficiency and Relative Efficiency

The concept of efficiency formalizes how estimators compare in terms of variance.

Definition: Relative Efficiency

For two unbiased estimators $\hat{\theta}_1$ and $\hat{\theta}_2$ of the same parameter:

$$\text{RE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}$$

If $\text{RE} > 1$, then $\hat{\theta}_1$ is more efficient (lower variance) than $\hat{\theta}_2$.

OLS Efficiency:

By Gauss-Markov, for any linear unbiased estimator $\tilde{\boldsymbol{\beta}}$:

$$\text{RE}(\hat{\boldsymbol{\beta}}_{OLS}, \tilde{\boldsymbol{\beta}}) \geq 1$$

with equality only when $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS}$.

Efficiency Bound (Scalar Case):

For estimating a single coefficient $\beta_j$:

$$\text{Var}(\tilde{\beta}j) \geq \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}$$

This is the efficiency bound for linear unbiased estimation. OLS achieves this bound exactly.

Efficiency Comparison: OLS vs Alternative Estimators
Estimator	Form	Unbiased?	Efficiency vs OLS
OLS	$(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$	Yes	100% (benchmark)
Equally-weighted	$(\mathbf{1}'\mathbf{X})^{-1}\mathbf{1}'\mathbf{y}$ (intercept only)	Yes if balanced	< 100% in general
Ridge (biased)	$(\mathbf{X}'\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}'\mathbf{y}$	No	Not comparable (biased)
Random subsampling	OLS on subset	Yes	< 100% (throws away data)
GLS (correct weights)	$(\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{y}$	Yes	100% if heteroscedastic

Efficiency Under Wrong Assumptions

OLS is only the most efficient linear unbiased estimator when Gauss-Markov assumptions hold. Under heteroscedasticity, GLS can be more efficient. Under heavy-tailed errors, robust methods may have better practical performance. The theorem applies conditional on its assumptions.

Connection to the Cramér-Rao Lower Bound

The Gauss-Markov theorem provides a variance lower bound within the class of linear unbiased estimators. A related but more general result is the Cramér-Rao Lower Bound (CRLB), which provides a variance lower bound for all unbiased estimators (not just linear ones).

The Cramér-Rao Lower Bound:

Under regularity conditions, for any unbiased estimator $\hat{\theta}$ of $\theta$:

$$\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$$

Where $I(\theta)$ is the Fisher information:

$$I(\theta) = \mathbb{E}\left[\left(\frac{\partial \log f(\mathbf{y}; \theta)}{\partial \theta}\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2 \log f(\mathbf{y}; \theta)}{\partial \theta^2}\right]$$

OLS and CRLB Under Gaussian Errors:

When $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, the Fisher information for $\boldsymbol{\beta}$ is:

$$I(\boldsymbol{\beta}) = \frac{1}{\sigma^2}\mathbf{X}^\top\mathbf{X}$$

The CRLB gives:

$$\text{Var}(\hat{\boldsymbol{\beta}}) \geq [I(\boldsymbol{\beta})]^{-1} = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$$

This is exactly the variance of OLS! Thus, under Gaussian errors:

OLS achieves the Cramér-Rao bound—it is the minimum variance unbiased estimator among ALL unbiased estimators, not just linear ones.

OLS is MVUE Under Normality

Under Gaussian errors, OLS is not just BLUE (best linear unbiased) but MVUE (Minimum Variance Unbiased Estimator) overall. No unbiased estimator—linear or nonlinear—can have lower variance than OLS when errors are normal.

The Relationship Between Gauss-Markov and Cramér-Rao:

Property	Gauss-Markov	Cramér-Rao
Class of estimators	Linear unbiased	All unbiased
Distributional assumption	None (beyond finite variance)	Regularity conditions
Bound achieved by OLS	Always (under GM assumptions)	Only under Gaussian errors
Bound formula	$\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$	$[I(\boldsymbol{\beta})]^{-1}$

The Gauss-Markov theorem is more general (no distributional assumption) but more limited (only linear estimators). The Cramér-Rao bound is less general (needs smooth density) but more comprehensive (all estimators).

When Unbiasedness Isn't Optimal: The Bias-Variance Tradeoff

The Gauss-Markov theorem proves OLS is optimal among unbiased estimators. But should we always demand unbiasedness?

The Bias-Variance Tradeoff:

Recall the MSE decomposition:

$$\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2$$

An estimator with small bias but much smaller variance can have lower MSE than an unbiased estimator with high variance.

Example: Ridge Regression

The Ridge estimator:

$$\hat{\boldsymbol{\beta}}_{Ridge} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$

is biased:

$$\mathbb{E}[\hat{\boldsymbol{\beta}}_{Ridge}] = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \neq \boldsymbol{\beta}$$

But its variance is lower:

$$\text{Var}(\hat{\boldsymbol{\beta}}_{Ridge}) = \sigma^2(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}$$

For appropriate $\lambda$, the MSE of Ridge can be lower than OLS, especially when:

$p$ is large relative to $n$
True coefficients are small
Multicollinearity is present

When OLS (Unbiased) is Preferred

•Sample size n is large relative to predictors p
•Gauss-Markov assumptions clearly hold
•Interpretability of coefficients is paramount
•Unbiased inference (confidence intervals, tests) is needed
•True coefficients are relatively large

When Biased Estimators May Be Better

•High-dimensional settings (p ≈ n or p > n)
•Strong multicollinearity present
•Prediction accuracy is the main goal
•Prior knowledge suggests regularization
•OLS variance is exploding

bias_variance_tradeoff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from numpy.linalg import inv
 
def compare_ols_ridge_mse(n=50, p=40, true_beta_scale=0.1, 
                           sigma=1.0, lambda_ridge=1.0, n_sims=1000):
    """
    Compare MSE of OLS vs Ridge regression.
    
    In high-dimensional settings with small true coefficients,
    Ridge often has lower MSE despite being biased.
    """
    mse_ols_list = []
    mse_ridge_list = []
    
    np.random.seed(42)
    
    # True coefficients (small)
    beta_true = np.random.randn(p) * true_beta_scale
    
    for _ in range(n_sims):
        # Generate data
        X = np.random.randn(n, p)
        eps = np.random.randn(n) * sigma
        y = X @ beta_true + eps
        
        # OLS estimator
        XtX = X.T @ X
        try:
            beta_ols = inv(XtX) @ X.T @ y
        except np.linalg.LinAlgError:
            continue  # Skip if singular
        
        # Ridge estimator
        beta_ridge = inv(XtX + lambda_ridge * np.eye(p)) @ X.T @ y
        
        # Compute MSE (parameter estimation error)
        mse_ols = np.mean((beta_ols - beta_true)**2)
        mse_ridge = np.mean((beta_ridge - beta_true)**2)
        
        mse_ols_list.append(mse_ols)
        mse_ridge_list.append(mse_ridge)
    
    avg_mse_ols = np.mean(mse_ols_list)
    avg_mse_ridge = np.mean(mse_ridge_list)
    
    print(f"Settings: n={n}, p={p}, λ={lambda_ridge}")
    print(f"Average MSE (OLS):   {avg_mse_ols:.4f}")
    print(f"Average MSE (Ridge): {avg_mse_ridge:.4f}")
    print(f"Ratio (OLS/Ridge):   {avg_mse_ols/avg_mse_ridge:.2f}x")
    print()
    
    return avg_mse_ols, avg_mse_ridge
 
# Low-dimensional: OLS wins
print("=== Low-dimensional setting ===")
compare_ols_ridge_mse(n=200, p=10)
 
# High-dimensional: Ridge wins
print("=== High-dimensional setting ===")
compare_ols_ridge_mse(n=50, p=40)

Extensions of the BLUE Concept

The BLUE concept extends to more general settings beyond standard regression.

Generalized Least Squares (GLS):

When errors have non-spherical covariance $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\boldsymbol{\Omega}$ (where $\boldsymbol{\Omega} \neq \mathbf{I}$), the Gauss-Markov theorem no longer applies to OLS.

The GLS estimator:

$$\hat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}^\top\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\Omega}^{-1}\mathbf{y}$$

is BLUE when $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\boldsymbol{\Omega}$. GLS 'whitens' the error structure before applying least squares.

Special Cases of GLS:

Error Structure	$\boldsymbol{\Omega}$ Form	Name	Application
Heteroscedasticity	$\text{diag}(\omega_1, ..., \omega_n)$	WLS (Weighted LS)	Different variance per obs
AR(1) errors	Toeplitz with decay	FGLS	Time series
Grouped data	Block diagonal	Clustered	Panel data

Best Linear Unbiased Prediction (BLUP):

In mixed effects models with random effects $\mathbf{u}$:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon}$$

We want to estimate the fixed effects $\boldsymbol{\beta}$ and predict the random effects $\mathbf{u}$.

The BLUP of $\mathbf{u}$ minimizes prediction error variance among linear unbiased predictors—an extension of BLUE principles to prediction.

Aitken Theorem:

The result that GLS is BLUE under correlated errors is called the Aitken theorem (or generalized Gauss-Markov theorem):

Under the model $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}$ and $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\boldsymbol{\Omega}$, the GLS estimator is BLUE.

Feasible GLS

In practice, Ω is unknown and must be estimated. The resulting estimator, FGLS (Feasible GLS), uses an estimated Ω̂. FGLS is not exactly BLUE but has the same asymptotic properties as GLS under mild conditions.

Practical Implications of the BLUE Property

Understanding the BLUE property has direct practical consequences for regression analysis.

When to Rely on OLS Optimality:

•Check Gauss-Markov assumptions first — OLS is only BLUE when assumptions hold. Run diagnostics before trusting optimality.
•Use OLS as the baseline — Even if you suspect violations, start with OLS. It's a natural benchmark for comparison.
•Trust OLS standard errors (when appropriate) — Under Gauss-Markov + normality, OLS standard errors are correct and confidence intervals have proper coverage.
•Don't search for 'better' linear unbiased estimators — The theorem proves none exist. Spend effort on checking assumptions instead.
•Consider your objective — If prediction is the goal and you're in high-d, biased estimators may be preferable despite breaking the theorem's scope.

The Practitioner's Decision Framework

Check if Gauss-Markov assumptions hold. 2) If yes, use OLS—it's provably optimal. 3) If no, identify which assumption fails. 4) Apply the appropriate remedy: GLS for heteroscedasticity, robust SE for mild violations, or different methods for severe issues.

Common Mistakes Related to BLUE:

❌ Assuming OLS is always best — It's only BLUE under specific conditions

❌ Ignoring the 'linear' restriction — Nonlinear estimators can beat OLS in MSE

❌ Confusing unbiasedness with accuracy — An unbiased estimator with huge variance can be practically useless

❌ Over-relying on OLS in high dimensions — When p ≈ n, regularization typically dominates

❌ Forgetting that BLUE is conditional on X — The comparison is among estimators for the same design matrix

Summary: The BLUE Property Fully Understanding

We've thoroughly explored what it means for OLS to be the Best Linear Unbiased Estimator. Here are the key insights:

Key Takeaways

•BLUE = Best Linear Unbiased Estimator — 'Best' means minimum variance; 'Linear' means a linear function of y; 'Unbiased' means E[β̂] = β
•Minimum variance is natural for unbiased estimators — When bias = 0, variance = MSE, so we naturally want to minimize variance
•Matrix ordering — OLS has minimum variance for every coefficient and every linear combination simultaneously
•Geometric interpretation — OLS uses only the informative signal; alternative estimators add noise from the null space
•Cramér-Rao connection — Under Gaussian errors, OLS achieves the information-theoretic lower bound (MVUE)
•Bias-variance tradeoff exists — Biased estimators (Ridge, Lasso) can have lower MSE than OLS in high-dimensional or multicollinear settings
•GLS extends the BLUE concept — When error structure is non-spherical, GLS becomes BLUE instead of OLS

What's Next:

With the theoretical foundation of OLS optimality established, we now turn to practical inference. The next page examines the variance of OLS estimates in detail—how to estimate σ², construct standard errors, and understand what affects estimation precision.

Page Complete

You now have a complete understanding of the BLUE property—what it means, why it matters, when it applies, and what alternatives exist when its conditions don't hold. This knowledge is fundamental for correctly applying and interpreting regression analysis.

2 / 5

Loading learning content...

Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

2 / 5

The BLUE Property and Minimum Variance Estimation

Understanding Optimality in Estimation

This page takes a deep dive into the BLUE property, examining:

The precise mathematical meaning of each component
Why minimum variance is the natural criterion for 'best'
How BLUE compares to other optimality criteria
When sacrificing unbiasedness might actually improve estimation

What You Will Learn

Deconstructing BLUE: Best Linear Unbiased Estimator

Let's examine each word in 'Best Linear Unbiased Estimator' with mathematical precision.

Estimator:

An estimator $\hat{\boldsymbol{\theta}}$ is a function of the observed data that produces an estimate of an unknown population parameter $\boldsymbol{\theta}$. In regression:

$$\hat{\boldsymbol{\beta}} = T(\mathbf{X}, \mathbf{y})$$

Where $T$ is some function (rule) mapping data to parameter estimates. Different choices of $T$ give different estimators—OLS is one particular choice among infinitely many.

Linear:

An estimator is linear if it can be expressed as:

$$\hat{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}$$

Where $\mathbf{A}$ is a $p \times n$ matrix that may depend on $\mathbf{X}$ but not on $\mathbf{y}$. This is a crucial restriction:

✅ OLS: $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ — Linear in $\mathbf{y}$
❌ LAD (Least Absolute Deviations): Depends on $\mathbf{y}$ nonlinearly through the $\text{sign}$ function
❌ Ridge (with data-dependent $\lambda$): If $\lambda$ depends on cross-validation of $\mathbf{y}$, it's nonlinear
❌ Maximum Likelihood under non-Gaussian errors: Generally nonlinear

Why Restrict to Linear Estimators?

Unbiased:

An estimator is unbiased if its expected value equals the true parameter:

$$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$$

Bias is defined as:

$$\text{Bias}(\hat{\boldsymbol{\beta}}) = \mathbb{E}[\hat{\boldsymbol{\beta}}] - \boldsymbol{\beta}$$

An unbiased estimator has zero bias.

Best (Minimum Variance):

Among all estimators in the class being considered (linear and unbiased), 'best' means minimum variance. Specifically:

$$\text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \preceq \text{Var}(\tilde{\boldsymbol{\beta}}) \quad \text{for all LUE } \tilde{\boldsymbol{\beta}}$$

The $\preceq$ notation means the difference is positive semi-definite: $\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \succeq 0$.

Why Minimum Variance is the Right Criterion

Given that we've restricted to unbiased estimators (mean = truth), what's the natural way to rank them? The answer comes from considering Mean Squared Error (MSE).

MSE Decomposition:

For any estimator $\hat{\theta}$ of a scalar parameter $\theta$:

$$\text{MSE}(\hat{\theta}) = \mathbb{E}[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2$$

This fundamental identity shows that estimation error comes from two sources:

Variance: Random fluctuation around the expected estimate
Bias²: Systematic deviation of expected estimate from truth

For unbiased estimators, bias = 0, so:

$$\text{MSE}(\hat{\theta}_{unbiased}) = \text{Var}(\hat{\theta})$$

Thus, among unbiased estimators, minimum variance = minimum MSE. The 'best' unbiased estimator is simply the one with lowest variance.

The Multivariate Extension

Alternative Criterion: Mean Absolute Error

One might ask: why not minimize $\mathbb{E}[|\hat{\theta} - \theta|]$ instead? This leads to the median rather than the mean, and corresponding estimators.

Criterion	Optimal Point Estimator	Regression Method
Minimize E[(θ̂ - θ)²]	Mean	OLS
Minimize E[	θ̂ - θ	]

MSE (and thus variance for unbiased estimators) is preferred because:

It has nicer mathematical properties (differentiable, decomposable)
It naturally connects to the Gaussian distribution (MLE = OLS under Gaussian errors)
For symmetric distributions, mean and median coincide
It enables elegant analytical solutions

The Gauss-Markov theorem specifically addresses the MSE criterion for unbiased estimators.

The Minimum Variance Property in Detail

Let's examine exactly what 'minimum variance' means in the matrix sense and prove a key result about linear combinations.

Individual Coefficients:

For each coefficient $\beta_j$, the OLS estimator $\hat{\beta}_j$ has variance:

$$\text{Var}(\hat{\beta}j | \mathbf{X}) = \sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}$$

For any other linear unbiased estimator $\tilde{\beta}_j$:

$$\text{Var}(\tilde{\beta}_j | \mathbf{X}) \geq \text{Var}(\hat{\beta}_j | \mathbf{X})$$

Linear Combinations:

More generally, for any linear combination $\mathbf{c}^\top\boldsymbol{\beta}$ (e.g., a contrast, a prediction):

$$\text{Var}(\mathbf{c}^\top\hat{\boldsymbol{\beta}}_{OLS} | \mathbf{X}) \leq \text{Var}(\mathbf{c}^\top\tilde{\boldsymbol{\beta}} | \mathbf{X})$$

for any linear unbiased estimator $\tilde{\boldsymbol{\beta}}$.

minimum_variance_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from numpy.linalg import inv
import matplotlib.pyplot as plt
 
def demonstrate_minimum_variance():
    """
    Demonstrate that OLS has minimum variance among linear unbiased estimators
    by comparing to alternative (inefficient) linear unbiased estimators.
    """
    np.random.seed(42)
    n, p = 50, 2
    n_simulations = 5000
    
    # Fixed design matrix
    X = np.column_stack([np.ones(n), np.random.randn(n)])
    beta_true = np.array([1.0, 2.0])
    sigma = 0.5
    
    # OLS weighting matrix
    A_ols = inv(X.T @ X) @ X.T  # (X'X)^(-1) X'
    
    # Alternative linear unbiased estimator:
    # Add some perturbation D such that DX = 0 (maintains unbiasedness)
    # Example: D that redistributes weight to certain observations
    D = np.zeros((p, n))
    # Make D satisfy DX = 0 by construction
    # D adds weight to residual-like components
    null_space = np.eye(n) - X @ inv(X.T @ X) @ X.T  # Projection onto null space
    D = 0.2 * np.random.randn(p, n) @ null_space  # Random perturbation in null space
    
    A_alt = A_ols + D  # Alternative estimator: same A structure + perturbation
    
    # Verify unbiasedness: A_alt @ X should equal I_p
    print("Checking unbiasedness:")
    print(f"  A_ols @ X = \n{A_ols @ X}")
    print(f"  A_alt @ X = \n{A_alt @ X}")
    
    # Simulate many datasets
    ols_estimates = []
    alt_estimates = []
    
    for _ in range(n_simulations):
        eps = sigma * np.random.randn(n)
        y = X @ beta_true + eps
        
        beta_ols = A_ols @ y
        beta_alt = A_alt @ y
        
        ols_estimates.append(beta_ols)
        alt_estimates.append(beta_alt)
    
    ols_estimates = np.array(ols_estimates)
    alt_estimates = np.array(alt_estimates)
    
    # Compare variances
    print("\nEmpirical Results (β₁ = slope coefficient):")
    print(f"  True β₁ = {beta_true[1]:.4f}")
    print(f"  OLS mean: {ols_estimates[:, 1].mean():.4f}, "
          f"variance: {ols_estimates[:, 1].var():.6f}")
    print(f"  Alt mean: {alt_estimates[:, 1].mean():.4f}, "
          f"variance: {alt_estimates[:, 1].var():.6f}")
    print(f"  Variance ratio (Alt/OLS): {alt_estimates[:, 1].var() / ols_estimates[:, 1].var():.4f}")
    
    # Theoretical variance comparison
    var_ols_theory = sigma**2 * inv(X.T @ X)[1, 1]
    var_alt_theory = sigma**2 * (A_alt @ A_alt.T)[1, 1]
    print(f"\nTheoretical Variances:")
    print(f"  OLS: {var_ols_theory:.6f}")
    print(f"  Alt: {var_alt_theory:.6f}")
    
    return ols_estimates, alt_estimates
 
ols_est, alt_est = demonstrate_minimum_variance()

Matrix Positive Semi-Definiteness:

The statement $\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) \succeq 0$ is a matrix inequality. It means:

All eigenvalues of the difference matrix are non-negative
For every vector $\mathbf{c}$: $\mathbf{c}^\top[\text{Var}(\tilde{\boldsymbol{\beta}}) - \text{Var}(\hat{\boldsymbol{\beta}}_{OLS})]\mathbf{c} \geq 0$
Equivalently: $\text{Var}(\mathbf{c}^\top\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\mathbf{c}^\top\hat{\boldsymbol{\beta}}_{OLS})$ for all $\mathbf{c}$

This is the strongest possible optimality statement—OLS is best not just for each coefficient, but for every possible linear function of coefficients.

Geometric Interpretation of Minimum Variance

The minimum variance property has a beautiful geometric interpretation that illuminates why OLS is optimal.

The Space of Linear Unbiased Estimators:

Recall that any linear unbiased estimator has the form:

$$\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS} + \mathbf{D}\boldsymbol{\varepsilon}$$

Where $\mathbf{D}\mathbf{X} = \mathbf{0}$ (the unbiasedness constraint).

The rows of $\mathbf{D}$ must lie in the null space of $\mathbf{X}^\top$. This is the orthogonal complement of the column space of $\mathbf{X}$:

$$\text{col}(\mathbf{X})^\perp = {\mathbf{v} \in \mathbb{R}^n : \mathbf{X}^\top\mathbf{v} = \mathbf{0}}$$

Why Adding D Increases Variance:

The variance of the alternative estimator is:

$$\text{Var}(\tilde{\boldsymbol{\beta}}) = \text{Var}(\hat{\boldsymbol{\beta}}_{OLS}) + \sigma^2\mathbf{D}\mathbf{D}^\top$$

Geometrically, $\mathbf{D}\mathbf{D}^\top$ is a positive semi-definite matrix representing the 'extra' variance from the perturbation $\mathbf{D}\boldsymbol{\varepsilon}$.

The Orthogonal Projection Insight

Visualization in Simple Regression:

Consider simple linear regression: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$.

The residual vector $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ is perpendicular to the plane—it contains no information about $\boldsymbol{\beta}$.

Alternative estimators that use components from the residual space are essentially using random noise to estimate $\boldsymbol{\beta}$—hence their increased variance.

The Fundamental Insight:

OLS is optimal because orthogonal projection extracts all signal and discards all noise. Any deviation from orthogonal projection necessarily introduces noise into the estimate.

Efficiency and Relative Efficiency

The concept of efficiency formalizes how estimators compare in terms of variance.

Definition: Relative Efficiency

For two unbiased estimators $\hat{\theta}_1$ and $\hat{\theta}_2$ of the same parameter:

$$\text{RE}(\hat{\theta}_1, \hat{\theta}_2) = \frac{\text{Var}(\hat{\theta}_2)}{\text{Var}(\hat{\theta}_1)}$$

If $\text{RE} > 1$, then $\hat{\theta}_1$ is more efficient (lower variance) than $\hat{\theta}_2$.

OLS Efficiency:

By Gauss-Markov, for any linear unbiased estimator $\tilde{\boldsymbol{\beta}}$:

$$\text{RE}(\hat{\boldsymbol{\beta}}_{OLS}, \tilde{\boldsymbol{\beta}}) \geq 1$$

with equality only when $\tilde{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}_{OLS}$.

Efficiency Bound (Scalar Case):

For estimating a single coefficient $\beta_j$:

$$\text{Var}(\tilde{\beta}j) \geq \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}$$

This is the efficiency bound for linear unbiased estimation. OLS achieves this bound exactly.

Efficiency Comparison: OLS vs Alternative Estimators
Estimator	Form	Unbiased?	Efficiency vs OLS
OLS	$(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}$	Yes	100% (benchmark)
Equally-weighted	$(\mathbf{1}'\mathbf{X})^{-1}\mathbf{1}'\mathbf{y}$ (intercept only)	Yes if balanced	< 100% in general
Ridge (biased)	$(\mathbf{X}'\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}'\mathbf{y}$	No	Not comparable (biased)
Random subsampling	OLS on subset	Yes	< 100% (throws away data)
GLS (correct weights)	$(\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{y}$	Yes	100% if heteroscedastic

Efficiency Under Wrong Assumptions

Connection to the Cramér-Rao Lower Bound

The Cramér-Rao Lower Bound:

Under regularity conditions, for any unbiased estimator $\hat{\theta}$ of $\theta$:

$$\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}$$

Where $I(\theta)$ is the Fisher information:

OLS and CRLB Under Gaussian Errors:

When $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, the Fisher information for $\boldsymbol{\beta}$ is:

$$I(\boldsymbol{\beta}) = \frac{1}{\sigma^2}\mathbf{X}^\top\mathbf{X}$$

The CRLB gives:

$$\text{Var}(\hat{\boldsymbol{\beta}}) \geq [I(\boldsymbol{\beta})]^{-1} = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$$

This is exactly the variance of OLS! Thus, under Gaussian errors:

OLS achieves the Cramér-Rao bound—it is the minimum variance unbiased estimator among ALL unbiased estimators, not just linear ones.

OLS is MVUE Under Normality

The Relationship Between Gauss-Markov and Cramér-Rao:

Property	Gauss-Markov	Cramér-Rao
Class of estimators	Linear unbiased	All unbiased
Distributional assumption	None (beyond finite variance)	Regularity conditions
Bound achieved by OLS	Always (under GM assumptions)	Only under Gaussian errors
Bound formula	$\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$	$[I(\boldsymbol{\beta})]^{-1}$

When Unbiasedness Isn't Optimal: The Bias-Variance Tradeoff

The Gauss-Markov theorem proves OLS is optimal among unbiased estimators. But should we always demand unbiasedness?

The Bias-Variance Tradeoff:

Recall the MSE decomposition:

$$\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2$$

An estimator with small bias but much smaller variance can have lower MSE than an unbiased estimator with high variance.

Example: Ridge Regression

The Ridge estimator:

$$\hat{\boldsymbol{\beta}}_{Ridge} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$

is biased:

$$\mathbb{E}[\hat{\boldsymbol{\beta}}_{Ridge}] = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \neq \boldsymbol{\beta}$$

But its variance is lower:

$$\text{Var}(\hat{\boldsymbol{\beta}}_{Ridge}) = \sigma^2(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}$$

For appropriate $\lambda$, the MSE of Ridge can be lower than OLS, especially when:

$p$ is large relative to $n$
True coefficients are small
Multicollinearity is present

When OLS (Unbiased) is Preferred

•Sample size n is large relative to predictors p
•Gauss-Markov assumptions clearly hold
•Interpretability of coefficients is paramount
•Unbiased inference (confidence intervals, tests) is needed
•True coefficients are relatively large

When Biased Estimators May Be Better

•High-dimensional settings (p ≈ n or p > n)
•Strong multicollinearity present
•Prediction accuracy is the main goal
•Prior knowledge suggests regularization
•OLS variance is exploding

bias_variance_tradeoff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from numpy.linalg import inv
 
def compare_ols_ridge_mse(n=50, p=40, true_beta_scale=0.1, 
                           sigma=1.0, lambda_ridge=1.0, n_sims=1000):
    """
    Compare MSE of OLS vs Ridge regression.
    
    In high-dimensional settings with small true coefficients,
    Ridge often has lower MSE despite being biased.
    """
    mse_ols_list = []
    mse_ridge_list = []
    
    np.random.seed(42)
    
    # True coefficients (small)
    beta_true = np.random.randn(p) * true_beta_scale
    
    for _ in range(n_sims):
        # Generate data
        X = np.random.randn(n, p)
        eps = np.random.randn(n) * sigma
        y = X @ beta_true + eps
        
        # OLS estimator
        XtX = X.T @ X
        try:
            beta_ols = inv(XtX) @ X.T @ y
        except np.linalg.LinAlgError:
            continue  # Skip if singular
        
        # Ridge estimator
        beta_ridge = inv(XtX + lambda_ridge * np.eye(p)) @ X.T @ y
        
        # Compute MSE (parameter estimation error)
        mse_ols = np.mean((beta_ols - beta_true)**2)
        mse_ridge = np.mean((beta_ridge - beta_true)**2)
        
        mse_ols_list.append(mse_ols)
        mse_ridge_list.append(mse_ridge)
    
    avg_mse_ols = np.mean(mse_ols_list)
    avg_mse_ridge = np.mean(mse_ridge_list)
    
    print(f"Settings: n={n}, p={p}, λ={lambda_ridge}")
    print(f"Average MSE (OLS):   {avg_mse_ols:.4f}")
    print(f"Average MSE (Ridge): {avg_mse_ridge:.4f}")
    print(f"Ratio (OLS/Ridge):   {avg_mse_ols/avg_mse_ridge:.2f}x")
    print()
    
    return avg_mse_ols, avg_mse_ridge
 
# Low-dimensional: OLS wins
print("=== Low-dimensional setting ===")
compare_ols_ridge_mse(n=200, p=10)
 
# High-dimensional: Ridge wins
print("=== High-dimensional setting ===")
compare_ols_ridge_mse(n=50, p=40)

Extensions of the BLUE Concept

The BLUE concept extends to more general settings beyond standard regression.

Generalized Least Squares (GLS):

The GLS estimator:

$$\hat{\boldsymbol{\beta}}_{GLS} = (\mathbf{X}^\top\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\Omega}^{-1}\mathbf{y}$$

is BLUE when $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\boldsymbol{\Omega}$. GLS 'whitens' the error structure before applying least squares.

Special Cases of GLS:

Error Structure	$\boldsymbol{\Omega}$ Form	Name	Application
Heteroscedasticity	$\text{diag}(\omega_1, ..., \omega_n)$	WLS (Weighted LS)	Different variance per obs
AR(1) errors	Toeplitz with decay	FGLS	Time series
Grouped data	Block diagonal	Clustered	Panel data

Best Linear Unbiased Prediction (BLUP):

In mixed effects models with random effects $\mathbf{u}$:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{u} + \boldsymbol{\varepsilon}$$

We want to estimate the fixed effects $\boldsymbol{\beta}$ and predict the random effects $\mathbf{u}$.

The BLUP of $\mathbf{u}$ minimizes prediction error variance among linear unbiased predictors—an extension of BLUE principles to prediction.

Aitken Theorem:

The result that GLS is BLUE under correlated errors is called the Aitken theorem (or generalized Gauss-Markov theorem):

Under the model $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ with $\mathbb{E}[\boldsymbol{\varepsilon}] = \mathbf{0}$ and $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\boldsymbol{\Omega}$, the GLS estimator is BLUE.

Feasible GLS

Practical Implications of the BLUE Property

Understanding the BLUE property has direct practical consequences for regression analysis.

When to Rely on OLS Optimality:

•Check Gauss-Markov assumptions first — OLS is only BLUE when assumptions hold. Run diagnostics before trusting optimality.
•Use OLS as the baseline — Even if you suspect violations, start with OLS. It's a natural benchmark for comparison.
•Trust OLS standard errors (when appropriate) — Under Gauss-Markov + normality, OLS standard errors are correct and confidence intervals have proper coverage.
•Don't search for 'better' linear unbiased estimators — The theorem proves none exist. Spend effort on checking assumptions instead.
•Consider your objective — If prediction is the goal and you're in high-d, biased estimators may be preferable despite breaking the theorem's scope.

The Practitioner's Decision Framework

Check if Gauss-Markov assumptions hold. 2) If yes, use OLS—it's provably optimal. 3) If no, identify which assumption fails. 4) Apply the appropriate remedy: GLS for heteroscedasticity, robust SE for mild violations, or different methods for severe issues.

Common Mistakes Related to BLUE:

❌ Assuming OLS is always best — It's only BLUE under specific conditions

❌ Ignoring the 'linear' restriction — Nonlinear estimators can beat OLS in MSE

❌ Confusing unbiasedness with accuracy — An unbiased estimator with huge variance can be practically useless

❌ Over-relying on OLS in high dimensions — When p ≈ n, regularization typically dominates

❌ Forgetting that BLUE is conditional on X — The comparison is among estimators for the same design matrix

Summary: The BLUE Property Fully Understanding

We've thoroughly explored what it means for OLS to be the Best Linear Unbiased Estimator. Here are the key insights:

Key Takeaways

•BLUE = Best Linear Unbiased Estimator — 'Best' means minimum variance; 'Linear' means a linear function of y; 'Unbiased' means E[β̂] = β
•Minimum variance is natural for unbiased estimators — When bias = 0, variance = MSE, so we naturally want to minimize variance
•Matrix ordering — OLS has minimum variance for every coefficient and every linear combination simultaneously
•Geometric interpretation — OLS uses only the informative signal; alternative estimators add noise from the null space
•Cramér-Rao connection — Under Gaussian errors, OLS achieves the information-theoretic lower bound (MVUE)
•Bias-variance tradeoff exists — Biased estimators (Ridge, Lasso) can have lower MSE than OLS in high-dimensional or multicollinear settings
•GLS extends the BLUE concept — When error structure is non-spherical, GLS becomes BLUE instead of OLS

What's Next:

Page Complete

2 / 5