Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

3 / 5

Variance of OLS Estimates and Standard Errors

Quantifying Uncertainty in Coefficient Estimates

Point estimates alone tell an incomplete story. When we report $\hat{\beta}_1 = 2.5$, the natural question is: how confident are we in this value? Could the true coefficient plausibly be 1.0? Or 4.0? Or even negative?

The variance of OLS estimates and the resulting standard errors quantify this uncertainty. They form the foundation for:

Confidence intervals that bracket the true parameter
Hypothesis tests that assess statistical significance
Model comparison and variable selection
Understanding what makes estimates precise or imprecise

What You Will Learn

By the end of this page, you will be able to derive the variance-covariance matrix of OLS coefficients, compute and interpret standard errors, understand what factors affect estimation precision, estimate the error variance σ² from data, and recognize when standard error formulas might be invalid.

The Variance-Covariance Matrix of OLS

The central result for OLS inference is the variance-covariance matrix of the coefficient estimates.

Derivation:

Recall from the Gauss-Markov discussion that:

$$\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$$

Since $\boldsymbol{\beta}$ is a constant, the variability in $\hat{\boldsymbol{\beta}}$ comes entirely from $\boldsymbol{\varepsilon}$:

$$\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$$

Taking the variance (conditional on $\mathbf{X}$):

$$\begin{align} \text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= \text{Var}((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon} | \mathbf{X}) \ &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \text{Var}(\boldsymbol{\varepsilon}) \cdot \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \end{align}$$

Under the Gauss-Markov assumption of spherical errors: $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}_n$

$$\begin{align} \text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \sigma^2\mathbf{I}_n \cdot \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \ &= \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \ &= \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1} \end{align}$$

The Fundamental Variance Formula

$$\boxed{\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}}$$ This elegant formula relates coefficient variance to two quantities: the error variance σ² and the design matrix structure X'X. All inference in linear regression flows from this result.

Interpreting the Matrix:

Let $\boldsymbol{\Sigma}_{\hat{\beta}} = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$. This $p \times p$ matrix contains:

Diagonal elements $(\boldsymbol{\Sigma}{\hat{\beta}}){jj} = \text{Var}(\hat{\beta}_j)$: Variance of each coefficient estimate
Off-diagonal elements $(\boldsymbol{\Sigma}{\hat{\beta}}){jk} = \text{Cov}(\hat{\beta}_j, \hat{\beta}_k)$: Covariance between coefficient estimates

The covariances matter because correlated estimates can have unusual joint distributions even when individual estimates seem reasonable.

Standard Errors of Coefficients

Standard Errors (SE) are the square roots of variances—they express uncertainty in the same units as the coefficients themselves.

Definition:

$$\text{SE}(\hat{\beta}j) = \sqrt{\text{Var}(\hat{\beta}j)} = \sqrt{\sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}} = \sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}$$

Interpretation:

The standard error tells us the typical size of our estimation error. Under repeated sampling:

Approximately 68% of estimates fall within $\pm 1$ SE of the true value
Approximately 95% fall within $\pm 2$ SE (roughly, more precise under normality)

Example Calculation:

For simple linear regression $y = \beta_0 + \beta_1 x + \varepsilon$ with centered $x$ (mean-zero):

standard_errors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from numpy.linalg import inv
 
def compute_ols_with_se(X: np.ndarray, y: np.ndarray):
    """
    Compute OLS estimates with standard errors.
    
    Returns:
    --------
    beta_hat : coefficient estimates
    se : standard errors
    residuals : residuals (y - X @ beta_hat)
    sigma_sq_hat : estimated error variance
    """
    n, p = X.shape
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residuals
    y_hat = X @ beta_hat
    residuals = y - y_hat
    
    # Estimate σ² (unbiased: divide by n-p)
    RSS = np.sum(residuals**2)
    sigma_sq_hat = RSS / (n - p)
    
    # Variance-covariance matrix
    var_cov = sigma_sq_hat * XtX_inv
    
    # Standard errors (sqrt of diagonal)
    se = np.sqrt(np.diag(var_cov))
    
    return {
        'beta_hat': beta_hat,
        'se': se,
        'residuals': residuals,
        'sigma_sq_hat': sigma_sq_hat,
        'var_cov': var_cov,
        'RSS': RSS,
        'R_squared': 1 - RSS / np.sum((y - y.mean())**2)
    }
 
# Example
np.random.seed(42)
n = 100
 
# Design matrix with intercept
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),  # x1
    np.random.randn(n)   # x2
])
 
# True parameters
beta_true = np.array([1.0, 2.0, -0.5])
sigma_true = 0.5
 
# Generate data
y = X @ beta_true + sigma_true * np.random.randn(n)
 
# Compute OLS with SE
results = compute_ols_with_se(X, y)
 
print("OLS Results:")
print("=" * 50)
for j, name in enumerate(['Intercept', 'x1', 'x2']):
    beta = results['beta_hat'][j]
    se = results['se'][j]
    t_stat = beta / se
    print(f"{name:12s}: β̂ = {beta:7.4f}, SE = {se:.4f}, t = {t_stat:7.3f}")
print()
print(f"Estimated σ²: {results['sigma_sq_hat']:.4f} (True: {sigma_true**2:.4f})")
print(f"R²: {results['R_squared']:.4f}")

Formulas for Simple Linear Regression:

In simple regression $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$, the standard errors have explicit forms:

$$\text{SE}(\hat{\beta}1) = \frac{\sigma}{\sqrt{\sum{i=1}^n (x_i - \bar{x})^2}} = \frac{\sigma}{\sqrt{n} \cdot s_x}$$

$$\text{SE}(\hat{\beta}0) = \sigma \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{i=1}^n (x_i - \bar{x})^2}}$$

Where $s_x$ is the sample standard deviation of $x$.

These formulas reveal key insights:

More variation in $x$ → lower SE for slope
Larger $n$ → lower SE for both coefficients
$\bar{x}$ closer to 0 → lower SE for intercept

Factors Affecting Estimation Precision

Understanding what makes coefficient estimates more or less precise is crucial for experimental design and data analysis.

From $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$, precision depends on:

Error variance $\sigma^2$
Sample size $n$
Predictor variation
Multicollinearity (correlation among predictors)

Let's examine each factor systematically.

Factors Affecting Coefficient Precision
Factor	Effect on SE	Mathematical Reason	Practical Implication
↑ Error variance (σ²)	↑ SE	SE ∝ σ	Noisy data → less precise estimates
↑ Sample size (n)	↓ SE	SE ∝ 1/√n	More data → more precision (slowly)
↑ Predictor variance	↓ SE for that predictor	(X'X)⁻¹ shrinks	Spread in x → better estimation of slope
↑ Multicollinearity	↑ SE for correlated predictors	(X'X)⁻¹ explodes	Correlated predictors → unstable estimates
Predictor near mean	↓ SE for intercept	Design matrix effect	Centering predictors helps intercept

Deep Dive: The Multicollinearity Effect

When predictors are correlated, the matrix $\mathbf{X}^\top\mathbf{X}$ becomes nearly singular. Its inverse $(\mathbf{X}^\top\mathbf{X})^{-1}$ has very large elements, inflating variances.

Consider two predictors $x_1$ and $x_2$ with correlation $\rho$. For standardized predictors:

$$(\mathbf{X}^\top\mathbf{X})^{-1} \approx \frac{1}{1-\rho^2}\begin{pmatrix} 1 & -\rho \ -\rho & 1 \end{pmatrix}$$

As $\rho \to 1$:

The factor $\frac{1}{1-\rho^2} \to \infty$
Variances of $\hat{\beta}_1$ and $\hat{\beta}_2$ explode
The estimates become highly negatively correlated

Variance Inflation Factor (VIF)

The VIF for predictor j is: VIF_j = 1/(1 - R²_j), where R²_j is from regressing x_j on all other predictors. VIF > 10 indicates problematic multicollinearity. The SE increases by factor √VIF compared to the uncorrelated case.

precision_factors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from numpy.linalg import inv
 
def demonstrate_precision_factors():
    """
    Demonstrate how different factors affect coefficient precision.
    """
    np.random.seed(42)
    sigma = 1.0
    beta_true = np.array([1.0, 2.0])
    
    print("=" * 60)
    print("FACTOR 1: Sample Size Effect")
    print("=" * 60)
    for n in [25, 100, 400, 1600]:
        X = np.column_stack([np.ones(n), np.random.randn(n)])
        se = sigma * np.sqrt(inv(X.T @ X)[1, 1])
        print(f"  n = {n:4d}: SE(β̂₁) = {se:.4f}")
    print("  Note: SE decreases by factor of 2 when n increases by 4")
    
    print("\n" + "=" * 60)
    print("FACTOR 2: Predictor Variance Effect")
    print("=" * 60)
    n = 100
    for sd_x in [0.5, 1.0, 2.0, 4.0]:
        x = np.random.randn(n) * sd_x
        X = np.column_stack([np.ones(n), x])
        se = sigma * np.sqrt(inv(X.T @ X)[1, 1])
        print(f"  SD(x) = {sd_x:.1f}: SE(β̂₁) = {se:.4f}")
    print("  Note: SE inversely proportional to SD(x)")
    
    print("\n" + "=" * 60)
    print("FACTOR 3: Multicollinearity Effect")
    print("=" * 60)
    n = 100
    for rho in [0.0, 0.5, 0.9, 0.95, 0.99]:
        x1 = np.random.randn(n)
        x2 = rho * x1 + np.sqrt(1 - rho**2) * np.random.randn(n)
        X = np.column_stack([np.ones(n), x1, x2])
        XtX_inv = inv(X.T @ X)
        se1 = sigma * np.sqrt(XtX_inv[1, 1])
        se2 = sigma * np.sqrt(XtX_inv[2, 2])
        vif = 1 / (1 - rho**2) if rho < 1 else np.inf
        print(f"  ρ = {rho:.2f}: SE(β̂₁) = {se1:.4f}, SE(β̂₂) = {se2:.4f}, VIF ≈ {vif:.2f}")
    print("  Note: SE explodes as correlation approaches 1")
 
demonstrate_precision_factors()

Estimating the Error Variance σ²

The variance formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ requires knowledge of $\sigma^2$, which is typically unknown. We must estimate it from the data.

The Residual Sum of Squares (RSS):

The residuals $e_i = y_i - \hat{y}_i$ are our proxy for the unobservable errors $\varepsilon_i$:

$$\text{RSS} = \sum_{i=1}^n e_i^2 = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2 = \mathbf{e}^\top\mathbf{e}$$

Unbiased Estimator of σ²:

A natural guess might be $\hat{\sigma}^2 = \frac{\text{RSS}}{n}$. But this is biased downward!

The unbiased estimator is:

$$\hat{\sigma}^2 = s^2 = \frac{\text{RSS}}{n - p} = \frac{\sum_{i=1}^n e_i^2}{n - p}$$

Where $p$ is the number of parameters (including intercept).

Why Divide by n - p?

We lose p degrees of freedom estimating β. The residuals are constrained: X'e = 0 (p linear constraints). Only n - p residuals are 'free'—this is the effective sample size for estimating σ². The quantity n - p is called the degrees of freedom for error.

Proof of Unbiasedness:

We can show $\mathbb{E}[\text{RSS}] = (n-p)\sigma^2$.

The key insight is that the residual vector is:

$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y} = \mathbf{M}\mathbf{y}$$

Where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix and $\mathbf{M} = \mathbf{I} - \mathbf{H}$.

Then: $$\mathbf{e} = \mathbf{M}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) = \mathbf{M}\boldsymbol{\varepsilon}$$

(Since $\mathbf{M}\mathbf{X} = \mathbf{0}$)

The expected sum of squares: $$\mathbb{E}[\mathbf{e}^\top\mathbf{e}] = \mathbb{E}[\boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}] = \sigma^2 \text{tr}(\mathbf{M}) = \sigma^2(n - p)$$

Since $\text{rank}(\mathbf{M}) = n - p$.

Therefore: $$\mathbb{E}\left[\frac{\text{RSS}}{n-p}\right] = \sigma^2 \quad \checkmark$$

Properties of s²:

Property	Statement
Unbiasedness	E[s²] = σ²
Consistency	s² → σ² as n → ∞
Distribution (under normality)	(n-p)s²/σ² ~ χ²(n-p)
Independence from β̂	s² and β̂ are independent (under normality)

The independence property is remarkable and crucial for constructing t-statistics.

Estimated Standard Errors

In practice, we replace $\sigma^2$ with its estimate $s^2$:

$$\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = s^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$

$$\widehat{\text{SE}}(\hat{\beta}j) = s \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}$$

These estimated standard errors (also called standard errors of the estimates) are what statistical software reports.

Important Distinction:

Standard Deviation of the Estimator $\sigma_{\hat{\beta}}$: Requires knowing $\sigma$, typically unknown
Standard Error (estimated) $\widehat{\text{SE}}(\hat{\beta})$: Uses $s$ instead, what we actually compute

The estimated SE is a random variable (because $s$ is random), while the true SD is a parameter.

estimated_se.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def full_ols_inference(X: np.ndarray, y: np.ndarray, alpha: float = 0.05):
    """
    Complete OLS inference including estimated standard errors,
    t-statistics, and confidence intervals.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response vector (n,)
    alpha : significance level for confidence intervals
    
    Returns:
    --------
    Dictionary with coefficients, SEs, t-stats, p-values, CIs
    """
    n, p = X.shape
    df = n - p  # degrees of freedom
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residuals and s²
    residuals = y - X @ beta_hat
    RSS = np.sum(residuals**2)
    s_sq = RSS / df
    s = np.sqrt(s_sq)
    
    # Estimated variance-covariance matrix
    var_cov = s_sq * XtX_inv
    
    # Standard errors
    se = np.sqrt(np.diag(var_cov))
    
    # t-statistics
    t_stats = beta_hat / se
    
    # p-values (two-tailed)
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df))
    
    # Confidence intervals
    t_critical = stats.t.ppf(1 - alpha/2, df)
    ci_lower = beta_hat - t_critical * se
    ci_upper = beta_hat + t_critical * se
    
    return {
        'beta_hat': beta_hat,
        'se': se,
        't_stats': t_stats,
        'p_values': p_values,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        's': s,
        's_sq': s_sq,
        'df': df,
        'RSS': RSS,
        'var_cov': var_cov
    }
 
# Example
np.random.seed(42)
n = 50
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),
    np.random.randn(n)
])
beta_true = np.array([1.0, 2.5, -1.0])
y = X @ beta_true + 0.8 * np.random.randn(n)
 
results = full_ols_inference(X, y)
 
print("OLS Regression Results")
print("=" * 70)
print(f"{'Coef':12s} {'Estimate':>10s} {'Std.Err':>10s} {'t-value':>10s} {'p-value':>10s} {'95% CI':>15s}")
print("-" * 70)
 
names = ['Intercept', 'x1', 'x2']
for j in range(len(names)):
    ci = f"[{results['ci_lower'][j]:.3f}, {results['ci_upper'][j]:.3f}]"
    print(f"{names[j]:12s} {results['beta_hat'][j]:10.4f} {results['se'][j]:10.4f} "
          f"{results['t_stats'][j]:10.3f} {results['p_values'][j]:10.4f} {ci:>15s}")
 
print("-" * 70)
print(f"Residual std error: {results['s']:.4f} on {results['df']} degrees of freedom")
print(f"True values: β = {beta_true}")

Reading Software Output

Statistical software (R, Stata, Python statsmodels) reports estimated standard errors. These incorporate the uncertainty from estimating σ². This is why we use t-distribution rather than normal distribution for inference—the extra randomness from estimating σ² is accounted for.

Standard Errors of Predictions

Beyond coefficient uncertainty, we often want to quantify uncertainty in predictions. There are two types of prediction uncertainty:

1. Confidence Interval for Mean Response (Fitted Value):

For a new observation with predictors $\mathbf{x}_0$, the estimated mean response is:

$$\hat{\mu}_0 = \mathbf{x}_0^\top \hat{\boldsymbol{\beta}}$$

Its variance:

$$\text{Var}(\hat{\mu}_0) = \mathbf{x}_0^\top \text{Var}(\hat{\boldsymbol{\beta}}) \mathbf{x}_0 = \sigma^2 \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0$$

This measures uncertainty about the average y for given x.

2. Prediction Interval for New Observation:

For predicting an actual new observation $y_0 = \mathbf{x}_0^\top\boldsymbol{\beta} + \varepsilon_0$:

$$\text{Var}(y_0 - \hat{y}_0) = \sigma^2 + \sigma^2 \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0 = \sigma^2(1 + \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0)$$

The extra $\sigma^2$ accounts for the inherent variability in the new observation.

Confidence vs. Prediction Intervals

Confidence interval (for mean): Where we expect the average response to lie. Narrows as n → ∞. Prediction interval (for individual): Where we expect a single new observation to lie. Does NOT shrink to zero as n → ∞ because irreducible error remains.

prediction_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def prediction_with_intervals(X: np.ndarray, y: np.ndarray, 
                               x_new: np.ndarray, alpha: float = 0.05):
    """
    Make predictions with confidence and prediction intervals.
    
    Parameters:
    -----------
    X : training design matrix (n x p)
    y : training response (n,)
    x_new : new observation features (p,) or (m x p)
    alpha : significance level
    
    Returns:
    --------
    Dictionary with predictions, CI, PI
    """
    n, p = X.shape
    df = n - p
    
    # Ensure x_new is 2D
    if x_new.ndim == 1:
        x_new = x_new.reshape(1, -1)
    m = x_new.shape[0]
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance estimate
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    s = np.sqrt(s_sq)
    
    # Predictions
    y_pred = x_new @ beta_hat
    
    # Variance of fitted values: x' (X'X)^{-1} x
    var_fitted = np.array([s_sq * x @ XtX_inv @ x for x in x_new])
    se_fitted = np.sqrt(var_fitted)
    
    # Variance of prediction error: s^2 (1 + x' (X'X)^{-1} x)
    var_pred = s_sq + var_fitted
    se_pred = np.sqrt(var_pred)
    
    # Critical value
    t_crit = stats.t.ppf(1 - alpha/2, df)
    
    # Confidence intervals (for mean)
    ci_lower = y_pred - t_crit * se_fitted
    ci_upper = y_pred + t_crit * se_fitted
    
    # Prediction intervals (for new observation)
    pi_lower = y_pred - t_crit * se_pred
    pi_upper = y_pred + t_crit * se_pred
    
    return {
        'y_pred': y_pred,
        'se_fitted': se_fitted,
        'se_pred': se_pred,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'pi_lower': pi_lower,
        'pi_upper': pi_upper,
        's': s
    }
 
# Example
np.random.seed(42)
n = 100
X = np.column_stack([np.ones(n), np.random.randn(n)])
beta_true = np.array([2.0, 1.5])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# New observations at different x values
x_new_values = np.array([-2, 0, 2])
X_new = np.column_stack([np.ones(3), x_new_values])
 
results = prediction_with_intervals(X, y, X_new)
 
print("Prediction Results")
print("=" * 70)
print(f"{'x':>8s} {'ŷ':>10s} {'SE(fit)':>10s} {'SE(pred)':>10s} {'95% CI':>18s} {'95% PI':>18s}")
print("-" * 70)
 
for i, x in enumerate(x_new_values):
    ci = f"[{results['ci_lower'][i]:.2f}, {results['ci_upper'][i]:.2f}]"
    pi = f"[{results['pi_lower'][i]:.2f}, {results['pi_upper'][i]:.2f}]"
    print(f"{x:8.1f} {results['y_pred'][i]:10.3f} {results['se_fitted'][i]:10.3f} "
          f"{results['se_pred'][i]:10.3f} {ci:>18s} {pi:>18s}")
 
print("\nNote: Prediction intervals are always wider than confidence intervals")

Robust Standard Errors (Heteroscedasticity-Consistent)

When the homoscedasticity assumption fails (i.e., $\text{Var}(\varepsilon_i) \neq \sigma^2$ for all $i$), the standard OLS variance formula $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ is incorrect.

OLS estimates remain unbiased and consistent, but:

Standard errors are wrong (usually too small)
Confidence intervals have incorrect coverage
t-tests and F-tests are invalid

Solution: Heteroscedasticity-Consistent (HC) Standard Errors

Also called 'robust' or 'sandwich' standard errors, these don't assume constant variance.

The General Variance:

With heteroscedasticity, $\text{Var}(\boldsymbol{\varepsilon}) = \boldsymbol{\Omega}$ (diagonal with varying entries):

$$\text{Var}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\Omega}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}$$

This is the 'sandwich' form: bread-meat-bread.

The Sandwich Estimator

The robust variance estimator replaces Ω with a diagonal matrix of squared residuals: Ω̂ = diag(e₁², e₂², ..., eₙ²). This gives: V̂_robust = (X'X)⁻¹ X' diag(e²) X (X'X)⁻¹. Named 'sandwich' because (X'X)⁻¹ is the 'bread' and X' diag(e²) X is the 'meat'.

robust_se.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from numpy.linalg import inv
 
def compare_se_methods(X: np.ndarray, y: np.ndarray):
    """
    Compare classical vs robust (HC0, HC1, HC3) standard errors.
    """
    n, p = X.shape
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    residuals = y - X @ beta_hat
    
    # Classical SE (assumes homoscedasticity)
    s_sq = np.sum(residuals**2) / (n - p)
    var_classical = s_sq * XtX_inv
    se_classical = np.sqrt(np.diag(var_classical))
    
    # HC0: Basic White (1980) estimator
    # Omega_hat = diag(e^2)
    meat_hc0 = X.T @ np.diag(residuals**2) @ X
    var_hc0 = XtX_inv @ meat_hc0 @ XtX_inv
    se_hc0 = np.sqrt(np.diag(var_hc0))
    
    # HC1: Degree of freedom correction
    # Scale by n/(n-p)
    var_hc1 = (n / (n - p)) * var_hc0
    se_hc1 = np.sqrt(np.diag(var_hc1))
    
    # HC3: MacKinnon-White (1985) - Most conservative
    # Weight by 1/(1-h_ii)^2 where h_ii = diag(H)
    H = X @ XtX_inv @ X.T
    leverage = np.diag(H)
    e_adj = residuals / (1 - leverage)
    meat_hc3 = X.T @ np.diag(e_adj**2) @ X
    var_hc3 = XtX_inv @ meat_hc3 @ XtX_inv
    se_hc3 = np.sqrt(np.diag(var_hc3))
    
    return {
        'beta_hat': beta_hat,
        'se_classical': se_classical,
        'se_hc0': se_hc0,
        'se_hc1': se_hc1,
        'se_hc3': se_hc3
    }
 
# Example with heteroscedasticity
np.random.seed(42)
n = 100
x = np.random.uniform(1, 10, n)
X = np.column_stack([np.ones(n), x])
 
# Heteroscedastic errors: variance increases with x
beta_true = np.array([1.0, 2.0])
errors = np.random.randn(n) * (0.5 * x)  # SD proportional to x
y = X @ beta_true + errors
 
results = compare_se_methods(X, y)
 
print("Standard Error Comparison (with heteroscedasticity)")
print("=" * 60)
print(f"{'Method':15s} {'SE(β̂₀)':>12s} {'SE(β̂₁)':>12s}")
print("-" * 60)
print(f"{'Classical':15s} {results['se_classical'][0]:12.4f} {results['se_classical'][1]:12.4f}")
print(f"{'HC0 (White)':15s} {results['se_hc0'][0]:12.4f} {results['se_hc0'][1]:12.4f}")
print(f"{'HC1 (adjusted)':15s} {results['se_hc1'][0]:12.4f} {results['se_hc1'][1]:12.4f}")
print(f"{'HC3 (conservative)':15s} {results['se_hc3'][0]:12.4f} {results['se_hc3'][1]:12.4f}")
print()
print("Note: Robust SEs differ from classical when heteroscedasticity present")

Common HC Variants:

Variant	Formula	Use Case
HC0	White (1980)	Original, biased in small samples
HC1	HC0 × n/(n-p)	Small-sample correction
HC2	Weight by 1/(1-hᵢᵢ)	Better with influential points
HC3	Weight by 1/(1-hᵢᵢ)²	Most robust, default recommendation

When to Use Robust Standard Errors:

When heteroscedasticity is suspected (e.g., residual plots show fanning)
When working with cross-sectional data with varying group sizes
As a robustness check even when classical SE seems fine
In empirical economics, robust SE is often the default

Clustered Standard Errors

When observations are grouped (clusters), errors within groups may be correlated even if errors across groups are independent. Examples:

Students within schools
Employees within firms
Repeated observations on individuals (panel data)
Geographic clustering (cities within states)

The Problem:

Both classical and HC standard errors assume independent observations. With clustering:

Effective sample size is smaller than n (closer to number of clusters)
Standard errors should be larger to reflect within-cluster correlation

Cluster-Robust Standard Errors:

Let G be the number of clusters and $\mathbf{X}_g$, $\mathbf{e}_g$ be the design matrix and residuals for cluster g:

$$\hat{\text{Var}}{cluster}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1} \left(\sum{g=1}^G \mathbf{X}_g^\top \mathbf{e}_g \mathbf{e}_g^\top \mathbf{X}_g \right) (\mathbf{X}^\top\mathbf{X})^{-1}$$

This allows arbitrary within-cluster correlation.

Clustering Dramatically Affects Inference

Ignoring clustering can make standard errors 2-10x too small, leading to wildly overstated significance. If you have 1000 students in 50 schools, your 'effective n' is closer to 50 than 1000 for school-level variation. Always cluster at the appropriate level.

Guidelines for Clustering:

Cluster at the level of treatment assignment — If treatment is assigned at school level, cluster at school level
When in doubt, cluster at a higher level — Over-clustering is conservative; under-clustering is dangerous
Need enough clusters for asymptotics — Rule of thumb: at least 30-50 clusters
Wild cluster bootstrap — For few clusters (<30), use bootstrap methods

Comparison of SE Types:

Data Structure	Recommended SE Type
iid observations	Classical or HC
Heteroscedastic, independent	HC (HC1 or HC3)
Clustered, homoscedastic within	Cluster-robust
Clustered, heteroscedastic	Cluster-robust (handles both)
Time series (autocorrelation)	HAC (Newey-West)

Summary: Mastering Variance and Standard Errors

This page has covered the essential concepts of variance estimation in OLS regression. Let's consolidate the key insights:

Key Takeaways

•Variance formula: Var(β̂) = σ²(X'X)⁻¹ — This fundamental result links coefficient precision to error variance and design matrix structure
•Standard errors = √Var — Express uncertainty in coefficient units; form the basis for confidence intervals and tests
•Precision depends on: σ², n, predictor spread, multicollinearity — More data, more x-variation, less noise, and less collinearity all help
•Estimate σ² with s² = RSS/(n-p) — Divide by n-p (not n) for unbiasedness; accounts for degrees of freedom lost estimating β
•Prediction intervals > confidence intervals — Irreducible error σ² adds to prediction uncertainty but not to mean estimation uncertainty
•Robust SEs when homoscedasticity fails — HC standard errors (sandwich estimators) correct for heteroscedasticity
•Cluster when observations are grouped — Ignoring clustering leads to severely understated standard errors

What's Next:

With variance and standard errors understood, we can now construct confidence intervals and perform hypothesis tests. The next page develops the theory of confidence intervals for regression coefficients—how to construct them, interpret them correctly, and understand their properties.

Page Complete

You now understand how to quantify uncertainty in OLS estimates through variance and standard errors. This knowledge is essential for all statistical inference in regression, from simple significance tests to complex model comparisons.

3 / 5

Loading learning content...

Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

3 / 5

Variance of OLS Estimates and Standard Errors

Quantifying Uncertainty in Coefficient Estimates

The variance of OLS estimates and the resulting standard errors quantify this uncertainty. They form the foundation for:

Confidence intervals that bracket the true parameter
Hypothesis tests that assess statistical significance
Model comparison and variable selection
Understanding what makes estimates precise or imprecise

What You Will Learn

The Variance-Covariance Matrix of OLS

The central result for OLS inference is the variance-covariance matrix of the coefficient estimates.

Derivation:

Recall from the Gauss-Markov discussion that:

$$\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$$

Since $\boldsymbol{\beta}$ is a constant, the variability in $\hat{\boldsymbol{\beta}}$ comes entirely from $\boldsymbol{\varepsilon}$:

$$\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$$

Taking the variance (conditional on $\mathbf{X}$):

Under the Gauss-Markov assumption of spherical errors: $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}_n$

The Fundamental Variance Formula

Interpreting the Matrix:

Let $\boldsymbol{\Sigma}_{\hat{\beta}} = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$. This $p \times p$ matrix contains:

Diagonal elements $(\boldsymbol{\Sigma}{\hat{\beta}}){jj} = \text{Var}(\hat{\beta}_j)$: Variance of each coefficient estimate
Off-diagonal elements $(\boldsymbol{\Sigma}{\hat{\beta}}){jk} = \text{Cov}(\hat{\beta}_j, \hat{\beta}_k)$: Covariance between coefficient estimates

The covariances matter because correlated estimates can have unusual joint distributions even when individual estimates seem reasonable.

Standard Errors of Coefficients

Standard Errors (SE) are the square roots of variances—they express uncertainty in the same units as the coefficients themselves.

Definition:

$$\text{SE}(\hat{\beta}j) = \sqrt{\text{Var}(\hat{\beta}j)} = \sqrt{\sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}} = \sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}$$

Interpretation:

The standard error tells us the typical size of our estimation error. Under repeated sampling:

Approximately 68% of estimates fall within $\pm 1$ SE of the true value
Approximately 95% fall within $\pm 2$ SE (roughly, more precise under normality)

Example Calculation:

For simple linear regression $y = \beta_0 + \beta_1 x + \varepsilon$ with centered $x$ (mean-zero):

standard_errors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from numpy.linalg import inv
 
def compute_ols_with_se(X: np.ndarray, y: np.ndarray):
    """
    Compute OLS estimates with standard errors.
    
    Returns:
    --------
    beta_hat : coefficient estimates
    se : standard errors
    residuals : residuals (y - X @ beta_hat)
    sigma_sq_hat : estimated error variance
    """
    n, p = X.shape
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residuals
    y_hat = X @ beta_hat
    residuals = y - y_hat
    
    # Estimate σ² (unbiased: divide by n-p)
    RSS = np.sum(residuals**2)
    sigma_sq_hat = RSS / (n - p)
    
    # Variance-covariance matrix
    var_cov = sigma_sq_hat * XtX_inv
    
    # Standard errors (sqrt of diagonal)
    se = np.sqrt(np.diag(var_cov))
    
    return {
        'beta_hat': beta_hat,
        'se': se,
        'residuals': residuals,
        'sigma_sq_hat': sigma_sq_hat,
        'var_cov': var_cov,
        'RSS': RSS,
        'R_squared': 1 - RSS / np.sum((y - y.mean())**2)
    }
 
# Example
np.random.seed(42)
n = 100
 
# Design matrix with intercept
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),  # x1
    np.random.randn(n)   # x2
])
 
# True parameters
beta_true = np.array([1.0, 2.0, -0.5])
sigma_true = 0.5
 
# Generate data
y = X @ beta_true + sigma_true * np.random.randn(n)
 
# Compute OLS with SE
results = compute_ols_with_se(X, y)
 
print("OLS Results:")
print("=" * 50)
for j, name in enumerate(['Intercept', 'x1', 'x2']):
    beta = results['beta_hat'][j]
    se = results['se'][j]
    t_stat = beta / se
    print(f"{name:12s}: β̂ = {beta:7.4f}, SE = {se:.4f}, t = {t_stat:7.3f}")
print()
print(f"Estimated σ²: {results['sigma_sq_hat']:.4f} (True: {sigma_true**2:.4f})")
print(f"R²: {results['R_squared']:.4f}")

Formulas for Simple Linear Regression:

In simple regression $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$, the standard errors have explicit forms:

$$\text{SE}(\hat{\beta}1) = \frac{\sigma}{\sqrt{\sum{i=1}^n (x_i - \bar{x})^2}} = \frac{\sigma}{\sqrt{n} \cdot s_x}$$

$$\text{SE}(\hat{\beta}0) = \sigma \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{i=1}^n (x_i - \bar{x})^2}}$$

Where $s_x$ is the sample standard deviation of $x$.

These formulas reveal key insights:

More variation in $x$ → lower SE for slope
Larger $n$ → lower SE for both coefficients
$\bar{x}$ closer to 0 → lower SE for intercept

Factors Affecting Estimation Precision

Understanding what makes coefficient estimates more or less precise is crucial for experimental design and data analysis.

From $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$, precision depends on:

Error variance $\sigma^2$
Sample size $n$
Predictor variation
Multicollinearity (correlation among predictors)

Let's examine each factor systematically.

Factors Affecting Coefficient Precision
Factor	Effect on SE	Mathematical Reason	Practical Implication
↑ Error variance (σ²)	↑ SE	SE ∝ σ	Noisy data → less precise estimates
↑ Sample size (n)	↓ SE	SE ∝ 1/√n	More data → more precision (slowly)
↑ Predictor variance	↓ SE for that predictor	(X'X)⁻¹ shrinks	Spread in x → better estimation of slope
↑ Multicollinearity	↑ SE for correlated predictors	(X'X)⁻¹ explodes	Correlated predictors → unstable estimates
Predictor near mean	↓ SE for intercept	Design matrix effect	Centering predictors helps intercept

Deep Dive: The Multicollinearity Effect

When predictors are correlated, the matrix $\mathbf{X}^\top\mathbf{X}$ becomes nearly singular. Its inverse $(\mathbf{X}^\top\mathbf{X})^{-1}$ has very large elements, inflating variances.

Consider two predictors $x_1$ and $x_2$ with correlation $\rho$. For standardized predictors:

$$(\mathbf{X}^\top\mathbf{X})^{-1} \approx \frac{1}{1-\rho^2}\begin{pmatrix} 1 & -\rho \ -\rho & 1 \end{pmatrix}$$

As $\rho \to 1$:

The factor $\frac{1}{1-\rho^2} \to \infty$
Variances of $\hat{\beta}_1$ and $\hat{\beta}_2$ explode
The estimates become highly negatively correlated

Variance Inflation Factor (VIF)

precision_factors.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
from numpy.linalg import inv
 
def demonstrate_precision_factors():
    """
    Demonstrate how different factors affect coefficient precision.
    """
    np.random.seed(42)
    sigma = 1.0
    beta_true = np.array([1.0, 2.0])
    
    print("=" * 60)
    print("FACTOR 1: Sample Size Effect")
    print("=" * 60)
    for n in [25, 100, 400, 1600]:
        X = np.column_stack([np.ones(n), np.random.randn(n)])
        se = sigma * np.sqrt(inv(X.T @ X)[1, 1])
        print(f"  n = {n:4d}: SE(β̂₁) = {se:.4f}")
    print("  Note: SE decreases by factor of 2 when n increases by 4")
    
    print("\n" + "=" * 60)
    print("FACTOR 2: Predictor Variance Effect")
    print("=" * 60)
    n = 100
    for sd_x in [0.5, 1.0, 2.0, 4.0]:
        x = np.random.randn(n) * sd_x
        X = np.column_stack([np.ones(n), x])
        se = sigma * np.sqrt(inv(X.T @ X)[1, 1])
        print(f"  SD(x) = {sd_x:.1f}: SE(β̂₁) = {se:.4f}")
    print("  Note: SE inversely proportional to SD(x)")
    
    print("\n" + "=" * 60)
    print("FACTOR 3: Multicollinearity Effect")
    print("=" * 60)
    n = 100
    for rho in [0.0, 0.5, 0.9, 0.95, 0.99]:
        x1 = np.random.randn(n)
        x2 = rho * x1 + np.sqrt(1 - rho**2) * np.random.randn(n)
        X = np.column_stack([np.ones(n), x1, x2])
        XtX_inv = inv(X.T @ X)
        se1 = sigma * np.sqrt(XtX_inv[1, 1])
        se2 = sigma * np.sqrt(XtX_inv[2, 2])
        vif = 1 / (1 - rho**2) if rho < 1 else np.inf
        print(f"  ρ = {rho:.2f}: SE(β̂₁) = {se1:.4f}, SE(β̂₂) = {se2:.4f}, VIF ≈ {vif:.2f}")
    print("  Note: SE explodes as correlation approaches 1")
 
demonstrate_precision_factors()

Estimating the Error Variance σ²

The variance formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ requires knowledge of $\sigma^2$, which is typically unknown. We must estimate it from the data.

The Residual Sum of Squares (RSS):

The residuals $e_i = y_i - \hat{y}_i$ are our proxy for the unobservable errors $\varepsilon_i$:

$$\text{RSS} = \sum_{i=1}^n e_i^2 = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2 = \mathbf{e}^\top\mathbf{e}$$

Unbiased Estimator of σ²:

A natural guess might be $\hat{\sigma}^2 = \frac{\text{RSS}}{n}$. But this is biased downward!

The unbiased estimator is:

$$\hat{\sigma}^2 = s^2 = \frac{\text{RSS}}{n - p} = \frac{\sum_{i=1}^n e_i^2}{n - p}$$

Where $p$ is the number of parameters (including intercept).

Why Divide by n - p?

Proof of Unbiasedness:

We can show $\mathbb{E}[\text{RSS}] = (n-p)\sigma^2$.

The key insight is that the residual vector is:

$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y} = \mathbf{M}\mathbf{y}$$

Where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix and $\mathbf{M} = \mathbf{I} - \mathbf{H}$.

Then: $$\mathbf{e} = \mathbf{M}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) = \mathbf{M}\boldsymbol{\varepsilon}$$

(Since $\mathbf{M}\mathbf{X} = \mathbf{0}$)

The expected sum of squares: $$\mathbb{E}[\mathbf{e}^\top\mathbf{e}] = \mathbb{E}[\boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}] = \sigma^2 \text{tr}(\mathbf{M}) = \sigma^2(n - p)$$

Since $\text{rank}(\mathbf{M}) = n - p$.

Therefore: $$\mathbb{E}\left[\frac{\text{RSS}}{n-p}\right] = \sigma^2 \quad \checkmark$$

Properties of s²:

Property	Statement
Unbiasedness	E[s²] = σ²
Consistency	s² → σ² as n → ∞
Distribution (under normality)	(n-p)s²/σ² ~ χ²(n-p)
Independence from β̂	s² and β̂ are independent (under normality)

The independence property is remarkable and crucial for constructing t-statistics.

Estimated Standard Errors

In practice, we replace $\sigma^2$ with its estimate $s^2$:

$$\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = s^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$

$$\widehat{\text{SE}}(\hat{\beta}j) = s \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}$$

These estimated standard errors (also called standard errors of the estimates) are what statistical software reports.

Important Distinction:

Standard Deviation of the Estimator $\sigma_{\hat{\beta}}$: Requires knowing $\sigma$, typically unknown
Standard Error (estimated) $\widehat{\text{SE}}(\hat{\beta})$: Uses $s$ instead, what we actually compute

The estimated SE is a random variable (because $s$ is random), while the true SD is a parameter.

estimated_se.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def full_ols_inference(X: np.ndarray, y: np.ndarray, alpha: float = 0.05):
    """
    Complete OLS inference including estimated standard errors,
    t-statistics, and confidence intervals.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response vector (n,)
    alpha : significance level for confidence intervals
    
    Returns:
    --------
    Dictionary with coefficients, SEs, t-stats, p-values, CIs
    """
    n, p = X.shape
    df = n - p  # degrees of freedom
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residuals and s²
    residuals = y - X @ beta_hat
    RSS = np.sum(residuals**2)
    s_sq = RSS / df
    s = np.sqrt(s_sq)
    
    # Estimated variance-covariance matrix
    var_cov = s_sq * XtX_inv
    
    # Standard errors
    se = np.sqrt(np.diag(var_cov))
    
    # t-statistics
    t_stats = beta_hat / se
    
    # p-values (two-tailed)
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df))
    
    # Confidence intervals
    t_critical = stats.t.ppf(1 - alpha/2, df)
    ci_lower = beta_hat - t_critical * se
    ci_upper = beta_hat + t_critical * se
    
    return {
        'beta_hat': beta_hat,
        'se': se,
        't_stats': t_stats,
        'p_values': p_values,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        's': s,
        's_sq': s_sq,
        'df': df,
        'RSS': RSS,
        'var_cov': var_cov
    }
 
# Example
np.random.seed(42)
n = 50
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),
    np.random.randn(n)
])
beta_true = np.array([1.0, 2.5, -1.0])
y = X @ beta_true + 0.8 * np.random.randn(n)
 
results = full_ols_inference(X, y)
 
print("OLS Regression Results")
print("=" * 70)
print(f"{'Coef':12s} {'Estimate':>10s} {'Std.Err':>10s} {'t-value':>10s} {'p-value':>10s} {'95% CI':>15s}")
print("-" * 70)
 
names = ['Intercept', 'x1', 'x2']
for j in range(len(names)):
    ci = f"[{results['ci_lower'][j]:.3f}, {results['ci_upper'][j]:.3f}]"
    print(f"{names[j]:12s} {results['beta_hat'][j]:10.4f} {results['se'][j]:10.4f} "
          f"{results['t_stats'][j]:10.3f} {results['p_values'][j]:10.4f} {ci:>15s}")
 
print("-" * 70)
print(f"Residual std error: {results['s']:.4f} on {results['df']} degrees of freedom")
print(f"True values: β = {beta_true}")

Reading Software Output

Standard Errors of Predictions

Beyond coefficient uncertainty, we often want to quantify uncertainty in predictions. There are two types of prediction uncertainty:

1. Confidence Interval for Mean Response (Fitted Value):

For a new observation with predictors $\mathbf{x}_0$, the estimated mean response is:

$$\hat{\mu}_0 = \mathbf{x}_0^\top \hat{\boldsymbol{\beta}}$$

Its variance:

$$\text{Var}(\hat{\mu}_0) = \mathbf{x}_0^\top \text{Var}(\hat{\boldsymbol{\beta}}) \mathbf{x}_0 = \sigma^2 \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0$$

This measures uncertainty about the average y for given x.

2. Prediction Interval for New Observation:

For predicting an actual new observation $y_0 = \mathbf{x}_0^\top\boldsymbol{\beta} + \varepsilon_0$:

$$\text{Var}(y_0 - \hat{y}_0) = \sigma^2 + \sigma^2 \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0 = \sigma^2(1 + \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0)$$

The extra $\sigma^2$ accounts for the inherent variability in the new observation.

Confidence vs. Prediction Intervals

prediction_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def prediction_with_intervals(X: np.ndarray, y: np.ndarray, 
                               x_new: np.ndarray, alpha: float = 0.05):
    """
    Make predictions with confidence and prediction intervals.
    
    Parameters:
    -----------
    X : training design matrix (n x p)
    y : training response (n,)
    x_new : new observation features (p,) or (m x p)
    alpha : significance level
    
    Returns:
    --------
    Dictionary with predictions, CI, PI
    """
    n, p = X.shape
    df = n - p
    
    # Ensure x_new is 2D
    if x_new.ndim == 1:
        x_new = x_new.reshape(1, -1)
    m = x_new.shape[0]
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance estimate
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    s = np.sqrt(s_sq)
    
    # Predictions
    y_pred = x_new @ beta_hat
    
    # Variance of fitted values: x' (X'X)^{-1} x
    var_fitted = np.array([s_sq * x @ XtX_inv @ x for x in x_new])
    se_fitted = np.sqrt(var_fitted)
    
    # Variance of prediction error: s^2 (1 + x' (X'X)^{-1} x)
    var_pred = s_sq + var_fitted
    se_pred = np.sqrt(var_pred)
    
    # Critical value
    t_crit = stats.t.ppf(1 - alpha/2, df)
    
    # Confidence intervals (for mean)
    ci_lower = y_pred - t_crit * se_fitted
    ci_upper = y_pred + t_crit * se_fitted
    
    # Prediction intervals (for new observation)
    pi_lower = y_pred - t_crit * se_pred
    pi_upper = y_pred + t_crit * se_pred
    
    return {
        'y_pred': y_pred,
        'se_fitted': se_fitted,
        'se_pred': se_pred,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'pi_lower': pi_lower,
        'pi_upper': pi_upper,
        's': s
    }
 
# Example
np.random.seed(42)
n = 100
X = np.column_stack([np.ones(n), np.random.randn(n)])
beta_true = np.array([2.0, 1.5])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# New observations at different x values
x_new_values = np.array([-2, 0, 2])
X_new = np.column_stack([np.ones(3), x_new_values])
 
results = prediction_with_intervals(X, y, X_new)
 
print("Prediction Results")
print("=" * 70)
print(f"{'x':>8s} {'ŷ':>10s} {'SE(fit)':>10s} {'SE(pred)':>10s} {'95% CI':>18s} {'95% PI':>18s}")
print("-" * 70)
 
for i, x in enumerate(x_new_values):
    ci = f"[{results['ci_lower'][i]:.2f}, {results['ci_upper'][i]:.2f}]"
    pi = f"[{results['pi_lower'][i]:.2f}, {results['pi_upper'][i]:.2f}]"
    print(f"{x:8.1f} {results['y_pred'][i]:10.3f} {results['se_fitted'][i]:10.3f} "
          f"{results['se_pred'][i]:10.3f} {ci:>18s} {pi:>18s}")
 
print("\nNote: Prediction intervals are always wider than confidence intervals")

Robust Standard Errors (Heteroscedasticity-Consistent)

OLS estimates remain unbiased and consistent, but:

Standard errors are wrong (usually too small)
Confidence intervals have incorrect coverage
t-tests and F-tests are invalid

Solution: Heteroscedasticity-Consistent (HC) Standard Errors

Also called 'robust' or 'sandwich' standard errors, these don't assume constant variance.

The General Variance:

With heteroscedasticity, $\text{Var}(\boldsymbol{\varepsilon}) = \boldsymbol{\Omega}$ (diagonal with varying entries):

$$\text{Var}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\Omega}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}$$

This is the 'sandwich' form: bread-meat-bread.

The Sandwich Estimator

robust_se.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from numpy.linalg import inv
 
def compare_se_methods(X: np.ndarray, y: np.ndarray):
    """
    Compare classical vs robust (HC0, HC1, HC3) standard errors.
    """
    n, p = X.shape
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    residuals = y - X @ beta_hat
    
    # Classical SE (assumes homoscedasticity)
    s_sq = np.sum(residuals**2) / (n - p)
    var_classical = s_sq * XtX_inv
    se_classical = np.sqrt(np.diag(var_classical))
    
    # HC0: Basic White (1980) estimator
    # Omega_hat = diag(e^2)
    meat_hc0 = X.T @ np.diag(residuals**2) @ X
    var_hc0 = XtX_inv @ meat_hc0 @ XtX_inv
    se_hc0 = np.sqrt(np.diag(var_hc0))
    
    # HC1: Degree of freedom correction
    # Scale by n/(n-p)
    var_hc1 = (n / (n - p)) * var_hc0
    se_hc1 = np.sqrt(np.diag(var_hc1))
    
    # HC3: MacKinnon-White (1985) - Most conservative
    # Weight by 1/(1-h_ii)^2 where h_ii = diag(H)
    H = X @ XtX_inv @ X.T
    leverage = np.diag(H)
    e_adj = residuals / (1 - leverage)
    meat_hc3 = X.T @ np.diag(e_adj**2) @ X
    var_hc3 = XtX_inv @ meat_hc3 @ XtX_inv
    se_hc3 = np.sqrt(np.diag(var_hc3))
    
    return {
        'beta_hat': beta_hat,
        'se_classical': se_classical,
        'se_hc0': se_hc0,
        'se_hc1': se_hc1,
        'se_hc3': se_hc3
    }
 
# Example with heteroscedasticity
np.random.seed(42)
n = 100
x = np.random.uniform(1, 10, n)
X = np.column_stack([np.ones(n), x])
 
# Heteroscedastic errors: variance increases with x
beta_true = np.array([1.0, 2.0])
errors = np.random.randn(n) * (0.5 * x)  # SD proportional to x
y = X @ beta_true + errors
 
results = compare_se_methods(X, y)
 
print("Standard Error Comparison (with heteroscedasticity)")
print("=" * 60)
print(f"{'Method':15s} {'SE(β̂₀)':>12s} {'SE(β̂₁)':>12s}")
print("-" * 60)
print(f"{'Classical':15s} {results['se_classical'][0]:12.4f} {results['se_classical'][1]:12.4f}")
print(f"{'HC0 (White)':15s} {results['se_hc0'][0]:12.4f} {results['se_hc0'][1]:12.4f}")
print(f"{'HC1 (adjusted)':15s} {results['se_hc1'][0]:12.4f} {results['se_hc1'][1]:12.4f}")
print(f"{'HC3 (conservative)':15s} {results['se_hc3'][0]:12.4f} {results['se_hc3'][1]:12.4f}")
print()
print("Note: Robust SEs differ from classical when heteroscedasticity present")

Common HC Variants:

Variant	Formula	Use Case
HC0	White (1980)	Original, biased in small samples
HC1	HC0 × n/(n-p)	Small-sample correction
HC2	Weight by 1/(1-hᵢᵢ)	Better with influential points
HC3	Weight by 1/(1-hᵢᵢ)²	Most robust, default recommendation

When to Use Robust Standard Errors:

When heteroscedasticity is suspected (e.g., residual plots show fanning)
When working with cross-sectional data with varying group sizes
As a robustness check even when classical SE seems fine
In empirical economics, robust SE is often the default

Clustered Standard Errors

When observations are grouped (clusters), errors within groups may be correlated even if errors across groups are independent. Examples:

Students within schools
Employees within firms
Repeated observations on individuals (panel data)
Geographic clustering (cities within states)

The Problem:

Both classical and HC standard errors assume independent observations. With clustering:

Effective sample size is smaller than n (closer to number of clusters)
Standard errors should be larger to reflect within-cluster correlation

Cluster-Robust Standard Errors:

Let G be the number of clusters and $\mathbf{X}_g$, $\mathbf{e}_g$ be the design matrix and residuals for cluster g:

This allows arbitrary within-cluster correlation.

Clustering Dramatically Affects Inference

Guidelines for Clustering:

Cluster at the level of treatment assignment — If treatment is assigned at school level, cluster at school level
When in doubt, cluster at a higher level — Over-clustering is conservative; under-clustering is dangerous
Need enough clusters for asymptotics — Rule of thumb: at least 30-50 clusters
Wild cluster bootstrap — For few clusters (<30), use bootstrap methods

Comparison of SE Types:

Data Structure	Recommended SE Type
iid observations	Classical or HC
Heteroscedastic, independent	HC (HC1 or HC3)
Clustered, homoscedastic within	Cluster-robust
Clustered, heteroscedastic	Cluster-robust (handles both)
Time series (autocorrelation)	HAC (Newey-West)

Summary: Mastering Variance and Standard Errors

This page has covered the essential concepts of variance estimation in OLS regression. Let's consolidate the key insights:

Key Takeaways

•Variance formula: Var(β̂) = σ²(X'X)⁻¹ — This fundamental result links coefficient precision to error variance and design matrix structure
•Standard errors = √Var — Express uncertainty in coefficient units; form the basis for confidence intervals and tests
•Precision depends on: σ², n, predictor spread, multicollinearity — More data, more x-variation, less noise, and less collinearity all help
•Estimate σ² with s² = RSS/(n-p) — Divide by n-p (not n) for unbiasedness; accounts for degrees of freedom lost estimating β
•Prediction intervals > confidence intervals — Irreducible error σ² adds to prediction uncertainty but not to mean estimation uncertainty
•Robust SEs when homoscedasticity fails — HC standard errors (sandwich estimators) correct for heteroscedasticity
•Cluster when observations are grouped — Ignoring clustering leads to severely understated standard errors

What's Next:

Page Complete

3 / 5