Loading learning content...
Point estimates alone tell an incomplete story. When we report $\hat{\beta}_1 = 2.5$, the natural question is: how confident are we in this value? Could the true coefficient plausibly be 1.0? Or 4.0? Or even negative?
The variance of OLS estimates and the resulting standard errors quantify this uncertainty. They form the foundation for:
By the end of this page, you will be able to derive the variance-covariance matrix of OLS coefficients, compute and interpret standard errors, understand what factors affect estimation precision, estimate the error variance σ² from data, and recognize when standard error formulas might be invalid.
The central result for OLS inference is the variance-covariance matrix of the coefficient estimates.
Derivation:
Recall from the Gauss-Markov discussion that:
$$\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$$
Since $\boldsymbol{\beta}$ is a constant, the variability in $\hat{\boldsymbol{\beta}}$ comes entirely from $\boldsymbol{\varepsilon}$:
$$\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon}$$
Taking the variance (conditional on $\mathbf{X}$):
$$\begin{align} \text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= \text{Var}((\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\varepsilon} | \mathbf{X}) \ &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \text{Var}(\boldsymbol{\varepsilon}) \cdot \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \end{align}$$
Under the Gauss-Markov assumption of spherical errors: $\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}_n$
$$\begin{align} \text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) &= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top \cdot \sigma^2\mathbf{I}_n \cdot \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \ &= \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1} \ &= \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1} \end{align}$$
$$\boxed{\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}}$$ This elegant formula relates coefficient variance to two quantities: the error variance σ² and the design matrix structure X'X. All inference in linear regression flows from this result.
Interpreting the Matrix:
Let $\boldsymbol{\Sigma}_{\hat{\beta}} = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$. This $p \times p$ matrix contains:
The covariances matter because correlated estimates can have unusual joint distributions even when individual estimates seem reasonable.
Standard Errors (SE) are the square roots of variances—they express uncertainty in the same units as the coefficients themselves.
Definition:
$$\text{SE}(\hat{\beta}j) = \sqrt{\text{Var}(\hat{\beta}j)} = \sqrt{\sigma^2 [(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}} = \sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}$$
Interpretation:
The standard error tells us the typical size of our estimation error. Under repeated sampling:
Example Calculation:
For simple linear regression $y = \beta_0 + \beta_1 x + \varepsilon$ with centered $x$ (mean-zero):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom numpy.linalg import inv def compute_ols_with_se(X: np.ndarray, y: np.ndarray): """ Compute OLS estimates with standard errors. Returns: -------- beta_hat : coefficient estimates se : standard errors residuals : residuals (y - X @ beta_hat) sigma_sq_hat : estimated error variance """ n, p = X.shape # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y # Residuals y_hat = X @ beta_hat residuals = y - y_hat # Estimate σ² (unbiased: divide by n-p) RSS = np.sum(residuals**2) sigma_sq_hat = RSS / (n - p) # Variance-covariance matrix var_cov = sigma_sq_hat * XtX_inv # Standard errors (sqrt of diagonal) se = np.sqrt(np.diag(var_cov)) return { 'beta_hat': beta_hat, 'se': se, 'residuals': residuals, 'sigma_sq_hat': sigma_sq_hat, 'var_cov': var_cov, 'RSS': RSS, 'R_squared': 1 - RSS / np.sum((y - y.mean())**2) } # Examplenp.random.seed(42)n = 100 # Design matrix with interceptX = np.column_stack([ np.ones(n), np.random.randn(n), # x1 np.random.randn(n) # x2]) # True parametersbeta_true = np.array([1.0, 2.0, -0.5])sigma_true = 0.5 # Generate datay = X @ beta_true + sigma_true * np.random.randn(n) # Compute OLS with SEresults = compute_ols_with_se(X, y) print("OLS Results:")print("=" * 50)for j, name in enumerate(['Intercept', 'x1', 'x2']): beta = results['beta_hat'][j] se = results['se'][j] t_stat = beta / se print(f"{name:12s}: β̂ = {beta:7.4f}, SE = {se:.4f}, t = {t_stat:7.3f}")print()print(f"Estimated σ²: {results['sigma_sq_hat']:.4f} (True: {sigma_true**2:.4f})")print(f"R²: {results['R_squared']:.4f}")Formulas for Simple Linear Regression:
In simple regression $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$, the standard errors have explicit forms:
$$\text{SE}(\hat{\beta}1) = \frac{\sigma}{\sqrt{\sum{i=1}^n (x_i - \bar{x})^2}} = \frac{\sigma}{\sqrt{n} \cdot s_x}$$
$$\text{SE}(\hat{\beta}0) = \sigma \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{i=1}^n (x_i - \bar{x})^2}}$$
Where $s_x$ is the sample standard deviation of $x$.
These formulas reveal key insights:
Understanding what makes coefficient estimates more or less precise is crucial for experimental design and data analysis.
From $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$, precision depends on:
Let's examine each factor systematically.
| Factor | Effect on SE | Mathematical Reason | Practical Implication |
|---|---|---|---|
| ↑ Error variance (σ²) | ↑ SE | SE ∝ σ | Noisy data → less precise estimates |
| ↑ Sample size (n) | ↓ SE | SE ∝ 1/√n | More data → more precision (slowly) |
| ↑ Predictor variance | ↓ SE for that predictor | (X'X)⁻¹ shrinks | Spread in x → better estimation of slope |
| ↑ Multicollinearity | ↑ SE for correlated predictors | (X'X)⁻¹ explodes | Correlated predictors → unstable estimates |
| Predictor near mean | ↓ SE for intercept | Design matrix effect | Centering predictors helps intercept |
Deep Dive: The Multicollinearity Effect
When predictors are correlated, the matrix $\mathbf{X}^\top\mathbf{X}$ becomes nearly singular. Its inverse $(\mathbf{X}^\top\mathbf{X})^{-1}$ has very large elements, inflating variances.
Consider two predictors $x_1$ and $x_2$ with correlation $\rho$. For standardized predictors:
$$(\mathbf{X}^\top\mathbf{X})^{-1} \approx \frac{1}{1-\rho^2}\begin{pmatrix} 1 & -\rho \ -\rho & 1 \end{pmatrix}$$
As $\rho \to 1$:
The VIF for predictor j is: VIF_j = 1/(1 - R²_j), where R²_j is from regressing x_j on all other predictors. VIF > 10 indicates problematic multicollinearity. The SE increases by factor √VIF compared to the uncorrelated case.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npfrom numpy.linalg import inv def demonstrate_precision_factors(): """ Demonstrate how different factors affect coefficient precision. """ np.random.seed(42) sigma = 1.0 beta_true = np.array([1.0, 2.0]) print("=" * 60) print("FACTOR 1: Sample Size Effect") print("=" * 60) for n in [25, 100, 400, 1600]: X = np.column_stack([np.ones(n), np.random.randn(n)]) se = sigma * np.sqrt(inv(X.T @ X)[1, 1]) print(f" n = {n:4d}: SE(β̂₁) = {se:.4f}") print(" Note: SE decreases by factor of 2 when n increases by 4") print("\n" + "=" * 60) print("FACTOR 2: Predictor Variance Effect") print("=" * 60) n = 100 for sd_x in [0.5, 1.0, 2.0, 4.0]: x = np.random.randn(n) * sd_x X = np.column_stack([np.ones(n), x]) se = sigma * np.sqrt(inv(X.T @ X)[1, 1]) print(f" SD(x) = {sd_x:.1f}: SE(β̂₁) = {se:.4f}") print(" Note: SE inversely proportional to SD(x)") print("\n" + "=" * 60) print("FACTOR 3: Multicollinearity Effect") print("=" * 60) n = 100 for rho in [0.0, 0.5, 0.9, 0.95, 0.99]: x1 = np.random.randn(n) x2 = rho * x1 + np.sqrt(1 - rho**2) * np.random.randn(n) X = np.column_stack([np.ones(n), x1, x2]) XtX_inv = inv(X.T @ X) se1 = sigma * np.sqrt(XtX_inv[1, 1]) se2 = sigma * np.sqrt(XtX_inv[2, 2]) vif = 1 / (1 - rho**2) if rho < 1 else np.inf print(f" ρ = {rho:.2f}: SE(β̂₁) = {se1:.4f}, SE(β̂₂) = {se2:.4f}, VIF ≈ {vif:.2f}") print(" Note: SE explodes as correlation approaches 1") demonstrate_precision_factors()The variance formula $\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ requires knowledge of $\sigma^2$, which is typically unknown. We must estimate it from the data.
The Residual Sum of Squares (RSS):
The residuals $e_i = y_i - \hat{y}_i$ are our proxy for the unobservable errors $\varepsilon_i$:
$$\text{RSS} = \sum_{i=1}^n e_i^2 = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2 = \mathbf{e}^\top\mathbf{e}$$
Unbiased Estimator of σ²:
A natural guess might be $\hat{\sigma}^2 = \frac{\text{RSS}}{n}$. But this is biased downward!
The unbiased estimator is:
$$\hat{\sigma}^2 = s^2 = \frac{\text{RSS}}{n - p} = \frac{\sum_{i=1}^n e_i^2}{n - p}$$
Where $p$ is the number of parameters (including intercept).
We lose p degrees of freedom estimating β. The residuals are constrained: X'e = 0 (p linear constraints). Only n - p residuals are 'free'—this is the effective sample size for estimating σ². The quantity n - p is called the degrees of freedom for error.
Proof of Unbiasedness:
We can show $\mathbb{E}[\text{RSS}] = (n-p)\sigma^2$.
The key insight is that the residual vector is:
$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y} = \mathbf{M}\mathbf{y}$$
Where $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ is the hat matrix and $\mathbf{M} = \mathbf{I} - \mathbf{H}$.
Then: $$\mathbf{e} = \mathbf{M}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}) = \mathbf{M}\boldsymbol{\varepsilon}$$
(Since $\mathbf{M}\mathbf{X} = \mathbf{0}$)
The expected sum of squares: $$\mathbb{E}[\mathbf{e}^\top\mathbf{e}] = \mathbb{E}[\boldsymbol{\varepsilon}^\top\mathbf{M}\boldsymbol{\varepsilon}] = \sigma^2 \text{tr}(\mathbf{M}) = \sigma^2(n - p)$$
Since $\text{rank}(\mathbf{M}) = n - p$.
Therefore: $$\mathbb{E}\left[\frac{\text{RSS}}{n-p}\right] = \sigma^2 \quad \checkmark$$
Properties of s²:
| Property | Statement |
|---|---|
| Unbiasedness | E[s²] = σ² |
| Consistency | s² → σ² as n → ∞ |
| Distribution (under normality) | (n-p)s²/σ² ~ χ²(n-p) |
| Independence from β̂ | s² and β̂ are independent (under normality) |
The independence property is remarkable and crucial for constructing t-statistics.
In practice, we replace $\sigma^2$ with its estimate $s^2$:
$$\widehat{\text{Var}}(\hat{\boldsymbol{\beta}}) = s^2 (\mathbf{X}^\top\mathbf{X})^{-1}$$
$$\widehat{\text{SE}}(\hat{\beta}j) = s \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}$$
These estimated standard errors (also called standard errors of the estimates) are what statistical software reports.
Important Distinction:
The estimated SE is a random variable (because $s$ is random), while the true SD is a parameter.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npfrom numpy.linalg import invimport scipy.stats as stats def full_ols_inference(X: np.ndarray, y: np.ndarray, alpha: float = 0.05): """ Complete OLS inference including estimated standard errors, t-statistics, and confidence intervals. Parameters: ----------- X : design matrix (n x p) y : response vector (n,) alpha : significance level for confidence intervals Returns: -------- Dictionary with coefficients, SEs, t-stats, p-values, CIs """ n, p = X.shape df = n - p # degrees of freedom # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y # Residuals and s² residuals = y - X @ beta_hat RSS = np.sum(residuals**2) s_sq = RSS / df s = np.sqrt(s_sq) # Estimated variance-covariance matrix var_cov = s_sq * XtX_inv # Standard errors se = np.sqrt(np.diag(var_cov)) # t-statistics t_stats = beta_hat / se # p-values (two-tailed) p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), df)) # Confidence intervals t_critical = stats.t.ppf(1 - alpha/2, df) ci_lower = beta_hat - t_critical * se ci_upper = beta_hat + t_critical * se return { 'beta_hat': beta_hat, 'se': se, 't_stats': t_stats, 'p_values': p_values, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 's': s, 's_sq': s_sq, 'df': df, 'RSS': RSS, 'var_cov': var_cov } # Examplenp.random.seed(42)n = 50X = np.column_stack([ np.ones(n), np.random.randn(n), np.random.randn(n)])beta_true = np.array([1.0, 2.5, -1.0])y = X @ beta_true + 0.8 * np.random.randn(n) results = full_ols_inference(X, y) print("OLS Regression Results")print("=" * 70)print(f"{'Coef':12s} {'Estimate':>10s} {'Std.Err':>10s} {'t-value':>10s} {'p-value':>10s} {'95% CI':>15s}")print("-" * 70) names = ['Intercept', 'x1', 'x2']for j in range(len(names)): ci = f"[{results['ci_lower'][j]:.3f}, {results['ci_upper'][j]:.3f}]" print(f"{names[j]:12s} {results['beta_hat'][j]:10.4f} {results['se'][j]:10.4f} " f"{results['t_stats'][j]:10.3f} {results['p_values'][j]:10.4f} {ci:>15s}") print("-" * 70)print(f"Residual std error: {results['s']:.4f} on {results['df']} degrees of freedom")print(f"True values: β = {beta_true}")Statistical software (R, Stata, Python statsmodels) reports estimated standard errors. These incorporate the uncertainty from estimating σ². This is why we use t-distribution rather than normal distribution for inference—the extra randomness from estimating σ² is accounted for.
Beyond coefficient uncertainty, we often want to quantify uncertainty in predictions. There are two types of prediction uncertainty:
1. Confidence Interval for Mean Response (Fitted Value):
For a new observation with predictors $\mathbf{x}_0$, the estimated mean response is:
$$\hat{\mu}_0 = \mathbf{x}_0^\top \hat{\boldsymbol{\beta}}$$
Its variance:
$$\text{Var}(\hat{\mu}_0) = \mathbf{x}_0^\top \text{Var}(\hat{\boldsymbol{\beta}}) \mathbf{x}_0 = \sigma^2 \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0$$
This measures uncertainty about the average y for given x.
2. Prediction Interval for New Observation:
For predicting an actual new observation $y_0 = \mathbf{x}_0^\top\boldsymbol{\beta} + \varepsilon_0$:
$$\text{Var}(y_0 - \hat{y}_0) = \sigma^2 + \sigma^2 \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0 = \sigma^2(1 + \mathbf{x}_0^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{x}_0)$$
The extra $\sigma^2$ accounts for the inherent variability in the new observation.
Confidence interval (for mean): Where we expect the average response to lie. Narrows as n → ∞. Prediction interval (for individual): Where we expect a single new observation to lie. Does NOT shrink to zero as n → ∞ because irreducible error remains.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
import numpy as npfrom numpy.linalg import invimport scipy.stats as stats def prediction_with_intervals(X: np.ndarray, y: np.ndarray, x_new: np.ndarray, alpha: float = 0.05): """ Make predictions with confidence and prediction intervals. Parameters: ----------- X : training design matrix (n x p) y : training response (n,) x_new : new observation features (p,) or (m x p) alpha : significance level Returns: -------- Dictionary with predictions, CI, PI """ n, p = X.shape df = n - p # Ensure x_new is 2D if x_new.ndim == 1: x_new = x_new.reshape(1, -1) m = x_new.shape[0] # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y # Residual variance estimate residuals = y - X @ beta_hat s_sq = np.sum(residuals**2) / df s = np.sqrt(s_sq) # Predictions y_pred = x_new @ beta_hat # Variance of fitted values: x' (X'X)^{-1} x var_fitted = np.array([s_sq * x @ XtX_inv @ x for x in x_new]) se_fitted = np.sqrt(var_fitted) # Variance of prediction error: s^2 (1 + x' (X'X)^{-1} x) var_pred = s_sq + var_fitted se_pred = np.sqrt(var_pred) # Critical value t_crit = stats.t.ppf(1 - alpha/2, df) # Confidence intervals (for mean) ci_lower = y_pred - t_crit * se_fitted ci_upper = y_pred + t_crit * se_fitted # Prediction intervals (for new observation) pi_lower = y_pred - t_crit * se_pred pi_upper = y_pred + t_crit * se_pred return { 'y_pred': y_pred, 'se_fitted': se_fitted, 'se_pred': se_pred, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 'pi_lower': pi_lower, 'pi_upper': pi_upper, 's': s } # Examplenp.random.seed(42)n = 100X = np.column_stack([np.ones(n), np.random.randn(n)])beta_true = np.array([2.0, 1.5])y = X @ beta_true + 0.5 * np.random.randn(n) # New observations at different x valuesx_new_values = np.array([-2, 0, 2])X_new = np.column_stack([np.ones(3), x_new_values]) results = prediction_with_intervals(X, y, X_new) print("Prediction Results")print("=" * 70)print(f"{'x':>8s} {'ŷ':>10s} {'SE(fit)':>10s} {'SE(pred)':>10s} {'95% CI':>18s} {'95% PI':>18s}")print("-" * 70) for i, x in enumerate(x_new_values): ci = f"[{results['ci_lower'][i]:.2f}, {results['ci_upper'][i]:.2f}]" pi = f"[{results['pi_lower'][i]:.2f}, {results['pi_upper'][i]:.2f}]" print(f"{x:8.1f} {results['y_pred'][i]:10.3f} {results['se_fitted'][i]:10.3f} " f"{results['se_pred'][i]:10.3f} {ci:>18s} {pi:>18s}") print("\nNote: Prediction intervals are always wider than confidence intervals")When the homoscedasticity assumption fails (i.e., $\text{Var}(\varepsilon_i) \neq \sigma^2$ for all $i$), the standard OLS variance formula $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ is incorrect.
OLS estimates remain unbiased and consistent, but:
Solution: Heteroscedasticity-Consistent (HC) Standard Errors
Also called 'robust' or 'sandwich' standard errors, these don't assume constant variance.
The General Variance:
With heteroscedasticity, $\text{Var}(\boldsymbol{\varepsilon}) = \boldsymbol{\Omega}$ (diagonal with varying entries):
$$\text{Var}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\boldsymbol{\Omega}\mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}$$
This is the 'sandwich' form: bread-meat-bread.
The robust variance estimator replaces Ω with a diagonal matrix of squared residuals: Ω̂ = diag(e₁², e₂², ..., eₙ²). This gives: V̂_robust = (X'X)⁻¹ X' diag(e²) X (X'X)⁻¹. Named 'sandwich' because (X'X)⁻¹ is the 'bread' and X' diag(e²) X is the 'meat'.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom numpy.linalg import inv def compare_se_methods(X: np.ndarray, y: np.ndarray): """ Compare classical vs robust (HC0, HC1, HC3) standard errors. """ n, p = X.shape # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y residuals = y - X @ beta_hat # Classical SE (assumes homoscedasticity) s_sq = np.sum(residuals**2) / (n - p) var_classical = s_sq * XtX_inv se_classical = np.sqrt(np.diag(var_classical)) # HC0: Basic White (1980) estimator # Omega_hat = diag(e^2) meat_hc0 = X.T @ np.diag(residuals**2) @ X var_hc0 = XtX_inv @ meat_hc0 @ XtX_inv se_hc0 = np.sqrt(np.diag(var_hc0)) # HC1: Degree of freedom correction # Scale by n/(n-p) var_hc1 = (n / (n - p)) * var_hc0 se_hc1 = np.sqrt(np.diag(var_hc1)) # HC3: MacKinnon-White (1985) - Most conservative # Weight by 1/(1-h_ii)^2 where h_ii = diag(H) H = X @ XtX_inv @ X.T leverage = np.diag(H) e_adj = residuals / (1 - leverage) meat_hc3 = X.T @ np.diag(e_adj**2) @ X var_hc3 = XtX_inv @ meat_hc3 @ XtX_inv se_hc3 = np.sqrt(np.diag(var_hc3)) return { 'beta_hat': beta_hat, 'se_classical': se_classical, 'se_hc0': se_hc0, 'se_hc1': se_hc1, 'se_hc3': se_hc3 } # Example with heteroscedasticitynp.random.seed(42)n = 100x = np.random.uniform(1, 10, n)X = np.column_stack([np.ones(n), x]) # Heteroscedastic errors: variance increases with xbeta_true = np.array([1.0, 2.0])errors = np.random.randn(n) * (0.5 * x) # SD proportional to xy = X @ beta_true + errors results = compare_se_methods(X, y) print("Standard Error Comparison (with heteroscedasticity)")print("=" * 60)print(f"{'Method':15s} {'SE(β̂₀)':>12s} {'SE(β̂₁)':>12s}")print("-" * 60)print(f"{'Classical':15s} {results['se_classical'][0]:12.4f} {results['se_classical'][1]:12.4f}")print(f"{'HC0 (White)':15s} {results['se_hc0'][0]:12.4f} {results['se_hc0'][1]:12.4f}")print(f"{'HC1 (adjusted)':15s} {results['se_hc1'][0]:12.4f} {results['se_hc1'][1]:12.4f}")print(f"{'HC3 (conservative)':15s} {results['se_hc3'][0]:12.4f} {results['se_hc3'][1]:12.4f}")print()print("Note: Robust SEs differ from classical when heteroscedasticity present")Common HC Variants:
| Variant | Formula | Use Case |
|---|---|---|
| HC0 | White (1980) | Original, biased in small samples |
| HC1 | HC0 × n/(n-p) | Small-sample correction |
| HC2 | Weight by 1/(1-hᵢᵢ) | Better with influential points |
| HC3 | Weight by 1/(1-hᵢᵢ)² | Most robust, default recommendation |
When to Use Robust Standard Errors:
When observations are grouped (clusters), errors within groups may be correlated even if errors across groups are independent. Examples:
The Problem:
Both classical and HC standard errors assume independent observations. With clustering:
Cluster-Robust Standard Errors:
Let G be the number of clusters and $\mathbf{X}_g$, $\mathbf{e}_g$ be the design matrix and residuals for cluster g:
$$\hat{\text{Var}}{cluster}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^\top\mathbf{X})^{-1} \left(\sum{g=1}^G \mathbf{X}_g^\top \mathbf{e}_g \mathbf{e}_g^\top \mathbf{X}_g \right) (\mathbf{X}^\top\mathbf{X})^{-1}$$
This allows arbitrary within-cluster correlation.
Ignoring clustering can make standard errors 2-10x too small, leading to wildly overstated significance. If you have 1000 students in 50 schools, your 'effective n' is closer to 50 than 1000 for school-level variation. Always cluster at the appropriate level.
Guidelines for Clustering:
Comparison of SE Types:
| Data Structure | Recommended SE Type |
|---|---|
| iid observations | Classical or HC |
| Heteroscedastic, independent | HC (HC1 or HC3) |
| Clustered, homoscedastic within | Cluster-robust |
| Clustered, heteroscedastic | Cluster-robust (handles both) |
| Time series (autocorrelation) | HAC (Newey-West) |
This page has covered the essential concepts of variance estimation in OLS regression. Let's consolidate the key insights:
What's Next:
With variance and standard errors understood, we can now construct confidence intervals and perform hypothesis tests. The next page develops the theory of confidence intervals for regression coefficients—how to construct them, interpret them correctly, and understand their properties.
You now understand how to quantify uncertainty in OLS estimates through variance and standard errors. This knowledge is essential for all statistical inference in regression, from simple significance tests to complex model comparisons.