Maximum Likelihood View - Learning Module

Loading content...

0/245

Variance Estimation

The Forgotten Parameter

In discussions of linear regression, the spotlight typically falls on $\boldsymbol{\beta}$—the regression coefficients. But lurking in the shadows is an equally important parameter: the noise variance $\sigma^2$.

The variance parameter plays critical roles:

Uncertainty quantification: Standard errors, confidence intervals, and prediction intervals all depend on $\sigma^2$
Model assessment: The magnitude of $\sigma^2$ tells us how well our model captures the data
Hypothesis testing: Test statistics are normalized by variance estimates
Model comparison: Likelihood-based criteria incorporate variance

This page provides a rigorous treatment of variance estimation—its MLE, its bias, its distribution, and its practical importance.

What You Will Learn

By the end of this page, you will:

Derive both biased (MLE) and unbiased variance estimators
Understand the exact distribution of variance estimates under the Gaussian model
Explain why MLE is biased and the role of degrees of freedom
Connect variance estimation to residual analysis
Know how variance estimation underlies all inference in regression

Two Variance Estimators: MLE and Unbiased

We've encountered two estimators for $\sigma^2$. Let's state them clearly and understand their differences.

The MLE estimator: $$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{\text{RSS}}{n}$$

The unbiased estimator: $$s^2 = \frac{1}{n-p}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{\text{RSS}}{n-p}$$

where $\text{RSS} = \sum_{i=1}^n \hat{r}_i^2 = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2$ is the residual sum of squares, $n$ is the sample size, and $p$ is the number of parameters in $\boldsymbol{\beta}$.

The only difference is the denominator: $n$ for MLE, $n-p$ for the unbiased version.

Comparison of Variance Estimators
Property	MLE: $\frac{\text{RSS}}{n}$	Unbiased: $\frac{\text{RSS}}{n-p}$
Bias	$\mathbb{E}[\hat{\sigma}^2] = \frac{n-p}{n}\sigma^2$ (biased low)	$\mathbb{E}[s^2] = \sigma^2$ (unbiased)
Consistency	Yes: $\hat{\sigma}^2 \xrightarrow{p} \sigma^2$	Yes: $s^2 \xrightarrow{p} \sigma^2$
Distribution	$\frac{n\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}$	$\frac{(n-p)s^2}{\sigma^2} \sim \chi^2_{n-p}$
Variance	$\text{Var}(\hat{\sigma}^2) = \frac{2(n-p)\sigma^4}{n^2}$	$\text{Var}(s^2) = \frac{2\sigma^4}{n-p}$
Typical use	Likelihood, AIC/BIC, large samples	Inference, confidence intervals, F-tests

Why Two Estimators?

The MLE is theoretically optimal in a likelihood sense but biased. The unbiased estimator sacrifices a tiny bit of efficiency (slightly higher variance) to eliminate bias.

For practical inference (confidence intervals, hypothesis tests), the unbiased estimator $s^2$ is standard. For model selection with AIC/BIC, the MLE is often used.

As $n$ grows large, both converge to $\sigma^2$, and the distinction vanishes.

The Distribution of the Residual Sum of Squares

Understanding the distribution of RSS is key to deriving the properties of variance estimators.

Setup:

Under the Gaussian model $\mathbf{y} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$, the residual vector is: $$\hat{\mathbf{r}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{y} - \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = (\mathbf{I}_n - \mathbf{H})\mathbf{y} = \mathbf{M}\mathbf{y}$$

where:

$\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is the hat matrix (projects onto column space of $\mathbf{X}$)
$\mathbf{M} = \mathbf{I}_n - \mathbf{H}$ is the residual maker matrix (projects onto orthogonal complement)

Key properties of $\mathbf{M}$:

Symmetric: $\mathbf{M}^T = \mathbf{M}$
Idempotent: $\mathbf{M}^2 = \mathbf{M}$
Rank: $\text{rank}(\mathbf{M}) = \text{trace}(\mathbf{M}) = n - p$ (this is the degrees of freedom!)
Orthogonal to $\mathbf{X}$: $\mathbf{M}\mathbf{X} = \mathbf{0}$

Deriving the distribution of RSS:

The true errors are $\boldsymbol{\epsilon} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$, with $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)$.

The residuals are: $$\hat{\mathbf{r}} = \mathbf{M}\mathbf{y} = \mathbf{M}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}) = \mathbf{M}\mathbf{X}\boldsymbol{\beta} + \mathbf{M}\boldsymbol{\epsilon} = \mathbf{M}\boldsymbol{\epsilon}$$

(Using $\mathbf{M}\mathbf{X} = \mathbf{0}$.)

So residuals are a linear transformation of true errors, and the RSS is: $$\text{RSS} = \hat{\mathbf{r}}^T\hat{\mathbf{r}} = \boldsymbol{\epsilon}^T\mathbf{M}^T\mathbf{M}\boldsymbol{\epsilon} = \boldsymbol{\epsilon}^T\mathbf{M}\boldsymbol{\epsilon}$$

(Using $\mathbf{M}^T = \mathbf{M}$ and $\mathbf{M}^2 = \mathbf{M}$.)

Now, $\boldsymbol{\epsilon}/\sigma \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_n)$, so: $$\frac{\text{RSS}}{\sigma^2} = \frac{\boldsymbol{\epsilon}^T\mathbf{M}\boldsymbol{\epsilon}}{\sigma^2} = \mathbf{z}^T\mathbf{M}\mathbf{z}$$

where $\mathbf{z} = \boldsymbol{\epsilon}/\sigma \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_n)$.

Chi-Squared Distribution of RSS

For standard normal $\mathbf{z}$ and symmetric idempotent matrix $\mathbf{M}$ of rank $k$:

$$\mathbf{z}^T\mathbf{M}\mathbf{z} \sim \chi^2_k$$

Since $\text{rank}(\mathbf{M}) = n - p$:

$$\boxed{\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}}$$

This is a fundamental result. The scaled RSS follows a chi-squared distribution with $n-p$ degrees of freedom.

Why $n-p$ degrees of freedom?

Intuitively:

We start with $n$ random residuals
But they satisfy $p$ linear constraints: $\mathbf{X}^T\hat{\mathbf{r}} = \mathbf{0}$ (residuals are orthogonal to all $p$ columns of $\mathbf{X}$)
This leaves $n - p$ "free" dimensions for the residuals to vary

Mathematically, $\mathbf{M}$ projects $\mathbb{R}^n$ onto an $(n-p)$-dimensional subspace (the orthogonal complement of the column space of $\mathbf{X}$).

Rigorous Derivation of Bias

Let's derive the bias of the MLE variance estimator rigorously.

Expected value of RSS:

Since $\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$ and $\mathbb{E}[\chi^2_k] = k$: $$\mathbb{E}\left[\frac{\text{RSS}}{\sigma^2}\right] = n - p$$ $$\mathbb{E}[\text{RSS}] = (n-p)\sigma^2$$

Expectation of MLE: $$\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \mathbb{E}\left[\frac{\text{RSS}}{n}\right] = \frac{(n-p)\sigma^2}{n} = \frac{n-p}{n}\sigma^2$$

Bias: $$\text{Bias}(\hat{\sigma}^2_{\text{MLE}}) = \mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] - \sigma^2 = \frac{n-p}{n}\sigma^2 - \sigma^2 = -\frac{p}{n}\sigma^2$$

The MLE underestimates $\sigma^2$ by a factor of $\frac{n-p}{n}$.

Severity of the Bias

The bias can be severe when $p$ is large relative to $n$:

$n$	$p$	$n-p$	Ratio $\frac{n-p}{n}$	% Underestimation
100	5	95	0.95	5%
50	10	40	0.80	20%
20	10	10	0.50	50%
15	10	5	0.33	67%

In high-dimensional settings where $p \approx n$, the MLE is severely biased!

Expectation of unbiased estimator: $$\mathbb{E}[s^2] = \mathbb{E}\left[\frac{\text{RSS}}{n-p}\right] = \frac{(n-p)\sigma^2}{n-p} = \sigma^2$$

The division by $n-p$ exactly corrects the bias.

Alternative derivation via trace:

We can also compute $\mathbb{E}[\text{RSS}]$ directly: $$\mathbb{E}[\text{RSS}] = \mathbb{E}[\boldsymbol{\epsilon}^T\mathbf{M}\boldsymbol{\epsilon}] = \mathbb{E}[\text{trace}(\boldsymbol{\epsilon}^T\mathbf{M}\boldsymbol{\epsilon})] = \mathbb{E}[\text{trace}(\mathbf{M}\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T)]$$ $$= \text{trace}(\mathbf{M}\mathbb{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T]) = \text{trace}(\mathbf{M} \cdot \sigma^2\mathbf{I}_n) = \sigma^2 \text{trace}(\mathbf{M}) = \sigma^2(n-p)$$

This uses $\text{trace}(\mathbf{M}) = \text{trace}(\mathbf{I}_n) - \text{trace}(\mathbf{H}) = n - p$.

(The trace of the hat matrix equals $p$ because $\text{trace}(\mathbf{H}) = \text{trace}(\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T) = \text{trace}(\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}) = \text{trace}(\mathbf{I}_p) = p$.)

Variance of the Variance Estimators

Beyond bias, we should understand how variable our variance estimates are.

Variance of chi-squared:

For $X \sim \chi^2_k$: $\text{Var}(X) = 2k$.

Variance of RSS:

Since $\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$: $$\text{Var}\left(\frac{\text{RSS}}{\sigma^2}\right) = 2(n-p)$$ $$\text{Var}(\text{RSS}) = 2(n-p)\sigma^4$$

Variance of unbiased estimator: $$\text{Var}(s^2) = \text{Var}\left(\frac{\text{RSS}}{n-p}\right) = \frac{\text{Var}(\text{RSS})}{(n-p)^2} = \frac{2(n-p)\sigma^4}{(n-p)^2} = \frac{2\sigma^4}{n-p}$$

Variance of MLE: $$\text{Var}(\hat{\sigma}^2_{\text{MLE}}) = \text{Var}\left(\frac{\text{RSS}}{n}\right) = \frac{\text{Var}(\text{RSS})}{n^2} = \frac{2(n-p)\sigma^4}{n^2}$$

Mean Squared Error Comparison

The Mean Squared Error combines bias and variance: $$\text{MSE}(\hat{\theta}) = \text{Bias}^2 + \text{Variance}$$

For the MLE: $$\text{MSE}(\hat{\sigma}^2_{\text{MLE}}) = \left(-\frac{p}{n}\sigma^2\right)^2 + \frac{2(n-p)\sigma^4}{n^2} = \frac{p^2 + 2(n-p)}{n^2}\sigma^4$$

For the unbiased estimator: $$\text{MSE}(s^2) = 0 + \frac{2\sigma^4}{n-p} = \frac{2\sigma^4}{n-p}$$

Interestingly, the MLE can have lower MSE than the unbiased estimator for certain values of $n$ and $p$! This illustrates the bias-variance tradeoff: the MLE trades bias for lower variance.

variance_estimator_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy import stats
 
def analyze_variance_estimators(n, p, sigma_true, n_simulations=50000):
    """
    Empirically verify properties of variance estimators.
    """
    sigma_sq_true = sigma_true**2
    df = n - p
    
    np.random.seed(42)
    
    # Store estimates
    mle_estimates = []
    unbiased_estimates = []
    
    for _ in range(n_simulations):
        # Generate data
        X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
        beta = np.zeros(p)
        y = X @ beta + sigma_true * np.random.randn(n)
        
        # Fit model
        beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
        residuals = y - X @ beta_hat
        rss = np.sum(residuals**2)
        
        mle_estimates.append(rss / n)
        unbiased_estimates.append(rss / df)
    
    mle_estimates = np.array(mle_estimates)
    unbiased_estimates = np.array(unbiased_estimates)
    
    print(f"Variance Estimator Analysis (n={n}, p={p})")
    print("="*60)
    print(f"True σ² = {sigma_sq_true}")
    print(f"Degrees of freedom = {df}")
    
    print(f"\n--- MLE Estimator (RSS/n) ---")
    print(f"Empirical mean:     {np.mean(mle_estimates):.4f}")
    print(f"Theoretical mean:   {(n-p)/n * sigma_sq_true:.4f}")
    print(f"Empirical variance: {np.var(mle_estimates):.6f}")
    print(f"Theoretical var:    {2*(n-p)*sigma_sq_true**2/n**2:.6f}")
    print(f"Empirical bias:     {np.mean(mle_estimates) - sigma_sq_true:.4f}")
    print(f"Theoretical bias:   {-p/n * sigma_sq_true:.4f}")
    
    print(f"\n--- Unbiased Estimator (RSS/(n-p)) ---")
    print(f"Empirical mean:     {np.mean(unbiased_estimates):.4f}")
    print(f"Theoretical mean:   {sigma_sq_true:.4f}")
    print(f"Empirical variance: {np.var(unbiased_estimates):.6f}")
    print(f"Theoretical var:    {2*sigma_sq_true**2/(n-p):.6f}")
    print(f"Empirical bias:     {np.mean(unbiased_estimates) - sigma_sq_true:.4f}")
    
    print(f"\n--- Chi-squared Distribution Check ---")
    scaled_rss = df * unbiased_estimates / sigma_sq_true  # Should be chi^2_{n-p}
    print(f"Theoretical chi^2_{df} mean:     {df}")
    print(f"Empirical scaled RSS mean:   {np.mean(scaled_rss):.4f}")
    print(f"Theoretical chi^2_{df} var:      {2*df}")
    print(f"Empirical scaled RSS var:    {np.var(scaled_rss):.4f}")
    
    # MSE comparison
    mse_mle = np.mean((mle_estimates - sigma_sq_true)**2)
    mse_unbiased = np.mean((unbiased_estimates - sigma_sq_true)**2)
    print(f"\n--- Mean Squared Error ---")
    print(f"MSE of MLE:      {mse_mle:.6f}")
    print(f"MSE of unbiased: {mse_unbiased:.6f}")
    if mse_mle < mse_unbiased:
        print(f"MLE has lower MSE (bias-variance tradeoff favors MLE here)")
    else:
        print(f"Unbiased has lower MSE")
 
# Analyze for different scenarios
print("CASE 1: Large sample, few parameters")
analyze_variance_estimators(n=100, p=5, sigma_true=2.0)
 
print("\n" + "="*60 + "\n")
print("CASE 2: Moderate sample, more parameters")
analyze_variance_estimators(n=50, p=10, sigma_true=2.0)

Confidence Intervals for $\sigma^2$

The chi-squared distribution of RSS enables exact confidence intervals for $\sigma^2$.

Deriving the confidence interval:

Since $\frac{(n-p)s^2}{\sigma^2} \sim \chi^2_{n-p}$, we can construct a pivotal quantity:

$$P\left(\chi^2_{n-p, \alpha/2} \leq \frac{(n-p)s^2}{\sigma^2} \leq \chi^2_{n-p, 1-\alpha/2}\right) = 1 - \alpha$$

where $\chi^2_{k, q}$ denotes the $q$-th quantile of the $\chi^2_k$ distribution.

Solving for $\sigma^2$:

$$P\left(\frac{(n-p)s^2}{\chi^2_{n-p, 1-\alpha/2}} \leq \sigma^2 \leq \frac{(n-p)s^2}{\chi^2_{n-p, \alpha/2}}\right) = 1 - \alpha$$

$(1-\alpha)$ Confidence Interval for $\sigma^2$

$$\left[\frac{(n-p)s^2}{\chi^2_{n-p, 1-\alpha/2}}, \quad \frac{(n-p)s^2}{\chi^2_{n-p, \alpha/2}}\right]$$

Or equivalently: $$\left[\frac{\text{RSS}}{\chi^2_{n-p, 1-\alpha/2}}, \quad \frac{\text{RSS}}{\chi^2_{n-p, \alpha/2}}\right]$$

This interval is exact under the Gaussian assumption (not asymptotic).

Confidence interval for $\sigma$ (standard deviation):

Simply take square roots: $$\left[\sqrt{\frac{(n-p)s^2}{\chi^2_{n-p, 1-\alpha/2}}}, \quad \sqrt{\frac{(n-p)s^2}{\chi^2_{n-p, \alpha/2}}}\right]$$

Properties of the interval:

Asymmetric: Chi-squared is right-skewed, so the CI is not symmetric around $s^2$
Exact: Valid for any sample size, not just large $n$
Requires normality: Relies on Gaussian assumption for validity
Width decreases with $n$: $\chi^2_{n-p, 1-\alpha/2} - \chi^2_{n-p, \alpha/2} \approx 2\sqrt{2(n-p)} \cdot z_{1-\alpha/2}$ for large $n$

variance_confidence_interval.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from scipy import stats
 
def variance_confidence_interval(rss, n, p, alpha=0.05):
    """
    Compute exact confidence interval for sigma^2.
    
    Parameters:
    -----------
    rss : float
        Residual sum of squares
    n : int
        Sample size
    p : int
        Number of parameters
    alpha : float
        Significance level (default 0.05 for 95% CI)
    
    Returns:
    --------
    tuple : (lower, upper, point_estimate) for sigma^2
    """
    df = n - p
    s2 = rss / df
    
    # Chi-squared quantiles
    chi2_lower = stats.chi2.ppf(alpha/2, df)
    chi2_upper = stats.chi2.ppf(1 - alpha/2, df)
    
    # Confidence interval for sigma^2
    ci_lower = df * s2 / chi2_upper
    ci_upper = df * s2 / chi2_lower
    
    return ci_lower, ci_upper, s2
 
 
def demonstrate_variance_ci():
    """Demonstrate variance CI with example and coverage check."""
    np.random.seed(42)
    
    # Example
    n, p = 50, 5
    sigma_true = 2.0
    sigma_sq_true = sigma_true**2
    
    # Generate sample data
    X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
    beta = np.random.randn(p)
    y = X @ beta + sigma_true * np.random.randn(n)
    
    # Fit regression
    beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
    residuals = y - X @ beta_hat
    rss = np.sum(residuals**2)
    
    ci_lower, ci_upper, s2 = variance_confidence_interval(rss, n, p)
    
    print("Variance Confidence Interval Example")
    print("="*50)
    print(f"Sample size: n = {n}")
    print(f"Parameters: p = {p}")
    print(f"Degrees of freedom: {n-p}")
    print(f"True σ² = {sigma_sq_true}")
    print(f"\nPoint estimate s² = {s2:.4f}")
    print(f"95% CI for σ²: [{ci_lower:.4f}, {ci_upper:.4f}]")
    print(f"CI width: {ci_upper - ci_lower:.4f}")
    
    if ci_lower <= sigma_sq_true <= ci_upper:
        print(f"✓ True σ² is captured in the interval")
    else:
        print(f"✗ True σ² is NOT in the interval (happens 5% of the time)")
    
    # Coverage simulation
    print("\n" + "="*50)
    print("Coverage Probability Check (10,000 simulations)")
    n_sim = 10000
    captured = 0
    
    for _ in range(n_sim):
        y = X @ beta + sigma_true * np.random.randn(n)
        beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
        residuals = y - X @ beta_hat
        rss = np.sum(residuals**2)
        ci_lower, ci_upper, _ = variance_confidence_interval(rss, n, p)
        if ci_lower <= sigma_sq_true <= ci_upper:
            captured += 1
    
    coverage = captured / n_sim
    print(f"Nominal coverage: 95%")
    print(f"Empirical coverage: {coverage*100:.2f}%")
    print(f"Standard error: ±{np.sqrt(0.95*0.05/n_sim)*100:.2f}%")
 
demonstrate_variance_ci()

Role of Variance Estimation in Regression Inference

The variance estimate $s^2$ is the foundation for all inference in regression. Let's trace its role through common tasks.

1. Standard errors of coefficients:

Recall that $\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$.

Since $\sigma^2$ is unknown, we estimate: $$\widehat{\text{Cov}}(\hat{\boldsymbol{\beta}}) = s^2(\mathbf{X}^T\mathbf{X})^{-1}$$

Standard error of $\hat{\beta}_j$: $$\text{SE}(\hat{\beta}j) = \sqrt{s^2 [(\mathbf{X}^T\mathbf{X})^{-1}]{jj}}$$

2. t-statistics and hypothesis tests:

To test $H_0: \beta_j = 0$: $$t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} = \frac{\hat{\beta}j}{\sqrt{s^2 [(\mathbf{X}^T\mathbf{X})^{-1}]{jj}}}$$

Under $H_0$: $t_j \sim t_{n-p}$ (Student's t with $n-p$ degrees of freedom).

The $t$-distribution arises precisely because we're dividing a normal variable (numerator) by an estimate of its standard deviation based on a chi-squared variable (denominator).

Why t-distribution, Not Normal?

If we knew $\sigma$ exactly, the test statistic would be normal: $$\frac{\hat{\beta}j - \beta_j}{\sigma\sqrt{[(\mathbf{X}^T\mathbf{X})^{-1}]{jj}}} \sim \mathcal{N}(0,1)$$

But we estimate $\sigma$ with $s$. This introduces additional variability. The result is a t-distribution:

$$\frac{\hat{\beta}j - \beta_j}{s\sqrt{[(\mathbf{X}^T\mathbf{X})^{-1}]{jj}}} \sim t_{n-p}$$

As $n-p \to \infty$, $t_{n-p} \to \mathcal{N}(0,1)$. For large samples, the distinction vanishes.

3. Confidence intervals for coefficients:

$$\hat{\beta}j \pm t{n-p, 1-\alpha/2} \cdot \text{SE}(\hat{\beta}_j)$$

4. Prediction intervals:

For a new observation with features $\mathbf{x}*$: $$\hat{y}* \pm t_{n-p, 1-\alpha/2} \cdot \sqrt{s^2\left(1 + \mathbf{x}*^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}*\right)}$$

The $s^2$ term appears twice:

In the prediction variance ($s^2 \cdot \mathbf{x}*^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}*$) — uncertainty in $\hat{\boldsymbol{\beta}}$
As irreducible noise ($s^2$) — inherent variability in new observation

5. ANOVA and F-tests:

The F-statistic for overall significance: $$F = \frac{(\text{TSS} - \text{RSS})/(p-1)}{\text{RSS}/(n-p)} = \frac{\text{MSR}}{\text{MSE}}$$

where MSE = $s^2$ is our variance estimate. Under $H_0$: all $\beta_j = 0$ (except intercept), $F \sim F_{p-1, n-p}$.

Variance Estimation and Residual Analysis

Variance estimation is intimately connected to residual analysis—examining residuals helps assess whether our variance model is appropriate.

Standardized residuals:

Raw residuals $\hat{r}_i = y_i - \hat{y}_i$ have different variances! Specifically: $$\text{Var}(\hat{r}i) = \sigma^2(1 - h{ii})$$

where $h_{ii}$ is the $i$-th diagonal element of the hat matrix $\mathbf{H}$.

Internally studentized residuals: $$e_i = \frac{\hat{r}i}{s\sqrt{1 - h{ii}}}$$

These have approximately unit variance under the model.

Externally studentized residuals (Studentized deleted residuals): $$t_i = \frac{\hat{r}i}{s{(i)}\sqrt{1 - h_{ii}}}$$

where $s_{(i)}$ is the variance estimate computed without observation $i$. Under $H_0$ that observation $i$ follows the model: $t_i \sim t_{n-p-1}$.

residual_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from scipy import stats
 
class RegressionDiagnostics:
    """
    Comprehensive residual diagnostics for linear regression.
    """
    
    def __init__(self, X, y):
        self.X = np.asarray(X)
        self.y = np.asarray(y)
        self.n, self.p = X.shape
        
        # Fit model
        self.beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
        self.fitted = X @ self.beta_hat
        self.residuals = y - self.fitted
        self.rss = np.sum(self.residuals**2)
        self.s2 = self.rss / (self.n - self.p)
        self.s = np.sqrt(self.s2)
        
        # Hat matrix diagonals
        XtX_inv = np.linalg.inv(X.T @ X)
        self.hat_diag = np.sum(X @ XtX_inv * X, axis=1)  # h_ii
        
    def standardized_residuals(self):
        """Residuals divided by estimated std deviation."""
        return self.residuals / self.s
    
    def internally_studentized(self):
        """Internally studentized residuals."""
        return self.residuals / (self.s * np.sqrt(1 - self.hat_diag))
    
    def externally_studentized(self):
        """Externally studentized (deleted) residuals."""
        e = self.internally_studentized()
        # Delete-one formula: t_i = e_i * sqrt((n-p-1)/(n-p-e_i^2))
        t = e * np.sqrt((self.n - self.p - 1) / (self.n - self.p - e**2))
        return t
    
    def cooks_distance(self):
        """Cook's distance for influence."""
        e_int = self.internally_studentized()
        return (e_int**2 / self.p) * (self.hat_diag / (1 - self.hat_diag))
    
    def identify_outliers(self, alpha=0.05):
        """Identify potential outliers using externally studentized residuals."""
        t = self.externally_studentized()
        df = self.n - self.p - 1
        
        # Bonferroni correction
        threshold = stats.t.ppf(1 - alpha/(2*self.n), df)
        
        outliers = np.where(np.abs(t) > threshold)[0]
        return outliers, t, threshold
    
    def summary(self):
        """Print diagnostic summary."""
        print("Residual Diagnostics Summary")
        print("="*60)
        
        std_res = self.standardized_residuals()
        int_stud = self.internally_studentized()
        ext_stud = self.externally_studentized()
        cooks = self.cooks_distance()
        
        print(f"\n--- Residual Statistics ---")
        print(f"RSS: {self.rss:.4f}")
        print(f"s² (unbiased): {self.s2:.4f}")
        print(f"s: {self.s:.4f}")
        
        print(f"\n--- Residual Distribution ---")
        print(f"Mean of residuals: {np.mean(self.residuals):.6f} (should be ≈0)")
        print(f"Std of standardized residuals: {np.std(std_res):.4f} (should be ≈1)")
        
        print(f"\n--- Potential Outliers (|internally studentized| > 2) ---")
        potential_outliers = np.where(np.abs(int_stud) > 2)[0]
        if len(potential_outliers) > 0:
            for i in potential_outliers:
                print(f"  Observation {i}: t = {int_stud[i]:.3f}")
        else:
            print("  None identified")
        
        print(f"\n--- High Leverage Points (h_ii > 2p/n = {2*self.p/self.n:.4f}) ---")
        high_leverage = np.where(self.hat_diag > 2*self.p/self.n)[0]
        if len(high_leverage) > 0:
            for i in high_leverage:
                print(f"  Observation {i}: h_ii = {self.hat_diag[i]:.4f}")
        else:
            print("  None identified")
        
        print(f"\n--- Influential Points (Cook's D > 4/n = {4/self.n:.4f}) ---")
        influential = np.where(cooks > 4/self.n)[0]
        if len(influential) > 0:
            for i in influential:
                print(f"  Observation {i}: Cook's D = {cooks[i]:.4f}")
        else:
            print("  None identified")
 
 
# Example with some outliers
np.random.seed(42)
n, p = 100, 4
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta = np.array([1.0, 2.0, -1.0, 0.5])
y = X @ beta + np.random.randn(n)
 
# Add an outlier
y[50] += 10
 
diag = RegressionDiagnostics(X, y)
diag.summary()

Key diagnostic checks involving $s^2$:

Normality: If residuals are truly $\mathcal{N}(0, \sigma^2)$, studentized residuals should follow $t_{n-p-1}$
Homoscedasticity: Plot $|\hat{r}_i|$ or $\hat{r}_i^2$ vs fitted values. If variance is constant, no pattern should emerge.
Outliers: Studentized residuals exceeding 2-3 suggest unusual observations.
Model fit: If $s^2$ is large relative to the variance of $y$, the model explains little variation (low $R^2$).

Summary: The Complete Picture

We've completed our deep dive into variance estimation. Here's the complete picture:

Key Takeaways

•Two estimators: MLE uses $n$; unbiased uses $n-p$
•RSS distribution: $\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$ (exact under Gaussian model)
•MLE bias: Underestimates by factor $\frac{n-p}{n}$; severe when $p/n$ is large
•Degrees of freedom: $n-p$ "free" residual dimensions after fitting $p$ parameters
•Exact CIs for $\sigma^2$: Use chi-squared quantiles; valid for any sample size
•Foundation for all inference: Standard errors, t-tests, F-tests all depend on $s^2$
•Residual diagnostics: Studentized residuals, leverage, Cook's distance all involve $s^2$

Module Complete: Maximum Likelihood View of Linear Regression

You've now mastered the probabilistic perspective on linear regression. Starting from the Gaussian noise assumption, you've seen how:

The likelihood function connects probability to optimization
MLE leads to the same coefficient estimates as OLS
The MLE framework provides variance estimation, confidence intervals, and hypothesis testing
Understanding the distribution of residuals enables model checking

This foundation prepares you for regularized regression, generalized linear models, and Bayesian approaches—all of which build on the maximum likelihood framework.

Looking forward:

With the maximum likelihood view complete, you have a powerful framework for understanding linear regression. The next module will address regression diagnostics in depth—checking assumptions, detecting violations, and remedying problems. The variance estimation concepts from this page will be essential there.

The journey from geometry (OLS) through probability (MLE) gives you multiple lenses for understanding the same fundamental technique. Expert practitioners use all these perspectives, choosing the one most helpful for each situation.