Loading content...
In discussions of linear regression, the spotlight typically falls on $\boldsymbol{\beta}$—the regression coefficients. But lurking in the shadows is an equally important parameter: the noise variance $\sigma^2$.
The variance parameter plays critical roles:
This page provides a rigorous treatment of variance estimation—its MLE, its bias, its distribution, and its practical importance.
By the end of this page, you will:
We've encountered two estimators for $\sigma^2$. Let's state them clearly and understand their differences.
The MLE estimator: $$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{\text{RSS}}{n}$$
The unbiased estimator: $$s^2 = \frac{1}{n-p}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{\text{RSS}}{n-p}$$
where $\text{RSS} = \sum_{i=1}^n \hat{r}_i^2 = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2$ is the residual sum of squares, $n$ is the sample size, and $p$ is the number of parameters in $\boldsymbol{\beta}$.
The only difference is the denominator: $n$ for MLE, $n-p$ for the unbiased version.
| Property | MLE: $\frac{\text{RSS}}{n}$ | Unbiased: $\frac{\text{RSS}}{n-p}$ |
|---|---|---|
| Bias | $\mathbb{E}[\hat{\sigma}^2] = \frac{n-p}{n}\sigma^2$ (biased low) | $\mathbb{E}[s^2] = \sigma^2$ (unbiased) |
| Consistency | Yes: $\hat{\sigma}^2 \xrightarrow{p} \sigma^2$ | Yes: $s^2 \xrightarrow{p} \sigma^2$ |
| Distribution | $\frac{n\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}$ | $\frac{(n-p)s^2}{\sigma^2} \sim \chi^2_{n-p}$ |
| Variance | $\text{Var}(\hat{\sigma}^2) = \frac{2(n-p)\sigma^4}{n^2}$ | $\text{Var}(s^2) = \frac{2\sigma^4}{n-p}$ |
| Typical use | Likelihood, AIC/BIC, large samples | Inference, confidence intervals, F-tests |
The MLE is theoretically optimal in a likelihood sense but biased. The unbiased estimator sacrifices a tiny bit of efficiency (slightly higher variance) to eliminate bias.
For practical inference (confidence intervals, hypothesis tests), the unbiased estimator $s^2$ is standard. For model selection with AIC/BIC, the MLE is often used.
As $n$ grows large, both converge to $\sigma^2$, and the distinction vanishes.
Understanding the distribution of RSS is key to deriving the properties of variance estimators.
Setup:
Under the Gaussian model $\mathbf{y} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$, the residual vector is: $$\hat{\mathbf{r}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{y} - \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = (\mathbf{I}_n - \mathbf{H})\mathbf{y} = \mathbf{M}\mathbf{y}$$
where:
Key properties of $\mathbf{M}$:
Deriving the distribution of RSS:
The true errors are $\boldsymbol{\epsilon} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$, with $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)$.
The residuals are: $$\hat{\mathbf{r}} = \mathbf{M}\mathbf{y} = \mathbf{M}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}) = \mathbf{M}\mathbf{X}\boldsymbol{\beta} + \mathbf{M}\boldsymbol{\epsilon} = \mathbf{M}\boldsymbol{\epsilon}$$
(Using $\mathbf{M}\mathbf{X} = \mathbf{0}$.)
So residuals are a linear transformation of true errors, and the RSS is: $$\text{RSS} = \hat{\mathbf{r}}^T\hat{\mathbf{r}} = \boldsymbol{\epsilon}^T\mathbf{M}^T\mathbf{M}\boldsymbol{\epsilon} = \boldsymbol{\epsilon}^T\mathbf{M}\boldsymbol{\epsilon}$$
(Using $\mathbf{M}^T = \mathbf{M}$ and $\mathbf{M}^2 = \mathbf{M}$.)
Now, $\boldsymbol{\epsilon}/\sigma \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_n)$, so: $$\frac{\text{RSS}}{\sigma^2} = \frac{\boldsymbol{\epsilon}^T\mathbf{M}\boldsymbol{\epsilon}}{\sigma^2} = \mathbf{z}^T\mathbf{M}\mathbf{z}$$
where $\mathbf{z} = \boldsymbol{\epsilon}/\sigma \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_n)$.
For standard normal $\mathbf{z}$ and symmetric idempotent matrix $\mathbf{M}$ of rank $k$:
$$\mathbf{z}^T\mathbf{M}\mathbf{z} \sim \chi^2_k$$
Since $\text{rank}(\mathbf{M}) = n - p$:
$$\boxed{\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}}$$
This is a fundamental result. The scaled RSS follows a chi-squared distribution with $n-p$ degrees of freedom.
Why $n-p$ degrees of freedom?
Intuitively:
Mathematically, $\mathbf{M}$ projects $\mathbb{R}^n$ onto an $(n-p)$-dimensional subspace (the orthogonal complement of the column space of $\mathbf{X}$).
Let's derive the bias of the MLE variance estimator rigorously.
Expected value of RSS:
Since $\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$ and $\mathbb{E}[\chi^2_k] = k$: $$\mathbb{E}\left[\frac{\text{RSS}}{\sigma^2}\right] = n - p$$ $$\mathbb{E}[\text{RSS}] = (n-p)\sigma^2$$
Expectation of MLE: $$\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \mathbb{E}\left[\frac{\text{RSS}}{n}\right] = \frac{(n-p)\sigma^2}{n} = \frac{n-p}{n}\sigma^2$$
Bias: $$\text{Bias}(\hat{\sigma}^2_{\text{MLE}}) = \mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] - \sigma^2 = \frac{n-p}{n}\sigma^2 - \sigma^2 = -\frac{p}{n}\sigma^2$$
The MLE underestimates $\sigma^2$ by a factor of $\frac{n-p}{n}$.
The bias can be severe when $p$ is large relative to $n$:
| $n$ | $p$ | $n-p$ | Ratio $\frac{n-p}{n}$ | % Underestimation |
|---|---|---|---|---|
| 100 | 5 | 95 | 0.95 | 5% |
| 50 | 10 | 40 | 0.80 | 20% |
| 20 | 10 | 10 | 0.50 | 50% |
| 15 | 10 | 5 | 0.33 | 67% |
In high-dimensional settings where $p \approx n$, the MLE is severely biased!
Expectation of unbiased estimator: $$\mathbb{E}[s^2] = \mathbb{E}\left[\frac{\text{RSS}}{n-p}\right] = \frac{(n-p)\sigma^2}{n-p} = \sigma^2$$
The division by $n-p$ exactly corrects the bias.
Alternative derivation via trace:
We can also compute $\mathbb{E}[\text{RSS}]$ directly: $$\mathbb{E}[\text{RSS}] = \mathbb{E}[\boldsymbol{\epsilon}^T\mathbf{M}\boldsymbol{\epsilon}] = \mathbb{E}[\text{trace}(\boldsymbol{\epsilon}^T\mathbf{M}\boldsymbol{\epsilon})] = \mathbb{E}[\text{trace}(\mathbf{M}\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T)]$$ $$= \text{trace}(\mathbf{M}\mathbb{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T]) = \text{trace}(\mathbf{M} \cdot \sigma^2\mathbf{I}_n) = \sigma^2 \text{trace}(\mathbf{M}) = \sigma^2(n-p)$$
This uses $\text{trace}(\mathbf{M}) = \text{trace}(\mathbf{I}_n) - \text{trace}(\mathbf{H}) = n - p$.
(The trace of the hat matrix equals $p$ because $\text{trace}(\mathbf{H}) = \text{trace}(\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T) = \text{trace}(\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}) = \text{trace}(\mathbf{I}_p) = p$.)
Beyond bias, we should understand how variable our variance estimates are.
Variance of chi-squared:
For $X \sim \chi^2_k$: $\text{Var}(X) = 2k$.
Variance of RSS:
Since $\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$: $$\text{Var}\left(\frac{\text{RSS}}{\sigma^2}\right) = 2(n-p)$$ $$\text{Var}(\text{RSS}) = 2(n-p)\sigma^4$$
Variance of unbiased estimator: $$\text{Var}(s^2) = \text{Var}\left(\frac{\text{RSS}}{n-p}\right) = \frac{\text{Var}(\text{RSS})}{(n-p)^2} = \frac{2(n-p)\sigma^4}{(n-p)^2} = \frac{2\sigma^4}{n-p}$$
Variance of MLE: $$\text{Var}(\hat{\sigma}^2_{\text{MLE}}) = \text{Var}\left(\frac{\text{RSS}}{n}\right) = \frac{\text{Var}(\text{RSS})}{n^2} = \frac{2(n-p)\sigma^4}{n^2}$$
The Mean Squared Error combines bias and variance: $$\text{MSE}(\hat{\theta}) = \text{Bias}^2 + \text{Variance}$$
For the MLE: $$\text{MSE}(\hat{\sigma}^2_{\text{MLE}}) = \left(-\frac{p}{n}\sigma^2\right)^2 + \frac{2(n-p)\sigma^4}{n^2} = \frac{p^2 + 2(n-p)}{n^2}\sigma^4$$
For the unbiased estimator: $$\text{MSE}(s^2) = 0 + \frac{2\sigma^4}{n-p} = \frac{2\sigma^4}{n-p}$$
Interestingly, the MLE can have lower MSE than the unbiased estimator for certain values of $n$ and $p$! This illustrates the bias-variance tradeoff: the MLE trades bias for lower variance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy import stats def analyze_variance_estimators(n, p, sigma_true, n_simulations=50000): """ Empirically verify properties of variance estimators. """ sigma_sq_true = sigma_true**2 df = n - p np.random.seed(42) # Store estimates mle_estimates = [] unbiased_estimates = [] for _ in range(n_simulations): # Generate data X = np.column_stack([np.ones(n), np.random.randn(n, p-1)]) beta = np.zeros(p) y = X @ beta + sigma_true * np.random.randn(n) # Fit model beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] residuals = y - X @ beta_hat rss = np.sum(residuals**2) mle_estimates.append(rss / n) unbiased_estimates.append(rss / df) mle_estimates = np.array(mle_estimates) unbiased_estimates = np.array(unbiased_estimates) print(f"Variance Estimator Analysis (n={n}, p={p})") print("="*60) print(f"True σ² = {sigma_sq_true}") print(f"Degrees of freedom = {df}") print(f"\n--- MLE Estimator (RSS/n) ---") print(f"Empirical mean: {np.mean(mle_estimates):.4f}") print(f"Theoretical mean: {(n-p)/n * sigma_sq_true:.4f}") print(f"Empirical variance: {np.var(mle_estimates):.6f}") print(f"Theoretical var: {2*(n-p)*sigma_sq_true**2/n**2:.6f}") print(f"Empirical bias: {np.mean(mle_estimates) - sigma_sq_true:.4f}") print(f"Theoretical bias: {-p/n * sigma_sq_true:.4f}") print(f"\n--- Unbiased Estimator (RSS/(n-p)) ---") print(f"Empirical mean: {np.mean(unbiased_estimates):.4f}") print(f"Theoretical mean: {sigma_sq_true:.4f}") print(f"Empirical variance: {np.var(unbiased_estimates):.6f}") print(f"Theoretical var: {2*sigma_sq_true**2/(n-p):.6f}") print(f"Empirical bias: {np.mean(unbiased_estimates) - sigma_sq_true:.4f}") print(f"\n--- Chi-squared Distribution Check ---") scaled_rss = df * unbiased_estimates / sigma_sq_true # Should be chi^2_{n-p} print(f"Theoretical chi^2_{df} mean: {df}") print(f"Empirical scaled RSS mean: {np.mean(scaled_rss):.4f}") print(f"Theoretical chi^2_{df} var: {2*df}") print(f"Empirical scaled RSS var: {np.var(scaled_rss):.4f}") # MSE comparison mse_mle = np.mean((mle_estimates - sigma_sq_true)**2) mse_unbiased = np.mean((unbiased_estimates - sigma_sq_true)**2) print(f"\n--- Mean Squared Error ---") print(f"MSE of MLE: {mse_mle:.6f}") print(f"MSE of unbiased: {mse_unbiased:.6f}") if mse_mle < mse_unbiased: print(f"MLE has lower MSE (bias-variance tradeoff favors MLE here)") else: print(f"Unbiased has lower MSE") # Analyze for different scenariosprint("CASE 1: Large sample, few parameters")analyze_variance_estimators(n=100, p=5, sigma_true=2.0) print("\n" + "="*60 + "\n")print("CASE 2: Moderate sample, more parameters")analyze_variance_estimators(n=50, p=10, sigma_true=2.0)The chi-squared distribution of RSS enables exact confidence intervals for $\sigma^2$.
Deriving the confidence interval:
Since $\frac{(n-p)s^2}{\sigma^2} \sim \chi^2_{n-p}$, we can construct a pivotal quantity:
$$P\left(\chi^2_{n-p, \alpha/2} \leq \frac{(n-p)s^2}{\sigma^2} \leq \chi^2_{n-p, 1-\alpha/2}\right) = 1 - \alpha$$
where $\chi^2_{k, q}$ denotes the $q$-th quantile of the $\chi^2_k$ distribution.
Solving for $\sigma^2$:
$$P\left(\frac{(n-p)s^2}{\chi^2_{n-p, 1-\alpha/2}} \leq \sigma^2 \leq \frac{(n-p)s^2}{\chi^2_{n-p, \alpha/2}}\right) = 1 - \alpha$$
$$\left[\frac{(n-p)s^2}{\chi^2_{n-p, 1-\alpha/2}}, \quad \frac{(n-p)s^2}{\chi^2_{n-p, \alpha/2}}\right]$$
Or equivalently: $$\left[\frac{\text{RSS}}{\chi^2_{n-p, 1-\alpha/2}}, \quad \frac{\text{RSS}}{\chi^2_{n-p, \alpha/2}}\right]$$
This interval is exact under the Gaussian assumption (not asymptotic).
Confidence interval for $\sigma$ (standard deviation):
Simply take square roots: $$\left[\sqrt{\frac{(n-p)s^2}{\chi^2_{n-p, 1-\alpha/2}}}, \quad \sqrt{\frac{(n-p)s^2}{\chi^2_{n-p, \alpha/2}}}\right]$$
Properties of the interval:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom scipy import stats def variance_confidence_interval(rss, n, p, alpha=0.05): """ Compute exact confidence interval for sigma^2. Parameters: ----------- rss : float Residual sum of squares n : int Sample size p : int Number of parameters alpha : float Significance level (default 0.05 for 95% CI) Returns: -------- tuple : (lower, upper, point_estimate) for sigma^2 """ df = n - p s2 = rss / df # Chi-squared quantiles chi2_lower = stats.chi2.ppf(alpha/2, df) chi2_upper = stats.chi2.ppf(1 - alpha/2, df) # Confidence interval for sigma^2 ci_lower = df * s2 / chi2_upper ci_upper = df * s2 / chi2_lower return ci_lower, ci_upper, s2 def demonstrate_variance_ci(): """Demonstrate variance CI with example and coverage check.""" np.random.seed(42) # Example n, p = 50, 5 sigma_true = 2.0 sigma_sq_true = sigma_true**2 # Generate sample data X = np.column_stack([np.ones(n), np.random.randn(n, p-1)]) beta = np.random.randn(p) y = X @ beta + sigma_true * np.random.randn(n) # Fit regression beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] residuals = y - X @ beta_hat rss = np.sum(residuals**2) ci_lower, ci_upper, s2 = variance_confidence_interval(rss, n, p) print("Variance Confidence Interval Example") print("="*50) print(f"Sample size: n = {n}") print(f"Parameters: p = {p}") print(f"Degrees of freedom: {n-p}") print(f"True σ² = {sigma_sq_true}") print(f"\nPoint estimate s² = {s2:.4f}") print(f"95% CI for σ²: [{ci_lower:.4f}, {ci_upper:.4f}]") print(f"CI width: {ci_upper - ci_lower:.4f}") if ci_lower <= sigma_sq_true <= ci_upper: print(f"✓ True σ² is captured in the interval") else: print(f"✗ True σ² is NOT in the interval (happens 5% of the time)") # Coverage simulation print("\n" + "="*50) print("Coverage Probability Check (10,000 simulations)") n_sim = 10000 captured = 0 for _ in range(n_sim): y = X @ beta + sigma_true * np.random.randn(n) beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] residuals = y - X @ beta_hat rss = np.sum(residuals**2) ci_lower, ci_upper, _ = variance_confidence_interval(rss, n, p) if ci_lower <= sigma_sq_true <= ci_upper: captured += 1 coverage = captured / n_sim print(f"Nominal coverage: 95%") print(f"Empirical coverage: {coverage*100:.2f}%") print(f"Standard error: ±{np.sqrt(0.95*0.05/n_sim)*100:.2f}%") demonstrate_variance_ci()The variance estimate $s^2$ is the foundation for all inference in regression. Let's trace its role through common tasks.
1. Standard errors of coefficients:
Recall that $\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$.
Since $\sigma^2$ is unknown, we estimate: $$\widehat{\text{Cov}}(\hat{\boldsymbol{\beta}}) = s^2(\mathbf{X}^T\mathbf{X})^{-1}$$
Standard error of $\hat{\beta}_j$: $$\text{SE}(\hat{\beta}j) = \sqrt{s^2 [(\mathbf{X}^T\mathbf{X})^{-1}]{jj}}$$
2. t-statistics and hypothesis tests:
To test $H_0: \beta_j = 0$: $$t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} = \frac{\hat{\beta}j}{\sqrt{s^2 [(\mathbf{X}^T\mathbf{X})^{-1}]{jj}}}$$
Under $H_0$: $t_j \sim t_{n-p}$ (Student's t with $n-p$ degrees of freedom).
The $t$-distribution arises precisely because we're dividing a normal variable (numerator) by an estimate of its standard deviation based on a chi-squared variable (denominator).
If we knew $\sigma$ exactly, the test statistic would be normal: $$\frac{\hat{\beta}j - \beta_j}{\sigma\sqrt{[(\mathbf{X}^T\mathbf{X})^{-1}]{jj}}} \sim \mathcal{N}(0,1)$$
But we estimate $\sigma$ with $s$. This introduces additional variability. The result is a t-distribution:
$$\frac{\hat{\beta}j - \beta_j}{s\sqrt{[(\mathbf{X}^T\mathbf{X})^{-1}]{jj}}} \sim t_{n-p}$$
As $n-p \to \infty$, $t_{n-p} \to \mathcal{N}(0,1)$. For large samples, the distinction vanishes.
3. Confidence intervals for coefficients:
$$\hat{\beta}j \pm t{n-p, 1-\alpha/2} \cdot \text{SE}(\hat{\beta}_j)$$
4. Prediction intervals:
For a new observation with features $\mathbf{x}*$: $$\hat{y}* \pm t_{n-p, 1-\alpha/2} \cdot \sqrt{s^2\left(1 + \mathbf{x}*^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}*\right)}$$
The $s^2$ term appears twice:
5. ANOVA and F-tests:
The F-statistic for overall significance: $$F = \frac{(\text{TSS} - \text{RSS})/(p-1)}{\text{RSS}/(n-p)} = \frac{\text{MSR}}{\text{MSE}}$$
where MSE = $s^2$ is our variance estimate. Under $H_0$: all $\beta_j = 0$ (except intercept), $F \sim F_{p-1, n-p}$.
Variance estimation is intimately connected to residual analysis—examining residuals helps assess whether our variance model is appropriate.
Standardized residuals:
Raw residuals $\hat{r}_i = y_i - \hat{y}_i$ have different variances! Specifically: $$\text{Var}(\hat{r}i) = \sigma^2(1 - h{ii})$$
where $h_{ii}$ is the $i$-th diagonal element of the hat matrix $\mathbf{H}$.
Internally studentized residuals: $$e_i = \frac{\hat{r}i}{s\sqrt{1 - h{ii}}}$$
These have approximately unit variance under the model.
Externally studentized residuals (Studentized deleted residuals): $$t_i = \frac{\hat{r}i}{s{(i)}\sqrt{1 - h_{ii}}}$$
where $s_{(i)}$ is the variance estimate computed without observation $i$. Under $H_0$ that observation $i$ follows the model: $t_i \sim t_{n-p-1}$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
import numpy as npfrom scipy import stats class RegressionDiagnostics: """ Comprehensive residual diagnostics for linear regression. """ def __init__(self, X, y): self.X = np.asarray(X) self.y = np.asarray(y) self.n, self.p = X.shape # Fit model self.beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] self.fitted = X @ self.beta_hat self.residuals = y - self.fitted self.rss = np.sum(self.residuals**2) self.s2 = self.rss / (self.n - self.p) self.s = np.sqrt(self.s2) # Hat matrix diagonals XtX_inv = np.linalg.inv(X.T @ X) self.hat_diag = np.sum(X @ XtX_inv * X, axis=1) # h_ii def standardized_residuals(self): """Residuals divided by estimated std deviation.""" return self.residuals / self.s def internally_studentized(self): """Internally studentized residuals.""" return self.residuals / (self.s * np.sqrt(1 - self.hat_diag)) def externally_studentized(self): """Externally studentized (deleted) residuals.""" e = self.internally_studentized() # Delete-one formula: t_i = e_i * sqrt((n-p-1)/(n-p-e_i^2)) t = e * np.sqrt((self.n - self.p - 1) / (self.n - self.p - e**2)) return t def cooks_distance(self): """Cook's distance for influence.""" e_int = self.internally_studentized() return (e_int**2 / self.p) * (self.hat_diag / (1 - self.hat_diag)) def identify_outliers(self, alpha=0.05): """Identify potential outliers using externally studentized residuals.""" t = self.externally_studentized() df = self.n - self.p - 1 # Bonferroni correction threshold = stats.t.ppf(1 - alpha/(2*self.n), df) outliers = np.where(np.abs(t) > threshold)[0] return outliers, t, threshold def summary(self): """Print diagnostic summary.""" print("Residual Diagnostics Summary") print("="*60) std_res = self.standardized_residuals() int_stud = self.internally_studentized() ext_stud = self.externally_studentized() cooks = self.cooks_distance() print(f"\n--- Residual Statistics ---") print(f"RSS: {self.rss:.4f}") print(f"s² (unbiased): {self.s2:.4f}") print(f"s: {self.s:.4f}") print(f"\n--- Residual Distribution ---") print(f"Mean of residuals: {np.mean(self.residuals):.6f} (should be ≈0)") print(f"Std of standardized residuals: {np.std(std_res):.4f} (should be ≈1)") print(f"\n--- Potential Outliers (|internally studentized| > 2) ---") potential_outliers = np.where(np.abs(int_stud) > 2)[0] if len(potential_outliers) > 0: for i in potential_outliers: print(f" Observation {i}: t = {int_stud[i]:.3f}") else: print(" None identified") print(f"\n--- High Leverage Points (h_ii > 2p/n = {2*self.p/self.n:.4f}) ---") high_leverage = np.where(self.hat_diag > 2*self.p/self.n)[0] if len(high_leverage) > 0: for i in high_leverage: print(f" Observation {i}: h_ii = {self.hat_diag[i]:.4f}") else: print(" None identified") print(f"\n--- Influential Points (Cook's D > 4/n = {4/self.n:.4f}) ---") influential = np.where(cooks > 4/self.n)[0] if len(influential) > 0: for i in influential: print(f" Observation {i}: Cook's D = {cooks[i]:.4f}") else: print(" None identified") # Example with some outliersnp.random.seed(42)n, p = 100, 4X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])beta = np.array([1.0, 2.0, -1.0, 0.5])y = X @ beta + np.random.randn(n) # Add an outliery[50] += 10 diag = RegressionDiagnostics(X, y)diag.summary()Key diagnostic checks involving $s^2$:
Normality: If residuals are truly $\mathcal{N}(0, \sigma^2)$, studentized residuals should follow $t_{n-p-1}$
Homoscedasticity: Plot $|\hat{r}_i|$ or $\hat{r}_i^2$ vs fitted values. If variance is constant, no pattern should emerge.
Outliers: Studentized residuals exceeding 2-3 suggest unusual observations.
Model fit: If $s^2$ is large relative to the variance of $y$, the model explains little variation (low $R^2$).
We've completed our deep dive into variance estimation. Here's the complete picture:
You've now mastered the probabilistic perspective on linear regression. Starting from the Gaussian noise assumption, you've seen how:
This foundation prepares you for regularized regression, generalized linear models, and Bayesian approaches—all of which build on the maximum likelihood framework.
Looking forward:
With the maximum likelihood view complete, you have a powerful framework for understanding linear regression. The next module will address regression diagnostics in depth—checking assumptions, detecting violations, and remedying problems. The variance estimation concepts from this page will be essential there.
The journey from geometry (OLS) through probability (MLE) gives you multiple lenses for understanding the same fundamental technique. Expert practitioners use all these perspectives, choosing the one most helpful for each situation.