Loading content...
We've now seen that the maximum likelihood estimator for regression coefficients is identical to the ordinary least squares estimator:
$$\hat{\boldsymbol{\beta}}{\text{MLE}} = \hat{\boldsymbol{\beta}}{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
This is remarkable. Two completely different frameworks—one geometric (OLS), one probabilistic (MLE)—arrive at the same answer. But this agreement is not coincidental; it reveals deep structure in the linear regression problem.
In this page, we'll explore this connection thoroughly: understanding why they agree, when they might differ, and what the probabilistic perspective adds beyond the geometric view.
By the end of this page, you will:
The connection between Gaussian noise and squared loss isn't arbitrary—it's a deep mathematical relationship worth understanding carefully.
Deriving squared loss from Gaussian likelihood:
Starting from the Gaussian density: $$p(y | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right)$$
The negative log-likelihood for a single observation is: $$-\log p(y | \mu, \sigma^2) = \frac{1}{2}\log(2\pi\sigma^2) + \frac{(y-\mu)^2}{2\sigma^2}$$
Dropping constants (terms not involving $\mu$): $$-\log p(y | \mu, \sigma^2) \propto (y - \mu)^2$$
The squared error emerges naturally from the Gaussian exponent.
Every loss function corresponds to an implicit noise distribution:
Choosing a loss function is implicitly choosing a noise model!
Why squared, specifically?
The squared term in the exponent is precisely what makes the Gaussian "Gaussian." It's a defining characteristic of the distribution. This connection runs deep:
Gaussian is max-entropy for fixed variance — Among all distributions with mean 0 and variance $\sigma^2$, Gaussian has maximum entropy. Squared loss emerges from the least-presumptuous model.
Generating function structure — The moment generating function of the Gaussian has a quadratic exponent, which is intimately connected to squared loss.
Closure under addition — Gaussian distributions are closed under convolution (sum of Gaussians is Gaussian). The RSS as a sum of squared terms connects to this property.
Unique characterization — The only distributions that lead to separable maximum likelihood in linear regression are exponential family distributions, with Gaussian being the most natural continuous choice.
| Gaussian Property | Corresponding Optimization Property |
|---|---|
| Bell-shaped, symmetric | Penalty increases symmetrically for over/under-prediction |
| Thin tails (subgaussian) | Large errors are heavily penalized (quadratic growth) |
| Maximum entropy for fixed variance | Least-presumptuous loss for known variance |
| Sum of Gaussians is Gaussian | RSS as sum of squared terms is well-behaved |
| Characterized by mean and variance | OLS optimally estimates mean; MLE estimates variance |
While OLS and MLE give the same point estimates for $\boldsymbol{\beta}$, the maximum likelihood framework provides much more. Understanding these additional capabilities reveals why the probabilistic perspective is so powerful.
1. Principled variance estimation:
OLS alone doesn't tell us how to estimate $\sigma^2$. MLE provides:
2. Sampling distributions of estimators:
Under the Gaussian model: $$\hat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^T\mathbf{X})^{-1})$$
This tells us:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
import numpy as npfrom scipy import stats class MLEInference: """ Demonstrates inference capabilities provided by MLE framework that go beyond OLS point estimation. """ def __init__(self, X, y): self.X = np.asarray(X) self.y = np.asarray(y) self.n, self.p = X.shape # Fit model self.beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] self.fitted = X @ self.beta_hat self.residuals = y - self.fitted self.rss = np.sum(self.residuals**2) self.sigma_sq = self.rss / (self.n - self.p) # Unbiased # Covariance matrix of beta_hat self.XtX_inv = np.linalg.inv(X.T @ X) self.cov_beta = self.sigma_sq * self.XtX_inv self.se_beta = np.sqrt(np.diag(self.cov_beta)) def confidence_interval(self, alpha=0.05): """ Compute confidence intervals for all coefficients. Uses t-distribution with n-p degrees of freedom. """ df = self.n - self.p t_crit = stats.t.ppf(1 - alpha/2, df) lower = self.beta_hat - t_crit * self.se_beta upper = self.beta_hat + t_crit * self.se_beta return lower, upper def hypothesis_test(self, j, beta_0=0): """ Test H0: beta_j = beta_0 vs H1: beta_j != beta_0 Returns t-statistic and p-value. """ t_stat = (self.beta_hat[j] - beta_0) / self.se_beta[j] df = self.n - self.p p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df)) return t_stat, p_value def prediction_interval(self, x_new, alpha=0.05): """ Compute prediction interval for new observation. Accounts for both parameter uncertainty and irreducible noise. """ x_new = np.asarray(x_new).reshape(-1) # Point prediction y_pred = x_new @ self.beta_hat # Variance of prediction # Var(y_new - y_hat) = sigma^2 + sigma^2 * x'(X'X)^{-1}x var_pred = self.sigma_sq * (1 + x_new @ self.XtX_inv @ x_new) se_pred = np.sqrt(var_pred) df = self.n - self.p t_crit = stats.t.ppf(1 - alpha/2, df) lower = y_pred - t_crit * se_pred upper = y_pred + t_crit * se_pred return y_pred, lower, upper def log_likelihood(self): """Compute maximized log-likelihood.""" ll = -self.n/2 * np.log(2 * np.pi) ll -= self.n/2 * np.log(self.rss / self.n) # Using MLE sigma ll -= self.n/2 # RSS/(2*sigma^2) = n/2 at MLE return ll def aic(self): """Akaike Information Criterion.""" k = self.p + 1 # p coefficients + 1 for variance return 2 * k - 2 * self.log_likelihood() def bic(self): """Bayesian Information Criterion.""" k = self.p + 1 return k * np.log(self.n) - 2 * self.log_likelihood() def summary(self): """Print comprehensive summary.""" print("="*70) print("MLE Inference Summary") print("="*70) print("\nCoefficient Estimates with 95% Confidence Intervals:") lower, upper = self.confidence_interval() print(f"{'Coef':>8} {'Estimate':>12} {'Std.Err':>10} {'t-stat':>10} {'p-value':>10} {'95% CI':>20}") print("-"*70) for j in range(self.p): t, p = self.hypothesis_test(j) ci = f"[{lower[j]:.3f}, {upper[j]:.3f}]" sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else "" print(f"β_{j:>6} {self.beta_hat[j]:>12.4f} {self.se_beta[j]:>10.4f} {t:>10.3f} {p:>10.4f} {ci:>20} {sig}") print(f"\nResidual Standard Error: {np.sqrt(self.sigma_sq):.4f} on {self.n - self.p} df") y_mean = np.mean(self.y) ss_tot = np.sum((self.y - y_mean)**2) r_sq = 1 - self.rss / ss_tot adj_r_sq = 1 - (1 - r_sq) * (self.n - 1) / (self.n - self.p) print(f"R-squared: {r_sq:.4f}, Adjusted R-squared: {adj_r_sq:.4f}") print(f"\nModel Selection Criteria:") print(f" Log-Likelihood: {self.log_likelihood():.2f}") print(f" AIC: {self.aic():.2f}") print(f" BIC: {self.bic():.2f}") # Example usagenp.random.seed(42)n, p = 100, 4X = np.column_stack([np.ones(n), np.random.randn(n, 3)])beta_true = np.array([5.0, 2.0, 0.0, -1.5]) # Note: beta_2 = 0y = X @ beta_true + 1.5 * np.random.randn(n) inference = MLEInference(X, y)inference.summary() print("\n" + "="*70)print("Note: β₂ is correctly identified as non-significant (p > 0.05)")print("This demonstrates the hypothesis testing capability of MLE framework.")3. Model selection criteria:
The likelihood framework enables principled model comparison:
$$\text{AIC} = 2k - 2\log \mathcal{L}{\max}$$ $$\text{BIC} = k\log n - 2\log \mathcal{L}{\max}$$
where $k$ is the number of parameters and $\mathcal{L}_{\max}$ is the maximized likelihood.
4. Hypothesis testing:
Likelihood-ratio tests for nested models: $$\Lambda = -2\log\frac{\mathcal{L}{\text{restricted}}}{\mathcal{L}{\text{full}}} \sim \chi^2_q$$
where $q$ is the difference in number of parameters.
5. Foundation for regularization:
MLE naturally extends to MAP (Maximum A Posteriori) estimation: $$\hat{\boldsymbol{\beta}}{\text{MAP}} = \arg\max{\boldsymbol{\beta}} [\log p(\mathbf{y}|\boldsymbol{\beta}) + \log p(\boldsymbol{\beta})]$$
Ridge regression ($L_2$ penalty) corresponds to a Gaussian prior; Lasso ($L_1$ penalty) to a Laplace prior.
While OLS and Gaussian MLE give the same coefficient estimates, there are scenarios where the two perspectives diverge:
1. Variance estimation:
2. Non-Gaussian errors:
When the true noise isn't Gaussian:
For example, with Laplace noise: $$p(\epsilon) = \frac{1}{2b}\exp\left(-\frac{|\epsilon|}{b}\right)$$
The MLE minimizes the sum of absolute residuals (L1 loss), not squared residuals. This gives the Least Absolute Deviations (LAD) estimator, which differs from OLS.
3. Heteroscedastic errors:
If $\text{Var}(\epsilon_i) = \sigma_i^2$ varies across observations:
$$\hat{\boldsymbol{\beta}}_{\text{WLS}} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{y}$$
where $\mathbf{W} = \text{diag}(1/\sigma_1^2, ..., 1/\sigma_n^2)$.
4. Correlated errors:
If $\text{Cov}(\boldsymbol{\epsilon}) = \boldsymbol{\Sigma} \neq \sigma^2\mathbf{I}$:
$$\hat{\boldsymbol{\beta}}_{\text{GLS}} = (\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{y}$$
Use OLS thinking when:
Use MLE thinking when:
In practice, use both: compute OLS estimates for robustness, interpret through MLE framework for inference.
The MLE has powerful optimality properties that extend beyond what OLS provides alone.
The Cramér-Rao Lower Bound:
For any unbiased estimator $\tilde{\boldsymbol{\beta}}$ of $\boldsymbol{\beta}$: $$\text{Cov}(\tilde{\boldsymbol{\beta}}) \geq \mathcal{I}(\boldsymbol{\beta})^{-1}$$
where $\mathcal{I}(\boldsymbol{\beta})$ is the Fisher information matrix—the expected curvature of the log-likelihood.
This inequality says that no unbiased estimator can have smaller variance than the inverse Fisher information.
For Gaussian linear regression:
The Fisher information is: $$\mathcal{I}(\boldsymbol{\beta}) = \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X}$$
Thus the Cramér-Rao lower bound is: $$\text{Cov}(\tilde{\boldsymbol{\beta}}) \geq \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$
The MLE achieves this bound exactly: $$\text{Cov}(\hat{\boldsymbol{\beta}}_{\text{MLE}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$
This makes the MLE an efficient estimator—it achieves the smallest possible variance among all unbiased estimators.
Under mild regularity conditions, MLE satisfies:
Consistency: $\hat{\boldsymbol{\theta}} \xrightarrow{p} \boldsymbol{\theta}$ as $n \to \infty$
Asymptotic normality: $\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathcal{I}(\boldsymbol{\theta})^{-1})$
Asymptotic efficiency: Achieves the Cramér-Rao lower bound asymptotically
Invariance: If $\hat{\theta}$ is MLE of $\theta$, then $g(\hat{\theta})$ is MLE of $g(\theta)$
For Gaussian linear regression, these properties hold exactly (not just asymptotically).
Connection to Gauss-Markov:
The Gauss-Markov theorem says OLS is BLUE—Best Linear Unbiased Estimator. It's optimal among linear estimators only.
The MLE efficiency result is stronger: under Gaussian noise, OLS/MLE is optimal among all estimators, not just linear ones.
Put differently:
If the Gaussian assumption holds, we gain additional optimality. If it fails, we still have Gauss-Markov guarantees for linear estimation.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as np def fisher_information_analysis(X, sigma_sq): """ Analyze Fisher information and Cramér-Rao bound for linear regression. """ n, p = X.shape # Fisher information matrix I = (1/sigma_sq) * (X.T @ X) # Cramér-Rao lower bound for covariance CRLB = np.linalg.inv(I) # = sigma^2 * (X'X)^{-1} # Variance lower bounds for individual parameters var_bounds = np.diag(CRLB) print("Fisher Information Analysis") print("="*50) print(f"\nSample size: n = {n}") print(f"Parameters: p = {p}") print(f"Noise variance: σ² = {sigma_sq}") print(f"\nFisher Information Matrix I(β):") print(I) print(f"\nCramér-Rao Lower Bound (Cov lower bound):") print(CRLB) print(f"\nMinimum achievable standard errors:") for j in range(p): print(f" SE(β_{j}) ≥ {np.sqrt(var_bounds[j]):.4f}") # Verify MLE achieves the bound print(f"\nThe MLE/OLS estimator achieves these bounds exactly!") print(f"This is what makes it 'efficient'.") return I, CRLB def demonstrate_efficiency(): """ Show that MLE achieves Cramér-Rao bound through simulation. """ np.random.seed(42) n, p = 100, 3 sigma_true = 1.5 # Fixed design matrix X = np.column_stack([np.ones(n), np.random.randn(n, p-1)]) beta_true = np.array([1.0, 2.0, -0.5]) # Theoretical variance of beta_hat theoretical_cov = sigma_true**2 * np.linalg.inv(X.T @ X) theoretical_se = np.sqrt(np.diag(theoretical_cov)) # Simulate many datasets and compute empirical variance n_simulations = 10000 beta_estimates = np.zeros((n_simulations, p)) for i in range(n_simulations): y = X @ beta_true + sigma_true * np.random.randn(n) beta_estimates[i] = np.linalg.lstsq(X, y, rcond=None)[0] empirical_cov = np.cov(beta_estimates.T) empirical_se = np.sqrt(np.diag(empirical_cov)) print("\n" + "="*50) print("Efficiency Verification by Simulation") print("="*50) print(f"\nComparing theoretical (CRLB) vs empirical SE:") print(f"{'Param':>8} {'Theoretical':>12} {'Empirical':>12} {'Ratio':>10}") print("-"*50) for j in range(p): ratio = empirical_se[j] / theoretical_se[j] print(f"β_{j:>6} {theoretical_se[j]:>12.4f} {empirical_se[j]:>12.4f} {ratio:>10.4f}") print(f"\n✓ MLE achieves the Cramér-Rao bound (ratios ≈ 1.0)") # Run analysisX = np.column_stack([np.ones(100), np.random.randn(100, 2)])fisher_information_analysis(X, sigma_sq=2.25)demonstrate_efficiency()The MLE framework naturally connects to Bayesian inference, providing a bridge to regularization and uncertainty quantification.
From MLE to MAP:
In Bayesian regression, we place a prior distribution $p(\boldsymbol{\beta})$ on the parameters and compute the posterior using Bayes' theorem:
$$p(\boldsymbol{\beta} | \mathbf{y}) = \frac{p(\mathbf{y} | \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})}{p(\mathbf{y})} \propto p(\mathbf{y} | \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$
The Maximum A Posteriori (MAP) estimator maximizes the posterior: $$\hat{\boldsymbol{\beta}}{\text{MAP}} = \arg\max{\boldsymbol{\beta}} [\log p(\mathbf{y} | \boldsymbol{\beta}) + \log p(\boldsymbol{\beta})]$$
MLE is the special case with a uniform (improper) prior: $p(\boldsymbol{\beta}) \propto 1$.
Ridge regression as Gaussian prior:
With a Gaussian prior $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$:
$$\log p(\boldsymbol{\beta}) = -\frac{1}{2\tau^2}|\boldsymbol{\beta}|^2 + \text{const}$$
MAP objective: $$\max_{\boldsymbol{\beta}} \left[-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 - \frac{1}{2\tau^2}|\boldsymbol{\beta}|^2\right]$$
Equivalently, minimize: $$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 + \lambda|\boldsymbol{\beta}|^2$$
where $\lambda = \sigma^2/\tau^2$. This is Ridge regression!
| Prior on β | Regularization | Penalty | Effect |
|---|---|---|---|
| Uniform (improper) | None (MLE) | $0$ | No shrinkage |
| Gaussian $\mathcal{N}(0, \tau^2 I)$ | Ridge (L2) | $\lambda|\boldsymbol{\beta}|_2^2$ | Shrink coefficients toward zero |
| Laplace | Lasso (L1) | $\lambda|\boldsymbol{\beta}|_1$ | Shrink + sparsity |
| Spike-and-slab | Best subset | Combinatorial | Exact sparsity |
| Horseshoe | Adaptive shrinkage | Complex | Heavy tails, sparsity |
Full Bayesian inference:
Beyond point estimates, Bayesian regression provides the full posterior distribution:
$$p(\boldsymbol{\beta} | \mathbf{y}) = \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$
where (with Gaussian likelihood and prior): $$\boldsymbol{\Sigma}_n = \left(\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X} + \frac{1}{\tau^2}\mathbf{I}\right)^{-1}$$ $$\boldsymbol{\mu}_n = \boldsymbol{\Sigma}_n \cdot \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{y}$$
This posterior:
MLE is the starting point; Bayesian methods extend it by incorporating prior information and providing uncertainty in a different mathematical framework.
As the sample size $n \to \infty$ or prior variance $\tau^2 \to \infty$:
$$\hat{\boldsymbol{\beta}}{\text{MAP}} \to \hat{\boldsymbol{\beta}}{\text{MLE}}$$
With lots of data, the prior becomes irrelevant and Bayesian inference converges to frequentist MLE. The prior matters most when data is scarce—exactly when we'd want to incorporate external knowledge.
The connection between OLS and MLE has deep historical roots, involving some of the greatest mathematicians in history.
Adrien-Marie Legendre (1805): First published the method of least squares as a computational technique for fitting curves to astronomical observations. He presented it as a practical (some would say ad hoc) method.
Carl Friedrich Gauss (1809): Independently developed least squares and provided its probabilistic justification. Gauss showed that if errors follow what we now call the Gaussian distribution, then minimizing squared errors gives the most probable parameter values. He also derived the distribution named after him in this context.
Gauss famously claimed priority, stating he had used the method since 1795—leading to a bitter dispute with Legendre.
R.A. Fisher (1912-1925): Formalized maximum likelihood as a general principle of statistical estimation. Fisher unified the probabilistic approach and established key concepts: likelihood, sufficiency, efficiency, and the information matrix. His work placed Gauss's insight into a general framework applicable beyond normally distributed errors.
The history illustrates a common pattern in science:
Least squares started as a computational convenience and became a principled statistical method. Understanding this evolution helps us appreciate what assumptions justify our techniques.
The Modern Synthesis:
Today we understand that:
All four perspectives are valid and useful. Expert practitioners move fluently between them, choosing the perspective most helpful for each problem.
The equivalence $\hat{\boldsymbol{\beta}}{\text{OLS}} = \hat{\boldsymbol{\beta}}{\text{MLE}}$ under Gaussian noise is the bridge connecting these worlds—a mathematical truth discovered over 200 years ago that remains central to statistics and machine learning today.
We've thoroughly explored the relationship between OLS and MLE. Here's what we've established:
What's next:
In the final page of this module, we'll address variance estimation in detail—beyond the brief treatment we've given so far. We'll explore the distributional properties of the MLE variance estimator, derive its unbiased correction rigorously, and connect variance estimation to residual diagnostics and model checking.
You now understand the deep relationship between geometric and probabilistic views of linear regression. The equivalence is not coincidental—it reflects fundamental mathematical structure. With this understanding, you can move between perspectives as needed: using OLS for robustness, MLE for inference, and Bayesian methods when prior information is available.