Maximum Likelihood View - Learning Module

Loading content...

0/245

Connection to OLS

Two Perspectives, One Solution

We've now seen that the maximum likelihood estimator for regression coefficients is identical to the ordinary least squares estimator:

$$\hat{\boldsymbol{\beta}}{\text{MLE}} = \hat{\boldsymbol{\beta}}{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

This is remarkable. Two completely different frameworks—one geometric (OLS), one probabilistic (MLE)—arrive at the same answer. But this agreement is not coincidental; it reveals deep structure in the linear regression problem.

In this page, we'll explore this connection thoroughly: understanding why they agree, when they might differ, and what the probabilistic perspective adds beyond the geometric view.

What You Will Learn

By the end of this page, you will:

Understand why Gaussian noise leads to squared loss as the natural objective
See how different noise distributions lead to different estimators
Appreciate what MLE provides beyond OLS: uncertainty quantification, model selection, hypothesis testing
Understand when OLS and MLE diverge (variance estimation, finite samples)
Connect this framework to Bayesian regression and regularization

The Core Equivalence: Why Gaussian Implies Squared Loss

The connection between Gaussian noise and squared loss isn't arbitrary—it's a deep mathematical relationship worth understanding carefully.

Deriving squared loss from Gaussian likelihood:

Starting from the Gaussian density: $$p(y | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y-\mu)^2}{2\sigma^2}\right)$$

The negative log-likelihood for a single observation is: $$-\log p(y | \mu, \sigma^2) = \frac{1}{2}\log(2\pi\sigma^2) + \frac{(y-\mu)^2}{2\sigma^2}$$

Dropping constants (terms not involving $\mu$): $$-\log p(y | \mu, \sigma^2) \propto (y - \mu)^2$$

The squared error emerges naturally from the Gaussian exponent.

Loss Functions and Noise Distributions

Every loss function corresponds to an implicit noise distribution:

Squared loss $(y - \hat{y})^2$ ↔ Gaussian noise: $\epsilon \sim \mathcal{N}(0, \sigma^2)$
Absolute loss $|y - \hat{y}|$ ↔ Laplace noise: $p(\epsilon) \propto e^{-|\epsilon|/b}$
Log loss (for classification) ↔ Bernoulli/Logistic distribution
Huber loss ↔ Mixture of Gaussian (center) and Laplace (tails)

Choosing a loss function is implicitly choosing a noise model!

Why squared, specifically?

The squared term in the exponent is precisely what makes the Gaussian "Gaussian." It's a defining characteristic of the distribution. This connection runs deep:

Gaussian is max-entropy for fixed variance — Among all distributions with mean 0 and variance $\sigma^2$, Gaussian has maximum entropy. Squared loss emerges from the least-presumptuous model.
Generating function structure — The moment generating function of the Gaussian has a quadratic exponent, which is intimately connected to squared loss.
Closure under addition — Gaussian distributions are closed under convolution (sum of Gaussians is Gaussian). The RSS as a sum of squared terms connects to this property.
Unique characterization — The only distributions that lead to separable maximum likelihood in linear regression are exponential family distributions, with Gaussian being the most natural continuous choice.

The Gaussian-Squared Loss Connection
Gaussian Property	Corresponding Optimization Property
Bell-shaped, symmetric	Penalty increases symmetrically for over/under-prediction
Thin tails (subgaussian)	Large errors are heavily penalized (quadratic growth)
Maximum entropy for fixed variance	Least-presumptuous loss for known variance
Sum of Gaussians is Gaussian	RSS as sum of squared terms is well-behaved
Characterized by mean and variance	OLS optimally estimates mean; MLE estimates variance

What MLE Provides Beyond OLS

While OLS and MLE give the same point estimates for $\boldsymbol{\beta}$, the maximum likelihood framework provides much more. Understanding these additional capabilities reveals why the probabilistic perspective is so powerful.

1. Principled variance estimation:

OLS alone doesn't tell us how to estimate $\sigma^2$. MLE provides:

A natural estimator: $\hat{\sigma}^2_{\text{MLE}} = \text{RSS}/n$
Understanding of its bias and the correction $s^2 = \text{RSS}/(n-p)$
The distribution of the variance estimator under the model

2. Sampling distributions of estimators:

Under the Gaussian model: $$\hat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^T\mathbf{X})^{-1})$$

This tells us:

$\hat{\boldsymbol{\beta}}$ is unbiased: $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$
Covariance structure: $\text{Cov}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$
Distribution is exactly Gaussian (not just asymptotically, but for any $n$)

mle_inference_capabilities.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
import numpy as np
from scipy import stats
 
class MLEInference:
    """
    Demonstrates inference capabilities provided by MLE framework
    that go beyond OLS point estimation.
    """
    
    def __init__(self, X, y):
        self.X = np.asarray(X)
        self.y = np.asarray(y)
        self.n, self.p = X.shape
        
        # Fit model
        self.beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
        self.fitted = X @ self.beta_hat
        self.residuals = y - self.fitted
        self.rss = np.sum(self.residuals**2)
        self.sigma_sq = self.rss / (self.n - self.p)  # Unbiased
        
        # Covariance matrix of beta_hat
        self.XtX_inv = np.linalg.inv(X.T @ X)
        self.cov_beta = self.sigma_sq * self.XtX_inv
        self.se_beta = np.sqrt(np.diag(self.cov_beta))
        
    def confidence_interval(self, alpha=0.05):
        """
        Compute confidence intervals for all coefficients.
        
        Uses t-distribution with n-p degrees of freedom.
        """
        df = self.n - self.p
        t_crit = stats.t.ppf(1 - alpha/2, df)
        
        lower = self.beta_hat - t_crit * self.se_beta
        upper = self.beta_hat + t_crit * self.se_beta
        
        return lower, upper
    
    def hypothesis_test(self, j, beta_0=0):
        """
        Test H0: beta_j = beta_0 vs H1: beta_j != beta_0
        
        Returns t-statistic and p-value.
        """
        t_stat = (self.beta_hat[j] - beta_0) / self.se_beta[j]
        df = self.n - self.p
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
        
        return t_stat, p_value
    
    def prediction_interval(self, x_new, alpha=0.05):
        """
        Compute prediction interval for new observation.
        
        Accounts for both parameter uncertainty and irreducible noise.
        """
        x_new = np.asarray(x_new).reshape(-1)
        
        # Point prediction
        y_pred = x_new @ self.beta_hat
        
        # Variance of prediction
        # Var(y_new - y_hat) = sigma^2 + sigma^2 * x'(X'X)^{-1}x
        var_pred = self.sigma_sq * (1 + x_new @ self.XtX_inv @ x_new)
        se_pred = np.sqrt(var_pred)
        
        df = self.n - self.p
        t_crit = stats.t.ppf(1 - alpha/2, df)
        
        lower = y_pred - t_crit * se_pred
        upper = y_pred + t_crit * se_pred
        
        return y_pred, lower, upper
    
    def log_likelihood(self):
        """Compute maximized log-likelihood."""
        ll = -self.n/2 * np.log(2 * np.pi)
        ll -= self.n/2 * np.log(self.rss / self.n)  # Using MLE sigma
        ll -= self.n/2  # RSS/(2*sigma^2) = n/2 at MLE
        return ll
    
    def aic(self):
        """Akaike Information Criterion."""
        k = self.p + 1  # p coefficients + 1 for variance
        return 2 * k - 2 * self.log_likelihood()
    
    def bic(self):
        """Bayesian Information Criterion."""
        k = self.p + 1
        return k * np.log(self.n) - 2 * self.log_likelihood()
    
    def summary(self):
        """Print comprehensive summary."""
        print("="*70)
        print("MLE Inference Summary")
        print("="*70)
        
        print("\nCoefficient Estimates with 95% Confidence Intervals:")
        lower, upper = self.confidence_interval()
        print(f"{'Coef':>8} {'Estimate':>12} {'Std.Err':>10} {'t-stat':>10} {'p-value':>10} {'95% CI':>20}")
        print("-"*70)
        for j in range(self.p):
            t, p = self.hypothesis_test(j)
            ci = f"[{lower[j]:.3f}, {upper[j]:.3f}]"
            sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else ""
            print(f"β_{j:>6} {self.beta_hat[j]:>12.4f} {self.se_beta[j]:>10.4f} {t:>10.3f} {p:>10.4f} {ci:>20} {sig}")
        
        print(f"\nResidual Standard Error: {np.sqrt(self.sigma_sq):.4f} on {self.n - self.p} df")
        
        y_mean = np.mean(self.y)
        ss_tot = np.sum((self.y - y_mean)**2)
        r_sq = 1 - self.rss / ss_tot
        adj_r_sq = 1 - (1 - r_sq) * (self.n - 1) / (self.n - self.p)
        print(f"R-squared: {r_sq:.4f}, Adjusted R-squared: {adj_r_sq:.4f}")
        
        print(f"\nModel Selection Criteria:")
        print(f"  Log-Likelihood: {self.log_likelihood():.2f}")
        print(f"  AIC: {self.aic():.2f}")
        print(f"  BIC: {self.bic():.2f}")
 
 
# Example usage
np.random.seed(42)
n, p = 100, 4
X = np.column_stack([np.ones(n), np.random.randn(n, 3)])
beta_true = np.array([5.0, 2.0, 0.0, -1.5])  # Note: beta_2 = 0
y = X @ beta_true + 1.5 * np.random.randn(n)
 
inference = MLEInference(X, y)
inference.summary()
 
print("\n" + "="*70)
print("Note: β₂ is correctly identified as non-significant (p > 0.05)")
print("This demonstrates the hypothesis testing capability of MLE framework.")

3. Model selection criteria:

The likelihood framework enables principled model comparison:

$$\text{AIC} = 2k - 2\log \mathcal{L}{\max}$$ $$\text{BIC} = k\log n - 2\log \mathcal{L}{\max}$$

where $k$ is the number of parameters and $\mathcal{L}_{\max}$ is the maximized likelihood.

4. Hypothesis testing:

Likelihood-ratio tests for nested models: $$\Lambda = -2\log\frac{\mathcal{L}{\text{restricted}}}{\mathcal{L}{\text{full}}} \sim \chi^2_q$$

where $q$ is the difference in number of parameters.

5. Foundation for regularization:

MLE naturally extends to MAP (Maximum A Posteriori) estimation: $$\hat{\boldsymbol{\beta}}{\text{MAP}} = \arg\max{\boldsymbol{\beta}} [\log p(\mathbf{y}|\boldsymbol{\beta}) + \log p(\boldsymbol{\beta})]$$

Ridge regression ($L_2$ penalty) corresponds to a Gaussian prior; Lasso ($L_1$ penalty) to a Laplace prior.

When OLS and MLE Differ

While OLS and Gaussian MLE give the same coefficient estimates, there are scenarios where the two perspectives diverge:

1. Variance estimation:

OLS perspective: Doesn't inherently provide a variance estimator. Convention uses $s^2 = \text{RSS}/(n-p)$.
MLE perspective: Naturally produces $\hat{\sigma}^2 = \text{RSS}/n$ (biased but consistent).

2. Non-Gaussian errors:

When the true noise isn't Gaussian:

OLS remains BLUE (Best Linear Unbiased Estimator) under Gauss-Markov conditions
MLE changes — different distributions lead to different estimators

For example, with Laplace noise: $$p(\epsilon) = \frac{1}{2b}\exp\left(-\frac{|\epsilon|}{b}\right)$$

The MLE minimizes the sum of absolute residuals (L1 loss), not squared residuals. This gives the Least Absolute Deviations (LAD) estimator, which differs from OLS.

OLS (Geometric View)

•Minimizes $\sum (y_i - \hat{y}_i)^2$
•No distributional assumptions needed
•BLUE under Gauss-Markov conditions
•Optimal linear estimator
•Doesn't provide natural variance estimator
•No model selection framework

MLE (Probabilistic View)

•Maximizes $p(\mathbf{y}|\boldsymbol{\beta}, \sigma^2)$
•Requires distributional assumptions
•Optimal among all estimators (asymptotically)
•Achieves Cramér-Rao lower bound
•Natural variance estimation
•AIC, BIC, likelihood ratios for model selection

3. Heteroscedastic errors:

If $\text{Var}(\epsilon_i) = \sigma_i^2$ varies across observations:

OLS remains unbiased but is no longer efficient
MLE adapts: Weighted likelihood gives Weighted Least Squares (WLS)

$$\hat{\boldsymbol{\beta}}_{\text{WLS}} = (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T\mathbf{W}\mathbf{y}$$

where $\mathbf{W} = \text{diag}(1/\sigma_1^2, ..., 1/\sigma_n^2)$.

4. Correlated errors:

If $\text{Cov}(\boldsymbol{\epsilon}) = \boldsymbol{\Sigma} \neq \sigma^2\mathbf{I}$:

OLS ignores correlation — inefficient, wrong standard errors
MLE uses full covariance — Generalized Least Squares (GLS)

$$\hat{\boldsymbol{\beta}}_{\text{GLS}} = (\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\Sigma}^{-1}\mathbf{y}$$

When to Use Which Perspective

Use OLS thinking when:

You want robust, assumption-light estimation
You're primarily interested in point predictions
Model assumptions may be violated

Use MLE thinking when:

You need uncertainty quantification (CIs, hypothesis tests)
You want to compare models systematically
You believe the Gaussian assumption is reasonable
You want to extend to Bayesian methods

In practice, use both: compute OLS estimates for robustness, interpret through MLE framework for inference.

Efficiency and Asymptotic Optimality

The MLE has powerful optimality properties that extend beyond what OLS provides alone.

The Cramér-Rao Lower Bound:

For any unbiased estimator $\tilde{\boldsymbol{\beta}}$ of $\boldsymbol{\beta}$: $$\text{Cov}(\tilde{\boldsymbol{\beta}}) \geq \mathcal{I}(\boldsymbol{\beta})^{-1}$$

where $\mathcal{I}(\boldsymbol{\beta})$ is the Fisher information matrix—the expected curvature of the log-likelihood.

This inequality says that no unbiased estimator can have smaller variance than the inverse Fisher information.

For Gaussian linear regression:

The Fisher information is: $$\mathcal{I}(\boldsymbol{\beta}) = \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X}$$

Thus the Cramér-Rao lower bound is: $$\text{Cov}(\tilde{\boldsymbol{\beta}}) \geq \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$

The MLE achieves this bound exactly: $$\text{Cov}(\hat{\boldsymbol{\beta}}_{\text{MLE}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$

This makes the MLE an efficient estimator—it achieves the smallest possible variance among all unbiased estimators.

MLE Optimality Properties

Under mild regularity conditions, MLE satisfies:

Consistency: $\hat{\boldsymbol{\theta}} \xrightarrow{p} \boldsymbol{\theta}$ as $n \to \infty$
Asymptotic normality: $\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \mathcal{I}(\boldsymbol{\theta})^{-1})$
Asymptotic efficiency: Achieves the Cramér-Rao lower bound asymptotically
Invariance: If $\hat{\theta}$ is MLE of $\theta$, then $g(\hat{\theta})$ is MLE of $g(\theta)$

For Gaussian linear regression, these properties hold exactly (not just asymptotically).

Connection to Gauss-Markov:

The Gauss-Markov theorem says OLS is BLUE—Best Linear Unbiased Estimator. It's optimal among linear estimators only.

The MLE efficiency result is stronger: under Gaussian noise, OLS/MLE is optimal among all estimators, not just linear ones.

Put differently:

Gauss-Markov (OLS): Best linear estimator, few assumptions
Cramér-Rao (MLE): Best estimator overall, requires Gaussian assumption

If the Gaussian assumption holds, we gain additional optimality. If it fails, we still have Gauss-Markov guarantees for linear estimation.

fisher_information_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
 
def fisher_information_analysis(X, sigma_sq):
    """
    Analyze Fisher information and Cramér-Rao bound for linear regression.
    """
    n, p = X.shape
    
    # Fisher information matrix
    I = (1/sigma_sq) * (X.T @ X)
    
    # Cramér-Rao lower bound for covariance
    CRLB = np.linalg.inv(I)  # = sigma^2 * (X'X)^{-1}
    
    # Variance lower bounds for individual parameters
    var_bounds = np.diag(CRLB)
    
    print("Fisher Information Analysis")
    print("="*50)
    print(f"\nSample size: n = {n}")
    print(f"Parameters: p = {p}")
    print(f"Noise variance: σ² = {sigma_sq}")
    
    print(f"\nFisher Information Matrix I(β):")
    print(I)
    
    print(f"\nCramér-Rao Lower Bound (Cov lower bound):")
    print(CRLB)
    
    print(f"\nMinimum achievable standard errors:")
    for j in range(p):
        print(f"  SE(β_{j}) ≥ {np.sqrt(var_bounds[j]):.4f}")
    
    # Verify MLE achieves the bound
    print(f"\nThe MLE/OLS estimator achieves these bounds exactly!")
    print(f"This is what makes it 'efficient'.")
    
    return I, CRLB
 
 
def demonstrate_efficiency():
    """
    Show that MLE achieves Cramér-Rao bound through simulation.
    """
    np.random.seed(42)
    n, p = 100, 3
    sigma_true = 1.5
    
    # Fixed design matrix
    X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
    beta_true = np.array([1.0, 2.0, -0.5])
    
    # Theoretical variance of beta_hat
    theoretical_cov = sigma_true**2 * np.linalg.inv(X.T @ X)
    theoretical_se = np.sqrt(np.diag(theoretical_cov))
    
    # Simulate many datasets and compute empirical variance
    n_simulations = 10000
    beta_estimates = np.zeros((n_simulations, p))
    
    for i in range(n_simulations):
        y = X @ beta_true + sigma_true * np.random.randn(n)
        beta_estimates[i] = np.linalg.lstsq(X, y, rcond=None)[0]
    
    empirical_cov = np.cov(beta_estimates.T)
    empirical_se = np.sqrt(np.diag(empirical_cov))
    
    print("\n" + "="*50)
    print("Efficiency Verification by Simulation")
    print("="*50)
    print(f"\nComparing theoretical (CRLB) vs empirical SE:")
    print(f"{'Param':>8} {'Theoretical':>12} {'Empirical':>12} {'Ratio':>10}")
    print("-"*50)
    for j in range(p):
        ratio = empirical_se[j] / theoretical_se[j]
        print(f"β_{j:>6} {theoretical_se[j]:>12.4f} {empirical_se[j]:>12.4f} {ratio:>10.4f}")
    
    print(f"\n✓ MLE achieves the Cramér-Rao bound (ratios ≈ 1.0)")
 
# Run analysis
X = np.column_stack([np.ones(100), np.random.randn(100, 2)])
fisher_information_analysis(X, sigma_sq=2.25)
demonstrate_efficiency()

Connection to Bayesian Regression

The MLE framework naturally connects to Bayesian inference, providing a bridge to regularization and uncertainty quantification.

From MLE to MAP:

In Bayesian regression, we place a prior distribution $p(\boldsymbol{\beta})$ on the parameters and compute the posterior using Bayes' theorem:

$$p(\boldsymbol{\beta} | \mathbf{y}) = \frac{p(\mathbf{y} | \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})}{p(\mathbf{y})} \propto p(\mathbf{y} | \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

The Maximum A Posteriori (MAP) estimator maximizes the posterior: $$\hat{\boldsymbol{\beta}}{\text{MAP}} = \arg\max{\boldsymbol{\beta}} [\log p(\mathbf{y} | \boldsymbol{\beta}) + \log p(\boldsymbol{\beta})]$$

MLE is the special case with a uniform (improper) prior: $p(\boldsymbol{\beta}) \propto 1$.

Ridge regression as Gaussian prior:

With a Gaussian prior $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$:

$$\log p(\boldsymbol{\beta}) = -\frac{1}{2\tau^2}|\boldsymbol{\beta}|^2 + \text{const}$$

MAP objective: $$\max_{\boldsymbol{\beta}} \left[-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 - \frac{1}{2\tau^2}|\boldsymbol{\beta}|^2\right]$$

Equivalently, minimize: $$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 + \lambda|\boldsymbol{\beta}|^2$$

where $\lambda = \sigma^2/\tau^2$. This is Ridge regression!

Prior Distributions and Regularization
Prior on β	Regularization	Penalty	Effect
Uniform (improper)	None (MLE)	$0$	No shrinkage
Gaussian $\mathcal{N}(0, \tau^2 I)$	Ridge (L2)	$\lambda\|\boldsymbol{\beta}\|_2^2$	Shrink coefficients toward zero
Laplace	Lasso (L1)	$\lambda\|\boldsymbol{\beta}\|_1$	Shrink + sparsity
Spike-and-slab	Best subset	Combinatorial	Exact sparsity
Horseshoe	Adaptive shrinkage	Complex	Heavy tails, sparsity

Full Bayesian inference:

Beyond point estimates, Bayesian regression provides the full posterior distribution:

$$p(\boldsymbol{\beta} | \mathbf{y}) = \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$

where (with Gaussian likelihood and prior): $$\boldsymbol{\Sigma}_n = \left(\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X} + \frac{1}{\tau^2}\mathbf{I}\right)^{-1}$$ $$\boldsymbol{\mu}_n = \boldsymbol{\Sigma}_n \cdot \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{y}$$

This posterior:

Quantifies uncertainty through its spread
Enables credible intervals without frequentist asymptotics
Naturally incorporates prior knowledge
Propagates to predictions: $p(y^* | \mathbf{y})$

MLE is the starting point; Bayesian methods extend it by incorporating prior information and providing uncertainty in a different mathematical framework.

The Certainty Equivalence Principle

As the sample size $n \to \infty$ or prior variance $\tau^2 \to \infty$:

$$\hat{\boldsymbol{\beta}}{\text{MAP}} \to \hat{\boldsymbol{\beta}}{\text{MLE}}$$

With lots of data, the prior becomes irrelevant and Bayesian inference converges to frequentist MLE. The prior matters most when data is scarce—exactly when we'd want to incorporate external knowledge.

Historical Perspective: Gauss, Legendre, and Fisher

The connection between OLS and MLE has deep historical roots, involving some of the greatest mathematicians in history.

Adrien-Marie Legendre (1805): First published the method of least squares as a computational technique for fitting curves to astronomical observations. He presented it as a practical (some would say ad hoc) method.

Carl Friedrich Gauss (1809): Independently developed least squares and provided its probabilistic justification. Gauss showed that if errors follow what we now call the Gaussian distribution, then minimizing squared errors gives the most probable parameter values. He also derived the distribution named after him in this context.

Gauss famously claimed priority, stating he had used the method since 1795—leading to a bitter dispute with Legendre.

R.A. Fisher (1912-1925): Formalized maximum likelihood as a general principle of statistical estimation. Fisher unified the probabilistic approach and established key concepts: likelihood, sufficiency, efficiency, and the information matrix. His work placed Gauss's insight into a general framework applicable beyond normally distributed errors.

From Ad Hoc to Principled

The history illustrates a common pattern in science:

Legendre: Practical method discovered empirically
Gauss: Probabilistic justification showing why it works
Fisher: Generalization into a unified framework

Least squares started as a computational convenience and became a principled statistical method. Understanding this evolution helps us appreciate what assumptions justify our techniques.

The Modern Synthesis:

Today we understand that:

Geometric view (Legendre's legacy): OLS as projection, minimal distance in Euclidean space
Probabilistic view (Gauss's legacy): MLE under Gaussian noise, principled uncertainty quantification
Bayesian view (Laplace's and later contributions): Prior incorporation, full posterior inference
Computational view (modern ML): Regularization, optimization algorithms, scalability

All four perspectives are valid and useful. Expert practitioners move fluently between them, choosing the perspective most helpful for each problem.

The equivalence $\hat{\boldsymbol{\beta}}{\text{OLS}} = \hat{\boldsymbol{\beta}}{\text{MLE}}$ under Gaussian noise is the bridge connecting these worlds—a mathematical truth discovered over 200 years ago that remains central to statistics and machine learning today.

Summary and Looking Ahead

We've thoroughly explored the relationship between OLS and MLE. Here's what we've established:

Key Takeaways

•Gaussian ↔ Squared loss — The Gaussian distribution's exponent directly implies squared loss as the natural objective
•Same point estimates — $\hat{\boldsymbol{\beta}}{\text{MLE}} = \hat{\boldsymbol{\beta}}{\text{OLS}}$ under Gaussian noise
•MLE adds inference tools — Standard errors, confidence intervals, hypothesis tests, model selection criteria
•MLE is efficient — Achieves the Cramér-Rao lower bound, optimal among all estimators
•Gauss-Markov is complementary — OLS is optimal among linear estimators without normality; MLE is globally optimal with normality
•Bayesian connection — MLE is MAP with uniform prior; regularization corresponds to informative priors
•Different noise → different MLE — Laplace noise gives L1 regression; the loss function reflects assumed noise

What's next:

In the final page of this module, we'll address variance estimation in detail—beyond the brief treatment we've given so far. We'll explore the distributional properties of the MLE variance estimator, derive its unbiased correction rigorously, and connect variance estimation to residual diagnostics and model checking.

OLS-MLE Connection Complete

You now understand the deep relationship between geometric and probabilistic views of linear regression. The equivalence is not coincidental—it reflects fundamental mathematical structure. With this understanding, you can move between perspectives as needed: using OLS for robustness, MLE for inference, and Bayesian methods when prior information is available.