Maximum Likelihood View - Learning Module

Loading content...

0/245

MLE Solution

Solving the Optimization Problem

We've established that maximum likelihood estimation for Gaussian linear regression requires maximizing the log-likelihood function. Now we complete the journey by solving this optimization problem analytically.

This page derives the closed-form expressions for both:

The MLE for the regression coefficients $\boldsymbol{\beta}$
The MLE for the noise variance $\sigma^2$

The beauty of Gaussian linear regression is that these solutions emerge from elementary calculus and linear algebra—no iterative optimization required. Understanding this derivation provides deep insight into why certain estimators have the forms they do.

What You Will Learn

By the end of this page, you will:

Derive the MLE $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ from first principles
Derive the MLE for variance $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$
Understand why the MLE for variance is biased and how to correct it
See the complete optimization solved step by step
Verify solutions satisfy necessary and sufficient conditions

The Complete Optimization Problem

The objective function:

Recalling from the previous page, our log-likelihood is:

$$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$

The optimization problem:

We seek: $$(\hat{\boldsymbol{\beta}}, \hat{\sigma}^2) = \arg\max_{\boldsymbol{\beta}, \sigma^2 > 0} \ell(\boldsymbol{\beta}, \sigma^2)$$

This is a joint optimization over both parameters. A key insight is that we can decouple this problem—solve for $\boldsymbol{\beta}$ first, then solve for $\sigma^2$.

Strategy: Profile Likelihood

When optimizing over multiple parameters, a useful technique is profiling:

First, solve for one parameter ($\boldsymbol{\beta}$) treating the other ($\sigma^2$) as fixed
Substitute this solution back to get a "profile likelihood" in the remaining parameter
Maximize the profile likelihood

For our problem, the structure is even nicer: the optimal $\boldsymbol{\beta}$ doesn't depend on $\sigma^2$ at all!

Why decoupling works:

Look at how $\boldsymbol{\beta}$ and $\sigma^2$ appear in the log-likelihood:

$$\ell(\boldsymbol{\beta}, \sigma^2) = \underbrace{-\frac{n}{2}\log(2\pi)}{\text{constant}} - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\underbrace{|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2}{\text{RSS}(\boldsymbol{\beta})}$$

For any fixed $\sigma^2 > 0$:

The factor $\frac{1}{2\sigma^2}$ is just a positive constant
Maximizing $-\frac{1}{2\sigma^2}\text{RSS}(\boldsymbol{\beta})$ means minimizing $\text{RSS}(\boldsymbol{\beta})$
The minimizer of RSS doesn't depend on what $\sigma^2$ we assume

Therefore: Solve for $\hat{\boldsymbol{\beta}}$ by minimizing RSS, then plug in to find $\hat{\sigma}^2$.

Deriving the MLE for $\boldsymbol{\beta}$

The minimization problem:

We need to solve: $$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 = \arg\min_{\boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

Let's expand the objective function. Define $\mathbf{r}(\boldsymbol{\beta}) = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$, the residual vector.

$$\text{RSS}(\boldsymbol{\beta}) = \mathbf{r}^T\mathbf{r} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

Expanding: $$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

Since $\mathbf{y}^T\mathbf{X}\boldsymbol{\beta}$ is a scalar, it equals its transpose: $\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y}$. Thus:

$$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

Step 1: Compute the gradient

Using matrix calculus identities:

$\nabla_{\boldsymbol{\beta}}(\boldsymbol{\beta}^T\mathbf{a}) = \mathbf{a}$ for constant vector $\mathbf{a}$
$\nabla_{\boldsymbol{\beta}}(\boldsymbol{\beta}^T\mathbf{A}\boldsymbol{\beta}) = 2\mathbf{A}\boldsymbol{\beta}$ for symmetric matrix $\mathbf{A}$

Applying these: $$\nabla_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 0 - 2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

$$\nabla_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} - 2\mathbf{X}^T\mathbf{y}$$

Step 2: Set gradient to zero

At a minimum (first-order necessary condition): $$\nabla_{\boldsymbol{\beta}} \text{RSS}(\hat{\boldsymbol{\beta}}) = \mathbf{0}$$

$$2\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} - 2\mathbf{X}^T\mathbf{y} = \mathbf{0}$$

$$\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$$

The Normal Equations

$$\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$$

These are the famous normal equations of linear regression. The name comes from the geometric interpretation: at the solution, the residual vector $\mathbf{r} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is normal (perpendicular) to the column space of $\mathbf{X}$.

Equivalently: $\mathbf{X}^T\mathbf{r} = \mathbf{0}$, meaning the residual is orthogonal to every column of $\mathbf{X}$.

Step 3: Solve for $\hat{\boldsymbol{\beta}}$

If $\mathbf{X}^T\mathbf{X}$ is invertible (which requires $\mathbf{X}$ to have full column rank):

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Step 4: Verify this is a minimum (not maximum)

The Hessian of RSS is: $$\nabla^2_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 2\mathbf{X}^T\mathbf{X}$$

This is positive semi-definite (since $\mathbf{v}^T\mathbf{X}^T\mathbf{X}\mathbf{v} = |\mathbf{X}\mathbf{v}|^2 \geq 0$ for any $\mathbf{v}$).

When $\mathbf{X}$ has full column rank, it's strictly positive definite. This confirms:

The critical point is a minimum (not maximum or saddle)
The minimum is unique (since the function is strictly convex)
No other local minima exist

The MLE for Regression Coefficients

$$\hat{\boldsymbol{\beta}}_{\text{MLE}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

This is identical to the OLS estimator! The maximum likelihood approach under Gaussian noise leads to exactly the same formula we derived geometrically.

Alternative Derivation: Completing the Square

For deeper insight, let's derive the solution using the "completing the square" technique, which reveals the structure more elegantly.

Starting point: $$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

The key idea:

We want to rewrite this as $(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{A}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}) + \text{constant}$ for some matrix $\mathbf{A}$ and optimal $\hat{\boldsymbol{\beta}}$.

Let $\mathbf{A} = \mathbf{X}^T\mathbf{X}$ and assume $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$.

Expanding $(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})$: $$= \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} + \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}}$$

Substituting $\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$ (the normal equations): $$= \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{y}$$

Comparing with RSS: $$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

Result: $$\text{RSS}(\boldsymbol{\beta}) = (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}) + \mathbf{y}^T\mathbf{y} - \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{y}$$

Simplifying the constant term (using $\hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{y} = \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{y}^T\mathbf{X}\hat{\boldsymbol{\beta}}$): $$\text{RSS}(\boldsymbol{\beta}) = (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}) + |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2$$

Interpretation:

This beautiful decomposition shows that RSS has two parts:

$(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}) \geq 0$ — deviation from optimal (zero at $\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}$)
$|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2 = \text{RSS}_{\min}$ — irreducible minimum (the best achievable RSS)

Since the first term is non-negative and equals zero only when $\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}$, the minimum occurs exactly at $\hat{\boldsymbol{\beta}}$.

The Power of Completing the Square

This technique:

Proves uniqueness and existence of the minimum in one step
Shows the minimum RSS explicitly: $\text{RSS}_{\min} = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2$
Reveals the geometry: RSS increases as we move away from $\hat{\boldsymbol{\beta}}$, with the rate determined by $\mathbf{X}^T\mathbf{X}$
Extends naturally to regularized regression (Ridge adds $\lambda\mathbf{I}$ to $\mathbf{X}^T\mathbf{X}$)

Deriving the MLE for $\sigma^2$

With $\hat{\boldsymbol{\beta}}$ in hand, we now derive the MLE for the variance parameter $\sigma^2$.

Profile log-likelihood:

Substitute $\hat{\boldsymbol{\beta}}$ into the log-likelihood: $$\ell(\hat{\boldsymbol{\beta}}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{\text{RSS}_{\min}}{2\sigma^2}$$

where $\text{RSS}{\min} = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2 = \sum{i=1}^n \hat{r}_i^2$ and $\hat{r}_i = y_i - \hat{y}_i$ are the fitted residuals.

This is now a function of $\sigma^2$ only—the profile log-likelihood.

Differentiate with respect to $\sigma^2$:

$$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2} \cdot \frac{1}{\sigma^2} - \frac{\text{RSS}_{\min}}{2} \cdot \left(-\frac{1}{(\sigma^2)^2}\right)$$

$$= -\frac{n}{2\sigma^2} + \frac{\text{RSS}_{\min}}{2(\sigma^2)^2}$$

Set derivative to zero:

$$-\frac{n}{2\sigma^2} + \frac{\text{RSS}_{\min}}{2(\sigma^2)^2} = 0$$

Multiply through by $2(\sigma^2)^2$: $$-n\sigma^2 + \text{RSS}_{\min} = 0$$

$$\sigma^2 = \frac{\text{RSS}_{\min}}{n}$$

Verify it's a maximum:

Second derivative: $$\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2(\sigma^2)^2} - \frac{\text{RSS}_{\min}}{(\sigma^2)^3}$$

At $\hat{\sigma}^2 = \frac{\text{RSS}{\min}}{n}$: $$= \frac{n}{2} \cdot \frac{n^2}{\text{RSS}{\min}^2} - \frac{\text{RSS}{\min} \cdot n^3}{\text{RSS}{\min}^3} = \frac{n^3}{2\text{RSS}{\min}^2} - \frac{n^3}{\text{RSS}{\min}^2} = -\frac{n^3}{2\text{RSS}_{\min}^2} < 0$$

Since the second derivative is negative, this is indeed a maximum.

The MLE for Noise Variance

$$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}i)^2 = \frac{1}{n}\sum{i=1}^n \hat{r}_i^2 = \frac{\text{RSS}}{n}$$

The MLE for variance is the average squared residual. This has an intuitive interpretation: the noise variance is estimated by how much the data deviates from our fitted values.

The Bias Issue: Why $\frac{RSS}{n}$ Underestimates Variance

Here's an important subtlety: while $\hat{\sigma}^2_{\text{MLE}} = \frac{\text{RSS}}{n}$ is the maximum likelihood estimator, it is biased.

The problem: $$\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \mathbb{E}\left[\frac{\text{RSS}}{n}\right] = \frac{n-p}{n} \sigma^2 \neq \sigma^2$$

where $p$ is the number of parameters (columns in $\mathbf{X}$).

The MLE systematically underestimates the true variance. Why does this happen?

Intuitive explanation:

When we fit $\hat{\boldsymbol{\beta}}$, we're choosing it to minimize RSS—to make residuals as small as possible. This "overfits" the noise slightly. The fitted residuals are smaller than the true errors because $\hat{\boldsymbol{\beta}}$ has adapted to the particular noise realization in our sample.

Each parameter we estimate "uses up" one degree of freedom, reducing the effective sample size for variance estimation.

Degrees of Freedom

With $n$ observations and $p$ estimated parameters, we have $n - p$ degrees of freedom for variance estimation.

Intuition: If we have $n = p$ (as many parameters as observations), we can fit the data exactly (RSS = 0), giving the absurd estimate $\hat{\sigma}^2 = 0$. The data has "used up" all its information for estimating $\boldsymbol{\beta}$, leaving nothing for estimating $\sigma^2$.

The unbiased estimator:

To correct the bias, we use the unbiased variance estimator:

$$s^2 = \frac{\text{RSS}}{n - p} = \frac{1}{n-p}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$

One can prove that $\mathbb{E}[s^2] = \sigma^2$.

Key quantity: The residual RSS

More precisely, under the Gaussian model: $$\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$$

The RSS, scaled by $\sigma^2$, follows a chi-squared distribution with $n-p$ degrees of freedom. Since $\mathbb{E}[\chi^2_k] = k$: $$\mathbb{E}\left[\frac{\text{RSS}}{\sigma^2}\right] = n - p \implies \mathbb{E}[\text{RSS}] = (n-p)\sigma^2$$

Thus: $$\mathbb{E}\left[\frac{\text{RSS}}{n-p}\right] = \sigma^2$$

variance_estimator_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
def compare_variance_estimators(n_simulations=10000):
    """
    Demonstrate bias of MLE variance estimator through simulation.
    """
    n = 50  # Sample size
    p = 5   # Number of parameters (including intercept)
    sigma_true = 2.0
    sigma_sq_true = sigma_true**2
    
    np.random.seed(42)
    
    mle_estimates = []
    unbiased_estimates = []
    
    for _ in range(n_simulations):
        # Generate data
        X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
        beta_true = np.random.randn(p)
        y = X @ beta_true + sigma_true * np.random.randn(n)
        
        # Fit OLS
        beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
        residuals = y - X @ beta_hat
        rss = np.sum(residuals**2)
        
        # MLE estimate (biased)
        sigma_sq_mle = rss / n
        mle_estimates.append(sigma_sq_mle)
        
        # Unbiased estimate
        sigma_sq_unbiased = rss / (n - p)
        unbiased_estimates.append(sigma_sq_unbiased)
    
    mle_mean = np.mean(mle_estimates)
    unbiased_mean = np.mean(unbiased_estimates)
    
    print("Variance Estimator Comparison")
    print("=" * 50)
    print(f"True σ² = {sigma_sq_true}")
    print(f"Sample size n = {n}, parameters p = {p}")
    print(f"Degrees of freedom = n - p = {n - p}")
    print(f"\nOver {n_simulations} simulations:")
    print(f"\n  MLE estimator (RSS/n):")
    print(f"    Mean: {mle_mean:.4f}")
    print(f"    Bias: {mle_mean - sigma_sq_true:.4f}")
    print(f"    Relative bias: {(mle_mean - sigma_sq_true) / sigma_sq_true * 100:.2f}%")
    print(f"    Theoretical bias factor: (n-p)/n = {(n-p)/n:.4f}")
    print(f"    Theoretical mean: {sigma_sq_true * (n-p)/n:.4f}")
    print(f"\n  Unbiased estimator (RSS/(n-p)):")
    print(f"    Mean: {unbiased_mean:.4f}")
    print(f"    Bias: {unbiased_mean - sigma_sq_true:.4f}")
    print(f"    Relative bias: {(unbiased_mean - sigma_sq_true) / sigma_sq_true * 100:.2f}%")
    
    print(f"\n✓ Unbiased estimator correctly estimates σ² on average")
    print(f"✓ MLE underestimates by factor (n-p)/n = {(n-p)/n:.3f}")
 
compare_variance_estimators()

MLE vs Unbiased Variance Estimators
Estimator	Formula	Bias	When to Use
MLE	$\hat{\sigma}^2 = \frac{\text{RSS}}{n}$	Biased low by factor $\frac{n-p}{n}$	Large samples where $n \gg p$
Unbiased	$s^2 = \frac{\text{RSS}}{n-p}$	Unbiased	Standard practice; used for inference

The Complete MLE Solution

Let's consolidate everything into the complete maximum likelihood solution for Gaussian linear regression.

Complete MLE Solution for Gaussian Linear Regression

Model: $\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$

MLE for coefficients: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Fitted values: $$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{H}\mathbf{y}$$ where $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is the hat matrix.

Residuals: $$\hat{\mathbf{r}} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}$$

MLE for variance: $$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}|\hat{\mathbf{r}}|^2 = \frac{1}{n}\sum_{i=1}^n \hat{r}_i^2$$

Unbiased variance estimator: $$s^2 = \frac{1}{n-p}|\hat{\mathbf{r}}|^2 = \frac{1}{n-p}\sum_{i=1}^n \hat{r}_i^2$$

complete_mle_solution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
 
class GaussianLinearRegressionMLE:
    """
    Complete Maximum Likelihood Estimation for Gaussian Linear Regression.
    
    This implementation follows the exact derivations from theory.
    """
    
    def __init__(self):
        self.beta_hat = None
        self.sigma_sq_mle = None
        self.sigma_sq_unbiased = None
        self.residuals = None
        self.fitted_values = None
        self.hat_matrix = None
        self.n = None
        self.p = None
        
    def fit(self, X, y):
        """
        Fit the model using maximum likelihood estimation.
        
        Parameters:
        -----------
        X : array-like, shape (n, p)
            Design matrix (should include intercept column if desired)
        y : array-like, shape (n,)
            Response vector
        """
        X = np.asarray(X)
        y = np.asarray(y)
        
        self.n, self.p = X.shape
        
        # Step 1: Compute beta_hat = (X'X)^{-1}X'y
        XtX = X.T @ X
        Xty = X.T @ y
        
        # Use solve instead of explicit inverse for numerical stability
        self.beta_hat = np.linalg.solve(XtX, Xty)
        
        # Step 2: Compute fitted values and residuals
        self.fitted_values = X @ self.beta_hat
        self.residuals = y - self.fitted_values
        
        # Step 3: Compute RSS
        rss = np.sum(self.residuals**2)
        
        # Step 4: Variance estimates
        self.sigma_sq_mle = rss / self.n
        self.sigma_sq_unbiased = rss / (self.n - self.p)
        
        # Step 5: Hat matrix (optional, for diagnostics)
        self.hat_matrix = X @ np.linalg.solve(XtX, X.T)
        
        return self
    
    def predict(self, X_new):
        """Predict for new observations."""
        return X_new @ self.beta_hat
    
    def log_likelihood(self, X, y, use_mle_sigma=True):
        """Compute log-likelihood at fitted parameters."""
        sigma_sq = self.sigma_sq_mle if use_mle_sigma else self.sigma_sq_unbiased
        residuals = y - X @ self.beta_hat
        rss = np.sum(residuals**2)
        
        n = len(y)
        ll = -n/2 * np.log(2 * np.pi) - n/2 * np.log(sigma_sq) - rss / (2 * sigma_sq)
        return ll
    
    def covariance_matrix(self):
        """Return covariance matrix of beta_hat."""
        # Using unbiased variance estimator for proper inference
        return self.sigma_sq_unbiased * np.linalg.inv(X.T @ X)
    
    def standard_errors(self, X):
        """Compute standard errors for each coefficient."""
        cov = self.sigma_sq_unbiased * np.linalg.inv(X.T @ X)
        return np.sqrt(np.diag(cov))
    
    def summary(self, X):
        """Print a summary of the fitted model."""
        print("="*60)
        print("Gaussian Linear Regression - Maximum Likelihood Estimation")
        print("="*60)
        print(f"\nObservations: n = {self.n}")
        print(f"Parameters: p = {self.p}")
        print(f"Degrees of freedom: {self.n - self.p}")
        
        print(f"\n--- Coefficient Estimates ---")
        se = self.standard_errors(X)
        for i, (b, s) in enumerate(zip(self.beta_hat, se)):
            t_stat = b / s
            print(f"  β_{i}: {b:9.4f} (SE: {s:.4f}, t: {t_stat:7.3f})")
        
        print(f"\n--- Variance Estimates ---")
        print(f"  σ²_MLE (biased):   {self.sigma_sq_mle:.4f}")
        print(f"  s² (unbiased):     {self.sigma_sq_unbiased:.4f}")
        print(f"  σ estimate (√s²):  {np.sqrt(self.sigma_sq_unbiased):.4f}")
        
        print(f"\n--- Model Fit ---")
        y = self.fitted_values + self.residuals
        ss_tot = np.sum((y - np.mean(y))**2)
        ss_res = np.sum(self.residuals**2)
        r_squared = 1 - ss_res / ss_tot
        adj_r_squared = 1 - (1 - r_squared) * (self.n - 1) / (self.n - self.p)
        print(f"  R²:        {r_squared:.4f}")
        print(f"  Adj R²:    {adj_r_squared:.4f}")
        print(f"  RSS:       {ss_res:.4f}")
        
        print(f"\n--- Log-Likelihood ---")
        print(f"  Log-likelihood: {self.log_likelihood(X, y):.4f}")
 
 
# Example usage
np.random.seed(42)
n, p = 100, 4
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta_true = np.array([5.0, 2.0, -1.0, 0.5])
sigma_true = 1.5
 
y = X @ beta_true + sigma_true * np.random.randn(n)
 
model = GaussianLinearRegressionMLE()
model.fit(X, y)
model.summary(X)
 
print(f"\n--- Comparison with True Values ---")
print(f"True β:  {beta_true}")
print(f"Est. β:  {model.beta_hat}")
print(f"True σ²: {sigma_true**2}")

Special Cases and Limiting Behavior

Let's examine important special cases to develop intuition for the MLE solution.

Case 1: Simple mean estimation ($\mathbf{X} = \mathbf{1}$)

If we only estimate a constant (intercept-only model), $\mathbf{X}$ is an $n \times 1$ vector of ones: $$\hat{\beta} = (\mathbf{1}^T\mathbf{1})^{-1}\mathbf{1}^T\mathbf{y} = \frac{\sum_{i=1}^n y_i}{n} = \bar{y}$$

The MLE is the sample mean—exactly what we'd expect for estimating the mean of a Gaussian!

$$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \bar{y})^2$$

This is the biased sample variance (divide by $n$), while $s^2$ divides by $n-1$.

Case 2: Orthogonal design ($\mathbf{X}^T\mathbf{X} = \mathbf{I}$)

If columns of $\mathbf{X}$ are orthonormal: $$\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$$

Each coefficient is simply the inner product of the corresponding column with $\mathbf{y}$. Coefficient estimates are independent of each other.

Case 3: Simple linear regression ($p = 2$)

For $y = \beta_0 + \beta_1 x + \epsilon$ with design matrix $\mathbf{X} = [\mathbf{1} \mid \mathbf{x}]$:

$$\hat{\beta}1 = \frac{\sum{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}$$

$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

These familiar formulas are special cases of the general MLE.

Case 4: Large sample behavior ($n \to \infty$)

As $n \to \infty$:

$\hat{\boldsymbol{\beta}} \xrightarrow{p} \boldsymbol{\beta}$ (consistency)
$\sqrt{n}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \sigma^2(\mathbf{X}^T\mathbf{X}/n)^{-1})$ (asymptotic normality)
$\hat{\sigma}^2 \xrightarrow{p} \sigma^2$ (both MLE and unbiased converge)
The bias of MLE variance estimator vanishes: $\frac{n-p}{n} \to 1$

When X'X Is Singular

If $\mathbf{X}$ doesn't have full column rank (e.g., due to multicollinearity), $\mathbf{X}^T\mathbf{X}$ is singular and not invertible. In this case:

There are infinitely many solutions to the normal equations
Use the Moore-Penrose pseudoinverse: $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^+\mathbf{X}^T\mathbf{y}$
Or use regularization (Ridge regression) which adds $\lambda\mathbf{I}$ to make $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ invertible
Consider removing redundant features

Summary and Looking Ahead

We've now completed the derivation of the maximum likelihood solution for Gaussian linear regression. Here's what we've established:

Key Takeaways

•Decoupled optimization — Solve for $\boldsymbol{\beta}$ first (independent of $\sigma^2$), then for $\sigma^2$
•MLE for coefficients — $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ (same as OLS!)
•Normal equations — First-order condition gives $\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$
•Completing the square — Elegant proof of uniqueness and explicit form of minimum RSS
•MLE for variance — $\hat{\sigma}^2 = \text{RSS}/n$ is biased; use $s^2 = \text{RSS}/(n-p)$ for unbiased estimation
•Degrees of freedom — $n-p$ for variance estimation, reflecting parameters already estimated
•Second-order conditions — Positive definite Hessian confirms unique global minimum

What's next:

In the next page, we'll deepen the connection between OLS and MLE—showing not just that they give the same point estimates, but understanding when and why they differ in their broader statistical properties. We'll also explore what MLE provides beyond OLS: principled variance estimation, likelihood ratio tests, and model selection criteria.

MLE Solution Complete

You now understand how to derive the maximum likelihood estimators from first principles. The closed-form solutions aren't magic—they emerge from careful application of calculus and linear algebra to a clearly specified probabilistic model. This foundation enables everything from hypothesis testing to Bayesian extensions.

MLE Solution

Solving the Optimization Problem

This page derives the closed-form expressions for both:

The MLE for the regression coefficients $\boldsymbol{\beta}$
The MLE for the noise variance $\sigma^2$

What You Will Learn

By the end of this page, you will:

Derive the MLE $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ from first principles
Derive the MLE for variance $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$
Understand why the MLE for variance is biased and how to correct it
See the complete optimization solved step by step
Verify solutions satisfy necessary and sufficient conditions

The Complete Optimization Problem

The objective function:

Recalling from the previous page, our log-likelihood is:

$$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$

The optimization problem:

We seek: $$(\hat{\boldsymbol{\beta}}, \hat{\sigma}^2) = \arg\max_{\boldsymbol{\beta}, \sigma^2 > 0} \ell(\boldsymbol{\beta}, \sigma^2)$$

This is a joint optimization over both parameters. A key insight is that we can decouple this problem—solve for $\boldsymbol{\beta}$ first, then solve for $\sigma^2$.

Strategy: Profile Likelihood

When optimizing over multiple parameters, a useful technique is profiling:

First, solve for one parameter ($\boldsymbol{\beta}$) treating the other ($\sigma^2$) as fixed
Substitute this solution back to get a "profile likelihood" in the remaining parameter
Maximize the profile likelihood

For our problem, the structure is even nicer: the optimal $\boldsymbol{\beta}$ doesn't depend on $\sigma^2$ at all!

Why decoupling works:

Look at how $\boldsymbol{\beta}$ and $\sigma^2$ appear in the log-likelihood:

For any fixed $\sigma^2 > 0$:

The factor $\frac{1}{2\sigma^2}$ is just a positive constant
Maximizing $-\frac{1}{2\sigma^2}\text{RSS}(\boldsymbol{\beta})$ means minimizing $\text{RSS}(\boldsymbol{\beta})$
The minimizer of RSS doesn't depend on what $\sigma^2$ we assume

Therefore: Solve for $\hat{\boldsymbol{\beta}}$ by minimizing RSS, then plug in to find $\hat{\sigma}^2$.

Deriving the MLE for $\boldsymbol{\beta}$

The minimization problem:

Let's expand the objective function. Define $\mathbf{r}(\boldsymbol{\beta}) = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$, the residual vector.

$$\text{RSS}(\boldsymbol{\beta}) = \mathbf{r}^T\mathbf{r} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

Since $\mathbf{y}^T\mathbf{X}\boldsymbol{\beta}$ is a scalar, it equals its transpose: $\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y}$. Thus:

$$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

Step 1: Compute the gradient

Using matrix calculus identities:

$\nabla_{\boldsymbol{\beta}}(\boldsymbol{\beta}^T\mathbf{a}) = \mathbf{a}$ for constant vector $\mathbf{a}$
$\nabla_{\boldsymbol{\beta}}(\boldsymbol{\beta}^T\mathbf{A}\boldsymbol{\beta}) = 2\mathbf{A}\boldsymbol{\beta}$ for symmetric matrix $\mathbf{A}$

Applying these: $$\nabla_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 0 - 2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

$$\nabla_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} - 2\mathbf{X}^T\mathbf{y}$$

Step 2: Set gradient to zero

At a minimum (first-order necessary condition): $$\nabla_{\boldsymbol{\beta}} \text{RSS}(\hat{\boldsymbol{\beta}}) = \mathbf{0}$$

$$2\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} - 2\mathbf{X}^T\mathbf{y} = \mathbf{0}$$

$$\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$$

The Normal Equations

$$\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$$

Equivalently: $\mathbf{X}^T\mathbf{r} = \mathbf{0}$, meaning the residual is orthogonal to every column of $\mathbf{X}$.

Step 3: Solve for $\hat{\boldsymbol{\beta}}$

If $\mathbf{X}^T\mathbf{X}$ is invertible (which requires $\mathbf{X}$ to have full column rank):

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Step 4: Verify this is a minimum (not maximum)

The Hessian of RSS is: $$\nabla^2_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 2\mathbf{X}^T\mathbf{X}$$

This is positive semi-definite (since $\mathbf{v}^T\mathbf{X}^T\mathbf{X}\mathbf{v} = |\mathbf{X}\mathbf{v}|^2 \geq 0$ for any $\mathbf{v}$).

When $\mathbf{X}$ has full column rank, it's strictly positive definite. This confirms:

The critical point is a minimum (not maximum or saddle)
The minimum is unique (since the function is strictly convex)
No other local minima exist

The MLE for Regression Coefficients

$$\hat{\boldsymbol{\beta}}_{\text{MLE}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

This is identical to the OLS estimator! The maximum likelihood approach under Gaussian noise leads to exactly the same formula we derived geometrically.

Alternative Derivation: Completing the Square

For deeper insight, let's derive the solution using the "completing the square" technique, which reveals the structure more elegantly.

Starting point: $$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

The key idea:

Let $\mathbf{A} = \mathbf{X}^T\mathbf{X}$ and assume $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$.

Comparing with RSS: $$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

Interpretation:

This beautiful decomposition shows that RSS has two parts:

$(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}) \geq 0$ — deviation from optimal (zero at $\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}$)
$|\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2 = \text{RSS}_{\min}$ — irreducible minimum (the best achievable RSS)

Since the first term is non-negative and equals zero only when $\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}$, the minimum occurs exactly at $\hat{\boldsymbol{\beta}}$.

The Power of Completing the Square

This technique:

Proves uniqueness and existence of the minimum in one step
Shows the minimum RSS explicitly: $\text{RSS}_{\min} = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2$
Reveals the geometry: RSS increases as we move away from $\hat{\boldsymbol{\beta}}$, with the rate determined by $\mathbf{X}^T\mathbf{X}$
Extends naturally to regularized regression (Ridge adds $\lambda\mathbf{I}$ to $\mathbf{X}^T\mathbf{X}$)

Deriving the MLE for $\sigma^2$

With $\hat{\boldsymbol{\beta}}$ in hand, we now derive the MLE for the variance parameter $\sigma^2$.

Profile log-likelihood:

Substitute $\hat{\boldsymbol{\beta}}$ into the log-likelihood: $$\ell(\hat{\boldsymbol{\beta}}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{\text{RSS}_{\min}}{2\sigma^2}$$

where $\text{RSS}{\min} = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2 = \sum{i=1}^n \hat{r}_i^2$ and $\hat{r}_i = y_i - \hat{y}_i$ are the fitted residuals.

This is now a function of $\sigma^2$ only—the profile log-likelihood.

Differentiate with respect to $\sigma^2$:

$$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2} \cdot \frac{1}{\sigma^2} - \frac{\text{RSS}_{\min}}{2} \cdot \left(-\frac{1}{(\sigma^2)^2}\right)$$

$$= -\frac{n}{2\sigma^2} + \frac{\text{RSS}_{\min}}{2(\sigma^2)^2}$$

Set derivative to zero:

$$-\frac{n}{2\sigma^2} + \frac{\text{RSS}_{\min}}{2(\sigma^2)^2} = 0$$

Multiply through by $2(\sigma^2)^2$: $$-n\sigma^2 + \text{RSS}_{\min} = 0$$

$$\sigma^2 = \frac{\text{RSS}_{\min}}{n}$$

Verify it's a maximum:

Second derivative: $$\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2(\sigma^2)^2} - \frac{\text{RSS}_{\min}}{(\sigma^2)^3}$$

Since the second derivative is negative, this is indeed a maximum.

The MLE for Noise Variance

$$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}i)^2 = \frac{1}{n}\sum{i=1}^n \hat{r}_i^2 = \frac{\text{RSS}}{n}$$

The MLE for variance is the average squared residual. This has an intuitive interpretation: the noise variance is estimated by how much the data deviates from our fitted values.

The Bias Issue: Why $\frac{RSS}{n}$ Underestimates Variance

Here's an important subtlety: while $\hat{\sigma}^2_{\text{MLE}} = \frac{\text{RSS}}{n}$ is the maximum likelihood estimator, it is biased.

The problem: $$\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \mathbb{E}\left[\frac{\text{RSS}}{n}\right] = \frac{n-p}{n} \sigma^2 \neq \sigma^2$$

where $p$ is the number of parameters (columns in $\mathbf{X}$).

The MLE systematically underestimates the true variance. Why does this happen?

Intuitive explanation:

Each parameter we estimate "uses up" one degree of freedom, reducing the effective sample size for variance estimation.

Degrees of Freedom

With $n$ observations and $p$ estimated parameters, we have $n - p$ degrees of freedom for variance estimation.

The unbiased estimator:

To correct the bias, we use the unbiased variance estimator:

$$s^2 = \frac{\text{RSS}}{n - p} = \frac{1}{n-p}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$

One can prove that $\mathbb{E}[s^2] = \sigma^2$.

Key quantity: The residual RSS

More precisely, under the Gaussian model: $$\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$$

Thus: $$\mathbb{E}\left[\frac{\text{RSS}}{n-p}\right] = \sigma^2$$

variance_estimator_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
def compare_variance_estimators(n_simulations=10000):
    """
    Demonstrate bias of MLE variance estimator through simulation.
    """
    n = 50  # Sample size
    p = 5   # Number of parameters (including intercept)
    sigma_true = 2.0
    sigma_sq_true = sigma_true**2
    
    np.random.seed(42)
    
    mle_estimates = []
    unbiased_estimates = []
    
    for _ in range(n_simulations):
        # Generate data
        X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
        beta_true = np.random.randn(p)
        y = X @ beta_true + sigma_true * np.random.randn(n)
        
        # Fit OLS
        beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
        residuals = y - X @ beta_hat
        rss = np.sum(residuals**2)
        
        # MLE estimate (biased)
        sigma_sq_mle = rss / n
        mle_estimates.append(sigma_sq_mle)
        
        # Unbiased estimate
        sigma_sq_unbiased = rss / (n - p)
        unbiased_estimates.append(sigma_sq_unbiased)
    
    mle_mean = np.mean(mle_estimates)
    unbiased_mean = np.mean(unbiased_estimates)
    
    print("Variance Estimator Comparison")
    print("=" * 50)
    print(f"True σ² = {sigma_sq_true}")
    print(f"Sample size n = {n}, parameters p = {p}")
    print(f"Degrees of freedom = n - p = {n - p}")
    print(f"\nOver {n_simulations} simulations:")
    print(f"\n  MLE estimator (RSS/n):")
    print(f"    Mean: {mle_mean:.4f}")
    print(f"    Bias: {mle_mean - sigma_sq_true:.4f}")
    print(f"    Relative bias: {(mle_mean - sigma_sq_true) / sigma_sq_true * 100:.2f}%")
    print(f"    Theoretical bias factor: (n-p)/n = {(n-p)/n:.4f}")
    print(f"    Theoretical mean: {sigma_sq_true * (n-p)/n:.4f}")
    print(f"\n  Unbiased estimator (RSS/(n-p)):")
    print(f"    Mean: {unbiased_mean:.4f}")
    print(f"    Bias: {unbiased_mean - sigma_sq_true:.4f}")
    print(f"    Relative bias: {(unbiased_mean - sigma_sq_true) / sigma_sq_true * 100:.2f}%")
    
    print(f"\n✓ Unbiased estimator correctly estimates σ² on average")
    print(f"✓ MLE underestimates by factor (n-p)/n = {(n-p)/n:.3f}")
 
compare_variance_estimators()

MLE vs Unbiased Variance Estimators
Estimator	Formula	Bias	When to Use
MLE	$\hat{\sigma}^2 = \frac{\text{RSS}}{n}$	Biased low by factor $\frac{n-p}{n}$	Large samples where $n \gg p$
Unbiased	$s^2 = \frac{\text{RSS}}{n-p}$	Unbiased	Standard practice; used for inference

The Complete MLE Solution

Let's consolidate everything into the complete maximum likelihood solution for Gaussian linear regression.

Complete MLE Solution for Gaussian Linear Regression

Model: $\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$

MLE for coefficients: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Residuals: $$\hat{\mathbf{r}} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}$$

MLE for variance: $$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}|\hat{\mathbf{r}}|^2 = \frac{1}{n}\sum_{i=1}^n \hat{r}_i^2$$

Unbiased variance estimator: $$s^2 = \frac{1}{n-p}|\hat{\mathbf{r}}|^2 = \frac{1}{n-p}\sum_{i=1}^n \hat{r}_i^2$$

complete_mle_solution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
 
class GaussianLinearRegressionMLE:
    """
    Complete Maximum Likelihood Estimation for Gaussian Linear Regression.
    
    This implementation follows the exact derivations from theory.
    """
    
    def __init__(self):
        self.beta_hat = None
        self.sigma_sq_mle = None
        self.sigma_sq_unbiased = None
        self.residuals = None
        self.fitted_values = None
        self.hat_matrix = None
        self.n = None
        self.p = None
        
    def fit(self, X, y):
        """
        Fit the model using maximum likelihood estimation.
        
        Parameters:
        -----------
        X : array-like, shape (n, p)
            Design matrix (should include intercept column if desired)
        y : array-like, shape (n,)
            Response vector
        """
        X = np.asarray(X)
        y = np.asarray(y)
        
        self.n, self.p = X.shape
        
        # Step 1: Compute beta_hat = (X'X)^{-1}X'y
        XtX = X.T @ X
        Xty = X.T @ y
        
        # Use solve instead of explicit inverse for numerical stability
        self.beta_hat = np.linalg.solve(XtX, Xty)
        
        # Step 2: Compute fitted values and residuals
        self.fitted_values = X @ self.beta_hat
        self.residuals = y - self.fitted_values
        
        # Step 3: Compute RSS
        rss = np.sum(self.residuals**2)
        
        # Step 4: Variance estimates
        self.sigma_sq_mle = rss / self.n
        self.sigma_sq_unbiased = rss / (self.n - self.p)
        
        # Step 5: Hat matrix (optional, for diagnostics)
        self.hat_matrix = X @ np.linalg.solve(XtX, X.T)
        
        return self
    
    def predict(self, X_new):
        """Predict for new observations."""
        return X_new @ self.beta_hat
    
    def log_likelihood(self, X, y, use_mle_sigma=True):
        """Compute log-likelihood at fitted parameters."""
        sigma_sq = self.sigma_sq_mle if use_mle_sigma else self.sigma_sq_unbiased
        residuals = y - X @ self.beta_hat
        rss = np.sum(residuals**2)
        
        n = len(y)
        ll = -n/2 * np.log(2 * np.pi) - n/2 * np.log(sigma_sq) - rss / (2 * sigma_sq)
        return ll
    
    def covariance_matrix(self):
        """Return covariance matrix of beta_hat."""
        # Using unbiased variance estimator for proper inference
        return self.sigma_sq_unbiased * np.linalg.inv(X.T @ X)
    
    def standard_errors(self, X):
        """Compute standard errors for each coefficient."""
        cov = self.sigma_sq_unbiased * np.linalg.inv(X.T @ X)
        return np.sqrt(np.diag(cov))
    
    def summary(self, X):
        """Print a summary of the fitted model."""
        print("="*60)
        print("Gaussian Linear Regression - Maximum Likelihood Estimation")
        print("="*60)
        print(f"\nObservations: n = {self.n}")
        print(f"Parameters: p = {self.p}")
        print(f"Degrees of freedom: {self.n - self.p}")
        
        print(f"\n--- Coefficient Estimates ---")
        se = self.standard_errors(X)
        for i, (b, s) in enumerate(zip(self.beta_hat, se)):
            t_stat = b / s
            print(f"  β_{i}: {b:9.4f} (SE: {s:.4f}, t: {t_stat:7.3f})")
        
        print(f"\n--- Variance Estimates ---")
        print(f"  σ²_MLE (biased):   {self.sigma_sq_mle:.4f}")
        print(f"  s² (unbiased):     {self.sigma_sq_unbiased:.4f}")
        print(f"  σ estimate (√s²):  {np.sqrt(self.sigma_sq_unbiased):.4f}")
        
        print(f"\n--- Model Fit ---")
        y = self.fitted_values + self.residuals
        ss_tot = np.sum((y - np.mean(y))**2)
        ss_res = np.sum(self.residuals**2)
        r_squared = 1 - ss_res / ss_tot
        adj_r_squared = 1 - (1 - r_squared) * (self.n - 1) / (self.n - self.p)
        print(f"  R²:        {r_squared:.4f}")
        print(f"  Adj R²:    {adj_r_squared:.4f}")
        print(f"  RSS:       {ss_res:.4f}")
        
        print(f"\n--- Log-Likelihood ---")
        print(f"  Log-likelihood: {self.log_likelihood(X, y):.4f}")
 
 
# Example usage
np.random.seed(42)
n, p = 100, 4
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta_true = np.array([5.0, 2.0, -1.0, 0.5])
sigma_true = 1.5
 
y = X @ beta_true + sigma_true * np.random.randn(n)
 
model = GaussianLinearRegressionMLE()
model.fit(X, y)
model.summary(X)
 
print(f"\n--- Comparison with True Values ---")
print(f"True β:  {beta_true}")
print(f"Est. β:  {model.beta_hat}")
print(f"True σ²: {sigma_true**2}")