Loading content...
We've established that maximum likelihood estimation for Gaussian linear regression requires maximizing the log-likelihood function. Now we complete the journey by solving this optimization problem analytically.
This page derives the closed-form expressions for both:
The beauty of Gaussian linear regression is that these solutions emerge from elementary calculus and linear algebra—no iterative optimization required. Understanding this derivation provides deep insight into why certain estimators have the forms they do.
By the end of this page, you will:
The objective function:
Recalling from the previous page, our log-likelihood is:
$$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$
The optimization problem:
We seek: $$(\hat{\boldsymbol{\beta}}, \hat{\sigma}^2) = \arg\max_{\boldsymbol{\beta}, \sigma^2 > 0} \ell(\boldsymbol{\beta}, \sigma^2)$$
This is a joint optimization over both parameters. A key insight is that we can decouple this problem—solve for $\boldsymbol{\beta}$ first, then solve for $\sigma^2$.
When optimizing over multiple parameters, a useful technique is profiling:
For our problem, the structure is even nicer: the optimal $\boldsymbol{\beta}$ doesn't depend on $\sigma^2$ at all!
Why decoupling works:
Look at how $\boldsymbol{\beta}$ and $\sigma^2$ appear in the log-likelihood:
$$\ell(\boldsymbol{\beta}, \sigma^2) = \underbrace{-\frac{n}{2}\log(2\pi)}{\text{constant}} - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\underbrace{|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2}{\text{RSS}(\boldsymbol{\beta})}$$
For any fixed $\sigma^2 > 0$:
Therefore: Solve for $\hat{\boldsymbol{\beta}}$ by minimizing RSS, then plug in to find $\hat{\sigma}^2$.
The minimization problem:
We need to solve: $$\hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2 = \arg\min_{\boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$
Let's expand the objective function. Define $\mathbf{r}(\boldsymbol{\beta}) = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$, the residual vector.
$$\text{RSS}(\boldsymbol{\beta}) = \mathbf{r}^T\mathbf{r} = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$
Expanding: $$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$
Since $\mathbf{y}^T\mathbf{X}\boldsymbol{\beta}$ is a scalar, it equals its transpose: $\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y}$. Thus:
$$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$
Step 1: Compute the gradient
Using matrix calculus identities:
Applying these: $$\nabla_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 0 - 2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$
$$\nabla_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} - 2\mathbf{X}^T\mathbf{y}$$
Step 2: Set gradient to zero
At a minimum (first-order necessary condition): $$\nabla_{\boldsymbol{\beta}} \text{RSS}(\hat{\boldsymbol{\beta}}) = \mathbf{0}$$
$$2\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} - 2\mathbf{X}^T\mathbf{y} = \mathbf{0}$$
$$\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$$
$$\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$$
These are the famous normal equations of linear regression. The name comes from the geometric interpretation: at the solution, the residual vector $\mathbf{r} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is normal (perpendicular) to the column space of $\mathbf{X}$.
Equivalently: $\mathbf{X}^T\mathbf{r} = \mathbf{0}$, meaning the residual is orthogonal to every column of $\mathbf{X}$.
Step 3: Solve for $\hat{\boldsymbol{\beta}}$
If $\mathbf{X}^T\mathbf{X}$ is invertible (which requires $\mathbf{X}$ to have full column rank):
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
Step 4: Verify this is a minimum (not maximum)
The Hessian of RSS is: $$\nabla^2_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = 2\mathbf{X}^T\mathbf{X}$$
This is positive semi-definite (since $\mathbf{v}^T\mathbf{X}^T\mathbf{X}\mathbf{v} = |\mathbf{X}\mathbf{v}|^2 \geq 0$ for any $\mathbf{v}$).
When $\mathbf{X}$ has full column rank, it's strictly positive definite. This confirms:
$$\hat{\boldsymbol{\beta}}_{\text{MLE}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
This is identical to the OLS estimator! The maximum likelihood approach under Gaussian noise leads to exactly the same formula we derived geometrically.
For deeper insight, let's derive the solution using the "completing the square" technique, which reveals the structure more elegantly.
Starting point: $$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$
The key idea:
We want to rewrite this as $(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{A}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}) + \text{constant}$ for some matrix $\mathbf{A}$ and optimal $\hat{\boldsymbol{\beta}}$.
Let $\mathbf{A} = \mathbf{X}^T\mathbf{X}$ and assume $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$.
Expanding $(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})$: $$= \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} + \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}}$$
Substituting $\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$ (the normal equations): $$= \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{y}$$
Comparing with RSS: $$\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$
Result: $$\text{RSS}(\boldsymbol{\beta}) = (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}) + \mathbf{y}^T\mathbf{y} - \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{y}$$
Simplifying the constant term (using $\hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{y} = \hat{\boldsymbol{\beta}}^T\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{y}^T\mathbf{X}\hat{\boldsymbol{\beta}}$): $$\text{RSS}(\boldsymbol{\beta}) = (\boldsymbol{\beta} - \hat{\boldsymbol{\beta}})^T\mathbf{X}^T\mathbf{X}(\boldsymbol{\beta} - \hat{\boldsymbol{\beta}}) + |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2$$
Interpretation:
This beautiful decomposition shows that RSS has two parts:
Since the first term is non-negative and equals zero only when $\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}$, the minimum occurs exactly at $\hat{\boldsymbol{\beta}}$.
This technique:
With $\hat{\boldsymbol{\beta}}$ in hand, we now derive the MLE for the variance parameter $\sigma^2$.
Profile log-likelihood:
Substitute $\hat{\boldsymbol{\beta}}$ into the log-likelihood: $$\ell(\hat{\boldsymbol{\beta}}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{\text{RSS}_{\min}}{2\sigma^2}$$
where $\text{RSS}{\min} = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}|^2 = \sum{i=1}^n \hat{r}_i^2$ and $\hat{r}_i = y_i - \hat{y}_i$ are the fitted residuals.
This is now a function of $\sigma^2$ only—the profile log-likelihood.
Differentiate with respect to $\sigma^2$:
$$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2} \cdot \frac{1}{\sigma^2} - \frac{\text{RSS}_{\min}}{2} \cdot \left(-\frac{1}{(\sigma^2)^2}\right)$$
$$= -\frac{n}{2\sigma^2} + \frac{\text{RSS}_{\min}}{2(\sigma^2)^2}$$
Set derivative to zero:
$$-\frac{n}{2\sigma^2} + \frac{\text{RSS}_{\min}}{2(\sigma^2)^2} = 0$$
Multiply through by $2(\sigma^2)^2$: $$-n\sigma^2 + \text{RSS}_{\min} = 0$$
$$\sigma^2 = \frac{\text{RSS}_{\min}}{n}$$
Verify it's a maximum:
Second derivative: $$\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2(\sigma^2)^2} - \frac{\text{RSS}_{\min}}{(\sigma^2)^3}$$
At $\hat{\sigma}^2 = \frac{\text{RSS}{\min}}{n}$: $$= \frac{n}{2} \cdot \frac{n^2}{\text{RSS}{\min}^2} - \frac{\text{RSS}{\min} \cdot n^3}{\text{RSS}{\min}^3} = \frac{n^3}{2\text{RSS}{\min}^2} - \frac{n^3}{\text{RSS}{\min}^2} = -\frac{n^3}{2\text{RSS}_{\min}^2} < 0$$
Since the second derivative is negative, this is indeed a maximum.
$$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}i)^2 = \frac{1}{n}\sum{i=1}^n \hat{r}_i^2 = \frac{\text{RSS}}{n}$$
The MLE for variance is the average squared residual. This has an intuitive interpretation: the noise variance is estimated by how much the data deviates from our fitted values.
Here's an important subtlety: while $\hat{\sigma}^2_{\text{MLE}} = \frac{\text{RSS}}{n}$ is the maximum likelihood estimator, it is biased.
The problem: $$\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \mathbb{E}\left[\frac{\text{RSS}}{n}\right] = \frac{n-p}{n} \sigma^2 \neq \sigma^2$$
where $p$ is the number of parameters (columns in $\mathbf{X}$).
The MLE systematically underestimates the true variance. Why does this happen?
Intuitive explanation:
When we fit $\hat{\boldsymbol{\beta}}$, we're choosing it to minimize RSS—to make residuals as small as possible. This "overfits" the noise slightly. The fitted residuals are smaller than the true errors because $\hat{\boldsymbol{\beta}}$ has adapted to the particular noise realization in our sample.
Each parameter we estimate "uses up" one degree of freedom, reducing the effective sample size for variance estimation.
With $n$ observations and $p$ estimated parameters, we have $n - p$ degrees of freedom for variance estimation.
Intuition: If we have $n = p$ (as many parameters as observations), we can fit the data exactly (RSS = 0), giving the absurd estimate $\hat{\sigma}^2 = 0$. The data has "used up" all its information for estimating $\boldsymbol{\beta}$, leaving nothing for estimating $\sigma^2$.
The unbiased estimator:
To correct the bias, we use the unbiased variance estimator:
$$s^2 = \frac{\text{RSS}}{n - p} = \frac{1}{n-p}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$
One can prove that $\mathbb{E}[s^2] = \sigma^2$.
Key quantity: The residual RSS
More precisely, under the Gaussian model: $$\frac{\text{RSS}}{\sigma^2} \sim \chi^2_{n-p}$$
The RSS, scaled by $\sigma^2$, follows a chi-squared distribution with $n-p$ degrees of freedom. Since $\mathbb{E}[\chi^2_k] = k$: $$\mathbb{E}\left[\frac{\text{RSS}}{\sigma^2}\right] = n - p \implies \mathbb{E}[\text{RSS}] = (n-p)\sigma^2$$
Thus: $$\mathbb{E}\left[\frac{\text{RSS}}{n-p}\right] = \sigma^2$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as np def compare_variance_estimators(n_simulations=10000): """ Demonstrate bias of MLE variance estimator through simulation. """ n = 50 # Sample size p = 5 # Number of parameters (including intercept) sigma_true = 2.0 sigma_sq_true = sigma_true**2 np.random.seed(42) mle_estimates = [] unbiased_estimates = [] for _ in range(n_simulations): # Generate data X = np.column_stack([np.ones(n), np.random.randn(n, p-1)]) beta_true = np.random.randn(p) y = X @ beta_true + sigma_true * np.random.randn(n) # Fit OLS beta_hat = np.linalg.lstsq(X, y, rcond=None)[0] residuals = y - X @ beta_hat rss = np.sum(residuals**2) # MLE estimate (biased) sigma_sq_mle = rss / n mle_estimates.append(sigma_sq_mle) # Unbiased estimate sigma_sq_unbiased = rss / (n - p) unbiased_estimates.append(sigma_sq_unbiased) mle_mean = np.mean(mle_estimates) unbiased_mean = np.mean(unbiased_estimates) print("Variance Estimator Comparison") print("=" * 50) print(f"True σ² = {sigma_sq_true}") print(f"Sample size n = {n}, parameters p = {p}") print(f"Degrees of freedom = n - p = {n - p}") print(f"\nOver {n_simulations} simulations:") print(f"\n MLE estimator (RSS/n):") print(f" Mean: {mle_mean:.4f}") print(f" Bias: {mle_mean - sigma_sq_true:.4f}") print(f" Relative bias: {(mle_mean - sigma_sq_true) / sigma_sq_true * 100:.2f}%") print(f" Theoretical bias factor: (n-p)/n = {(n-p)/n:.4f}") print(f" Theoretical mean: {sigma_sq_true * (n-p)/n:.4f}") print(f"\n Unbiased estimator (RSS/(n-p)):") print(f" Mean: {unbiased_mean:.4f}") print(f" Bias: {unbiased_mean - sigma_sq_true:.4f}") print(f" Relative bias: {(unbiased_mean - sigma_sq_true) / sigma_sq_true * 100:.2f}%") print(f"\n✓ Unbiased estimator correctly estimates σ² on average") print(f"✓ MLE underestimates by factor (n-p)/n = {(n-p)/n:.3f}") compare_variance_estimators()| Estimator | Formula | Bias | When to Use |
|---|---|---|---|
| MLE | $\hat{\sigma}^2 = \frac{\text{RSS}}{n}$ | Biased low by factor $\frac{n-p}{n}$ | Large samples where $n \gg p$ |
| Unbiased | $s^2 = \frac{\text{RSS}}{n-p}$ | Unbiased | Standard practice; used for inference |
Let's consolidate everything into the complete maximum likelihood solution for Gaussian linear regression.
Model: $\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$
MLE for coefficients: $$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
Fitted values: $$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{H}\mathbf{y}$$ where $\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$ is the hat matrix.
Residuals: $$\hat{\mathbf{r}} = \mathbf{y} - \hat{\mathbf{y}} = (\mathbf{I} - \mathbf{H})\mathbf{y}$$
MLE for variance: $$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}|\hat{\mathbf{r}}|^2 = \frac{1}{n}\sum_{i=1}^n \hat{r}_i^2$$
Unbiased variance estimator: $$s^2 = \frac{1}{n-p}|\hat{\mathbf{r}}|^2 = \frac{1}{n-p}\sum_{i=1}^n \hat{r}_i^2$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
import numpy as np class GaussianLinearRegressionMLE: """ Complete Maximum Likelihood Estimation for Gaussian Linear Regression. This implementation follows the exact derivations from theory. """ def __init__(self): self.beta_hat = None self.sigma_sq_mle = None self.sigma_sq_unbiased = None self.residuals = None self.fitted_values = None self.hat_matrix = None self.n = None self.p = None def fit(self, X, y): """ Fit the model using maximum likelihood estimation. Parameters: ----------- X : array-like, shape (n, p) Design matrix (should include intercept column if desired) y : array-like, shape (n,) Response vector """ X = np.asarray(X) y = np.asarray(y) self.n, self.p = X.shape # Step 1: Compute beta_hat = (X'X)^{-1}X'y XtX = X.T @ X Xty = X.T @ y # Use solve instead of explicit inverse for numerical stability self.beta_hat = np.linalg.solve(XtX, Xty) # Step 2: Compute fitted values and residuals self.fitted_values = X @ self.beta_hat self.residuals = y - self.fitted_values # Step 3: Compute RSS rss = np.sum(self.residuals**2) # Step 4: Variance estimates self.sigma_sq_mle = rss / self.n self.sigma_sq_unbiased = rss / (self.n - self.p) # Step 5: Hat matrix (optional, for diagnostics) self.hat_matrix = X @ np.linalg.solve(XtX, X.T) return self def predict(self, X_new): """Predict for new observations.""" return X_new @ self.beta_hat def log_likelihood(self, X, y, use_mle_sigma=True): """Compute log-likelihood at fitted parameters.""" sigma_sq = self.sigma_sq_mle if use_mle_sigma else self.sigma_sq_unbiased residuals = y - X @ self.beta_hat rss = np.sum(residuals**2) n = len(y) ll = -n/2 * np.log(2 * np.pi) - n/2 * np.log(sigma_sq) - rss / (2 * sigma_sq) return ll def covariance_matrix(self): """Return covariance matrix of beta_hat.""" # Using unbiased variance estimator for proper inference return self.sigma_sq_unbiased * np.linalg.inv(X.T @ X) def standard_errors(self, X): """Compute standard errors for each coefficient.""" cov = self.sigma_sq_unbiased * np.linalg.inv(X.T @ X) return np.sqrt(np.diag(cov)) def summary(self, X): """Print a summary of the fitted model.""" print("="*60) print("Gaussian Linear Regression - Maximum Likelihood Estimation") print("="*60) print(f"\nObservations: n = {self.n}") print(f"Parameters: p = {self.p}") print(f"Degrees of freedom: {self.n - self.p}") print(f"\n--- Coefficient Estimates ---") se = self.standard_errors(X) for i, (b, s) in enumerate(zip(self.beta_hat, se)): t_stat = b / s print(f" β_{i}: {b:9.4f} (SE: {s:.4f}, t: {t_stat:7.3f})") print(f"\n--- Variance Estimates ---") print(f" σ²_MLE (biased): {self.sigma_sq_mle:.4f}") print(f" s² (unbiased): {self.sigma_sq_unbiased:.4f}") print(f" σ estimate (√s²): {np.sqrt(self.sigma_sq_unbiased):.4f}") print(f"\n--- Model Fit ---") y = self.fitted_values + self.residuals ss_tot = np.sum((y - np.mean(y))**2) ss_res = np.sum(self.residuals**2) r_squared = 1 - ss_res / ss_tot adj_r_squared = 1 - (1 - r_squared) * (self.n - 1) / (self.n - self.p) print(f" R²: {r_squared:.4f}") print(f" Adj R²: {adj_r_squared:.4f}") print(f" RSS: {ss_res:.4f}") print(f"\n--- Log-Likelihood ---") print(f" Log-likelihood: {self.log_likelihood(X, y):.4f}") # Example usagenp.random.seed(42)n, p = 100, 4X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])beta_true = np.array([5.0, 2.0, -1.0, 0.5])sigma_true = 1.5 y = X @ beta_true + sigma_true * np.random.randn(n) model = GaussianLinearRegressionMLE()model.fit(X, y)model.summary(X) print(f"\n--- Comparison with True Values ---")print(f"True β: {beta_true}")print(f"Est. β: {model.beta_hat}")print(f"True σ²: {sigma_true**2}")Let's examine important special cases to develop intuition for the MLE solution.
Case 1: Simple mean estimation ($\mathbf{X} = \mathbf{1}$)
If we only estimate a constant (intercept-only model), $\mathbf{X}$ is an $n \times 1$ vector of ones: $$\hat{\beta} = (\mathbf{1}^T\mathbf{1})^{-1}\mathbf{1}^T\mathbf{y} = \frac{\sum_{i=1}^n y_i}{n} = \bar{y}$$
The MLE is the sample mean—exactly what we'd expect for estimating the mean of a Gaussian!
$$\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (y_i - \bar{y})^2$$
This is the biased sample variance (divide by $n$), while $s^2$ divides by $n-1$.
Case 2: Orthogonal design ($\mathbf{X}^T\mathbf{X} = \mathbf{I}$)
If columns of $\mathbf{X}$ are orthonormal: $$\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}$$
Each coefficient is simply the inner product of the corresponding column with $\mathbf{y}$. Coefficient estimates are independent of each other.
Case 3: Simple linear regression ($p = 2$)
For $y = \beta_0 + \beta_1 x + \epsilon$ with design matrix $\mathbf{X} = [\mathbf{1} \mid \mathbf{x}]$:
$$\hat{\beta}1 = \frac{\sum{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}$$
$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$
These familiar formulas are special cases of the general MLE.
Case 4: Large sample behavior ($n \to \infty$)
As $n \to \infty$:
If $\mathbf{X}$ doesn't have full column rank (e.g., due to multicollinearity), $\mathbf{X}^T\mathbf{X}$ is singular and not invertible. In this case:
We've now completed the derivation of the maximum likelihood solution for Gaussian linear regression. Here's what we've established:
What's next:
In the next page, we'll deepen the connection between OLS and MLE—showing not just that they give the same point estimates, but understanding when and why they differ in their broader statistical properties. We'll also explore what MLE provides beyond OLS: principled variance estimation, likelihood ratio tests, and model selection criteria.
You now understand how to derive the maximum likelihood estimators from first principles. The closed-form solutions aren't magic—they emerge from careful application of calculus and linear algebra to a clearly specified probabilistic model. This foundation enables everything from hypothesis testing to Bayesian extensions.