Loading learning content...
With the Gaussian noise assumption established, we now embark on the mathematical derivation that connects probability theory to optimization. The log-likelihood function is the central object of study—it quantifies how probable our observed data is under our assumed model for different parameter values.
Maximum likelihood estimation (MLE) follows a simple principle: choose the parameters that make the observed data most probable. This intuitive idea translates into a precise optimization problem. For Gaussian linear regression, this optimization problem is not only well-defined but has a beautiful closed-form solution.
In this page, we'll derive every step in complete detail, leaving no mathematical gaps.
By the end of this page, you will:
Probability vs. Likelihood:
Before diving into derivations, let's clarify terminology that often confuses newcomers.
The probability density $p(\mathbf{y} | \boldsymbol{\theta})$ can be viewed in two ways:
As a probability (function of data): With parameters $\boldsymbol{\theta}$ fixed, this specifies a probability distribution over possible observations $\mathbf{y}$. We integrate/sum over $\mathbf{y}$ to get probabilities.
As a likelihood (function of parameters): With data $\mathbf{y}$ fixed (we've observed it!), this quantifies how "likely" different parameter values $\boldsymbol{\theta}$ are. We do not integrate over $\boldsymbol{\theta}$.
The likelihood function is precisely this second view:
$$\mathcal{L}(\boldsymbol{\theta}) \equiv \mathcal{L}(\boldsymbol{\theta}; \mathbf{y}) = p(\mathbf{y} | \boldsymbol{\theta})$$
The notation emphasizes that we treat $\mathcal{L}$ as a function of $\boldsymbol{\theta}$, with $\mathbf{y}$ held constant.
A critical conceptual point: the likelihood function is not a probability distribution over parameters. It does not integrate to 1 over $\boldsymbol{\theta}$. It's simply a function that tells us the relative plausibility of different parameter values given fixed data.
This distinction becomes crucial in Bayesian statistics, where we multiply likelihood by a prior distribution to obtain a posterior distribution. But in frequentist MLE, we work with likelihood directly without normalization.
For our model:
In Gaussian linear regression, the parameters are $\boldsymbol{\theta} = (\boldsymbol{\beta}, \sigma^2)$. Given the model:
$$\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)$$
The likelihood function is:
$$\mathcal{L}(\boldsymbol{\beta}, \sigma^2) = p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}, \sigma^2)$$
The MLE principle:
$$\hat{\boldsymbol{\theta}}{\text{MLE}} = \arg\max{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$
We seek the parameter values that maximize the probability of observing our actual data.
In practice, we always work with the log-likelihood $\ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta})$ rather than the likelihood itself. This isn't just convention—there are compelling reasons.
1. Numerical stability:
Likelihood values for $n$ observations are products of $n$ probability densities. Each density value is typically less than 1 (often much less). Multiplying many such numbers leads to catastrophic underflow—the result becomes smaller than floating-point representation allows.
For example, with $n=1000$ observations each having density $p_i = 0.1$: $$\mathcal{L} = \prod_{i=1}^{1000} 0.1 = 10^{-1000}$$
This is smaller than the smallest positive float64 number (~$10^{-308}$). The log-likelihood, however: $$\ell = \sum_{i=1}^{1000} \log(0.1) = -1000 \log(10) \approx -2303$$
This is perfectly representable.
12345678910111213141516171819202122232425262728293031323334
import numpy as np # Demonstrate why log-likelihoods are necessary def demonstrate_underflow(): """Show how regular likelihood computation fails.""" n = 1000 # Simulate probability densities (typical Gaussian values) np.random.seed(42) densities = 0.3 + 0.1 * np.random.randn(n) # Around 0.2-0.4 densities = np.clip(densities, 0.01, 0.99) # Attempt to compute likelihood directly try: # This will underflow to 0 likelihood = np.prod(densities) print(f"Direct likelihood: {likelihood}") print("⚠️ This is 0.0 due to underflow!") except: print("Direct computation failed") # Use log-likelihood instead log_likelihood = np.sum(np.log(densities)) print(f"\nLog-likelihood: {log_likelihood:.4f}") print("✓ Log-likelihood computes without issue") # If we really need likelihood, use logsumexp-style tricks: # exp(log_likelihood) would still underflow, but for comparisons # we only need log-likelihood print(f"\nFor comparison purposes, log-likelihood is sufficient.") print(f"log L₁ > log L₂ ⟺ L₁ > L₂") demonstrate_underflow()2. Products become sums:
The logarithm transforms products into sums. For independent observations:
$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^n p(y_i | \mathbf{x}i, \boldsymbol{\theta})$$ $$\ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta}) = \sum{i=1}^n \log p(y_i | \mathbf{x}_i, \boldsymbol{\theta})$$
Sums are much easier to differentiate, manipulate, and interpret than products.
3. Exponential families simplify:
For distributions in the exponential family (including Gaussian, Poisson, Bernoulli, etc.), the log-likelihood has a particularly nice form that makes derivatives tractable.
4. Monotonicity preserves optima:
Since $\log$ is strictly monotonically increasing: $$\arg\max_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \log \mathcal{L}(\boldsymbol{\theta})$$
The maximizers are identical. We lose nothing by working with log-likelihood.
Often you'll see negative log-likelihood (NLL) used as a loss function:
$$\text{NLL}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) = -\log \mathcal{L}(\boldsymbol{\theta})$$
This converts maximization to minimization, fitting the standard loss function paradigm used in machine learning. Minimizing NLL is equivalent to maximizing log-likelihood.
Let's build up the log-likelihood step by step, starting with a single observation.
The density for one observation:
For a single observation $(\mathbf{x}_i, y_i)$, the conditional density of $y_i$ given $\mathbf{x}_i$ under our Gaussian model is:
$$p(y_i | \mathbf{x}_i, \boldsymbol{\beta}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)$$
Taking the logarithm:
Applying $\log$ and using $\log(ab) = \log a + \log b$ and $\log(e^x) = x$:
$$\log p(y_i | \mathbf{x}_i, \boldsymbol{\beta}, \sigma^2) = \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) + \log\left(\exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)\right)$$
Simplifying the first term: $$\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) = -\frac{1}{2}\log(2\pi\sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2)$$
Simplifying the second term: $$\log\left(\exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)\right) = -\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}$$
The log-likelihood contribution from observation $i$ is:
$$\ell_i(\boldsymbol{\beta}, \sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}$$
Notice the structure:
The residual connection:
Define the residual $r_i = y_i - \mathbf{x}_i^T \boldsymbol{\beta}$. Then:
$$\ell_i(\boldsymbol{\beta}, \sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{r_i^2}{2\sigma^2}$$
Key observation: the log-likelihood decreases as the squared residual $r_i^2$ increases. Larger residuals means lower likelihood. This is exactly what we'd expect—if our model predicts poorly (large residuals), the data is less probable under that model.
Summing over all observations:
With $n$ independent observations, the total log-likelihood is the sum of individual contributions:
$$\ell(\boldsymbol{\beta}, \sigma^2) = \sum_{i=1}^n \ell_i(\boldsymbol{\beta}, \sigma^2)$$
Expanding: $$\ell(\boldsymbol{\beta}, \sigma^2) = \sum_{i=1}^n \left[-\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right]$$
Since the first two terms don't depend on $i$, we can factor them out: $$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2$$
$$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\text{RSS}(\boldsymbol{\beta})$$
where $\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ is the Residual Sum of Squares.
Matrix form:
Recalling that $\text{RSS}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$:
$$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$
Understanding the terms:
| Term | Expression | Role |
|---|---|---|
| Normalization constant | $-\frac{n}{2}\log(2\pi)$ | Doesn't affect optimization |
| Variance penalty | $-\frac{n}{2}\log(\sigma^2)$ | Larger $\sigma^2$ → lower likelihood |
| Fit penalty | $-\frac{\text{RSS}}{2\sigma^2}$ | Larger residuals → lower likelihood |
The log-likelihood balances two forces:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as np def log_likelihood(y, X, beta, sigma_sq): """ Compute the log-likelihood for Gaussian linear regression. Parameters: ----------- y : array-like, shape (n,) Response vector X : array-like, shape (n, p) Design matrix beta : array-like, shape (p,) Coefficient vector sigma_sq : float Variance of the noise (sigma^2) Returns: -------- float : Log-likelihood value """ n = len(y) # Compute residuals residuals = y - X @ beta # Residual sum of squares rss = np.sum(residuals**2) # Log-likelihood components const_term = -n/2 * np.log(2 * np.pi) var_term = -n/2 * np.log(sigma_sq) fit_term = -rss / (2 * sigma_sq) log_lik = const_term + var_term + fit_term return log_lik def negative_log_likelihood(y, X, beta, sigma_sq): """NLL for use as a loss function (to minimize).""" return -log_likelihood(y, X, beta, sigma_sq) # Example usagenp.random.seed(42)n, p = 100, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])beta_true = np.array([1.0, 2.0, -0.5])sigma_true = 0.5 y = X @ beta_true + sigma_true * np.random.randn(n) # Compute log-likelihood at true parametersll_true = log_likelihood(y, X, beta_true, sigma_true**2)print(f"Log-likelihood at true β, σ²: {ll_true:.4f}") # Compute at wrong parameters (should be lower)beta_wrong = np.array([0.5, 1.5, 0.0])ll_wrong = log_likelihood(y, X, beta_wrong, sigma_true**2)print(f"Log-likelihood at wrong β: {ll_wrong:.4f}") # With wrong sigma^2 (too small - penalizes RSS more)ll_small_var = log_likelihood(y, X, beta_true, 0.1)print(f"Log-likelihood with σ²=0.1: {ll_small_var:.4f}") # With wrong sigma^2 (too large - variance penalty)ll_large_var = log_likelihood(y, X, beta_true, 2.0)print(f"Log-likelihood with σ²=2.0: {ll_large_var:.4f}")We now arrive at the key result that bridges probabilistic and geometric views of linear regression.
Observation 1: $\boldsymbol{\beta}$ appears only in RSS
Examining the log-likelihood: $$\ell(\boldsymbol{\beta}, \sigma^2) = \underbrace{-\frac{n}{2}\log(2\pi)}{\text{constant}} - \underbrace{\frac{n}{2}\log(\sigma^2)}{\text{involves only } \sigma^2} - \underbrace{\frac{\text{RSS}(\boldsymbol{\beta})}{2\sigma^2}}_{\text{involves both}}$$
When maximizing with respect to $\boldsymbol{\beta}$ (holding $\sigma^2$ fixed):
Observation 2: RSS is negated
The term involving $\boldsymbol{\beta}$ is $-\frac{\text{RSS}(\boldsymbol{\beta})}{2\sigma^2}$. This has a negative sign in front of RSS.
Observation 3: Maximizing negative of something = Minimizing that thing
$$\arg\max_{\boldsymbol{\beta}} \left(-\frac{\text{RSS}(\boldsymbol{\beta})}{2\sigma^2}\right) = \arg\min_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta})$$
The positive factor $\frac{1}{2\sigma^2}$ doesn't change the argmax.
For any fixed $\sigma^2 > 0$:
$$\hat{\boldsymbol{\beta}}{\text{MLE}} = \arg\max{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}, \sigma^2) = \arg\min_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = \hat{\boldsymbol{\beta}}_{\text{OLS}}$$
The maximum likelihood estimator for $\boldsymbol{\beta}$ is exactly the ordinary least squares estimator!
This is profound: the OLS solution we derived geometrically is also the statistically optimal solution under Gaussian noise. We're not just fitting a line—we're finding the most probable parameters.
Why is this important?
Justification for squared loss: We're not minimizing squared errors because it's computationally convenient; we're doing it because it's the correct thing under Gaussian noise.
Unification: The geometric view (projection onto column space) and probabilistic view (maximizing likelihood) lead to the same answer.
Beyond point estimates: The MLE framework gives us much more than OLS alone:
Foundation for extensions: Regularization (Ridge, Lasso) can be understood as adding priors—connecting MLE to Bayesian inference.
| Approach | Starting Point | Objective | Solution |
|---|---|---|---|
| Geometric (OLS) | Find best-fit line | Minimize $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ | $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ |
| Probabilistic (MLE) | Most probable parameters | Maximize $\ell(\boldsymbol{\beta}, \sigma^2)$ | $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ |
To develop intuition, let's visualize how the log-likelihood varies with parameter values.
Simple linear regression case:
Consider $y = \beta_0 + \beta_1 x + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$. The log-likelihood is a function of three parameters: $\beta_0$, $\beta_1$, and $\sigma^2$.
For fixed $\sigma^2$, the log-likelihood as a function of $(\beta_0, \beta_1)$ forms a concave paraboloid—a bowl shape flipped upside down. This concavity guarantees:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D def plot_log_likelihood_surface(): """ Visualize the log-likelihood surface for simple linear regression. """ # Generate synthetic data np.random.seed(42) n = 50 x = np.linspace(0, 10, n) beta0_true, beta1_true = 2.0, 0.5 sigma_true = 1.0 y = beta0_true + beta1_true * x + sigma_true * np.random.randn(n) # Design matrix X = np.column_stack([np.ones(n), x]) # Create parameter grid beta0_range = np.linspace(0, 4, 50) beta1_range = np.linspace(-0.5, 1.5, 50) B0, B1 = np.meshgrid(beta0_range, beta1_range) # Compute log-likelihood on grid log_lik = np.zeros_like(B0) sigma_sq = sigma_true**2 # Fix sigma at true value for i in range(B0.shape[0]): for j in range(B0.shape[1]): beta = np.array([B0[i,j], B1[i,j]]) residuals = y - X @ beta rss = np.sum(residuals**2) log_lik[i,j] = -n/2*np.log(2*np.pi) - n/2*np.log(sigma_sq) - rss/(2*sigma_sq) # Find maximum max_idx = np.unravel_index(np.argmax(log_lik), log_lik.shape) beta0_mle = B0[max_idx] beta1_mle = B1[max_idx] print("Log-Likelihood Surface Analysis") print("=" * 40) print(f"True parameters: β₀={beta0_true}, β₁={beta1_true}") print(f"MLE (grid search): β₀≈{beta0_mle:.3f}, β₁≈{beta1_mle:.3f}") print(f"\nThe surface is a concave paraboloid (inverted bowl)") print(f"Unique global maximum - no local optima!") # OLS solution for comparison beta_ols = np.linalg.lstsq(X, y, rcond=None)[0] print(f"OLS solution: β₀={beta_ols[0]:.3f}, β₁={beta_ols[1]:.3f}") print(f"MLE and OLS agree (as theory predicts)") plot_log_likelihood_surface()Contour interpretation:
Contours of the log-likelihood surface are ellipses centered at the MLE. These ellipses have a direct connection to confidence regions—a contour at level $\ell_{\max} - c$ for appropriate $c$ corresponds to a joint confidence region for the parameters.
The orientation and eccentricity of these ellipses depend on the correlation between parameter estimates. When features are correlated in $\mathbf{X}$, the ellipses become elongated and tilted, indicating that the parameters are hard to estimate independently.
The log-likelihood is concave in $\boldsymbol{\beta}$ (and in $(\boldsymbol{\beta}, \log\sigma^2)$ jointly). This means:
This tractability is a special property of the Gaussian distribution. Most other noise distributions lead to non-concave likelihoods with potential local optima.
For completeness, let's compute the derivatives of the log-likelihood. These are essential for:
Gradient with respect to $\boldsymbol{\beta}$:
Starting from: $$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$
Let's expand the quadratic term: $$q(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$
Taking the gradient with respect to $\boldsymbol{\beta}$: $$\nabla_{\boldsymbol{\beta}} q(\boldsymbol{\beta}) = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = 2\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})$$
Since $\ell = \text{const} - \frac{q}{2\sigma^2}$: $$\nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}, \sigma^2) = -\frac{1}{2\sigma^2} \cdot 2\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y}) = \frac{1}{\sigma^2}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$
The gradient of the log-likelihood with respect to parameters is called the score function in statistics:
$$\mathbf{s}(\boldsymbol{\beta}) = \nabla_{\boldsymbol{\beta}} \ell = \frac{1}{\sigma^2}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{r}$$
where $\mathbf{r} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$ is the residual vector.
At the MLE, the score equals zero: $\mathbf{s}(\hat{\boldsymbol{\beta}}) = \mathbf{0}$, which gives the normal equations.
Hessian with respect to $\boldsymbol{\beta}$:
The Hessian (matrix of second derivatives) is: $$\mathbf{H}{\boldsymbol{\beta}} = \nabla^2{\boldsymbol{\beta}} \ell = \nabla_{\boldsymbol{\beta}} \left(\frac{1}{\sigma^2}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\right) = -\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X}$$
Key observations:
Negative definite: Since $\mathbf{X}^T\mathbf{X}$ is positive semi-definite (and positive definite when $\mathbf{X}$ has full column rank), the Hessian is negative definite. This confirms concavity.
Independent of $\boldsymbol{\beta}$: The Hessian doesn't depend on where we evaluate it. This means the curvature is constant everywhere—a property unique to quadratic functions.
Connected to variance: The inverse of the negative Hessian gives the covariance matrix of the MLE: $$\text{Cov}(\hat{\boldsymbol{\beta}}) = (-\mathbf{H})^{-1} = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$
This last relationship is crucial for inference—the curvature of the likelihood at its peak tells us how certain we are about the parameter estimates.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as np def gradient_log_likelihood(y, X, beta, sigma_sq): """ Compute the gradient (score) of log-likelihood w.r.t. beta. Returns: gradient vector of shape (p,) """ residuals = y - X @ beta gradient = (1 / sigma_sq) * (X.T @ residuals) return gradient def hessian_log_likelihood(X, sigma_sq): """ Compute the Hessian of log-likelihood w.r.t. beta. Note: Hessian is constant (doesn't depend on beta). Returns: Hessian matrix of shape (p, p) """ H = -(1 / sigma_sq) * (X.T @ X) return H def verify_mle_conditions(y, X, beta_mle, sigma_sq): """ Verify that MLE satisfies first-order conditions. """ grad = gradient_log_likelihood(y, X, beta_mle, sigma_sq) H = hessian_log_likelihood(X, sigma_sq) print("Verification of MLE Conditions") print("=" * 40) print(f"\n1. First-order condition (gradient = 0):") print(f" ||gradient||₂ = {np.linalg.norm(grad):.2e}") print(f" {'✓ Satisfied' if np.linalg.norm(grad) < 1e-10 else '✗ Not satisfied'}") print(f"\n2. Second-order condition (Hessian negative definite):") eigenvalues = np.linalg.eigvalsh(H) all_negative = np.all(eigenvalues < 0) print(f" Eigenvalues of Hessian: {eigenvalues}") print(f" {'✓ All negative (concave)' if all_negative else '✗ Not all negative'}") print(f"\n3. Parameter covariance matrix:") cov_beta = sigma_sq * np.linalg.inv(X.T @ X) print(f" Cov(β̂) = σ²(X'X)⁻¹ = ") print(cov_beta) print(f" Standard errors: {np.sqrt(np.diag(cov_beta))}") # Examplenp.random.seed(42)n, p = 100, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])beta_true = np.array([1.0, 2.0, -0.5])sigma_true = 0.5 y = X @ beta_true + sigma_true * np.random.randn(n) # Compute MLE (= OLS)beta_mle = np.linalg.lstsq(X, y, rcond=None)[0] verify_mle_conditions(y, X, beta_mle, sigma_true**2)We've now completed the derivation of the log-likelihood function—the mathematical heart of probabilistic linear regression. Here's what we've established:
What's next:
In the next page, we'll solve the MLE problem explicitly—showing the complete derivation that produces the closed-form solution $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ we know from OLS. We'll also derive the MLE for the noise variance $\sigma^2$, completing our parameter estimation picture.
You now understand the log-likelihood function in complete detail. This is the objective function that defines maximum likelihood estimation. Every MLE result flows from maximizing this function. With this foundation, you're ready to see the optimization carried out explicitly.