Machine LearningLinear Regression

Maximum Likelihood View of Linear Regression

LevelIntermediate

Duration75 mins

TopicLinear Regression

2 / 5

Log-Likelihood Derivation

The Heart of Probabilistic Regression

With the Gaussian noise assumption established, we now embark on the mathematical derivation that connects probability theory to optimization. The log-likelihood function is the central object of study—it quantifies how probable our observed data is under our assumed model for different parameter values.

Maximum likelihood estimation (MLE) follows a simple principle: choose the parameters that make the observed data most probable. This intuitive idea translates into a precise optimization problem. For Gaussian linear regression, this optimization problem is not only well-defined but has a beautiful closed-form solution.

In this page, we'll derive every step in complete detail, leaving no mathematical gaps.

What You Will Learn

By the end of this page, you will:

Understand the likelihood function and why we take logarithms
Derive the log-likelihood for Gaussian linear regression step by step
See how maximizing log-likelihood transforms into minimizing squared error
Appreciate why the Gaussian assumption leads to analytical tractability
Develop intuition for likelihood surfaces in parameter space

The Likelihood Function: Definition and Interpretation

Probability vs. Likelihood:

Before diving into derivations, let's clarify terminology that often confuses newcomers.

The probability density $p(\mathbf{y} | \boldsymbol{\theta})$ can be viewed in two ways:

As a probability (function of data): With parameters $\boldsymbol{\theta}$ fixed, this specifies a probability distribution over possible observations $\mathbf{y}$. We integrate/sum over $\mathbf{y}$ to get probabilities.
As a likelihood (function of parameters): With data $\mathbf{y}$ fixed (we've observed it!), this quantifies how "likely" different parameter values $\boldsymbol{\theta}$ are. We do not integrate over $\boldsymbol{\theta}$.

The likelihood function is precisely this second view:

$$\mathcal{L}(\boldsymbol{\theta}) \equiv \mathcal{L}(\boldsymbol{\theta}; \mathbf{y}) = p(\mathbf{y} | \boldsymbol{\theta})$$

The notation emphasizes that we treat $\mathcal{L}$ as a function of $\boldsymbol{\theta}$, with $\mathbf{y}$ held constant.

Likelihood Is NOT a Probability Distribution

A critical conceptual point: the likelihood function is not a probability distribution over parameters. It does not integrate to 1 over $\boldsymbol{\theta}$. It's simply a function that tells us the relative plausibility of different parameter values given fixed data.

This distinction becomes crucial in Bayesian statistics, where we multiply likelihood by a prior distribution to obtain a posterior distribution. But in frequentist MLE, we work with likelihood directly without normalization.

For our model:

In Gaussian linear regression, the parameters are $\boldsymbol{\theta} = (\boldsymbol{\beta}, \sigma^2)$. Given the model:

$$\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)$$

The likelihood function is:

$$\mathcal{L}(\boldsymbol{\beta}, \sigma^2) = p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}, \sigma^2)$$

The MLE principle:

$$\hat{\boldsymbol{\theta}}{\text{MLE}} = \arg\max{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$

We seek the parameter values that maximize the probability of observing our actual data.

Why Take Logarithms? Practical and Theoretical Reasons

In practice, we always work with the log-likelihood $\ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta})$ rather than the likelihood itself. This isn't just convention—there are compelling reasons.

1. Numerical stability:

Likelihood values for $n$ observations are products of $n$ probability densities. Each density value is typically less than 1 (often much less). Multiplying many such numbers leads to catastrophic underflow—the result becomes smaller than floating-point representation allows.

For example, with $n=1000$ observations each having density $p_i = 0.1$: $$\mathcal{L} = \prod_{i=1}^{1000} 0.1 = 10^{-1000}$$

This is smaller than the smallest positive float64 number (~$10^{-308}$). The log-likelihood, however: $$\ell = \sum_{i=1}^{1000} \log(0.1) = -1000 \log(10) \approx -2303$$

This is perfectly representable.

numerical_underflow_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
 
# Demonstrate why log-likelihoods are necessary
 
def demonstrate_underflow():
    """Show how regular likelihood computation fails."""
    n = 1000
    
    # Simulate probability densities (typical Gaussian values)
    np.random.seed(42)
    densities = 0.3 + 0.1 * np.random.randn(n)  # Around 0.2-0.4
    densities = np.clip(densities, 0.01, 0.99)
    
    # Attempt to compute likelihood directly
    try:
        # This will underflow to 0
        likelihood = np.prod(densities)
        print(f"Direct likelihood: {likelihood}")
        print("⚠️ This is 0.0 due to underflow!")
    except:
        print("Direct computation failed")
    
    # Use log-likelihood instead
    log_likelihood = np.sum(np.log(densities))
    print(f"\nLog-likelihood: {log_likelihood:.4f}")
    print("✓ Log-likelihood computes without issue")
    
    # If we really need likelihood, use logsumexp-style tricks:
    # exp(log_likelihood) would still underflow, but for comparisons
    # we only need log-likelihood
    print(f"\nFor comparison purposes, log-likelihood is sufficient.")
    print(f"log L₁ > log L₂ ⟺ L₁ > L₂")
 
demonstrate_underflow()

2. Products become sums:

The logarithm transforms products into sums. For independent observations:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^n p(y_i | \mathbf{x}i, \boldsymbol{\theta})$$ $$\ell(\boldsymbol{\theta}) = \log \mathcal{L}(\boldsymbol{\theta}) = \sum{i=1}^n \log p(y_i | \mathbf{x}_i, \boldsymbol{\theta})$$

Sums are much easier to differentiate, manipulate, and interpret than products.

3. Exponential families simplify:

For distributions in the exponential family (including Gaussian, Poisson, Bernoulli, etc.), the log-likelihood has a particularly nice form that makes derivatives tractable.

4. Monotonicity preserves optima:

Since $\log$ is strictly monotonically increasing: $$\arg\max_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \log \mathcal{L}(\boldsymbol{\theta})$$

The maximizers are identical. We lose nothing by working with log-likelihood.

Negative Log-Likelihood (NLL)

Often you'll see negative log-likelihood (NLL) used as a loss function:

$$\text{NLL}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) = -\log \mathcal{L}(\boldsymbol{\theta})$$

This converts maximization to minimization, fitting the standard loss function paradigm used in machine learning. Minimizing NLL is equivalent to maximizing log-likelihood.

Derivation: Single Observation

Let's build up the log-likelihood step by step, starting with a single observation.

The density for one observation:

For a single observation $(\mathbf{x}_i, y_i)$, the conditional density of $y_i$ given $\mathbf{x}_i$ under our Gaussian model is:

$$p(y_i | \mathbf{x}_i, \boldsymbol{\beta}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)$$

Taking the logarithm:

Applying $\log$ and using $\log(ab) = \log a + \log b$ and $\log(e^x) = x$:

$$\log p(y_i | \mathbf{x}_i, \boldsymbol{\beta}, \sigma^2) = \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) + \log\left(\exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)\right)$$

Simplifying the first term: $$\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) = -\frac{1}{2}\log(2\pi\sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2)$$

Simplifying the second term: $$\log\left(\exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)\right) = -\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}$$

Single-Observation Log-Likelihood

The log-likelihood contribution from observation $i$ is:

$$\ell_i(\boldsymbol{\beta}, \sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}$$

Notice the structure:

Constants: $-\frac{1}{2}\log(2\pi)$ (doesn't depend on parameters)
Variance term: $-\frac{1}{2}\log(\sigma^2)$ (depends only on $\sigma^2$)
Fit term: $-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}$ (squared residual, scaled by variance)

The residual connection:

Define the residual $r_i = y_i - \mathbf{x}_i^T \boldsymbol{\beta}$. Then:

$$\ell_i(\boldsymbol{\beta}, \sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{r_i^2}{2\sigma^2}$$

Key observation: the log-likelihood decreases as the squared residual $r_i^2$ increases. Larger residuals means lower likelihood. This is exactly what we'd expect—if our model predicts poorly (large residuals), the data is less probable under that model.

The Complete Log-Likelihood Function

Summing over all observations:

With $n$ independent observations, the total log-likelihood is the sum of individual contributions:

$$\ell(\boldsymbol{\beta}, \sigma^2) = \sum_{i=1}^n \ell_i(\boldsymbol{\beta}, \sigma^2)$$

Expanding: $$\ell(\boldsymbol{\beta}, \sigma^2) = \sum_{i=1}^n \left[-\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right]$$

Since the first two terms don't depend on $i$, we can factor them out: $$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2$$

The Complete Log-Likelihood for Gaussian Linear Regression

$$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\text{RSS}(\boldsymbol{\beta})$$

where $\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ is the Residual Sum of Squares.

Matrix form:

Recalling that $\text{RSS}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$:

$$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

Understanding the terms:

Term	Expression	Role
Normalization constant	$-\frac{n}{2}\log(2\pi)$	Doesn't affect optimization
Variance penalty	$-\frac{n}{2}\log(\sigma^2)$	Larger $\sigma^2$ → lower likelihood
Fit penalty	$-\frac{\text{RSS}}{2\sigma^2}$	Larger residuals → lower likelihood

The log-likelihood balances two forces:

Minimizing residuals (good fit)
Not claiming too-small variance (realistic noise estimate)

log_likelihood_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def log_likelihood(y, X, beta, sigma_sq):
    """
    Compute the log-likelihood for Gaussian linear regression.
    
    Parameters:
    -----------
    y : array-like, shape (n,)
        Response vector
    X : array-like, shape (n, p)
        Design matrix
    beta : array-like, shape (p,)
        Coefficient vector
    sigma_sq : float
        Variance of the noise (sigma^2)
    
    Returns:
    --------
    float : Log-likelihood value
    """
    n = len(y)
    
    # Compute residuals
    residuals = y - X @ beta
    
    # Residual sum of squares
    rss = np.sum(residuals**2)
    
    # Log-likelihood components
    const_term = -n/2 * np.log(2 * np.pi)
    var_term = -n/2 * np.log(sigma_sq)
    fit_term = -rss / (2 * sigma_sq)
    
    log_lik = const_term + var_term + fit_term
    
    return log_lik
 
 
def negative_log_likelihood(y, X, beta, sigma_sq):
    """NLL for use as a loss function (to minimize)."""
    return -log_likelihood(y, X, beta, sigma_sq)
 
 
# Example usage
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta_true = np.array([1.0, 2.0, -0.5])
sigma_true = 0.5
 
y = X @ beta_true + sigma_true * np.random.randn(n)
 
# Compute log-likelihood at true parameters
ll_true = log_likelihood(y, X, beta_true, sigma_true**2)
print(f"Log-likelihood at true β, σ²: {ll_true:.4f}")
 
# Compute at wrong parameters (should be lower)
beta_wrong = np.array([0.5, 1.5, 0.0])
ll_wrong = log_likelihood(y, X, beta_wrong, sigma_true**2)
print(f"Log-likelihood at wrong β: {ll_wrong:.4f}")
 
# With wrong sigma^2 (too small - penalizes RSS more)
ll_small_var = log_likelihood(y, X, beta_true, 0.1)
print(f"Log-likelihood with σ²=0.1: {ll_small_var:.4f}")
 
# With wrong sigma^2 (too large - variance penalty)
ll_large_var = log_likelihood(y, X, beta_true, 2.0)
print(f"Log-likelihood with σ²=2.0: {ll_large_var:.4f}")

The Crucial Insight: Maximizing Likelihood = Minimizing RSS

We now arrive at the key result that bridges probabilistic and geometric views of linear regression.

Observation 1: $\boldsymbol{\beta}$ appears only in RSS

Examining the log-likelihood: $$\ell(\boldsymbol{\beta}, \sigma^2) = \underbrace{-\frac{n}{2}\log(2\pi)}{\text{constant}} - \underbrace{\frac{n}{2}\log(\sigma^2)}{\text{involves only } \sigma^2} - \underbrace{\frac{\text{RSS}(\boldsymbol{\beta})}{2\sigma^2}}_{\text{involves both}}$$

When maximizing with respect to $\boldsymbol{\beta}$ (holding $\sigma^2$ fixed):

The first term is irrelevant (constant)
The second term doesn't involve $\boldsymbol{\beta}$
Only the third term matters

Observation 2: RSS is negated

The term involving $\boldsymbol{\beta}$ is $-\frac{\text{RSS}(\boldsymbol{\beta})}{2\sigma^2}$. This has a negative sign in front of RSS.

Observation 3: Maximizing negative of something = Minimizing that thing

$$\arg\max_{\boldsymbol{\beta}} \left(-\frac{\text{RSS}(\boldsymbol{\beta})}{2\sigma^2}\right) = \arg\min_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta})$$

The positive factor $\frac{1}{2\sigma^2}$ doesn't change the argmax.

The Fundamental Result

For any fixed $\sigma^2 > 0$:

$$\hat{\boldsymbol{\beta}}{\text{MLE}} = \arg\max{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}, \sigma^2) = \arg\min_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta}) = \hat{\boldsymbol{\beta}}_{\text{OLS}}$$

The maximum likelihood estimator for $\boldsymbol{\beta}$ is exactly the ordinary least squares estimator!

This is profound: the OLS solution we derived geometrically is also the statistically optimal solution under Gaussian noise. We're not just fitting a line—we're finding the most probable parameters.

Why is this important?

Justification for squared loss: We're not minimizing squared errors because it's computationally convenient; we're doing it because it's the correct thing under Gaussian noise.
Unification: The geometric view (projection onto column space) and probabilistic view (maximizing likelihood) lead to the same answer.
Beyond point estimates: The MLE framework gives us much more than OLS alone:
- Variance estimation
- Confidence intervals
- Hypothesis tests
- Model comparison criteria
Foundation for extensions: Regularization (Ridge, Lasso) can be understood as adding priors—connecting MLE to Bayesian inference.

Two Paths to the Same Solution
Approach	Starting Point	Objective	Solution
Geometric (OLS)	Find best-fit line	Minimize $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$	$(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
Probabilistic (MLE)	Most probable parameters	Maximize $\ell(\boldsymbol{\beta}, \sigma^2)$	$(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Visualizing the Log-Likelihood Surface

To develop intuition, let's visualize how the log-likelihood varies with parameter values.

Simple linear regression case:

Consider $y = \beta_0 + \beta_1 x + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$. The log-likelihood is a function of three parameters: $\beta_0$, $\beta_1$, and $\sigma^2$.

For fixed $\sigma^2$, the log-likelihood as a function of $(\beta_0, \beta_1)$ forms a concave paraboloid—a bowl shape flipped upside down. This concavity guarantees:

A unique global maximum
No local maxima to get trapped in
Gradient-based optimization will always converge to the global optimum

visualize_likelihood_surface.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def plot_log_likelihood_surface():
    """
    Visualize the log-likelihood surface for simple linear regression.
    """
    # Generate synthetic data
    np.random.seed(42)
    n = 50
    x = np.linspace(0, 10, n)
    beta0_true, beta1_true = 2.0, 0.5
    sigma_true = 1.0
    y = beta0_true + beta1_true * x + sigma_true * np.random.randn(n)
    
    # Design matrix
    X = np.column_stack([np.ones(n), x])
    
    # Create parameter grid
    beta0_range = np.linspace(0, 4, 50)
    beta1_range = np.linspace(-0.5, 1.5, 50)
    B0, B1 = np.meshgrid(beta0_range, beta1_range)
    
    # Compute log-likelihood on grid
    log_lik = np.zeros_like(B0)
    sigma_sq = sigma_true**2  # Fix sigma at true value
    
    for i in range(B0.shape[0]):
        for j in range(B0.shape[1]):
            beta = np.array([B0[i,j], B1[i,j]])
            residuals = y - X @ beta
            rss = np.sum(residuals**2)
            log_lik[i,j] = -n/2*np.log(2*np.pi) - n/2*np.log(sigma_sq) - rss/(2*sigma_sq)
    
    # Find maximum
    max_idx = np.unravel_index(np.argmax(log_lik), log_lik.shape)
    beta0_mle = B0[max_idx]
    beta1_mle = B1[max_idx]
    
    print("Log-Likelihood Surface Analysis")
    print("=" * 40)
    print(f"True parameters: β₀={beta0_true}, β₁={beta1_true}")
    print(f"MLE (grid search): β₀≈{beta0_mle:.3f}, β₁≈{beta1_mle:.3f}")
    print(f"\nThe surface is a concave paraboloid (inverted bowl)")
    print(f"Unique global maximum - no local optima!")
    
    # OLS solution for comparison
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    print(f"OLS solution: β₀={beta_ols[0]:.3f}, β₁={beta_ols[1]:.3f}")
    print(f"MLE and OLS agree (as theory predicts)")
 
plot_log_likelihood_surface()

Contour interpretation:

Contours of the log-likelihood surface are ellipses centered at the MLE. These ellipses have a direct connection to confidence regions—a contour at level $\ell_{\max} - c$ for appropriate $c$ corresponds to a joint confidence region for the parameters.

The orientation and eccentricity of these ellipses depend on the correlation between parameter estimates. When features are correlated in $\mathbf{X}$, the ellipses become elongated and tilted, indicating that the parameters are hard to estimate independently.

Concavity Implies Convex Optimization

The log-likelihood is concave in $\boldsymbol{\beta}$ (and in $(\boldsymbol{\beta}, \log\sigma^2)$ jointly). This means:

Unique global optimum — No local maxima to worry about
Convex optimization — Maximizing a concave function = minimizing a convex function
Efficient algorithms — Gradient descent, Newton's method all converge to the global solution

This tractability is a special property of the Gaussian distribution. Most other noise distributions lead to non-concave likelihoods with potential local optima.

Computing the Gradient and Hessian

For completeness, let's compute the derivatives of the log-likelihood. These are essential for:

Deriving the MLE analytically
Understanding the curvature (and hence uncertainty) of the likelihood
Implementing optimization algorithms

Gradient with respect to $\boldsymbol{\beta}$:

Starting from: $$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

Let's expand the quadratic term: $$q(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

Taking the gradient with respect to $\boldsymbol{\beta}$: $$\nabla_{\boldsymbol{\beta}} q(\boldsymbol{\beta}) = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = 2\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})$$

Since $\ell = \text{const} - \frac{q}{2\sigma^2}$: $$\nabla_{\boldsymbol{\beta}} \ell(\boldsymbol{\beta}, \sigma^2) = -\frac{1}{2\sigma^2} \cdot 2\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y}) = \frac{1}{\sigma^2}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$$

The Score Function

The gradient of the log-likelihood with respect to parameters is called the score function in statistics:

$$\mathbf{s}(\boldsymbol{\beta}) = \nabla_{\boldsymbol{\beta}} \ell = \frac{1}{\sigma^2}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{r}$$

where $\mathbf{r} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$ is the residual vector.

At the MLE, the score equals zero: $\mathbf{s}(\hat{\boldsymbol{\beta}}) = \mathbf{0}$, which gives the normal equations.

Hessian with respect to $\boldsymbol{\beta}$:

The Hessian (matrix of second derivatives) is: $$\mathbf{H}{\boldsymbol{\beta}} = \nabla^2{\boldsymbol{\beta}} \ell = \nabla_{\boldsymbol{\beta}} \left(\frac{1}{\sigma^2}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\right) = -\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X}$$

Key observations:

Negative definite: Since $\mathbf{X}^T\mathbf{X}$ is positive semi-definite (and positive definite when $\mathbf{X}$ has full column rank), the Hessian is negative definite. This confirms concavity.
Independent of $\boldsymbol{\beta}$: The Hessian doesn't depend on where we evaluate it. This means the curvature is constant everywhere—a property unique to quadratic functions.
Connected to variance: The inverse of the negative Hessian gives the covariance matrix of the MLE: $$\text{Cov}(\hat{\boldsymbol{\beta}}) = (-\mathbf{H})^{-1} = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$

This last relationship is crucial for inference—the curvature of the likelihood at its peak tells us how certain we are about the parameter estimates.

gradient_hessian_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
 
def gradient_log_likelihood(y, X, beta, sigma_sq):
    """
    Compute the gradient (score) of log-likelihood w.r.t. beta.
    
    Returns: gradient vector of shape (p,)
    """
    residuals = y - X @ beta
    gradient = (1 / sigma_sq) * (X.T @ residuals)
    return gradient
 
 
def hessian_log_likelihood(X, sigma_sq):
    """
    Compute the Hessian of log-likelihood w.r.t. beta.
    Note: Hessian is constant (doesn't depend on beta).
    
    Returns: Hessian matrix of shape (p, p)
    """
    H = -(1 / sigma_sq) * (X.T @ X)
    return H
 
 
def verify_mle_conditions(y, X, beta_mle, sigma_sq):
    """
    Verify that MLE satisfies first-order conditions.
    """
    grad = gradient_log_likelihood(y, X, beta_mle, sigma_sq)
    H = hessian_log_likelihood(X, sigma_sq)
    
    print("Verification of MLE Conditions")
    print("=" * 40)
    print(f"\n1. First-order condition (gradient = 0):")
    print(f"   ||gradient||₂ = {np.linalg.norm(grad):.2e}")
    print(f"   {'✓ Satisfied' if np.linalg.norm(grad) < 1e-10 else '✗ Not satisfied'}")
    
    print(f"\n2. Second-order condition (Hessian negative definite):")
    eigenvalues = np.linalg.eigvalsh(H)
    all_negative = np.all(eigenvalues < 0)
    print(f"   Eigenvalues of Hessian: {eigenvalues}")
    print(f"   {'✓ All negative (concave)' if all_negative else '✗ Not all negative'}")
    
    print(f"\n3. Parameter covariance matrix:")
    cov_beta = sigma_sq * np.linalg.inv(X.T @ X)
    print(f"   Cov(β̂) = σ²(X'X)⁻¹ = ")
    print(cov_beta)
    print(f"   Standard errors: {np.sqrt(np.diag(cov_beta))}")
 
 
# Example
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta_true = np.array([1.0, 2.0, -0.5])
sigma_true = 0.5
 
y = X @ beta_true + sigma_true * np.random.randn(n)
 
# Compute MLE (= OLS)
beta_mle = np.linalg.lstsq(X, y, rcond=None)[0]
 
verify_mle_conditions(y, X, beta_mle, sigma_true**2)

Summary and Looking Ahead

We've now completed the derivation of the log-likelihood function—the mathematical heart of probabilistic linear regression. Here's what we've established:

Key Takeaways

•Likelihood measures data probability — Given parameters, how probable is our observed data?
•Logarithms are essential — For numerical stability and analytical convenience
•Single-observation log-likelihood — $\ell_i = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2}{2\sigma^2}$
•Complete log-likelihood — $\ell = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{\text{RSS}}{2\sigma^2}$
•MLE for $\boldsymbol{\beta}$ = OLS — Maximizing likelihood with respect to $\boldsymbol{\beta}$ gives the least squares solution
•Concave surface — Log-likelihood is concave, guaranteeing a unique global maximum
•Score and Hessian — Gradient equals zero at MLE; Hessian connects to parameter uncertainty

What's next:

In the next page, we'll solve the MLE problem explicitly—showing the complete derivation that produces the closed-form solution $(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ we know from OLS. We'll also derive the MLE for the noise variance $\sigma^2$, completing our parameter estimation picture.

Mathematical Foundation Complete

You now understand the log-likelihood function in complete detail. This is the objective function that defines maximum likelihood estimation. Every MLE result flows from maximizing this function. With this foundation, you're ready to see the optimization carried out explicitly.

2 / 5

Loading learning content...

Machine LearningLinear Regression

Maximum Likelihood View of Linear Regression

LevelIntermediate

Duration75 mins

TopicLinear Regression

2 / 5

Log-Likelihood Derivation

The Heart of Probabilistic Regression

In this page, we'll derive every step in complete detail, leaving no mathematical gaps.

What You Will Learn

By the end of this page, you will:

Understand the likelihood function and why we take logarithms
Derive the log-likelihood for Gaussian linear regression step by step
See how maximizing log-likelihood transforms into minimizing squared error
Appreciate why the Gaussian assumption leads to analytical tractability
Develop intuition for likelihood surfaces in parameter space

The Likelihood Function: Definition and Interpretation

Probability vs. Likelihood:

Before diving into derivations, let's clarify terminology that often confuses newcomers.

The probability density $p(\mathbf{y} | \boldsymbol{\theta})$ can be viewed in two ways:

As a probability (function of data): With parameters $\boldsymbol{\theta}$ fixed, this specifies a probability distribution over possible observations $\mathbf{y}$. We integrate/sum over $\mathbf{y}$ to get probabilities.
As a likelihood (function of parameters): With data $\mathbf{y}$ fixed (we've observed it!), this quantifies how "likely" different parameter values $\boldsymbol{\theta}$ are. We do not integrate over $\boldsymbol{\theta}$.

The likelihood function is precisely this second view:

$$\mathcal{L}(\boldsymbol{\theta}) \equiv \mathcal{L}(\boldsymbol{\theta}; \mathbf{y}) = p(\mathbf{y} | \boldsymbol{\theta})$$

The notation emphasizes that we treat $\mathcal{L}$ as a function of $\boldsymbol{\theta}$, with $\mathbf{y}$ held constant.

Likelihood Is NOT a Probability Distribution

For our model:

In Gaussian linear regression, the parameters are $\boldsymbol{\theta} = (\boldsymbol{\beta}, \sigma^2)$. Given the model:

$$\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)$$

The likelihood function is:

$$\mathcal{L}(\boldsymbol{\beta}, \sigma^2) = p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}, \sigma^2)$$

The MLE principle:

$$\hat{\boldsymbol{\theta}}{\text{MLE}} = \arg\max{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta})$$

We seek the parameter values that maximize the probability of observing our actual data.

Why Take Logarithms? Practical and Theoretical Reasons

1. Numerical stability:

For example, with $n=1000$ observations each having density $p_i = 0.1$: $$\mathcal{L} = \prod_{i=1}^{1000} 0.1 = 10^{-1000}$$

This is smaller than the smallest positive float64 number (~$10^{-308}$). The log-likelihood, however: $$\ell = \sum_{i=1}^{1000} \log(0.1) = -1000 \log(10) \approx -2303$$

This is perfectly representable.

numerical_underflow_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
 
# Demonstrate why log-likelihoods are necessary
 
def demonstrate_underflow():
    """Show how regular likelihood computation fails."""
    n = 1000
    
    # Simulate probability densities (typical Gaussian values)
    np.random.seed(42)
    densities = 0.3 + 0.1 * np.random.randn(n)  # Around 0.2-0.4
    densities = np.clip(densities, 0.01, 0.99)
    
    # Attempt to compute likelihood directly
    try:
        # This will underflow to 0
        likelihood = np.prod(densities)
        print(f"Direct likelihood: {likelihood}")
        print("⚠️ This is 0.0 due to underflow!")
    except:
        print("Direct computation failed")
    
    # Use log-likelihood instead
    log_likelihood = np.sum(np.log(densities))
    print(f"\nLog-likelihood: {log_likelihood:.4f}")
    print("✓ Log-likelihood computes without issue")
    
    # If we really need likelihood, use logsumexp-style tricks:
    # exp(log_likelihood) would still underflow, but for comparisons
    # we only need log-likelihood
    print(f"\nFor comparison purposes, log-likelihood is sufficient.")
    print(f"log L₁ > log L₂ ⟺ L₁ > L₂")
 
demonstrate_underflow()

2. Products become sums:

The logarithm transforms products into sums. For independent observations:

Sums are much easier to differentiate, manipulate, and interpret than products.

3. Exponential families simplify:

For distributions in the exponential family (including Gaussian, Poisson, Bernoulli, etc.), the log-likelihood has a particularly nice form that makes derivatives tractable.

4. Monotonicity preserves optima:

Since $\log$ is strictly monotonically increasing: $$\arg\max_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \log \mathcal{L}(\boldsymbol{\theta})$$

The maximizers are identical. We lose nothing by working with log-likelihood.

Negative Log-Likelihood (NLL)

Often you'll see negative log-likelihood (NLL) used as a loss function:

$$\text{NLL}(\boldsymbol{\theta}) = -\ell(\boldsymbol{\theta}) = -\log \mathcal{L}(\boldsymbol{\theta})$$

This converts maximization to minimization, fitting the standard loss function paradigm used in machine learning. Minimizing NLL is equivalent to maximizing log-likelihood.

Derivation: Single Observation

Let's build up the log-likelihood step by step, starting with a single observation.

The density for one observation:

For a single observation $(\mathbf{x}_i, y_i)$, the conditional density of $y_i$ given $\mathbf{x}_i$ under our Gaussian model is:

$$p(y_i | \mathbf{x}_i, \boldsymbol{\beta}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)$$

Taking the logarithm:

Applying $\log$ and using $\log(ab) = \log a + \log b$ and $\log(e^x) = x$:

Simplifying the first term: $$\log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) = -\frac{1}{2}\log(2\pi\sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2)$$

Simplifying the second term: $$\log\left(\exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)\right) = -\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}$$

Single-Observation Log-Likelihood

The log-likelihood contribution from observation $i$ is:

$$\ell_i(\boldsymbol{\beta}, \sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}$$

Notice the structure:

Constants: $-\frac{1}{2}\log(2\pi)$ (doesn't depend on parameters)
Variance term: $-\frac{1}{2}\log(\sigma^2)$ (depends only on $\sigma^2$)
Fit term: $-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}$ (squared residual, scaled by variance)

The residual connection:

Define the residual $r_i = y_i - \mathbf{x}_i^T \boldsymbol{\beta}$. Then:

$$\ell_i(\boldsymbol{\beta}, \sigma^2) = -\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{r_i^2}{2\sigma^2}$$

The Complete Log-Likelihood Function

Summing over all observations:

With $n$ independent observations, the total log-likelihood is the sum of individual contributions:

$$\ell(\boldsymbol{\beta}, \sigma^2) = \sum_{i=1}^n \ell_i(\boldsymbol{\beta}, \sigma^2)$$

Expanding: $$\ell(\boldsymbol{\beta}, \sigma^2) = \sum_{i=1}^n \left[-\frac{1}{2}\log(2\pi) - \frac{1}{2}\log(\sigma^2) - \frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right]$$

The Complete Log-Likelihood for Gaussian Linear Regression

$$\ell(\boldsymbol{\beta}, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\text{RSS}(\boldsymbol{\beta})$$

where $\text{RSS}(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ is the Residual Sum of Squares.

Matrix form:

Recalling that $\text{RSS}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$:

Understanding the terms:

Term	Expression	Role
Normalization constant	$-\frac{n}{2}\log(2\pi)$	Doesn't affect optimization
Variance penalty	$-\frac{n}{2}\log(\sigma^2)$	Larger $\sigma^2$ → lower likelihood
Fit penalty	$-\frac{\text{RSS}}{2\sigma^2}$	Larger residuals → lower likelihood

The log-likelihood balances two forces:

Minimizing residuals (good fit)
Not claiming too-small variance (realistic noise estimate)

log_likelihood_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
 
def log_likelihood(y, X, beta, sigma_sq):
    """
    Compute the log-likelihood for Gaussian linear regression.
    
    Parameters:
    -----------
    y : array-like, shape (n,)
        Response vector
    X : array-like, shape (n, p)
        Design matrix
    beta : array-like, shape (p,)
        Coefficient vector
    sigma_sq : float
        Variance of the noise (sigma^2)
    
    Returns:
    --------
    float : Log-likelihood value
    """
    n = len(y)
    
    # Compute residuals
    residuals = y - X @ beta
    
    # Residual sum of squares
    rss = np.sum(residuals**2)
    
    # Log-likelihood components
    const_term = -n/2 * np.log(2 * np.pi)
    var_term = -n/2 * np.log(sigma_sq)
    fit_term = -rss / (2 * sigma_sq)
    
    log_lik = const_term + var_term + fit_term
    
    return log_lik
 
 
def negative_log_likelihood(y, X, beta, sigma_sq):
    """NLL for use as a loss function (to minimize)."""
    return -log_likelihood(y, X, beta, sigma_sq)
 
 
# Example usage
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta_true = np.array([1.0, 2.0, -0.5])
sigma_true = 0.5
 
y = X @ beta_true + sigma_true * np.random.randn(n)
 
# Compute log-likelihood at true parameters
ll_true = log_likelihood(y, X, beta_true, sigma_true**2)
print(f"Log-likelihood at true β, σ²: {ll_true:.4f}")
 
# Compute at wrong parameters (should be lower)
beta_wrong = np.array([0.5, 1.5, 0.0])
ll_wrong = log_likelihood(y, X, beta_wrong, sigma_true**2)
print(f"Log-likelihood at wrong β: {ll_wrong:.4f}")
 
# With wrong sigma^2 (too small - penalizes RSS more)
ll_small_var = log_likelihood(y, X, beta_true, 0.1)
print(f"Log-likelihood with σ²=0.1: {ll_small_var:.4f}")
 
# With wrong sigma^2 (too large - variance penalty)
ll_large_var = log_likelihood(y, X, beta_true, 2.0)
print(f"Log-likelihood with σ²=2.0: {ll_large_var:.4f}")

The Crucial Insight: Maximizing Likelihood = Minimizing RSS

We now arrive at the key result that bridges probabilistic and geometric views of linear regression.

Observation 1: $\boldsymbol{\beta}$ appears only in RSS

When maximizing with respect to $\boldsymbol{\beta}$ (holding $\sigma^2$ fixed):

The first term is irrelevant (constant)
The second term doesn't involve $\boldsymbol{\beta}$
Only the third term matters

Observation 2: RSS is negated

The term involving $\boldsymbol{\beta}$ is $-\frac{\text{RSS}(\boldsymbol{\beta})}{2\sigma^2}$. This has a negative sign in front of RSS.

Observation 3: Maximizing negative of something = Minimizing that thing

$$\arg\max_{\boldsymbol{\beta}} \left(-\frac{\text{RSS}(\boldsymbol{\beta})}{2\sigma^2}\right) = \arg\min_{\boldsymbol{\beta}} \text{RSS}(\boldsymbol{\beta})$$

The positive factor $\frac{1}{2\sigma^2}$ doesn't change the argmax.

The Fundamental Result

For any fixed $\sigma^2 > 0$:

The maximum likelihood estimator for $\boldsymbol{\beta}$ is exactly the ordinary least squares estimator!

Why is this important?

Justification for squared loss: We're not minimizing squared errors because it's computationally convenient; we're doing it because it's the correct thing under Gaussian noise.
Unification: The geometric view (projection onto column space) and probabilistic view (maximizing likelihood) lead to the same answer.
Beyond point estimates: The MLE framework gives us much more than OLS alone:
- Variance estimation
- Confidence intervals
- Hypothesis tests
- Model comparison criteria
Foundation for extensions: Regularization (Ridge, Lasso) can be understood as adding priors—connecting MLE to Bayesian inference.

Two Paths to the Same Solution
Approach	Starting Point	Objective	Solution
Geometric (OLS)	Find best-fit line	Minimize $\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2$	$(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$
Probabilistic (MLE)	Most probable parameters	Maximize $\ell(\boldsymbol{\beta}, \sigma^2)$	$(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$

Visualizing the Log-Likelihood Surface

To develop intuition, let's visualize how the log-likelihood varies with parameter values.

Simple linear regression case:

Consider $y = \beta_0 + \beta_1 x + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$. The log-likelihood is a function of three parameters: $\beta_0$, $\beta_1$, and $\sigma^2$.

For fixed $\sigma^2$, the log-likelihood as a function of $(\beta_0, \beta_1)$ forms a concave paraboloid—a bowl shape flipped upside down. This concavity guarantees:

A unique global maximum
No local maxima to get trapped in
Gradient-based optimization will always converge to the global optimum

visualize_likelihood_surface.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def plot_log_likelihood_surface():
    """
    Visualize the log-likelihood surface for simple linear regression.
    """
    # Generate synthetic data
    np.random.seed(42)
    n = 50
    x = np.linspace(0, 10, n)
    beta0_true, beta1_true = 2.0, 0.5
    sigma_true = 1.0
    y = beta0_true + beta1_true * x + sigma_true * np.random.randn(n)
    
    # Design matrix
    X = np.column_stack([np.ones(n), x])
    
    # Create parameter grid
    beta0_range = np.linspace(0, 4, 50)
    beta1_range = np.linspace(-0.5, 1.5, 50)
    B0, B1 = np.meshgrid(beta0_range, beta1_range)
    
    # Compute log-likelihood on grid
    log_lik = np.zeros_like(B0)
    sigma_sq = sigma_true**2  # Fix sigma at true value
    
    for i in range(B0.shape[0]):
        for j in range(B0.shape[1]):
            beta = np.array([B0[i,j], B1[i,j]])
            residuals = y - X @ beta
            rss = np.sum(residuals**2)
            log_lik[i,j] = -n/2*np.log(2*np.pi) - n/2*np.log(sigma_sq) - rss/(2*sigma_sq)
    
    # Find maximum
    max_idx = np.unravel_index(np.argmax(log_lik), log_lik.shape)
    beta0_mle = B0[max_idx]
    beta1_mle = B1[max_idx]
    
    print("Log-Likelihood Surface Analysis")
    print("=" * 40)
    print(f"True parameters: β₀={beta0_true}, β₁={beta1_true}")
    print(f"MLE (grid search): β₀≈{beta0_mle:.3f}, β₁≈{beta1_mle:.3f}")
    print(f"\nThe surface is a concave paraboloid (inverted bowl)")
    print(f"Unique global maximum - no local optima!")
    
    # OLS solution for comparison
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    print(f"OLS solution: β₀={beta_ols[0]:.3f}, β₁={beta_ols[1]:.3f}")
    print(f"MLE and OLS agree (as theory predicts)")
 
plot_log_likelihood_surface()

Contour interpretation:

Concavity Implies Convex Optimization

The log-likelihood is concave in $\boldsymbol{\beta}$ (and in $(\boldsymbol{\beta}, \log\sigma^2)$ jointly). This means:

Unique global optimum — No local maxima to worry about
Convex optimization — Maximizing a concave function = minimizing a convex function
Efficient algorithms — Gradient descent, Newton's method all converge to the global solution

This tractability is a special property of the Gaussian distribution. Most other noise distributions lead to non-concave likelihoods with potential local optima.

Computing the Gradient and Hessian

For completeness, let's compute the derivatives of the log-likelihood. These are essential for:

Deriving the MLE analytically
Understanding the curvature (and hence uncertainty) of the likelihood
Implementing optimization algorithms

Gradient with respect to $\boldsymbol{\beta}$:

The Score Function

The gradient of the log-likelihood with respect to parameters is called the score function in statistics:

$$\mathbf{s}(\boldsymbol{\beta}) = \nabla_{\boldsymbol{\beta}} \ell = \frac{1}{\sigma^2}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \frac{1}{\sigma^2}\mathbf{X}^T\mathbf{r}$$

where $\mathbf{r} = \mathbf{y} - \mathbf{X}\boldsymbol{\beta}$ is the residual vector.

At the MLE, the score equals zero: $\mathbf{s}(\hat{\boldsymbol{\beta}}) = \mathbf{0}$, which gives the normal equations.

Hessian with respect to $\boldsymbol{\beta}$:

Key observations:

Negative definite: Since $\mathbf{X}^T\mathbf{X}$ is positive semi-definite (and positive definite when $\mathbf{X}$ has full column rank), the Hessian is negative definite. This confirms concavity.
Independent of $\boldsymbol{\beta}$: The Hessian doesn't depend on where we evaluate it. This means the curvature is constant everywhere—a property unique to quadratic functions.
Connected to variance: The inverse of the negative Hessian gives the covariance matrix of the MLE: $$\text{Cov}(\hat{\boldsymbol{\beta}}) = (-\mathbf{H})^{-1} = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$$

This last relationship is crucial for inference—the curvature of the likelihood at its peak tells us how certain we are about the parameter estimates.

gradient_hessian_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
 
def gradient_log_likelihood(y, X, beta, sigma_sq):
    """
    Compute the gradient (score) of log-likelihood w.r.t. beta.
    
    Returns: gradient vector of shape (p,)
    """
    residuals = y - X @ beta
    gradient = (1 / sigma_sq) * (X.T @ residuals)
    return gradient
 
 
def hessian_log_likelihood(X, sigma_sq):
    """
    Compute the Hessian of log-likelihood w.r.t. beta.
    Note: Hessian is constant (doesn't depend on beta).
    
    Returns: Hessian matrix of shape (p, p)
    """
    H = -(1 / sigma_sq) * (X.T @ X)
    return H
 
 
def verify_mle_conditions(y, X, beta_mle, sigma_sq):
    """
    Verify that MLE satisfies first-order conditions.
    """
    grad = gradient_log_likelihood(y, X, beta_mle, sigma_sq)
    H = hessian_log_likelihood(X, sigma_sq)
    
    print("Verification of MLE Conditions")
    print("=" * 40)
    print(f"\n1. First-order condition (gradient = 0):")
    print(f"   ||gradient||₂ = {np.linalg.norm(grad):.2e}")
    print(f"   {'✓ Satisfied' if np.linalg.norm(grad) < 1e-10 else '✗ Not satisfied'}")
    
    print(f"\n2. Second-order condition (Hessian negative definite):")
    eigenvalues = np.linalg.eigvalsh(H)
    all_negative = np.all(eigenvalues < 0)
    print(f"   Eigenvalues of Hessian: {eigenvalues}")
    print(f"   {'✓ All negative (concave)' if all_negative else '✗ Not all negative'}")
    
    print(f"\n3. Parameter covariance matrix:")
    cov_beta = sigma_sq * np.linalg.inv(X.T @ X)
    print(f"   Cov(β̂) = σ²(X'X)⁻¹ = ")
    print(cov_beta)
    print(f"   Standard errors: {np.sqrt(np.diag(cov_beta))}")
 
 
# Example
np.random.seed(42)
n, p = 100, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta_true = np.array([1.0, 2.0, -0.5])
sigma_true = 0.5
 
y = X @ beta_true + sigma_true * np.random.randn(n)
 
# Compute MLE (= OLS)
beta_mle = np.linalg.lstsq(X, y, rcond=None)[0]
 
verify_mle_conditions(y, X, beta_mle, sigma_true**2)

Summary and Looking Ahead

We've now completed the derivation of the log-likelihood function—the mathematical heart of probabilistic linear regression. Here's what we've established:

Key Takeaways

•Likelihood measures data probability — Given parameters, how probable is our observed data?
•Logarithms are essential — For numerical stability and analytical convenience
•Single-observation log-likelihood — $\ell_i = -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2}{2\sigma^2}$
•Complete log-likelihood — $\ell = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{\text{RSS}}{2\sigma^2}$
•MLE for $\boldsymbol{\beta}$ = OLS — Maximizing likelihood with respect to $\boldsymbol{\beta}$ gives the least squares solution
•Concave surface — Log-likelihood is concave, guaranteeing a unique global maximum
•Score and Hessian — Gradient equals zero at MLE; Hessian connects to parameter uncertainty

What's next:

Mathematical Foundation Complete

2 / 5