Machine LearningLinear Regression

Maximum Likelihood View of Linear Regression

LevelIntermediate

Duration75 mins

TopicLinear Regression

1 / 5

Gaussian Noise Assumption

From Geometry to Probability

In the preceding modules, we developed linear regression from a purely geometric perspective—finding the hyperplane that minimizes the sum of squared residuals. While this geometric view is elegant and computationally convenient, it leaves fundamental questions unanswered:

Why squared residuals rather than absolute residuals or some other measure?
How do we quantify uncertainty in our predictions and parameter estimates?
What assumptions about the data generation process justify our approach?

To answer these questions, we must transition from geometry to probability. The maximum likelihood perspective recasts linear regression as a problem of statistical inference—estimating the parameters of a probabilistic model that describes how the data was generated. At the heart of this probabilistic view lies a critical assumption: the noise in our observations follows a Gaussian (Normal) distribution.

What You Will Learn

By the end of this page, you will understand the Gaussian noise assumption in depth—why it's made, what it implies mathematically, when it's reasonable, and what happens when it's violated. You'll see how this single assumption connects deterministic least squares optimization to principled probabilistic inference.

The Probabilistic Linear Regression Model

Let's formalize the probabilistic perspective. Instead of viewing regression as curve fitting, we model the data-generating process—how the observed responses $y$ are produced from the inputs $\mathbf{x}$.

The generative model:

We assume that each observation $y_i$ is generated according to:

$$y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i$$

where:

$\mathbf{x}_i \in \mathbb{R}^p$ is the feature vector for observation $i$
$\boldsymbol{\beta} \in \mathbb{R}^p$ is the true (unknown) parameter vector we seek to estimate
$\epsilon_i$ is a random noise term that accounts for all factors not captured by the linear relationship
$y_i \in \mathbb{R}$ is the observed response

The critical assumption is about the distribution of the noise term $\epsilon_i$. In the Gaussian noise model, we assume:

$$\epsilon_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2)$$

Unpacking the Gaussian Noise Assumption

The notation $\epsilon_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2)$ packs several crucial assumptions:

1. Zero Mean: $\mathbb{E}[\epsilon_i] = 0$ — The noise has no systematic bias. On average, errors are zero.

2. Constant Variance (Homoscedasticity): $\text{Var}(\epsilon_i) = \sigma^2$ — All observations have the same noise level.

3. Independent: $\epsilon_i \perp \epsilon_j$ for $i \neq j$ — The error for one observation tells us nothing about errors for others.

4. Identically Distributed: All $\epsilon_i$ come from the same distribution.

5. Gaussian: The specific distributional form is Normal/Gaussian.

Conditional distribution of $y$:

Since $y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i$ and $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$, we can derive the conditional distribution of $y_i$ given $\mathbf{x}_i$:

$$y_i | \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2)$$

Or equivalently:

$$p(y_i | \mathbf{x}_i, \boldsymbol{\beta}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)$$

This formulation tells us that:

The mean of $y_i$ is $\mathbf{x}_i^T \boldsymbol{\beta}$ (our prediction)
The variance of $y_i$ is $\sigma^2$ (the irreducible noise level)
The response $y_i$ is probabilistically determined—we can only specify its distribution, not its exact value

Why Gaussian? The Central Limit Theorem Justification

The Gaussian assumption might seem arbitrary at first. Why not use a uniform distribution? Or a Laplace distribution? Or something else entirely?

The strongest justification comes from the Central Limit Theorem (CLT). In most regression scenarios, the noise term $\epsilon_i$ represents the cumulative effect of many small, independent factors that we haven't (or can't) model explicitly:

Measurement error in sensors
Minor variations in environmental conditions
Small fluctuations in human behavior
Rounding errors in data collection
Unexplained biological variability

When the total error is the sum of many small, independent contributions, the CLT guarantees that this sum will be approximately Gaussian—regardless of the distributions of the individual components.

The Central Limit Theorem

If $Z_1, Z_2, \ldots, Z_k$ are independent random variables with finite mean and variance, then as $k \to \infty$:

$$\frac{\sum_{j=1}^k Z_j - \sum_{j=1}^k \mathbb{E}[Z_j]}{\sqrt{\sum_{j=1}^k \text{Var}(Z_j)}} \xrightarrow{d} \mathcal{N}(0, 1)$$

In words: the standardized sum of many independent random variables converges in distribution to a standard Gaussian, regardless of the original distributions.

Practical implications:

The CLT justification is powerful because it doesn't require us to know the true distribution of individual error sources. If our noise is genuinely composed of many small effects, Gaussian is a reasonable assumption even if we don't know the exact mechanism.

However, this justification has limits:

Finite samples: The CLT is an asymptotic result. With few error sources, the distribution may not be Gaussian.
Independence assumption: If error components are correlated, the CLT may not apply.
Heavy tails: If some error sources have infinite variance (e.g., Cauchy-distributed outliers), the sum won't be Gaussian.
Dominant sources: If one error source dominates, the distribution will reflect that source, not a Gaussian sum.

When Gaussian Noise Is Reasonable

•Measurement errors — When using calibrated instruments, measurement noise is typically Gaussian
•Aggregate effects — When many small factors combine (biological variability, weather effects, market fluctuations)
•Controlled experiments — Laboratory settings where major confounds are eliminated
•Physical processes — Thermal noise, Brownian motion, and many other physical phenomena follow Gaussian distributions
•Residuals from good models — When the systematic component is well-specified, remaining noise often appears Gaussian

Alternative Justifications for the Gaussian Model

Beyond the CLT, there are several other compelling reasons to adopt the Gaussian noise model:

1. Maximum Entropy Principle:

Among all probability distributions with a specified mean and variance, the Gaussian distribution has maximum entropy—it assumes the least additional structure beyond those two constraints.

If we know only that:

The noise has zero mean: $\mathbb{E}[\epsilon] = 0$
The noise has variance $\sigma^2$: $\mathbb{E}[\epsilon^2] = \sigma^2$

...and we want to choose a distribution that doesn't assume anything else unnecessarily, the Gaussian is the unique answer. It's the "maximally uninformative" distribution subject to our constraints.

2. Analytical Tractability:

The Gaussian distribution leads to closed-form solutions for maximum likelihood estimation. This wasn't historically accidental—Gauss developed much of least-squares theory precisely because Gaussian errors yield tractable mathematics.

gaussian_properties_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Demonstrate key Gaussian properties relevant to regression
 
# 1. Maximum entropy property visualization
def entropy_comparison():
    """Compare entropy of distributions with same variance."""
    variance = 1.0
    x = np.linspace(-4, 4, 1000)
    
    # Gaussian
    gaussian_pdf = stats.norm.pdf(x, 0, np.sqrt(variance))
    gaussian_entropy = stats.norm.entropy(0, np.sqrt(variance))
    
    # Uniform with same variance: Var = (b-a)^2/12, so b-a = sqrt(12*variance)
    width = np.sqrt(12 * variance)
    uniform_pdf = stats.uniform.pdf(x, -width/2, width)
    uniform_entropy = stats.uniform.entropy(-width/2, width)
    
    # Laplace with same variance: Var = 2*b^2, so b = sqrt(variance/2)
    scale = np.sqrt(variance / 2)
    laplace_pdf = stats.laplace.pdf(x, 0, scale)
    laplace_entropy = stats.laplace.entropy(0, scale)
    
    print(f"Entropy comparison (same variance = {variance}):")
    print(f"  Gaussian: {gaussian_entropy:.4f} nats (MAXIMUM)")
    print(f"  Laplace:  {laplace_entropy:.4f} nats")
    print(f"  Uniform:  {uniform_entropy:.4f} nats")
    
    return gaussian_entropy, laplace_entropy, uniform_entropy
 
# 2. Demonstrate why squared loss is natural for Gaussian
def negative_log_likelihood():
    """Show connection between Gaussian NLL and squared error."""
    y_true = 5.0  # Observed value
    y_preds = np.linspace(0, 10, 100)  # Predicted values
    sigma = 1.0
    
    # Gaussian negative log-likelihood
    # -log p(y|mu, sigma) = 0.5*log(2*pi*sigma^2) + (y-mu)^2 / (2*sigma^2)
    nll = 0.5 * np.log(2 * np.pi * sigma**2) + (y_true - y_preds)**2 / (2 * sigma**2)
    
    # Squared error
    se = (y_true - y_preds)**2
    
    # These differ only by constants and scaling
    # Minimizing NLL is equivalent to minimizing SE
    print("\nKey insight: Minimizing Gaussian NLL = Minimizing squared error")
    print(f"  NLL at y_pred=5.0: {0.5 * np.log(2 * np.pi) + 0:.4f}")
    print(f"  NLL at y_pred=4.0: {0.5 * np.log(2 * np.pi) + 0.5:.4f}")
    print(f"  NLL at y_pred=3.0: {0.5 * np.log(2 * np.pi) + 2:.4f}")
    
entropy_comparison()
negative_log_likelihood()

3. Stability Under Linear Transformations:

The Gaussian distribution is closed under linear transformations. If $\epsilon \sim \mathcal{N}(0, \sigma^2)$ and $y = \mathbf{x}^T \boldsymbol{\beta} + \epsilon$, then:

$$y | \mathbf{x} \sim \mathcal{N}(\mathbf{x}^T \boldsymbol{\beta}, \sigma^2)$$

This closure property means the entire distribution of $y$ is easily characterized—a property not shared by most other distributions.

4. Connection to Least Squares:

As we'll see in detail, the Gaussian assumption leads directly to the least squares criterion. This provides a probabilistic foundation for a technique that can otherwise seem ad hoc. We're not minimizing squared errors because it's convenient; we're doing it because it's the correct thing to do if the noise is Gaussian.

Mathematical Formulation: The Complete Model

Let's consolidate our probabilistic model in matrix notation. Given a dataset with $n$ observations and $p$ features:

Data:

Design matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ (includes intercept column if needed)
Response vector $\mathbf{y} \in \mathbb{R}^n$

Model: $$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$—a multivariate Gaussian with zero mean and covariance matrix $\sigma^2 \mathbf{I}_n$.

Equivalently: $$\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)$$

The probability density of the entire response vector is:

$$p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\left(-\frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\right)$$

The Independence Assumption in Matrix Form

The covariance matrix $\sigma^2 \mathbf{I}_n$ encodes our assumptions:

Diagonal structure: Off-diagonal elements are zero, meaning $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i \neq j$ (independence)
Constant diagonal: All diagonal elements equal $\sigma^2$, meaning $\text{Var}(\epsilon_i) = \sigma^2$ for all $i$ (homoscedasticity)

When these assumptions are violated, we need generalized least squares (GLS), which replaces $\sigma^2 \mathbf{I}_n$ with a general covariance matrix $\boldsymbol{\Sigma}$.

Parameters to estimate:

Our probabilistic model has parameters:

$\boldsymbol{\beta} \in \mathbb{R}^p$ — the regression coefficients
$\sigma^2 > 0$ — the noise variance

Both are unknown and must be estimated from data. In the maximum likelihood approach (next pages), we find values that maximize the probability of observing our actual data.

Key insight:

Notice that $(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$ is exactly the residual sum of squares (RSS):

$$\text{RSS}(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 = \sum{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2$$

This RSS appears in the exponent of the Gaussian density. This is not a coincidence—it's the deep connection between Gaussian noise and least squares that we'll fully exploit.

The Assumptions: Stated, Tested, and Understood

The Gaussian linear regression model makes several assumptions. Understanding these precisely is crucial for knowing when to trust your results and when to seek alternative methods.

Let's examine each assumption with precision:

Complete Assumption Breakdown for Gaussian Linear Regression
Assumption	Formal Statement	Violation Symptoms	Remedies
Linearity	$\mathbb{E}[y \| \mathbf{x}] = \mathbf{x}^T \boldsymbol{\beta}$	Curved patterns in residual plots	Polynomial features, transforms, nonlinear models
Independence	$\epsilon_i \perp \epsilon_j$ for $i \neq j$	Autocorrelation in residuals (time series), spatial clustering	GLS, time-series models, spatial models
Homoscedasticity	$\text{Var}(\epsilon_i) = \sigma^2$ constant	Fan-shaped residual plots	WLS, robust SEs, variance-stabilizing transforms
Normality	$\epsilon_i \sim \mathcal{N}(0, \sigma^2)$	Heavy tails, skewness in residuals	Robust regression, GLMs, non-parametric methods
No perfect multicollinearity	$\text{rank}(\mathbf{X}) = p$	Singular $\mathbf{X}^T\mathbf{X}$, unstable coefficients	Remove/combine correlated features, regularization

Hierarchy of importance:

Not all assumptions are equally critical. In practice:

Linearity is fundamental—if violated, the model is systematically wrong
Independence matters greatly for standard errors and inference
Homoscedasticity affects efficiency and confidence intervals
Normality is primarily important for small samples and hypothesis testing

For large samples, the CLT often makes inference approximately valid even without perfect normality. But linearity and independence remain essential.

assumption_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
 
def comprehensive_regression_diagnostics(X, y, beta_hat, residuals):
    """
    Comprehensive diagnostic suite for linear regression assumptions.
    
    Parameters:
    -----------
    X : array-like, shape (n, p)
        Design matrix
    y : array-like, shape (n,)
        Response vector
    beta_hat : array-like, shape (p,)
        Estimated coefficients
    residuals : array-like, shape (n,)
        Fitted residuals (y - X @ beta_hat)
    """
    n = len(y)
    y_hat = X @ beta_hat
    
    # Standardized residuals
    sigma_hat = np.sqrt(np.sum(residuals**2) / (n - len(beta_hat)))
    std_residuals = residuals / sigma_hat
    
    print("=" * 60)
    print("REGRESSION DIAGNOSTICS: Testing Gaussian Noise Assumptions")
    print("=" * 60)
    
    # 1. Normality test (Shapiro-Wilk)
    print("\n1. NORMALITY OF RESIDUALS")
    if n <= 5000:  # Shapiro-Wilk limited to n=5000
        stat, p_value = stats.shapiro(residuals)
        print(f"   Shapiro-Wilk test: W={stat:.4f}, p-value={p_value:.4f}")
    else:
        stat, p_value = stats.normaltest(residuals)
        print(f"   D'Agostino-Pearson test: stat={stat:.4f}, p-value={p_value:.4f}")
    
    if p_value < 0.05:
        print("   ⚠️  Evidence AGAINST normality (p < 0.05)")
    else:
        print("   ✓  No significant evidence against normality")
    
    # Skewness and kurtosis
    skew = stats.skew(residuals)
    kurt = stats.kurtosis(residuals)  # Excess kurtosis (0 for Gaussian)
    print(f"   Skewness: {skew:.4f} (should be ≈ 0)")
    print(f"   Excess Kurtosis: {kurt:.4f} (should be ≈ 0)")
    
    # 2. Homoscedasticity test (Breusch-Pagan simplified)
    print("\n2. HOMOSCEDASTICITY (Constant Variance)")
    # Regression of squared residuals on fitted values
    residuals_sq = residuals**2
    correlation = np.corrcoef(y_hat, residuals_sq)[0, 1]
    print(f"   Correlation(fitted, residuals²): {correlation:.4f}")
    
    if abs(correlation) > 0.1:
        print("   ⚠️  Possible heteroscedasticity detected")
    else:
        print("   ✓  No strong evidence of heteroscedasticity")
    
    # 3. Independence (Durbin-Watson for autocorrelation)
    print("\n3. INDEPENDENCE OF RESIDUALS")
    # Durbin-Watson statistic
    dw = np.sum(np.diff(residuals)**2) / np.sum(residuals**2)
    print(f"   Durbin-Watson statistic: {dw:.4f}")
    print(f"   (≈ 2 implies no autocorrelation; < 1.5 or > 2.5 suggests issue)")
    
    if dw < 1.5:
        print("   ⚠️  Possible positive autocorrelation")
    elif dw > 2.5:
        print("   ⚠️  Possible negative autocorrelation")
    else:
        print("   ✓  No strong evidence of autocorrelation")
    
    # 4. Outlier detection
    print("\n4. OUTLIER DETECTION")
    outliers = np.abs(std_residuals) > 2
    n_outliers = np.sum(outliers)
    expected = n * 0.05  # ~5% expected beyond 2 std devs
    print(f"   Observations with |standardized residual| > 2: {n_outliers}")
    print(f"   Expected under normality: ~{expected:.1f}")
    
    extreme_outliers = np.abs(std_residuals) > 3
    if np.any(extreme_outliers):
        print(f"   ⚠️  {np.sum(extreme_outliers)} extreme outliers (> 3 std devs)")
    
    print("\n" + "=" * 60)
    print("SUMMARY: Review any ⚠️ warnings before trusting inference")
    print("=" * 60)
    
    return {
        'normality_pvalue': p_value,
        'skewness': skew,
        'kurtosis': kurt,
        'heteroscedasticity_corr': correlation,
        'durbin_watson': dw,
        'n_outliers': n_outliers
    }
 
# Example usage with synthetic data
np.random.seed(42)
n, p = 200, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta_true = np.array([2.0, 0.5, -0.3])
y = X @ beta_true + np.random.randn(n) * 0.5  # Gaussian noise
 
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
residuals = y - X @ beta_hat
 
diagnostics = comprehensive_regression_diagnostics(X, y, beta_hat, residuals)

When Gaussian Assumptions Fail: Consequences and Alternatives

What happens when the Gaussian noise assumption is violated? The consequences depend on which aspect fails:

1. Non-normality with correct variance:

If the noise is non-Gaussian but still has zero mean and constant variance, the OLS estimator remains:

Unbiased: $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$
Consistent: $\hat{\boldsymbol{\beta}} \xrightarrow{p} \boldsymbol{\beta}$ as $n \to \infty$
BLUE: Best Linear Unbiased Estimator (Gauss-Markov still holds!)

The impact is primarily on inference:

Standard errors may be incorrect for small samples
Confidence intervals and hypothesis tests may be invalid
Maximum likelihood interpretation is lost

For large samples, the CLT rescues us—test statistics become approximately normal regardless of error distribution.

Problematic Violations

•Heavy-tailed errors — A few extreme observations dominate the fit, distorting all coefficients
•Heteroscedasticity — Standard errors are wrong, confidence intervals misleading
•Correlated errors — Standard errors severely underestimated, false positives common
•Skewed errors — Predictions biased for extreme values

Available Remedies

•Robust regression — M-estimators, Huber loss, RANSAC for outlier resistance
•Weighted Least Squares — Account for non-constant variance with observation weights
•Generalized Least Squares — Model correlation structure explicitly
•Response transformations — Log, Box-Cox to achieve normality

2. Heavy tails and outliers:

Gaussian distributions have thin tails—extreme values are exponentially rare. Real data often contains outliers that violate this property.

Quantitatively, for a standard Gaussian:

Probability of |z| > 2: 4.6%
Probability of |z| > 3: 0.3%
Probability of |z| > 4: 0.006%

If your data shows more extreme values than these frequencies suggest, the Gaussian assumption is questionable.

Squared loss (from Gaussian MLE) is especially sensitive to outliers because the loss grows quadratically with the residual. A single observation 10 standard deviations from the mean contributes 100 times as much to the objective as an observation 1 standard deviation away. This can catastrophically distort the fit.

Robust Alternatives and Their Noise Models

Different loss functions correspond to different noise distributions:

Squared loss ↔ Gaussian noise
Absolute loss (L1) ↔ Laplace noise (heavier tails)
Huber loss ↔ Gaussian near center, heavier tails outside
Quantile loss ↔ Asymmetric noise distributions

Choosing a loss function is implicitly choosing a noise model. Select based on your knowledge of the error-generating process.

Connecting Gaussian Noise to Maximum Likelihood

We now arrive at the crucial insight that bridges this entire module. The Gaussian noise assumption doesn't just provide a probabilistic interpretation—it explains why least squares works.

The key result (preview):

Under the model $y_i | \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2)$, the maximum likelihood estimator for $\boldsymbol{\beta}$ is exactly the OLS estimator:

$$\hat{\boldsymbol{\beta}}{\text{MLE}} = \hat{\boldsymbol{\beta}}{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Why does this happen?

Recall the Gaussian density: $$p(\mathbf{y} | \boldsymbol{\beta}, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\left(-\frac{|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2}{2\sigma^2}\right)$$

Maximizing this likelihood is equivalent to maximizing the exponent (after taking log), which means minimizing $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$—the residual sum of squares.

This is profound: Least squares isn't arbitrary; it's the principled choice when noise is Gaussian.

Historical Note: Gauss and Legendre

The method of least squares was independently developed by Gauss and Legendre in the early 1800s. Gauss justified it probabilistically using what we now call the Gaussian distribution (named after him). Legendre presented it as a convenient computational technique. Both perspectives—computational convenience and probabilistic optimality—remain valid today.

What MLE adds beyond OLS:

The maximum likelihood framework provides additional benefits:

Uncertainty quantification: We can derive the sampling distribution of $\hat{\boldsymbol{\beta}}$
Variance estimation: We get a principled estimator for $\sigma^2$
Model comparison: Likelihood-based criteria (AIC, BIC) enable principled model selection
Confidence intervals: Probabilistic statements about parameter values
Hypothesis testing: Formal statistical tests for coefficient significance
Bayesian extension: A natural starting point for Bayesian regression

These benefits make MLE the foundation for modern statistical inference in regression.

Summary and Looking Ahead

We've established the probabilistic foundation for linear regression. Let's consolidate the key insights:

Key Takeaways

•The probabilistic model — $y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ i.i.d.
•Justification via CLT — When noise is the sum of many small effects, Gaussian is natural
•Maximum entropy — Gaussian assumes the least structure given mean and variance constraints
•The key assumptions — Linearity, independence, homoscedasticity, normality, and no perfect multicollinearity
•RSS in the exponent — The Gaussian density directly contains the residual sum of squares
•Violations have consequences — But OLS often remains useful; inference is more affected than estimation
•Bridge to MLE — The Gaussian assumption makes OLS a maximum likelihood estimator

What's next:

In the next page, we'll derive the log-likelihood function explicitly and show step-by-step how maximizing likelihood leads to the OLS solution. We'll see the beautiful mathematics that connects probability theory to linear algebra, and understand why the Gaussian assumption makes the math tractable while other distributions would not.

Foundation Complete

You now understand the Gaussian noise assumption—the cornerstone of probabilistic linear regression. This isn't just mathematical elegance; it's the foundation for everything from simple t-tests on coefficients to sophisticated model selection criteria. With this understanding, you're ready to see how maximum likelihood estimation completes the picture.

1 / 5

Loading learning content...

Machine LearningLinear Regression

Maximum Likelihood View of Linear Regression

LevelIntermediate

Duration75 mins

TopicLinear Regression

1 / 5

Gaussian Noise Assumption

From Geometry to Probability

Why squared residuals rather than absolute residuals or some other measure?
How do we quantify uncertainty in our predictions and parameter estimates?
What assumptions about the data generation process justify our approach?

What You Will Learn

The Probabilistic Linear Regression Model

The generative model:

We assume that each observation $y_i$ is generated according to:

$$y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i$$

where:

$\mathbf{x}_i \in \mathbb{R}^p$ is the feature vector for observation $i$
$\boldsymbol{\beta} \in \mathbb{R}^p$ is the true (unknown) parameter vector we seek to estimate
$\epsilon_i$ is a random noise term that accounts for all factors not captured by the linear relationship
$y_i \in \mathbb{R}$ is the observed response

The critical assumption is about the distribution of the noise term $\epsilon_i$. In the Gaussian noise model, we assume:

$$\epsilon_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2)$$

Unpacking the Gaussian Noise Assumption

The notation $\epsilon_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2)$ packs several crucial assumptions:

1. Zero Mean: $\mathbb{E}[\epsilon_i] = 0$ — The noise has no systematic bias. On average, errors are zero.

2. Constant Variance (Homoscedasticity): $\text{Var}(\epsilon_i) = \sigma^2$ — All observations have the same noise level.

3. Independent: $\epsilon_i \perp \epsilon_j$ for $i \neq j$ — The error for one observation tells us nothing about errors for others.

4. Identically Distributed: All $\epsilon_i$ come from the same distribution.

5. Gaussian: The specific distributional form is Normal/Gaussian.

Conditional distribution of $y$:

Since $y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i$ and $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$, we can derive the conditional distribution of $y_i$ given $\mathbf{x}_i$:

$$y_i | \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2)$$

Or equivalently:

$$p(y_i | \mathbf{x}_i, \boldsymbol{\beta}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)$$

This formulation tells us that:

The mean of $y_i$ is $\mathbf{x}_i^T \boldsymbol{\beta}$ (our prediction)
The variance of $y_i$ is $\sigma^2$ (the irreducible noise level)
The response $y_i$ is probabilistically determined—we can only specify its distribution, not its exact value

Why Gaussian? The Central Limit Theorem Justification

The Gaussian assumption might seem arbitrary at first. Why not use a uniform distribution? Or a Laplace distribution? Or something else entirely?

Measurement error in sensors
Minor variations in environmental conditions
Small fluctuations in human behavior
Rounding errors in data collection
Unexplained biological variability

The Central Limit Theorem

If $Z_1, Z_2, \ldots, Z_k$ are independent random variables with finite mean and variance, then as $k \to \infty$:

$$\frac{\sum_{j=1}^k Z_j - \sum_{j=1}^k \mathbb{E}[Z_j]}{\sqrt{\sum_{j=1}^k \text{Var}(Z_j)}} \xrightarrow{d} \mathcal{N}(0, 1)$$

In words: the standardized sum of many independent random variables converges in distribution to a standard Gaussian, regardless of the original distributions.

Practical implications:

However, this justification has limits:

Finite samples: The CLT is an asymptotic result. With few error sources, the distribution may not be Gaussian.
Independence assumption: If error components are correlated, the CLT may not apply.
Heavy tails: If some error sources have infinite variance (e.g., Cauchy-distributed outliers), the sum won't be Gaussian.
Dominant sources: If one error source dominates, the distribution will reflect that source, not a Gaussian sum.

When Gaussian Noise Is Reasonable

•Measurement errors — When using calibrated instruments, measurement noise is typically Gaussian
•Aggregate effects — When many small factors combine (biological variability, weather effects, market fluctuations)
•Controlled experiments — Laboratory settings where major confounds are eliminated
•Physical processes — Thermal noise, Brownian motion, and many other physical phenomena follow Gaussian distributions
•Residuals from good models — When the systematic component is well-specified, remaining noise often appears Gaussian

Alternative Justifications for the Gaussian Model

Beyond the CLT, there are several other compelling reasons to adopt the Gaussian noise model:

1. Maximum Entropy Principle:

Among all probability distributions with a specified mean and variance, the Gaussian distribution has maximum entropy—it assumes the least additional structure beyond those two constraints.

If we know only that:

The noise has zero mean: $\mathbb{E}[\epsilon] = 0$
The noise has variance $\sigma^2$: $\mathbb{E}[\epsilon^2] = \sigma^2$

...and we want to choose a distribution that doesn't assume anything else unnecessarily, the Gaussian is the unique answer. It's the "maximally uninformative" distribution subject to our constraints.

2. Analytical Tractability:

gaussian_properties_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Demonstrate key Gaussian properties relevant to regression
 
# 1. Maximum entropy property visualization
def entropy_comparison():
    """Compare entropy of distributions with same variance."""
    variance = 1.0
    x = np.linspace(-4, 4, 1000)
    
    # Gaussian
    gaussian_pdf = stats.norm.pdf(x, 0, np.sqrt(variance))
    gaussian_entropy = stats.norm.entropy(0, np.sqrt(variance))
    
    # Uniform with same variance: Var = (b-a)^2/12, so b-a = sqrt(12*variance)
    width = np.sqrt(12 * variance)
    uniform_pdf = stats.uniform.pdf(x, -width/2, width)
    uniform_entropy = stats.uniform.entropy(-width/2, width)
    
    # Laplace with same variance: Var = 2*b^2, so b = sqrt(variance/2)
    scale = np.sqrt(variance / 2)
    laplace_pdf = stats.laplace.pdf(x, 0, scale)
    laplace_entropy = stats.laplace.entropy(0, scale)
    
    print(f"Entropy comparison (same variance = {variance}):")
    print(f"  Gaussian: {gaussian_entropy:.4f} nats (MAXIMUM)")
    print(f"  Laplace:  {laplace_entropy:.4f} nats")
    print(f"  Uniform:  {uniform_entropy:.4f} nats")
    
    return gaussian_entropy, laplace_entropy, uniform_entropy
 
# 2. Demonstrate why squared loss is natural for Gaussian
def negative_log_likelihood():
    """Show connection between Gaussian NLL and squared error."""
    y_true = 5.0  # Observed value
    y_preds = np.linspace(0, 10, 100)  # Predicted values
    sigma = 1.0
    
    # Gaussian negative log-likelihood
    # -log p(y|mu, sigma) = 0.5*log(2*pi*sigma^2) + (y-mu)^2 / (2*sigma^2)
    nll = 0.5 * np.log(2 * np.pi * sigma**2) + (y_true - y_preds)**2 / (2 * sigma**2)
    
    # Squared error
    se = (y_true - y_preds)**2
    
    # These differ only by constants and scaling
    # Minimizing NLL is equivalent to minimizing SE
    print("\nKey insight: Minimizing Gaussian NLL = Minimizing squared error")
    print(f"  NLL at y_pred=5.0: {0.5 * np.log(2 * np.pi) + 0:.4f}")
    print(f"  NLL at y_pred=4.0: {0.5 * np.log(2 * np.pi) + 0.5:.4f}")
    print(f"  NLL at y_pred=3.0: {0.5 * np.log(2 * np.pi) + 2:.4f}")
    
entropy_comparison()
negative_log_likelihood()

3. Stability Under Linear Transformations:

The Gaussian distribution is closed under linear transformations. If $\epsilon \sim \mathcal{N}(0, \sigma^2)$ and $y = \mathbf{x}^T \boldsymbol{\beta} + \epsilon$, then:

$$y | \mathbf{x} \sim \mathcal{N}(\mathbf{x}^T \boldsymbol{\beta}, \sigma^2)$$

This closure property means the entire distribution of $y$ is easily characterized—a property not shared by most other distributions.

4. Connection to Least Squares:

Mathematical Formulation: The Complete Model

Let's consolidate our probabilistic model in matrix notation. Given a dataset with $n$ observations and $p$ features:

Data:

Design matrix $\mathbf{X} \in \mathbb{R}^{n \times p}$ (includes intercept column if needed)
Response vector $\mathbf{y} \in \mathbb{R}^n$

Model: $$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$—a multivariate Gaussian with zero mean and covariance matrix $\sigma^2 \mathbf{I}_n$.

Equivalently: $$\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)$$

The probability density of the entire response vector is:

The Independence Assumption in Matrix Form

The covariance matrix $\sigma^2 \mathbf{I}_n$ encodes our assumptions:

Diagonal structure: Off-diagonal elements are zero, meaning $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i \neq j$ (independence)
Constant diagonal: All diagonal elements equal $\sigma^2$, meaning $\text{Var}(\epsilon_i) = \sigma^2$ for all $i$ (homoscedasticity)

When these assumptions are violated, we need generalized least squares (GLS), which replaces $\sigma^2 \mathbf{I}_n$ with a general covariance matrix $\boldsymbol{\Sigma}$.

Parameters to estimate:

Our probabilistic model has parameters:

$\boldsymbol{\beta} \in \mathbb{R}^p$ — the regression coefficients
$\sigma^2 > 0$ — the noise variance

Both are unknown and must be estimated from data. In the maximum likelihood approach (next pages), we find values that maximize the probability of observing our actual data.

Key insight:

Notice that $(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$ is exactly the residual sum of squares (RSS):

$$\text{RSS}(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 = \sum{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2$$

This RSS appears in the exponent of the Gaussian density. This is not a coincidence—it's the deep connection between Gaussian noise and least squares that we'll fully exploit.

The Assumptions: Stated, Tested, and Understood

The Gaussian linear regression model makes several assumptions. Understanding these precisely is crucial for knowing when to trust your results and when to seek alternative methods.

Let's examine each assumption with precision:

Complete Assumption Breakdown for Gaussian Linear Regression
Assumption	Formal Statement	Violation Symptoms	Remedies
Linearity	$\mathbb{E}[y \| \mathbf{x}] = \mathbf{x}^T \boldsymbol{\beta}$	Curved patterns in residual plots	Polynomial features, transforms, nonlinear models
Independence	$\epsilon_i \perp \epsilon_j$ for $i \neq j$	Autocorrelation in residuals (time series), spatial clustering	GLS, time-series models, spatial models
Homoscedasticity	$\text{Var}(\epsilon_i) = \sigma^2$ constant	Fan-shaped residual plots	WLS, robust SEs, variance-stabilizing transforms
Normality	$\epsilon_i \sim \mathcal{N}(0, \sigma^2)$	Heavy tails, skewness in residuals	Robust regression, GLMs, non-parametric methods
No perfect multicollinearity	$\text{rank}(\mathbf{X}) = p$	Singular $\mathbf{X}^T\mathbf{X}$, unstable coefficients	Remove/combine correlated features, regularization

Hierarchy of importance:

Not all assumptions are equally critical. In practice:

Linearity is fundamental—if violated, the model is systematically wrong
Independence matters greatly for standard errors and inference
Homoscedasticity affects efficiency and confidence intervals
Normality is primarily important for small samples and hypothesis testing

For large samples, the CLT often makes inference approximately valid even without perfect normality. But linearity and independence remain essential.

assumption_diagnostics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
 
def comprehensive_regression_diagnostics(X, y, beta_hat, residuals):
    """
    Comprehensive diagnostic suite for linear regression assumptions.
    
    Parameters:
    -----------
    X : array-like, shape (n, p)
        Design matrix
    y : array-like, shape (n,)
        Response vector
    beta_hat : array-like, shape (p,)
        Estimated coefficients
    residuals : array-like, shape (n,)
        Fitted residuals (y - X @ beta_hat)
    """
    n = len(y)
    y_hat = X @ beta_hat
    
    # Standardized residuals
    sigma_hat = np.sqrt(np.sum(residuals**2) / (n - len(beta_hat)))
    std_residuals = residuals / sigma_hat
    
    print("=" * 60)
    print("REGRESSION DIAGNOSTICS: Testing Gaussian Noise Assumptions")
    print("=" * 60)
    
    # 1. Normality test (Shapiro-Wilk)
    print("\n1. NORMALITY OF RESIDUALS")
    if n <= 5000:  # Shapiro-Wilk limited to n=5000
        stat, p_value = stats.shapiro(residuals)
        print(f"   Shapiro-Wilk test: W={stat:.4f}, p-value={p_value:.4f}")
    else:
        stat, p_value = stats.normaltest(residuals)
        print(f"   D'Agostino-Pearson test: stat={stat:.4f}, p-value={p_value:.4f}")
    
    if p_value < 0.05:
        print("   ⚠️  Evidence AGAINST normality (p < 0.05)")
    else:
        print("   ✓  No significant evidence against normality")
    
    # Skewness and kurtosis
    skew = stats.skew(residuals)
    kurt = stats.kurtosis(residuals)  # Excess kurtosis (0 for Gaussian)
    print(f"   Skewness: {skew:.4f} (should be ≈ 0)")
    print(f"   Excess Kurtosis: {kurt:.4f} (should be ≈ 0)")
    
    # 2. Homoscedasticity test (Breusch-Pagan simplified)
    print("\n2. HOMOSCEDASTICITY (Constant Variance)")
    # Regression of squared residuals on fitted values
    residuals_sq = residuals**2
    correlation = np.corrcoef(y_hat, residuals_sq)[0, 1]
    print(f"   Correlation(fitted, residuals²): {correlation:.4f}")
    
    if abs(correlation) > 0.1:
        print("   ⚠️  Possible heteroscedasticity detected")
    else:
        print("   ✓  No strong evidence of heteroscedasticity")
    
    # 3. Independence (Durbin-Watson for autocorrelation)
    print("\n3. INDEPENDENCE OF RESIDUALS")
    # Durbin-Watson statistic
    dw = np.sum(np.diff(residuals)**2) / np.sum(residuals**2)
    print(f"   Durbin-Watson statistic: {dw:.4f}")
    print(f"   (≈ 2 implies no autocorrelation; < 1.5 or > 2.5 suggests issue)")
    
    if dw < 1.5:
        print("   ⚠️  Possible positive autocorrelation")
    elif dw > 2.5:
        print("   ⚠️  Possible negative autocorrelation")
    else:
        print("   ✓  No strong evidence of autocorrelation")
    
    # 4. Outlier detection
    print("\n4. OUTLIER DETECTION")
    outliers = np.abs(std_residuals) > 2
    n_outliers = np.sum(outliers)
    expected = n * 0.05  # ~5% expected beyond 2 std devs
    print(f"   Observations with |standardized residual| > 2: {n_outliers}")
    print(f"   Expected under normality: ~{expected:.1f}")
    
    extreme_outliers = np.abs(std_residuals) > 3
    if np.any(extreme_outliers):
        print(f"   ⚠️  {np.sum(extreme_outliers)} extreme outliers (> 3 std devs)")
    
    print("\n" + "=" * 60)
    print("SUMMARY: Review any ⚠️ warnings before trusting inference")
    print("=" * 60)
    
    return {
        'normality_pvalue': p_value,
        'skewness': skew,
        'kurtosis': kurt,
        'heteroscedasticity_corr': correlation,
        'durbin_watson': dw,
        'n_outliers': n_outliers
    }
 
# Example usage with synthetic data
np.random.seed(42)
n, p = 200, 3
X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
beta_true = np.array([2.0, 0.5, -0.3])
y = X @ beta_true + np.random.randn(n) * 0.5  # Gaussian noise
 
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
residuals = y - X @ beta_hat
 
diagnostics = comprehensive_regression_diagnostics(X, y, beta_hat, residuals)

When Gaussian Assumptions Fail: Consequences and Alternatives

What happens when the Gaussian noise assumption is violated? The consequences depend on which aspect fails:

1. Non-normality with correct variance:

If the noise is non-Gaussian but still has zero mean and constant variance, the OLS estimator remains:

Unbiased: $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$
Consistent: $\hat{\boldsymbol{\beta}} \xrightarrow{p} \boldsymbol{\beta}$ as $n \to \infty$
BLUE: Best Linear Unbiased Estimator (Gauss-Markov still holds!)

The impact is primarily on inference:

Standard errors may be incorrect for small samples
Confidence intervals and hypothesis tests may be invalid
Maximum likelihood interpretation is lost

For large samples, the CLT rescues us—test statistics become approximately normal regardless of error distribution.

Problematic Violations

•Heavy-tailed errors — A few extreme observations dominate the fit, distorting all coefficients
•Heteroscedasticity — Standard errors are wrong, confidence intervals misleading
•Correlated errors — Standard errors severely underestimated, false positives common
•Skewed errors — Predictions biased for extreme values

Available Remedies

•Robust regression — M-estimators, Huber loss, RANSAC for outlier resistance
•Weighted Least Squares — Account for non-constant variance with observation weights
•Generalized Least Squares — Model correlation structure explicitly
•Response transformations — Log, Box-Cox to achieve normality

2. Heavy tails and outliers:

Gaussian distributions have thin tails—extreme values are exponentially rare. Real data often contains outliers that violate this property.

Quantitatively, for a standard Gaussian:

Probability of |z| > 2: 4.6%
Probability of |z| > 3: 0.3%
Probability of |z| > 4: 0.006%

If your data shows more extreme values than these frequencies suggest, the Gaussian assumption is questionable.

Robust Alternatives and Their Noise Models

Different loss functions correspond to different noise distributions:

Squared loss ↔ Gaussian noise
Absolute loss (L1) ↔ Laplace noise (heavier tails)
Huber loss ↔ Gaussian near center, heavier tails outside
Quantile loss ↔ Asymmetric noise distributions

Choosing a loss function is implicitly choosing a noise model. Select based on your knowledge of the error-generating process.

Connecting Gaussian Noise to Maximum Likelihood

We now arrive at the crucial insight that bridges this entire module. The Gaussian noise assumption doesn't just provide a probabilistic interpretation—it explains why least squares works.

The key result (preview):

Under the model $y_i | \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2)$, the maximum likelihood estimator for $\boldsymbol{\beta}$ is exactly the OLS estimator:

$$\hat{\boldsymbol{\beta}}{\text{MLE}} = \hat{\boldsymbol{\beta}}{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Why does this happen?

Recall the Gaussian density: $$p(\mathbf{y} | \boldsymbol{\beta}, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\left(-\frac{|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2}{2\sigma^2}\right)$$

Maximizing this likelihood is equivalent to maximizing the exponent (after taking log), which means minimizing $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$—the residual sum of squares.

This is profound: Least squares isn't arbitrary; it's the principled choice when noise is Gaussian.

Historical Note: Gauss and Legendre

What MLE adds beyond OLS:

The maximum likelihood framework provides additional benefits:

Uncertainty quantification: We can derive the sampling distribution of $\hat{\boldsymbol{\beta}}$
Variance estimation: We get a principled estimator for $\sigma^2$
Model comparison: Likelihood-based criteria (AIC, BIC) enable principled model selection
Confidence intervals: Probabilistic statements about parameter values
Hypothesis testing: Formal statistical tests for coefficient significance
Bayesian extension: A natural starting point for Bayesian regression

These benefits make MLE the foundation for modern statistical inference in regression.

Summary and Looking Ahead

We've established the probabilistic foundation for linear regression. Let's consolidate the key insights:

Key Takeaways

•The probabilistic model — $y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ i.i.d.
•Justification via CLT — When noise is the sum of many small effects, Gaussian is natural
•Maximum entropy — Gaussian assumes the least structure given mean and variance constraints
•The key assumptions — Linearity, independence, homoscedasticity, normality, and no perfect multicollinearity
•RSS in the exponent — The Gaussian density directly contains the residual sum of squares
•Violations have consequences — But OLS often remains useful; inference is more affected than estimation
•Bridge to MLE — The Gaussian assumption makes OLS a maximum likelihood estimator

What's next:

Foundation Complete

1 / 5