Loading learning content...
In the preceding modules, we developed linear regression from a purely geometric perspective—finding the hyperplane that minimizes the sum of squared residuals. While this geometric view is elegant and computationally convenient, it leaves fundamental questions unanswered:
To answer these questions, we must transition from geometry to probability. The maximum likelihood perspective recasts linear regression as a problem of statistical inference—estimating the parameters of a probabilistic model that describes how the data was generated. At the heart of this probabilistic view lies a critical assumption: the noise in our observations follows a Gaussian (Normal) distribution.
By the end of this page, you will understand the Gaussian noise assumption in depth—why it's made, what it implies mathematically, when it's reasonable, and what happens when it's violated. You'll see how this single assumption connects deterministic least squares optimization to principled probabilistic inference.
Let's formalize the probabilistic perspective. Instead of viewing regression as curve fitting, we model the data-generating process—how the observed responses $y$ are produced from the inputs $\mathbf{x}$.
The generative model:
We assume that each observation $y_i$ is generated according to:
$$y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i$$
where:
The critical assumption is about the distribution of the noise term $\epsilon_i$. In the Gaussian noise model, we assume:
$$\epsilon_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2)$$
The notation $\epsilon_i \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2)$ packs several crucial assumptions:
1. Zero Mean: $\mathbb{E}[\epsilon_i] = 0$ — The noise has no systematic bias. On average, errors are zero.
2. Constant Variance (Homoscedasticity): $\text{Var}(\epsilon_i) = \sigma^2$ — All observations have the same noise level.
3. Independent: $\epsilon_i \perp \epsilon_j$ for $i \neq j$ — The error for one observation tells us nothing about errors for others.
4. Identically Distributed: All $\epsilon_i$ come from the same distribution.
5. Gaussian: The specific distributional form is Normal/Gaussian.
Conditional distribution of $y$:
Since $y_i = \mathbf{x}_i^T \boldsymbol{\beta} + \epsilon_i$ and $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$, we can derive the conditional distribution of $y_i$ given $\mathbf{x}_i$:
$$y_i | \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2)$$
Or equivalently:
$$p(y_i | \mathbf{x}_i, \boldsymbol{\beta}, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}{2\sigma^2}\right)$$
This formulation tells us that:
The Gaussian assumption might seem arbitrary at first. Why not use a uniform distribution? Or a Laplace distribution? Or something else entirely?
The strongest justification comes from the Central Limit Theorem (CLT). In most regression scenarios, the noise term $\epsilon_i$ represents the cumulative effect of many small, independent factors that we haven't (or can't) model explicitly:
When the total error is the sum of many small, independent contributions, the CLT guarantees that this sum will be approximately Gaussian—regardless of the distributions of the individual components.
If $Z_1, Z_2, \ldots, Z_k$ are independent random variables with finite mean and variance, then as $k \to \infty$:
$$\frac{\sum_{j=1}^k Z_j - \sum_{j=1}^k \mathbb{E}[Z_j]}{\sqrt{\sum_{j=1}^k \text{Var}(Z_j)}} \xrightarrow{d} \mathcal{N}(0, 1)$$
In words: the standardized sum of many independent random variables converges in distribution to a standard Gaussian, regardless of the original distributions.
Practical implications:
The CLT justification is powerful because it doesn't require us to know the true distribution of individual error sources. If our noise is genuinely composed of many small effects, Gaussian is a reasonable assumption even if we don't know the exact mechanism.
However, this justification has limits:
Finite samples: The CLT is an asymptotic result. With few error sources, the distribution may not be Gaussian.
Independence assumption: If error components are correlated, the CLT may not apply.
Heavy tails: If some error sources have infinite variance (e.g., Cauchy-distributed outliers), the sum won't be Gaussian.
Dominant sources: If one error source dominates, the distribution will reflect that source, not a Gaussian sum.
Beyond the CLT, there are several other compelling reasons to adopt the Gaussian noise model:
1. Maximum Entropy Principle:
Among all probability distributions with a specified mean and variance, the Gaussian distribution has maximum entropy—it assumes the least additional structure beyond those two constraints.
If we know only that:
...and we want to choose a distribution that doesn't assume anything else unnecessarily, the Gaussian is the unique answer. It's the "maximally uninformative" distribution subject to our constraints.
2. Analytical Tractability:
The Gaussian distribution leads to closed-form solutions for maximum likelihood estimation. This wasn't historically accidental—Gauss developed much of least-squares theory precisely because Gaussian errors yield tractable mathematics.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats # Demonstrate key Gaussian properties relevant to regression # 1. Maximum entropy property visualizationdef entropy_comparison(): """Compare entropy of distributions with same variance.""" variance = 1.0 x = np.linspace(-4, 4, 1000) # Gaussian gaussian_pdf = stats.norm.pdf(x, 0, np.sqrt(variance)) gaussian_entropy = stats.norm.entropy(0, np.sqrt(variance)) # Uniform with same variance: Var = (b-a)^2/12, so b-a = sqrt(12*variance) width = np.sqrt(12 * variance) uniform_pdf = stats.uniform.pdf(x, -width/2, width) uniform_entropy = stats.uniform.entropy(-width/2, width) # Laplace with same variance: Var = 2*b^2, so b = sqrt(variance/2) scale = np.sqrt(variance / 2) laplace_pdf = stats.laplace.pdf(x, 0, scale) laplace_entropy = stats.laplace.entropy(0, scale) print(f"Entropy comparison (same variance = {variance}):") print(f" Gaussian: {gaussian_entropy:.4f} nats (MAXIMUM)") print(f" Laplace: {laplace_entropy:.4f} nats") print(f" Uniform: {uniform_entropy:.4f} nats") return gaussian_entropy, laplace_entropy, uniform_entropy # 2. Demonstrate why squared loss is natural for Gaussiandef negative_log_likelihood(): """Show connection between Gaussian NLL and squared error.""" y_true = 5.0 # Observed value y_preds = np.linspace(0, 10, 100) # Predicted values sigma = 1.0 # Gaussian negative log-likelihood # -log p(y|mu, sigma) = 0.5*log(2*pi*sigma^2) + (y-mu)^2 / (2*sigma^2) nll = 0.5 * np.log(2 * np.pi * sigma**2) + (y_true - y_preds)**2 / (2 * sigma**2) # Squared error se = (y_true - y_preds)**2 # These differ only by constants and scaling # Minimizing NLL is equivalent to minimizing SE print("\nKey insight: Minimizing Gaussian NLL = Minimizing squared error") print(f" NLL at y_pred=5.0: {0.5 * np.log(2 * np.pi) + 0:.4f}") print(f" NLL at y_pred=4.0: {0.5 * np.log(2 * np.pi) + 0.5:.4f}") print(f" NLL at y_pred=3.0: {0.5 * np.log(2 * np.pi) + 2:.4f}") entropy_comparison()negative_log_likelihood()3. Stability Under Linear Transformations:
The Gaussian distribution is closed under linear transformations. If $\epsilon \sim \mathcal{N}(0, \sigma^2)$ and $y = \mathbf{x}^T \boldsymbol{\beta} + \epsilon$, then:
$$y | \mathbf{x} \sim \mathcal{N}(\mathbf{x}^T \boldsymbol{\beta}, \sigma^2)$$
This closure property means the entire distribution of $y$ is easily characterized—a property not shared by most other distributions.
4. Connection to Least Squares:
As we'll see in detail, the Gaussian assumption leads directly to the least squares criterion. This provides a probabilistic foundation for a technique that can otherwise seem ad hoc. We're not minimizing squared errors because it's convenient; we're doing it because it's the correct thing to do if the noise is Gaussian.
Let's consolidate our probabilistic model in matrix notation. Given a dataset with $n$ observations and $p$ features:
Data:
Model: $$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$
where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n)$—a multivariate Gaussian with zero mean and covariance matrix $\sigma^2 \mathbf{I}_n$.
Equivalently: $$\mathbf{y} | \mathbf{X} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)$$
The probability density of the entire response vector is:
$$p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\left(-\frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\right)$$
The covariance matrix $\sigma^2 \mathbf{I}_n$ encodes our assumptions:
Diagonal structure: Off-diagonal elements are zero, meaning $\text{Cov}(\epsilon_i, \epsilon_j) = 0$ for $i \neq j$ (independence)
Constant diagonal: All diagonal elements equal $\sigma^2$, meaning $\text{Var}(\epsilon_i) = \sigma^2$ for all $i$ (homoscedasticity)
When these assumptions are violated, we need generalized least squares (GLS), which replaces $\sigma^2 \mathbf{I}_n$ with a general covariance matrix $\boldsymbol{\Sigma}$.
Parameters to estimate:
Our probabilistic model has parameters:
Both are unknown and must be estimated from data. In the maximum likelihood approach (next pages), we find values that maximize the probability of observing our actual data.
Key insight:
Notice that $(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})$ is exactly the residual sum of squares (RSS):
$$\text{RSS}(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|2^2 = \sum{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2$$
This RSS appears in the exponent of the Gaussian density. This is not a coincidence—it's the deep connection between Gaussian noise and least squares that we'll fully exploit.
The Gaussian linear regression model makes several assumptions. Understanding these precisely is crucial for knowing when to trust your results and when to seek alternative methods.
Let's examine each assumption with precision:
| Assumption | Formal Statement | Violation Symptoms | Remedies |
|---|---|---|---|
| Linearity | $\mathbb{E}[y | \mathbf{x}] = \mathbf{x}^T \boldsymbol{\beta}$ | Curved patterns in residual plots | Polynomial features, transforms, nonlinear models |
| Independence | $\epsilon_i \perp \epsilon_j$ for $i \neq j$ | Autocorrelation in residuals (time series), spatial clustering | GLS, time-series models, spatial models |
| Homoscedasticity | $\text{Var}(\epsilon_i) = \sigma^2$ constant | Fan-shaped residual plots | WLS, robust SEs, variance-stabilizing transforms |
| Normality | $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ | Heavy tails, skewness in residuals | Robust regression, GLMs, non-parametric methods |
| No perfect multicollinearity | $\text{rank}(\mathbf{X}) = p$ | Singular $\mathbf{X}^T\mathbf{X}$, unstable coefficients | Remove/combine correlated features, regularization |
Hierarchy of importance:
Not all assumptions are equally critical. In practice:
For large samples, the CLT often makes inference approximately valid even without perfect normality. But linearity and independence remain essential.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsimport warningswarnings.filterwarnings('ignore') def comprehensive_regression_diagnostics(X, y, beta_hat, residuals): """ Comprehensive diagnostic suite for linear regression assumptions. Parameters: ----------- X : array-like, shape (n, p) Design matrix y : array-like, shape (n,) Response vector beta_hat : array-like, shape (p,) Estimated coefficients residuals : array-like, shape (n,) Fitted residuals (y - X @ beta_hat) """ n = len(y) y_hat = X @ beta_hat # Standardized residuals sigma_hat = np.sqrt(np.sum(residuals**2) / (n - len(beta_hat))) std_residuals = residuals / sigma_hat print("=" * 60) print("REGRESSION DIAGNOSTICS: Testing Gaussian Noise Assumptions") print("=" * 60) # 1. Normality test (Shapiro-Wilk) print("\n1. NORMALITY OF RESIDUALS") if n <= 5000: # Shapiro-Wilk limited to n=5000 stat, p_value = stats.shapiro(residuals) print(f" Shapiro-Wilk test: W={stat:.4f}, p-value={p_value:.4f}") else: stat, p_value = stats.normaltest(residuals) print(f" D'Agostino-Pearson test: stat={stat:.4f}, p-value={p_value:.4f}") if p_value < 0.05: print(" ⚠️ Evidence AGAINST normality (p < 0.05)") else: print(" ✓ No significant evidence against normality") # Skewness and kurtosis skew = stats.skew(residuals) kurt = stats.kurtosis(residuals) # Excess kurtosis (0 for Gaussian) print(f" Skewness: {skew:.4f} (should be ≈ 0)") print(f" Excess Kurtosis: {kurt:.4f} (should be ≈ 0)") # 2. Homoscedasticity test (Breusch-Pagan simplified) print("\n2. HOMOSCEDASTICITY (Constant Variance)") # Regression of squared residuals on fitted values residuals_sq = residuals**2 correlation = np.corrcoef(y_hat, residuals_sq)[0, 1] print(f" Correlation(fitted, residuals²): {correlation:.4f}") if abs(correlation) > 0.1: print(" ⚠️ Possible heteroscedasticity detected") else: print(" ✓ No strong evidence of heteroscedasticity") # 3. Independence (Durbin-Watson for autocorrelation) print("\n3. INDEPENDENCE OF RESIDUALS") # Durbin-Watson statistic dw = np.sum(np.diff(residuals)**2) / np.sum(residuals**2) print(f" Durbin-Watson statistic: {dw:.4f}") print(f" (≈ 2 implies no autocorrelation; < 1.5 or > 2.5 suggests issue)") if dw < 1.5: print(" ⚠️ Possible positive autocorrelation") elif dw > 2.5: print(" ⚠️ Possible negative autocorrelation") else: print(" ✓ No strong evidence of autocorrelation") # 4. Outlier detection print("\n4. OUTLIER DETECTION") outliers = np.abs(std_residuals) > 2 n_outliers = np.sum(outliers) expected = n * 0.05 # ~5% expected beyond 2 std devs print(f" Observations with |standardized residual| > 2: {n_outliers}") print(f" Expected under normality: ~{expected:.1f}") extreme_outliers = np.abs(std_residuals) > 3 if np.any(extreme_outliers): print(f" ⚠️ {np.sum(extreme_outliers)} extreme outliers (> 3 std devs)") print("\n" + "=" * 60) print("SUMMARY: Review any ⚠️ warnings before trusting inference") print("=" * 60) return { 'normality_pvalue': p_value, 'skewness': skew, 'kurtosis': kurt, 'heteroscedasticity_corr': correlation, 'durbin_watson': dw, 'n_outliers': n_outliers } # Example usage with synthetic datanp.random.seed(42)n, p = 200, 3X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])beta_true = np.array([2.0, 0.5, -0.3])y = X @ beta_true + np.random.randn(n) * 0.5 # Gaussian noise beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]residuals = y - X @ beta_hat diagnostics = comprehensive_regression_diagnostics(X, y, beta_hat, residuals)What happens when the Gaussian noise assumption is violated? The consequences depend on which aspect fails:
1. Non-normality with correct variance:
If the noise is non-Gaussian but still has zero mean and constant variance, the OLS estimator remains:
The impact is primarily on inference:
For large samples, the CLT rescues us—test statistics become approximately normal regardless of error distribution.
2. Heavy tails and outliers:
Gaussian distributions have thin tails—extreme values are exponentially rare. Real data often contains outliers that violate this property.
Quantitatively, for a standard Gaussian:
If your data shows more extreme values than these frequencies suggest, the Gaussian assumption is questionable.
Squared loss (from Gaussian MLE) is especially sensitive to outliers because the loss grows quadratically with the residual. A single observation 10 standard deviations from the mean contributes 100 times as much to the objective as an observation 1 standard deviation away. This can catastrophically distort the fit.
Different loss functions correspond to different noise distributions:
Choosing a loss function is implicitly choosing a noise model. Select based on your knowledge of the error-generating process.
We now arrive at the crucial insight that bridges this entire module. The Gaussian noise assumption doesn't just provide a probabilistic interpretation—it explains why least squares works.
The key result (preview):
Under the model $y_i | \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2)$, the maximum likelihood estimator for $\boldsymbol{\beta}$ is exactly the OLS estimator:
$$\hat{\boldsymbol{\beta}}{\text{MLE}} = \hat{\boldsymbol{\beta}}{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
Why does this happen?
Recall the Gaussian density: $$p(\mathbf{y} | \boldsymbol{\beta}, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\left(-\frac{|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2}{2\sigma^2}\right)$$
Maximizing this likelihood is equivalent to maximizing the exponent (after taking log), which means minimizing $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$—the residual sum of squares.
This is profound: Least squares isn't arbitrary; it's the principled choice when noise is Gaussian.
The method of least squares was independently developed by Gauss and Legendre in the early 1800s. Gauss justified it probabilistically using what we now call the Gaussian distribution (named after him). Legendre presented it as a convenient computational technique. Both perspectives—computational convenience and probabilistic optimality—remain valid today.
What MLE adds beyond OLS:
The maximum likelihood framework provides additional benefits:
These benefits make MLE the foundation for modern statistical inference in regression.
We've established the probabilistic foundation for linear regression. Let's consolidate the key insights:
What's next:
In the next page, we'll derive the log-likelihood function explicitly and show step-by-step how maximizing likelihood leads to the OLS solution. We'll see the beautiful mathematics that connects probability theory to linear algebra, and understand why the Gaussian assumption makes the math tractable while other distributions would not.
You now understand the Gaussian noise assumption—the cornerstone of probabilistic linear regression. This isn't just mathematical elegance; it's the foundation for everything from simple t-tests on coefficients to sophisticated model selection criteria. With this understanding, you're ready to see how maximum likelihood estimation completes the picture.