Bayesian Interpretation - Learning Module

Loading content...

0/245

Ridge as Gaussian Prior

The Gaussian Bridge to Ridge Regression

In the previous page, we established that prior distributions encode our beliefs about model parameters and that the negative log of a prior becomes a regularization penalty. We claimed that Gaussian priors lead to L2 (Ridge) regularization.

Now we'll prove this claim rigorously. By the end of this page, you'll understand that Ridge regression isn't just a convenient technique for handling multicollinearity or overfitting—it's the exact solution to a Bayesian inference problem where we assume coefficients are drawn from a Gaussian distribution centered at zero.

This perspective transforms Ridge regression from an algorithm to be applied into a statement of belief to be understood. When you use Ridge, you're asserting: 'I believe my coefficients are normally distributed with mean zero and some variance that balances signal against overfitting.'

What You Will Learn

By completing this page, you will: (1) Derive the Ridge regression estimator from first principles using Bayesian inference; (2) Understand the precise relationship between prior variance and regularization strength; (3) See how the posterior distribution provides not just point estimates but full uncertainty quantification; (4) Appreciate why Ridge never produces exactly zero coefficients.

The Bayesian Linear Regression Setup

Let's establish our statistical model precisely. We have:

Design matrix: $\mathbf{X} \in \mathbb{R}^{n \times p}$ (n observations, p features)
Response vector: $\mathbf{y} \in \mathbb{R}^n$
Coefficient vector: $\boldsymbol{\beta} \in \mathbb{R}^p$ (to be estimated)

The Likelihood Model:

We assume the standard linear regression model with Gaussian noise:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)$$

Equivalently, the response is conditionally Gaussian given the predictors:

$$\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$

The likelihood function for the parameters given the data is:

$$p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right)$$

Notation Convention

Throughout this derivation, we treat $\sigma^2$ as known. In practice, it can be estimated from residuals or given its own prior. Treating it as known keeps the derivation focused on the key insight: the prior-regularization correspondence.

The Gaussian Prior on Coefficients:

We now place a Gaussian prior on the coefficient vector. The key assumption is that coefficients are independent, each with mean zero and variance $\tau^2$:

$$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I}_p)$$

Expanded: $$p(\boldsymbol{\beta}) = \frac{1}{(2\pi\tau^2)^{p/2}} \exp\left(-\frac{1}{2\tau^2}|\boldsymbol{\beta}|_2^2\right)$$

This prior encodes several beliefs:

Centered at zero: We expect coefficients to be small on average
Symmetric: Positive and negative coefficients with equal probability
Independent: No prior correlation between different coefficients
Finite variance: We're confident coefficients won't be arbitrarily large

Components of the Bayesian Model

•Likelihood: $p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right)$
•Prior: $p(\boldsymbol{\beta}) \propto \exp\left(-\frac{1}{2\tau^2}|\boldsymbol{\beta}|_2^2\right)$
•Goal: Find the posterior $p(\boldsymbol{\beta} \mid \mathbf{X}, \mathbf{y}, \sigma^2, \tau^2)$

Deriving the Posterior Distribution

By Bayes' theorem, the posterior distribution is proportional to the product of likelihood and prior:

$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

Substituting our expressions:

$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right) \cdot \exp\left(-\frac{1}{2\tau^2}|\boldsymbol{\beta}|_2^2\right)$$

Combining the Exponents:

Since $\exp(a) \cdot \exp(b) = \exp(a + b)$:

$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2}\left[\frac{1}{\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \frac{1}{\tau^2}|\boldsymbol{\beta}|_2^2\right]\right)$$

Let's define $\lambda = \sigma^2 / \tau^2$. Then multiplying by $\sigma^2$:

$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2\sigma^2}\left[|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2\right]\right)$$

The Ridge Objective Appears!

The expression inside the exponential is exactly the Ridge regression objective: $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2$. The regularization parameter $\lambda = \sigma^2/\tau^2$ is the ratio of noise variance to prior variance!

Identifying the Posterior Distribution:

The posterior is proportional to $\exp(-\text{quadratic form in }\boldsymbol{\beta})$. This is the kernel of a multivariate Gaussian distribution. By completing the square (detailed below), we can identify the posterior mean and covariance.

Completing the Square:

We need to rewrite the exponent as $(\boldsymbol{\beta} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{\beta} - \boldsymbol{\mu})$ plus constants.

Expanding the quadratic forms: $$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$

$$\lambda|\boldsymbol{\beta}|_2^2 = \lambda\boldsymbol{\beta}^T\boldsymbol{\beta}$$

Combining: $$\mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\boldsymbol{\beta}$$

Extracting the Posterior Parameters:

From the completed square form, the posterior is:

$$\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$

where the posterior mean is: $$\boldsymbol{\mu}_n = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

and the posterior covariance is: $$\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

The Ridge Estimator is the Posterior Mean!

The posterior mean $\boldsymbol{\mu}_n = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$ is exactly the Ridge regression estimator! When we solve Ridge regression, we're finding the center of the Bayesian posterior distribution under Gaussian assumptions.

The Complete Derivation Step by Step

Let's present the complete derivation with all steps explicit, suitable for rigorous study.

Model Specification:

Likelihood: $$\mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$

$$p(\mathbf{y} \mid \boldsymbol{\beta}) = (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\right)$$

Prior: $$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I}_p)$$

$$p(\boldsymbol{\beta}) = (2\pi\tau^2)^{-p/2} \exp\left(-\frac{1}{2\tau^2}\boldsymbol{\beta}^T\boldsymbol{\beta}\right)$$

Interpreting the Regularization Parameter

The derivation reveals the precise meaning of $\lambda$:

$$\lambda = \frac{\sigma^2}{\tau^2} = \frac{\text{Noise variance}}{\text{Prior variance}}$$

This ratio has profound interpretive value:

Interpretation of Regularization Strength
Scenario	$\lambda$ Value	Interpretation
High noise, tight prior	Large	Data is unreliable; trust the prior heavily → strong shrinkage
Low noise, diffuse prior	Small	Data is reliable; let the data speak → weak shrinkage
High noise, diffuse prior	Moderate	Uncertain data meeting uncertain prior → balanced shrinkage
Low noise, tight prior	Moderate	Reliable data meeting confident prior → both contribute

Limiting Cases:

As $\tau^2 \to \infty$ (flat prior): $$\lambda \to 0, \quad \boldsymbol{\hat\beta}{\text{Ridge}} \to (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \boldsymbol{\hat\beta}{\text{OLS}}$$

With an infinitely diffuse prior, we recover ordinary least squares—the prior contributes nothing.

As $\tau^2 \to 0$ (spike prior at zero): $$\lambda \to \infty, \quad \boldsymbol{\hat\beta}_{\text{Ridge}} \to \mathbf{0}$$

With a prior concentrated entirely at zero, the posterior is forced to zero regardless of data—the prior dominates completely.

The Signal-to-Noise Intuition

Think of $\tau^2/\sigma^2 = 1/\lambda$ as a prior signal-to-noise ratio. Large $\tau^2/\sigma^2$ means we believe coefficients can be large relative to noise → little regularization. Small $\tau^2/\sigma^2$ means we expect coefficients to be small relative to noise → heavy regularization.

Understanding the Posterior Covariance

The Bayesian framework provides more than a point estimate—we get a full distribution over coefficients:

$$\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

What This Tells Us:

Uncertainty about each coefficient: The diagonal elements $[\boldsymbol{\Sigma}n]{jj}$ give the marginal variance of $\beta_j$
Correlations between coefficients: Off-diagonal elements $[\boldsymbol{\Sigma}n]{ij}$ indicate how uncertainty in $\beta_i$ relates to uncertainty in $\beta_j$
Credible intervals: For coefficient $\beta_j$, a 95% credible interval is $\mu_{n,j} \pm 1.96\sqrt{[\boldsymbol{\Sigma}n]{jj}}$

Frequentist vs. Bayesian Uncertainty

The frequentist OLS covariance is $\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$. The Ridge posterior covariance $\sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$ is smaller—adding $\lambda\mathbf{I}$ shrinks the uncertainty. This reflects the additional information we've incorporated via the prior.

Effect of Regularization on Uncertainty:

As $\lambda$ increases:

The posterior covariance shrinks
Credible intervals narrow
We become more confident about our estimates

This might seem paradoxical—stronger regularization gives more certainty? The resolution is that we're becoming more certain the coefficients are near zero, not more certain about their true values. The prior is dominating the inference.

In Matrix Terms:

Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ where $d_1 \geq d_2 \geq \ldots \geq d_p$ are singular values:

$$\boldsymbol{\Sigma}_n = \sigma^2 \mathbf{V} \text{diag}\left(\frac{1}{d_1^2 + \lambda}, \ldots, \frac{1}{d_p^2 + \lambda}\right) \mathbf{V}^T$$

Directions with small singular values (poorly determined by data) get the most uncertainty reduction from the prior.

Why Ridge Never Zeros Coefficients

A fundamental property of Ridge regression is that it shrinks all coefficients toward zero but never sets them exactly to zero. The Bayesian perspective explains why.

The Gaussian Prior Argument:

The Gaussian distribution is continuous with full support on $\mathbb{R}$: $$p(\beta_j = 0) = 0$$

The prior assigns zero probability to any single point, including zero. Since the likelihood (also Gaussian) is continuous with full support, the posterior remains continuous with full support: $$p(\beta_j = 0 \mid \mathbf{y}) = 0$$

The Continuous Distribution Issue

For a continuous distribution, the probability of any exact value is zero. This is why Gaussian priors cannot induce exact zeros—they don't place positive probability on $\beta_j = 0$ as a discrete outcome. For exact sparsity, we need priors with point mass at zero (like spike-and-slab) or priors that induce sparsity through MAP estimation (Laplace).

The Optimization Perspective:

From an optimization viewpoint, the Ridge objective: $$f(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2$$

has gradient: $$\nabla f = -2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + 2\lambda\boldsymbol{\beta}$$

Setting to zero: $$\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$

For $\beta_j = 0$, we'd need the $j$-th equation: $$\sum_{i \neq j} [\mathbf{X}^T\mathbf{X}]_{ji}\beta_i = [\mathbf{X}^T\mathbf{y}]_j$$

This is a measure-zero event in the space of possible data configurations—it almost never happens naturally.

Implications for Feature Selection:

Since Ridge never produces exact zeros:

It cannot perform automatic feature selection
All features remain in the model with shrunken coefficients
Interpretability can suffer in high dimensions
When sparsity is desired, Lasso or Elastic Net are preferable

However, this is not always a disadvantage. If you believe all features contribute (even if weakly), Ridge's full-coefficient approach may be more appropriate.

Shrinkage Factor Analysis

The SVD representation of Ridge provides deep insight into its shrinkage behavior. Let $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ with singular values $d_1, \ldots, d_p$.

The OLS estimator in the SVD basis: $$\boldsymbol{\hat\beta}_{OLS} = \mathbf{V}\mathbf{D}^{-1}\mathbf{U}^T\mathbf{y}$$

The Ridge estimator: $$\boldsymbol{\hat\beta}_{Ridge} = \mathbf{V}(\mathbf{D}^2 + \lambda\mathbf{I})^{-1}\mathbf{D}\mathbf{U}^T\mathbf{y}$$

We can write this as: $$\boldsymbol{\hat\beta}{Ridge} = \mathbf{V}\mathbf{S}{\lambda}\mathbf{D}^{-1}\mathbf{U}^T\mathbf{y}$$

where $\mathbf{S}_{\lambda}$ is a diagonal matrix with entries: $$s_j = \frac{d_j^2}{d_j^2 + \lambda}$$

The Shrinkage Factors

The shrinkage factors $s_j = d_j^2/(d_j^2 + \lambda)$ are always in $(0, 1)$. They multiply the OLS solution in the SVD basis, shrinking each component toward zero. Components with small singular values (poorly determined by data) are shrunk more heavily.

Shrinkage Behavior by Singular Value
Singular Value $d_j$	Shrinkage Factor $s_j$	Effect
$d_j \gg \sqrt{\lambda}$	≈ 1	Minimal shrinkage—data strongly determines this direction
$d_j \approx \sqrt{\lambda}$	≈ 0.5	Moderate shrinkage—prior and data contribute equally
$d_j \ll \sqrt{\lambda}$	≈ 0	Heavy shrinkage—prior dominates this poorly-determined direction

The Bayesian Interpretation:

The shrinkage factor $s_j = d_j^2/(d_j^2 + \lambda)$ can be rewritten: $$s_j = \frac{1}{1 + \lambda/d_j^2}$$

This is the precision-weighted average between data and prior. Directions well-estimated by data (large $d_j^2$, high data precision) get weights close to 1. Poorly-estimated directions (small $d_j^2$, low data precision) get weights close to 0.

This is Bayesian learning: we trust data more where it's informative and fall back on the prior where data is uninformative.

Ridge Regression and Multicollinearity

One of Ridge regression's most celebrated properties is its ability to handle multicollinearity—highly correlated features. The Bayesian perspective explains why.

The OLS Problem with Multicollinearity:

When features are nearly collinear, $\mathbf{X}^T\mathbf{X}$ is nearly singular:

Some eigenvalues (and singular values) are near zero
$(\mathbf{X}^T\mathbf{X})^{-1}$ has huge entries
OLS estimates have enormous variance
Small changes in data cause wild swings in estimates

The Instability of Unregularized OLS

If $\mathbf{X}^T\mathbf{X}$ has a small eigenvalue $\lambda_{\min}$, the corresponding component of $(\mathbf{X}^T\mathbf{X})^{-1}$ is $1/\lambda_{\min}$—potentially huge. This amplifies noise in the data into massive estimation variance.

How Ridge Solves This:

Ridge adds $\lambda\mathbf{I}$ to $\mathbf{X}^T\mathbf{X}$ before inverting: $$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

Even if $\mathbf{X}^T\mathbf{X}$ has eigenvalue 0, $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ has eigenvalue $\lambda$.

The minimum eigenvalue is now at least $\lambda$, bounding the maximum eigenvalue of the inverse at $1/\lambda$.

The Bayesian Interpretation:

Multicollinearity means the data doesn't distinguish between certain linear combinations of coefficients. In Bayesian terms, the likelihood is flat in those directions—many coefficient configurations explain the data equally well.

The Gaussian prior isn't flat. It prefers smaller coefficients. So even where the data provides no information, the prior guides us toward the origin. The "arbitrary" choice among equivalent OLS solutions is resolved by the prior's preference for parsimony.

Quantitative Effect:

For an eigenvalue $\mu$ of $\mathbf{X}^T\mathbf{X}$:

OLS variance contribution: proportional to $1/\mu$ (explodes as $\mu \to 0$)

Ridge variance contribution: proportional to $1/(\mu + \lambda)$ (bounded by $1/\lambda$)

The regularization parameter $\lambda$ sets a floor on how much any direction can be amplified. This is precisely the regularizing effect of the prior: it contributes information that stabilizes inference in data-poor directions.

Computational Considerations

The Bayesian derivation yields the same computational formula as the optimization approach. Let's examine the practical aspects.

The Closed-Form Solution:

$$\boldsymbol{\hat\beta}_{Ridge} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

Computational Approaches:

Methods for Computing Ridge Estimates

•Direct solve: Compute $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$, then solve the linear system via Cholesky decomposition — $O(np^2 + p^3)$
•SVD approach: Compute SVD of $\mathbf{X}$ once, apply shrinkage factors for any $\lambda$ — $O(\min(np^2, n^2p))$ initially, $O(np)$ per new $\lambda$
•Kernel form: When $n < p$, use dual form $(\mathbf{X}\mathbf{X}^T + \lambda\mathbf{I})^{-1}$ for $O(n^3)$ instead of $O(p^3)$
•Iterative methods: For very large problems, conjugate gradient on the normal equations

The SVD Advantage:

Computing the SVD $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ upfront allows:

Rapid computation for multiple $\lambda$ values
Easy extraction of shrinkage factors
Computation of posterior covariance
Degrees of freedom calculation: $\text{df}(\lambda) = \sum_j d_j^2/(d_j^2 + \lambda)$

For Full Bayesian Inference:

If we need the full posterior (not just the mean), we need the covariance $\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$.

Sampling from this Gaussian is straightforward—much simpler than MCMC approaches needed for non-Gaussian priors.

Conjugacy Enables Exact Inference

The Gaussian-Gaussian conjugate pair means we can compute the exact posterior analytically. No approximations, no MCMC, no variational inference. This computational tractability is a major advantage of the Gaussian prior / L2 regularization combination.

Summary and Looking Ahead

We've established the deep connection between Ridge regression and Bayesian inference with Gaussian priors.

Key Takeaways

•Ridge = Posterior Mean: The Ridge estimator $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$ is exactly the posterior mean under Gaussian likelihood and Gaussian prior
•$\lambda = \sigma^2/\tau^2$: Regularization strength equals the noise-to-prior variance ratio
•Full posterior available: The posterior is Gaussian with known covariance, enabling uncertainty quantification
•Shrinkage interpretation: The shrinkage factors $d_j^2/(d_j^2 + \lambda)$ are precision-weighted averages between data and prior
•Multicollinearity resolution: The prior provides information in data-poor directions, stabilizing estimation
•No exact zeros: Gaussian priors are continuous and cannot produce exact sparsity
•Computational tractability: Conjugacy enables exact, closed-form posterior inference

What's Next:

In Page 3, we'll develop the parallel analysis for Lasso regression, showing how Laplace priors lead to L1 regularization. The mathematics will be more subtle—Laplace priors don't yield closed-form posteriors—but the conceptual connection remains equally profound. We'll see why Laplace priors produce sparsity where Gaussian priors cannot.

Page Complete

You now understand Ridge regression as the natural consequence of Bayesian inference with Gaussian priors. This perspective transforms regularization from a computational trick into a principled statement of prior belief. When you apply Ridge regression, you're asserting that you believe coefficients are normally distributed around zero—and letting the data update that belief.

Ridge as Gaussian Prior

The Gaussian Bridge to Ridge Regression

What You Will Learn

The Bayesian Linear Regression Setup

Let's establish our statistical model precisely. We have:

Design matrix: $\mathbf{X} \in \mathbb{R}^{n \times p}$ (n observations, p features)
Response vector: $\mathbf{y} \in \mathbb{R}^n$
Coefficient vector: $\boldsymbol{\beta} \in \mathbb{R}^p$ (to be estimated)

The Likelihood Model:

We assume the standard linear regression model with Gaussian noise:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)$$

Equivalently, the response is conditionally Gaussian given the predictors:

$$\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$

The likelihood function for the parameters given the data is:

$$p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right)$$

Notation Convention

The Gaussian Prior on Coefficients:

We now place a Gaussian prior on the coefficient vector. The key assumption is that coefficients are independent, each with mean zero and variance $\tau^2$:

$$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I}_p)$$

Expanded: $$p(\boldsymbol{\beta}) = \frac{1}{(2\pi\tau^2)^{p/2}} \exp\left(-\frac{1}{2\tau^2}|\boldsymbol{\beta}|_2^2\right)$$

This prior encodes several beliefs:

Centered at zero: We expect coefficients to be small on average
Symmetric: Positive and negative coefficients with equal probability
Independent: No prior correlation between different coefficients
Finite variance: We're confident coefficients won't be arbitrarily large

Components of the Bayesian Model

•Likelihood: $p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right)$
•Prior: $p(\boldsymbol{\beta}) \propto \exp\left(-\frac{1}{2\tau^2}|\boldsymbol{\beta}|_2^2\right)$
•Goal: Find the posterior $p(\boldsymbol{\beta} \mid \mathbf{X}, \mathbf{y}, \sigma^2, \tau^2)$

Deriving the Posterior Distribution

By Bayes' theorem, the posterior distribution is proportional to the product of likelihood and prior:

$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

Substituting our expressions:

Combining the Exponents:

Since $\exp(a) \cdot \exp(b) = \exp(a + b)$:

Let's define $\lambda = \sigma^2 / \tau^2$. Then multiplying by $\sigma^2$:

$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2\sigma^2}\left[|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2\right]\right)$$

The Ridge Objective Appears!

Identifying the Posterior Distribution:

Completing the Square:

We need to rewrite the exponent as $(\boldsymbol{\beta} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{\beta} - \boldsymbol{\mu})$ plus constants.

$$\lambda|\boldsymbol{\beta}|_2^2 = \lambda\boldsymbol{\beta}^T\boldsymbol{\beta}$$

Combining: $$\mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\boldsymbol{\beta}$$

Extracting the Posterior Parameters:

From the completed square form, the posterior is:

$$\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$

where the posterior mean is: $$\boldsymbol{\mu}_n = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

and the posterior covariance is: $$\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

The Ridge Estimator is the Posterior Mean!

The Complete Derivation Step by Step

Let's present the complete derivation with all steps explicit, suitable for rigorous study.

Model Specification:

Likelihood: $$\mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$

$$p(\mathbf{y} \mid \boldsymbol{\beta}) = (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\right)$$

Prior: $$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I}_p)$$

$$p(\boldsymbol{\beta}) = (2\pi\tau^2)^{-p/2} \exp\left(-\frac{1}{2\tau^2}\boldsymbol{\beta}^T\boldsymbol{\beta}\right)$$

Interpreting the Regularization Parameter

The derivation reveals the precise meaning of $\lambda$:

$$\lambda = \frac{\sigma^2}{\tau^2} = \frac{\text{Noise variance}}{\text{Prior variance}}$$

This ratio has profound interpretive value:

Interpretation of Regularization Strength
Scenario	$\lambda$ Value	Interpretation
High noise, tight prior	Large	Data is unreliable; trust the prior heavily → strong shrinkage
Low noise, diffuse prior	Small	Data is reliable; let the data speak → weak shrinkage
High noise, diffuse prior	Moderate	Uncertain data meeting uncertain prior → balanced shrinkage
Low noise, tight prior	Moderate	Reliable data meeting confident prior → both contribute

Limiting Cases:

As $\tau^2 \to \infty$ (flat prior): $$\lambda \to 0, \quad \boldsymbol{\hat\beta}{\text{Ridge}} \to (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \boldsymbol{\hat\beta}{\text{OLS}}$$

With an infinitely diffuse prior, we recover ordinary least squares—the prior contributes nothing.

As $\tau^2 \to 0$ (spike prior at zero): $$\lambda \to \infty, \quad \boldsymbol{\hat\beta}_{\text{Ridge}} \to \mathbf{0}$$

With a prior concentrated entirely at zero, the posterior is forced to zero regardless of data—the prior dominates completely.

The Signal-to-Noise Intuition

Understanding the Posterior Covariance

The Bayesian framework provides more than a point estimate—we get a full distribution over coefficients:

$$\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

What This Tells Us:

Uncertainty about each coefficient: The diagonal elements $[\boldsymbol{\Sigma}n]{jj}$ give the marginal variance of $\beta_j$
Correlations between coefficients: Off-diagonal elements $[\boldsymbol{\Sigma}n]{ij}$ indicate how uncertainty in $\beta_i$ relates to uncertainty in $\beta_j$
Credible intervals: For coefficient $\beta_j$, a 95% credible interval is $\mu_{n,j} \pm 1.96\sqrt{[\boldsymbol{\Sigma}n]{jj}}$

Frequentist vs. Bayesian Uncertainty

Effect of Regularization on Uncertainty:

As $\lambda$ increases:

The posterior covariance shrinks
Credible intervals narrow
We become more confident about our estimates

In Matrix Terms:

Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ where $d_1 \geq d_2 \geq \ldots \geq d_p$ are singular values:

$$\boldsymbol{\Sigma}_n = \sigma^2 \mathbf{V} \text{diag}\left(\frac{1}{d_1^2 + \lambda}, \ldots, \frac{1}{d_p^2 + \lambda}\right) \mathbf{V}^T$$

Directions with small singular values (poorly determined by data) get the most uncertainty reduction from the prior.

Why Ridge Never Zeros Coefficients

A fundamental property of Ridge regression is that it shrinks all coefficients toward zero but never sets them exactly to zero. The Bayesian perspective explains why.

The Gaussian Prior Argument:

The Gaussian distribution is continuous with full support on $\mathbb{R}$: $$p(\beta_j = 0) = 0$$

The Continuous Distribution Issue

The Optimization Perspective:

From an optimization viewpoint, the Ridge objective: $$f(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2$$

has gradient: $$\nabla f = -2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + 2\lambda\boldsymbol{\beta}$$

Setting to zero: $$\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$

For $\beta_j = 0$, we'd need the $j$-th equation: $$\sum_{i \neq j} [\mathbf{X}^T\mathbf{X}]_{ji}\beta_i = [\mathbf{X}^T\mathbf{y}]_j$$

This is a measure-zero event in the space of possible data configurations—it almost never happens naturally.

Implications for Feature Selection:

Since Ridge never produces exact zeros:

It cannot perform automatic feature selection
All features remain in the model with shrunken coefficients
Interpretability can suffer in high dimensions
When sparsity is desired, Lasso or Elastic Net are preferable

However, this is not always a disadvantage. If you believe all features contribute (even if weakly), Ridge's full-coefficient approach may be more appropriate.

Shrinkage Factor Analysis

The SVD representation of Ridge provides deep insight into its shrinkage behavior. Let $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ with singular values $d_1, \ldots, d_p$.

The OLS estimator in the SVD basis: $$\boldsymbol{\hat\beta}_{OLS} = \mathbf{V}\mathbf{D}^{-1}\mathbf{U}^T\mathbf{y}$$

The Ridge estimator: $$\boldsymbol{\hat\beta}_{Ridge} = \mathbf{V}(\mathbf{D}^2 + \lambda\mathbf{I})^{-1}\mathbf{D}\mathbf{U}^T\mathbf{y}$$

We can write this as: $$\boldsymbol{\hat\beta}{Ridge} = \mathbf{V}\mathbf{S}{\lambda}\mathbf{D}^{-1}\mathbf{U}^T\mathbf{y}$$

where $\mathbf{S}_{\lambda}$ is a diagonal matrix with entries: $$s_j = \frac{d_j^2}{d_j^2 + \lambda}$$

The Shrinkage Factors

Shrinkage Behavior by Singular Value
Singular Value $d_j$	Shrinkage Factor $s_j$	Effect
$d_j \gg \sqrt{\lambda}$	≈ 1	Minimal shrinkage—data strongly determines this direction
$d_j \approx \sqrt{\lambda}$	≈ 0.5	Moderate shrinkage—prior and data contribute equally
$d_j \ll \sqrt{\lambda}$	≈ 0	Heavy shrinkage—prior dominates this poorly-determined direction

The Bayesian Interpretation:

The shrinkage factor $s_j = d_j^2/(d_j^2 + \lambda)$ can be rewritten: $$s_j = \frac{1}{1 + \lambda/d_j^2}$$

This is Bayesian learning: we trust data more where it's informative and fall back on the prior where data is uninformative.

Ridge Regression and Multicollinearity

One of Ridge regression's most celebrated properties is its ability to handle multicollinearity—highly correlated features. The Bayesian perspective explains why.

The OLS Problem with Multicollinearity:

When features are nearly collinear, $\mathbf{X}^T\mathbf{X}$ is nearly singular:

Some eigenvalues (and singular values) are near zero
$(\mathbf{X}^T\mathbf{X})^{-1}$ has huge entries
OLS estimates have enormous variance
Small changes in data cause wild swings in estimates

The Instability of Unregularized OLS

How Ridge Solves This:

Ridge adds $\lambda\mathbf{I}$ to $\mathbf{X}^T\mathbf{X}$ before inverting: $$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

Even if $\mathbf{X}^T\mathbf{X}$ has eigenvalue 0, $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ has eigenvalue $\lambda$.

The minimum eigenvalue is now at least $\lambda$, bounding the maximum eigenvalue of the inverse at $1/\lambda$.

The Bayesian Interpretation:

Quantitative Effect:

For an eigenvalue $\mu$ of $\mathbf{X}^T\mathbf{X}$:

OLS variance contribution: proportional to $1/\mu$ (explodes as $\mu \to 0$)

Ridge variance contribution: proportional to $1/(\mu + \lambda)$ (bounded by $1/\lambda$)

Computational Considerations

The Bayesian derivation yields the same computational formula as the optimization approach. Let's examine the practical aspects.

The Closed-Form Solution:

$$\boldsymbol{\hat\beta}_{Ridge} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

Computational Approaches:

Methods for Computing Ridge Estimates

•Direct solve: Compute $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$, then solve the linear system via Cholesky decomposition — $O(np^2 + p^3)$
•SVD approach: Compute SVD of $\mathbf{X}$ once, apply shrinkage factors for any $\lambda$ — $O(\min(np^2, n^2p))$ initially, $O(np)$ per new $\lambda$
•Kernel form: When $n < p$, use dual form $(\mathbf{X}\mathbf{X}^T + \lambda\mathbf{I})^{-1}$ for $O(n^3)$ instead of $O(p^3)$
•Iterative methods: For very large problems, conjugate gradient on the normal equations

The SVD Advantage:

Computing the SVD $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ upfront allows:

Rapid computation for multiple $\lambda$ values
Easy extraction of shrinkage factors
Computation of posterior covariance
Degrees of freedom calculation: $\text{df}(\lambda) = \sum_j d_j^2/(d_j^2 + \lambda)$

For Full Bayesian Inference:

If we need the full posterior (not just the mean), we need the covariance $\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$.

Sampling from this Gaussian is straightforward—much simpler than MCMC approaches needed for non-Gaussian priors.

Conjugacy Enables Exact Inference

Summary and Looking Ahead

We've established the deep connection between Ridge regression and Bayesian inference with Gaussian priors.

Key Takeaways

•Ridge = Posterior Mean: The Ridge estimator $(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$ is exactly the posterior mean under Gaussian likelihood and Gaussian prior
•$\lambda = \sigma^2/\tau^2$: Regularization strength equals the noise-to-prior variance ratio
•Full posterior available: The posterior is Gaussian with known covariance, enabling uncertainty quantification
•Shrinkage interpretation: The shrinkage factors $d_j^2/(d_j^2 + \lambda)$ are precision-weighted averages between data and prior
•Multicollinearity resolution: The prior provides information in data-poor directions, stabilizing estimation
•No exact zeros: Gaussian priors are continuous and cannot produce exact sparsity
•Computational tractability: Conjugacy enables exact, closed-form posterior inference

What's Next:

Page Complete