Loading content...
In the previous page, we established that prior distributions encode our beliefs about model parameters and that the negative log of a prior becomes a regularization penalty. We claimed that Gaussian priors lead to L2 (Ridge) regularization.
Now we'll prove this claim rigorously. By the end of this page, you'll understand that Ridge regression isn't just a convenient technique for handling multicollinearity or overfitting—it's the exact solution to a Bayesian inference problem where we assume coefficients are drawn from a Gaussian distribution centered at zero.
This perspective transforms Ridge regression from an algorithm to be applied into a statement of belief to be understood. When you use Ridge, you're asserting: 'I believe my coefficients are normally distributed with mean zero and some variance that balances signal against overfitting.'
By completing this page, you will: (1) Derive the Ridge regression estimator from first principles using Bayesian inference; (2) Understand the precise relationship between prior variance and regularization strength; (3) See how the posterior distribution provides not just point estimates but full uncertainty quantification; (4) Appreciate why Ridge never produces exactly zero coefficients.
Let's establish our statistical model precisely. We have:
The Likelihood Model:
We assume the standard linear regression model with Gaussian noise:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I}_n)$$
Equivalently, the response is conditionally Gaussian given the predictors:
$$\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$
The likelihood function for the parameters given the data is:
$$p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}} \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right)$$
Throughout this derivation, we treat $\sigma^2$ as known. In practice, it can be estimated from residuals or given its own prior. Treating it as known keeps the derivation focused on the key insight: the prior-regularization correspondence.
The Gaussian Prior on Coefficients:
We now place a Gaussian prior on the coefficient vector. The key assumption is that coefficients are independent, each with mean zero and variance $\tau^2$:
$$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I}_p)$$
Expanded: $$p(\boldsymbol{\beta}) = \frac{1}{(2\pi\tau^2)^{p/2}} \exp\left(-\frac{1}{2\tau^2}|\boldsymbol{\beta}|_2^2\right)$$
This prior encodes several beliefs:
By Bayes' theorem, the posterior distribution is proportional to the product of likelihood and prior:
$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$
Substituting our expressions:
$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right) \cdot \exp\left(-\frac{1}{2\tau^2}|\boldsymbol{\beta}|_2^2\right)$$
Combining the Exponents:
Since $\exp(a) \cdot \exp(b) = \exp(a + b)$:
$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2}\left[\frac{1}{\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \frac{1}{\tau^2}|\boldsymbol{\beta}|_2^2\right]\right)$$
Let's define $\lambda = \sigma^2 / \tau^2$. Then multiplying by $\sigma^2$:
$$p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X}) \propto \exp\left(-\frac{1}{2\sigma^2}\left[|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2\right]\right)$$
The expression inside the exponential is exactly the Ridge regression objective: $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2$. The regularization parameter $\lambda = \sigma^2/\tau^2$ is the ratio of noise variance to prior variance!
Identifying the Posterior Distribution:
The posterior is proportional to $\exp(-\text{quadratic form in }\boldsymbol{\beta})$. This is the kernel of a multivariate Gaussian distribution. By completing the square (detailed below), we can identify the posterior mean and covariance.
Completing the Square:
We need to rewrite the exponent as $(\boldsymbol{\beta} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{\beta} - \boldsymbol{\mu})$ plus constants.
Expanding the quadratic forms: $$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}$$
$$\lambda|\boldsymbol{\beta}|_2^2 = \lambda\boldsymbol{\beta}^T\boldsymbol{\beta}$$
Combining: $$\mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\boldsymbol{\beta}$$
Extracting the Posterior Parameters:
From the completed square form, the posterior is:
$$\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$
where the posterior mean is: $$\boldsymbol{\mu}_n = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$
and the posterior covariance is: $$\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$
The posterior mean $\boldsymbol{\mu}_n = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$ is exactly the Ridge regression estimator! When we solve Ridge regression, we're finding the center of the Bayesian posterior distribution under Gaussian assumptions.
Let's present the complete derivation with all steps explicit, suitable for rigorous study.
Model Specification:
Likelihood: $$\mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$
$$p(\mathbf{y} \mid \boldsymbol{\beta}) = (2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2}(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\right)$$
Prior: $$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I}_p)$$
$$p(\boldsymbol{\beta}) = (2\pi\tau^2)^{-p/2} \exp\left(-\frac{1}{2\tau^2}\boldsymbol{\beta}^T\boldsymbol{\beta}\right)$$
The derivation reveals the precise meaning of $\lambda$:
$$\lambda = \frac{\sigma^2}{\tau^2} = \frac{\text{Noise variance}}{\text{Prior variance}}$$
This ratio has profound interpretive value:
| Scenario | $\lambda$ Value | Interpretation |
|---|---|---|
| High noise, tight prior | Large | Data is unreliable; trust the prior heavily → strong shrinkage |
| Low noise, diffuse prior | Small | Data is reliable; let the data speak → weak shrinkage |
| High noise, diffuse prior | Moderate | Uncertain data meeting uncertain prior → balanced shrinkage |
| Low noise, tight prior | Moderate | Reliable data meeting confident prior → both contribute |
Limiting Cases:
As $\tau^2 \to \infty$ (flat prior): $$\lambda \to 0, \quad \boldsymbol{\hat\beta}{\text{Ridge}} \to (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \boldsymbol{\hat\beta}{\text{OLS}}$$
With an infinitely diffuse prior, we recover ordinary least squares—the prior contributes nothing.
As $\tau^2 \to 0$ (spike prior at zero): $$\lambda \to \infty, \quad \boldsymbol{\hat\beta}_{\text{Ridge}} \to \mathbf{0}$$
With a prior concentrated entirely at zero, the posterior is forced to zero regardless of data—the prior dominates completely.
Think of $\tau^2/\sigma^2 = 1/\lambda$ as a prior signal-to-noise ratio. Large $\tau^2/\sigma^2$ means we believe coefficients can be large relative to noise → little regularization. Small $\tau^2/\sigma^2$ means we expect coefficients to be small relative to noise → heavy regularization.
The Bayesian framework provides more than a point estimate—we get a full distribution over coefficients:
$$\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$
What This Tells Us:
Uncertainty about each coefficient: The diagonal elements $[\boldsymbol{\Sigma}n]{jj}$ give the marginal variance of $\beta_j$
Correlations between coefficients: Off-diagonal elements $[\boldsymbol{\Sigma}n]{ij}$ indicate how uncertainty in $\beta_i$ relates to uncertainty in $\beta_j$
Credible intervals: For coefficient $\beta_j$, a 95% credible interval is $\mu_{n,j} \pm 1.96\sqrt{[\boldsymbol{\Sigma}n]{jj}}$
The frequentist OLS covariance is $\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$. The Ridge posterior covariance $\sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$ is smaller—adding $\lambda\mathbf{I}$ shrinks the uncertainty. This reflects the additional information we've incorporated via the prior.
Effect of Regularization on Uncertainty:
As $\lambda$ increases:
This might seem paradoxical—stronger regularization gives more certainty? The resolution is that we're becoming more certain the coefficients are near zero, not more certain about their true values. The prior is dominating the inference.
In Matrix Terms:
Using the SVD $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ where $d_1 \geq d_2 \geq \ldots \geq d_p$ are singular values:
$$\boldsymbol{\Sigma}_n = \sigma^2 \mathbf{V} \text{diag}\left(\frac{1}{d_1^2 + \lambda}, \ldots, \frac{1}{d_p^2 + \lambda}\right) \mathbf{V}^T$$
Directions with small singular values (poorly determined by data) get the most uncertainty reduction from the prior.
A fundamental property of Ridge regression is that it shrinks all coefficients toward zero but never sets them exactly to zero. The Bayesian perspective explains why.
The Gaussian Prior Argument:
The Gaussian distribution is continuous with full support on $\mathbb{R}$: $$p(\beta_j = 0) = 0$$
The prior assigns zero probability to any single point, including zero. Since the likelihood (also Gaussian) is continuous with full support, the posterior remains continuous with full support: $$p(\beta_j = 0 \mid \mathbf{y}) = 0$$
For a continuous distribution, the probability of any exact value is zero. This is why Gaussian priors cannot induce exact zeros—they don't place positive probability on $\beta_j = 0$ as a discrete outcome. For exact sparsity, we need priors with point mass at zero (like spike-and-slab) or priors that induce sparsity through MAP estimation (Laplace).
The Optimization Perspective:
From an optimization viewpoint, the Ridge objective: $$f(\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_2^2$$
has gradient: $$\nabla f = -2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + 2\lambda\boldsymbol{\beta}$$
Setting to zero: $$\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}$$
For $\beta_j = 0$, we'd need the $j$-th equation: $$\sum_{i \neq j} [\mathbf{X}^T\mathbf{X}]_{ji}\beta_i = [\mathbf{X}^T\mathbf{y}]_j$$
This is a measure-zero event in the space of possible data configurations—it almost never happens naturally.
Implications for Feature Selection:
Since Ridge never produces exact zeros:
However, this is not always a disadvantage. If you believe all features contribute (even if weakly), Ridge's full-coefficient approach may be more appropriate.
The SVD representation of Ridge provides deep insight into its shrinkage behavior. Let $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ with singular values $d_1, \ldots, d_p$.
The OLS estimator in the SVD basis: $$\boldsymbol{\hat\beta}_{OLS} = \mathbf{V}\mathbf{D}^{-1}\mathbf{U}^T\mathbf{y}$$
The Ridge estimator: $$\boldsymbol{\hat\beta}_{Ridge} = \mathbf{V}(\mathbf{D}^2 + \lambda\mathbf{I})^{-1}\mathbf{D}\mathbf{U}^T\mathbf{y}$$
We can write this as: $$\boldsymbol{\hat\beta}{Ridge} = \mathbf{V}\mathbf{S}{\lambda}\mathbf{D}^{-1}\mathbf{U}^T\mathbf{y}$$
where $\mathbf{S}_{\lambda}$ is a diagonal matrix with entries: $$s_j = \frac{d_j^2}{d_j^2 + \lambda}$$
The shrinkage factors $s_j = d_j^2/(d_j^2 + \lambda)$ are always in $(0, 1)$. They multiply the OLS solution in the SVD basis, shrinking each component toward zero. Components with small singular values (poorly determined by data) are shrunk more heavily.
| Singular Value $d_j$ | Shrinkage Factor $s_j$ | Effect |
|---|---|---|
| $d_j \gg \sqrt{\lambda}$ | ≈ 1 | Minimal shrinkage—data strongly determines this direction |
| $d_j \approx \sqrt{\lambda}$ | ≈ 0.5 | Moderate shrinkage—prior and data contribute equally |
| $d_j \ll \sqrt{\lambda}$ | ≈ 0 | Heavy shrinkage—prior dominates this poorly-determined direction |
The Bayesian Interpretation:
The shrinkage factor $s_j = d_j^2/(d_j^2 + \lambda)$ can be rewritten: $$s_j = \frac{1}{1 + \lambda/d_j^2}$$
This is the precision-weighted average between data and prior. Directions well-estimated by data (large $d_j^2$, high data precision) get weights close to 1. Poorly-estimated directions (small $d_j^2$, low data precision) get weights close to 0.
This is Bayesian learning: we trust data more where it's informative and fall back on the prior where data is uninformative.
One of Ridge regression's most celebrated properties is its ability to handle multicollinearity—highly correlated features. The Bayesian perspective explains why.
The OLS Problem with Multicollinearity:
When features are nearly collinear, $\mathbf{X}^T\mathbf{X}$ is nearly singular:
If $\mathbf{X}^T\mathbf{X}$ has a small eigenvalue $\lambda_{\min}$, the corresponding component of $(\mathbf{X}^T\mathbf{X})^{-1}$ is $1/\lambda_{\min}$—potentially huge. This amplifies noise in the data into massive estimation variance.
How Ridge Solves This:
Ridge adds $\lambda\mathbf{I}$ to $\mathbf{X}^T\mathbf{X}$ before inverting: $$(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$
Even if $\mathbf{X}^T\mathbf{X}$ has eigenvalue 0, $\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}$ has eigenvalue $\lambda$.
The minimum eigenvalue is now at least $\lambda$, bounding the maximum eigenvalue of the inverse at $1/\lambda$.
The Bayesian Interpretation:
Multicollinearity means the data doesn't distinguish between certain linear combinations of coefficients. In Bayesian terms, the likelihood is flat in those directions—many coefficient configurations explain the data equally well.
The Gaussian prior isn't flat. It prefers smaller coefficients. So even where the data provides no information, the prior guides us toward the origin. The "arbitrary" choice among equivalent OLS solutions is resolved by the prior's preference for parsimony.
Quantitative Effect:
For an eigenvalue $\mu$ of $\mathbf{X}^T\mathbf{X}$:
OLS variance contribution: proportional to $1/\mu$ (explodes as $\mu \to 0$)
Ridge variance contribution: proportional to $1/(\mu + \lambda)$ (bounded by $1/\lambda$)
The regularization parameter $\lambda$ sets a floor on how much any direction can be amplified. This is precisely the regularizing effect of the prior: it contributes information that stabilizes inference in data-poor directions.
The Bayesian derivation yields the same computational formula as the optimization approach. Let's examine the practical aspects.
The Closed-Form Solution:
$$\boldsymbol{\hat\beta}_{Ridge} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$
Computational Approaches:
The SVD Advantage:
Computing the SVD $\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ upfront allows:
For Full Bayesian Inference:
If we need the full posterior (not just the mean), we need the covariance $\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$.
Sampling from this Gaussian is straightforward—much simpler than MCMC approaches needed for non-Gaussian priors.
The Gaussian-Gaussian conjugate pair means we can compute the exact posterior analytically. No approximations, no MCMC, no variational inference. This computational tractability is a major advantage of the Gaussian prior / L2 regularization combination.
We've established the deep connection between Ridge regression and Bayesian inference with Gaussian priors.
What's Next:
In Page 3, we'll develop the parallel analysis for Lasso regression, showing how Laplace priors lead to L1 regularization. The mathematics will be more subtle—Laplace priors don't yield closed-form posteriors—but the conceptual connection remains equally profound. We'll see why Laplace priors produce sparsity where Gaussian priors cannot.
You now understand Ridge regression as the natural consequence of Bayesian inference with Gaussian priors. This perspective transforms regularization from a computational trick into a principled statement of prior belief. When you apply Ridge regression, you're asserting that you believe coefficients are normally distributed around zero—and letting the data update that belief.