Loading content...
Why should we prefer simpler hypotheses? From an optimization perspective, we add penalties because they reduce overfitting. But there is a deeper, more principled answer: simplicity reflects our prior beliefs about which models are more likely to be correct before seeing data.
This Bayesian perspective transforms regularization from a computational trick into a fundamental principle of probabilistic inference. The regularization term $\lambda\Omega(w)$ is not an arbitrary addition—it emerges naturally from prior probability distributions over the parameters. Understanding this connection unifies regularization with the broader framework of Bayesian inference and opens doors to powerful extensions.
By the end of this page, you will understand: (1) the Bayesian framework for learning, (2) how MAP estimation corresponds to regularized optimization, (3) which prior distributions correspond to which regularizers, (4) the interpretation of λ as a prior-to-likelihood ratio, (5) extensions to full Bayesian inference beyond MAP, and (6) practical implications for regularization design.
The connection between regularization and Bayesian inference was recognized early in statistics and machine learning. The Bayesian interpretation of ridge regression was known by the 1960s. This perspective became central to machine learning through work on Gaussian processes, Bayesian neural networks, and the Minimum Description Length principle.
Before connecting regularization to priors, let us establish the Bayesian framework for learning from data.
Bayes' theorem provides the fundamental equation for updating beliefs given evidence:
$$P(w | S) = \frac{P(S | w) \cdot P(w)}{P(S)}$$
where:
Step 1: Specify the prior $P(w)$
Before seeing any data, what do we believe about likely parameter values?
Step 2: Define the likelihood $P(S | w)$
Given parameters $w$, how likely is the observed data?
Step 3: Compute the posterior $P(w | S)$
Combine prior and likelihood via Bayes' theorem.
Step 4: Use posterior for prediction
For a new input $x^$, the predictive distribution integrates over parameter uncertainty: $$P(y^ | x^, S) = \int P(y^ | x^*, w) P(w | S) dw$$
In the frequentist view, parameters are fixed (unknown) constants. In the Bayesian view, parameters are random variables with probability distributions reflecting our uncertainty. The Bayesian framework naturally handles uncertainty quantification and regularization through the prior.
| Component | Symbol | Meaning | Role in Learning |
|---|---|---|---|
| Prior | $P(w)$ | Belief before data | Encodes inductive bias |
| Likelihood | $P(S|w)$ | Data probability given $w$ | Model of data generation |
| Posterior | $P(w|S)$ | Belief after data | Updated knowledge |
| Evidence | $P(S)$ | Total data probability | Model comparison |
| Predictive | $P(y^|x^,S)$ | Future prediction | Integrates uncertainty |
Full Bayesian inference maintains the entire posterior distribution over parameters. However, a simpler approach—Maximum A Posteriori (MAP) estimation—selects the single most probable parameter value.
MAP estimate: $$\hat{w}_{\text{MAP}} = \arg\max_w P(w | S) = \arg\max_w \frac{P(S | w) P(w)}{P(S)}$$
Since $P(S)$ doesn't depend on $w$, we can equivalently maximize: $$\hat{w}_{\text{MAP}} = \arg\max_w \left[ P(S | w) \cdot P(w) \right]$$
This can also be written: $$\hat{w}_{\text{MAP}} = \arg\max_w \left[ \log P(S | w) + \log P(w) \right]$$
Compare to Maximum Likelihood Estimation (MLE): $$\hat{w}_{\text{MLE}} = \arg\max_w P(S | w) = \arg\max_w \log P(S | w)$$
MLE ignores the prior entirely—it finds parameters that maximize data probability.
The relationship: $$\underbrace{\log P(w | S)}{\text{Log posterior}} = \underbrace{\log P(S | w)}{\text{Log likelihood}} + \underbrace{\log P(w)}{\text{Log prior}} - \underbrace{\log P(S)}{\text{constant}}$$
Key insight: MAP = MLE + prior regularization
The log prior acts as a penalty on parameters, exactly like regularization!
MAP estimation selects a single point estimate, discarding all posterior uncertainty. It does not provide error bars, does not naturally handle model uncertainty, and can give overconfident predictions. Full Bayesian inference maintains the entire posterior, but at higher computational cost.
Let's make the connection explicit. Consider:
Then: $$\hat{w}_{\text{MAP}} = \arg\max_w [\log P(S|w) + \log P(w)]$$ $$= \arg\min_w [-\log P(S|w) - \log P(w)]$$ $$= \arg\min_w [\mathcal{L}(w; S) + \Omega(w)]$$
This is exactly regularized optimization!
The regularizer $\Omega(w)$ is the negative log of the prior distribution. Different priors correspond to different regularizers.
The most common regularizer, L2 (Ridge), corresponds to a Gaussian (Normal) prior on parameters.
Assume each parameter is drawn from a zero-mean Gaussian: $$w_j \sim \mathcal{N}(0, \tau^2)$$
For independent parameters: $$P(w) = \prod_{j=1}^d \frac{1}{\sqrt{2\pi\tau^2}} \exp\left(-\frac{w_j^2}{2\tau^2}\right) = \frac{1}{(2\pi\tau^2)^{d/2}} \exp\left(-\frac{|w|_2^2}{2\tau^2}\right)$$
The negative log-prior: $$-\log P(w) = \frac{|w|_2^2}{2\tau^2} + \text{const}$$
For regression with Gaussian noise: $$P(y | x, w) = \mathcal{N}(w^\top x, \sigma^2)$$
The negative log-likelihood for training set $S$: $$-\log P(S | w) = \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2 + \text{const}$$
Combining with Gaussian prior: $$-\log P(w | S) \propto \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2 + \frac{1}{2\tau^2} |w|_2^2$$
$$= \frac{1}{2\sigma^2} \left[ \sum_{i=1}^n (y_i - w^\top x_i)^2 + \frac{\sigma^2}{\tau^2} |w|_2^2 \right]$$
This is exactly Ridge regression with $\lambda = \sigma^2 / \tau^2$!
The regularization strength λ = σ²/τ² has a beautiful interpretation: it's the ratio of noise variance (σ²) to prior variance (τ²). High noise → believe prior more → more regularization. Tight prior (small τ²) → strong regularization. This explains why λ should be larger for noisier data or when we have strong prior beliefs about small weights.
| Parameter | Symbol | Effect | Regularization Impact |
|---|---|---|---|
| Prior mean | $\mu = 0$ | Expected parameter value | Shrinkage target |
| Prior variance | $\tau^2$ | Expected parameter magnitude | $\lambda \propto 1/\tau^2$ |
| Noise variance | $\sigma^2$ | Data uncertainty | $\lambda \propto \sigma^2$ |
| Dimension | $d$ | Number of parameters | Affects posterior shape |
1. Symmetry: Zero-mean implies no preference for positive or negative weights.
2. Shrinkage toward zero: All weights are pulled toward the prior mean (zero).
3. No exact zeros: The Gaussian has no mass at zero — extremely small weights are possible but not exactly zero.
4. Scale sensitivity: The prior variance $\tau^2$ controls expected weight magnitude. Small $\tau^2$ strongly favors small weights.
5. Conjugacy: Gaussian prior + Gaussian likelihood = Gaussian posterior. This enables closed-form Bayesian inference for linear models.
L1 regularization (Lasso) corresponds to a Laplace (double-exponential) prior.
The Laplace distribution centered at zero: $$P(w_j) = \frac{1}{2b} \exp\left(-\frac{|w_j|}{b}\right)$$
For independent parameters: $$P(w) = \prod_{j=1}^d \frac{1}{2b} \exp\left(-\frac{|w_j|}{b}\right) = \frac{1}{(2b)^d} \exp\left(-\frac{|w|_1}{b}\right)$$
The negative log-prior: $$-\log P(w) = \frac{|w|_1}{b} + \text{const}$$
This gives L1 regularization with $\lambda \propto 1/b$.
The key difference is the shape of the distributions:
Gaussian (L2):
Laplace (L1):
The cusp at zero is the key: the Laplace prior has significant mass exactly at zero, making zero a natural value for parameters.
The Laplace prior has a pole in its log-density at zero: $-\log P(w) \propto |w|$, which has infinite slope at $w = 0^+$. This infinite "pull" toward zero means unless the likelihood strongly supports non-zero, the MAP estimate will be exactly zero. The Gaussian prior only has zero slope at zero — no special pull toward sparsity.
The Gaussian-L2 and Laplace-L1 connections are just two examples. Many regularizers have natural Bayesian interpretations.
Elastic Net combines L1 and L2: $$\Omega(w) = \alpha |w|_1 + (1-\alpha) |w|_2^2$$
This corresponds to a mixture prior — part Laplace, part Gaussian. Alternatively, it can be derived from a Gaussian prior on Laplace-distributed means (hierarchical model).
Properties:
| Prior Distribution | Regularizer Ω(w) | Key Property |
|---|---|---|
| Gaussian $\mathcal{N}(0, \tau^2I)$ | L2: $|w|_2^2$ | Uniform shrinkage |
| Laplace | L1: $|w|_1$ | Sparsity |
| Student-t | Log-L2 variants | Heavy-tailed, robust |
| Horseshoe | Adaptive shrinkage | Strong sparsity + large signals |
| Spike-and-Slab | Best subset (NP-hard) | Exact sparsity pattern |
| Uniform (bounded) | Box constraints | Hard parameter limits |
| Wishart (covariance) | Nuclear norm | Low-rank matrices |
More sophisticated priors use hierarchical (multi-level) structures:
Example: Automatic Relevance Determination (ARD) $$w_j \sim \mathcal{N}(0, \alpha_j^{-1})$$ $$\alpha_j \sim \text{Gamma}(a, b)$$
Each feature has its own precision $\alpha_j$, which is itself random. If $\alpha_j \to \infty$, the corresponding $w_j$ is driven to zero (feature eliminated).
Effect: Automatic feature selection through type-II maximum likelihood—optimize over both $w$ and ${\alpha_j}$.
Used in: Sparse Bayesian Learning (Relevance Vector Machines).
A modern sparsity-inducing prior with excellent theoretical and practical properties:
$$w_j | \lambda_j, \tau \sim \mathcal{N}(0, \lambda_j^2 \tau^2)$$ $$\lambda_j \sim \text{Cauchy}^+(0, 1)$$
where $\text{Cauchy}^+$ is the half-Cauchy distribution.
Key property: The marginal prior on $w_j$ has a sharp peak at zero AND heavy tails.
The Horseshoe is increasingly used in high-dimensional Bayesian regression.
Prior choice should reflect domain knowledge: Use Gaussian (L2) when you expect all features to contribute somewhat. Use Laplace (L1) when you expect sparsity. Use hierarchical priors when effect sizes vary widely. Use structured priors (Group Lasso) when features have known structure.
MAP estimation gives a point estimate — the mode of the posterior. Full Bayesian inference goes further, using the entire posterior distribution.
Instead of finding $\hat{w}_{\text{MAP}} = \arg\max P(w|S)$, maintain $P(w|S)$ itself.
Advantages:
Disadvantages:
Given posterior $P(w|S)$, predictions integrate over parameter uncertainty:
$$P(y^* | x^, S) = \int P(y^ | x^*, w) P(w | S) dw$$
This Bayesian Model Averaging automatically accounts for uncertainty.
Contrast with MAP/MLE: $$P(y^* | x^*, \hat{w}) = \text{point prediction, no uncertainty}$$
Practical computation:
When exact posterior computation is intractable:
1. Markov Chain Monte Carlo (MCMC)
2. Variational Inference
3. Laplace Approximation
MAP estimation (regularized optimization) can be viewed as the cheapest form of Bayesian inference — it gives one summary of the posterior (the mode) at the cost of a single optimization. Full Bayesian inference is more principled but more expensive. The choice depends on computational budget and need for uncertainty quantification.
The Bayesian perspective on regularization offers several practical insights and design principles.
Instead of tuning $\lambda$ arbitrarily, design it from prior beliefs:
Signal-to-noise reasoning: $$\lambda = \frac{\sigma^2}{\tau^2}$$
Example: If you expect coefficients around ±1 and noise standard deviation around 10: $$\tau^2 \approx 1, \quad \sigma^2 \approx 100, \quad \lambda \approx 100$$
Encode domain knowledge through prior design:
Non-zero prior mean: $$w \sim \mathcal{N}(\mu, \tau^2 I)$$
If you expect coefficient $w_j$ to be around 2, use $\mu_j = 2$ rather than 0. Regularization then shrinks toward $\mu$, not toward zero.
Heterogeneous variances: $$w_j \sim \mathcal{N}(0, \tau_j^2)$$
If some features are more important a priori, give them larger $\tau_j^2$ (less shrinkage).
Cross-validation for selecting $\lambda$ can be viewed as empirical Bayes:
Full Bayes: Place prior on $\lambda$ (or equivalently, on $\tau^2$), integrate out
Type-II Maximum Likelihood (Empirical Bayes): $$\hat{\lambda} = \arg\max_\lambda P(S | \lambda) = \arg\max_\lambda \int P(S | w) P(w | \lambda) dw$$
Cross-validation approximation: $$\hat{\lambda} \approx \arg\min_\lambda \text{CV-Error}(\lambda)$$
Cross-validation is often more robust than marginal likelihood when model is misspecified.
Use full Bayesian inference when: (1) uncertainty quantification is critical (medical diagnosis, safety systems), (2) data is scarce and prior information is valuable, (3) model comparison is needed, or (4) hierarchical/pooling structure benefits from sharing. Use MAP/regularization when computational efficiency dominates and point predictions suffice.
We have established the deep connection between regularization and Bayesian inference. Let us consolidate the key insights:
We've now seen regularization from both optimization (constraint) and probabilistic (prior) perspectives. The next page examines Effects on Generalization — how regularization provably improves generalization bounds, tightening the gap between training and test performance. This connects the intuitive ideas of bias-variance and prior-likelihood to rigorous learning-theoretic guarantees.
You now understand regularization from the Bayesian perspective: how regularizers correspond to prior distributions, the equivalence of MAP and regularized optimization, and the rich landscape of prior choices. This probabilistic view complements the optimization perspective and connects regularization to the broader Bayesian framework.