Regularization Theory - Learning Module

Loading content...

0/245

Regularization as Prior

The Probabilistic Foundation of Simplicity

Why should we prefer simpler hypotheses? From an optimization perspective, we add penalties because they reduce overfitting. But there is a deeper, more principled answer: simplicity reflects our prior beliefs about which models are more likely to be correct before seeing data.

This Bayesian perspective transforms regularization from a computational trick into a fundamental principle of probabilistic inference. The regularization term $\lambda\Omega(w)$ is not an arbitrary addition—it emerges naturally from prior probability distributions over the parameters. Understanding this connection unifies regularization with the broader framework of Bayesian inference and opens doors to powerful extensions.

What You Will Learn

By the end of this page, you will understand: (1) the Bayesian framework for learning, (2) how MAP estimation corresponds to regularized optimization, (3) which prior distributions correspond to which regularizers, (4) the interpretation of λ as a prior-to-likelihood ratio, (5) extensions to full Bayesian inference beyond MAP, and (6) practical implications for regularization design.

Historical Note

The connection between regularization and Bayesian inference was recognized early in statistics and machine learning. The Bayesian interpretation of ridge regression was known by the 1960s. This perspective became central to machine learning through work on Gaussian processes, Bayesian neural networks, and the Minimum Description Length principle.

The Bayesian Learning Framework

Before connecting regularization to priors, let us establish the Bayesian framework for learning from data.

The Core Principle: Bayes' Theorem

Bayes' theorem provides the fundamental equation for updating beliefs given evidence:

$$P(w | S) = \frac{P(S | w) \cdot P(w)}{P(S)}$$

where:

$P(w)$ is the prior — our belief about parameters before seeing data
$P(S | w)$ is the likelihood — probability of observed data given parameters
$P(S) = \int P(S|w) P(w) dw$ is the marginal likelihood (evidence)
$P(w | S)$ is the posterior — our updated belief after seeing data

The Bayesian Learning Pipeline

Step 1: Specify the prior $P(w)$

Before seeing any data, what do we believe about likely parameter values?

Small weights more likely than large weights?
Sparse weights more likely than dense weights?
Smooth functions more likely than wiggly functions?

Step 2: Define the likelihood $P(S | w)$

Given parameters $w$, how likely is the observed data?

For regression with Gaussian noise: $P(y | x, w) = \mathcal{N}(f_w(x), \sigma^2)$
For classification: $P(y | x, w) = \text{Bernoulli}(\sigma(w^\top x))$

Step 3: Compute the posterior $P(w | S)$

Combine prior and likelihood via Bayes' theorem.

Step 4: Use posterior for prediction

For a new input $x^$, the predictive distribution integrates over parameter uncertainty: $$P(y^ | x^, S) = \int P(y^ | x^*, w) P(w | S) dw$$

Bayesian vs. Frequentist

In the frequentist view, parameters are fixed (unknown) constants. In the Bayesian view, parameters are random variables with probability distributions reflecting our uncertainty. The Bayesian framework naturally handles uncertainty quantification and regularization through the prior.

Bayesian Framework Components
Component	Symbol	Meaning	Role in Learning
Prior	$P(w)$	Belief before data	Encodes inductive bias
Likelihood	$P(S\|w)$	Data probability given $w$	Model of data generation
Posterior	$P(w\|S)$	Belief after data	Updated knowledge
Evidence	$P(S)$	Total data probability	Model comparison
Predictive	$P(y^\|x^,S)$	Future prediction	Integrates uncertainty

Maximum A Posteriori (MAP) Estimation

Full Bayesian inference maintains the entire posterior distribution over parameters. However, a simpler approach—Maximum A Posteriori (MAP) estimation—selects the single most probable parameter value.

Definition

MAP estimate: $$\hat{w}_{\text{MAP}} = \arg\max_w P(w | S) = \arg\max_w \frac{P(S | w) P(w)}{P(S)}$$

Since $P(S)$ doesn't depend on $w$, we can equivalently maximize: $$\hat{w}_{\text{MAP}} = \arg\max_w \left[ P(S | w) \cdot P(w) \right]$$

This can also be written: $$\hat{w}_{\text{MAP}} = \arg\max_w \left[ \log P(S | w) + \log P(w) \right]$$

MAP vs. MLE

Compare to Maximum Likelihood Estimation (MLE): $$\hat{w}_{\text{MLE}} = \arg\max_w P(S | w) = \arg\max_w \log P(S | w)$$

MLE ignores the prior entirely—it finds parameters that maximize data probability.

The relationship: $$\underbrace{\log P(w | S)}{\text{Log posterior}} = \underbrace{\log P(S | w)}{\text{Log likelihood}} + \underbrace{\log P(w)}{\text{Log prior}} - \underbrace{\log P(S)}{\text{constant}}$$

Key insight: MAP = MLE + prior regularization

The log prior acts as a penalty on parameters, exactly like regularization!

MAP is Not Fully Bayesian

MAP estimation selects a single point estimate, discarding all posterior uncertainty. It does not provide error bars, does not naturally handle model uncertainty, and can give overconfident predictions. Full Bayesian inference maintains the entire posterior, but at higher computational cost.

Deriving Regularization from MAP

Let's make the connection explicit. Consider:

Loss function: $\mathcal{L}(w; S) = -\log P(S | w)$ (negative log-likelihood)
Regularizer: $\Omega(w) \propto -\log P(w)$ (negative log-prior)

Then: $$\hat{w}_{\text{MAP}} = \arg\max_w [\log P(S|w) + \log P(w)]$$ $$= \arg\min_w [-\log P(S|w) - \log P(w)]$$ $$= \arg\min_w [\mathcal{L}(w; S) + \Omega(w)]$$

This is exactly regularized optimization!

The regularizer $\Omega(w)$ is the negative log of the prior distribution. Different priors correspond to different regularizers.

L2 Regularization and the Gaussian Prior

The most common regularizer, L2 (Ridge), corresponds to a Gaussian (Normal) prior on parameters.

The Gaussian Prior

Assume each parameter is drawn from a zero-mean Gaussian: $$w_j \sim \mathcal{N}(0, \tau^2)$$

For independent parameters: $$P(w) = \prod_{j=1}^d \frac{1}{\sqrt{2\pi\tau^2}} \exp\left(-\frac{w_j^2}{2\tau^2}\right) = \frac{1}{(2\pi\tau^2)^{d/2}} \exp\left(-\frac{|w|_2^2}{2\tau^2}\right)$$

The negative log-prior: $$-\log P(w) = \frac{|w|_2^2}{2\tau^2} + \text{const}$$

Deriving Ridge Regression

For regression with Gaussian noise: $$P(y | x, w) = \mathcal{N}(w^\top x, \sigma^2)$$

The negative log-likelihood for training set $S$: $$-\log P(S | w) = \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2 + \text{const}$$

Combining with Gaussian prior: $$-\log P(w | S) \propto \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - w^\top x_i)^2 + \frac{1}{2\tau^2} |w|_2^2$$

$$= \frac{1}{2\sigma^2} \left[ \sum_{i=1}^n (y_i - w^\top x_i)^2 + \frac{\sigma^2}{\tau^2} |w|_2^2 \right]$$

This is exactly Ridge regression with $\lambda = \sigma^2 / \tau^2$!

Interpreting λ

The regularization strength λ = σ²/τ² has a beautiful interpretation: it's the ratio of noise variance (σ²) to prior variance (τ²). High noise → believe prior more → more regularization. Tight prior (small τ²) → strong regularization. This explains why λ should be larger for noisier data or when we have strong prior beliefs about small weights.

Gaussian Prior Parameters
Parameter	Symbol	Effect	Regularization Impact
Prior mean	$\mu = 0$	Expected parameter value	Shrinkage target
Prior variance	$\tau^2$	Expected parameter magnitude	$\lambda \propto 1/\tau^2$
Noise variance	$\sigma^2$	Data uncertainty	$\lambda \propto \sigma^2$
Dimension	$d$	Number of parameters	Affects posterior shape

Properties of Gaussian Prior

1. Symmetry: Zero-mean implies no preference for positive or negative weights.

2. Shrinkage toward zero: All weights are pulled toward the prior mean (zero).

3. No exact zeros: The Gaussian has no mass at zero — extremely small weights are possible but not exactly zero.

4. Scale sensitivity: The prior variance $\tau^2$ controls expected weight magnitude. Small $\tau^2$ strongly favors small weights.

5. Conjugacy: Gaussian prior + Gaussian likelihood = Gaussian posterior. This enables closed-form Bayesian inference for linear models.

L1 Regularization and the Laplace Prior

L1 regularization (Lasso) corresponds to a Laplace (double-exponential) prior.

The Laplace Prior

The Laplace distribution centered at zero: $$P(w_j) = \frac{1}{2b} \exp\left(-\frac{|w_j|}{b}\right)$$

For independent parameters: $$P(w) = \prod_{j=1}^d \frac{1}{2b} \exp\left(-\frac{|w_j|}{b}\right) = \frac{1}{(2b)^d} \exp\left(-\frac{|w|_1}{b}\right)$$

The negative log-prior: $$-\log P(w) = \frac{|w|_1}{b} + \text{const}$$

This gives L1 regularization with $\lambda \propto 1/b$.

Comparing Gaussian and Laplace Priors

The key difference is the shape of the distributions:

Gaussian (L2):

Smooth, bell-shaped
Decays as $e^{-w^2/2\tau^2}$ (fast for large $|w|$)
Derivative continuous everywhere

Laplace (L1):

Sharp peak at zero
Decays as $e^{-|w|/b}$ (slower for large $|w|$)
Non-differentiable at zero (cusp)

The cusp at zero is the key: the Laplace prior has significant mass exactly at zero, making zero a natural value for parameters.

Gaussian Prior (L2)

•Bell-shaped, smooth
•Heavy penalty on large weights
•No mass at exactly zero
•All weights shrunk proportionally
•Never exactly sparse
•Closed-form posterior

Laplace Prior (L1)

•Sharp peak at zero
•Linear penalty on all weights
•Significant mass near zero
•Small weights set exactly to zero
•Naturally sparse solutions
•No closed-form posterior

Why Laplace Induces Sparsity

The Laplace prior has a pole in its log-density at zero: $-\log P(w) \propto |w|$, which has infinite slope at $w = 0^+$. This infinite "pull" toward zero means unless the likelihood strongly supports non-zero, the MAP estimate will be exactly zero. The Gaussian prior only has zero slope at zero — no special pull toward sparsity.

Other Prior-Regularizer Correspondences

The Gaussian-L2 and Laplace-L1 connections are just two examples. Many regularizers have natural Bayesian interpretations.

Elastic Net Prior

Elastic Net combines L1 and L2: $$\Omega(w) = \alpha |w|_1 + (1-\alpha) |w|_2^2$$

This corresponds to a mixture prior — part Laplace, part Gaussian. Alternatively, it can be derived from a Gaussian prior on Laplace-distributed means (hierarchical model).

Properties:

Sparsity from L1 component
Grouping effect from L2 component (correlated features selected together)

Prior Distributions and Their Regularizers
Prior Distribution	Regularizer Ω(w)	Key Property
Gaussian $\mathcal{N}(0, \tau^2I)$	L2: $\|w\|_2^2$	Uniform shrinkage
Laplace	L1: $\|w\|_1$	Sparsity
Student-t	Log-L2 variants	Heavy-tailed, robust
Horseshoe	Adaptive shrinkage	Strong sparsity + large signals
Spike-and-Slab	Best subset (NP-hard)	Exact sparsity pattern
Uniform (bounded)	Box constraints	Hard parameter limits
Wishart (covariance)	Nuclear norm	Low-rank matrices

Hierarchical Priors

More sophisticated priors use hierarchical (multi-level) structures:

Example: Automatic Relevance Determination (ARD) $$w_j \sim \mathcal{N}(0, \alpha_j^{-1})$$ $$\alpha_j \sim \text{Gamma}(a, b)$$

Each feature has its own precision $\alpha_j$, which is itself random. If $\alpha_j \to \infty$, the corresponding $w_j$ is driven to zero (feature eliminated).

Effect: Automatic feature selection through type-II maximum likelihood—optimize over both $w$ and ${\alpha_j}$.

Used in: Sparse Bayesian Learning (Relevance Vector Machines).

The Horseshoe Prior

A modern sparsity-inducing prior with excellent theoretical and practical properties:

$$w_j | \lambda_j, \tau \sim \mathcal{N}(0, \lambda_j^2 \tau^2)$$ $$\lambda_j \sim \text{Cauchy}^+(0, 1)$$

where $\text{Cauchy}^+$ is the half-Cauchy distribution.

Key property: The marginal prior on $w_j$ has a sharp peak at zero AND heavy tails.

Strong shrinkage of small effects to zero
Minimal shrinkage of large effects
Adaptive: learns which effects are large

The Horseshoe is increasingly used in high-dimensional Bayesian regression.

Choosing Priors in Practice

Prior choice should reflect domain knowledge: Use Gaussian (L2) when you expect all features to contribute somewhat. Use Laplace (L1) when you expect sparsity. Use hierarchical priors when effect sizes vary widely. Use structured priors (Group Lasso) when features have known structure.

Beyond MAP: Full Bayesian Inference

MAP estimation gives a point estimate — the mode of the posterior. Full Bayesian inference goes further, using the entire posterior distribution.

The Full Posterior

Instead of finding $\hat{w}_{\text{MAP}} = \arg\max P(w|S)$, maintain $P(w|S)$ itself.

Advantages:

Uncertainty quantification: Natural error bars on predictions
No overfitting: Averaging over parameters, not committing to one
Robust predictions: Large uncertainty when data is scarce
Model comparison: Marginal likelihood $P(S)$ for comparing models

Disadvantages:

Computational cost: Often intractable, requiring approximations
Approximation errors: MCMC, variational methods introduce biases
Prior sensitivity: Results depend on prior choice

Bayesian Prediction

Given posterior $P(w|S)$, predictions integrate over parameter uncertainty:

$$P(y^* | x^, S) = \int P(y^ | x^*, w) P(w | S) dw$$

This Bayesian Model Averaging automatically accounts for uncertainty.

Contrast with MAP/MLE: $$P(y^* | x^*, \hat{w}) = \text{point prediction, no uncertainty}$$

Practical computation:

For linear regression with Gaussian prior: closed-form
For most models: Monte Carlo (sample $w^{(i)} \sim P(w|S)$, average predictions)
For large-scale: variational approximations

Approximate Inference Methods

When exact posterior computation is intractable:

1. Markov Chain Monte Carlo (MCMC)

Generate samples from $P(w|S)$
Asymptotically exact
High computational cost
Examples: Gibbs sampling, Hamiltonian Monte Carlo

2. Variational Inference

Approximate $P(w|S)$ with tractable $Q(w)$
Minimize KL divergence $\text{KL}(Q | P)$
Faster but biased
Examples: Mean-field, stochastic variational

3. Laplace Approximation

Approximate posterior as Gaussian centered at MAP
Quick and simple
May be poor for multimodal posteriors

Regularization as Tractable Bayesian Inference

MAP estimation (regularized optimization) can be viewed as the cheapest form of Bayesian inference — it gives one summary of the posterior (the mode) at the cost of a single optimization. Full Bayesian inference is more principled but more expensive. The choice depends on computational budget and need for uncertainty quantification.

Practical Implications of the Bayesian View

The Bayesian perspective on regularization offers several practical insights and design principles.

1. Regularization Strength from Prior Beliefs

Instead of tuning $\lambda$ arbitrarily, design it from prior beliefs:

Signal-to-noise reasoning: $$\lambda = \frac{\sigma^2}{\tau^2}$$

Estimate noise variance $\sigma^2$ from residuals or domain knowledge
Set prior variance $\tau^2$ based on expected coefficient magnitudes
Compute appropriate $\lambda$

Example: If you expect coefficients around ±1 and noise standard deviation around 10: $$\tau^2 \approx 1, \quad \sigma^2 \approx 100, \quad \lambda \approx 100$$

2. Prior as Domain Knowledge

Encode domain knowledge through prior design:

Non-zero prior mean: $$w \sim \mathcal{N}(\mu, \tau^2 I)$$

If you expect coefficient $w_j$ to be around 2, use $\mu_j = 2$ rather than 0. Regularization then shrinks toward $\mu$, not toward zero.

Heterogeneous variances: $$w_j \sim \mathcal{N}(0, \tau_j^2)$$

If some features are more important a priori, give them larger $\tau_j^2$ (less shrinkage).

3. Cross-Validation as Empirical Bayes

Cross-validation for selecting $\lambda$ can be viewed as empirical Bayes:

Full Bayes: Place prior on $\lambda$ (or equivalently, on $\tau^2$), integrate out

Type-II Maximum Likelihood (Empirical Bayes): $$\hat{\lambda} = \arg\max_\lambda P(S | \lambda) = \arg\max_\lambda \int P(S | w) P(w | \lambda) dw$$

Cross-validation approximation: $$\hat{\lambda} \approx \arg\min_\lambda \text{CV-Error}(\lambda)$$

Cross-validation is often more robust than marginal likelihood when model is misspecified.

Bayesian Advantages

•Principled uncertainty quantification
•Coherent framework for regularization
•Natural hyperparameter selection
•Automatic Occam's razor
•Interpretable prior specification

Practical Challenges

•Computational expense of full posterior
•Prior specification can be challenging
•Sensitivity to prior misspecification
•Approximation errors (variational, MCMC)
•May not scale to massive datasets

When to Go Full Bayesian

Use full Bayesian inference when: (1) uncertainty quantification is critical (medical diagnosis, safety systems), (2) data is scarce and prior information is valuable, (3) model comparison is needed, or (4) hierarchical/pooling structure benefits from sharing. Use MAP/regularization when computational efficiency dominates and point predictions suffice.

Summary: Regularization as Prior

We have established the deep connection between regularization and Bayesian inference. Let us consolidate the key insights:

Key Takeaways

•Bayesian framework: Prior encodes beliefs before data; posterior combines prior with likelihood via Bayes' theorem.
•MAP as regularized optimization: $\hat{w}_{\text{MAP}} = \arg\min[\mathcal{L}(w) + \Omega(w)]$ where $\Omega(w) = -\log P(w)$.
•Gaussian prior = L2 regularization: $P(w) \propto \exp(-|w|_2^2 / 2\tau^2)$ gives Ridge regression.
•Laplace prior = L1 regularization: $P(w) \propto \exp(-|w|_1 / b)$ gives Lasso with sparsity.
•λ interpretation: $\lambda = \sigma^2/\tau^2$ is the noise-to-prior variance ratio.
•Rich prior landscape: Hierarchical priors, horseshoe, spike-and-slab offer sophisticated regularization.
•Beyond MAP: Full Bayesian inference provides uncertainty quantification but at higher computational cost.

What's Next

We've now seen regularization from both optimization (constraint) and probabilistic (prior) perspectives. The next page examines Effects on Generalization — how regularization provably improves generalization bounds, tightening the gap between training and test performance. This connects the intuitive ideas of bias-variance and prior-likelihood to rigorous learning-theoretic guarantees.

Page Complete

You now understand regularization from the Bayesian perspective: how regularizers correspond to prior distributions, the equivalence of MAP and regularized optimization, and the rich landscape of prior choices. This probabilistic view complements the optimization perspective and connects regularization to the broader Bayesian framework.