Loading content...
Throughout our exploration of regularization, we've treated the regularization term as a mathematical constraint—a penalty added to our objective function to prevent overfitting. We've derived closed-form solutions, analyzed shrinkage behavior, and understood the bias-variance tradeoff from an optimization perspective.
But there's a deeper question lurking beneath the surface: Why does adding a penalty term to our loss function work at all? Where do Ridge, Lasso, and Elastic Net penalties come from? Are they arbitrary choices, or is there a principled foundation underlying these techniques?
The answer lies in Bayesian statistics, where regularization emerges naturally from a simple, elegant idea: we have beliefs about our parameters before seeing any data, and we can encode those beliefs mathematically as prior distributions.
By the end of this page, you will understand the fundamental concept of prior distributions, their role in Bayesian inference, and how they provide a probabilistic interpretation of regularization. You'll see how our choice of prior encodes assumptions about the nature of model parameters—assumptions that directly translate into familiar regularization penalties.
Before diving into priors specifically, we need to understand the Bayesian framework for statistical inference. This perspective differs fundamentally from the frequentist approach we've been using.
The Frequentist View:
In frequentist statistics, model parameters $\boldsymbol{\theta}$ are fixed but unknown constants. We estimate them from data, and all probability statements concern the data or estimation procedures, never the parameters themselves. Questions like "What is the probability that $\theta_1 > 0$?" are meaningless—$\theta_1$ is either greater than zero or it isn't.
The Bayesian View:
In Bayesian statistics, we treat parameters $\boldsymbol{\theta}$ as random variables with their own probability distributions. This allows us to:
The Bayesian and frequentist views aren't mathematically contradictory—they're different philosophical interpretations of probability. Frequentists interpret probability as long-run frequency; Bayesians interpret it as degree of belief. Both approaches are valid tools with different strengths.
The Core Bayesian Equation:
The entire Bayesian framework rests on Bayes' theorem, applied to parameters and data:
$$p(\boldsymbol{\theta} \mid \mathbf{X}, \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})}{p(\mathbf{y} \mid \mathbf{X})}$$
Let's dissect each component:
| Term | Name | Interpretation |
|---|---|---|
| $p(\boldsymbol{\theta} \mid \mathbf{X}, \mathbf{y})$ | Posterior | Our updated beliefs about $\boldsymbol{\theta}$ after seeing the data |
| $p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta})$ | Likelihood | Probability of observing data given parameters |
| $p(\boldsymbol{\theta})$ | Prior | Our beliefs about $\boldsymbol{\theta}$ before seeing data |
| $p(\mathbf{y} \mid \mathbf{X})$ | Marginal likelihood | Normalizing constant (evidence) |
Bayesian inference is a learning process: Prior × Likelihood → Posterior. We start with beliefs (prior), observe evidence (likelihood), and arrive at updated beliefs (posterior). The posterior from one experiment can become the prior for the next, enabling sequential learning.
The prior distribution $p(\boldsymbol{\theta})$ is arguably the most distinctive element of Bayesian analysis. It encapsulates everything we believe about parameters before observing any data from the current dataset.
What Priors Represent:
Priors can encode various types of information:
Domain knowledge: A medical researcher studying drug effects might know that most drugs have small or no effect, encoding this as a prior concentrated near zero.
Previous experiments: If prior studies estimated a parameter to be around 0.5 with some uncertainty, this forms a natural prior for new experiments.
Physical constraints: If a parameter represents a probability, we know it must be in $[0, 1]$. If it represents a count, it must be non-negative.
Regularization preferences: We might believe that simpler models (smaller coefficients) are more likely—this is precisely what regularization priors encode.
Visualizing the Prior's Role:
Imagine estimating the mean height of a new population. Before measuring anyone:
Each prior leads to different posteriors, especially with small samples. As sample size grows, all three converge toward the true population mean—the data eventually overwhelms any reasonable prior.
Prior distributions span a spectrum from completely uninformative to highly specific. Understanding this spectrum is crucial for appreciating how different priors lead to different regularization behaviors.
| Prior Type | Definition | Example | Use Case |
|---|---|---|---|
| Flat/Uniform | Equal probability across parameter space | $p(\theta) \propto 1$ | No preference; let data decide entirely |
| Weakly Informative | Gentle regularization toward reasonable values | $\mathcal{N}(0, 10^2)$ | Soft constraint without strong opinions |
| Informative | Encodes specific prior knowledge | $\mathcal{N}(\mu_0, \sigma_0^2)$ from prior studies | Leveraging previous research |
| Conjugate | Prior and posterior are same family | Beta prior for binomial likelihood | Mathematical convenience, closed forms |
| Sparsity-Inducing | Mass concentrated at zero | Laplace, Spike-and-slab | Variable selection, regularization |
| Hierarchical | Prior parameters have their own priors | $\theta \sim \mathcal{N}(\mu, \sigma^2)$, $\mu \sim \mathcal{N}(0, 1)$ | Sharing information across groups |
Flat Priors and Their Limitations:
A flat (or uniform) prior seems appealingly "objective"—we're not biasing the analysis with any assumptions. However, flat priors have serious issues:
They're not uninformative: A flat prior on $\theta$ is not a flat prior on $\theta^2$ or $\log(\theta)$. The notion of "no information" depends on parameterization.
Improper priors: Flat priors over infinite domains don't integrate to 1. They're "improper" and can sometimes lead to improper posteriors.
Poor performance in high dimensions: With many parameters, flat priors become overwhelmingly diffuse, leading to unstable estimates.
No regularization effect: Flat priors provide no shrinkage, equivalent to unregularized maximum likelihood—exactly what causes overfitting.
Every prior is a choice, and every choice encodes assumptions. Even refusing to choose (flat prior) is a choice—one that often performs poorly. The question isn't whether to make assumptions but which assumptions are most reasonable for your problem.
Conjugate Priors:
A conjugate prior is one where the posterior distribution belongs to the same family as the prior. This isn't just mathematically elegant—it enables closed-form solutions.
Common conjugate pairs:
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Gaussian (known variance) | Gaussian | Gaussian |
| Binomial | Beta | Beta |
| Poisson | Gamma | Gamma |
| Multinomial | Dirichlet | Dirichlet |
| Gaussian (unknown variance) | Inverse-Gamma | Inverse-Gamma |
For regression with Gaussian noise and a Gaussian prior on coefficients, the posterior is also Gaussian—this is the foundation of Bayesian linear regression and, as we'll see, Ridge regression.
Our focus is regression, where we estimate coefficients $\boldsymbol{\beta} = (\beta_1, \ldots, \beta_p)^T$. What beliefs might we have about these coefficients before seeing data?
The Case for Shrinkage Priors:
In most regression problems, we have good reasons to believe:
Most coefficients are probably small — Not all features matter equally; many have weak or no effect
Extremely large coefficients are unlikely — Coefficients exploding to ±100 usually indicates overfitting, not true signal
Some coefficients may be exactly zero — In high-dimensional settings, true sparsity is common; many features are irrelevant
These beliefs translate directly into shrinkage priors—distributions that concentrate probability mass near zero while allowing occasional larger values.
Two Fundamental Choices:
The two priors most directly connected to regularization are:
1. Gaussian (Normal) Prior: $$\beta_j \sim \mathcal{N}(0, \tau^2)$$
Places probability symmetrically around zero with decreasing probability for larger values. The density: $$p(\beta_j) = \frac{1}{\sqrt{2\pi\tau^2}} \exp\left(-\frac{\beta_j^2}{2\tau^2}\right)$$
2. Laplace (Double Exponential) Prior: $$\beta_j \sim \text{Laplace}(0, b)$$
Also centered at zero but with a sharp peak and heavier tails. The density: $$p(\beta_j) = \frac{1}{2b} \exp\left(-\frac{|\beta_j|}{b}\right)$$
This is the key insight: the shape of the prior distribution directly determines the type of regularization. Gaussian priors yield L2 regularization; Laplace priors yield L1 regularization. The prior variance/scale determines the regularization strength. This connection is not a coincidence—it's a mathematical identity.
The Gaussian and Laplace priors, despite both being centered at zero, encode fundamentally different beliefs about coefficient structure. Understanding these differences illuminates why Ridge and Lasso behave so differently.
| Property | Gaussian Prior | Laplace Prior |
|---|---|---|
| Density shape | Smooth, bell-shaped | Sharp peak, exponential decay |
| Mass at zero | Zero (continuous) | Zero (continuous), but more concentrated near zero |
| Tail behavior | Light tails (sub-Gaussian) | Heavier tails (exponential) |
| Log-density | $-\beta^2 / (2\tau^2) + \text{const}$ | $-|\beta| / b + \text{const}$ |
| Corresponding penalty | $\lambda |\boldsymbol{\beta}|_2^2$ (L2) | $\lambda |\boldsymbol{\beta}|_1$ (L1) |
| Shrinkage behavior | Proportional shrinkage | Soft thresholding |
| Sparsity inducing? | No | Yes |
Geometric Intuition:
To understand the difference geometrically, consider the level sets of each prior (contours of equal probability):
Gaussian prior: Level sets are circles (in 2D) or hyperspheres. All coefficients shrink proportionally.
Laplace prior: Level sets are diamonds (in 2D) or cross-polytopes. The corners of the diamond lie on the coordinate axes.
When we combine these priors with the likelihood (which forms elliptical contours for quadratic loss), something remarkable happens:
The sharp corners of the Laplace prior's level sets create 'attractors' on the coordinate axes. When optimizing, the solution naturally gets pulled toward these corners, setting coefficients exactly to zero. This is not a quirk—it's a fundamental geometric property of the L1 norm.
Tail Behavior and Robustness:
The heavier tails of the Laplace prior have an important practical consequence: Laplace priors are more tolerant of occasional large coefficients.
With a Gaussian prior:
With a Laplace prior:
This explains why Ridge tends to distribute effect sizes across correlated predictors, while Lasso often selects one and zeros the others.
We've described Gaussian and Laplace priors and their properties. Now we'll establish the mathematical connection between priors and regularization penalties. This connection is the theoretical foundation for understanding regularization from a Bayesian perspective.
The Key Observation:
The regularization penalty in optimization objectives is the negative log-prior.
For a Gaussian prior $\beta_j \sim \mathcal{N}(0, \tau^2)$: $$\log p(\beta_j) = -\frac{\beta_j^2}{2\tau^2} - \frac{1}{2}\log(2\pi\tau^2)$$
The negative log-prior (ignoring constants) is: $$-\log p(\beta_j) \propto \frac{\beta_j^2}{2\tau^2}$$
Summing over all coefficients: $$-\log p(\boldsymbol{\beta}) \propto \frac{1}{2\tau^2} \sum_{j=1}^p \beta_j^2 = \frac{1}{2\tau^2} |\boldsymbol{\beta}|_2^2$$
This is exactly the L2 penalty with $\lambda = \frac{1}{\tau^2}$!
Regularization strength is inversely proportional to prior variance. Small prior variance (tight prior) → strong regularization. Large prior variance (diffuse prior) → weak regularization. Setting $\lambda = 0$ corresponds to an infinitely diffuse prior—effectively no prior information.
The Same for Laplace:
For a Laplace prior $\beta_j \sim \text{Laplace}(0, b)$: $$\log p(\beta_j) = -\frac{|\beta_j|}{b} - \log(2b)$$
The negative log-prior is: $$-\log p(\beta_j) \propto \frac{|\beta_j|}{b}$$
Summing over coefficients: $$-\log p(\boldsymbol{\beta}) \propto \frac{1}{b} \sum_{j=1}^p |\beta_j| = \frac{1}{b} |\boldsymbol{\beta}|_1$$
This is exactly the L1 penalty with $\lambda = \frac{1}{b}$!
The Unified View:
We can now see the regularized regression objective in a new light:
$$\text{Loss}(\boldsymbol{\beta}) + \lambda \cdot R(\boldsymbol{\beta}) = -\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) - \log p(\boldsymbol{\beta})$$
$$= -\log\left[p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})\right]$$
$$= -\log\left[p(\boldsymbol{\beta} \mid \mathbf{X}, \mathbf{y}) \cdot p(\mathbf{y} \mid \mathbf{X})\right]$$
Minimizing the regularized loss is equivalent to maximizing the posterior probability of parameters given data—Maximum A Posteriori (MAP) estimation!
Every time you add a regularization term, you're implicitly asserting a prior belief about your parameters. L2 regularization says you believe coefficients come from a Gaussian distribution. L1 regularization says you believe coefficients come from a Laplace distribution. Understanding this transforms regularization from a trick into a principled choice.
The prior distributions we've discussed have hyperparameters: variance $\tau^2$ for Gaussian, scale $b$ for Laplace. These directly control regularization strength. How should we choose them?
Approaches to Hyperparameter Selection:
Empirical Bayes:
The marginal likelihood approach finds hyperparameters that maximize the probability of the observed data, averaged over all possible parameter values:
$$p(\mathbf{y} \mid \mathbf{X}, \tau^2) = \int p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta} \mid \tau^2) , d\boldsymbol{\beta}$$
For Gaussian priors and Gaussian likelihoods, this integral has a closed form: $$\mathbf{y} \mid \mathbf{X}, \tau^2 \sim \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{X}\mathbf{X}^T + \sigma^2 \mathbf{I})$$
Maximizing this over $\tau^2$ provides a principled, data-driven regularization strength—one that balances model complexity against data fit automatically.
Unlike cross-validation which requires data splitting, empirical Bayes uses all the data. It also provides uncertainty quantification for the hyperparameters themselves. This is particularly valuable in small-sample settings where cross-validation estimates are highly variable.
Hierarchical Priors (Full Bayes):
The most thorough approach places priors on hyperparameters:
$$\boldsymbol{\beta} \mid \tau^2 \sim \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})$$ $$\tau^2 \sim \text{Inverse-Gamma}(a, b)$$
This fully Bayesian treatment marginalizes over hyperparameter uncertainty, providing more calibrated uncertainty quantification. The cost is increased computational complexity—typically requiring MCMC methods.
So far, we've assumed independent priors across coefficients: $$p(\boldsymbol{\beta}) = \prod_{j=1}^p p(\beta_j)$$
This independence assumption is convenient but not necessary. In many applications, we have prior knowledge about relationships between coefficients.
Correlated Priors:
We can specify a joint prior with non-diagonal covariance: $$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma}_0)$$
where $\boldsymbol{\Sigma}_0$ encodes prior correlations between coefficients.
The Corresponding Regularization:
A Gaussian prior with covariance $\boldsymbol{\Sigma}_0$ yields the penalty: $$R(\boldsymbol{\beta}) = \boldsymbol{\beta}^T \boldsymbol{\Sigma}_0^{-1} \boldsymbol{\beta}$$
This is a generalized Ridge penalty. When $\boldsymbol{\Sigma}_0 = \tau^2 \mathbf{I}$, we recover standard Ridge. But other choices yield different regularization behaviors:
The Bayesian framework makes it clear that choosing a regularization matrix is equivalent to choosing a prior covariance structure.
Understanding the prior-penalty correspondence gives you a powerful design tool. Instead of asking 'What penalty matrix should I use?', ask 'What prior correlation structure makes sense for my problem?' The answer often becomes clearer when framed probabilistically.
Understanding priors theoretically is valuable; applying them correctly requires attention to practical details.
Prior Predictive Checks:
A powerful diagnostic is the prior predictive distribution—the distribution of data implied by your prior before seeing actual observations:
If your prior implies outcomes like negative heights, probabilities > 1, or incomes of $10 million for entry-level positions, the prior needs adjustment. Good priors should generate plausible (if imprecise) predictions before seeing any data.
1. Overly strong priors: Priors that are too narrow relative to the data will dominate inference, ignoring your observations. 2. Overly diffuse priors: Priors that are too wide provide no regularization and can lead to numerical issues. 3. Ignoring scale: A $\mathcal{N}(0, 1)$ prior might be perfect for z-scored data but nonsensical for raw temperature in Kelvin.
We've established the foundational concept of prior distributions and their connection to regularization. This provides the theoretical grounding for the pages that follow.
Key Takeaways:
What's Next:
In the following pages, we'll develop these ideas in depth:
You now understand the concept of prior distributions and their fundamental connection to regularization. This Bayesian perspective transforms regularization from an ad-hoc trick into a principled framework for encoding prior knowledge about model parameters. The next page will make this concrete by deriving Ridge regression as a consequence of assuming Gaussian priors.