Machine LearningRegularization Theory

Bayesian Interpretation of Regularization

LevelAdvanced

Duration90 mins

TopicRegularization Theory

3 / 5

Lasso as Laplace Prior

The Laplace Connection to Sparse Solutions

Ridge regression emerges from Gaussian priors, producing proportional shrinkage but never exact zeros. Yet in many applications, we genuinely believe that most coefficients should be exactly zero—that only a subset of features truly matter.

Lasso regression famously produces sparse solutions. Variables are either in the model with non-zero coefficients or completely excluded with exactly zero coefficients. This isn't just computationally convenient—it's often scientifically meaningful. In genomics, we might believe only a handful of genes among thousands affect a phenotype. In finance, only certain factors might truly drive returns.

The Bayesian perspective reveals why Lasso produces sparsity: it corresponds to a Laplace (double exponential) prior on coefficients. This page develops this connection rigorously, explaining the geometric and algebraic reasons Laplace priors induce exact zeros where Gaussian priors cannot.

What You Will Learn

By completing this page, you will: (1) Understand the Laplace distribution and its properties; (2) Derive Lasso as MAP estimation under a Laplace prior; (3) Explain geometrically why Laplace priors induce sparsity; (4) Appreciate the computational challenges of full Bayesian inference with Laplace priors; (5) Connect the scale parameter to regularization strength.

The Laplace (Double Exponential) Distribution

Before connecting Laplace priors to Lasso, we need to understand the Laplace distribution itself—a distribution that predates the Gaussian in the history of statistics.

Definition:

The Laplace distribution with location $\mu$ and scale $b > 0$ has density:

$$f(x \mid \mu, b) = \frac{1}{2b} \exp\left(-\frac{|x - \mu|}{b}\right)$$

We write $X \sim \text{Laplace}(\mu, b)$.

For our regularization priors, we use the centered Laplace with $\mu = 0$:

$$f(x \mid b) = \frac{1}{2b} \exp\left(-\frac{|x|}{b}\right)$$

Properties of the Laplace Distribution
Property	Value	Comparison to Gaussian
Mean	$\mu$ (location)	Same as Gaussian mean
Variance	$2b^2$	Gaussian: $\sigma^2$
Mode	$\mu$	Same (both symmetric)
Shape	Peaked at center, exponential tails	Bell-shaped, Gaussian tails
Tail decay	$\exp(-\|x\|/b)$ (slower)	$\exp(-x^2/(2\sigma^2))$ (faster)
Kurtosis	6 (heavy tails)	3 (mesokurtic)
Support	$(-\infty, \infty)$	Same

The Key Difference: Shape and Tails

The Laplace and Gaussian distributions differ fundamentally in their shape:

Peak sharpness: The Laplace has a sharp peak at the mode (cusp), while the Gaussian has a smooth, rounded peak.
Tail behavior: Laplace tails decay exponentially ($e^{-|x|/b}$), while Gaussian tails decay as a Gaussian ($e^{-x^2/(2\sigma^2)}$). For large $|x|$, Laplace tails are heavier—they assign more probability to extreme values.
Concentration near zero: Despite heavier tails, the Laplace puts more mass near zero due to its sharp peak. This combination (more mass at zero AND heavier tails) is what drives sparsity.

The Laplace as Two Back-to-Back Exponentials

The Laplace distribution can be understood as two exponential distributions glued at the origin—one for positive values, one for negative values. Hence the name 'double exponential.' This construction explains the sharp peak: two exponentials meeting creates a cusp, not a smooth joining.

Deriving Lasso from Laplace Prior

Let's now derive Lasso regression as MAP estimation under a Laplace prior, paralleling our Ridge derivation.

The Setup:

Likelihood (same as before): $$\mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$

Prior: Independent Laplace priors on each coefficient: $$\beta_j \sim \text{Laplace}(0, b), \quad j = 1, \ldots, p$$

$$p(\boldsymbol{\beta}) = \prod_{j=1}^p \frac{1}{2b} \exp\left(-\frac{|\beta_j|}{b}\right) = \frac{1}{(2b)^p} \exp\left(-\frac{1}{b}\sum_{j=1}^p |\beta_j|\right)$$

Applying Bayes' Theorem:

$$p(\boldsymbol{\beta} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

$$\propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right) \cdot \exp\left(-\frac{1}{b}|\boldsymbol{\beta}|_1\right)$$

$$= \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{b}|\boldsymbol{\beta}|_1\right)$$

Taking the Log:

$$\log p(\boldsymbol{\beta} \mid \mathbf{y}) = -\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{b}|\boldsymbol{\beta}|_1 + \text{const}$$

The Lasso Objective Appears!

Maximizing $\log p(\boldsymbol{\beta} \mid \mathbf{y})$ is equivalent to minimizing $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$ with $\lambda = 2\sigma^2/b$. The Lasso solution is the Maximum A Posteriori (MAP) estimate under Gaussian likelihood and Laplace prior!

The Full Relationship:

$$\boldsymbol{\hat\beta}{\text{Lasso}} = \arg\max{\boldsymbol{\beta}} , p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X})$$

$$= \arg\min_{\boldsymbol{\beta}} \left[|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1\right]$$

where: $$\lambda = \frac{2\sigma^2}{b}$$

This connects the prior scale parameter $b$ to the regularization parameter $\lambda$. Smaller $b$ (tighter prior around zero) means larger $\lambda$ (stronger regularization).

Why Laplace Priors Induce Sparsity

Both Gaussian and Laplace priors are continuous with $p(\beta_j = 0) = 0$. So why does Lasso produce exact zeros while Ridge doesn't? The key is that we're doing MAP estimation (finding the mode), not full posterior inference.

The Subgradient Condition:

At a maximum of the log-posterior, we need the derivative to be zero. But the L1 norm $|\beta_j|$ is not differentiable at $\beta_j = 0$—it has different left and right derivatives.

For $\beta_j > 0$: $\frac{d}{d\beta_j}|\beta_j| = 1$ For $\beta_j < 0$: $\frac{d}{d\beta_j}|\beta_j| = -1$ At $\beta_j = 0$: any value in $[-1, 1]$ is a valid subgradient

The Subgradient Concept

For non-differentiable convex functions, the subgradient generalizes the derivative. At a point where the function has a 'corner' (like $|x|$ at $x=0$), the subgradient is the set of all slopes of lines that touch the function from below. For $|x|$ at $x=0$, this set is $[-1, 1]$.

The Optimality Condition:

For the Lasso objective: $$L(\boldsymbol{\beta}) = \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$$

The (sub)gradient condition for optimality is: $$-\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta}) + \lambda \cdot \text{sign}(\boldsymbol{\hat\beta}) = \mathbf{0}$$

where $\text{sign}(\beta_j) \in [-1, 1]$ if $\beta_j = 0$.

For the $j$-th coefficient, this gives: $$[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta})]_j = \lambda \cdot s_j$$

where $s_j = \text{sign}(\hat\beta_j)$ if $\hat\beta_j \neq 0$, or $s_j \in [-1, 1]$ if $\hat\beta_j = 0$.

When Can $\hat\beta_j = 0$?

The condition for $\hat\beta_j = 0$ to be optimal is: $$\left|[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta})]_j\right| \leq \lambda$$

If the absolute correlation between residuals and feature $j$ is less than $\lambda$, the coefficient is set to exactly zero.

This is the soft thresholding condition: the data's 'vote' for including feature $j$ (measured by correlation with residuals) must exceed a threshold $\lambda$ for the feature to be included.

Contrast with Ridge:

For Ridge, the L2 penalty $\beta_j^2$ is differentiable everywhere with derivative $2\beta_j$. The optimality condition: $$[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta})]_j = \lambda \hat\beta_j$$

This equation has $\hat\beta_j = 0$ as a solution only if $[\mathbf{X}^T\mathbf{y}]_j = 0$—a measure-zero event.

The Geometry of Sparsity

The L1 ball has corners on the coordinate axes. The L2 ball is smooth everywhere. When the likelihood contours intersect the constraint set, they typically hit corners of the L1 ball (yielding zeros) but smooth points of the L2 ball (yielding non-zeros). The 'corners' of the Laplace prior translate into corners on the constraint set, creating natural sparsity.

Geometric Interpretation

The geometric picture provides powerful intuition for why Laplace priors produce sparsity.

Contours of the Prior:

In two dimensions ($\beta_1, \beta_2$), the contours of equal prior probability are:

Gaussian prior: $\beta_1^2 + \beta_2^2 = c$ — circles centered at origin
Laplace prior: $|\beta_1| + |\beta_2| = c$ — diamonds (rotated squares) with corners on axes

The key difference is that Laplace contours have corners at $(c, 0)$, $(-c, 0)$, $(0, c)$, $(0, -c)$, while Gaussian contours are smooth everywhere.

Contours of the Likelihood:

For Gaussian likelihood with quadratic loss, the likelihood contours are ellipses: $$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 = c$$

These ellipses are typically not axis-aligned (unless features are orthogonal).

MAP Estimation as Constrained Optimization:

Finding the MAP estimate is equivalent to: $$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{subject to} \quad |\boldsymbol{\beta}|_1 \leq t$$

We're shrinking the loss ellipse until it first touches the constraint region.

Where Does First Contact Occur?

For L1 ball (diamond): First contact typically occurs at a corner—a point where one or more coordinates are exactly zero
For L2 ball (circle): First contact typically occurs at a smooth point—almost never on an axis

The probability that an arbitrarily oriented ellipse first touches a circle at a point on the axes is zero. But for a diamond, the corners are highly probable contact points.

The Corner Effect

In $p$ dimensions, the L1 ball has $2p$ corners (one at $\pm c$ on each axis) and $2^p$ facets (hyperplanes connecting corners). The corners occupy a significant fraction of the boundary's 'probability' for first contact with an ellipse. This is why Lasso produces sparse solutions even without explicit variable selection.

Visualizing in Higher Dimensions:

In $p$ dimensions:

The L1 ball (cross-polytope) has corners at $\pm e_j$ for each standard basis vector
The L2 ball (hypersphere) has no corners—it's uniformly curved

As $p$ grows, the L1 ball becomes increasingly 'spiky,' with most of its volume concentrated near the corners. This high-dimensional geometry explains why Lasso's sparsity-inducing property becomes stronger in high dimensions—exactly where sparsity is most useful.

Comparing Gaussian and Laplace Prior Shapes

Let's examine the probability densities more carefully to understand what each prior 'believes' about coefficients.

Detailed Comparison of Prior Densities
Aspect	Gaussian $\mathcal{N}(0, \tau^2)$	Laplace$(0, b)$
Density at 0	$1/\sqrt{2\pi\tau^2}$	$1/(2b)$
Log-density at 0	$-\frac{1}{2}\log(2\pi\tau^2)$	$-\log(2b)$
Density decay	Quadratic: $-\beta^2/(2\tau^2)$	Linear: $-\|\beta\|/b$
Penalty on $\|\beta\|=1$	$(2\tau^2)^{-1}$	$b^{-1}$
Penalty on $\|\beta\|=10$	$(2\tau^2)^{-1} \times 100$	$b^{-1} \times 10$
Relative cost of 10 vs 1	100× worse	10× worse

The Penalty Function Perspective:

Define the log-prior penalty (negative log-prior, ignoring constants):

Gaussian: $P_G(\beta) = \beta^2/(2\tau^2)$ — quadratic penalty
Laplace: $P_L(\beta) = |\beta|/b$ — linear penalty

Key Difference:

For small $|\beta|$:

Gaussian penalty grows slowly (derivative = $\beta/\tau^2 \approx 0$)
Laplace penalty grows at constant rate (derivative = $1/b$ or $-1/b$)

For large $|\beta|$:

Gaussian penalty grows rapidly (derivative = $\beta/\tau^2$, large)
Laplace penalty grows at same constant rate (derivative = $1/b$)

The Laplace prior penalizes deviations from zero at a constant rate per unit, regardless of starting point. The Gaussian prior penalizes marginally—the cost of moving from 9 to 10 is much greater than from 0 to 1.

The Marginal Cost Interpretation

Gaussian: 'Every additional unit of $|\beta|$ is more expensive than the last.' Laplace: 'Every unit of $|\beta|$ costs the same, regardless of how much you already have.'

This constant marginal cost means Laplace priors don't mind if you concentrate effect sizes in a few large coefficients (sparsity) rather than spreading them across many small ones.

Implications for Coefficient Estimation:

Consider equal-magnitude signal spread across $k$ coefficients vs. one coefficient:

$k$ coefficients of size $c/\sqrt{k}$ (Euclidean norm $c$)
1 coefficient of size $c$

Gaussian penalty:

Spread: $k \times (c/\sqrt{k})^2/(2\tau^2) = c^2/(2\tau^2)$
Concentrated: $c^2/(2\tau^2)$
Same penalty! Gaussian is indifferent

Laplace penalty:

Spread: $k \times c/(\sqrt{k} \cdot b) = c\sqrt{k}/b$
Concentrated: $c/b$
Concentrated is $\sqrt{k}$ times cheaper! Laplace prefers sparsity

The Posterior Under Laplace Prior

A crucial difference from the Gaussian case: the posterior under a Laplace prior is not Gaussian and does not have a closed form.

Why No Closed Form?

For Gaussian prior and Gaussian likelihood, the product of two Gaussians is Gaussian—this conjugacy gives closed-form posteriors.

For Laplace prior and Gaussian likelihood: $$p(\boldsymbol{\beta} \mid \mathbf{y}) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right) \cdot \exp\left(-\frac{1}{b}|\boldsymbol{\beta}|_1\right)$$

This combines a Gaussian (in the squared distance) with a Laplace (in the L1 norm). The product is neither Gaussian nor Laplace—it's a complicated distribution with no standard name.

No Conjugacy for Laplace

Laplace priors are not conjugate to Gaussian likelihoods. The posterior is intractable to compute exactly in closed form. This is why Lasso (MAP estimation) is practical, but full Bayesian inference with Laplace priors requires approximate methods.

Properties of the Posterior:

Although we can't write down the posterior in closed form, we know several things:

Unimodal: The posterior is log-concave (sum of concave functions), so it has a unique mode
Non-Gaussian: The posterior has heavier tails and sharper peak than Gaussian
Non-zero posterior probability for $\beta_j = 0$: Still zero! The posterior remains continuous
Mode at Lasso solution: The MAP estimate (posterior mode) is the Lasso solution
Posterior mean ≠ posterior mode: Unlike Gaussians, the mean and mode differ

Wait—If the Posterior is Continuous, How Can Lasso Give Exactly Zero?

This is a subtle but crucial point. The posterior $p(\beta_j \mid \mathbf{y})$ is continuous—it assigns probability zero to ${\beta_j = 0}$.

But the mode of this posterior (the Lasso estimate) can be exactly zero!

This is analogous to how the Laplace distribution itself is continuous with $p(x=0) = 0$, yet its mode is at $x = 0$.

The distinction:

MAP estimate (mode): Can be exactly zero — this is Lasso
Posterior mean: Never exactly zero — averaging with a continuous distribution
Posterior samples: Never exactly zero — sampling from continuous distribution

The MAP vs. Full Bayes Distinction

Lasso produces exact zeros because it finds the MAP estimate (mode), not the posterior mean. If you perform full Bayesian inference (MCMC sampling, posterior mean estimation), you will NOT get exactly zero coefficients. The sparsity is a property of MAP estimation with Laplace priors, not a property of the posterior distribution itself.

Computational Implications

The lack of closed-form posteriors affects both MAP estimation (Lasso) and full Bayesian inference differently.

For MAP Estimation (Lasso)

•Convex optimization problem
•Efficient algorithms exist (coordinate descent, LARS)
•No matrix inversion needed (unlike Ridge)
•Non-differentiability at zero requires careful handling
•Solution path can be computed efficiently
•Fast implementations widely available

For Full Bayesian Inference

•No closed-form posterior
•MCMC (Gibbs sampling) required
•Scale-mixture representation enables efficient Gibbs
•Posterior samples never exactly zero
•Uncertainty quantification requires sampling
•Computationally more expensive than Ridge

The Scale-Mixture Representation:

A clever computational trick: the Laplace distribution can be written as a scale mixture of Gaussians:

$$\beta_j \mid \tau_j^2 \sim \mathcal{N}(0, \tau_j^2)$$ $$\tau_j^2 \sim \text{Exponential}(\lambda^2/2)$$

Marginalizing over $\tau_j^2$ gives a Laplace distribution on $\beta_j$.

This representation enables Gibbs sampling:

Sample $\boldsymbol{\beta}$ given ${\tau_j^2}$ — this is a Gaussian conditional
Sample each $\tau_j^2$ given $\beta_j$ — this is an Inverse-Gaussian conditional

The augmented model allows efficient MCMC despite the non-conjugacy.

Bayesian Lasso vs. Frequentist Lasso

The 'Bayesian Lasso' uses full MCMC to explore the posterior under Laplace priors. It provides uncertainty quantification (credible intervals) but does NOT produce exact zeros—samples are from a continuous distribution. Frequentist Lasso (MAP) produces exact zeros but doesn't naturally provide uncertainty estimates. Choose based on whether you need sparsity or uncertainty quantification more.

Connecting Scale Parameter to Regularization Strength

We derived that $\lambda = 2\sigma^2/b$, where $b$ is the Laplace scale and $\sigma^2$ is the noise variance. Let's understand this relationship more deeply.

The Signal-to-Noise Interpretation:

Rewriting: $b = 2\sigma^2/\lambda$

Larger $b$ means:

Prior allows larger coefficients (more diffuse prior)
Smaller $\lambda$ (weaker regularization)
Data dominates over prior

Smaller $b$ means:

Prior concentrates mass near zero (tighter prior)
Larger $\lambda$ (stronger regularization)
Prior dominates over data

Effect of Prior Scale on Regularization
Scenario	Prior Scale $b$	$\lambda$	Regularization Effect
Very sparse belief	Small	Large	Strong shrinkage, many zeros
Moderate sparsity	Moderate	Moderate	Some shrinkage, some zeros
Dense belief	Large	Small	Weak shrinkage, few zeros
No prior ($b \to \infty$)	∞	0	No regularization (OLS)

Comparison with Ridge:

Regularization	Prior	Scale Parameter	Relation to $\lambda$
Ridge (L2)	Gaussian $\mathcal{N}(0, \tau^2)$	Prior variance $\tau^2$	$\lambda = \sigma^2/\tau^2$
Lasso (L1)	Laplace$(0, b)$	Prior scale $b$	$\lambda = 2\sigma^2/b$

Both have the same structure: $\lambda \propto \sigma^2 / \text{prior scale}$.

Strong prior (small scale) → strong regularization Weak prior (large scale) → weak regularization

Choosing $\lambda$ is Choosing a Prior

When you select $\lambda$ by cross-validation, you're implicitly selecting a prior scale parameter. Cross-validation finds the prior belief about coefficient sizes that best predicts new data. This is a form of empirical Bayes—letting the data choose the prior hyperparameters.

Extensions and Generalizations

The Laplace prior is just one of many sparsity-inducing priors. Understanding its Bayesian interpretation opens the door to principled extensions.

Related Sparsity Priors

•Spike-and-Slab Prior — Mixture of a point mass at zero and a diffuse 'slab': $p(\beta_j) = \pi \cdot \delta_0 + (1-\pi) \cdot \mathcal{N}(0, \tau^2)$. Produces 'truly' sparse posteriors with positive probability at zero.
•Horseshoe Prior — Heavy-tailed prior that strongly shrinks small signals to zero while leaving large signals nearly unaffected. Provides better frequentist properties than Laplace.
•Generalized Double Pareto — Generalizes Laplace with additional shape parameter, allowing tunable tail behavior between Laplace and Student-t.
•Regularized Horseshoe — Combines horseshoe with a slab component to prevent extremely large coefficients.
•R2-D2 Prior — Places prior on the proportion of variance explained, automatically calibrating regularization strength.

The Elastic Net Prior:

Elastic Net combines L1 and L2 penalties. Its Bayesian interpretation involves mixing Gaussian and Laplace:

$$p(\beta_j) \propto \exp\left(-\frac{\lambda_1}{2}|\beta_j| - \frac{\lambda_2}{2}\beta_j^2\right)$$

This isn't a standard distribution but can be explored via MCMC.

Hierarchical Extensions:

We can place priors on the prior scale parameters: $$\beta_j \mid b \sim \text{Laplace}(0, b)$$ $$b \sim \text{Inverse-Gamma}(a, c)$$

This creates a Global-Local hierarchy: $b$ is the global shrinkage level, adapted from data.

The Sparsity-Uncertainty Tradeoff

Different priors trade off exact sparsity against proper uncertainty quantification differently. Laplace with MAP gives exact zeros but no uncertainty. Horseshoe with MCMC gives proper uncertainty but approximate sparsity. Spike-and-slab gives both but at higher computational cost. Choose based on your priorities.

Summary and Looking Ahead

We've established the profound connection between Lasso regression and Bayesian inference with Laplace priors.

Key Takeaways

•Lasso = MAP under Laplace prior: The Lasso estimate is the Maximum A Posteriori estimate when coefficients have Laplace priors
•$\lambda = 2\sigma^2/b$: Regularization strength relates inversely to Laplace scale parameter
•Laplace induces sparsity via corners: The L1 ball's corners are 'attractors' in the constrained optimization geometry
•Constant marginal penalty: Unlike Gaussian's increasing marginal penalty, Laplace penalizes all units equally, favoring concentration in few coefficients
•No closed-form posterior: Unlike Ridge, Bayesian inference with Laplace priors requires MCMC or approximations
•MAP ≠ Full Bayes: Only MAP estimation produces exact zeros; posterior samples are always continuous
•Scale-mixture enables computation: The Laplace as a Gaussian scale-mixture allows efficient Gibbs sampling for full Bayesian analysis

What's Next:

In Page 4, we'll explore MAP estimation in depth—the optimization framework that connects Bayesian priors to regularization penalties. We'll see how choosing the posterior mode (MAP) rather than the posterior mean fundamentally changes inference, and why MAP estimation can produce point estimates that no single sample from the posterior would give.

Page Complete

You now understand why Lasso produces sparse solutions: the Laplace prior's geometry creates corners where the posterior mode naturally sits. This Bayesian perspective transforms Lasso from an algorithmic trick for variable selection into a principled statement of prior belief—that most coefficients should be exactly or nearly zero, with signal concentrated in a few important features.

3 / 5

Loading learning content...

Machine LearningRegularization Theory

Bayesian Interpretation of Regularization

LevelAdvanced

Duration90 mins

TopicRegularization Theory

3 / 5

Lasso as Laplace Prior

The Laplace Connection to Sparse Solutions

What You Will Learn

The Laplace (Double Exponential) Distribution

Before connecting Laplace priors to Lasso, we need to understand the Laplace distribution itself—a distribution that predates the Gaussian in the history of statistics.

Definition:

The Laplace distribution with location $\mu$ and scale $b > 0$ has density:

$$f(x \mid \mu, b) = \frac{1}{2b} \exp\left(-\frac{|x - \mu|}{b}\right)$$

We write $X \sim \text{Laplace}(\mu, b)$.

For our regularization priors, we use the centered Laplace with $\mu = 0$:

$$f(x \mid b) = \frac{1}{2b} \exp\left(-\frac{|x|}{b}\right)$$

Properties of the Laplace Distribution
Property	Value	Comparison to Gaussian
Mean	$\mu$ (location)	Same as Gaussian mean
Variance	$2b^2$	Gaussian: $\sigma^2$
Mode	$\mu$	Same (both symmetric)
Shape	Peaked at center, exponential tails	Bell-shaped, Gaussian tails
Tail decay	$\exp(-\|x\|/b)$ (slower)	$\exp(-x^2/(2\sigma^2))$ (faster)
Kurtosis	6 (heavy tails)	3 (mesokurtic)
Support	$(-\infty, \infty)$	Same

The Key Difference: Shape and Tails

The Laplace and Gaussian distributions differ fundamentally in their shape:

Peak sharpness: The Laplace has a sharp peak at the mode (cusp), while the Gaussian has a smooth, rounded peak.
Tail behavior: Laplace tails decay exponentially ($e^{-|x|/b}$), while Gaussian tails decay as a Gaussian ($e^{-x^2/(2\sigma^2)}$). For large $|x|$, Laplace tails are heavier—they assign more probability to extreme values.
Concentration near zero: Despite heavier tails, the Laplace puts more mass near zero due to its sharp peak. This combination (more mass at zero AND heavier tails) is what drives sparsity.

The Laplace as Two Back-to-Back Exponentials

Deriving Lasso from Laplace Prior

Let's now derive Lasso regression as MAP estimation under a Laplace prior, paralleling our Ridge derivation.

The Setup:

Likelihood (same as before): $$\mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$

Prior: Independent Laplace priors on each coefficient: $$\beta_j \sim \text{Laplace}(0, b), \quad j = 1, \ldots, p$$

$$p(\boldsymbol{\beta}) = \prod_{j=1}^p \frac{1}{2b} \exp\left(-\frac{|\beta_j|}{b}\right) = \frac{1}{(2b)^p} \exp\left(-\frac{1}{b}\sum_{j=1}^p |\beta_j|\right)$$

Applying Bayes' Theorem:

$$p(\boldsymbol{\beta} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$

$$\propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right) \cdot \exp\left(-\frac{1}{b}|\boldsymbol{\beta}|_1\right)$$

$$= \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{b}|\boldsymbol{\beta}|_1\right)$$

Taking the Log:

$$\log p(\boldsymbol{\beta} \mid \mathbf{y}) = -\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{b}|\boldsymbol{\beta}|_1 + \text{const}$$

The Lasso Objective Appears!

The Full Relationship:

$$\boldsymbol{\hat\beta}{\text{Lasso}} = \arg\max{\boldsymbol{\beta}} , p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X})$$

$$= \arg\min_{\boldsymbol{\beta}} \left[|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1\right]$$

where: $$\lambda = \frac{2\sigma^2}{b}$$

This connects the prior scale parameter $b$ to the regularization parameter $\lambda$. Smaller $b$ (tighter prior around zero) means larger $\lambda$ (stronger regularization).

Why Laplace Priors Induce Sparsity

The Subgradient Condition:

At a maximum of the log-posterior, we need the derivative to be zero. But the L1 norm $|\beta_j|$ is not differentiable at $\beta_j = 0$—it has different left and right derivatives.

For $\beta_j > 0$: $\frac{d}{d\beta_j}|\beta_j| = 1$ For $\beta_j < 0$: $\frac{d}{d\beta_j}|\beta_j| = -1$ At $\beta_j = 0$: any value in $[-1, 1]$ is a valid subgradient

The Subgradient Concept

The Optimality Condition:

For the Lasso objective: $$L(\boldsymbol{\beta}) = \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$$

The (sub)gradient condition for optimality is: $$-\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta}) + \lambda \cdot \text{sign}(\boldsymbol{\hat\beta}) = \mathbf{0}$$

where $\text{sign}(\beta_j) \in [-1, 1]$ if $\beta_j = 0$.

For the $j$-th coefficient, this gives: $$[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta})]_j = \lambda \cdot s_j$$

where $s_j = \text{sign}(\hat\beta_j)$ if $\hat\beta_j \neq 0$, or $s_j \in [-1, 1]$ if $\hat\beta_j = 0$.

When Can $\hat\beta_j = 0$?

The condition for $\hat\beta_j = 0$ to be optimal is: $$\left|[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta})]_j\right| \leq \lambda$$

If the absolute correlation between residuals and feature $j$ is less than $\lambda$, the coefficient is set to exactly zero.

This is the soft thresholding condition: the data's 'vote' for including feature $j$ (measured by correlation with residuals) must exceed a threshold $\lambda$ for the feature to be included.

Contrast with Ridge:

This equation has $\hat\beta_j = 0$ as a solution only if $[\mathbf{X}^T\mathbf{y}]_j = 0$—a measure-zero event.

The Geometry of Sparsity

Geometric Interpretation

The geometric picture provides powerful intuition for why Laplace priors produce sparsity.

Contours of the Prior:

In two dimensions ($\beta_1, \beta_2$), the contours of equal prior probability are:

Gaussian prior: $\beta_1^2 + \beta_2^2 = c$ — circles centered at origin
Laplace prior: $|\beta_1| + |\beta_2| = c$ — diamonds (rotated squares) with corners on axes

The key difference is that Laplace contours have corners at $(c, 0)$, $(-c, 0)$, $(0, c)$, $(0, -c)$, while Gaussian contours are smooth everywhere.

Contours of the Likelihood:

For Gaussian likelihood with quadratic loss, the likelihood contours are ellipses: $$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 = c$$

These ellipses are typically not axis-aligned (unless features are orthogonal).

MAP Estimation as Constrained Optimization:

Finding the MAP estimate is equivalent to: $$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{subject to} \quad |\boldsymbol{\beta}|_1 \leq t$$

We're shrinking the loss ellipse until it first touches the constraint region.

Where Does First Contact Occur?

For L1 ball (diamond): First contact typically occurs at a corner—a point where one or more coordinates are exactly zero
For L2 ball (circle): First contact typically occurs at a smooth point—almost never on an axis

The probability that an arbitrarily oriented ellipse first touches a circle at a point on the axes is zero. But for a diamond, the corners are highly probable contact points.

The Corner Effect

Visualizing in Higher Dimensions:

In $p$ dimensions:

The L1 ball (cross-polytope) has corners at $\pm e_j$ for each standard basis vector
The L2 ball (hypersphere) has no corners—it's uniformly curved

Comparing Gaussian and Laplace Prior Shapes

Let's examine the probability densities more carefully to understand what each prior 'believes' about coefficients.

Detailed Comparison of Prior Densities
Aspect	Gaussian $\mathcal{N}(0, \tau^2)$	Laplace$(0, b)$
Density at 0	$1/\sqrt{2\pi\tau^2}$	$1/(2b)$
Log-density at 0	$-\frac{1}{2}\log(2\pi\tau^2)$	$-\log(2b)$
Density decay	Quadratic: $-\beta^2/(2\tau^2)$	Linear: $-\|\beta\|/b$
Penalty on $\|\beta\|=1$	$(2\tau^2)^{-1}$	$b^{-1}$
Penalty on $\|\beta\|=10$	$(2\tau^2)^{-1} \times 100$	$b^{-1} \times 10$
Relative cost of 10 vs 1	100× worse	10× worse

The Penalty Function Perspective:

Define the log-prior penalty (negative log-prior, ignoring constants):

Gaussian: $P_G(\beta) = \beta^2/(2\tau^2)$ — quadratic penalty
Laplace: $P_L(\beta) = |\beta|/b$ — linear penalty

Key Difference:

For small $|\beta|$:

Gaussian penalty grows slowly (derivative = $\beta/\tau^2 \approx 0$)
Laplace penalty grows at constant rate (derivative = $1/b$ or $-1/b$)

For large $|\beta|$:

Gaussian penalty grows rapidly (derivative = $\beta/\tau^2$, large)
Laplace penalty grows at same constant rate (derivative = $1/b$)

The Marginal Cost Interpretation

Gaussian: 'Every additional unit of $|\beta|$ is more expensive than the last.' Laplace: 'Every unit of $|\beta|$ costs the same, regardless of how much you already have.'

This constant marginal cost means Laplace priors don't mind if you concentrate effect sizes in a few large coefficients (sparsity) rather than spreading them across many small ones.

Implications for Coefficient Estimation:

Consider equal-magnitude signal spread across $k$ coefficients vs. one coefficient:

$k$ coefficients of size $c/\sqrt{k}$ (Euclidean norm $c$)
1 coefficient of size $c$

Gaussian penalty:

Spread: $k \times (c/\sqrt{k})^2/(2\tau^2) = c^2/(2\tau^2)$
Concentrated: $c^2/(2\tau^2)$
Same penalty! Gaussian is indifferent

Laplace penalty:

Spread: $k \times c/(\sqrt{k} \cdot b) = c\sqrt{k}/b$
Concentrated: $c/b$
Concentrated is $\sqrt{k}$ times cheaper! Laplace prefers sparsity

The Posterior Under Laplace Prior

A crucial difference from the Gaussian case: the posterior under a Laplace prior is not Gaussian and does not have a closed form.

Why No Closed Form?

For Gaussian prior and Gaussian likelihood, the product of two Gaussians is Gaussian—this conjugacy gives closed-form posteriors.

This combines a Gaussian (in the squared distance) with a Laplace (in the L1 norm). The product is neither Gaussian nor Laplace—it's a complicated distribution with no standard name.

No Conjugacy for Laplace

Properties of the Posterior:

Although we can't write down the posterior in closed form, we know several things:

Unimodal: The posterior is log-concave (sum of concave functions), so it has a unique mode
Non-Gaussian: The posterior has heavier tails and sharper peak than Gaussian
Non-zero posterior probability for $\beta_j = 0$: Still zero! The posterior remains continuous
Mode at Lasso solution: The MAP estimate (posterior mode) is the Lasso solution
Posterior mean ≠ posterior mode: Unlike Gaussians, the mean and mode differ

Wait—If the Posterior is Continuous, How Can Lasso Give Exactly Zero?

This is a subtle but crucial point. The posterior $p(\beta_j \mid \mathbf{y})$ is continuous—it assigns probability zero to ${\beta_j = 0}$.

But the mode of this posterior (the Lasso estimate) can be exactly zero!

This is analogous to how the Laplace distribution itself is continuous with $p(x=0) = 0$, yet its mode is at $x = 0$.

The distinction:

MAP estimate (mode): Can be exactly zero — this is Lasso
Posterior mean: Never exactly zero — averaging with a continuous distribution
Posterior samples: Never exactly zero — sampling from continuous distribution

The MAP vs. Full Bayes Distinction

Computational Implications

The lack of closed-form posteriors affects both MAP estimation (Lasso) and full Bayesian inference differently.

For MAP Estimation (Lasso)

•Convex optimization problem
•Efficient algorithms exist (coordinate descent, LARS)
•No matrix inversion needed (unlike Ridge)
•Non-differentiability at zero requires careful handling
•Solution path can be computed efficiently
•Fast implementations widely available

For Full Bayesian Inference

•No closed-form posterior
•MCMC (Gibbs sampling) required
•Scale-mixture representation enables efficient Gibbs
•Posterior samples never exactly zero
•Uncertainty quantification requires sampling
•Computationally more expensive than Ridge

The Scale-Mixture Representation:

A clever computational trick: the Laplace distribution can be written as a scale mixture of Gaussians:

$$\beta_j \mid \tau_j^2 \sim \mathcal{N}(0, \tau_j^2)$$ $$\tau_j^2 \sim \text{Exponential}(\lambda^2/2)$$

Marginalizing over $\tau_j^2$ gives a Laplace distribution on $\beta_j$.

This representation enables Gibbs sampling:

Sample $\boldsymbol{\beta}$ given ${\tau_j^2}$ — this is a Gaussian conditional
Sample each $\tau_j^2$ given $\beta_j$ — this is an Inverse-Gaussian conditional

The augmented model allows efficient MCMC despite the non-conjugacy.

Bayesian Lasso vs. Frequentist Lasso

Connecting Scale Parameter to Regularization Strength

We derived that $\lambda = 2\sigma^2/b$, where $b$ is the Laplace scale and $\sigma^2$ is the noise variance. Let's understand this relationship more deeply.

The Signal-to-Noise Interpretation:

Rewriting: $b = 2\sigma^2/\lambda$

Larger $b$ means:

Prior allows larger coefficients (more diffuse prior)
Smaller $\lambda$ (weaker regularization)
Data dominates over prior

Smaller $b$ means:

Prior concentrates mass near zero (tighter prior)
Larger $\lambda$ (stronger regularization)
Prior dominates over data

Effect of Prior Scale on Regularization
Scenario	Prior Scale $b$	$\lambda$	Regularization Effect
Very sparse belief	Small	Large	Strong shrinkage, many zeros
Moderate sparsity	Moderate	Moderate	Some shrinkage, some zeros
Dense belief	Large	Small	Weak shrinkage, few zeros
No prior ($b \to \infty$)	∞	0	No regularization (OLS)

Comparison with Ridge:

Regularization	Prior	Scale Parameter	Relation to $\lambda$
Ridge (L2)	Gaussian $\mathcal{N}(0, \tau^2)$	Prior variance $\tau^2$	$\lambda = \sigma^2/\tau^2$
Lasso (L1)	Laplace$(0, b)$	Prior scale $b$	$\lambda = 2\sigma^2/b$

Both have the same structure: $\lambda \propto \sigma^2 / \text{prior scale}$.

Strong prior (small scale) → strong regularization Weak prior (large scale) → weak regularization

Choosing $\lambda$ is Choosing a Prior

Extensions and Generalizations

The Laplace prior is just one of many sparsity-inducing priors. Understanding its Bayesian interpretation opens the door to principled extensions.

Related Sparsity Priors

•Spike-and-Slab Prior — Mixture of a point mass at zero and a diffuse 'slab': $p(\beta_j) = \pi \cdot \delta_0 + (1-\pi) \cdot \mathcal{N}(0, \tau^2)$. Produces 'truly' sparse posteriors with positive probability at zero.
•Horseshoe Prior — Heavy-tailed prior that strongly shrinks small signals to zero while leaving large signals nearly unaffected. Provides better frequentist properties than Laplace.
•Generalized Double Pareto — Generalizes Laplace with additional shape parameter, allowing tunable tail behavior between Laplace and Student-t.
•Regularized Horseshoe — Combines horseshoe with a slab component to prevent extremely large coefficients.
•R2-D2 Prior — Places prior on the proportion of variance explained, automatically calibrating regularization strength.

The Elastic Net Prior:

Elastic Net combines L1 and L2 penalties. Its Bayesian interpretation involves mixing Gaussian and Laplace:

$$p(\beta_j) \propto \exp\left(-\frac{\lambda_1}{2}|\beta_j| - \frac{\lambda_2}{2}\beta_j^2\right)$$

This isn't a standard distribution but can be explored via MCMC.

Hierarchical Extensions:

We can place priors on the prior scale parameters: $$\beta_j \mid b \sim \text{Laplace}(0, b)$$ $$b \sim \text{Inverse-Gamma}(a, c)$$

This creates a Global-Local hierarchy: $b$ is the global shrinkage level, adapted from data.

The Sparsity-Uncertainty Tradeoff

Summary and Looking Ahead

We've established the profound connection between Lasso regression and Bayesian inference with Laplace priors.

Key Takeaways

•Lasso = MAP under Laplace prior: The Lasso estimate is the Maximum A Posteriori estimate when coefficients have Laplace priors
•$\lambda = 2\sigma^2/b$: Regularization strength relates inversely to Laplace scale parameter
•Laplace induces sparsity via corners: The L1 ball's corners are 'attractors' in the constrained optimization geometry
•Constant marginal penalty: Unlike Gaussian's increasing marginal penalty, Laplace penalizes all units equally, favoring concentration in few coefficients
•No closed-form posterior: Unlike Ridge, Bayesian inference with Laplace priors requires MCMC or approximations
•MAP ≠ Full Bayes: Only MAP estimation produces exact zeros; posterior samples are always continuous
•Scale-mixture enables computation: The Laplace as a Gaussian scale-mixture allows efficient Gibbs sampling for full Bayesian analysis

What's Next:

Page Complete

3 / 5