Loading learning content...
Ridge regression emerges from Gaussian priors, producing proportional shrinkage but never exact zeros. Yet in many applications, we genuinely believe that most coefficients should be exactly zero—that only a subset of features truly matter.
Lasso regression famously produces sparse solutions. Variables are either in the model with non-zero coefficients or completely excluded with exactly zero coefficients. This isn't just computationally convenient—it's often scientifically meaningful. In genomics, we might believe only a handful of genes among thousands affect a phenotype. In finance, only certain factors might truly drive returns.
The Bayesian perspective reveals why Lasso produces sparsity: it corresponds to a Laplace (double exponential) prior on coefficients. This page develops this connection rigorously, explaining the geometric and algebraic reasons Laplace priors induce exact zeros where Gaussian priors cannot.
By completing this page, you will: (1) Understand the Laplace distribution and its properties; (2) Derive Lasso as MAP estimation under a Laplace prior; (3) Explain geometrically why Laplace priors induce sparsity; (4) Appreciate the computational challenges of full Bayesian inference with Laplace priors; (5) Connect the scale parameter to regularization strength.
Before connecting Laplace priors to Lasso, we need to understand the Laplace distribution itself—a distribution that predates the Gaussian in the history of statistics.
Definition:
The Laplace distribution with location $\mu$ and scale $b > 0$ has density:
$$f(x \mid \mu, b) = \frac{1}{2b} \exp\left(-\frac{|x - \mu|}{b}\right)$$
We write $X \sim \text{Laplace}(\mu, b)$.
For our regularization priors, we use the centered Laplace with $\mu = 0$:
$$f(x \mid b) = \frac{1}{2b} \exp\left(-\frac{|x|}{b}\right)$$
| Property | Value | Comparison to Gaussian |
|---|---|---|
| Mean | $\mu$ (location) | Same as Gaussian mean |
| Variance | $2b^2$ | Gaussian: $\sigma^2$ |
| Mode | $\mu$ | Same (both symmetric) |
| Shape | Peaked at center, exponential tails | Bell-shaped, Gaussian tails |
| Tail decay | $\exp(-|x|/b)$ (slower) | $\exp(-x^2/(2\sigma^2))$ (faster) |
| Kurtosis | 6 (heavy tails) | 3 (mesokurtic) |
| Support | $(-\infty, \infty)$ | Same |
The Key Difference: Shape and Tails
The Laplace and Gaussian distributions differ fundamentally in their shape:
Peak sharpness: The Laplace has a sharp peak at the mode (cusp), while the Gaussian has a smooth, rounded peak.
Tail behavior: Laplace tails decay exponentially ($e^{-|x|/b}$), while Gaussian tails decay as a Gaussian ($e^{-x^2/(2\sigma^2)}$). For large $|x|$, Laplace tails are heavier—they assign more probability to extreme values.
Concentration near zero: Despite heavier tails, the Laplace puts more mass near zero due to its sharp peak. This combination (more mass at zero AND heavier tails) is what drives sparsity.
The Laplace distribution can be understood as two exponential distributions glued at the origin—one for positive values, one for negative values. Hence the name 'double exponential.' This construction explains the sharp peak: two exponentials meeting creates a cusp, not a smooth joining.
Let's now derive Lasso regression as MAP estimation under a Laplace prior, paralleling our Ridge derivation.
The Setup:
Likelihood (same as before): $$\mathbf{y} \mid \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I}_n)$$
Prior: Independent Laplace priors on each coefficient: $$\beta_j \sim \text{Laplace}(0, b), \quad j = 1, \ldots, p$$
$$p(\boldsymbol{\beta}) = \prod_{j=1}^p \frac{1}{2b} \exp\left(-\frac{|\beta_j|}{b}\right) = \frac{1}{(2b)^p} \exp\left(-\frac{1}{b}\sum_{j=1}^p |\beta_j|\right)$$
Applying Bayes' Theorem:
$$p(\boldsymbol{\beta} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \boldsymbol{\beta}) \cdot p(\boldsymbol{\beta})$$
$$\propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right) \cdot \exp\left(-\frac{1}{b}|\boldsymbol{\beta}|_1\right)$$
$$= \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{b}|\boldsymbol{\beta}|_1\right)$$
Taking the Log:
$$\log p(\boldsymbol{\beta} \mid \mathbf{y}) = -\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 - \frac{1}{b}|\boldsymbol{\beta}|_1 + \text{const}$$
Maximizing $\log p(\boldsymbol{\beta} \mid \mathbf{y})$ is equivalent to minimizing $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$ with $\lambda = 2\sigma^2/b$. The Lasso solution is the Maximum A Posteriori (MAP) estimate under Gaussian likelihood and Laplace prior!
The Full Relationship:
$$\boldsymbol{\hat\beta}{\text{Lasso}} = \arg\max{\boldsymbol{\beta}} , p(\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X})$$
$$= \arg\min_{\boldsymbol{\beta}} \left[|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1\right]$$
where: $$\lambda = \frac{2\sigma^2}{b}$$
This connects the prior scale parameter $b$ to the regularization parameter $\lambda$. Smaller $b$ (tighter prior around zero) means larger $\lambda$ (stronger regularization).
Both Gaussian and Laplace priors are continuous with $p(\beta_j = 0) = 0$. So why does Lasso produce exact zeros while Ridge doesn't? The key is that we're doing MAP estimation (finding the mode), not full posterior inference.
The Subgradient Condition:
At a maximum of the log-posterior, we need the derivative to be zero. But the L1 norm $|\beta_j|$ is not differentiable at $\beta_j = 0$—it has different left and right derivatives.
For $\beta_j > 0$: $\frac{d}{d\beta_j}|\beta_j| = 1$ For $\beta_j < 0$: $\frac{d}{d\beta_j}|\beta_j| = -1$ At $\beta_j = 0$: any value in $[-1, 1]$ is a valid subgradient
For non-differentiable convex functions, the subgradient generalizes the derivative. At a point where the function has a 'corner' (like $|x|$ at $x=0$), the subgradient is the set of all slopes of lines that touch the function from below. For $|x|$ at $x=0$, this set is $[-1, 1]$.
The Optimality Condition:
For the Lasso objective: $$L(\boldsymbol{\beta}) = \frac{1}{2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 + \lambda|\boldsymbol{\beta}|_1$$
The (sub)gradient condition for optimality is: $$-\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta}) + \lambda \cdot \text{sign}(\boldsymbol{\hat\beta}) = \mathbf{0}$$
where $\text{sign}(\beta_j) \in [-1, 1]$ if $\beta_j = 0$.
For the $j$-th coefficient, this gives: $$[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta})]_j = \lambda \cdot s_j$$
where $s_j = \text{sign}(\hat\beta_j)$ if $\hat\beta_j \neq 0$, or $s_j \in [-1, 1]$ if $\hat\beta_j = 0$.
When Can $\hat\beta_j = 0$?
The condition for $\hat\beta_j = 0$ to be optimal is: $$\left|[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta})]_j\right| \leq \lambda$$
If the absolute correlation between residuals and feature $j$ is less than $\lambda$, the coefficient is set to exactly zero.
This is the soft thresholding condition: the data's 'vote' for including feature $j$ (measured by correlation with residuals) must exceed a threshold $\lambda$ for the feature to be included.
Contrast with Ridge:
For Ridge, the L2 penalty $\beta_j^2$ is differentiable everywhere with derivative $2\beta_j$. The optimality condition: $$[\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat\beta})]_j = \lambda \hat\beta_j$$
This equation has $\hat\beta_j = 0$ as a solution only if $[\mathbf{X}^T\mathbf{y}]_j = 0$—a measure-zero event.
The L1 ball has corners on the coordinate axes. The L2 ball is smooth everywhere. When the likelihood contours intersect the constraint set, they typically hit corners of the L1 ball (yielding zeros) but smooth points of the L2 ball (yielding non-zeros). The 'corners' of the Laplace prior translate into corners on the constraint set, creating natural sparsity.
The geometric picture provides powerful intuition for why Laplace priors produce sparsity.
Contours of the Prior:
In two dimensions ($\beta_1, \beta_2$), the contours of equal prior probability are:
The key difference is that Laplace contours have corners at $(c, 0)$, $(-c, 0)$, $(0, c)$, $(0, -c)$, while Gaussian contours are smooth everywhere.
Contours of the Likelihood:
For Gaussian likelihood with quadratic loss, the likelihood contours are ellipses: $$|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 = c$$
These ellipses are typically not axis-aligned (unless features are orthogonal).
MAP Estimation as Constrained Optimization:
Finding the MAP estimate is equivalent to: $$\min_{\boldsymbol{\beta}} |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2 \quad \text{subject to} \quad |\boldsymbol{\beta}|_1 \leq t$$
We're shrinking the loss ellipse until it first touches the constraint region.
Where Does First Contact Occur?
The probability that an arbitrarily oriented ellipse first touches a circle at a point on the axes is zero. But for a diamond, the corners are highly probable contact points.
In $p$ dimensions, the L1 ball has $2p$ corners (one at $\pm c$ on each axis) and $2^p$ facets (hyperplanes connecting corners). The corners occupy a significant fraction of the boundary's 'probability' for first contact with an ellipse. This is why Lasso produces sparse solutions even without explicit variable selection.
Visualizing in Higher Dimensions:
In $p$ dimensions:
As $p$ grows, the L1 ball becomes increasingly 'spiky,' with most of its volume concentrated near the corners. This high-dimensional geometry explains why Lasso's sparsity-inducing property becomes stronger in high dimensions—exactly where sparsity is most useful.
Let's examine the probability densities more carefully to understand what each prior 'believes' about coefficients.
| Aspect | Gaussian $\mathcal{N}(0, \tau^2)$ | Laplace$(0, b)$ |
|---|---|---|
| Density at 0 | $1/\sqrt{2\pi\tau^2}$ | $1/(2b)$ |
| Log-density at 0 | $-\frac{1}{2}\log(2\pi\tau^2)$ | $-\log(2b)$ |
| Density decay | Quadratic: $-\beta^2/(2\tau^2)$ | Linear: $-|\beta|/b$ |
| Penalty on $|\beta|=1$ | $(2\tau^2)^{-1}$ | $b^{-1}$ |
| Penalty on $|\beta|=10$ | $(2\tau^2)^{-1} \times 100$ | $b^{-1} \times 10$ |
| Relative cost of 10 vs 1 | 100× worse | 10× worse |
The Penalty Function Perspective:
Define the log-prior penalty (negative log-prior, ignoring constants):
Key Difference:
For small $|\beta|$:
For large $|\beta|$:
The Laplace prior penalizes deviations from zero at a constant rate per unit, regardless of starting point. The Gaussian prior penalizes marginally—the cost of moving from 9 to 10 is much greater than from 0 to 1.
Gaussian: 'Every additional unit of $|\beta|$ is more expensive than the last.' Laplace: 'Every unit of $|\beta|$ costs the same, regardless of how much you already have.'
This constant marginal cost means Laplace priors don't mind if you concentrate effect sizes in a few large coefficients (sparsity) rather than spreading them across many small ones.
Implications for Coefficient Estimation:
Consider equal-magnitude signal spread across $k$ coefficients vs. one coefficient:
Gaussian penalty:
Laplace penalty:
A crucial difference from the Gaussian case: the posterior under a Laplace prior is not Gaussian and does not have a closed form.
Why No Closed Form?
For Gaussian prior and Gaussian likelihood, the product of two Gaussians is Gaussian—this conjugacy gives closed-form posteriors.
For Laplace prior and Gaussian likelihood: $$p(\boldsymbol{\beta} \mid \mathbf{y}) \propto \exp\left(-\frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|_2^2\right) \cdot \exp\left(-\frac{1}{b}|\boldsymbol{\beta}|_1\right)$$
This combines a Gaussian (in the squared distance) with a Laplace (in the L1 norm). The product is neither Gaussian nor Laplace—it's a complicated distribution with no standard name.
Laplace priors are not conjugate to Gaussian likelihoods. The posterior is intractable to compute exactly in closed form. This is why Lasso (MAP estimation) is practical, but full Bayesian inference with Laplace priors requires approximate methods.
Properties of the Posterior:
Although we can't write down the posterior in closed form, we know several things:
Unimodal: The posterior is log-concave (sum of concave functions), so it has a unique mode
Non-Gaussian: The posterior has heavier tails and sharper peak than Gaussian
Non-zero posterior probability for $\beta_j = 0$: Still zero! The posterior remains continuous
Mode at Lasso solution: The MAP estimate (posterior mode) is the Lasso solution
Posterior mean ≠ posterior mode: Unlike Gaussians, the mean and mode differ
Wait—If the Posterior is Continuous, How Can Lasso Give Exactly Zero?
This is a subtle but crucial point. The posterior $p(\beta_j \mid \mathbf{y})$ is continuous—it assigns probability zero to ${\beta_j = 0}$.
But the mode of this posterior (the Lasso estimate) can be exactly zero!
This is analogous to how the Laplace distribution itself is continuous with $p(x=0) = 0$, yet its mode is at $x = 0$.
The distinction:
Lasso produces exact zeros because it finds the MAP estimate (mode), not the posterior mean. If you perform full Bayesian inference (MCMC sampling, posterior mean estimation), you will NOT get exactly zero coefficients. The sparsity is a property of MAP estimation with Laplace priors, not a property of the posterior distribution itself.
The lack of closed-form posteriors affects both MAP estimation (Lasso) and full Bayesian inference differently.
The Scale-Mixture Representation:
A clever computational trick: the Laplace distribution can be written as a scale mixture of Gaussians:
$$\beta_j \mid \tau_j^2 \sim \mathcal{N}(0, \tau_j^2)$$ $$\tau_j^2 \sim \text{Exponential}(\lambda^2/2)$$
Marginalizing over $\tau_j^2$ gives a Laplace distribution on $\beta_j$.
This representation enables Gibbs sampling:
The augmented model allows efficient MCMC despite the non-conjugacy.
The 'Bayesian Lasso' uses full MCMC to explore the posterior under Laplace priors. It provides uncertainty quantification (credible intervals) but does NOT produce exact zeros—samples are from a continuous distribution. Frequentist Lasso (MAP) produces exact zeros but doesn't naturally provide uncertainty estimates. Choose based on whether you need sparsity or uncertainty quantification more.
We derived that $\lambda = 2\sigma^2/b$, where $b$ is the Laplace scale and $\sigma^2$ is the noise variance. Let's understand this relationship more deeply.
The Signal-to-Noise Interpretation:
Rewriting: $b = 2\sigma^2/\lambda$
Larger $b$ means:
Smaller $b$ means:
| Scenario | Prior Scale $b$ | $\lambda$ | Regularization Effect |
|---|---|---|---|
| Very sparse belief | Small | Large | Strong shrinkage, many zeros |
| Moderate sparsity | Moderate | Moderate | Some shrinkage, some zeros |
| Dense belief | Large | Small | Weak shrinkage, few zeros |
| No prior ($b \to \infty$) | ∞ | 0 | No regularization (OLS) |
Comparison with Ridge:
| Regularization | Prior | Scale Parameter | Relation to $\lambda$ |
|---|---|---|---|
| Ridge (L2) | Gaussian $\mathcal{N}(0, \tau^2)$ | Prior variance $\tau^2$ | $\lambda = \sigma^2/\tau^2$ |
| Lasso (L1) | Laplace$(0, b)$ | Prior scale $b$ | $\lambda = 2\sigma^2/b$ |
Both have the same structure: $\lambda \propto \sigma^2 / \text{prior scale}$.
Strong prior (small scale) → strong regularization Weak prior (large scale) → weak regularization
When you select $\lambda$ by cross-validation, you're implicitly selecting a prior scale parameter. Cross-validation finds the prior belief about coefficient sizes that best predicts new data. This is a form of empirical Bayes—letting the data choose the prior hyperparameters.
The Laplace prior is just one of many sparsity-inducing priors. Understanding its Bayesian interpretation opens the door to principled extensions.
The Elastic Net Prior:
Elastic Net combines L1 and L2 penalties. Its Bayesian interpretation involves mixing Gaussian and Laplace:
$$p(\beta_j) \propto \exp\left(-\frac{\lambda_1}{2}|\beta_j| - \frac{\lambda_2}{2}\beta_j^2\right)$$
This isn't a standard distribution but can be explored via MCMC.
Hierarchical Extensions:
We can place priors on the prior scale parameters: $$\beta_j \mid b \sim \text{Laplace}(0, b)$$ $$b \sim \text{Inverse-Gamma}(a, c)$$
This creates a Global-Local hierarchy: $b$ is the global shrinkage level, adapted from data.
Different priors trade off exact sparsity against proper uncertainty quantification differently. Laplace with MAP gives exact zeros but no uncertainty. Horseshoe with MCMC gives proper uncertainty but approximate sparsity. Spike-and-slab gives both but at higher computational cost. Choose based on your priorities.
We've established the profound connection between Lasso regression and Bayesian inference with Laplace priors.
What's Next:
In Page 4, we'll explore MAP estimation in depth—the optimization framework that connects Bayesian priors to regularization penalties. We'll see how choosing the posterior mode (MAP) rather than the posterior mean fundamentally changes inference, and why MAP estimation can produce point estimates that no single sample from the posterior would give.
You now understand why Lasso produces sparse solutions: the Laplace prior's geometry creates corners where the posterior mode naturally sits. This Bayesian perspective transforms Lasso from an algorithmic trick for variable selection into a principled statement of prior belief—that most coefficients should be exactly or nearly zero, with signal concentrated in a few important features.