Loading learning content...
Throughout this module, we've developed Bayesian linear regression from first principles—priors on weights, posterior derivation, predictive distributions with uncertainty. Meanwhile, in classical machine learning, practitioners routinely use regularization to prevent overfitting: Ridge regression, LASSO, elastic net.
These seem like different worlds. One speaks of probability distributions and Bayesian updating. The other speaks of penalty terms and constrained optimization. Yet they lead to the same solutions.
This is not a coincidence. There is a deep, mathematically precise connection between Bayesian inference and regularization. Every regularized method corresponds to a specific prior. Every prior implies a particular form of regularization. Understanding this connection provides:
By the end of this page, you will understand how every regularized estimator corresponds to a specific Bayesian prior, derive the exact correspondence between Ridge/LASSO and Gaussian/Laplace priors, see how regularization strength relates to prior precision, and gain intuition for choosing regularization methods based on their probabilistic interpretation.
The connection between Bayesian inference and regularization runs through the Maximum A Posteriori (MAP) estimator.
MAP Definition:
The MAP estimate is the mode (peak) of the posterior distribution:
$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} p(\mathbf{w} | \mathbf{y}, \mathbf{X})$$
Applying Bayes' theorem:
$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \frac{p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w})}{p(\mathbf{y} | \mathbf{X})}$$
The denominator doesn't depend on w, so:
$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \left[ p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w}) \right]$$
Taking the logarithm (monotonic transformation):
$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \left[ \log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) + \log p(\mathbf{w}) \right]$$
Or equivalently, as a minimization:
$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ -\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) - \log p(\mathbf{w}) \right]$$
The MAP objective decomposes into two terms:
• Negative log-likelihood: -log p(y|X,w) → This is the data fit term (like squared error)
• Negative log-prior: -log p(w) → This is the regularization term (like ‖w‖²)
Regularization IS the negative log-prior. Different priors = different regularizers.
Let's derive the correspondence for the most common case: Gaussian prior with Ridge regularization.
The Gaussian Prior:
$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{0}, \alpha^{-1}\mathbf{I})$$
Negative log-prior:
$$-\log p(\mathbf{w}) = \frac{\alpha}{2}\mathbf{w}^\top\mathbf{w} + \text{const} = \frac{\alpha}{2}|\mathbf{w}|_2^2 + \text{const}$$
The Gaussian Likelihood:
$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \mathcal{N}(\mathbf{y} | \mathbf{X}\mathbf{w}, \sigma^2\mathbf{I})$$
Negative log-likelihood:
$$-\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \text{const}$$
Combining (MAP Objective):
$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \frac{\alpha}{2}|\mathbf{w}|_2^2 \right]$$
Multiplying by 2σ²:
$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \alpha\sigma^2|\mathbf{w}|_2^2 \right]$$
Ridge Regression:
w_Ridge = argmin ||y - Xw||² + λ||w||²
Bayesian Correspondence:
λ = α σ² = (prior precision) × (noise variance)
Gaussian prior with precision α ↔ L2 regularization with strength λ = ασ²
Closed-Form Solution:
Both formulations yield the same closed-form:
$$\mathbf{w}_{\text{Ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$
With λ = ασ², this is exactly the posterior mean mₙ from Bayesian linear regression!
What This Means:
Ridge is Bayesian: Every Ridge regression is secretly computing the MAP under a Gaussian prior.
λ has meaning: The regularization strength isn't arbitrary—it's the ratio of prior precision to noise precision.
λ = 1 interpretation: With standardized data, λ = 1 means each data point and the prior contribute equally.
Large λ means strong prior: Heavy regularization = confident prior belief that weights should be small.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npfrom sklearn.linear_model import Ridgeimport matplotlib.pyplot as plt def bayesian_map(X, y, alpha, sigma_sq): """Compute MAP estimate (posterior mean for Gaussian prior).""" D = X.shape[1] lambda_reg = alpha * sigma_sq w_map = np.linalg.solve(X.T @ X + lambda_reg * np.eye(D), X.T @ y) return w_map def sklearn_ridge(X, y, lambda_reg): """Compute Ridge regression solution.""" model = Ridge(alpha=lambda_reg, fit_intercept=False) model.fit(X, y) return model.coef_ # Generate synthetic datanp.random.seed(42)N, D = 50, 5X = np.random.randn(N, D)w_true = np.array([1.0, -0.5, 0.0, 0.3, -0.2])sigma_true = 0.5y = X @ w_true + sigma_true * np.random.randn(N) # Bayesian parametersalpha = 2.0 # Prior precisionsigma_sq = sigma_true ** 2 # Compute both waysw_bayesian = bayesian_map(X, y, alpha, sigma_sq)w_ridge = sklearn_ridge(X, y, lambda_reg=alpha * sigma_sq) print("Bayesian MAP estimate:", w_bayesian)print("sklearn Ridge estimate:", w_ridge)print("Difference (should be ~0):", np.linalg.norm(w_bayesian - w_ridge)) # Visualize equivalence across different lambda valueslambdas = np.logspace(-3, 3, 50)bayesian_norms = []ridge_norms = [] for lam in lambdas: # For Bayesian: lambda = alpha * sigma_sq, so alpha = lambda / sigma_sq alpha_equiv = lam / sigma_sq w_b = bayesian_map(X, y, alpha_equiv, sigma_sq) w_r = sklearn_ridge(X, y, lam) bayesian_norms.append(np.linalg.norm(w_b)) ridge_norms.append(np.linalg.norm(w_r)) plt.figure(figsize=(10, 6))plt.semilogx(lambdas, bayesian_norms, 'b-', linewidth=2, label='Bayesian MAP')plt.semilogx(lambdas, ridge_norms, 'r--', linewidth=2, label='sklearn Ridge')plt.xlabel('λ (Regularization Strength)')plt.ylabel('||w||₂')plt.title('Ridge Regression = Bayesian MAP with Gaussian Prior')plt.legend()plt.grid(True, alpha=0.3)plt.show()The LASSO (Least Absolute Shrinkage and Selection Operator) uses L1 regularization, which promotes sparse solutions. This corresponds to a Laplace prior.
The Laplace Prior:
$$p(w_j) = \frac{\lambda}{2}\exp(-\lambda|w_j|)$$
For all weights (assuming independence):
$$p(\mathbf{w}) = \prod_{j=1}^{D} \frac{\lambda}{2}\exp(-\lambda|w_j|) = \left(\frac{\lambda}{2}\right)^D \exp\left(-\lambda\sum_{j=1}^{D}|w_j|\right)$$
Negative log-prior:
$$-\log p(\mathbf{w}) = \lambda\sum_{j=1}^{D}|w_j| + \text{const} = \lambda|\mathbf{w}|_1 + \text{const}$$
MAP with Laplace Prior:
$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_1 \right]$$
Rescaling:
$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + 2\lambda\sigma^2|\mathbf{w}|_1 \right]$$
LASSO Regression:
w_LASSO = argmin ||y - Xw||² + λ||w||₁
Bayesian Correspondence:
Laplace prior with parameter λ/(2σ²) ↔ L1 regularization with strength λ
Why Laplace → Sparsity:
The Laplace distribution has a sharp peak at zero—much sharper than the Gaussian. This peak creates strong "pull" toward exactly zero.
Mathematically, the L1 penalty has non-differentiable kinks at zero. The optimization can land exactly at a kink, setting weights to exact zeros. L2 has smooth minima—weights shrink but rarely reach exactly zero.
Geometric Interpretation:
Visualize the constraint region for the regularization ball:
The rhombus has corners on the axes. When the likelihood ellipse touches a corner, some weights are exactly zero. The sphere has no corners—tangent points are typically interior.
Practical Difference:
| Property | Ridge (L2/Gaussian) | LASSO (L1/Laplace) |
|---|---|---|
| Sparsity | No exact zeros | Many exact zeros |
| Feature selection | No | Yes (implicit) |
| Correlated features | Shares weight | Picks one arbitrarily |
| Unique solution | Always (if λ > 0) | Not always |
| Closed form | Yes | No (requires iterative optimization) |
Unlike Ridge, LASSO doesn't have a closed-form posterior. The Laplace prior is not conjugate to the Gaussian likelihood. Full Bayesian LASSO requires MCMC or approximations. The MAP (LASSO solution) is available via optimization, but the full posterior distribution is not analytically tractable.
The Bayesian-regularization correspondence extends to many methods.
Elastic Net:
Combines L1 and L2: $$\mathbf{w}{\text{EN}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2 \right]$$
Corresponds to a prior that is a product of Gaussian and Laplace: $$p(\mathbf{w}) \propto \exp\left( -\frac{\alpha}{2}|\mathbf{w}|_2^2 - \beta|\mathbf{w}|_1 \right)$$
This combines the best of both: sparsity from L1 plus stability from L2.
Group LASSO:
Penalizes groups of features together: $$\text{Penalty} = \lambda\sum_{g=1}^{G} \sqrt{\sum_{j \in g} w_j^2}$$
Corresponds to a group-sparse prior that sets entire groups to zero.
Horseshoe Prior:
$$w_j | \tau_j \sim \mathcal{N}(0, \tau_j^2), \quad \tau_j \sim \text{Half-Cauchy}(0, \lambda)$$
Provides adaptive shrinkage—heavy shrinkage for noise features, light shrinkage for signal features. No simple frequentist analog.
| Prior Distribution | MAP Regularizer | Sparsity | Closed-Form MAP |
|---|---|---|---|
| Gaussian 𝒩(0, α⁻¹I) | L2 (Ridge): λ||w||₂² | No | Yes |
| Laplace (double exponential) | L1 (LASSO): λ||w||₁ | Yes | No |
| Gaussian + Laplace | Elastic Net: λ₁||w||₁ + λ₂||w||₂² | Yes | No |
| Uniform (improper) | None (OLS) | No | Yes |
| Student-t | Log penalty | Soft sparsity | No |
| Spike-and-Slab | Best subset selection | Hard sparsity | No |
| Horseshoe | Adaptive shrinkage | Yes (adaptive) | No |
| Gaussian (feature-specific) | Weighted L2 | No | Yes |
Instead of asking 'Should I use Ridge or LASSO?', ask: 'Do I believe most features are irrelevant (→ Laplace/LASSO) or that all features contribute small amounts (→ Gaussian/Ridge)?' The probabilistic interpretation guides the choice.
The Bayesian perspective gives principled interpretation to regularization hyperparameters.
Ridge λ = ασ²:
λ is the product of:
This explains why optimal λ often scales with noise. In high-noise settings, we should trust the prior more (larger λ). In low-noise settings, data is reliable (smaller λ).
Signal-to-Noise Ratio Interpretation:
Define SNR = σ²_signal / σ²_noise. With standardized features:
The Bayesian view: λ controls how much we trust data vs. prior.
Equivalent Sample Size:
The prior can be interpreted as "pseudo-observations." A prior 𝒩(0, α⁻¹I) contributes roughly α/(β) = ασ² "equivalent observations" worth of information pulling weights to zero.
If you have N real observations and λ = 10σ², the prior contributes like 10 additional observations at y = 0.
Setting Hyperparameters:
Cross-Validation: Choose λ to minimize held-out error. Works but doesn't use probabilistic interpretation.
Empirical Bayes (Marginal Likelihood): Choose α to maximize p(y|X, α). This automatically balances fit and complexity.
Prior Predictive Matching: Choose α so that prior samples give plausible predictions. If prior predictions are reasonable (e.g., "prices between $10K and $1M"), the prior is sensible.
Domain Knowledge: If you know typical weight magnitudes from previous studies, set α accordingly.
The α-λ Relationship:
Many practitioners tune λ by cross-validation without knowing σ². If you estimate σ² from residuals: $$\hat{\sigma}^2 = \frac{1}{N-D}|\mathbf{y} - \mathbf{X}\hat{\mathbf{w}}|^2$$
Then α = λ/σ̂² gives the implied prior precision.
Example:
The "right" λ depends on your beliefs about the problem. There's no universally correct value. Cross-validation finds λ that works best on this data, but the Bayesian interpretation helps you understand whether that λ makes sense for your domain.
While MAP provides the regularization connection, full Bayesian inference offers more.
What MAP Gives:
What Full Bayesian Gives (Beyond MAP):
Posterior Distribution: Not just the best guess, but the entire distribution of plausible weights.
Uncertainty Quantification: Predictive intervals, not just point predictions.
Principled Model Comparison: Marginal likelihood for comparing models of different complexity.
Robustness to Hyperparameters: Integrate over hyperparameters rather than picking a single value.
Posterior Mean: Sometimes better than MAP—posterior mean minimizes squared error loss, while MAP minimizes 0-1 loss.
| Aspect | MAP / Regularized Regression | Full Bayesian |
|---|---|---|
| Output | Point estimate w_MAP | Distribution p(w|y) |
| Prediction | Point: ŷ = w_MAP⊤x | Distribution P(y*|x*, data) |
| Uncertainty | None (just one value) | Full predictive uncertainty |
| Hyperparameters | Single λ (CV/manual) | Can integrate out or use evidence |
| Model comparison | Information criteria (AIC, BIC) | Marginal likelihood / Bayes factors |
| Computation | Optimization (fast) | Often requires MCMC or VI |
| Concept | Regularization penalty | Prior probability distribution |
Use full Bayesian inference when: (1) uncertainty quantification is important, (2) you want to compare models using marginal likelihood, (3) data is limited and you want to integrate over hyperparameter uncertainty, or (4) downstream decisions depend on understanding the range of plausible parameters.
The practical choice between MAP and full Bayesian often comes down to computation.
Gaussian Prior (Ridge):
Laplace Prior (LASSO):
Horseshoe/Spike-and-Slab:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npimport timefrom sklearn.linear_model import Ridge, Lassofrom scipy.linalg import cho_factor, cho_solve def ridge_closed_form(X, y, lambda_reg): """Ridge via closed-form solution.""" D = X.shape[1] A = X.T @ X + lambda_reg * np.eye(D) b = X.T @ y return np.linalg.solve(A, b) def bayesian_full(X, y, alpha, beta): """Full Bayesian: posterior mean and covariance.""" D = X.shape[1] S_N_inv = alpha * np.eye(D) + beta * X.T @ X L, lower = cho_factor(S_N_inv, lower=True) S_N = cho_solve((L, lower), np.eye(D)) m_N = beta * S_N @ X.T @ y return m_N, S_N # Benchmarksizes = [(100, 10), (1000, 50), (10000, 100)] print("Computational Comparison\n" + "="*50) for N, D in sizes: print(f"\nData size: N={N}, D={D}") X = np.random.randn(N, D) y = np.random.randn(N) # Ridge MAP (closed-form) start = time.time() for _ in range(10): w_ridge = ridge_closed_form(X, y, 1.0) t_ridge = (time.time() - start) / 10 # Full Bayesian (also closed-form for Gaussian) start = time.time() for _ in range(10): m_N, S_N = bayesian_full(X, y, alpha=1.0, beta=1.0) t_bayesian = (time.time() - start) / 10 # LASSO (requires optimization) start = time.time() for _ in range(10): model = Lasso(alpha=0.01, max_iter=1000, tol=1e-4) model.fit(X, y) t_lasso = (time.time() - start) / 10 print(f" Ridge (closed-form): {t_ridge*1000:.2f} ms") print(f" Full Bayesian: {t_bayesian*1000:.2f} ms") print(f" LASSO (optimization): {t_lasso*1000:.2f} ms") # Check that Ridge and Bayesian mean match diff = np.linalg.norm(w_ridge - m_N) print(f" Ridge-Bayesian diff: {diff:.2e} (should be ~0)")For Gaussian priors (Ridge), full Bayesian inference costs the same as MAP. You get uncertainty quantification for free! This is why there's essentially no reason to use Ridge without also computing the posterior covariance for Gaussian linear regression.
Let's step back and appreciate the unified view that emerges.
Classical Machine Learning Says:
"Regularization adds a penalty term to prevent overfitting. The penalty strength λ is a hyperparameter tuned by cross-validation."
Bayesian Statistics Says:
"We specify prior beliefs about parameters. The prior combines with the likelihood to form the posterior. The prior strength (precision α) encodes how confident we are in our prior beliefs."
The Unified View:
These are the same thing!
Why This Matters:
Interpretability: Regularization isn't ad hoc—it's encoding beliefs. You can ask: "What beliefs does this λ imply?"
Guidance: Want sparsity? Use LASSO/Laplace. Want shrinkage? Use Ridge/Gaussian. Want adaptive? Use horseshoe.
Extension: Full Bayesian inference extends regularization to provide uncertainty, model comparison, and principled hyperparameter selection.
Conceptual Economy: One framework (Bayesian inference) encompasses OLS, Ridge, LASSO, elastic net, and more.
When using regularization, think probabilistically. Your regularization choice implies a prior. Does that prior match your domain knowledge? If Ridge seems too weak, you might believe weights are sparse—try LASSO. If LASSO is unstable, you might want Ridge's stability. The Bayesian lens helps you reason about these tradeoffs.
We've revealed the deep connection between Bayesian inference and regularization—two approaches that appear different but are fundamentally the same.
Module Complete:
Over these five pages, we've built a complete understanding of Bayesian Linear Regression:
You now possess a comprehensive understanding of Bayesian linear regression—from philosophical foundations through practical implementation to deep connections with classical machine learning.
Congratulations! You've mastered Bayesian Linear Regression. You can now place priors on weights, derive posteriors, compute predictive distributions with uncertainty, validate calibration, and understand the profound connection between Bayesian inference and regularization. This foundation extends to Gaussian Processes, Bayesian neural networks, and beyond.