Machine LearningBayesian Linear Regression

Bayesian Linear Regression

LevelAdvanced

Duration90 mins

TopicBayesian Linear Regression

5 / 5

Connection to Regularization

Two Frameworks, One Solution

Throughout this module, we've developed Bayesian linear regression from first principles—priors on weights, posterior derivation, predictive distributions with uncertainty. Meanwhile, in classical machine learning, practitioners routinely use regularization to prevent overfitting: Ridge regression, LASSO, elastic net.

These seem like different worlds. One speaks of probability distributions and Bayesian updating. The other speaks of penalty terms and constrained optimization. Yet they lead to the same solutions.

This is not a coincidence. There is a deep, mathematically precise connection between Bayesian inference and regularization. Every regularized method corresponds to a specific prior. Every prior implies a particular form of regularization. Understanding this connection provides:

Principled interpretation of regularization hyperparameters
Guidance for choosing between regularization methods
Extension from point estimates to full uncertainty quantification
A unified view that simplifies learning both frameworks

What You Will Learn

By the end of this page, you will understand how every regularized estimator corresponds to a specific Bayesian prior, derive the exact correspondence between Ridge/LASSO and Gaussian/Laplace priors, see how regularization strength relates to prior precision, and gain intuition for choosing regularization methods based on their probabilistic interpretation.

The MAP Estimator: Bridge Between Two Worlds

The connection between Bayesian inference and regularization runs through the Maximum A Posteriori (MAP) estimator.

MAP Definition:

The MAP estimate is the mode (peak) of the posterior distribution:

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} p(\mathbf{w} | \mathbf{y}, \mathbf{X})$$

Applying Bayes' theorem:

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \frac{p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w})}{p(\mathbf{y} | \mathbf{X})}$$

The denominator doesn't depend on w, so:

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \left[ p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w}) \right]$$

Taking the logarithm (monotonic transformation):

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \left[ \log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) + \log p(\mathbf{w}) \right]$$

Or equivalently, as a minimization:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ -\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) - \log p(\mathbf{w}) \right]$$

The Key Insight

The MAP objective decomposes into two terms:

• Negative log-likelihood: -log p(y|X,w) → This is the data fit term (like squared error)

• Negative log-prior: -log p(w) → This is the regularization term (like ‖w‖²)

Regularization IS the negative log-prior. Different priors = different regularizers.

Gaussian Prior ↔ Ridge Regression (L2)

Let's derive the correspondence for the most common case: Gaussian prior with Ridge regularization.

The Gaussian Prior:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{0}, \alpha^{-1}\mathbf{I})$$

Negative log-prior:

$$-\log p(\mathbf{w}) = \frac{\alpha}{2}\mathbf{w}^\top\mathbf{w} + \text{const} = \frac{\alpha}{2}|\mathbf{w}|_2^2 + \text{const}$$

The Gaussian Likelihood:

$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \mathcal{N}(\mathbf{y} | \mathbf{X}\mathbf{w}, \sigma^2\mathbf{I})$$

Negative log-likelihood:

$$-\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \text{const}$$

Combining (MAP Objective):

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \frac{\alpha}{2}|\mathbf{w}|_2^2 \right]$$

Multiplying by 2σ²:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \alpha\sigma^2|\mathbf{w}|_2^2 \right]$$

The Ridge Regression Connection

Ridge Regression:

w_Ridge = argmin ||y - Xw||² + λ||w||²

Bayesian Correspondence:

λ = α σ² = (prior precision) × (noise variance)

Gaussian prior with precision α ↔ L2 regularization with strength λ = ασ²

Closed-Form Solution:

Both formulations yield the same closed-form:

$$\mathbf{w}_{\text{Ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$

With λ = ασ², this is exactly the posterior mean mₙ from Bayesian linear regression!

What This Means:

Ridge is Bayesian: Every Ridge regression is secretly computing the MAP under a Gaussian prior.
λ has meaning: The regularization strength isn't arbitrary—it's the ratio of prior precision to noise precision.
λ = 1 interpretation: With standardized data, λ = 1 means each data point and the prior contribute equally.
Large λ means strong prior: Heavy regularization = confident prior belief that weights should be small.

ridge_bayesian_equivalence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
 
def bayesian_map(X, y, alpha, sigma_sq):
    """Compute MAP estimate (posterior mean for Gaussian prior)."""
    D = X.shape[1]
    lambda_reg = alpha * sigma_sq
    w_map = np.linalg.solve(X.T @ X + lambda_reg * np.eye(D), X.T @ y)
    return w_map
 
def sklearn_ridge(X, y, lambda_reg):
    """Compute Ridge regression solution."""
    model = Ridge(alpha=lambda_reg, fit_intercept=False)
    model.fit(X, y)
    return model.coef_
 
# Generate synthetic data
np.random.seed(42)
N, D = 50, 5
X = np.random.randn(N, D)
w_true = np.array([1.0, -0.5, 0.0, 0.3, -0.2])
sigma_true = 0.5
y = X @ w_true + sigma_true * np.random.randn(N)
 
# Bayesian parameters
alpha = 2.0  # Prior precision
sigma_sq = sigma_true ** 2
 
# Compute both ways
w_bayesian = bayesian_map(X, y, alpha, sigma_sq)
w_ridge = sklearn_ridge(X, y, lambda_reg=alpha * sigma_sq)
 
print("Bayesian MAP estimate:", w_bayesian)
print("sklearn Ridge estimate:", w_ridge)
print("Difference (should be ~0):", np.linalg.norm(w_bayesian - w_ridge))
 
# Visualize equivalence across different lambda values
lambdas = np.logspace(-3, 3, 50)
bayesian_norms = []
ridge_norms = []
 
for lam in lambdas:
    # For Bayesian: lambda = alpha * sigma_sq, so alpha = lambda / sigma_sq
    alpha_equiv = lam / sigma_sq
    w_b = bayesian_map(X, y, alpha_equiv, sigma_sq)
    w_r = sklearn_ridge(X, y, lam)
    bayesian_norms.append(np.linalg.norm(w_b))
    ridge_norms.append(np.linalg.norm(w_r))
 
plt.figure(figsize=(10, 6))
plt.semilogx(lambdas, bayesian_norms, 'b-', linewidth=2, label='Bayesian MAP')
plt.semilogx(lambdas, ridge_norms, 'r--', linewidth=2, label='sklearn Ridge')
plt.xlabel('λ (Regularization Strength)')
plt.ylabel('||w||₂')
plt.title('Ridge Regression = Bayesian MAP with Gaussian Prior')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Laplace Prior ↔ LASSO Regression (L1)

The LASSO (Least Absolute Shrinkage and Selection Operator) uses L1 regularization, which promotes sparse solutions. This corresponds to a Laplace prior.

The Laplace Prior:

$$p(w_j) = \frac{\lambda}{2}\exp(-\lambda|w_j|)$$

For all weights (assuming independence):

$$p(\mathbf{w}) = \prod_{j=1}^{D} \frac{\lambda}{2}\exp(-\lambda|w_j|) = \left(\frac{\lambda}{2}\right)^D \exp\left(-\lambda\sum_{j=1}^{D}|w_j|\right)$$

Negative log-prior:

$$-\log p(\mathbf{w}) = \lambda\sum_{j=1}^{D}|w_j| + \text{const} = \lambda|\mathbf{w}|_1 + \text{const}$$

MAP with Laplace Prior:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_1 \right]$$

Rescaling:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + 2\lambda\sigma^2|\mathbf{w}|_1 \right]$$

The LASSO Connection

LASSO Regression:

w_LASSO = argmin ||y - Xw||² + λ||w||₁

Bayesian Correspondence:

Laplace prior with parameter λ/(2σ²) ↔ L1 regularization with strength λ

Why Laplace → Sparsity:

The Laplace distribution has a sharp peak at zero—much sharper than the Gaussian. This peak creates strong "pull" toward exactly zero.

Mathematically, the L1 penalty has non-differentiable kinks at zero. The optimization can land exactly at a kink, setting weights to exact zeros. L2 has smooth minima—weights shrink but rarely reach exactly zero.

Geometric Interpretation:

Visualize the constraint region for the regularization ball:

L2 (Ridge): Constraint is a sphere (circle in 2D)
L1 (LASSO): Constraint is a rhombus (diamond in 2D)

The rhombus has corners on the axes. When the likelihood ellipse touches a corner, some weights are exactly zero. The sphere has no corners—tangent points are typically interior.

Practical Difference:

Property	Ridge (L2/Gaussian)	LASSO (L1/Laplace)
Sparsity	No exact zeros	Many exact zeros
Feature selection	No	Yes (implicit)
Correlated features	Shares weight	Picks one arbitrarily
Unique solution	Always (if λ > 0)	Not always
Closed form	Yes	No (requires iterative optimization)

LASSO Is Not Conjugate

Unlike Ridge, LASSO doesn't have a closed-form posterior. The Laplace prior is not conjugate to the Gaussian likelihood. Full Bayesian LASSO requires MCMC or approximations. The MAP (LASSO solution) is available via optimization, but the full posterior distribution is not analytically tractable.

Elastic Net and Other Prior-Regularizer Pairs

The Bayesian-regularization correspondence extends to many methods.

Elastic Net:

Combines L1 and L2: $$\mathbf{w}{\text{EN}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2 \right]$$

Corresponds to a prior that is a product of Gaussian and Laplace: $$p(\mathbf{w}) \propto \exp\left( -\frac{\alpha}{2}|\mathbf{w}|_2^2 - \beta|\mathbf{w}|_1 \right)$$

This combines the best of both: sparsity from L1 plus stability from L2.

Group LASSO:

Penalizes groups of features together: $$\text{Penalty} = \lambda\sum_{g=1}^{G} \sqrt{\sum_{j \in g} w_j^2}$$

Corresponds to a group-sparse prior that sets entire groups to zero.

Horseshoe Prior:

$$w_j | \tau_j \sim \mathcal{N}(0, \tau_j^2), \quad \tau_j \sim \text{Half-Cauchy}(0, \lambda)$$

Provides adaptive shrinkage—heavy shrinkage for noise features, light shrinkage for signal features. No simple frequentist analog.

Prior-Regularizer Correspondence Table
Prior Distribution	MAP Regularizer	Sparsity	Closed-Form MAP
Gaussian 𝒩(0, α⁻¹I)	L2 (Ridge): λ\|\|w\|\|₂²	No	Yes
Laplace (double exponential)	L1 (LASSO): λ\|\|w\|\|₁	Yes	No
Gaussian + Laplace	Elastic Net: λ₁\|\|w\|\|₁ + λ₂\|\|w\|\|₂²	Yes	No
Uniform (improper)	None (OLS)	No	Yes
Student-t	Log penalty	Soft sparsity	No
Spike-and-Slab	Best subset selection	Hard sparsity	No
Horseshoe	Adaptive shrinkage	Yes (adaptive)	No
Gaussian (feature-specific)	Weighted L2	No	Yes

Choosing Based on Prior Beliefs

Instead of asking 'Should I use Ridge or LASSO?', ask: 'Do I believe most features are irrelevant (→ Laplace/LASSO) or that all features contribute small amounts (→ Gaussian/Ridge)?' The probabilistic interpretation guides the choice.

Principled Interpretation of Hyperparameters

The Bayesian perspective gives principled interpretation to regularization hyperparameters.

Ridge λ = ασ²:

λ is the product of:

α: Prior precision (how confident we are that weights are small)
σ²: Noise variance (how noisy the observations are)

This explains why optimal λ often scales with noise. In high-noise settings, we should trust the prior more (larger λ). In low-noise settings, data is reliable (smaller λ).

Signal-to-Noise Ratio Interpretation:

Define SNR = σ²_signal / σ²_noise. With standardized features:

High SNR: Data is informative → smaller λ
Low SNR: Data is noisy → larger λ

The Bayesian view: λ controls how much we trust data vs. prior.

Equivalent Sample Size:

The prior can be interpreted as "pseudo-observations." A prior 𝒩(0, α⁻¹I) contributes roughly α/(β) = ασ² "equivalent observations" worth of information pulling weights to zero.

If you have N real observations and λ = 10σ², the prior contributes like 10 additional observations at y = 0.

Setting Hyperparameters:

Cross-Validation: Choose λ to minimize held-out error. Works but doesn't use probabilistic interpretation.
Empirical Bayes (Marginal Likelihood): Choose α to maximize p(y|X, α). This automatically balances fit and complexity.
Prior Predictive Matching: Choose α so that prior samples give plausible predictions. If prior predictions are reasonable (e.g., "prices between $10K and $1M"), the prior is sensible.
Domain Knowledge: If you know typical weight magnitudes from previous studies, set α accordingly.

The α-λ Relationship:

Many practitioners tune λ by cross-validation without knowing σ². If you estimate σ² from residuals: $$\hat{\sigma}^2 = \frac{1}{N-D}|\mathbf{y} - \mathbf{X}\hat{\mathbf{w}}|^2$$

Then α = λ/σ̂² gives the implied prior precision.

Example:

If λ_CV = 0.1 and σ̂² = 1.0, then α = 0.1 → prior variance = 10
Prior says: weights are approximately 𝒩(0, 10), so range roughly ±6
Does this match domain knowledge?

Hyperparameters Are Model Choices

The "right" λ depends on your beliefs about the problem. There's no universally correct value. Cross-validation finds λ that works best on this data, but the Bayesian interpretation helps you understand whether that λ makes sense for your domain.

Beyond MAP: The Full Bayesian Advantage

While MAP provides the regularization connection, full Bayesian inference offers more.

What MAP Gives:

A single point estimate (posterior mode)
Equivalent to regularized regression
Computationally simple

What Full Bayesian Gives (Beyond MAP):

Posterior Distribution: Not just the best guess, but the entire distribution of plausible weights.
Uncertainty Quantification: Predictive intervals, not just point predictions.
Principled Model Comparison: Marginal likelihood for comparing models of different complexity.
Robustness to Hyperparameters: Integrate over hyperparameters rather than picking a single value.
Posterior Mean: Sometimes better than MAP—posterior mean minimizes squared error loss, while MAP minimizes 0-1 loss.

MAP vs. Full Bayesian Inference
Aspect	MAP / Regularized Regression	Full Bayesian
Output	Point estimate w_MAP	Distribution p(w\|y)
Prediction	Point: ŷ = w_MAP⊤x	Distribution P(y\|x, data)
Uncertainty	None (just one value)	Full predictive uncertainty
Hyperparameters	Single λ (CV/manual)	Can integrate out or use evidence
Model comparison	Information criteria (AIC, BIC)	Marginal likelihood / Bayes factors
Computation	Optimization (fast)	Often requires MCMC or VI
Concept	Regularization penalty	Prior probability distribution

When to Go Beyond MAP

Use full Bayesian inference when: (1) uncertainty quantification is important, (2) you want to compare models using marginal likelihood, (3) data is limited and you want to integrate over hyperparameter uncertainty, or (4) downstream decisions depend on understanding the range of plausible parameters.

Computational Considerations

The practical choice between MAP and full Bayesian often comes down to computation.

Gaussian Prior (Ridge):

MAP: Closed-form, O(ND² + D³) to solve
Full Bayesian: Closed-form, same cost as MAP
Verdict: No reason not to go full Bayesian

Laplace Prior (LASSO):

MAP: Requires iterative optimization (coordinate descent, LARS), but efficient implementations exist
Full Bayesian: Requires MCMC or variational inference—more expensive
Verdict: MAP often sufficient unless uncertainty needed

Horseshoe/Spike-and-Slab:

MAP: Not well-defined or not unique
Full Bayesian: MCMC required
Verdict: Full Bayesian is the only option, but provides superior adaptive shrinkage

computational_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
import time
from sklearn.linear_model import Ridge, Lasso
from scipy.linalg import cho_factor, cho_solve
 
def ridge_closed_form(X, y, lambda_reg):
    """Ridge via closed-form solution."""
    D = X.shape[1]
    A = X.T @ X + lambda_reg * np.eye(D)
    b = X.T @ y
    return np.linalg.solve(A, b)
 
def bayesian_full(X, y, alpha, beta):
    """Full Bayesian: posterior mean and covariance."""
    D = X.shape[1]
    S_N_inv = alpha * np.eye(D) + beta * X.T @ X
    L, lower = cho_factor(S_N_inv, lower=True)
    S_N = cho_solve((L, lower), np.eye(D))
    m_N = beta * S_N @ X.T @ y
    return m_N, S_N
 
# Benchmark
sizes = [(100, 10), (1000, 50), (10000, 100)]
 
print("Computational Comparison\n" + "="*50)
 
for N, D in sizes:
    print(f"\nData size: N={N}, D={D}")
    
    X = np.random.randn(N, D)
    y = np.random.randn(N)
    
    # Ridge MAP (closed-form)
    start = time.time()
    for _ in range(10):
        w_ridge = ridge_closed_form(X, y, 1.0)
    t_ridge = (time.time() - start) / 10
    
    # Full Bayesian (also closed-form for Gaussian)
    start = time.time()
    for _ in range(10):
        m_N, S_N = bayesian_full(X, y, alpha=1.0, beta=1.0)
    t_bayesian = (time.time() - start) / 10
    
    # LASSO (requires optimization)
    start = time.time()
    for _ in range(10):
        model = Lasso(alpha=0.01, max_iter=1000, tol=1e-4)
        model.fit(X, y)
    t_lasso = (time.time() - start) / 10
    
    print(f"  Ridge (closed-form):    {t_ridge*1000:.2f} ms")
    print(f"  Full Bayesian:          {t_bayesian*1000:.2f} ms")
    print(f"  LASSO (optimization):   {t_lasso*1000:.2f} ms")
    
    # Check that Ridge and Bayesian mean match
    diff = np.linalg.norm(w_ridge - m_N)
    print(f"  Ridge-Bayesian diff:    {diff:.2e} (should be ~0)")

The Free Lunch

For Gaussian priors (Ridge), full Bayesian inference costs the same as MAP. You get uncertainty quantification for free! This is why there's essentially no reason to use Ridge without also computing the posterior covariance for Gaussian linear regression.

The Unifying Perspective

Let's step back and appreciate the unified view that emerges.

Classical Machine Learning Says:

"Regularization adds a penalty term to prevent overfitting. The penalty strength λ is a hyperparameter tuned by cross-validation."

Bayesian Statistics Says:

"We specify prior beliefs about parameters. The prior combines with the likelihood to form the posterior. The prior strength (precision α) encodes how confident we are in our prior beliefs."

The Unified View:

These are the same thing!

Regularization = prior beliefs about parameter values
λ = prior precision × noise variance
Cross-validation for λ ≈ empirical Bayes for α
MAP estimate = regularized point estimate
Full Bayesian = regularization + uncertainty

Why This Matters:

Interpretability: Regularization isn't ad hoc—it's encoding beliefs. You can ask: "What beliefs does this λ imply?"
Guidance: Want sparsity? Use LASSO/Laplace. Want shrinkage? Use Ridge/Gaussian. Want adaptive? Use horseshoe.
Extension: Full Bayesian inference extends regularization to provide uncertainty, model comparison, and principled hyperparameter selection.
Conceptual Economy: One framework (Bayesian inference) encompasses OLS, Ridge, LASSO, elastic net, and more.

The Practitioner's Takeaway

When using regularization, think probabilistically. Your regularization choice implies a prior. Does that prior match your domain knowledge? If Ridge seems too weak, you might believe weights are sparse—try LASSO. If LASSO is unstable, you might want Ridge's stability. The Bayesian lens helps you reason about these tradeoffs.

Summary: The Complete Picture

We've revealed the deep connection between Bayesian inference and regularization—two approaches that appear different but are fundamentally the same.

Key Takeaways

•MAP = Regularization: The MAP estimator decomposes into negative log-likelihood (data fit) plus negative log-prior (regularization penalty).
•Gaussian Prior ↔ Ridge (L2): Gaussian priors correspond exactly to L2 regularization. λ = ασ².
•Laplace Prior ↔ LASSO (L1): Laplace priors give L1 regularization and promote sparsity through their sharp peak at zero.
•Elastic Net: Combines Gaussian and Laplace priors for both shrinkage and sparsity.
•Hyperparameter Interpretation: λ is not arbitrary—it's the product of prior precision and noise variance, controlling how much we trust data vs. prior.
•Full Bayesian Goes Further: Beyond the MAP/regularized point estimate, full Bayesian gives posterior distributions, uncertainty, and principled model comparison.
•Unified Framework: Bayesian inference provides a single framework encompassing OLS, Ridge, LASSO, and more, with clear probabilistic interpretation.

Module Complete:

Over these five pages, we've built a complete understanding of Bayesian Linear Regression:

Prior on Weights: Encoding beliefs as probability distributions
Posterior Derivation: Combining prior and data via Bayes' theorem
Predictive Distribution: Making predictions with calibrated uncertainty
Uncertainty Quantification: Using uncertainty in practice—calibration, visualization, decisions
Connection to Regularization: The unifying view linking Bayesian and frequentist approaches

You now possess a comprehensive understanding of Bayesian linear regression—from philosophical foundations through practical implementation to deep connections with classical machine learning.

Module Complete

Congratulations! You've mastered Bayesian Linear Regression. You can now place priors on weights, derive posteriors, compute predictive distributions with uncertainty, validate calibration, and understand the profound connection between Bayesian inference and regularization. This foundation extends to Gaussian Processes, Bayesian neural networks, and beyond.

5 / 5

Loading learning content...

Machine LearningBayesian Linear Regression

Bayesian Linear Regression

LevelAdvanced

Duration90 mins

TopicBayesian Linear Regression

5 / 5

Connection to Regularization

Two Frameworks, One Solution

These seem like different worlds. One speaks of probability distributions and Bayesian updating. The other speaks of penalty terms and constrained optimization. Yet they lead to the same solutions.

Principled interpretation of regularization hyperparameters
Guidance for choosing between regularization methods
Extension from point estimates to full uncertainty quantification
A unified view that simplifies learning both frameworks

What You Will Learn

The MAP Estimator: Bridge Between Two Worlds

The connection between Bayesian inference and regularization runs through the Maximum A Posteriori (MAP) estimator.

MAP Definition:

The MAP estimate is the mode (peak) of the posterior distribution:

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} p(\mathbf{w} | \mathbf{y}, \mathbf{X})$$

Applying Bayes' theorem:

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \frac{p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w})}{p(\mathbf{y} | \mathbf{X})}$$

The denominator doesn't depend on w, so:

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \left[ p(\mathbf{y} | \mathbf{X}, \mathbf{w}) \cdot p(\mathbf{w}) \right]$$

Taking the logarithm (monotonic transformation):

$$\mathbf{w}{\text{MAP}} = \arg\max\mathbf{w} \left[ \log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) + \log p(\mathbf{w}) \right]$$

Or equivalently, as a minimization:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ -\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) - \log p(\mathbf{w}) \right]$$

The Key Insight

The MAP objective decomposes into two terms:

• Negative log-likelihood: -log p(y|X,w) → This is the data fit term (like squared error)

• Negative log-prior: -log p(w) → This is the regularization term (like ‖w‖²)

Regularization IS the negative log-prior. Different priors = different regularizers.

Gaussian Prior ↔ Ridge Regression (L2)

Let's derive the correspondence for the most common case: Gaussian prior with Ridge regularization.

The Gaussian Prior:

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{0}, \alpha^{-1}\mathbf{I})$$

Negative log-prior:

$$-\log p(\mathbf{w}) = \frac{\alpha}{2}\mathbf{w}^\top\mathbf{w} + \text{const} = \frac{\alpha}{2}|\mathbf{w}|_2^2 + \text{const}$$

The Gaussian Likelihood:

$$p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \mathcal{N}(\mathbf{y} | \mathbf{X}\mathbf{w}, \sigma^2\mathbf{I})$$

Negative log-likelihood:

$$-\log p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \text{const}$$

Combining (MAP Objective):

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \frac{\alpha}{2}|\mathbf{w}|_2^2 \right]$$

Multiplying by 2σ²:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \alpha\sigma^2|\mathbf{w}|_2^2 \right]$$

The Ridge Regression Connection

Ridge Regression:

w_Ridge = argmin ||y - Xw||² + λ||w||²

Bayesian Correspondence:

λ = α σ² = (prior precision) × (noise variance)

Gaussian prior with precision α ↔ L2 regularization with strength λ = ασ²

Closed-Form Solution:

Both formulations yield the same closed-form:

$$\mathbf{w}_{\text{Ridge}} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$$

With λ = ασ², this is exactly the posterior mean mₙ from Bayesian linear regression!

What This Means:

Ridge is Bayesian: Every Ridge regression is secretly computing the MAP under a Gaussian prior.
λ has meaning: The regularization strength isn't arbitrary—it's the ratio of prior precision to noise precision.
λ = 1 interpretation: With standardized data, λ = 1 means each data point and the prior contribute equally.
Large λ means strong prior: Heavy regularization = confident prior belief that weights should be small.

ridge_bayesian_equivalence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from sklearn.linear_model import Ridge
import matplotlib.pyplot as plt
 
def bayesian_map(X, y, alpha, sigma_sq):
    """Compute MAP estimate (posterior mean for Gaussian prior)."""
    D = X.shape[1]
    lambda_reg = alpha * sigma_sq
    w_map = np.linalg.solve(X.T @ X + lambda_reg * np.eye(D), X.T @ y)
    return w_map
 
def sklearn_ridge(X, y, lambda_reg):
    """Compute Ridge regression solution."""
    model = Ridge(alpha=lambda_reg, fit_intercept=False)
    model.fit(X, y)
    return model.coef_
 
# Generate synthetic data
np.random.seed(42)
N, D = 50, 5
X = np.random.randn(N, D)
w_true = np.array([1.0, -0.5, 0.0, 0.3, -0.2])
sigma_true = 0.5
y = X @ w_true + sigma_true * np.random.randn(N)
 
# Bayesian parameters
alpha = 2.0  # Prior precision
sigma_sq = sigma_true ** 2
 
# Compute both ways
w_bayesian = bayesian_map(X, y, alpha, sigma_sq)
w_ridge = sklearn_ridge(X, y, lambda_reg=alpha * sigma_sq)
 
print("Bayesian MAP estimate:", w_bayesian)
print("sklearn Ridge estimate:", w_ridge)
print("Difference (should be ~0):", np.linalg.norm(w_bayesian - w_ridge))
 
# Visualize equivalence across different lambda values
lambdas = np.logspace(-3, 3, 50)
bayesian_norms = []
ridge_norms = []
 
for lam in lambdas:
    # For Bayesian: lambda = alpha * sigma_sq, so alpha = lambda / sigma_sq
    alpha_equiv = lam / sigma_sq
    w_b = bayesian_map(X, y, alpha_equiv, sigma_sq)
    w_r = sklearn_ridge(X, y, lam)
    bayesian_norms.append(np.linalg.norm(w_b))
    ridge_norms.append(np.linalg.norm(w_r))
 
plt.figure(figsize=(10, 6))
plt.semilogx(lambdas, bayesian_norms, 'b-', linewidth=2, label='Bayesian MAP')
plt.semilogx(lambdas, ridge_norms, 'r--', linewidth=2, label='sklearn Ridge')
plt.xlabel('λ (Regularization Strength)')
plt.ylabel('||w||₂')
plt.title('Ridge Regression = Bayesian MAP with Gaussian Prior')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Laplace Prior ↔ LASSO Regression (L1)

The LASSO (Least Absolute Shrinkage and Selection Operator) uses L1 regularization, which promotes sparse solutions. This corresponds to a Laplace prior.

The Laplace Prior:

$$p(w_j) = \frac{\lambda}{2}\exp(-\lambda|w_j|)$$

For all weights (assuming independence):

$$p(\mathbf{w}) = \prod_{j=1}^{D} \frac{\lambda}{2}\exp(-\lambda|w_j|) = \left(\frac{\lambda}{2}\right)^D \exp\left(-\lambda\sum_{j=1}^{D}|w_j|\right)$$

Negative log-prior:

$$-\log p(\mathbf{w}) = \lambda\sum_{j=1}^{D}|w_j| + \text{const} = \lambda|\mathbf{w}|_1 + \text{const}$$

MAP with Laplace Prior:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ \frac{1}{2\sigma^2}|\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda|\mathbf{w}|_1 \right]$$

Rescaling:

$$\mathbf{w}{\text{MAP}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + 2\lambda\sigma^2|\mathbf{w}|_1 \right]$$

The LASSO Connection

LASSO Regression:

w_LASSO = argmin ||y - Xw||² + λ||w||₁

Bayesian Correspondence:

Laplace prior with parameter λ/(2σ²) ↔ L1 regularization with strength λ

Why Laplace → Sparsity:

The Laplace distribution has a sharp peak at zero—much sharper than the Gaussian. This peak creates strong "pull" toward exactly zero.

Geometric Interpretation:

Visualize the constraint region for the regularization ball:

L2 (Ridge): Constraint is a sphere (circle in 2D)
L1 (LASSO): Constraint is a rhombus (diamond in 2D)

The rhombus has corners on the axes. When the likelihood ellipse touches a corner, some weights are exactly zero. The sphere has no corners—tangent points are typically interior.

Practical Difference:

Property	Ridge (L2/Gaussian)	LASSO (L1/Laplace)
Sparsity	No exact zeros	Many exact zeros
Feature selection	No	Yes (implicit)
Correlated features	Shares weight	Picks one arbitrarily
Unique solution	Always (if λ > 0)	Not always
Closed form	Yes	No (requires iterative optimization)

LASSO Is Not Conjugate

Elastic Net and Other Prior-Regularizer Pairs

The Bayesian-regularization correspondence extends to many methods.

Elastic Net:

Combines L1 and L2: $$\mathbf{w}{\text{EN}} = \arg\min\mathbf{w} \left[ |\mathbf{y} - \mathbf{X}\mathbf{w}|_2^2 + \lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2 \right]$$

Corresponds to a prior that is a product of Gaussian and Laplace: $$p(\mathbf{w}) \propto \exp\left( -\frac{\alpha}{2}|\mathbf{w}|_2^2 - \beta|\mathbf{w}|_1 \right)$$

This combines the best of both: sparsity from L1 plus stability from L2.

Group LASSO:

Penalizes groups of features together: $$\text{Penalty} = \lambda\sum_{g=1}^{G} \sqrt{\sum_{j \in g} w_j^2}$$

Corresponds to a group-sparse prior that sets entire groups to zero.

Horseshoe Prior:

$$w_j | \tau_j \sim \mathcal{N}(0, \tau_j^2), \quad \tau_j \sim \text{Half-Cauchy}(0, \lambda)$$

Provides adaptive shrinkage—heavy shrinkage for noise features, light shrinkage for signal features. No simple frequentist analog.

Prior-Regularizer Correspondence Table
Prior Distribution	MAP Regularizer	Sparsity	Closed-Form MAP
Gaussian 𝒩(0, α⁻¹I)	L2 (Ridge): λ\|\|w\|\|₂²	No	Yes
Laplace (double exponential)	L1 (LASSO): λ\|\|w\|\|₁	Yes	No
Gaussian + Laplace	Elastic Net: λ₁\|\|w\|\|₁ + λ₂\|\|w\|\|₂²	Yes	No
Uniform (improper)	None (OLS)	No	Yes
Student-t	Log penalty	Soft sparsity	No
Spike-and-Slab	Best subset selection	Hard sparsity	No
Horseshoe	Adaptive shrinkage	Yes (adaptive)	No
Gaussian (feature-specific)	Weighted L2	No	Yes

Choosing Based on Prior Beliefs

Principled Interpretation of Hyperparameters

The Bayesian perspective gives principled interpretation to regularization hyperparameters.

Ridge λ = ασ²:

λ is the product of:

α: Prior precision (how confident we are that weights are small)
σ²: Noise variance (how noisy the observations are)

This explains why optimal λ often scales with noise. In high-noise settings, we should trust the prior more (larger λ). In low-noise settings, data is reliable (smaller λ).

Signal-to-Noise Ratio Interpretation:

Define SNR = σ²_signal / σ²_noise. With standardized features:

High SNR: Data is informative → smaller λ
Low SNR: Data is noisy → larger λ

The Bayesian view: λ controls how much we trust data vs. prior.

Equivalent Sample Size:

The prior can be interpreted as "pseudo-observations." A prior 𝒩(0, α⁻¹I) contributes roughly α/(β) = ασ² "equivalent observations" worth of information pulling weights to zero.

If you have N real observations and λ = 10σ², the prior contributes like 10 additional observations at y = 0.

Setting Hyperparameters:

Cross-Validation: Choose λ to minimize held-out error. Works but doesn't use probabilistic interpretation.
Empirical Bayes (Marginal Likelihood): Choose α to maximize p(y|X, α). This automatically balances fit and complexity.
Prior Predictive Matching: Choose α so that prior samples give plausible predictions. If prior predictions are reasonable (e.g., "prices between $10K and $1M"), the prior is sensible.
Domain Knowledge: If you know typical weight magnitudes from previous studies, set α accordingly.

The α-λ Relationship:

Many practitioners tune λ by cross-validation without knowing σ². If you estimate σ² from residuals: $$\hat{\sigma}^2 = \frac{1}{N-D}|\mathbf{y} - \mathbf{X}\hat{\mathbf{w}}|^2$$

Then α = λ/σ̂² gives the implied prior precision.

Example:

If λ_CV = 0.1 and σ̂² = 1.0, then α = 0.1 → prior variance = 10
Prior says: weights are approximately 𝒩(0, 10), so range roughly ±6
Does this match domain knowledge?

Hyperparameters Are Model Choices

Beyond MAP: The Full Bayesian Advantage

While MAP provides the regularization connection, full Bayesian inference offers more.

What MAP Gives:

A single point estimate (posterior mode)
Equivalent to regularized regression
Computationally simple

What Full Bayesian Gives (Beyond MAP):

Posterior Distribution: Not just the best guess, but the entire distribution of plausible weights.
Uncertainty Quantification: Predictive intervals, not just point predictions.
Principled Model Comparison: Marginal likelihood for comparing models of different complexity.
Robustness to Hyperparameters: Integrate over hyperparameters rather than picking a single value.
Posterior Mean: Sometimes better than MAP—posterior mean minimizes squared error loss, while MAP minimizes 0-1 loss.

MAP vs. Full Bayesian Inference
Aspect	MAP / Regularized Regression	Full Bayesian
Output	Point estimate w_MAP	Distribution p(w\|y)
Prediction	Point: ŷ = w_MAP⊤x	Distribution P(y\|x, data)
Uncertainty	None (just one value)	Full predictive uncertainty
Hyperparameters	Single λ (CV/manual)	Can integrate out or use evidence
Model comparison	Information criteria (AIC, BIC)	Marginal likelihood / Bayes factors
Computation	Optimization (fast)	Often requires MCMC or VI
Concept	Regularization penalty	Prior probability distribution

When to Go Beyond MAP

Computational Considerations

The practical choice between MAP and full Bayesian often comes down to computation.

Gaussian Prior (Ridge):

MAP: Closed-form, O(ND² + D³) to solve
Full Bayesian: Closed-form, same cost as MAP
Verdict: No reason not to go full Bayesian

Laplace Prior (LASSO):

MAP: Requires iterative optimization (coordinate descent, LARS), but efficient implementations exist
Full Bayesian: Requires MCMC or variational inference—more expensive
Verdict: MAP often sufficient unless uncertainty needed

Horseshoe/Spike-and-Slab:

MAP: Not well-defined or not unique
Full Bayesian: MCMC required
Verdict: Full Bayesian is the only option, but provides superior adaptive shrinkage

computational_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
import time
from sklearn.linear_model import Ridge, Lasso
from scipy.linalg import cho_factor, cho_solve
 
def ridge_closed_form(X, y, lambda_reg):
    """Ridge via closed-form solution."""
    D = X.shape[1]
    A = X.T @ X + lambda_reg * np.eye(D)
    b = X.T @ y
    return np.linalg.solve(A, b)
 
def bayesian_full(X, y, alpha, beta):
    """Full Bayesian: posterior mean and covariance."""
    D = X.shape[1]
    S_N_inv = alpha * np.eye(D) + beta * X.T @ X
    L, lower = cho_factor(S_N_inv, lower=True)
    S_N = cho_solve((L, lower), np.eye(D))
    m_N = beta * S_N @ X.T @ y
    return m_N, S_N
 
# Benchmark
sizes = [(100, 10), (1000, 50), (10000, 100)]
 
print("Computational Comparison\n" + "="*50)
 
for N, D in sizes:
    print(f"\nData size: N={N}, D={D}")
    
    X = np.random.randn(N, D)
    y = np.random.randn(N)
    
    # Ridge MAP (closed-form)
    start = time.time()
    for _ in range(10):
        w_ridge = ridge_closed_form(X, y, 1.0)
    t_ridge = (time.time() - start) / 10
    
    # Full Bayesian (also closed-form for Gaussian)
    start = time.time()
    for _ in range(10):
        m_N, S_N = bayesian_full(X, y, alpha=1.0, beta=1.0)
    t_bayesian = (time.time() - start) / 10
    
    # LASSO (requires optimization)
    start = time.time()
    for _ in range(10):
        model = Lasso(alpha=0.01, max_iter=1000, tol=1e-4)
        model.fit(X, y)
    t_lasso = (time.time() - start) / 10
    
    print(f"  Ridge (closed-form):    {t_ridge*1000:.2f} ms")
    print(f"  Full Bayesian:          {t_bayesian*1000:.2f} ms")
    print(f"  LASSO (optimization):   {t_lasso*1000:.2f} ms")
    
    # Check that Ridge and Bayesian mean match
    diff = np.linalg.norm(w_ridge - m_N)
    print(f"  Ridge-Bayesian diff:    {diff:.2e} (should be ~0)")

The Free Lunch

The Unifying Perspective

Let's step back and appreciate the unified view that emerges.

Classical Machine Learning Says:

"Regularization adds a penalty term to prevent overfitting. The penalty strength λ is a hyperparameter tuned by cross-validation."

Bayesian Statistics Says:

"We specify prior beliefs about parameters. The prior combines with the likelihood to form the posterior. The prior strength (precision α) encodes how confident we are in our prior beliefs."

The Unified View:

These are the same thing!

Regularization = prior beliefs about parameter values
λ = prior precision × noise variance
Cross-validation for λ ≈ empirical Bayes for α
MAP estimate = regularized point estimate
Full Bayesian = regularization + uncertainty

Why This Matters:

Interpretability: Regularization isn't ad hoc—it's encoding beliefs. You can ask: "What beliefs does this λ imply?"
Guidance: Want sparsity? Use LASSO/Laplace. Want shrinkage? Use Ridge/Gaussian. Want adaptive? Use horseshoe.
Extension: Full Bayesian inference extends regularization to provide uncertainty, model comparison, and principled hyperparameter selection.
Conceptual Economy: One framework (Bayesian inference) encompasses OLS, Ridge, LASSO, elastic net, and more.

The Practitioner's Takeaway

Summary: The Complete Picture

We've revealed the deep connection between Bayesian inference and regularization—two approaches that appear different but are fundamentally the same.

Key Takeaways

•MAP = Regularization: The MAP estimator decomposes into negative log-likelihood (data fit) plus negative log-prior (regularization penalty).
•Gaussian Prior ↔ Ridge (L2): Gaussian priors correspond exactly to L2 regularization. λ = ασ².
•Laplace Prior ↔ LASSO (L1): Laplace priors give L1 regularization and promote sparsity through their sharp peak at zero.
•Elastic Net: Combines Gaussian and Laplace priors for both shrinkage and sparsity.
•Hyperparameter Interpretation: λ is not arbitrary—it's the product of prior precision and noise variance, controlling how much we trust data vs. prior.
•Full Bayesian Goes Further: Beyond the MAP/regularized point estimate, full Bayesian gives posterior distributions, uncertainty, and principled model comparison.
•Unified Framework: Bayesian inference provides a single framework encompassing OLS, Ridge, LASSO, and more, with clear probabilistic interpretation.

Module Complete:

Over these five pages, we've built a complete understanding of Bayesian Linear Regression:

Prior on Weights: Encoding beliefs as probability distributions
Posterior Derivation: Combining prior and data via Bayes' theorem
Predictive Distribution: Making predictions with calibrated uncertainty
Uncertainty Quantification: Using uncertainty in practice—calibration, visualization, decisions
Connection to Regularization: The unifying view linking Bayesian and frequentist approaches

You now possess a comprehensive understanding of Bayesian linear regression—from philosophical foundations through practical implementation to deep connections with classical machine learning.

Module Complete

5 / 5