Bayesian Interpretation - Learning Module

Loading content...

0/245

Posterior Inference

Beyond Point Estimates: The Full Posterior

MAP estimation gives us the most probable parameter values given our data and prior beliefs. But probability is a richer language than single point estimates. What if the posterior is flat and we're highly uncertain? What if it's bimodal with two plausible explanations? What if we need to propagate parameter uncertainty into predictions?

Full Bayesian posterior inference addresses these questions. Instead of finding just the mode, we characterize the entire posterior distribution $p(\boldsymbol{\theta} \mid \mathcal{D})$. This distribution encapsulates everything we know about parameters after observing data—not just the best guess, but our complete uncertainty landscape.

This page develops the theory and practice of posterior inference, connecting it back to regularization and showing when full Bayesian methods matter most.

What You Will Learn

By completing this page, you will: (1) Understand what the posterior distribution represents and how to interpret it; (2) Learn methods for computing posteriors (exact, MCMC, variational); (3) See how posterior inference enables uncertainty quantification; (4) Understand Bayesian prediction and decision-making; (5) Connect posterior inference to regularization through the Bayesian linear regression example.

The Posterior Distribution

Definition and Meaning:

The posterior distribution $p(\boldsymbol{\theta} \mid \mathcal{D})$ is the probability distribution over parameters after observing data $\mathcal{D}$. It combines:

Prior information $p(\boldsymbol{\theta})$: What we believed before data
Data evidence via likelihood $p(\mathcal{D} \mid \boldsymbol{\theta})$: What the data tells us

By Bayes' theorem: $$p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})}{p(\mathcal{D})}$$

Interpreting the Posterior:

The posterior is not just a computational object—it has direct probabilistic meaning:

$p(\theta_j > 0 \mid \mathcal{D})$: Probability that coefficient $j$ is positive
$p(\theta_j \in [a, b] \mid \mathcal{D})$: Probability coefficient lies in an interval
$\mathbb{E}[\theta_j \mid \mathcal{D}]$: Expected value of coefficient (posterior mean)
$\text{Var}(\theta_j \mid \mathcal{D})$: Uncertainty about coefficient
$\text{Cov}(\theta_i, \theta_j \mid \mathcal{D})$: How uncertainty in coefficients is correlated

These are genuine probability statements about parameters—something frequentist methods cannot provide directly.

The Complete Answer

The posterior distribution is the complete Bayesian answer to the inference problem. It contains all information about parameters that the data and prior can provide. Any summary (mean, mode, intervals) is just a projection of this full distribution.

Posterior Summaries and Their Uses
Summary	Definition	Use Case
Posterior Mean	$\mathbb{E}[\boldsymbol{\theta} \mid \mathcal{D}]$	Optimal under squared loss; often used for prediction
Posterior Mode (MAP)	$\arg\max p(\boldsymbol{\theta} \mid \mathcal{D})$	Point estimate; may be sparse
Posterior Median	50th percentile	Robust to skewness; optimal under absolute loss
Credible Interval	$[\theta_L, \theta_U]$ with $P(\theta \in [\theta_L, \theta_U] \mid \mathcal{D}) = 1-\alpha$	Uncertainty quantification
Highest Density Region	Smallest region containing $(1-\alpha)$ probability	Most informative uncertainty region

Exact Posterior Inference

For certain combinations of likelihood and prior, the posterior has a known closed-form. These conjugate pairs are computationally precious.

Bayesian Linear Regression (Gaussian Prior):

The most important example for regularization:

Likelihood: $\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I})$

Prior: $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$

Posterior (derived in Page 2): $$\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$

where: $$\boldsymbol{\mu}_n = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$ $$\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$

with $\lambda = \sigma^2/\tau^2$.

The Ridge Posterior is Exact and Gaussian

For Ridge regression, we have the complete posterior analytically. The posterior mean equals the Ridge estimate. The posterior covariance gives us immediate uncertainty quantification—no MCMC needed. This is one of the few cases where full Bayesian inference is as easy as point estimation.

Other Conjugate Pairs:

Likelihood	Conjugate Prior	Posterior
Gaussian (known $\sigma^2$)	Gaussian	Gaussian
Binomial	Beta	Beta
Poisson	Gamma	Gamma
Multinomial	Dirichlet	Dirichlet
Exponential	Gamma	Gamma
Gaussian (unknown $\sigma^2$)	Normal-Inverse-Gamma	Normal-Inverse-Gamma

Limitations of Exact Inference:

Requires special likelihood-prior combinations
Not available for Laplace prior (Lasso)
Not available for complex/non-linear models
Extension to non-conjugate settings requires approximation

Markov Chain Monte Carlo (MCMC)

When exact posterior computation is intractable, we can approximate it by drawing samples. MCMC constructs a Markov chain whose stationary distribution is the posterior.

The Key Idea:

If we can draw samples $\boldsymbol{\theta}^{(1)}, \boldsymbol{\theta}^{(2)}, \ldots, \boldsymbol{\theta}^{(S)}$ from $p(\boldsymbol{\theta} \mid \mathcal{D})$, we can approximate any posterior quantity:

$$\mathbb{E}[g(\boldsymbol{\theta}) \mid \mathcal{D}] \approx \frac{1}{S}\sum_{s=1}^S g(\boldsymbol{\theta}^{(s)})$$

MCMC generates correlated samples that, after enough iterations, behave like independent samples from the posterior.

Common MCMC Algorithms

•Metropolis-Hastings — General-purpose sampler using proposal-acceptance. Propose new state, accept with probability based on likelihood ratio.
•Gibbs Sampling — Sample each parameter from its full conditional given others. Efficient when conditionals are tractable.
•Hamiltonian Monte Carlo (HMC) — Uses gradient information for efficient exploration. Excellent for continuous parameters in high dimensions.
•No-U-Turn Sampler (NUTS) — Adaptive HMC that automatically tunes step size and trajectory length. Default in Stan.
•Slice Sampling — Adaptive width sampling along auxiliary dimension. Robust but can be slow.

MCMC for Bayesian Lasso (Laplace Prior):

As noted in Page 3, the Laplace prior can be written as a scale-mixture of Gaussians:

$$\beta_j \mid \tau_j^2 \sim \mathcal{N}(0, \tau_j^2)$$ $$\tau_j^2 \sim \text{Exponential}(\lambda^2/2)$$

This enables Gibbs sampling:

Sample $\boldsymbol{\beta} \mid {\tau_j^2}, \mathbf{y}$: Gaussian conditional (like Ridge for that $\tau$ setting)
Sample $\tau_j^2 \mid \beta_j$: Inverse-Gaussian conditional

Each step has a known distribution, making sampling efficient despite the non-conjugacy of the original model.

MCMC Diagnostics

MCMC requires careful diagnostics: (1) Convergence: Has the chain reached its stationary distribution? Use trace plots, R-hat statistic. (2) Mixing: Is the chain exploring efficiently? Check effective sample size. (3) Autocorrelation: Are samples too correlated? Thin if necessary. Poor diagnostics mean your posterior approximation is unreliable.

Variational Inference

Variational Inference (VI) approximates the posterior with a simpler distribution from a tractable family, making inference an optimization problem rather than a sampling problem.

The Core Idea:

Find the distribution $q(\boldsymbol{\theta})$ in family $\mathcal{Q}$ that is closest to the true posterior:

$$q^*(\boldsymbol{\theta}) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\boldsymbol{\theta}) | p(\boldsymbol{\theta} \mid \mathcal{D}))$$

The KL divergence measures how different $q$ is from the posterior.

The Evidence Lower Bound (ELBO):

Direct KL minimization is intractable (requires $p(\mathcal{D})$). Instead, maximize the ELBO:

$$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathcal{D}, \boldsymbol{\theta})] - \mathbb{E}_q[\log q(\boldsymbol{\theta})]$$

$$= \mathbb{E}_q[\log p(\mathcal{D} \mid \boldsymbol{\theta})] - \text{KL}(q(\boldsymbol{\theta}) | p(\boldsymbol{\theta}))$$

maximizing ELBO is equivalent to minimizing $\text{KL}(q | p(\boldsymbol{\theta} \mid \mathcal{D}))$.

Mean-Field Approximation:

The most common assumption: parameters are independent in the variational family: $$q(\boldsymbol{\theta}) = \prod_j q_j(\theta_j)$$

This factorization makes optimization tractable but ignores posterior correlations.

MCMC vs. Variational Inference
Aspect	MCMC	Variational Inference
Accuracy	Asymptotically exact	Approximation, may be biased
Speed	Slow (many iterations)	Fast (optimization-based)
Scalability	Challenging for big data	Better for large datasets
Posterior correlations	Captured naturally	Often ignored (mean-field)
Convergence guarantee	Ergodic theorem (eventually)	Local optima possible
Uncertainty quality	Proper coverage	Often underestimates uncertainty

VI Underestimates Uncertainty

Mean-field VI minimizes $\text{KL}(q | p)$, which penalizes $q$ for having mass where $p$ doesn't but not vice versa. This causes VI to underestimate posterior variance—credible intervals will be too narrow. For uncertainty-critical applications, prefer MCMC or use careful VI variants.

Posterior Predictive Distribution

A key advantage of full Bayesian inference: we can propagate parameter uncertainty into predictions.

Definition:

For a new input $\mathbf{x}_*$, the posterior predictive distribution is:

$$p(y_* \mid \mathbf{x}*, \mathcal{D}) = \int p(y* \mid \mathbf{x}_*, \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta} \mid \mathcal{D}) , d\boldsymbol{\theta}$$

This integrates predictions over all plausible parameter values, weighted by their posterior probability.

Contrast with Point Estimate Prediction:

MAP/MLE: $\hat{y}* = f(\mathbf{x}*; \boldsymbol{\hat\theta})$ — single prediction using point estimate
Bayesian: Full distribution $p(y_* \mid \mathbf{x}_*, \mathcal{D})$ — accounts for parameter uncertainty

For Bayesian Linear Regression:

With posterior $\boldsymbol{\beta} \mid \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}n, \boldsymbol{\Sigma}n)$ and $y* = \mathbf{x}*^T\boldsymbol{\beta} + \epsilon$:

$$y_* \mid \mathbf{x}*, \mathcal{D} \sim \mathcal{N}(\mathbf{x}^T\boldsymbol{\mu}n, \mathbf{x}^T\boldsymbol{\Sigma}n\mathbf{x}* + \sigma^2)$$

The predictive variance has two components:

Epistemic uncertainty: $\mathbf{x}_^T\boldsymbol{\Sigma}n\mathbf{x}$ — uncertainty about parameters
Aleatoric uncertainty: $\sigma^2$ — irreducible noise in observations

With more data, epistemic uncertainty shrinks; aleatoric uncertainty remains constant.

Prediction Intervals vs. Confidence Intervals

Bayesian predictive intervals naturally include both parameter uncertainty and noise. A 95% predictive interval contains the next observation with 95% probability. This is wider than a confidence interval for the mean, which only captures parameter uncertainty—a crucial distinction for practical forecasting.

Computing Posterior Predictive with MCMC:

If we have posterior samples ${\boldsymbol{\theta}^{(s)}}_{s=1}^S$:

For each sample, compute $p(y_* \mid \mathbf{x}_*, \boldsymbol{\theta}^{(s)})$
Average: $p(y_* \mid \mathbf{x}*, \mathcal{D}) \approx \frac{1}{S}\sum{s=1}^S p(y_* \mid \mathbf{x}_*, \boldsymbol{\theta}^{(s)})$

Or draw predictive samples:

For each $\boldsymbol{\theta}^{(s)}$, sample $y_^{(s)} \sim p(y_ \mid \mathbf{x}_*, \boldsymbol{\theta}^{(s)})$
The collection ${y_*^{(s)}}$ forms samples from the posterior predictive

Bayesian Model Comparison

Full posterior inference enables principled model comparison through the marginal likelihood (evidence).

The Marginal Likelihood:

$$p(\mathcal{D} \mid \mathcal{M}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} \mid \mathcal{M}) , d\boldsymbol{\theta}$$

This is the probability of the data under model $\mathcal{M}$, averaged over all parameter values weighted by the prior.

Bayes Factors:

To compare models $\mathcal{M}_1$ and $\mathcal{M}_2$:

$$\text{BF}_{12} = \frac{p(\mathcal{D} \mid \mathcal{M}_1)}{p(\mathcal{D} \mid \mathcal{M}_2)}$$

Bayes factor > 1 favors $\mathcal{M}_1$; < 1 favors $\mathcal{M}_2$.

Interpretation of Bayes Factors
Bayes Factor	Evidence Strength
1 – 3	Anecdotal
3 – 10	Moderate
10 – 30	Strong
30 – 100	Very strong
100	Extreme/Decisive

Automatic Occam's Razor:

The marginal likelihood automatically penalizes complexity. Complex models spread prior probability over a larger parameter space, reducing the average likelihood:

Simple model: Prior concentrated; if data matches, high marginal likelihood
Complex model: Prior diffuse; even if best fit is good, average over prior is lower

This is why Bayesian model comparison favors simpler models that fit well—complexity is penalized without explicit regularization terms.

Marginal Likelihood and Regularization Strength

Different regularization strengths define different models. Maximizing the marginal likelihood over $\lambda$ (Empirical Bayes) selects the regularization that best balances fit and complexity in a principled Bayesian sense—essentially choosing the prior hyperparameters that make the data most probable.

Bayesian Decision Theory

The posterior distribution is not the end goal—it's input to decisions. Bayesian decision theory provides a coherent framework for acting under uncertainty.

The Framework:

Actions $a \in \mathcal{A}$: Possible decisions we can make
States $\theta \in \Theta$: Unknown parameter values
Loss function $L(\theta, a)$: Cost of action $a$ when true state is $\theta$

Optimal Bayesian Decision:

Choose action minimizing expected posterior loss:

$$a^* = \arg\min_{a \in \mathcal{A}} \mathbb{E}_{\theta \mid \mathcal{D}}[L(\theta, a)]$$

$$= \arg\min_{a \in \mathcal{A}} \int L(\theta, a) \cdot p(\theta \mid \mathcal{D}) , d\theta$$

Loss Functions and Optimal Estimates
Loss Function	Optimal Estimate	Name
$(\theta - a)^2$	Posterior mean $\mathbb{E}[\theta \mid \mathcal{D}]$	Squared error / L2
$\|\theta - a\|$	Posterior median	Absolute error / L1
$\mathbf{1}_{\theta \neq a}$	Posterior mode (MAP)	0-1 loss
Asymmetric	Depends on asymmetry	Custom loss

Example: Variable Selection Decision:

Suppose we must decide whether feature $j$ is relevant (include in model) or not:

Action $a=1$: Include feature $j$
Action $a=0$: Exclude feature $j$
Loss: False inclusion costs $c_1$; false exclusion costs $c_0$

Optimal decision:

Include if $P(\beta_j \neq 0 \mid \mathcal{D}) > \frac{c_1}{c_0 + c_1}$

With equal costs: include if posterior probability of relevance > 0.5.

This is principled variable selection based on the posterior, not on arbitrary significance thresholds.

Why Decision Theory Matters

Point estimates like MAP or posterior mean are only optimal under specific loss functions. If your actual loss is different (e.g., asymmetric costs of over- vs. under-prediction), the optimal action may be neither the mean nor the mode. The posterior enables computing the optimal action for any loss function.

Credible Intervals and Bayesian Hypothesis Testing

Credible Intervals:

A $(1-\alpha)$ credible interval contains the parameter with posterior probability $(1-\alpha)$:

$$P(\theta \in [\theta_L, \theta_U] \mid \mathcal{D}) = 1 - \alpha$$

Types of Credible Intervals:

Equal-tailed: $P(\theta < \theta_L \mid \mathcal{D}) = P(\theta > \theta_U \mid \mathcal{D}) = \alpha/2$
Highest Posterior Density (HPD): Smallest interval containing $(1-\alpha)$ probability. Every point inside has higher density than any point outside.

For symmetric posteriors, these coincide. For skewed posteriors, HPD is more informative.

Credible vs. Confidence Intervals:

Aspect	Frequentist Confidence Interval	Bayesian Credible Interval
Interpretation	95% of similarly constructed intervals contain $\theta$	95% probability $\theta$ is in interval
Randomness	The interval is random; $\theta$ is fixed	$\theta$ is random; the interval is conditional on data
Direct probability	Cannot say $P(\theta \in \text{CI}) = 0.95$	Can say $P(\theta \in \text{CrI} \mid \mathcal{D}) = 0.95$
Requires prior	No	Yes

The Bayesian interpretation is often what practitioners actually want.

Testing via Posterior Probability

For hypothesis $H_0: \theta = 0$, compute $P(\theta > 0 \mid \mathcal{D})$ or check if the credible interval contains zero. This directly answers 'What's the probability the effect is positive?' rather than the convoluted frequentist statement 'If $\theta = 0$, data this extreme would be rare.'

ROPE (Region of Practical Equivalence):

Often, we care not whether $\theta = 0$ exactly, but whether $\theta$ is negligibly small. Define a ROPE, e.g., $[-0.1, 0.1]$:

If entire credible interval is in ROPE: Practically null effect
If entire credible interval is outside ROPE: Practically significant effect
If credible interval overlaps ROPE boundary: Uncertain

This avoids the problem of rejecting $H_0: \theta = 0$ when the true $\theta = 0.0001$—technically non-zero but practically negligible.

Posterior Inference in High Dimensions

Modern machine learning often involves thousands or millions of parameters. Full posterior inference at this scale is challenging but possible with the right techniques.

Challenges in High Dimensions:

High-Dimensional Difficulties

•Curse of dimensionality — Posterior concentrates in thin shells; random exploration is inefficient
•Covariance estimation — Full $p \times p$ covariance matrix is $O(p^2)$ to store, $O(p^3)$ to invert
•Mixing time — MCMC can take exponentially long to explore the posterior
•Memory constraints — Storing millions of samples of millions of parameters is infeasible
•Computation cost — Each MCMC step or VI update may be costly

Scalable Approaches:

Stochastic Gradient MCMC: Use mini-batches to approximate gradients. Stochastic Gradient Langevin Dynamics (SGLD), Stochastic Gradient HMC.
Structured Variational Inference: Use factored or low-rank covariance in variational family. Trade accuracy for tractability.
Sparse Posterior Approximations: Approximate posterior with few support points (sparse VI, ensemble methods).
Neural Network Posteriors: For deep learning, use dropout as approximate Bayesian inference, or variational BNNs.
Laplace Approximation at Scale: Compute Hessian via automatic differentiation; use low-rank or diagonal approximations.

The Divide Between Theory and Practice

Most theoretical Bayesian work assumes we can compute posteriors exactly. In practice above a few thousand parameters, we rely on approximations whose quality is hard to verify. This is an active research area—bridging the gap between principled Bayesian inference and the scale of modern ML.

Summary and Module Conclusion

We've completed our exploration of the Bayesian interpretation of regularization. Let's consolidate the key insights from this page and the entire module.

Key Takeaways from This Page

•The posterior is the complete answer: It encapsulates all information about parameters from data and prior
•Exact inference is rare but powerful: Conjugate models (like Ridge) give closed-form posteriors
•MCMC samples from the posterior: Enables inference for any model, asymptotically exact
•Variational inference optimizes an approximation: Faster but biased, especially for uncertainty
•Posterior predictive propagates uncertainty: Predictions account for parameter uncertainty
•Bayesian model comparison uses marginal likelihood: Automatic Occam's razor penalizes complexity
•Decision theory uses the full posterior: Optimal actions depend on loss function and posterior
•Credible intervals have direct probability interpretation: Unlike frequentist confidence intervals

Module Synthesis: The Bayesian Interpretation of Regularization

Across this module, we've established:

Priors encode beliefs about parameters before data (Page 1)
Gaussian priors yield Ridge (L2) regularization (Page 2)
Laplace priors yield Lasso (L1) regularization (Page 3)
MAP estimation is regularized optimization (Page 4)
Full posteriors enable uncertainty quantification (Page 5)

This perspective transforms regularization from an algorithmic convenience into a principled framework for incorporating prior knowledge. When you choose $\lambda$, you're choosing a prior. When you choose L1 vs L2, you're choosing Laplace vs Gaussian. Every regularization decision is a statement about what you believe.

Module Complete

You've mastered the Bayesian interpretation of regularization. You can now see Ridge and Lasso as MAP estimation under specific priors, understand when full Bayesian inference matters, and appreciate the deep connection between prior beliefs and regularization choices. This probabilistic lens will inform your modeling decisions throughout your machine learning practice—every penalty term is a prior, every regularization strength is a belief about parameter magnitudes.

What's Next in the Curriculum:

With the Bayesian foundation established, you're prepared for advanced topics that build on these ideas:

Gaussian Processes: Full Bayesian treatment of nonparametric regression
Bayesian Optimization: Using posteriors for efficient hyperparameter search
Probabilistic Deep Learning: Uncertainty in neural networks
Hierarchical Models: Multi-level priors for complex data structures

The Bayesian perspective you've developed here will serve as a unifying thread through all of advanced machine learning.