Loading content...
MAP estimation gives us the most probable parameter values given our data and prior beliefs. But probability is a richer language than single point estimates. What if the posterior is flat and we're highly uncertain? What if it's bimodal with two plausible explanations? What if we need to propagate parameter uncertainty into predictions?
Full Bayesian posterior inference addresses these questions. Instead of finding just the mode, we characterize the entire posterior distribution $p(\boldsymbol{\theta} \mid \mathcal{D})$. This distribution encapsulates everything we know about parameters after observing data—not just the best guess, but our complete uncertainty landscape.
This page develops the theory and practice of posterior inference, connecting it back to regularization and showing when full Bayesian methods matter most.
By completing this page, you will: (1) Understand what the posterior distribution represents and how to interpret it; (2) Learn methods for computing posteriors (exact, MCMC, variational); (3) See how posterior inference enables uncertainty quantification; (4) Understand Bayesian prediction and decision-making; (5) Connect posterior inference to regularization through the Bayesian linear regression example.
Definition and Meaning:
The posterior distribution $p(\boldsymbol{\theta} \mid \mathcal{D})$ is the probability distribution over parameters after observing data $\mathcal{D}$. It combines:
By Bayes' theorem: $$p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})}{p(\mathcal{D})}$$
Interpreting the Posterior:
The posterior is not just a computational object—it has direct probabilistic meaning:
These are genuine probability statements about parameters—something frequentist methods cannot provide directly.
The posterior distribution is the complete Bayesian answer to the inference problem. It contains all information about parameters that the data and prior can provide. Any summary (mean, mode, intervals) is just a projection of this full distribution.
| Summary | Definition | Use Case |
|---|---|---|
| Posterior Mean | $\mathbb{E}[\boldsymbol{\theta} \mid \mathcal{D}]$ | Optimal under squared loss; often used for prediction |
| Posterior Mode (MAP) | $\arg\max p(\boldsymbol{\theta} \mid \mathcal{D})$ | Point estimate; may be sparse |
| Posterior Median | 50th percentile | Robust to skewness; optimal under absolute loss |
| Credible Interval | $[\theta_L, \theta_U]$ with $P(\theta \in [\theta_L, \theta_U] \mid \mathcal{D}) = 1-\alpha$ | Uncertainty quantification |
| Highest Density Region | Smallest region containing $(1-\alpha)$ probability | Most informative uncertainty region |
For certain combinations of likelihood and prior, the posterior has a known closed-form. These conjugate pairs are computationally precious.
Bayesian Linear Regression (Gaussian Prior):
The most important example for regularization:
Likelihood: $\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2 \sim \mathcal{N}(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I})$
Prior: $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$
Posterior (derived in Page 2): $$\boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Sigma}_n)$$
where: $$\boldsymbol{\mu}_n = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$ $$\boldsymbol{\Sigma}_n = \sigma^2(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}$$
with $\lambda = \sigma^2/\tau^2$.
For Ridge regression, we have the complete posterior analytically. The posterior mean equals the Ridge estimate. The posterior covariance gives us immediate uncertainty quantification—no MCMC needed. This is one of the few cases where full Bayesian inference is as easy as point estimation.
Other Conjugate Pairs:
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Gaussian (known $\sigma^2$) | Gaussian | Gaussian |
| Binomial | Beta | Beta |
| Poisson | Gamma | Gamma |
| Multinomial | Dirichlet | Dirichlet |
| Exponential | Gamma | Gamma |
| Gaussian (unknown $\sigma^2$) | Normal-Inverse-Gamma | Normal-Inverse-Gamma |
Limitations of Exact Inference:
When exact posterior computation is intractable, we can approximate it by drawing samples. MCMC constructs a Markov chain whose stationary distribution is the posterior.
The Key Idea:
If we can draw samples $\boldsymbol{\theta}^{(1)}, \boldsymbol{\theta}^{(2)}, \ldots, \boldsymbol{\theta}^{(S)}$ from $p(\boldsymbol{\theta} \mid \mathcal{D})$, we can approximate any posterior quantity:
$$\mathbb{E}[g(\boldsymbol{\theta}) \mid \mathcal{D}] \approx \frac{1}{S}\sum_{s=1}^S g(\boldsymbol{\theta}^{(s)})$$
MCMC generates correlated samples that, after enough iterations, behave like independent samples from the posterior.
MCMC for Bayesian Lasso (Laplace Prior):
As noted in Page 3, the Laplace prior can be written as a scale-mixture of Gaussians:
$$\beta_j \mid \tau_j^2 \sim \mathcal{N}(0, \tau_j^2)$$ $$\tau_j^2 \sim \text{Exponential}(\lambda^2/2)$$
This enables Gibbs sampling:
Each step has a known distribution, making sampling efficient despite the non-conjugacy of the original model.
MCMC requires careful diagnostics: (1) Convergence: Has the chain reached its stationary distribution? Use trace plots, R-hat statistic. (2) Mixing: Is the chain exploring efficiently? Check effective sample size. (3) Autocorrelation: Are samples too correlated? Thin if necessary. Poor diagnostics mean your posterior approximation is unreliable.
Variational Inference (VI) approximates the posterior with a simpler distribution from a tractable family, making inference an optimization problem rather than a sampling problem.
The Core Idea:
Find the distribution $q(\boldsymbol{\theta})$ in family $\mathcal{Q}$ that is closest to the true posterior:
$$q^*(\boldsymbol{\theta}) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q(\boldsymbol{\theta}) | p(\boldsymbol{\theta} \mid \mathcal{D}))$$
The KL divergence measures how different $q$ is from the posterior.
The Evidence Lower Bound (ELBO):
Direct KL minimization is intractable (requires $p(\mathcal{D})$). Instead, maximize the ELBO:
$$\mathcal{L}(q) = \mathbb{E}_q[\log p(\mathcal{D}, \boldsymbol{\theta})] - \mathbb{E}_q[\log q(\boldsymbol{\theta})]$$
$$= \mathbb{E}_q[\log p(\mathcal{D} \mid \boldsymbol{\theta})] - \text{KL}(q(\boldsymbol{\theta}) | p(\boldsymbol{\theta}))$$
maximizing ELBO is equivalent to minimizing $\text{KL}(q | p(\boldsymbol{\theta} \mid \mathcal{D}))$.
Mean-Field Approximation:
The most common assumption: parameters are independent in the variational family: $$q(\boldsymbol{\theta}) = \prod_j q_j(\theta_j)$$
This factorization makes optimization tractable but ignores posterior correlations.
| Aspect | MCMC | Variational Inference |
|---|---|---|
| Accuracy | Asymptotically exact | Approximation, may be biased |
| Speed | Slow (many iterations) | Fast (optimization-based) |
| Scalability | Challenging for big data | Better for large datasets |
| Posterior correlations | Captured naturally | Often ignored (mean-field) |
| Convergence guarantee | Ergodic theorem (eventually) | Local optima possible |
| Uncertainty quality | Proper coverage | Often underestimates uncertainty |
Mean-field VI minimizes $\text{KL}(q | p)$, which penalizes $q$ for having mass where $p$ doesn't but not vice versa. This causes VI to underestimate posterior variance—credible intervals will be too narrow. For uncertainty-critical applications, prefer MCMC or use careful VI variants.
A key advantage of full Bayesian inference: we can propagate parameter uncertainty into predictions.
Definition:
For a new input $\mathbf{x}_*$, the posterior predictive distribution is:
$$p(y_* \mid \mathbf{x}*, \mathcal{D}) = \int p(y* \mid \mathbf{x}_*, \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta} \mid \mathcal{D}) , d\boldsymbol{\theta}$$
This integrates predictions over all plausible parameter values, weighted by their posterior probability.
Contrast with Point Estimate Prediction:
For Bayesian Linear Regression:
With posterior $\boldsymbol{\beta} \mid \mathcal{D} \sim \mathcal{N}(\boldsymbol{\mu}n, \boldsymbol{\Sigma}n)$ and $y* = \mathbf{x}*^T\boldsymbol{\beta} + \epsilon$:
$$y_* \mid \mathbf{x}*, \mathcal{D} \sim \mathcal{N}(\mathbf{x}^T\boldsymbol{\mu}n, \mathbf{x}^T\boldsymbol{\Sigma}n\mathbf{x}* + \sigma^2)$$
The predictive variance has two components:
With more data, epistemic uncertainty shrinks; aleatoric uncertainty remains constant.
Bayesian predictive intervals naturally include both parameter uncertainty and noise. A 95% predictive interval contains the next observation with 95% probability. This is wider than a confidence interval for the mean, which only captures parameter uncertainty—a crucial distinction for practical forecasting.
Computing Posterior Predictive with MCMC:
If we have posterior samples ${\boldsymbol{\theta}^{(s)}}_{s=1}^S$:
Or draw predictive samples:
Full posterior inference enables principled model comparison through the marginal likelihood (evidence).
The Marginal Likelihood:
$$p(\mathcal{D} \mid \mathcal{M}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} \mid \mathcal{M}) , d\boldsymbol{\theta}$$
This is the probability of the data under model $\mathcal{M}$, averaged over all parameter values weighted by the prior.
Bayes Factors:
To compare models $\mathcal{M}_1$ and $\mathcal{M}_2$:
$$\text{BF}_{12} = \frac{p(\mathcal{D} \mid \mathcal{M}_1)}{p(\mathcal{D} \mid \mathcal{M}_2)}$$
Bayes factor > 1 favors $\mathcal{M}_1$; < 1 favors $\mathcal{M}_2$.
| Bayes Factor | Evidence Strength |
|---|---|
| 1 – 3 | Anecdotal |
| 3 – 10 | Moderate |
| 10 – 30 | Strong |
| 30 – 100 | Very strong |
100 | Extreme/Decisive |
Automatic Occam's Razor:
The marginal likelihood automatically penalizes complexity. Complex models spread prior probability over a larger parameter space, reducing the average likelihood:
This is why Bayesian model comparison favors simpler models that fit well—complexity is penalized without explicit regularization terms.
Different regularization strengths define different models. Maximizing the marginal likelihood over $\lambda$ (Empirical Bayes) selects the regularization that best balances fit and complexity in a principled Bayesian sense—essentially choosing the prior hyperparameters that make the data most probable.
The posterior distribution is not the end goal—it's input to decisions. Bayesian decision theory provides a coherent framework for acting under uncertainty.
The Framework:
Optimal Bayesian Decision:
Choose action minimizing expected posterior loss:
$$a^* = \arg\min_{a \in \mathcal{A}} \mathbb{E}_{\theta \mid \mathcal{D}}[L(\theta, a)]$$
$$= \arg\min_{a \in \mathcal{A}} \int L(\theta, a) \cdot p(\theta \mid \mathcal{D}) , d\theta$$
| Loss Function | Optimal Estimate | Name |
|---|---|---|
| $(\theta - a)^2$ | Posterior mean $\mathbb{E}[\theta \mid \mathcal{D}]$ | Squared error / L2 |
| $|\theta - a|$ | Posterior median | Absolute error / L1 |
| $\mathbf{1}_{\theta \neq a}$ | Posterior mode (MAP) | 0-1 loss |
| Asymmetric | Depends on asymmetry | Custom loss |
Example: Variable Selection Decision:
Suppose we must decide whether feature $j$ is relevant (include in model) or not:
Optimal decision:
With equal costs: include if posterior probability of relevance > 0.5.
This is principled variable selection based on the posterior, not on arbitrary significance thresholds.
Point estimates like MAP or posterior mean are only optimal under specific loss functions. If your actual loss is different (e.g., asymmetric costs of over- vs. under-prediction), the optimal action may be neither the mean nor the mode. The posterior enables computing the optimal action for any loss function.
Credible Intervals:
A $(1-\alpha)$ credible interval contains the parameter with posterior probability $(1-\alpha)$:
$$P(\theta \in [\theta_L, \theta_U] \mid \mathcal{D}) = 1 - \alpha$$
Types of Credible Intervals:
Equal-tailed: $P(\theta < \theta_L \mid \mathcal{D}) = P(\theta > \theta_U \mid \mathcal{D}) = \alpha/2$
Highest Posterior Density (HPD): Smallest interval containing $(1-\alpha)$ probability. Every point inside has higher density than any point outside.
For symmetric posteriors, these coincide. For skewed posteriors, HPD is more informative.
Credible vs. Confidence Intervals:
| Aspect | Frequentist Confidence Interval | Bayesian Credible Interval |
|---|---|---|
| Interpretation | 95% of similarly constructed intervals contain $\theta$ | 95% probability $\theta$ is in interval |
| Randomness | The interval is random; $\theta$ is fixed | $\theta$ is random; the interval is conditional on data |
| Direct probability | Cannot say $P(\theta \in \text{CI}) = 0.95$ | Can say $P(\theta \in \text{CrI} \mid \mathcal{D}) = 0.95$ |
| Requires prior | No | Yes |
The Bayesian interpretation is often what practitioners actually want.
For hypothesis $H_0: \theta = 0$, compute $P(\theta > 0 \mid \mathcal{D})$ or check if the credible interval contains zero. This directly answers 'What's the probability the effect is positive?' rather than the convoluted frequentist statement 'If $\theta = 0$, data this extreme would be rare.'
ROPE (Region of Practical Equivalence):
Often, we care not whether $\theta = 0$ exactly, but whether $\theta$ is negligibly small. Define a ROPE, e.g., $[-0.1, 0.1]$:
This avoids the problem of rejecting $H_0: \theta = 0$ when the true $\theta = 0.0001$—technically non-zero but practically negligible.
Modern machine learning often involves thousands or millions of parameters. Full posterior inference at this scale is challenging but possible with the right techniques.
Challenges in High Dimensions:
Scalable Approaches:
Stochastic Gradient MCMC: Use mini-batches to approximate gradients. Stochastic Gradient Langevin Dynamics (SGLD), Stochastic Gradient HMC.
Structured Variational Inference: Use factored or low-rank covariance in variational family. Trade accuracy for tractability.
Sparse Posterior Approximations: Approximate posterior with few support points (sparse VI, ensemble methods).
Neural Network Posteriors: For deep learning, use dropout as approximate Bayesian inference, or variational BNNs.
Laplace Approximation at Scale: Compute Hessian via automatic differentiation; use low-rank or diagonal approximations.
Most theoretical Bayesian work assumes we can compute posteriors exactly. In practice above a few thousand parameters, we rely on approximations whose quality is hard to verify. This is an active research area—bridging the gap between principled Bayesian inference and the scale of modern ML.
We've completed our exploration of the Bayesian interpretation of regularization. Let's consolidate the key insights from this page and the entire module.
Module Synthesis: The Bayesian Interpretation of Regularization
Across this module, we've established:
This perspective transforms regularization from an algorithmic convenience into a principled framework for incorporating prior knowledge. When you choose $\lambda$, you're choosing a prior. When you choose L1 vs L2, you're choosing Laplace vs Gaussian. Every regularization decision is a statement about what you believe.
You've mastered the Bayesian interpretation of regularization. You can now see Ridge and Lasso as MAP estimation under specific priors, understand when full Bayesian inference matters, and appreciate the deep connection between prior beliefs and regularization choices. This probabilistic lens will inform your modeling decisions throughout your machine learning practice—every penalty term is a prior, every regularization strength is a belief about parameter magnitudes.
What's Next in the Curriculum:
With the Bayesian foundation established, you're prepared for advanced topics that build on these ideas:
The Bayesian perspective you've developed here will serve as a unifying thread through all of advanced machine learning.