Loading content...
We've established our prior beliefs about parameter values. We've defined the likelihood function that quantifies how probable our observations are under different parameters. Now we unite these two components to obtain what is arguably the most important quantity in all of Bayesian statistics: the posterior distribution.
The posterior distribution represents our complete state of knowledge about the parameter after observing data. It is not merely a point estimate or an interval—it is a full probability distribution that captures every aspect of our remaining uncertainty. From this single distribution, we can derive point estimates, credible intervals, hypothesis tests, predictions, and decision-theoretic optimal actions.
Understanding the posterior is understanding Bayesian inference itself.
By the end of this page, you will understand the mathematical derivation of the posterior via Bayes' theorem, appreciate the posterior as a compromise between prior and likelihood, master methods for summarizing posterior distributions, understand the role of the normalizing constant (evidence), and recognize when posteriors are analytically tractable versus requiring approximation.
At the heart of Bayesian inference lies a simple identity from probability theory: Bayes' theorem. Despite its elementary derivation, its implications for statistical reasoning are profound.
Derivation:
Start with the definition of conditional probability:
$$p(\theta | D) = \frac{p(\theta, D)}{p(D)}$$
The joint probability can be factored two ways:
$$p(\theta, D) = p(D | \theta) \cdot p(\theta) = p(\theta | D) \cdot p(D)$$
Rearranging gives Bayes' theorem:
$$p(\theta | D) = \frac{p(D | \theta) \cdot p(\theta)}{p(D)}$$
Naming the Components:
| Symbol | Name | Meaning |
|---|---|---|
| p(θ | D) | Posterior | Updated belief about θ after seeing D |
| p(D | θ) | Likelihood | Probability of data D given parameter θ |
| p(θ) | Prior | Belief about θ before seeing D |
| p(D) | Evidence / Marginal Likelihood | Total probability of observing D |
Posterior ∝ Likelihood × Prior. The normalizing constant p(D) ensures the posterior integrates to 1, but for many purposes (finding the mode, comparing parameter values), we only need the unnormalized posterior.
The Normalizing Constant (Evidence):
The denominator p(D) is crucial for making the posterior a valid probability distribution:
$$p(D) = \int p(D | \theta) \cdot p(\theta) , d\theta = \int \mathcal{L}(\theta) \cdot \pi(\theta) , d\theta$$
This integral marginalizes over all possible parameter values, giving the total probability of observing the data under our model.
Two Roles of p(D):
Why Computing p(D) is Hard:
For most non-trivial models, the integral $\int \mathcal{L}(\theta) \cdot \pi(\theta) , d\theta$ has no closed form. This is the fundamental computational challenge of Bayesian inference, motivating:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
import numpy as npfrom scipy import statsimport matplotlib.pyplot as pltfrom scipy.integrate import quad # Complete Bayes' theorem demonstration with explicit normalization def bayes_theorem_example(): """ Example: Estimating a coin's bias Prior: Beta(2, 2) - weakly informative Likelihood: Binomial Data: 8 heads in 12 flips """ # Prior parameters prior_a, prior_b = 2, 2 # Observed data heads, total = 8, 12 tails = total - heads theta = np.linspace(0.001, 0.999, 1000) # Prior: p(θ) prior = stats.beta(prior_a, prior_b).pdf(theta) # Likelihood: p(D|θ) ∝ θ^k (1-θ)^(n-k) likelihood = (theta ** heads) * ((1 - theta) ** tails) # Unnormalized posterior: p(D|θ) × p(θ) unnorm_posterior = likelihood * prior # Evidence: p(D) = ∫ p(D|θ) p(θ) dθ # For Beta-Binomial, this has a closed form involving Beta functions from scipy.special import betaln log_evidence = betaln(prior_a + heads, prior_b + tails) - betaln(prior_a, prior_b) evidence = np.exp(log_evidence) print(f"Evidence p(D) = {evidence:.6f}") # Compute evidence numerically to verify def unnorm_posterior_func(t): return (t ** heads) * ((1 - t) ** tails) * stats.beta(prior_a, prior_b).pdf(t) evidence_numeric, _ = quad(unnorm_posterior_func, 0, 1) print(f"Evidence (numerical): {evidence_numeric:.6f}") # Normalized posterior: p(θ|D) = p(D|θ) p(θ) / p(D) # For conjugate Beta-Binomial: posterior is Beta(a+k, b+n-k) post_a, post_b = prior_a + heads, prior_b + tails posterior = stats.beta(post_a, post_b).pdf(theta) # Verify normalization from scipy.integrate import simps posterior_integral = simps(posterior, theta) print(f"Posterior integrates to: {posterior_integral:.6f}") # Visualization fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: Prior ax = axes[0, 0] ax.fill_between(theta, prior, alpha=0.3, color='blue') ax.plot(theta, prior, 'b-', linewidth=2) ax.set_title(f'Prior: Beta({prior_a}, {prior_b})') ax.set_xlabel('θ') ax.set_ylabel('p(θ)') ax.set_xlim(0, 1) ax.grid(True, alpha=0.3) # Plot 2: Likelihood ax = axes[0, 1] ax.fill_between(theta, likelihood/np.max(likelihood), alpha=0.3, color='green') ax.plot(theta, likelihood/np.max(likelihood), 'g-', linewidth=2) ax.axvline(heads/total, color='red', linestyle='--', label=f'MLE = {heads/total:.2f}') ax.set_title(f'Likelihood (normalized max): {heads} heads, {tails} tails') ax.set_xlabel('θ') ax.set_ylabel('L(θ) / max L(θ)') ax.set_xlim(0, 1) ax.legend() ax.grid(True, alpha=0.3) # Plot 3: Unnormalized posterior ax = axes[1, 0] ax.fill_between(theta, unnorm_posterior, alpha=0.3, color='orange') ax.plot(theta, unnorm_posterior, color='orange', linewidth=2) ax.set_title('Unnormalized Posterior: p(D|θ) × p(θ)') ax.set_xlabel('θ') ax.set_ylabel('Unnormalized p(θ|D)') ax.set_xlim(0, 1) ax.grid(True, alpha=0.3) # Plot 4: Normalized posterior ax = axes[1, 1] ax.fill_between(theta, posterior, alpha=0.3, color='red') ax.plot(theta, posterior, 'r-', linewidth=2) ax.axvline(post_a/(post_a + post_b), color='darkred', linestyle='-', label=f'Posterior mean = {post_a/(post_a + post_b):.3f}') # Add credible interval ci_lower = stats.beta(post_a, post_b).ppf(0.025) ci_upper = stats.beta(post_a, post_b).ppf(0.975) ax.axvspan(ci_lower, ci_upper, alpha=0.2, color='red', label=f'95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]') ax.set_title(f'Posterior: Beta({post_a}, {post_b})') ax.set_xlabel('θ') ax.set_ylabel('p(θ|D)') ax.set_xlim(0, 1) ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() return evidence evidence = bayes_theorem_example()A profound insight from Bayesian inference is that the posterior represents a principled compromise between prior beliefs and data evidence. Understanding this balancing act is essential for interpreting Bayesian results.
The Tension:
The relative influence of each depends on:
Mathematical Insight for Conjugate Models:
For the Beta-Binomial model:
$$\text{Prior}: \theta \sim \text{Beta}(\alpha, \beta)$$ $$\text{Likelihood}: X \sim \text{Binomial}(n, \theta)$$ $$\text{Posterior}: \theta | X \sim \text{Beta}(\alpha + k, \beta + n - k)$$
The posterior mean is:
$$E[\theta | X] = \frac{\alpha + k}{\alpha + \beta + n}$$
This can be rewritten as a weighted average:
$$E[\theta | X] = \underbrace{\frac{\alpha + \beta}{\alpha + \beta + n}}{\text{prior weight}} \cdot \underbrace{\frac{\alpha}{\alpha + \beta}}{\text{prior mean}} + \underbrace{\frac{n}{\alpha + \beta + n}}{\text{data weight}} \cdot \underbrace{\frac{k}{n}}{\text{MLE}}$$
In the Beta-Binomial model, α + β can be interpreted as an 'effective sample size' of the prior. A Beta(2, 2) prior acts like having seen 4 prior observations (2 successes, 2 failures). A Beta(50, 50) prior acts like 100 prior observations—much harder for new data to shift.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def demonstrate_compromise(prior_strength, n_data, true_theta=0.7): """ Show how posterior balances prior and data at different strengths. prior_strength: controls Beta(a, a) where a = prior_strength n_data: sample size """ # Prior centered at 0.5 prior_a = prior_b = prior_strength # Generate data from true_theta np.random.seed(42) k = np.random.binomial(n_data, true_theta) mle = k / n_data # Posterior post_a = prior_a + k post_b = prior_b + (n_data - k) # Means prior_mean = prior_a / (prior_a + prior_b) post_mean = post_a / (post_a + post_b) # Weights (as derived) prior_weight = (prior_a + prior_b) / (prior_a + prior_b + n_data) data_weight = n_data / (prior_a + prior_b + n_data) return { 'prior_a': prior_a, 'prior_b': prior_b, 'k': k, 'n': n_data, 'mle': mle, 'post_a': post_a, 'post_b': post_b, 'prior_mean': prior_mean, 'post_mean': post_mean, 'prior_weight': prior_weight, 'data_weight': data_weight } # Experiment: vary prior strength and data amountscenarios = [ (1, 10, "Weak prior, moderate data"), (10, 10, "Moderate prior, moderate data"), (50, 10, "Strong prior, moderate data"), (1, 100, "Weak prior, lots of data"), (10, 100, "Moderate prior, lots of data"), (50, 100, "Strong prior, lots of data"),] fig, axes = plt.subplots(2, 3, figsize=(15, 10))axes = axes.flatten()theta = np.linspace(0, 1, 500) for ax, (prior_str, n_data, title) in zip(axes, scenarios): result = demonstrate_compromise(prior_str, n_data) # Plot distributions prior = stats.beta(result['prior_a'], result['prior_b']).pdf(theta) posterior = stats.beta(result['post_a'], result['post_b']).pdf(theta) ax.plot(theta, prior, 'b--', linewidth=2, label='Prior', alpha=0.7) ax.fill_between(theta, posterior, alpha=0.3, color='red') ax.plot(theta, posterior, 'r-', linewidth=2, label='Posterior') ax.axvline(result['mle'], color='green', linestyle=':', linewidth=2, label=f'MLE={result["mle"]:.2f}') ax.axvline(result['post_mean'], color='darkred', linestyle='-', linewidth=1.5, label=f'Post mean={result["post_mean"]:.2f}') ax.axvline(0.7, color='black', linestyle='--', alpha=0.5, label='True θ=0.7') ax.set_title(f'{title}\nPrior weight: {result["prior_weight"]:.0%}, Data weight: {result["data_weight"]:.0%}') ax.set_xlabel('θ') ax.set_ylabel('Density') ax.legend(fontsize=8) ax.set_xlim(0, 1) ax.grid(True, alpha=0.3) plt.suptitle('Posterior as Weighted Compromise\n(True θ = 0.7, Prior centered at 0.5)', fontsize=14)plt.tight_layout() # Print summary tableprint("=" * 80)print("POSTERIOR COMPROMISE SUMMARY")print("=" * 80)print(f"{'Scenario':<35} {'Prior Wt':<10} {'Data Wt':<10} {'Post Mean':<12}")print("-" * 80)for prior_str, n_data, title in scenarios: result = demonstrate_compromise(prior_str, n_data) print(f"{title:<35} {result['prior_weight']:.1%} {result['data_weight']:.1%} {result['post_mean']:.4f}")Asymptotic Behavior:
As sample size n → ∞, the posterior concentrates around the true parameter value (assuming the model is correctly specified):
$$p(\theta | D_n) \xrightarrow{n \to \infty} \delta_{\theta_0}$$
where δ is a point mass at the true value θ₀. This is called posterior consistency.
Furthermore, for large n, the posterior is approximately Normal:
$$p(\theta | D_n) \approx \mathcal{N}\left(\hat{\theta}_{MLE}, \frac{1}{n \cdot I(\hat{\theta})}\right)$$
where I(θ) is the Fisher Information. This is the Bernstein-von Mises theorem—the Bayesian analogue of the frequentist Central Limit Theorem. It means that for large samples:
While the full posterior distribution is the complete Bayesian answer, we often need concise summaries for communication and decision-making. Several approaches exist, each with trade-offs.
Credible Intervals:
Credible intervals are the Bayesian answer to 'how uncertain are we?' Unlike frequentist confidence intervals, credible intervals have direct probability interpretations.
Central Credible Interval:
An α% central credible interval (CI) contains the central α% of posterior mass:
$$[q_{(1-\alpha)/2}, q_{(1+\alpha)/2}]$$
where q_p is the p-th percentile of the posterior. For example, the 95% central CI spans from the 2.5th to the 97.5th percentile.
Highest Density Interval (HDI):
The HDI is the narrowest interval containing α% posterior probability. Unlike central CIs, HDIs:
Interpretation:
A 95% credible interval means: 'Given our prior and the data, there is a 95% probability that θ lies in this interval.'
This is what non-statisticians think confidence intervals mean (but they don't).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def compute_posterior_summaries(post_a, post_b): """ Compute various posterior summaries for Beta distribution. """ dist = stats.beta(post_a, post_b) # Point estimates mean = dist.mean() median = dist.median() mode = (post_a - 1) / (post_a + post_b - 2) if post_a > 1 and post_b > 1 else None # Central 95% credible interval ci_95_central = (dist.ppf(0.025), dist.ppf(0.975)) # For Beta, HDI can be computed analytically/numerically # Simple approximation: for unimodal, central CI is close to HDI return { 'mean': mean, 'median': median, 'mode': mode, 'ci_95_lower': ci_95_central[0], 'ci_95_upper': ci_95_central[1], 'std': dist.std() } def compute_hdi(samples, cred_mass=0.95): """Compute Highest Density Interval from samples.""" sorted_samples = np.sort(samples) n = len(samples) interval_size = int(np.ceil(cred_mass * n)) intervals = sorted_samples[interval_size:] - sorted_samples[:-interval_size] min_idx = np.argmin(intervals) return sorted_samples[min_idx], sorted_samples[min_idx + interval_size] # Example: Beta(10, 4) - right-skewed posteriorpost_a, post_b = 10, 4theta = np.linspace(0, 1, 1000)posterior = stats.beta(post_a, post_b).pdf(theta)summaries = compute_posterior_summaries(post_a, post_b) # HDI from samplesnp.random.seed(42)samples = stats.beta(post_a, post_b).rvs(100000)hdi_lower, hdi_upper = compute_hdi(samples, 0.95) # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Point estimatesax = axes[0]ax.fill_between(theta, posterior, alpha=0.3, color='blue')ax.plot(theta, posterior, 'b-', linewidth=2)ax.axvline(summaries['mean'], color='red', linestyle='-', linewidth=2, label=f"Mean = {summaries['mean']:.3f}")ax.axvline(summaries['median'], color='green', linestyle='--', linewidth=2, label=f"Median = {summaries['median']:.3f}")if summaries['mode']: ax.axvline(summaries['mode'], color='purple', linestyle=':', linewidth=2, label=f"Mode (MAP) = {summaries['mode']:.3f}")ax.set_title(f'Point Estimates: Beta({post_a}, {post_b})')ax.set_xlabel('θ')ax.set_ylabel('p(θ|D)')ax.legend()ax.grid(True, alpha=0.3)ax.set_xlim(0, 1) # Plot 2: Credible intervalsax = axes[1]ax.fill_between(theta, posterior, alpha=0.3, color='blue')ax.plot(theta, posterior, 'b-', linewidth=2) # Central 95% CIax.axvspan(summaries['ci_95_lower'], summaries['ci_95_upper'], alpha=0.2, color='green', label=f"Central 95% CI: [{summaries['ci_95_lower']:.3f}, {summaries['ci_95_upper']:.3f}]") # HDIax.axvspan(hdi_lower, hdi_upper, alpha=0.2, color='red', label=f"95% HDI: [{hdi_lower:.3f}, {hdi_upper:.3f}]") # Show the differenceax.axvline(summaries['ci_95_lower'], color='green', linestyle='--', alpha=0.7)ax.axvline(summaries['ci_95_upper'], color='green', linestyle='--', alpha=0.7)ax.axvline(hdi_lower, color='red', linestyle='-', alpha=0.7)ax.axvline(hdi_upper, color='red', linestyle='-', alpha=0.7) ax.set_title('Central CI vs HDI (differ for skewed posteriors)')ax.set_xlabel('θ')ax.set_ylabel('p(θ|D)')ax.legend(loc='upper left')ax.grid(True, alpha=0.3)ax.set_xlim(0, 1) plt.tight_layout() # Print summary tableprint("=" * 60)print(f"POSTERIOR SUMMARIES: Beta({post_a}, {post_b})")print("=" * 60)print(f"Mean: {summaries['mean']:.4f}")print(f"Median: {summaries['median']:.4f}")print(f"Mode (MAP): {summaries['mode']:.4f}")print(f"Std Dev: {summaries['std']:.4f}")print(f"Central 95% CI: [{summaries['ci_95_lower']:.4f}, {summaries['ci_95_upper']:.4f}]")print(f"95% HDI: [{hdi_lower:.4f}, {hdi_upper:.4f}]")print()print("Note: For this right-skewed posterior, HDI is shifted right")print(" relative to the central CI, and is slightly narrower.")Use the mean for general reporting and when squared error matters. Use the median when robustness to skewness is important. Use the mode (MAP) when you need a single 'best' value for downstream processing. Use HDI over central CI for skewed posteriors or when communicating with non-experts.
Often our goal isn't to estimate parameters but to predict new observations. The posterior predictive distribution answers: 'What do we expect future data to look like, given what we've observed?'
Definition:
The posterior predictive distribution for a new observation x* is:
$$p(x^* | D) = \int p(x^* | \theta) \cdot p(\theta | D) , d\theta$$
This marginalizes over parameter uncertainty: we average predictions across all possible parameter values, weighted by how probable each value is given our data.
Key Properties:
Example: Beta-Binomial Posterior Predictive
Setup:
$$P(x^* = 1 | D) = \int_0^1 \theta \cdot p(\theta | D) , d\theta = E[\theta | D] = \frac{\alpha + k}{\alpha + \beta + n}$$
The predictive probability is simply the posterior mean—a beautiful closed form.
For multiple future trials m with y successes: $$p(y | D) = \text{BetaBinomial}(y | m, \alpha+k, \beta+n-k)$$
The Beta-Binomial is more dispersed than Binomial(m, θ̂) because it accounts for uncertainty in θ.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npfrom scipy import statsimport matplotlib.pyplot as pltfrom scipy.special import comb, beta as beta_func def beta_binomial_pmf(y, n, alpha, beta_param): """ Beta-Binomial probability mass function. p(y | n, α, β) = C(n,y) * B(y+α, n-y+β) / B(α, β) """ log_pmf = (np.log(comb(n, y, exact=True)) + np.log(beta_func(y + alpha, n - y + beta_param)) - np.log(beta_func(alpha, beta_param))) return np.exp(log_pmf) # Prior and observed dataprior_a, prior_b = 2, 2observed_successes, observed_trials = 7, 10 # Posterior parameterspost_a = prior_a + observed_successespost_b = prior_b + (observed_trials - observed_successes) # Future trialsfuture_trials = 15future_successes = np.arange(0, future_trials + 1) # Posterior predictive: Beta-Binomialposterior_predictive = [beta_binomial_pmf(y, future_trials, post_a, post_b) for y in future_successes] # Plug-in prediction: Binomial with posterior meanposterior_mean = post_a / (post_a + post_b)plugin_predictive = stats.binom(future_trials, posterior_mean).pmf(future_successes) # Comparefig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Comparison of distributionsax = axes[0]width = 0.35x = future_successesax.bar(x - width/2, posterior_predictive, width, label='Posterior Predictive\n(Beta-Binomial)', alpha=0.7, color='blue')ax.bar(x + width/2, plugin_predictive, width, label='Plug-in\n(Binomial)', alpha=0.7, color='red')ax.set_xlabel('Number of Successes in 15 Future Trials')ax.set_ylabel('Probability')ax.set_title('Posterior Predictive vs Plug-in Prediction')ax.legend()ax.grid(True, alpha=0.3, axis='y') # Plot 2: Highlight the differenceax = axes[1]difference = np.array(posterior_predictive) - np.array(plugin_predictive)colors = ['green' if d >= 0 else 'red' for d in difference]ax.bar(x, difference, color=colors, alpha=0.7)ax.axhline(0, color='black', linewidth=0.5)ax.set_xlabel('Number of Successes')ax.set_ylabel('Posterior Predictive - Plug-in')ax.set_title('Difference: Posterior Predictive is More Dispersed')ax.grid(True, alpha=0.3) plt.tight_layout() # Statisticsprint("=" * 60)print("POSTERIOR PREDICTIVE vs PLUG-IN PREDICTION")print("=" * 60)print(f"Observed: {observed_successes}/{observed_trials} successes")print(f"Posterior: Beta({post_a}, {post_b}), mean = {posterior_mean:.3f}")print(f"Predicting: {future_trials} future trials")print()print(f"{'Statistic':<25} {'Post. Pred.':<15} {'Plug-in':<15}")print("-" * 55) # Expected value (same for both)pp_mean = future_trials * posterior_meanprint(f"{'Mean':<25} {pp_mean:.3f} {pp_mean:.3f}") # Variance differspp_variance = sum(y**2 * p for y, p in zip(future_successes, posterior_predictive)) - pp_mean**2plugin_variance = future_trials * posterior_mean * (1 - posterior_mean)print(f"{'Variance':<25} {pp_variance:.3f} {plugin_variance:.3f}")print(f"{'Std Dev':<25} {np.sqrt(pp_variance):.3f} {np.sqrt(plugin_variance):.3f}") print()print("Note: Posterior predictive has LARGER variance because it")print(" accounts for uncertainty in θ, not just randomness in trials.")Using p(x*|θ̂) where θ̂ is any point estimate (MLE, posterior mean, etc.) ignores parameter uncertainty. This leads to overconfident predictions and poorly calibrated prediction intervals. Always use the full posterior predictive when uncertainty quantification matters.
For most models, the posterior has no closed-form expression and requires numerical approximation. However, certain prior-likelihood pairs yield posteriors in the same family as the prior—these are called conjugate pairs.
Definition:
A prior p(θ) is conjugate to a likelihood p(D|θ) if the posterior p(θ|D) is in the same distribution family as the prior.
Why Conjugacy Matters:
Limitation: Conjugate priors may not encode our actual beliefs. Don't choose a prior just because it's conjugate—computational convenience shouldn't trump model fidelity.
| Likelihood | Conjugate Prior | Posterior | Prior Interpretation |
|---|---|---|---|
| Bernoulli/Binomial(θ) | Beta(α, β) | Beta(α+k, β+n-k) | α-1 prior successes, β-1 prior failures |
| Poisson(λ) | Gamma(α, β) | Gamma(α+Σx, β+n) | α prior events in β time units |
| Normal(μ, σ² known) | Normal(μ₀, τ²) | Normal(weighted mean) | Prior observation with precision 1/τ² |
| Normal(μ known, σ²) | Inverse-Gamma(α, β) | Inverse-Gamma(α+n/2, ...) | α prior observations, β sum of squares |
| Exponential(λ) | Gamma(α, β) | Gamma(α+n, β+Σx) | α prior events in β time |
| Multinomial(θ) | Dirichlet(α) | Dirichlet(α+counts) | α-1 prior counts per category |
| Geometric(θ) | Beta(α, β) | Beta(α+n, β+Σx) | Similar to Bernoulli |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def sequential_bayesian_update(): """ Demonstrate sequential conjugate updating. Each data batch updates the posterior, which becomes the prior for the next batch. """ # True parameter true_theta = 0.65 # Initial prior: Beta(1, 1) = Uniform prior_a, prior_b = 1, 1 # Generate batches of data np.random.seed(42) batch_sizes = [5, 10, 20, 50, 100] batches = [np.random.binomial(1, true_theta, size=n) for n in batch_sizes] # Track evolution history = [{ 'batch': 0, 'cumulative_n': 0, 'a': prior_a, 'b': prior_b, 'mean': prior_a / (prior_a + prior_b) }] theta = np.linspace(0, 1, 500) fig, axes = plt.subplots(2, 3, figsize=(15, 10)) axes = axes.flatten() current_a, current_b = prior_a, prior_b cumulative_n = 0 # Plot initial prior ax = axes[0] initial_dist = stats.beta(prior_a, prior_b).pdf(theta) ax.fill_between(theta, initial_dist, alpha=0.3, color='blue') ax.plot(theta, initial_dist, 'b-', linewidth=2) ax.axvline(true_theta, color='green', linestyle='--', label=f'True θ={true_theta}') ax.set_title(f'Initial Prior: Beta({prior_a}, {prior_b})') ax.set_xlabel('θ') ax.set_ylabel('Density') ax.legend() ax.grid(True, alpha=0.3) ax.set_xlim(0, 1) # Sequential updates for i, batch in enumerate(batches): # Update k = batch.sum() n = len(batch) current_a += k current_b += (n - k) cumulative_n += n history.append({ 'batch': i + 1, 'cumulative_n': cumulative_n, 'a': current_a, 'b': current_b, 'mean': current_a / (current_a + current_b) }) # Plot ax = axes[i + 1] post_dist = stats.beta(current_a, current_b).pdf(theta) ax.fill_between(theta, post_dist, alpha=0.3, color='red') ax.plot(theta, post_dist, 'r-', linewidth=2) ax.axvline(true_theta, color='green', linestyle='--', label=f'True θ={true_theta}') ax.axvline(current_a/(current_a+current_b), color='darkred', linestyle='-', label=f'Post mean={current_a/(current_a+current_b):.3f}') ax.set_title(f'After {cumulative_n} obs: Beta({current_a}, {current_b})') ax.set_xlabel('θ') ax.set_ylabel('Density') ax.legend(fontsize=8) ax.grid(True, alpha=0.3) ax.set_xlim(0, 1) plt.suptitle('Sequential Bayesian Updating with Conjugate Priors', fontsize=14) plt.tight_layout() # Print history print("=" * 70) print("SEQUENTIAL UPDATE HISTORY") print("=" * 70) print(f"True θ = {true_theta}") print() print(f"{'Batch':<8} {'Total n':<10} {'Posterior':<20} {'Mean':<12} {'95% CI':<25}") print("-" * 70) for h in history: dist = stats.beta(h['a'], h['b']) ci = f"[{dist.ppf(0.025):.3f}, {dist.ppf(0.975):.3f}]" print(f"{h['batch']:<8} {h['cumulative_n']:<10} Beta({h['a']}, {h['b']}){'':<8} {h['mean']:.4f} {ci}") return history history = sequential_bayesian_update()Conjugate models enable true online learning: the posterior from observing data up to time t becomes the prior for data at time t+1. This is computationally efficient and theoretically elegant, but requires the conjugate family to be appropriate for your problem.
Real models typically have multiple parameters, creating multivariate posterior distributions. Understanding the structure of these distributions is crucial for both computation and interpretation.
Joint Posterior:
For parameters θ = (θ₁, θ₂, ..., θₖ), the joint posterior is:
$$p(\theta_1, \theta_2, ..., \theta_k | D) = \frac{p(D | \theta_1, ..., \theta_k) \cdot p(\theta_1, ..., \theta_k)}{p(D)}$$
Marginal Posteriors:
Often we're interested in individual parameters. The marginal posterior for θ₁ is:
$$p(\theta_1 | D) = \int p(\theta_1, \theta_2, ..., \theta_k | D) , d\theta_2 ... d\theta_k$$
This integration 'marginalizes out' the other parameters. With MCMC samples, marginals are trivial—just ignore the other columns.
Conditional Posteriors:
These are the distributions of one parameter given the others:
$$p(\theta_i | \theta_{-i}, D)$$
where θ₋ᵢ denotes all parameters except θᵢ. Conditional posteriors are the building blocks of Gibbs sampling.
Posterior Correlation:
Parameters may be correlated in the posterior even if their priors are independent. This occurs when the likelihood 'ties them together'—e.g., estimating both intercept and slope in regression creates posterior correlation.
Hierarchical Posteriors:
Hierarchical (multi-level) models introduce parameters at multiple levels:
$$y_{ij} \sim p(y | \theta_j)$$ (observation level)
$$\theta_j \sim p(\theta | \phi)$$ (group level)
$$\phi \sim p(\phi)$$ (population level)
The full posterior is:
$$p(\theta_1, ..., \theta_J, \phi | D) \propto \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j) \cdot \prod_{j=1}^{J} p(\theta_j | \phi) \cdot p(\phi)$$
Shrinkage and Partial Pooling:
Hierarchical posteriors exhibit shrinkage: individual group estimates θⱼ are pulled toward the population mean. Groups with less data are shrunk more. This borrows strength across groups and often improves predictions compared to either:
Hierarchical posteriors often have challenging 'funnel' geometries where the group-level variance and individual parameters become tightly correlated. When the variance is small, individual parameters must be close together; when large, they can spread. This creates sampling difficulties that require techniques like non-centered parameterization.
For most interesting models, the posterior cannot be computed in closed form. The normalizing constant p(D) requires integrating over the entire parameter space—an integral that is typically intractable.
Why Integration is Hard:
Modern Computational Approaches:
| Method | Key Idea | Pros | Cons |
|---|---|---|---|
| Grid Approximation | Evaluate posterior on fine grid | Simple, exact in limit | Fails in high dimensions |
| Conjugate Priors | Choose prior for closed-form posterior | Analytical, fast | Limited flexibility |
| Laplace Approximation | Gaussian approximation at mode | Fast, deterministic | Poor for non-Gaussian posteriors |
| MCMC | Generate samples from posterior | Flexible, asymptotically exact | Slow convergence, diagnostics needed |
| Variational Inference | Optimize over approximating family | Fast, scalable | Approximate, underestimates variance |
| Importance Sampling | Weighted samples from proposal | Simple concept | Variance can be huge |
MCMC: The Workhorse of Modern Bayesian Inference:
Markov Chain Monte Carlo (MCMC) methods generate sequences of samples that, in the limit, are draws from the posterior. The key insight is that we can sample from p(θ|D) without computing p(D)—we only need the unnormalized posterior, which is just likelihood × prior.
Common MCMC algorithms:
Variational Inference: When Speed Matters:
VI approximates the posterior with a tractable distribution (e.g., factorized Gaussian) by minimizing KL divergence. It's much faster than MCMC but provides an approximation, not exact samples.
When to use VI:
MCMC output is only valid if the chain has converged to the posterior. Always check: (1) Trace plots for mixing, (2) R-hat < 1.01 for convergence, (3) Effective sample size > 100 per parameter, (4) Compare multiple chains started from different points.
The posterior distribution is the culmination of Bayesian inference—our complete state of knowledge after observing data. Let's consolidate the key concepts:
What's Next:
With prior, likelihood, and posterior established, we now turn to Bayesian updating—the dynamic process of incorporating new evidence as it arrives. This perspective reveals Bayesian inference as a continuous learning process, not a one-time calculation.
You now understand the posterior distribution—how it's derived, interpreted, summarized, and computed. This is the central object of Bayesian inference, from which all downstream inferences flow. Next, we'll explore how posteriors evolve as new data arrives.