Bayesian Framework - Learning Module

Loading content...

0/245

Posterior Distribution

The Culmination of Bayesian Reasoning

We've established our prior beliefs about parameter values. We've defined the likelihood function that quantifies how probable our observations are under different parameters. Now we unite these two components to obtain what is arguably the most important quantity in all of Bayesian statistics: the posterior distribution.

The posterior distribution represents our complete state of knowledge about the parameter after observing data. It is not merely a point estimate or an interval—it is a full probability distribution that captures every aspect of our remaining uncertainty. From this single distribution, we can derive point estimates, credible intervals, hypothesis tests, predictions, and decision-theoretic optimal actions.

Understanding the posterior is understanding Bayesian inference itself.

What You Will Master

By the end of this page, you will understand the mathematical derivation of the posterior via Bayes' theorem, appreciate the posterior as a compromise between prior and likelihood, master methods for summarizing posterior distributions, understand the role of the normalizing constant (evidence), and recognize when posteriors are analytically tractable versus requiring approximation.

Bayes' Theorem: The Fundamental Equation

At the heart of Bayesian inference lies a simple identity from probability theory: Bayes' theorem. Despite its elementary derivation, its implications for statistical reasoning are profound.

Derivation:

Start with the definition of conditional probability:

$$p(\theta | D) = \frac{p(\theta, D)}{p(D)}$$

The joint probability can be factored two ways:

$$p(\theta, D) = p(D | \theta) \cdot p(\theta) = p(\theta | D) \cdot p(D)$$

Rearranging gives Bayes' theorem:

$$p(\theta | D) = \frac{p(D | \theta) \cdot p(\theta)}{p(D)}$$

Naming the Components:

Symbol	Name	Meaning
p(θ \| D)	Posterior	Updated belief about θ after seeing D
p(D \| θ)	Likelihood	Probability of data D given parameter θ
p(θ)	Prior	Belief about θ before seeing D
p(D)	Evidence / Marginal Likelihood	Total probability of observing D

The Bayesian Mantra

Posterior ∝ Likelihood × Prior. The normalizing constant p(D) ensures the posterior integrates to 1, but for many purposes (finding the mode, comparing parameter values), we only need the unnormalized posterior.

The Normalizing Constant (Evidence):

The denominator p(D) is crucial for making the posterior a valid probability distribution:

$$p(D) = \int p(D | \theta) \cdot p(\theta) , d\theta = \int \mathcal{L}(\theta) \cdot \pi(\theta) , d\theta$$

This integral marginalizes over all possible parameter values, giving the total probability of observing the data under our model.

Two Roles of p(D):

Normalization: Ensures p(θ|D) integrates to 1
Model Comparison: Different models yield different p(D) values—models that predict the observed data well have higher evidence

Why Computing p(D) is Hard:

For most non-trivial models, the integral $\int \mathcal{L}(\theta) \cdot \pi(\theta) , d\theta$ has no closed form. This is the fundamental computational challenge of Bayesian inference, motivating:

Conjugate priors (closed-form solutions for specific prior-likelihood pairs)
MCMC methods (sampling from the posterior without computing p(D))
Variational inference (optimizing a tractable approximation)
Importance sampling, Sequential Monte Carlo, and other approximations

bayes_theorem_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.integrate import quad
 
# Complete Bayes' theorem demonstration with explicit normalization
 
def bayes_theorem_example():
    """
    Example: Estimating a coin's bias
    Prior: Beta(2, 2) - weakly informative
    Likelihood: Binomial
    Data: 8 heads in 12 flips
    """
    # Prior parameters
    prior_a, prior_b = 2, 2
    
    # Observed data
    heads, total = 8, 12
    tails = total - heads
    
    theta = np.linspace(0.001, 0.999, 1000)
    
    # Prior: p(θ)
    prior = stats.beta(prior_a, prior_b).pdf(theta)
    
    # Likelihood: p(D|θ) ∝ θ^k (1-θ)^(n-k)
    likelihood = (theta ** heads) * ((1 - theta) ** tails)
    
    # Unnormalized posterior: p(D|θ) × p(θ)
    unnorm_posterior = likelihood * prior
    
    # Evidence: p(D) = ∫ p(D|θ) p(θ) dθ
    # For Beta-Binomial, this has a closed form involving Beta functions
    from scipy.special import betaln
    log_evidence = betaln(prior_a + heads, prior_b + tails) - betaln(prior_a, prior_b)
    evidence = np.exp(log_evidence) 
    print(f"Evidence p(D) = {evidence:.6f}")
    
    # Compute evidence numerically to verify
    def unnorm_posterior_func(t):
        return (t ** heads) * ((1 - t) ** tails) * stats.beta(prior_a, prior_b).pdf(t)
    
    evidence_numeric, _ = quad(unnorm_posterior_func, 0, 1)
    print(f"Evidence (numerical): {evidence_numeric:.6f}")
    
    # Normalized posterior: p(θ|D) = p(D|θ) p(θ) / p(D)
    # For conjugate Beta-Binomial: posterior is Beta(a+k, b+n-k)
    post_a, post_b = prior_a + heads, prior_b + tails
    posterior = stats.beta(post_a, post_b).pdf(theta)
    
    # Verify normalization
    from scipy.integrate import simps
    posterior_integral = simps(posterior, theta)
    print(f"Posterior integrates to: {posterior_integral:.6f}")
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Prior
    ax = axes[0, 0]
    ax.fill_between(theta, prior, alpha=0.3, color='blue')
    ax.plot(theta, prior, 'b-', linewidth=2)
    ax.set_title(f'Prior: Beta({prior_a}, {prior_b})')
    ax.set_xlabel('θ')
    ax.set_ylabel('p(θ)')
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Likelihood
    ax = axes[0, 1]
    ax.fill_between(theta, likelihood/np.max(likelihood), alpha=0.3, color='green')
    ax.plot(theta, likelihood/np.max(likelihood), 'g-', linewidth=2)
    ax.axvline(heads/total, color='red', linestyle='--', label=f'MLE = {heads/total:.2f}')
    ax.set_title(f'Likelihood (normalized max): {heads} heads, {tails} tails')
    ax.set_xlabel('θ')
    ax.set_ylabel('L(θ) / max L(θ)')
    ax.set_xlim(0, 1)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Plot 3: Unnormalized posterior
    ax = axes[1, 0]
    ax.fill_between(theta, unnorm_posterior, alpha=0.3, color='orange')
    ax.plot(theta, unnorm_posterior, color='orange', linewidth=2)
    ax.set_title('Unnormalized Posterior: p(D|θ) × p(θ)')
    ax.set_xlabel('θ')
    ax.set_ylabel('Unnormalized p(θ|D)')
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
    
    # Plot 4: Normalized posterior
    ax = axes[1, 1]
    ax.fill_between(theta, posterior, alpha=0.3, color='red')
    ax.plot(theta, posterior, 'r-', linewidth=2)
    ax.axvline(post_a/(post_a + post_b), color='darkred', linestyle='-',
               label=f'Posterior mean = {post_a/(post_a + post_b):.3f}')
    
    # Add credible interval
    ci_lower = stats.beta(post_a, post_b).ppf(0.025)
    ci_upper = stats.beta(post_a, post_b).ppf(0.975)
    ax.axvspan(ci_lower, ci_upper, alpha=0.2, color='red',
               label=f'95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]')
    
    ax.set_title(f'Posterior: Beta({post_a}, {post_b})')
    ax.set_xlabel('θ')
    ax.set_ylabel('p(θ|D)')
    ax.set_xlim(0, 1)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    return evidence
 
evidence = bayes_theorem_example()

The Posterior as a Compromise

A profound insight from Bayesian inference is that the posterior represents a principled compromise between prior beliefs and data evidence. Understanding this balancing act is essential for interpreting Bayesian results.

The Tension:

The prior pulls the posterior toward our pre-data beliefs
The likelihood pulls the posterior toward parameter values compatible with observed data

The relative influence of each depends on:

Prior strength (concentration): Stronger priors resist updating
Data strength (sample size): More data overwhelms weaker priors
Compatibility: When prior and likelihood agree, the posterior is simply sharpened; when they conflict, the posterior compromises

Mathematical Insight for Conjugate Models:

For the Beta-Binomial model:

$$\text{Prior}: \theta \sim \text{Beta}(\alpha, \beta)$$ $$\text{Likelihood}: X \sim \text{Binomial}(n, \theta)$$ $$\text{Posterior}: \theta | X \sim \text{Beta}(\alpha + k, \beta + n - k)$$

The posterior mean is:

$$E[\theta | X] = \frac{\alpha + k}{\alpha + \beta + n}$$

This can be rewritten as a weighted average:

$$E[\theta | X] = \underbrace{\frac{\alpha + \beta}{\alpha + \beta + n}}{\text{prior weight}} \cdot \underbrace{\frac{\alpha}{\alpha + \beta}}{\text{prior mean}} + \underbrace{\frac{n}{\alpha + \beta + n}}{\text{data weight}} \cdot \underbrace{\frac{k}{n}}{\text{MLE}}$$

Effective Sample Size of the Prior

In the Beta-Binomial model, α + β can be interpreted as an 'effective sample size' of the prior. A Beta(2, 2) prior acts like having seen 4 prior observations (2 successes, 2 failures). A Beta(50, 50) prior acts like 100 prior observations—much harder for new data to shift.

posterior_compromise.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def demonstrate_compromise(prior_strength, n_data, true_theta=0.7):
    """
    Show how posterior balances prior and data at different strengths.
    
    prior_strength: controls Beta(a, a) where a = prior_strength
    n_data: sample size
    """
    # Prior centered at 0.5
    prior_a = prior_b = prior_strength
    
    # Generate data from true_theta
    np.random.seed(42)
    k = np.random.binomial(n_data, true_theta)
    mle = k / n_data
    
    # Posterior
    post_a = prior_a + k
    post_b = prior_b + (n_data - k)
    
    # Means
    prior_mean = prior_a / (prior_a + prior_b)
    post_mean = post_a / (post_a + post_b)
    
    # Weights (as derived)
    prior_weight = (prior_a + prior_b) / (prior_a + prior_b + n_data)
    data_weight = n_data / (prior_a + prior_b + n_data)
    
    return {
        'prior_a': prior_a, 'prior_b': prior_b,
        'k': k, 'n': n_data, 'mle': mle,
        'post_a': post_a, 'post_b': post_b,
        'prior_mean': prior_mean, 'post_mean': post_mean,
        'prior_weight': prior_weight, 'data_weight': data_weight
    }
 
# Experiment: vary prior strength and data amount
scenarios = [
    (1, 10, "Weak prior, moderate data"),
    (10, 10, "Moderate prior, moderate data"),
    (50, 10, "Strong prior, moderate data"),
    (1, 100, "Weak prior, lots of data"),
    (10, 100, "Moderate prior, lots of data"),
    (50, 100, "Strong prior, lots of data"),
]
 
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
theta = np.linspace(0, 1, 500)
 
for ax, (prior_str, n_data, title) in zip(axes, scenarios):
    result = demonstrate_compromise(prior_str, n_data)
    
    # Plot distributions
    prior = stats.beta(result['prior_a'], result['prior_b']).pdf(theta)
    posterior = stats.beta(result['post_a'], result['post_b']).pdf(theta)
    
    ax.plot(theta, prior, 'b--', linewidth=2, label='Prior', alpha=0.7)
    ax.fill_between(theta, posterior, alpha=0.3, color='red')
    ax.plot(theta, posterior, 'r-', linewidth=2, label='Posterior')
    ax.axvline(result['mle'], color='green', linestyle=':', 
               linewidth=2, label=f'MLE={result["mle"]:.2f}')
    ax.axvline(result['post_mean'], color='darkred', linestyle='-',
               linewidth=1.5, label=f'Post mean={result["post_mean"]:.2f}')
    ax.axvline(0.7, color='black', linestyle='--', alpha=0.5, label='True θ=0.7')
    
    ax.set_title(f'{title}\nPrior weight: {result["prior_weight"]:.0%}, Data weight: {result["data_weight"]:.0%}')
    ax.set_xlabel('θ')
    ax.set_ylabel('Density')
    ax.legend(fontsize=8)
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
 
plt.suptitle('Posterior as Weighted Compromise\n(True θ = 0.7, Prior centered at 0.5)', fontsize=14)
plt.tight_layout()
 
# Print summary table
print("=" * 80)
print("POSTERIOR COMPROMISE SUMMARY")
print("=" * 80)
print(f"{'Scenario':<35} {'Prior Wt':<10} {'Data Wt':<10} {'Post Mean':<12}")
print("-" * 80)
for prior_str, n_data, title in scenarios:
    result = demonstrate_compromise(prior_str, n_data)
    print(f"{title:<35} {result['prior_weight']:.1%}      {result['data_weight']:.1%}      {result['post_mean']:.4f}")

Asymptotic Behavior:

As sample size n → ∞, the posterior concentrates around the true parameter value (assuming the model is correctly specified):

$$p(\theta | D_n) \xrightarrow{n \to \infty} \delta_{\theta_0}$$

where δ is a point mass at the true value θ₀. This is called posterior consistency.

Furthermore, for large n, the posterior is approximately Normal:

$$p(\theta | D_n) \approx \mathcal{N}\left(\hat{\theta}_{MLE}, \frac{1}{n \cdot I(\hat{\theta})}\right)$$

where I(θ) is the Fisher Information. This is the Bernstein-von Mises theorem—the Bayesian analogue of the frequentist Central Limit Theorem. It means that for large samples:

Prior choice becomes irrelevant
Bayesian and frequentist inferences coincide
Computational approximations become more accurate

Summarizing the Posterior Distribution

While the full posterior distribution is the complete Bayesian answer, we often need concise summaries for communication and decision-making. Several approaches exist, each with trade-offs.

Point Estimates from the Posterior

•Posterior Mean — E[θ|D]. Minimizes expected squared error loss. Most common default. May not be in highest-density region for skewed/multimodal posteriors.
•Posterior Median — 50th percentile. Minimizes expected absolute error loss. Robust to skewness and outliers in the posterior.
•Posterior Mode (MAP) — argmax p(θ|D). Maximum A Posteriori estimate. Can be found by optimization. Equivalent to penalized MLE with prior as penalty.
•Posterior Predictive Mean — E[y*|D] for new observation. Direct answer to prediction problems. Not a parameter estimate.

Credible Intervals:

Credible intervals are the Bayesian answer to 'how uncertain are we?' Unlike frequentist confidence intervals, credible intervals have direct probability interpretations.

Central Credible Interval:

An α% central credible interval (CI) contains the central α% of posterior mass:

$$[q_{(1-\alpha)/2}, q_{(1+\alpha)/2}]$$

where q_p is the p-th percentile of the posterior. For example, the 95% central CI spans from the 2.5th to the 97.5th percentile.

Highest Density Interval (HDI):

The HDI is the narrowest interval containing α% posterior probability. Unlike central CIs, HDIs:

Contain only high-probability values
Are not symmetric for skewed posteriors
May be disjoint for multimodal posteriors

Interpretation:

A 95% credible interval means: 'Given our prior and the data, there is a 95% probability that θ lies in this interval.'

This is what non-statisticians think confidence intervals mean (but they don't).

posterior_summaries.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def compute_posterior_summaries(post_a, post_b):
    """
    Compute various posterior summaries for Beta distribution.
    """
    dist = stats.beta(post_a, post_b)
    
    # Point estimates
    mean = dist.mean()
    median = dist.median()
    mode = (post_a - 1) / (post_a + post_b - 2) if post_a > 1 and post_b > 1 else None
    
    # Central 95% credible interval
    ci_95_central = (dist.ppf(0.025), dist.ppf(0.975))
    
    # For Beta, HDI can be computed analytically/numerically
    # Simple approximation: for unimodal, central CI is close to HDI
    
    return {
        'mean': mean,
        'median': median,
        'mode': mode,
        'ci_95_lower': ci_95_central[0],
        'ci_95_upper': ci_95_central[1],
        'std': dist.std()
    }
 
def compute_hdi(samples, cred_mass=0.95):
    """Compute Highest Density Interval from samples."""
    sorted_samples = np.sort(samples)
    n = len(samples)
    interval_size = int(np.ceil(cred_mass * n))
    
    intervals = sorted_samples[interval_size:] - sorted_samples[:-interval_size]
    min_idx = np.argmin(intervals)
    
    return sorted_samples[min_idx], sorted_samples[min_idx + interval_size]
 
# Example: Beta(10, 4) - right-skewed posterior
post_a, post_b = 10, 4
theta = np.linspace(0, 1, 1000)
posterior = stats.beta(post_a, post_b).pdf(theta)
summaries = compute_posterior_summaries(post_a, post_b)
 
# HDI from samples
np.random.seed(42)
samples = stats.beta(post_a, post_b).rvs(100000)
hdi_lower, hdi_upper = compute_hdi(samples, 0.95)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Point estimates
ax = axes[0]
ax.fill_between(theta, posterior, alpha=0.3, color='blue')
ax.plot(theta, posterior, 'b-', linewidth=2)
ax.axvline(summaries['mean'], color='red', linestyle='-', linewidth=2, 
           label=f"Mean = {summaries['mean']:.3f}")
ax.axvline(summaries['median'], color='green', linestyle='--', linewidth=2,
           label=f"Median = {summaries['median']:.3f}")
if summaries['mode']:
    ax.axvline(summaries['mode'], color='purple', linestyle=':', linewidth=2,
               label=f"Mode (MAP) = {summaries['mode']:.3f}")
ax.set_title(f'Point Estimates: Beta({post_a}, {post_b})')
ax.set_xlabel('θ')
ax.set_ylabel('p(θ|D)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
# Plot 2: Credible intervals
ax = axes[1]
ax.fill_between(theta, posterior, alpha=0.3, color='blue')
ax.plot(theta, posterior, 'b-', linewidth=2)
 
# Central 95% CI
ax.axvspan(summaries['ci_95_lower'], summaries['ci_95_upper'], 
           alpha=0.2, color='green', label=f"Central 95% CI: [{summaries['ci_95_lower']:.3f}, {summaries['ci_95_upper']:.3f}]")
 
# HDI
ax.axvspan(hdi_lower, hdi_upper, alpha=0.2, color='red', 
           label=f"95% HDI: [{hdi_lower:.3f}, {hdi_upper:.3f}]")
 
# Show the difference
ax.axvline(summaries['ci_95_lower'], color='green', linestyle='--', alpha=0.7)
ax.axvline(summaries['ci_95_upper'], color='green', linestyle='--', alpha=0.7)
ax.axvline(hdi_lower, color='red', linestyle='-', alpha=0.7)
ax.axvline(hdi_upper, color='red', linestyle='-', alpha=0.7)
 
ax.set_title('Central CI vs HDI (differ for skewed posteriors)')
ax.set_xlabel('θ')
ax.set_ylabel('p(θ|D)')
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
plt.tight_layout()
 
# Print summary table
print("=" * 60)
print(f"POSTERIOR SUMMARIES: Beta({post_a}, {post_b})")
print("=" * 60)
print(f"Mean:             {summaries['mean']:.4f}")
print(f"Median:           {summaries['median']:.4f}")
print(f"Mode (MAP):       {summaries['mode']:.4f}")
print(f"Std Dev:          {summaries['std']:.4f}")
print(f"Central 95% CI:   [{summaries['ci_95_lower']:.4f}, {summaries['ci_95_upper']:.4f}]")
print(f"95% HDI:          [{hdi_lower:.4f}, {hdi_upper:.4f}]")
print()
print("Note: For this right-skewed posterior, HDI is shifted right")
print("      relative to the central CI, and is slightly narrower.")

When to Use Which Summary

Use the mean for general reporting and when squared error matters. Use the median when robustness to skewness is important. Use the mode (MAP) when you need a single 'best' value for downstream processing. Use HDI over central CI for skewed posteriors or when communicating with non-experts.

The Posterior Predictive Distribution

Often our goal isn't to estimate parameters but to predict new observations. The posterior predictive distribution answers: 'What do we expect future data to look like, given what we've observed?'

Definition:

The posterior predictive distribution for a new observation x* is:

$$p(x^* | D) = \int p(x^* | \theta) \cdot p(\theta | D) , d\theta$$

This marginalizes over parameter uncertainty: we average predictions across all possible parameter values, weighted by how probable each value is given our data.

Key Properties:

Incorporates all uncertainty: Both aleatoric (inherent randomness) and epistemic (parameter uncertainty)
Wider than plug-in predictions: Using the posterior mean θ̂ and predicting from p(x*|θ̂) underestimates uncertainty
Converges to likelihood as data grows: With infinite data, parameter uncertainty vanishes
Model validation: Compare posterior predictive to observed data to check model adequacy

Example: Beta-Binomial Posterior Predictive

Setup:

Prior: θ ~ Beta(α, β)
Data: k successes in n trials → Posterior: θ | D ~ Beta(α+k, β+n-k)
Prediction: What's the probability the next trial is a success?

$$P(x^* = 1 | D) = \int_0^1 \theta \cdot p(\theta | D) , d\theta = E[\theta | D] = \frac{\alpha + k}{\alpha + \beta + n}$$

The predictive probability is simply the posterior mean—a beautiful closed form.

For multiple future trials m with y successes: $$p(y | D) = \text{BetaBinomial}(y | m, \alpha+k, \beta+n-k)$$

The Beta-Binomial is more dispersed than Binomial(m, θ̂) because it accounts for uncertainty in θ.

posterior_predictive.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.special import comb, beta as beta_func
 
def beta_binomial_pmf(y, n, alpha, beta_param):
    """
    Beta-Binomial probability mass function.
    p(y | n, α, β) = C(n,y) * B(y+α, n-y+β) / B(α, β)
    """
    log_pmf = (np.log(comb(n, y, exact=True)) + 
               np.log(beta_func(y + alpha, n - y + beta_param)) - 
               np.log(beta_func(alpha, beta_param)))
    return np.exp(log_pmf)
 
# Prior and observed data
prior_a, prior_b = 2, 2
observed_successes, observed_trials = 7, 10
 
# Posterior parameters
post_a = prior_a + observed_successes
post_b = prior_b + (observed_trials - observed_successes)
 
# Future trials
future_trials = 15
future_successes = np.arange(0, future_trials + 1)
 
# Posterior predictive: Beta-Binomial
posterior_predictive = [beta_binomial_pmf(y, future_trials, post_a, post_b) 
                        for y in future_successes]
 
# Plug-in prediction: Binomial with posterior mean
posterior_mean = post_a / (post_a + post_b)
plugin_predictive = stats.binom(future_trials, posterior_mean).pmf(future_successes)
 
# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Comparison of distributions
ax = axes[0]
width = 0.35
x = future_successes
ax.bar(x - width/2, posterior_predictive, width, label='Posterior Predictive\n(Beta-Binomial)', 
       alpha=0.7, color='blue')
ax.bar(x + width/2, plugin_predictive, width, label='Plug-in\n(Binomial)', 
       alpha=0.7, color='red')
ax.set_xlabel('Number of Successes in 15 Future Trials')
ax.set_ylabel('Probability')
ax.set_title('Posterior Predictive vs Plug-in Prediction')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
 
# Plot 2: Highlight the difference
ax = axes[1]
difference = np.array(posterior_predictive) - np.array(plugin_predictive)
colors = ['green' if d >= 0 else 'red' for d in difference]
ax.bar(x, difference, color=colors, alpha=0.7)
ax.axhline(0, color='black', linewidth=0.5)
ax.set_xlabel('Number of Successes')
ax.set_ylabel('Posterior Predictive - Plug-in')
ax.set_title('Difference: Posterior Predictive is More Dispersed')
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
 
# Statistics
print("=" * 60)
print("POSTERIOR PREDICTIVE vs PLUG-IN PREDICTION")
print("=" * 60)
print(f"Observed: {observed_successes}/{observed_trials} successes")
print(f"Posterior: Beta({post_a}, {post_b}), mean = {posterior_mean:.3f}")
print(f"Predicting: {future_trials} future trials")
print()
print(f"{'Statistic':<25} {'Post. Pred.':<15} {'Plug-in':<15}")
print("-" * 55)
 
# Expected value (same for both)
pp_mean = future_trials * posterior_mean
print(f"{'Mean':<25} {pp_mean:.3f}           {pp_mean:.3f}")
 
# Variance differs
pp_variance = sum(y**2 * p for y, p in zip(future_successes, posterior_predictive)) - pp_mean**2
plugin_variance = future_trials * posterior_mean * (1 - posterior_mean)
print(f"{'Variance':<25} {pp_variance:.3f}           {plugin_variance:.3f}")
print(f"{'Std Dev':<25} {np.sqrt(pp_variance):.3f}           {np.sqrt(plugin_variance):.3f}")
 
print()
print("Note: Posterior predictive has LARGER variance because it")
print("      accounts for uncertainty in θ, not just randomness in trials.")

Plug-in Predictions Underestimate Uncertainty

Using p(x*|θ̂) where θ̂ is any point estimate (MLE, posterior mean, etc.) ignores parameter uncertainty. This leads to overconfident predictions and poorly calibrated prediction intervals. Always use the full posterior predictive when uncertainty quantification matters.

Conjugate Posteriors: Analytical Tractability

For most models, the posterior has no closed-form expression and requires numerical approximation. However, certain prior-likelihood pairs yield posteriors in the same family as the prior—these are called conjugate pairs.

Definition:

A prior p(θ) is conjugate to a likelihood p(D|θ) if the posterior p(θ|D) is in the same distribution family as the prior.

Why Conjugacy Matters:

Closed-form posterior: No integration or sampling needed
Sequential updating: Posterior from one batch becomes prior for the next
Interpretability: Prior parameters often have intuitive meanings
Computational efficiency: Essential before MCMC became prevalent

Limitation: Conjugate priors may not encode our actual beliefs. Don't choose a prior just because it's conjugate—computational convenience shouldn't trump model fidelity.

Common Conjugate Prior-Likelihood Pairs
Likelihood	Conjugate Prior	Posterior	Prior Interpretation
Bernoulli/Binomial(θ)	Beta(α, β)	Beta(α+k, β+n-k)	α-1 prior successes, β-1 prior failures
Poisson(λ)	Gamma(α, β)	Gamma(α+Σx, β+n)	α prior events in β time units
Normal(μ, σ² known)	Normal(μ₀, τ²)	Normal(weighted mean)	Prior observation with precision 1/τ²
Normal(μ known, σ²)	Inverse-Gamma(α, β)	Inverse-Gamma(α+n/2, ...)	α prior observations, β sum of squares
Exponential(λ)	Gamma(α, β)	Gamma(α+n, β+Σx)	α prior events in β time
Multinomial(θ)	Dirichlet(α)	Dirichlet(α+counts)	α-1 prior counts per category
Geometric(θ)	Beta(α, β)	Beta(α+n, β+Σx)	Similar to Bernoulli

conjugate_updating.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def sequential_bayesian_update():
    """
    Demonstrate sequential conjugate updating.
    Each data batch updates the posterior, which becomes the prior for the next batch.
    """
    
    # True parameter
    true_theta = 0.65
    
    # Initial prior: Beta(1, 1) = Uniform
    prior_a, prior_b = 1, 1
    
    # Generate batches of data
    np.random.seed(42)
    batch_sizes = [5, 10, 20, 50, 100]
    batches = [np.random.binomial(1, true_theta, size=n) for n in batch_sizes]
    
    # Track evolution
    history = [{
        'batch': 0,
        'cumulative_n': 0,
        'a': prior_a,
        'b': prior_b,
        'mean': prior_a / (prior_a + prior_b)
    }]
    
    theta = np.linspace(0, 1, 500)
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    current_a, current_b = prior_a, prior_b
    cumulative_n = 0
    
    # Plot initial prior
    ax = axes[0]
    initial_dist = stats.beta(prior_a, prior_b).pdf(theta)
    ax.fill_between(theta, initial_dist, alpha=0.3, color='blue')
    ax.plot(theta, initial_dist, 'b-', linewidth=2)
    ax.axvline(true_theta, color='green', linestyle='--', label=f'True θ={true_theta}')
    ax.set_title(f'Initial Prior: Beta({prior_a}, {prior_b})')
    ax.set_xlabel('θ')
    ax.set_ylabel('Density')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, 1)
    
    # Sequential updates
    for i, batch in enumerate(batches):
        # Update
        k = batch.sum()
        n = len(batch)
        current_a += k
        current_b += (n - k)
        cumulative_n += n
        
        history.append({
            'batch': i + 1,
            'cumulative_n': cumulative_n,
            'a': current_a,
            'b': current_b,
            'mean': current_a / (current_a + current_b)
        })
        
        # Plot
        ax = axes[i + 1]
        post_dist = stats.beta(current_a, current_b).pdf(theta)
        ax.fill_between(theta, post_dist, alpha=0.3, color='red')
        ax.plot(theta, post_dist, 'r-', linewidth=2)
        ax.axvline(true_theta, color='green', linestyle='--', label=f'True θ={true_theta}')
        ax.axvline(current_a/(current_a+current_b), color='darkred', linestyle='-',
                   label=f'Post mean={current_a/(current_a+current_b):.3f}')
        
        ax.set_title(f'After {cumulative_n} obs: Beta({current_a}, {current_b})')
        ax.set_xlabel('θ')
        ax.set_ylabel('Density')
        ax.legend(fontsize=8)
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
    
    plt.suptitle('Sequential Bayesian Updating with Conjugate Priors', fontsize=14)
    plt.tight_layout()
    
    # Print history
    print("=" * 70)
    print("SEQUENTIAL UPDATE HISTORY")
    print("=" * 70)
    print(f"True θ = {true_theta}")
    print()
    print(f"{'Batch':<8} {'Total n':<10} {'Posterior':<20} {'Mean':<12} {'95% CI':<25}")
    print("-" * 70)
    for h in history:
        dist = stats.beta(h['a'], h['b'])
        ci = f"[{dist.ppf(0.025):.3f}, {dist.ppf(0.975):.3f}]"
        print(f"{h['batch']:<8} {h['cumulative_n']:<10} Beta({h['a']}, {h['b']}){'':<8} {h['mean']:.4f}       {ci}")
    
    return history
 
history = sequential_bayesian_update()

Online Learning

Conjugate models enable true online learning: the posterior from observing data up to time t becomes the prior for data at time t+1. This is computationally efficient and theoretically elegant, but requires the conjugate family to be appropriate for your problem.

Multivariate and Hierarchical Posteriors

Real models typically have multiple parameters, creating multivariate posterior distributions. Understanding the structure of these distributions is crucial for both computation and interpretation.

Joint Posterior:

For parameters θ = (θ₁, θ₂, ..., θₖ), the joint posterior is:

$$p(\theta_1, \theta_2, ..., \theta_k | D) = \frac{p(D | \theta_1, ..., \theta_k) \cdot p(\theta_1, ..., \theta_k)}{p(D)}$$

Marginal Posteriors:

Often we're interested in individual parameters. The marginal posterior for θ₁ is:

$$p(\theta_1 | D) = \int p(\theta_1, \theta_2, ..., \theta_k | D) , d\theta_2 ... d\theta_k$$

This integration 'marginalizes out' the other parameters. With MCMC samples, marginals are trivial—just ignore the other columns.

Conditional Posteriors:

These are the distributions of one parameter given the others:

$$p(\theta_i | \theta_{-i}, D)$$

where θ₋ᵢ denotes all parameters except θᵢ. Conditional posteriors are the building blocks of Gibbs sampling.

Posterior Correlation:

Parameters may be correlated in the posterior even if their priors are independent. This occurs when the likelihood 'ties them together'—e.g., estimating both intercept and slope in regression creates posterior correlation.

Hierarchical Posteriors:

Hierarchical (multi-level) models introduce parameters at multiple levels:

$$y_{ij} \sim p(y | \theta_j)$$ (observation level) $$\theta_j \sim p(\theta | \phi)$$ (group level)
$$\phi \sim p(\phi)$$ (population level)

The full posterior is:

$$p(\theta_1, ..., \theta_J, \phi | D) \propto \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j) \cdot \prod_{j=1}^{J} p(\theta_j | \phi) \cdot p(\phi)$$

Shrinkage and Partial Pooling:

Hierarchical posteriors exhibit shrinkage: individual group estimates θⱼ are pulled toward the population mean. Groups with less data are shrunk more. This borrows strength across groups and often improves predictions compared to either:

Complete pooling (ignore group structure)
No pooling (estimate each group independently)

Funnel Geometries

Hierarchical posteriors often have challenging 'funnel' geometries where the group-level variance and individual parameters become tightly correlated. When the variance is small, individual parameters must be close together; when large, they can spread. This creates sampling difficulties that require techniques like non-centered parameterization.

Computational Challenges and Solutions

For most interesting models, the posterior cannot be computed in closed form. The normalizing constant p(D) requires integrating over the entire parameter space—an integral that is typically intractable.

Why Integration is Hard:

High dimensionality: Numerical quadrature fails beyond ~10 dimensions
Complex geometry: Posteriors may be multimodal, curved, non-elliptical
Parameter constraints: Bounds and constraints complicate grids
Precision requirements: Need accurate integrals for reliable inference

Modern Computational Approaches:

Posterior Computation Methods
Method	Key Idea	Pros	Cons
Grid Approximation	Evaluate posterior on fine grid	Simple, exact in limit	Fails in high dimensions
Conjugate Priors	Choose prior for closed-form posterior	Analytical, fast	Limited flexibility
Laplace Approximation	Gaussian approximation at mode	Fast, deterministic	Poor for non-Gaussian posteriors
MCMC	Generate samples from posterior	Flexible, asymptotically exact	Slow convergence, diagnostics needed
Variational Inference	Optimize over approximating family	Fast, scalable	Approximate, underestimates variance
Importance Sampling	Weighted samples from proposal	Simple concept	Variance can be huge

MCMC: The Workhorse of Modern Bayesian Inference:

Markov Chain Monte Carlo (MCMC) methods generate sequences of samples that, in the limit, are draws from the posterior. The key insight is that we can sample from p(θ|D) without computing p(D)—we only need the unnormalized posterior, which is just likelihood × prior.

Common MCMC algorithms:

Metropolis-Hastings: Accept/reject proposals based on posterior ratio
Gibbs Sampling: Sample each parameter from its full conditional
Hamiltonian Monte Carlo (HMC): Use gradient information for efficient exploration
NUTS: Auto-tuned HMC (default in Stan, PyMC3)

Variational Inference: When Speed Matters:

VI approximates the posterior with a tractable distribution (e.g., factorized Gaussian) by minimizing KL divergence. It's much faster than MCMC but provides an approximation, not exact samples.

When to use VI:

Very large datasets (millions of observations)
Real-time inference requirements
Model exploration/iteration
When approximate answers suffice

MCMC Diagnostics Are Essential

MCMC output is only valid if the chain has converged to the posterior. Always check: (1) Trace plots for mixing, (2) R-hat < 1.01 for convergence, (3) Effective sample size > 100 per parameter, (4) Compare multiple chains started from different points.

Summary: The Posterior Distribution

The posterior distribution is the culmination of Bayesian inference—our complete state of knowledge after observing data. Let's consolidate the key concepts:

Key Takeaways

•Bayes' theorem is the foundation — Posterior ∝ Likelihood × Prior. The normalizing constant ensures integration to 1.
•The posterior balances prior and data — With weak priors or strong data, the posterior resembles the likelihood; with strong priors or weak data, the posterior resembles the prior.
•Multiple summaries serve different purposes — Mean for expected value, median for robustness, mode for optimization, HDI for uncertainty communication.
•Posterior predictive accounts for all uncertainty — Unlike plug-in predictions, it integrates over parameter uncertainty for properly calibrated predictions.
•Conjugate priors enable analytical posteriors — When mathematical convenience aligns with modeling needs, conjugacy provides exact, fast inference.
•Multivariate posteriors capture parameter relationships — Marginal and conditional posteriors reveal individual parameter behavior and dependencies.
•Computation is the main practical challenge — MCMC and variational inference are the workhorses for intractable posteriors.

What's Next:

With prior, likelihood, and posterior established, we now turn to Bayesian updating—the dynamic process of incorporating new evidence as it arrives. This perspective reveals Bayesian inference as a continuous learning process, not a one-time calculation.

Page Complete

You now understand the posterior distribution—how it's derived, interpreted, summarized, and computed. This is the central object of Bayesian inference, from which all downstream inferences flow. Next, we'll explore how posteriors evolve as new data arrives.

Posterior Distribution

The Culmination of Bayesian Reasoning

Understanding the posterior is understanding Bayesian inference itself.

What You Will Master

Bayes' Theorem: The Fundamental Equation

At the heart of Bayesian inference lies a simple identity from probability theory: Bayes' theorem. Despite its elementary derivation, its implications for statistical reasoning are profound.

Derivation:

Start with the definition of conditional probability:

$$p(\theta | D) = \frac{p(\theta, D)}{p(D)}$$

The joint probability can be factored two ways:

$$p(\theta, D) = p(D | \theta) \cdot p(\theta) = p(\theta | D) \cdot p(D)$$

Rearranging gives Bayes' theorem:

$$p(\theta | D) = \frac{p(D | \theta) \cdot p(\theta)}{p(D)}$$

Naming the Components:

Symbol	Name	Meaning
p(θ \| D)	Posterior	Updated belief about θ after seeing D
p(D \| θ)	Likelihood	Probability of data D given parameter θ
p(θ)	Prior	Belief about θ before seeing D
p(D)	Evidence / Marginal Likelihood	Total probability of observing D

The Bayesian Mantra

The Normalizing Constant (Evidence):

The denominator p(D) is crucial for making the posterior a valid probability distribution:

$$p(D) = \int p(D | \theta) \cdot p(\theta) , d\theta = \int \mathcal{L}(\theta) \cdot \pi(\theta) , d\theta$$

This integral marginalizes over all possible parameter values, giving the total probability of observing the data under our model.

Two Roles of p(D):

Normalization: Ensures p(θ|D) integrates to 1
Model Comparison: Different models yield different p(D) values—models that predict the observed data well have higher evidence

Why Computing p(D) is Hard:

For most non-trivial models, the integral $\int \mathcal{L}(\theta) \cdot \pi(\theta) , d\theta$ has no closed form. This is the fundamental computational challenge of Bayesian inference, motivating:

Conjugate priors (closed-form solutions for specific prior-likelihood pairs)
MCMC methods (sampling from the posterior without computing p(D))
Variational inference (optimizing a tractable approximation)
Importance sampling, Sequential Monte Carlo, and other approximations

bayes_theorem_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.integrate import quad
 
# Complete Bayes' theorem demonstration with explicit normalization
 
def bayes_theorem_example():
    """
    Example: Estimating a coin's bias
    Prior: Beta(2, 2) - weakly informative
    Likelihood: Binomial
    Data: 8 heads in 12 flips
    """
    # Prior parameters
    prior_a, prior_b = 2, 2
    
    # Observed data
    heads, total = 8, 12
    tails = total - heads
    
    theta = np.linspace(0.001, 0.999, 1000)
    
    # Prior: p(θ)
    prior = stats.beta(prior_a, prior_b).pdf(theta)
    
    # Likelihood: p(D|θ) ∝ θ^k (1-θ)^(n-k)
    likelihood = (theta ** heads) * ((1 - theta) ** tails)
    
    # Unnormalized posterior: p(D|θ) × p(θ)
    unnorm_posterior = likelihood * prior
    
    # Evidence: p(D) = ∫ p(D|θ) p(θ) dθ
    # For Beta-Binomial, this has a closed form involving Beta functions
    from scipy.special import betaln
    log_evidence = betaln(prior_a + heads, prior_b + tails) - betaln(prior_a, prior_b)
    evidence = np.exp(log_evidence) 
    print(f"Evidence p(D) = {evidence:.6f}")
    
    # Compute evidence numerically to verify
    def unnorm_posterior_func(t):
        return (t ** heads) * ((1 - t) ** tails) * stats.beta(prior_a, prior_b).pdf(t)
    
    evidence_numeric, _ = quad(unnorm_posterior_func, 0, 1)
    print(f"Evidence (numerical): {evidence_numeric:.6f}")
    
    # Normalized posterior: p(θ|D) = p(D|θ) p(θ) / p(D)
    # For conjugate Beta-Binomial: posterior is Beta(a+k, b+n-k)
    post_a, post_b = prior_a + heads, prior_b + tails
    posterior = stats.beta(post_a, post_b).pdf(theta)
    
    # Verify normalization
    from scipy.integrate import simps
    posterior_integral = simps(posterior, theta)
    print(f"Posterior integrates to: {posterior_integral:.6f}")
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Prior
    ax = axes[0, 0]
    ax.fill_between(theta, prior, alpha=0.3, color='blue')
    ax.plot(theta, prior, 'b-', linewidth=2)
    ax.set_title(f'Prior: Beta({prior_a}, {prior_b})')
    ax.set_xlabel('θ')
    ax.set_ylabel('p(θ)')
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Likelihood
    ax = axes[0, 1]
    ax.fill_between(theta, likelihood/np.max(likelihood), alpha=0.3, color='green')
    ax.plot(theta, likelihood/np.max(likelihood), 'g-', linewidth=2)
    ax.axvline(heads/total, color='red', linestyle='--', label=f'MLE = {heads/total:.2f}')
    ax.set_title(f'Likelihood (normalized max): {heads} heads, {tails} tails')
    ax.set_xlabel('θ')
    ax.set_ylabel('L(θ) / max L(θ)')
    ax.set_xlim(0, 1)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # Plot 3: Unnormalized posterior
    ax = axes[1, 0]
    ax.fill_between(theta, unnorm_posterior, alpha=0.3, color='orange')
    ax.plot(theta, unnorm_posterior, color='orange', linewidth=2)
    ax.set_title('Unnormalized Posterior: p(D|θ) × p(θ)')
    ax.set_xlabel('θ')
    ax.set_ylabel('Unnormalized p(θ|D)')
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
    
    # Plot 4: Normalized posterior
    ax = axes[1, 1]
    ax.fill_between(theta, posterior, alpha=0.3, color='red')
    ax.plot(theta, posterior, 'r-', linewidth=2)
    ax.axvline(post_a/(post_a + post_b), color='darkred', linestyle='-',
               label=f'Posterior mean = {post_a/(post_a + post_b):.3f}')
    
    # Add credible interval
    ci_lower = stats.beta(post_a, post_b).ppf(0.025)
    ci_upper = stats.beta(post_a, post_b).ppf(0.975)
    ax.axvspan(ci_lower, ci_upper, alpha=0.2, color='red',
               label=f'95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]')
    
    ax.set_title(f'Posterior: Beta({post_a}, {post_b})')
    ax.set_xlabel('θ')
    ax.set_ylabel('p(θ|D)')
    ax.set_xlim(0, 1)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    return evidence
 
evidence = bayes_theorem_example()

The Posterior as a Compromise

The Tension:

The prior pulls the posterior toward our pre-data beliefs
The likelihood pulls the posterior toward parameter values compatible with observed data

The relative influence of each depends on:

Prior strength (concentration): Stronger priors resist updating
Data strength (sample size): More data overwhelms weaker priors
Compatibility: When prior and likelihood agree, the posterior is simply sharpened; when they conflict, the posterior compromises

Mathematical Insight for Conjugate Models:

For the Beta-Binomial model:

$$\text{Prior}: \theta \sim \text{Beta}(\alpha, \beta)$$ $$\text{Likelihood}: X \sim \text{Binomial}(n, \theta)$$ $$\text{Posterior}: \theta | X \sim \text{Beta}(\alpha + k, \beta + n - k)$$

The posterior mean is:

$$E[\theta | X] = \frac{\alpha + k}{\alpha + \beta + n}$$

This can be rewritten as a weighted average:

Effective Sample Size of the Prior

posterior_compromise.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def demonstrate_compromise(prior_strength, n_data, true_theta=0.7):
    """
    Show how posterior balances prior and data at different strengths.
    
    prior_strength: controls Beta(a, a) where a = prior_strength
    n_data: sample size
    """
    # Prior centered at 0.5
    prior_a = prior_b = prior_strength
    
    # Generate data from true_theta
    np.random.seed(42)
    k = np.random.binomial(n_data, true_theta)
    mle = k / n_data
    
    # Posterior
    post_a = prior_a + k
    post_b = prior_b + (n_data - k)
    
    # Means
    prior_mean = prior_a / (prior_a + prior_b)
    post_mean = post_a / (post_a + post_b)
    
    # Weights (as derived)
    prior_weight = (prior_a + prior_b) / (prior_a + prior_b + n_data)
    data_weight = n_data / (prior_a + prior_b + n_data)
    
    return {
        'prior_a': prior_a, 'prior_b': prior_b,
        'k': k, 'n': n_data, 'mle': mle,
        'post_a': post_a, 'post_b': post_b,
        'prior_mean': prior_mean, 'post_mean': post_mean,
        'prior_weight': prior_weight, 'data_weight': data_weight
    }
 
# Experiment: vary prior strength and data amount
scenarios = [
    (1, 10, "Weak prior, moderate data"),
    (10, 10, "Moderate prior, moderate data"),
    (50, 10, "Strong prior, moderate data"),
    (1, 100, "Weak prior, lots of data"),
    (10, 100, "Moderate prior, lots of data"),
    (50, 100, "Strong prior, lots of data"),
]
 
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
theta = np.linspace(0, 1, 500)
 
for ax, (prior_str, n_data, title) in zip(axes, scenarios):
    result = demonstrate_compromise(prior_str, n_data)
    
    # Plot distributions
    prior = stats.beta(result['prior_a'], result['prior_b']).pdf(theta)
    posterior = stats.beta(result['post_a'], result['post_b']).pdf(theta)
    
    ax.plot(theta, prior, 'b--', linewidth=2, label='Prior', alpha=0.7)
    ax.fill_between(theta, posterior, alpha=0.3, color='red')
    ax.plot(theta, posterior, 'r-', linewidth=2, label='Posterior')
    ax.axvline(result['mle'], color='green', linestyle=':', 
               linewidth=2, label=f'MLE={result["mle"]:.2f}')
    ax.axvline(result['post_mean'], color='darkred', linestyle='-',
               linewidth=1.5, label=f'Post mean={result["post_mean"]:.2f}')
    ax.axvline(0.7, color='black', linestyle='--', alpha=0.5, label='True θ=0.7')
    
    ax.set_title(f'{title}\nPrior weight: {result["prior_weight"]:.0%}, Data weight: {result["data_weight"]:.0%}')
    ax.set_xlabel('θ')
    ax.set_ylabel('Density')
    ax.legend(fontsize=8)
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
 
plt.suptitle('Posterior as Weighted Compromise\n(True θ = 0.7, Prior centered at 0.5)', fontsize=14)
plt.tight_layout()
 
# Print summary table
print("=" * 80)
print("POSTERIOR COMPROMISE SUMMARY")
print("=" * 80)
print(f"{'Scenario':<35} {'Prior Wt':<10} {'Data Wt':<10} {'Post Mean':<12}")
print("-" * 80)
for prior_str, n_data, title in scenarios:
    result = demonstrate_compromise(prior_str, n_data)
    print(f"{title:<35} {result['prior_weight']:.1%}      {result['data_weight']:.1%}      {result['post_mean']:.4f}")

Asymptotic Behavior:

As sample size n → ∞, the posterior concentrates around the true parameter value (assuming the model is correctly specified):

$$p(\theta | D_n) \xrightarrow{n \to \infty} \delta_{\theta_0}$$

where δ is a point mass at the true value θ₀. This is called posterior consistency.

Furthermore, for large n, the posterior is approximately Normal:

$$p(\theta | D_n) \approx \mathcal{N}\left(\hat{\theta}_{MLE}, \frac{1}{n \cdot I(\hat{\theta})}\right)$$

where I(θ) is the Fisher Information. This is the Bernstein-von Mises theorem—the Bayesian analogue of the frequentist Central Limit Theorem. It means that for large samples:

Prior choice becomes irrelevant
Bayesian and frequentist inferences coincide
Computational approximations become more accurate

Summarizing the Posterior Distribution

While the full posterior distribution is the complete Bayesian answer, we often need concise summaries for communication and decision-making. Several approaches exist, each with trade-offs.

Point Estimates from the Posterior

•Posterior Mean — E[θ|D]. Minimizes expected squared error loss. Most common default. May not be in highest-density region for skewed/multimodal posteriors.
•Posterior Median — 50th percentile. Minimizes expected absolute error loss. Robust to skewness and outliers in the posterior.
•Posterior Mode (MAP) — argmax p(θ|D). Maximum A Posteriori estimate. Can be found by optimization. Equivalent to penalized MLE with prior as penalty.
•Posterior Predictive Mean — E[y*|D] for new observation. Direct answer to prediction problems. Not a parameter estimate.

Credible Intervals:

Credible intervals are the Bayesian answer to 'how uncertain are we?' Unlike frequentist confidence intervals, credible intervals have direct probability interpretations.

Central Credible Interval:

An α% central credible interval (CI) contains the central α% of posterior mass:

$$[q_{(1-\alpha)/2}, q_{(1+\alpha)/2}]$$

where q_p is the p-th percentile of the posterior. For example, the 95% central CI spans from the 2.5th to the 97.5th percentile.

Highest Density Interval (HDI):

The HDI is the narrowest interval containing α% posterior probability. Unlike central CIs, HDIs:

Contain only high-probability values
Are not symmetric for skewed posteriors
May be disjoint for multimodal posteriors

Interpretation:

A 95% credible interval means: 'Given our prior and the data, there is a 95% probability that θ lies in this interval.'

This is what non-statisticians think confidence intervals mean (but they don't).

posterior_summaries.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def compute_posterior_summaries(post_a, post_b):
    """
    Compute various posterior summaries for Beta distribution.
    """
    dist = stats.beta(post_a, post_b)
    
    # Point estimates
    mean = dist.mean()
    median = dist.median()
    mode = (post_a - 1) / (post_a + post_b - 2) if post_a > 1 and post_b > 1 else None
    
    # Central 95% credible interval
    ci_95_central = (dist.ppf(0.025), dist.ppf(0.975))
    
    # For Beta, HDI can be computed analytically/numerically
    # Simple approximation: for unimodal, central CI is close to HDI
    
    return {
        'mean': mean,
        'median': median,
        'mode': mode,
        'ci_95_lower': ci_95_central[0],
        'ci_95_upper': ci_95_central[1],
        'std': dist.std()
    }
 
def compute_hdi(samples, cred_mass=0.95):
    """Compute Highest Density Interval from samples."""
    sorted_samples = np.sort(samples)
    n = len(samples)
    interval_size = int(np.ceil(cred_mass * n))
    
    intervals = sorted_samples[interval_size:] - sorted_samples[:-interval_size]
    min_idx = np.argmin(intervals)
    
    return sorted_samples[min_idx], sorted_samples[min_idx + interval_size]
 
# Example: Beta(10, 4) - right-skewed posterior
post_a, post_b = 10, 4
theta = np.linspace(0, 1, 1000)
posterior = stats.beta(post_a, post_b).pdf(theta)
summaries = compute_posterior_summaries(post_a, post_b)
 
# HDI from samples
np.random.seed(42)
samples = stats.beta(post_a, post_b).rvs(100000)
hdi_lower, hdi_upper = compute_hdi(samples, 0.95)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Point estimates
ax = axes[0]
ax.fill_between(theta, posterior, alpha=0.3, color='blue')
ax.plot(theta, posterior, 'b-', linewidth=2)
ax.axvline(summaries['mean'], color='red', linestyle='-', linewidth=2, 
           label=f"Mean = {summaries['mean']:.3f}")
ax.axvline(summaries['median'], color='green', linestyle='--', linewidth=2,
           label=f"Median = {summaries['median']:.3f}")
if summaries['mode']:
    ax.axvline(summaries['mode'], color='purple', linestyle=':', linewidth=2,
               label=f"Mode (MAP) = {summaries['mode']:.3f}")
ax.set_title(f'Point Estimates: Beta({post_a}, {post_b})')
ax.set_xlabel('θ')
ax.set_ylabel('p(θ|D)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
# Plot 2: Credible intervals
ax = axes[1]
ax.fill_between(theta, posterior, alpha=0.3, color='blue')
ax.plot(theta, posterior, 'b-', linewidth=2)
 
# Central 95% CI
ax.axvspan(summaries['ci_95_lower'], summaries['ci_95_upper'], 
           alpha=0.2, color='green', label=f"Central 95% CI: [{summaries['ci_95_lower']:.3f}, {summaries['ci_95_upper']:.3f}]")
 
# HDI
ax.axvspan(hdi_lower, hdi_upper, alpha=0.2, color='red', 
           label=f"95% HDI: [{hdi_lower:.3f}, {hdi_upper:.3f}]")
 
# Show the difference
ax.axvline(summaries['ci_95_lower'], color='green', linestyle='--', alpha=0.7)
ax.axvline(summaries['ci_95_upper'], color='green', linestyle='--', alpha=0.7)
ax.axvline(hdi_lower, color='red', linestyle='-', alpha=0.7)
ax.axvline(hdi_upper, color='red', linestyle='-', alpha=0.7)
 
ax.set_title('Central CI vs HDI (differ for skewed posteriors)')
ax.set_xlabel('θ')
ax.set_ylabel('p(θ|D)')
ax.legend(loc='upper left')
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
plt.tight_layout()
 
# Print summary table
print("=" * 60)
print(f"POSTERIOR SUMMARIES: Beta({post_a}, {post_b})")
print("=" * 60)
print(f"Mean:             {summaries['mean']:.4f}")
print(f"Median:           {summaries['median']:.4f}")
print(f"Mode (MAP):       {summaries['mode']:.4f}")
print(f"Std Dev:          {summaries['std']:.4f}")
print(f"Central 95% CI:   [{summaries['ci_95_lower']:.4f}, {summaries['ci_95_upper']:.4f}]")
print(f"95% HDI:          [{hdi_lower:.4f}, {hdi_upper:.4f}]")
print()
print("Note: For this right-skewed posterior, HDI is shifted right")
print("      relative to the central CI, and is slightly narrower.")

When to Use Which Summary

The Posterior Predictive Distribution

Definition:

The posterior predictive distribution for a new observation x* is:

$$p(x^* | D) = \int p(x^* | \theta) \cdot p(\theta | D) , d\theta$$

This marginalizes over parameter uncertainty: we average predictions across all possible parameter values, weighted by how probable each value is given our data.

Key Properties:

Incorporates all uncertainty: Both aleatoric (inherent randomness) and epistemic (parameter uncertainty)
Wider than plug-in predictions: Using the posterior mean θ̂ and predicting from p(x*|θ̂) underestimates uncertainty
Converges to likelihood as data grows: With infinite data, parameter uncertainty vanishes
Model validation: Compare posterior predictive to observed data to check model adequacy

Example: Beta-Binomial Posterior Predictive

Setup:

Prior: θ ~ Beta(α, β)
Data: k successes in n trials → Posterior: θ | D ~ Beta(α+k, β+n-k)
Prediction: What's the probability the next trial is a success?

$$P(x^* = 1 | D) = \int_0^1 \theta \cdot p(\theta | D) , d\theta = E[\theta | D] = \frac{\alpha + k}{\alpha + \beta + n}$$

The predictive probability is simply the posterior mean—a beautiful closed form.

For multiple future trials m with y successes: $$p(y | D) = \text{BetaBinomial}(y | m, \alpha+k, \beta+n-k)$$

The Beta-Binomial is more dispersed than Binomial(m, θ̂) because it accounts for uncertainty in θ.

posterior_predictive.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.special import comb, beta as beta_func
 
def beta_binomial_pmf(y, n, alpha, beta_param):
    """
    Beta-Binomial probability mass function.
    p(y | n, α, β) = C(n,y) * B(y+α, n-y+β) / B(α, β)
    """
    log_pmf = (np.log(comb(n, y, exact=True)) + 
               np.log(beta_func(y + alpha, n - y + beta_param)) - 
               np.log(beta_func(alpha, beta_param)))
    return np.exp(log_pmf)
 
# Prior and observed data
prior_a, prior_b = 2, 2
observed_successes, observed_trials = 7, 10
 
# Posterior parameters
post_a = prior_a + observed_successes
post_b = prior_b + (observed_trials - observed_successes)
 
# Future trials
future_trials = 15
future_successes = np.arange(0, future_trials + 1)
 
# Posterior predictive: Beta-Binomial
posterior_predictive = [beta_binomial_pmf(y, future_trials, post_a, post_b) 
                        for y in future_successes]
 
# Plug-in prediction: Binomial with posterior mean
posterior_mean = post_a / (post_a + post_b)
plugin_predictive = stats.binom(future_trials, posterior_mean).pmf(future_successes)
 
# Compare
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Comparison of distributions
ax = axes[0]
width = 0.35
x = future_successes
ax.bar(x - width/2, posterior_predictive, width, label='Posterior Predictive\n(Beta-Binomial)', 
       alpha=0.7, color='blue')
ax.bar(x + width/2, plugin_predictive, width, label='Plug-in\n(Binomial)', 
       alpha=0.7, color='red')
ax.set_xlabel('Number of Successes in 15 Future Trials')
ax.set_ylabel('Probability')
ax.set_title('Posterior Predictive vs Plug-in Prediction')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
 
# Plot 2: Highlight the difference
ax = axes[1]
difference = np.array(posterior_predictive) - np.array(plugin_predictive)
colors = ['green' if d >= 0 else 'red' for d in difference]
ax.bar(x, difference, color=colors, alpha=0.7)
ax.axhline(0, color='black', linewidth=0.5)
ax.set_xlabel('Number of Successes')
ax.set_ylabel('Posterior Predictive - Plug-in')
ax.set_title('Difference: Posterior Predictive is More Dispersed')
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
 
# Statistics
print("=" * 60)
print("POSTERIOR PREDICTIVE vs PLUG-IN PREDICTION")
print("=" * 60)
print(f"Observed: {observed_successes}/{observed_trials} successes")
print(f"Posterior: Beta({post_a}, {post_b}), mean = {posterior_mean:.3f}")
print(f"Predicting: {future_trials} future trials")
print()
print(f"{'Statistic':<25} {'Post. Pred.':<15} {'Plug-in':<15}")
print("-" * 55)
 
# Expected value (same for both)
pp_mean = future_trials * posterior_mean
print(f"{'Mean':<25} {pp_mean:.3f}           {pp_mean:.3f}")
 
# Variance differs
pp_variance = sum(y**2 * p for y, p in zip(future_successes, posterior_predictive)) - pp_mean**2
plugin_variance = future_trials * posterior_mean * (1 - posterior_mean)
print(f"{'Variance':<25} {pp_variance:.3f}           {plugin_variance:.3f}")
print(f"{'Std Dev':<25} {np.sqrt(pp_variance):.3f}           {np.sqrt(plugin_variance):.3f}")
 
print()
print("Note: Posterior predictive has LARGER variance because it")
print("      accounts for uncertainty in θ, not just randomness in trials.")

Plug-in Predictions Underestimate Uncertainty

Conjugate Posteriors: Analytical Tractability

Definition:

A prior p(θ) is conjugate to a likelihood p(D|θ) if the posterior p(θ|D) is in the same distribution family as the prior.

Why Conjugacy Matters:

Closed-form posterior: No integration or sampling needed
Sequential updating: Posterior from one batch becomes prior for the next
Interpretability: Prior parameters often have intuitive meanings
Computational efficiency: Essential before MCMC became prevalent

Limitation: Conjugate priors may not encode our actual beliefs. Don't choose a prior just because it's conjugate—computational convenience shouldn't trump model fidelity.

Common Conjugate Prior-Likelihood Pairs
Likelihood	Conjugate Prior	Posterior	Prior Interpretation
Bernoulli/Binomial(θ)	Beta(α, β)	Beta(α+k, β+n-k)	α-1 prior successes, β-1 prior failures
Poisson(λ)	Gamma(α, β)	Gamma(α+Σx, β+n)	α prior events in β time units
Normal(μ, σ² known)	Normal(μ₀, τ²)	Normal(weighted mean)	Prior observation with precision 1/τ²
Normal(μ known, σ²)	Inverse-Gamma(α, β)	Inverse-Gamma(α+n/2, ...)	α prior observations, β sum of squares
Exponential(λ)	Gamma(α, β)	Gamma(α+n, β+Σx)	α prior events in β time
Multinomial(θ)	Dirichlet(α)	Dirichlet(α+counts)	α-1 prior counts per category
Geometric(θ)	Beta(α, β)	Beta(α+n, β+Σx)	Similar to Bernoulli

conjugate_updating.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def sequential_bayesian_update():
    """
    Demonstrate sequential conjugate updating.
    Each data batch updates the posterior, which becomes the prior for the next batch.
    """
    
    # True parameter
    true_theta = 0.65
    
    # Initial prior: Beta(1, 1) = Uniform
    prior_a, prior_b = 1, 1
    
    # Generate batches of data
    np.random.seed(42)
    batch_sizes = [5, 10, 20, 50, 100]
    batches = [np.random.binomial(1, true_theta, size=n) for n in batch_sizes]
    
    # Track evolution
    history = [{
        'batch': 0,
        'cumulative_n': 0,
        'a': prior_a,
        'b': prior_b,
        'mean': prior_a / (prior_a + prior_b)
    }]
    
    theta = np.linspace(0, 1, 500)
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    current_a, current_b = prior_a, prior_b
    cumulative_n = 0
    
    # Plot initial prior
    ax = axes[0]
    initial_dist = stats.beta(prior_a, prior_b).pdf(theta)
    ax.fill_between(theta, initial_dist, alpha=0.3, color='blue')
    ax.plot(theta, initial_dist, 'b-', linewidth=2)
    ax.axvline(true_theta, color='green', linestyle='--', label=f'True θ={true_theta}')
    ax.set_title(f'Initial Prior: Beta({prior_a}, {prior_b})')
    ax.set_xlabel('θ')
    ax.set_ylabel('Density')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, 1)
    
    # Sequential updates
    for i, batch in enumerate(batches):
        # Update
        k = batch.sum()
        n = len(batch)
        current_a += k
        current_b += (n - k)
        cumulative_n += n
        
        history.append({
            'batch': i + 1,
            'cumulative_n': cumulative_n,
            'a': current_a,
            'b': current_b,
            'mean': current_a / (current_a + current_b)
        })
        
        # Plot
        ax = axes[i + 1]
        post_dist = stats.beta(current_a, current_b).pdf(theta)
        ax.fill_between(theta, post_dist, alpha=0.3, color='red')
        ax.plot(theta, post_dist, 'r-', linewidth=2)
        ax.axvline(true_theta, color='green', linestyle='--', label=f'True θ={true_theta}')
        ax.axvline(current_a/(current_a+current_b), color='darkred', linestyle='-',
                   label=f'Post mean={current_a/(current_a+current_b):.3f}')
        
        ax.set_title(f'After {cumulative_n} obs: Beta({current_a}, {current_b})')
        ax.set_xlabel('θ')
        ax.set_ylabel('Density')
        ax.legend(fontsize=8)
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
    
    plt.suptitle('Sequential Bayesian Updating with Conjugate Priors', fontsize=14)
    plt.tight_layout()
    
    # Print history
    print("=" * 70)
    print("SEQUENTIAL UPDATE HISTORY")
    print("=" * 70)
    print(f"True θ = {true_theta}")
    print()
    print(f"{'Batch':<8} {'Total n':<10} {'Posterior':<20} {'Mean':<12} {'95% CI':<25}")
    print("-" * 70)
    for h in history:
        dist = stats.beta(h['a'], h['b'])
        ci = f"[{dist.ppf(0.025):.3f}, {dist.ppf(0.975):.3f}]"
        print(f"{h['batch']:<8} {h['cumulative_n']:<10} Beta({h['a']}, {h['b']}){'':<8} {h['mean']:.4f}       {ci}")
    
    return history
 
history = sequential_bayesian_update()

Online Learning

Multivariate and Hierarchical Posteriors

Real models typically have multiple parameters, creating multivariate posterior distributions. Understanding the structure of these distributions is crucial for both computation and interpretation.

Joint Posterior:

For parameters θ = (θ₁, θ₂, ..., θₖ), the joint posterior is:

$$p(\theta_1, \theta_2, ..., \theta_k | D) = \frac{p(D | \theta_1, ..., \theta_k) \cdot p(\theta_1, ..., \theta_k)}{p(D)}$$

Marginal Posteriors:

Often we're interested in individual parameters. The marginal posterior for θ₁ is:

$$p(\theta_1 | D) = \int p(\theta_1, \theta_2, ..., \theta_k | D) , d\theta_2 ... d\theta_k$$

This integration 'marginalizes out' the other parameters. With MCMC samples, marginals are trivial—just ignore the other columns.

Conditional Posteriors:

These are the distributions of one parameter given the others:

$$p(\theta_i | \theta_{-i}, D)$$

where θ₋ᵢ denotes all parameters except θᵢ. Conditional posteriors are the building blocks of Gibbs sampling.

Posterior Correlation:

Hierarchical Posteriors:

Hierarchical (multi-level) models introduce parameters at multiple levels:

$$y_{ij} \sim p(y | \theta_j)$$ (observation level) $$\theta_j \sim p(\theta | \phi)$$ (group level)
$$\phi \sim p(\phi)$$ (population level)

The full posterior is:

$$p(\theta_1, ..., \theta_J, \phi | D) \propto \prod_{j=1}^{J} \prod_{i=1}^{n_j} p(y_{ij} | \theta_j) \cdot \prod_{j=1}^{J} p(\theta_j | \phi) \cdot p(\phi)$$

Shrinkage and Partial Pooling:

Complete pooling (ignore group structure)
No pooling (estimate each group independently)

Funnel Geometries

Computational Challenges and Solutions

Why Integration is Hard:

High dimensionality: Numerical quadrature fails beyond ~10 dimensions
Complex geometry: Posteriors may be multimodal, curved, non-elliptical
Parameter constraints: Bounds and constraints complicate grids
Precision requirements: Need accurate integrals for reliable inference

Modern Computational Approaches:

Posterior Computation Methods
Method	Key Idea	Pros	Cons
Grid Approximation	Evaluate posterior on fine grid	Simple, exact in limit	Fails in high dimensions
Conjugate Priors	Choose prior for closed-form posterior	Analytical, fast	Limited flexibility
Laplace Approximation	Gaussian approximation at mode	Fast, deterministic	Poor for non-Gaussian posteriors
MCMC	Generate samples from posterior	Flexible, asymptotically exact	Slow convergence, diagnostics needed
Variational Inference	Optimize over approximating family	Fast, scalable	Approximate, underestimates variance
Importance Sampling	Weighted samples from proposal	Simple concept	Variance can be huge

MCMC: The Workhorse of Modern Bayesian Inference:

Common MCMC algorithms:

Metropolis-Hastings: Accept/reject proposals based on posterior ratio
Gibbs Sampling: Sample each parameter from its full conditional
Hamiltonian Monte Carlo (HMC): Use gradient information for efficient exploration
NUTS: Auto-tuned HMC (default in Stan, PyMC3)

Variational Inference: When Speed Matters:

VI approximates the posterior with a tractable distribution (e.g., factorized Gaussian) by minimizing KL divergence. It's much faster than MCMC but provides an approximation, not exact samples.

When to use VI:

Very large datasets (millions of observations)
Real-time inference requirements
Model exploration/iteration
When approximate answers suffice

MCMC Diagnostics Are Essential

Summary: The Posterior Distribution

The posterior distribution is the culmination of Bayesian inference—our complete state of knowledge after observing data. Let's consolidate the key concepts:

Key Takeaways

•Bayes' theorem is the foundation — Posterior ∝ Likelihood × Prior. The normalizing constant ensures integration to 1.
•The posterior balances prior and data — With weak priors or strong data, the posterior resembles the likelihood; with strong priors or weak data, the posterior resembles the prior.
•Multiple summaries serve different purposes — Mean for expected value, median for robustness, mode for optimization, HDI for uncertainty communication.
•Posterior predictive accounts for all uncertainty — Unlike plug-in predictions, it integrates over parameter uncertainty for properly calibrated predictions.
•Conjugate priors enable analytical posteriors — When mathematical convenience aligns with modeling needs, conjugacy provides exact, fast inference.
•Multivariate posteriors capture parameter relationships — Marginal and conditional posteriors reveal individual parameter behavior and dependencies.
•Computation is the main practical challenge — MCMC and variational inference are the workhorses for intractable posteriors.

What's Next:

Page Complete