Bayesian Framework - Learning Module

Loading content...

0/245

Point Estimates from Posterior

Distilling Distributions into Decisions

The posterior distribution encapsulates everything we know about a parameter after observing data. But in practice, we often need a single number—a point estimate—for communication, downstream processing, or decision-making. How do we responsibly compress an entire distribution into one value?

This seemingly simple question reveals deep connections between Bayesian inference, decision theory, regularization, and frequentist estimation. Different point estimates answer different implicit questions and optimize different loss functions. Understanding these distinctions ensures you choose the right summary for your purpose.

In this page, we explore the three major Bayesian point estimates—the Maximum A Posteriori (MAP), posterior mean, and posterior median—along with their properties, computation methods, and connections to familiar non-Bayesian techniques.

What You Will Master

By the end of this page, you will understand when to use MAP vs mean vs median estimates, derive the decision-theoretic justification for each, recognize MAP as regularized maximum likelihood, compute point estimates from MCMC samples, and appreciate what is lost when summarizing posteriors with single numbers.

Decision Theory: Choosing Optimally

Before examining specific point estimates, let's establish the framework that unifies them all: decision theory.

The Setup:

Action space: The possible estimates we could report, â ∈ Θ
Loss function: L(θ, â) = cost incurred if true parameter is θ and we estimate â
Posterior: p(θ|D) = our beliefs about θ after seeing data

The Bayes Estimator:

The optimal estimate under a given loss function is the one minimizing expected posterior loss:

$$\hat{\theta}{\text{Bayes}} = \arg\min{\hat{\theta}} E_{\theta|D}[L(\theta, \hat{\theta})] = \arg\min_{\hat{\theta}} \int L(\theta, \hat{\theta}) \cdot p(\theta|D) , d\theta$$

Key Insight:

Different loss functions yield different optimal estimates:

Loss Function	Bayes Estimator	Intuition
Squared: L = (θ - â)²	Posterior Mean	Penalizes large errors heavily
Absolute: L = \|θ - â\|	Posterior Median	Robust to outliers
0-1: L = 𝟙(θ ≠ â)	Posterior Mode (MAP)	'All-or-nothing' accuracy

No 'Best' Estimate Exists

There's no universally optimal point estimate because different situations have different loss structures. The mean is optimal for squared error; the median for absolute error; the mode for 0-1 loss. Choosing an estimate IS choosing a loss function, implicitly or explicitly.

Proving the Results:

Squared Error → Mean:

$$\frac{d}{d\hat{\theta}} E[(\theta - \hat{\theta})^2] = \frac{d}{d\hat{\theta}} E[\theta^2 - 2\theta\hat{\theta} + \hat{\theta}^2] = -2E[\theta] + 2\hat{\theta} = 0$$

$$\implies \hat{\theta}^* = E[\theta|D]$$

Absolute Error → Median:

The expected absolute loss is minimized when â splits the posterior into equal halves—i.e., at the median. This follows because the derivative of E[|θ - â|] changes sign exactly at the median.

0-1 Loss → Mode:

With 0-1 loss, we only care about getting it exactly right. The probability of 'guessing correctly' is maximized at the posterior mode—the most probable value.

Note: For continuous posteriors, P(θ = â) = 0 for any specific value, so 0-1 loss is technically problematic. We interpret it as equivalent to maximizing posterior density.

Maximum A Posteriori (MAP) Estimation

The MAP estimate is the posterior mode—the parameter value with highest posterior density:

$$\hat{\theta}{MAP} = \arg\max{\theta} p(\theta | D) = \arg\max_{\theta} \left[ p(D|\theta) \cdot p(\theta) \right]$$

Since the normalizing constant p(D) doesn't depend on θ, it doesn't affect the optimization.

Log-Form:

In practice, we optimize the log-posterior:

$$\hat{\theta}{MAP} = \arg\max{\theta} \left[ \log p(D|\theta) + \log p(\theta) \right] = \arg\max_{\theta} \left[ \ell(\theta) + \log \pi(\theta) \right]$$

where ℓ(θ) is the log-likelihood and π(θ) is the prior.

MAP = Penalized Maximum Likelihood:

This form reveals a crucial connection: MAP estimation is equivalent to regularized or penalized maximum likelihood estimation!

The log-prior acts as a penalty on parameter values, preventing solutions that are a priori implausible.

Prior-Regularization Correspondences
Prior Distribution	Regularization Type	Penalty Term	Effect
N(0, σ²) on weights	L2 / Ridge	λ‖θ‖² where λ = 1/(2σ²)	Shrinks weights toward zero
Laplace(0, b) on weights	L1 / Lasso	λ‖θ‖₁ where λ = 1/b	Encourages sparsity
Uniform (flat)	None (pure MLE)	0	No regularization
Horseshoe prior	Adaptive sparsity	Complex	Strong shrinkage + heavy tails
Spike-and-slab	Subset selection	Implicit	Exact zeros or full weights

map_as_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy import optimize, stats
import matplotlib.pyplot as plt
 
def demonstrate_map_regularization():
    """
    Show that MAP = regularized MLE through ridge regression example.
    """
    np.random.seed(42)
    
    # Generate data: y = Xβ + ε, with collinear features
    n, p = 50, 10
    X = np.random.randn(n, p)
    X[:, 1] = X[:, 0] + 0.1 * np.random.randn(n)  # Create collinearity
    true_beta = np.array([1, -1, 0.5, 0, 0, 0, 0, 0, 0, 0])
    y = X @ true_beta + 0.5 * np.random.randn(n)
    
    # Method 1: OLS (MLE with flat prior)
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # Method 2: Ridge regression (L2 penalty)
    lambda_ridge = 0.5
    beta_ridge = np.linalg.solve(X.T @ X + lambda_ridge * np.eye(p), X.T @ y)
    
    # Method 3: MAP with Normal(0, σ²) prior (equivalent to ridge)
    # Prior: β ~ N(0, σ²I) where σ² = 1/(2λ)
    prior_variance = 1 / (2 * lambda_ridge)
    
    def neg_log_posterior(beta, X, y, noise_var, prior_var):
        """Negative log-posterior (for minimization)."""
        residuals = y - X @ beta
        log_likelihood = -0.5 * np.sum(residuals**2) / noise_var
        log_prior = -0.5 * np.sum(beta**2) / prior_var
        return -(log_likelihood + log_prior)
    
    noise_variance = 0.5**2  # Assume known
    result = optimize.minimize(
        neg_log_posterior, 
        np.zeros(p), 
        args=(X, y, noise_variance, prior_variance),
        method='L-BFGS-B'
    )
    beta_map = result.x
    
    # Verify equivalence
    print("=" * 60)
    print("MAP = REGULARIZED MLE DEMONSTRATION")
    print("=" * 60)
    print()
    print(f"{'β':<8} {'True':<10} {'OLS':<10} {'Ridge':<10} {'MAP':<10}")
    print("-" * 60)
    for i in range(p):
        print(f"β_{i:<5} {true_beta[i]:<10.4f} {beta_ols[i]:<10.4f} {beta_ridge[i]:<10.4f} {beta_map[i]:<10.4f}")
    
    print()
    print(f"Ridge λ = {lambda_ridge}, Prior variance σ² = {prior_variance:.4f}")
    print(f"Max |Ridge - MAP| = {np.max(np.abs(beta_ridge - beta_map)):.2e}")
    print("Ridge and MAP are IDENTICAL (within numerical precision)!")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Coefficient comparison
    ax = axes[0]
    x_pos = np.arange(p)
    width = 0.2
    ax.bar(x_pos - 1.5*width, true_beta, width, label='True', alpha=0.8)
    ax.bar(x_pos - 0.5*width, beta_ols, width, label='OLS (flat prior)', alpha=0.8)
    ax.bar(x_pos + 0.5*width, beta_ridge, width, label='Ridge', alpha=0.8)
    ax.bar(x_pos + 1.5*width, beta_map, width, label='MAP', alpha=0.8)
    ax.set_xlabel('Coefficient index')
    ax.set_ylabel('Coefficient value')
    ax.set_title('Coefficient Estimates: OLS vs Ridge vs MAP')
    ax.legend()
    ax.set_xticks(x_pos)
    ax.set_xticklabels([f'β_{i}' for i in range(p)])
    ax.grid(True, alpha=0.3, axis='y')
    
    # Plot 2: Effect of regularization strength
    ax = axes[1]
    lambdas = np.logspace(-3, 2, 50)
    beta_paths = []
    for l in lambdas:
        beta_l = np.linalg.solve(X.T @ X + l * np.eye(p), X.T @ y)
        beta_paths.append(beta_l)
    beta_paths = np.array(beta_paths)
    
    for i in range(p):
        ax.plot(lambdas, beta_paths[:, i], linewidth=2, label=f'β_{i}' if i < 5 else None)
    ax.axhline(0, color='black', linewidth=0.5)
    ax.axvline(lambda_ridge, color='red', linestyle='--', label='Selected λ')
    ax.set_xscale('log')
    ax.set_xlabel('λ (regularization strength)')
    ax.set_ylabel('Coefficient value')
    ax.set_title('Ridge Regularization Path\n(Stronger prior → more shrinkage)')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
 
demonstrate_map_regularization()

MAP Enables Optimization Mindset

Since MAP reduces to optimization, we can use all the machinery of numerical optimization: gradient descent, Newton's method, L-BFGS, etc. This makes MAP computationally tractable even when full Bayesian inference (computing the entire posterior) is not.

Properties of MAP Estimation:

Advantages:

Computationally efficient: Only requires optimization, not integration
Interpretable: The 'most probable' value given data and prior
Connection to regularization: Familiar to ML practitioners
Mode-seeking: Good for peaked posteriors

Disadvantages:

Not invariant to reparameterization: MAP(θ) ≠ function of θ at MAP
Ignores posterior shape: Says nothing about uncertainty
Can be atypical: For high-dimensional problems, the mode may be in a region of low posterior probability mass
Sensitive to prior density: Not just prior beliefs but how they're parameterized

The Bayesian Critique:

From a fully Bayesian perspective, MAP is often considered suboptimal because it discards information about posterior uncertainty. The mode tells you where the peak is, but not how spread out the distribution is around it.

The Posterior Mean

The posterior mean is perhaps the most commonly used Bayesian point estimate:

$$\hat{\theta}_{\text{mean}} = E[\theta | D] = \int \theta \cdot p(\theta | D) , d\theta$$

Decision-Theoretic Justification:

As shown earlier, the posterior mean minimizes expected squared error loss:

$$\hat{\theta}{\text{mean}} = \arg\min{\hat{\theta}} E_{\theta|D}[(\theta - \hat{\theta})^2]$$

Squared error penalizes large mistakes heavily, making the mean optimal when big errors are catastrophic.

Properties:

Posterior Mean Properties

•Optimal for squared error — The Bayes estimator under L2 loss
•Accounts for asymmetry — Pulled toward posterior bulk, even if mode is elsewhere
•Invariant under linear transformation — E[aθ + b | D] = a·E[θ|D] + b
•Easy to compute from samples — Just take the sample average of MCMC draws
•Center of mass — Balances the posterior distribution
•Shrinkage interpretation — Often represents compromise between prior and MLE

posterior_mean_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def posterior_mean_analysis():
    """
    Analyze posterior mean vs MAP for different posterior shapes.
    """
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    theta = np.linspace(0, 1, 1000)
    
    scenarios = [
        ("Symmetric: Beta(8, 8)", stats.beta(8, 8)),
        ("Right-skewed: Beta(2, 8)", stats.beta(2, 8)),
        ("Left-skewed: Beta(8, 2)", stats.beta(8, 2)),
        ("Bimodal: Mixture", None),  # Custom
    ]
    
    for ax, (title, dist) in zip(axes.flatten(), scenarios):
        if dist is not None:
            pdf = dist.pdf(theta)
            mean = dist.mean()
            median = dist.median()
            mode = (dist.args[0] - 1) / (dist.args[0] + dist.args[1] - 2) if dist.args[0] > 1 and dist.args[1] > 1 else None
        else:
            # Create bimodal
            dist1 = stats.beta(5, 20)
            dist2 = stats.beta(20, 5)
            pdf = 0.5 * dist1.pdf(theta) + 0.5 * dist2.pdf(theta)
            
            # Compute mean for mixture
            mean = 0.5 * dist1.mean() + 0.5 * dist2.mean()
            
            # Mode is at one of the component modes
            mode = theta[np.argmax(pdf)]
            
            # Median (approximate)
            cdf = 0.5 * dist1.cdf(theta) + 0.5 * dist2.cdf(theta)
            median_idx = np.argmin(np.abs(cdf - 0.5))
            median = theta[median_idx]
        
        # Plot
        ax.fill_between(theta, pdf, alpha=0.3, color='blue')
        ax.plot(theta, pdf, 'b-', linewidth=2)
        
        if mode is not None:
            ax.axvline(mode, color='red', linestyle=':', linewidth=2, 
                       label=f'Mode (MAP) = {mode:.3f}')
        ax.axvline(mean, color='green', linestyle='-', linewidth=2,
                   label=f'Mean = {mean:.3f}')
        ax.axvline(median, color='orange', linestyle='--', linewidth=2,
                   label=f'Median = {median:.3f}')
        
        ax.set_xlabel('θ')
        ax.set_ylabel('p(θ|D)')
        ax.set_title(title)
        ax.legend(fontsize=9)
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
    
    plt.suptitle('Posterior Mean vs Mode vs Median\nfor Different Posterior Shapes', fontsize=14)
    plt.tight_layout()
    
    # Show MCMC estimation
    print("=" * 60)
    print("POSTERIOR MEAN FROM MCMC SAMPLES")
    print("=" * 60)
    
    # Simulate MCMC samples from Beta(3, 7)
    true_dist = stats.beta(3, 7)
    sample_sizes = [100, 1000, 10000, 100000]
    
    print(f"True posterior mean: {true_dist.mean():.6f}")
    print()
    print(f"{'N samples':<15} {'Estimated Mean':<20} {'Error':<20} {'MC Std Error':<15}")
    print("-" * 60)
    
    for n in sample_sizes:
        np.random.seed(42)
        samples = true_dist.rvs(n)
        estimated_mean = np.mean(samples)
        error = abs(estimated_mean - true_dist.mean())
        mc_std_error = np.std(samples) / np.sqrt(n)
        
        print(f"{n:<15} {estimated_mean:.6f}             "
              f"{error:.6f}             {mc_std_error:.6f}")
    
    print()
    print("Monte Carlo standard error decreases as 1/√n")
 
posterior_mean_analysis()

Posterior Mean as Shrinkage:

For conjugate models, the posterior mean often has a beautiful shrinkage interpretation:

$$\hat{\theta}{\text{mean}} = \lambda \cdot \hat{\theta}{\text{prior}} + (1 - \lambda) \cdot \hat{\theta}_{\text{MLE}}$$

where λ depends on relative prior strength vs. data strength.

For example, in the Beta-Binomial model with prior Beta(α, β) and k successes in n trials:

$$E[\theta | D] = \frac{\alpha + k}{\alpha + \beta + n} = \frac{\alpha + \beta}{\alpha + \beta + n} \cdot \frac{\alpha}{\alpha + \beta} + \frac{n}{\alpha + \beta + n} \cdot \frac{k}{n}$$

The posterior mean is literally a weighted average of prior mean and MLE, with weights proportional to 'sample sizes' (prior pseudocounts vs. actual observations).

The Posterior Median

The posterior median is the value that splits the posterior probability in half:

$$P(\theta \leq \hat{\theta}{\text{median}} | D) = P(\theta \geq \hat{\theta}{\text{median}} | D) = 0.5$$

Decision-Theoretic Justification:

The posterior median minimizes expected absolute error loss:

$$\hat{\theta}{\text{median}} = \arg\min{\hat{\theta}} E_{\theta|D}[|\theta - \hat{\theta}|]$$

Absolute error penalizes all errors linearly, regardless of magnitude, making the median more robust to extreme values.

Properties:

Robust to skewness: Less affected by heavy tails than the mean
Invariant to monotonic transformations: Median of g(θ) = g(median of θ) for monotonic g
Equal-tailed intervals: The median is the center of central credible intervals
Easy to compute from samples: Sort MCMC samples and take the middle value

When to Use Median:

When the posterior is heavily skewed
When outlying parameter values (in the posterior) would distort the mean
When you want an estimate that 'represents' the typical posterior value
When reporting to audiences sensitive to extreme claims

Median for Variance Parameters

For positive parameters like variances, the posterior is often right-skewed. The posterior mean can be pulled upward by heavy tails, while the median provides a more 'typical' value. This is one reason the median is sometimes preferred for scale parameters.

Comparing Point Estimates

Let's systematically compare the three main point estimates across different posterior shapes and see when they diverge significantly.

Point Estimate Comparison
Posterior Shape	Mean	Median	Mode	Which Differs Most?
Symmetric, unimodal	All equal	All equal	All equal	None
Right-skewed	Highest	Middle	Lowest	Mean (pulled by tail)
Left-skewed	Lowest	Middle	Highest	Mean (pulled by tail)
Heavy-tailed	May not exist	Exists	Exists	Mean (infinite for Cauchy)
Bimodal	Between modes	Between modes	One of the modes	Mode (chooses one)
Uniform	Center	Center	Entire interval	Mode (not unique)

estimate_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def comprehensive_comparison():
    """
    Compare all three point estimates across scenarios.
    """
    
    # Scenario: Different amounts of data, same true parameter
    true_theta = 0.6
    prior_a, prior_b = 2, 2  # Weakly informative prior
    
    data_scenarios = [
        (0, 0, "Prior only"),
        (3, 2, "5 trials: 3 heads"),
        (12, 8, "20 trials: 12 heads"),
        (60, 40, "100 trials: 60 heads"),
        (600, 400, "1000 trials: 600 heads"),
    ]
    
    theta = np.linspace(0, 1, 1000)
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    results = []
    
    for i, (k, n_minus_k, label) in enumerate(data_scenarios):
        n = k + n_minus_k
        post_a = prior_a + k
        post_b = prior_b + n_minus_k
        
        dist = stats.beta(post_a, post_b)
        pdf = dist.pdf(theta)
        
        # Compute estimates
        mean = dist.mean()
        median = dist.median()
        mode = (post_a - 1) / (post_a + post_b - 2) if post_a > 1 and post_b > 1 else 0.5
        mle = k / n if n > 0 else 0.5
        
        results.append({
            'label': label,
            'n': n,
            'mean': mean,
            'median': median,
            'mode': mode,
            'mle': mle
        })
        
        # Plot
        ax = axes[i]
        ax.fill_between(theta, pdf, alpha=0.3, color='blue')
        ax.plot(theta, pdf, 'b-', linewidth=2)
        ax.axvline(mean, color='green', linestyle='-', linewidth=2, label=f'Mean={mean:.3f}')
        ax.axvline(median, color='orange', linestyle='--', linewidth=2, label=f'Median={median:.3f}')
        ax.axvline(mode, color='red', linestyle=':', linewidth=2, label=f'Mode={mode:.3f}')
        if n > 0:
            ax.axvline(mle, color='purple', linestyle='-.', linewidth=1.5, 
                       alpha=0.7, label=f'MLE={mle:.3f}')
        ax.axvline(true_theta, color='black', linestyle='-', alpha=0.5, 
                   linewidth=1, label=f'True θ={true_theta}')
        
        ax.set_xlabel('θ')
        ax.set_ylabel('Density')
        ax.set_title(f'{label}\nBeta({post_a}, {post_b})')
        ax.legend(fontsize=8, loc='upper right')
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
    
    # Summary in last panel
    ax = axes[5]
    ax.axis('off')
    
    summary_text = "SUMMARY: Point Estimate Convergence\n" + "=" * 40 + "\n\n"
    summary_text += f"{'Scenario':<25} {'Mean':<8} {'Median':<8} {'Mode':<8} {'MLE':<8}\n"
    summary_text += "-" * 55 + "\n"
    for r in results:
        summary_text += f"{r['label']:<25} {r['mean']:.4f}   {r['median']:.4f}   {r['mode']:.4f}   {r['mle']:.4f}\n"
    
    summary_text += "\n" + "-" * 55 + "\n"
    summary_text += "As n → ∞, all estimates → MLE → True θ"
    
    ax.text(0.1, 0.5, summary_text, transform=ax.transAxes, fontsize=10,
            verticalalignment='center', fontfamily='monospace',
            bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.5))
    
    plt.suptitle('Point Estimate Convergence with Increasing Data', fontsize=14)
    plt.tight_layout()
 
comprehensive_comparison()

Asymptotic Equivalence:

As sample size n → ∞:

The posterior concentrates tightly around the true parameter
The posterior becomes approximately Gaussian (Bernstein-von Mises)
Mean, median, and mode all converge to the same value
All converge to the MLE (for well-specified models)

Differences between point estimates are most pronounced:

With small samples (where prior influence is strong)
With asymmetric posteriors
With heavy-tailed posteriors
Near boundaries of the parameter space

Computing Estimates in Practice

For complex models, we typically have MCMC samples rather than closed-form posteriors. Here's how to extract point estimates from samples:

From MCMC Samples:

Given samples θ⁽¹⁾, θ⁽²⁾, ..., θ⁽ᴺ⁾ from the posterior:

Posterior Mean: $\hat{\theta}{\text{mean}} \approx \frac{1}{N}\sum{i=1}^{N} \theta^{(i)}$
Posterior Median: Sort samples, take middle value(s)
Posterior Mode: More complex—requires density estimation

Estimating the Mode from Samples:

The mode (MAP) is tricky because it requires estimating the density peak, not just a sample average. Approaches include:

Kernel Density Estimation: Fit a KDE, find its maximum
Half-sample mode: Iteratively find the shortest interval containing half the samples
Optimization: Use the MCMC chain's highest-likelihood point as a proxy
Variational approximation: Fit a parametric approximation and take its mode

Note: For high-dimensional parameters, KDE-based mode estimation becomes unreliable.

mcmc_point_estimates.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
from scipy.ndimage import gaussian_filter1d
import matplotlib.pyplot as plt
 
def extract_point_estimates_from_samples():
    """
    Demonstrate extracting point estimates from MCMC samples.
    """
    np.random.seed(42)
    
    # Simulate MCMC samples from a skewed posterior
    # (Gamma distribution as example)
    true_shape, true_scale = 3, 2  # Gamma(3, 2)
    true_dist = stats.gamma(true_shape, scale=true_scale)
    
    n_samples = 10000
    samples = true_dist.rvs(n_samples)
    
    # True values
    true_mean = true_dist.mean()
    true_median = true_dist.median()
    true_mode = (true_shape - 1) * true_scale  # For Gamma with shape > 1
    
    # Estimated from samples
    est_mean = np.mean(samples)
    est_median = np.median(samples)
    
    # Estimate mode via KDE
    from scipy.stats import gaussian_kde
    kde = gaussian_kde(samples)
    x_grid = np.linspace(0, np.percentile(samples, 99), 500)
    density = kde(x_grid)
    est_mode_kde = x_grid[np.argmax(density)]
    
    # Half-sample mode (robust estimator)
    def half_sample_mode(data, max_iter=100):
        """Robertson-Cryer half-sample mode."""
        data = np.sort(data)
        n = len(data)
        
        for _ in range(max_iter):
            if n <= 3:
                break
            half_n = n // 2 + 1
            widths = data[half_n-1:] - data[:n-half_n+1]
            min_idx = np.argmin(widths)
            data = data[min_idx:min_idx+half_n]
            n = len(data)
        
        return np.mean(data)
    
    est_mode_hsm = half_sample_mode(samples)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Histogram with estimates
    ax = axes[0]
    ax.hist(samples, bins=50, density=True, alpha=0.6, color='gray', label='MCMC samples')
    ax.plot(x_grid, density, 'b-', linewidth=2, label='KDE')
    ax.axvline(est_mean, color='green', linestyle='-', linewidth=2, label=f'Est. mean={est_mean:.3f}')
    ax.axvline(est_median, color='orange', linestyle='--', linewidth=2, label=f'Est. median={est_median:.3f}')
    ax.axvline(est_mode_kde, color='red', linestyle=':', linewidth=2, label=f'Est. mode (KDE)={est_mode_kde:.3f}')
    ax.axvline(est_mode_hsm, color='purple', linestyle='-.', linewidth=2, label=f'Est. mode (HSM)={est_mode_hsm:.3f}')
    ax.set_xlabel('θ')
    ax.set_ylabel('Density')
    ax.set_title('Point Estimates from MCMC Samples')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Comparison with true values
    ax = axes[1]
    categories = ['Mean', 'Median', 'Mode (KDE)', 'Mode (HSM)']
    true_vals = [true_mean, true_median, true_mode, true_mode]
    est_vals = [est_mean, est_median, est_mode_kde, est_mode_hsm]
    errors = [abs(e - t) for e, t in zip(est_vals, true_vals)]
    
    x_pos = np.arange(len(categories))
    width = 0.35
    
    bars1 = ax.bar(x_pos - width/2, true_vals, width, label='True', alpha=0.8, color='blue')
    bars2 = ax.bar(x_pos + width/2, est_vals, width, label='Estimated', alpha=0.8, color='green')
    
    # Add error annotations
    for i, (x, err) in enumerate(zip(x_pos, errors)):
        ax.annotate(f'Err: {err:.3f}', (x + width/2, est_vals[i] + 0.2), 
                    ha='center', fontsize=9)
    
    ax.set_xlabel('Estimate Type')
    ax.set_ylabel('Value')
    ax.set_title('Comparison: True vs Estimated')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(categories)
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    
    # Print summary
    print("=" * 60)
    print("POINT ESTIMATES FROM MCMC SAMPLES")
    print("=" * 60)
    print(f"True distribution: Gamma({true_shape}, scale={true_scale})")
    print(f"Number of MCMC samples: {n_samples}")
    print()
    print(f"{'Estimate':<20} {'True Value':<15} {'Estimated':<15} {'Error':<15}")
    print("-" * 60)
    print(f"{'Mean':<20} {true_mean:<15.4f} {est_mean:<15.4f} {abs(est_mean - true_mean):<15.4f}")
    print(f"{'Median':<20} {true_median:<15.4f} {est_median:<15.4f} {abs(est_median - true_median):<15.4f}")
    print(f"{'Mode (KDE)':<20} {true_mode:<15.4f} {est_mode_kde:<15.4f} {abs(est_mode_kde - true_mode):<15.4f}")
    print(f"{'Mode (Half-sample)':<20} {true_mode:<15.4f} {est_mode_hsm:<15.4f} {abs(est_mode_hsm - true_mode):<15.4f}")
 
extract_point_estimates_from_samples()

Standard Error of the Mean

The Monte Carlo standard error of the posterior mean is σ/√N where σ is the posterior standard deviation and N is the effective sample size (accounting for MCMC autocorrelation). Always report this uncertainty when presenting MCMC-based estimates.

What Point Estimates Lose

Point estimates are useful summaries, but they sacrifice information. Understanding these losses helps you decide when a point estimate suffices and when the full posterior is needed.

What's Lost:

1. Uncertainty Quantification

A point estimate θ̂ = 0.5 tells you nothing about whether this is a confident conclusion or a wild guess. Was the posterior extremely peaked at 0.5, or broad and flat? Only credible intervals or the full distribution convey this.

2. Posterior Shape

Bimodality, skewness, heavy tails—all are invisible in a single number. For multimodal posteriors, the mean might lie in a region of LOW posterior probability!

3. Correlations (Multivariate)

For multiple parameters, point estimates lose all information about posterior correlations. Knowing β₁ ≈ 2 and β₂ ≈ 3 tells you nothing about whether they're positively or negatively correlated.

4. Transformation Properties

The mean of θ² is NOT (mean of θ)². For nonlinear functions of parameters, point estimates can be misleading. The full posterior allows computing E[g(θ)|D] correctly.

The Plug-in Fallacy

Using a point estimate θ̂ as if it were the true parameter (e.g., for prediction) is the 'plug-in' approach. This underestimates uncertainty because it ignores that θ̂ is itself uncertain. For predictions with proper uncertainty, integrate over the posterior: p(y*|D) = ∫p(y*|θ)p(θ|D)dθ.

When Point Estimates Suffice:

Posterior is approximately symmetric and unimodal: Mean ≈ median ≈ mode
Uncertainty is secondary: Quick decisions where precision isn't critical
Downstream processes require a single value: Some pipelines can't handle distributions
Communication constraints: Audiences who can't interpret distributions

When Full Posteriors Are Essential:

High-stakes decisions: Medical, financial, safety-critical applications
Multimodal posteriors: Need to represent multiple viable explanations
Uncertainty propagation: Downstream calculations need full uncertainty
Model comparison: Bayes factors require the full posterior
Prediction: Posterior predictive requires integration over θ

Best Practice:

Report point estimates alongside credible intervals. A statement like 'the posterior mean is 0.65 with 95% CI [0.52, 0.77]' is far more informative than 'the estimate is 0.65'.

Summary: Point Estimates from Posterior

Point estimates distill posterior distributions into single numbers, each optimizing a different loss function. Let's consolidate the key concepts:

Key Takeaways

•Decision theory unifies point estimates — Each estimate is optimal for a specific loss function: squared error → mean, absolute error → median, 0-1 loss → mode.
•MAP = regularized MLE — The posterior mode is equivalent to penalized maximum likelihood, with the log-prior serving as the penalty. Normal prior → L2 regularization; Laplace prior → L1 regularization.
•Posterior mean is the default Bayesian estimate — Optimal for squared error, easy to compute, and represents the posterior's center of mass.
•Median is robust to asymmetry — For skewed posteriors, the median provides a more 'typical' value than the mean.
•Estimates converge asymptotically — With large samples, mean, median, mode, and MLE all converge to the true parameter.
•MCMC samples enable estimation — Mean and median are trivial from samples; mode requires density estimation.
•Point estimates sacrifice information — They discard uncertainty, shape, and correlation. Always report credible intervals alongside.

Module Complete:

You have now completed the foundational Bayesian Framework module. You understand:

Prior beliefs and how to encode knowledge before seeing data
Likelihood functions and their role in quantifying data evidence
Posterior distributions as the culmination of Bayesian inference
Bayesian updating as continuous learning
Point estimates as decision-optimal summaries of posteriors

With these foundations, you're prepared to explore conjugate priors, inference algorithms (MCMC, variational inference), and applications of Bayesian methods to regression, classification, and beyond.

Module Complete

Congratulations! You've mastered the Bayesian Framework—the philosophical and mathematical foundations of principled probabilistic reasoning. You can now specify priors, construct likelihoods, derive posteriors, update beliefs with new evidence, and extract meaningful point estimates. This foundation supports all advanced Bayesian methods covered in subsequent modules.

Point Estimates from Posterior

Distilling Distributions into Decisions

What You Will Master

Decision Theory: Choosing Optimally

Before examining specific point estimates, let's establish the framework that unifies them all: decision theory.

The Setup:

Action space: The possible estimates we could report, â ∈ Θ
Loss function: L(θ, â) = cost incurred if true parameter is θ and we estimate â
Posterior: p(θ|D) = our beliefs about θ after seeing data

The Bayes Estimator:

The optimal estimate under a given loss function is the one minimizing expected posterior loss:

$$\hat{\theta}{\text{Bayes}} = \arg\min{\hat{\theta}} E_{\theta|D}[L(\theta, \hat{\theta})] = \arg\min_{\hat{\theta}} \int L(\theta, \hat{\theta}) \cdot p(\theta|D) , d\theta$$

Key Insight:

Different loss functions yield different optimal estimates:

Loss Function	Bayes Estimator	Intuition
Squared: L = (θ - â)²	Posterior Mean	Penalizes large errors heavily
Absolute: L = \|θ - â\|	Posterior Median	Robust to outliers
0-1: L = 𝟙(θ ≠ â)	Posterior Mode (MAP)	'All-or-nothing' accuracy

No 'Best' Estimate Exists

Proving the Results:

Squared Error → Mean:

$$\frac{d}{d\hat{\theta}} E[(\theta - \hat{\theta})^2] = \frac{d}{d\hat{\theta}} E[\theta^2 - 2\theta\hat{\theta} + \hat{\theta}^2] = -2E[\theta] + 2\hat{\theta} = 0$$

$$\implies \hat{\theta}^* = E[\theta|D]$$

Absolute Error → Median:

The expected absolute loss is minimized when â splits the posterior into equal halves—i.e., at the median. This follows because the derivative of E[|θ - â|] changes sign exactly at the median.

0-1 Loss → Mode:

With 0-1 loss, we only care about getting it exactly right. The probability of 'guessing correctly' is maximized at the posterior mode—the most probable value.

Note: For continuous posteriors, P(θ = â) = 0 for any specific value, so 0-1 loss is technically problematic. We interpret it as equivalent to maximizing posterior density.

Maximum A Posteriori (MAP) Estimation

The MAP estimate is the posterior mode—the parameter value with highest posterior density:

$$\hat{\theta}{MAP} = \arg\max{\theta} p(\theta | D) = \arg\max_{\theta} \left[ p(D|\theta) \cdot p(\theta) \right]$$

Since the normalizing constant p(D) doesn't depend on θ, it doesn't affect the optimization.

Log-Form:

In practice, we optimize the log-posterior:

$$\hat{\theta}{MAP} = \arg\max{\theta} \left[ \log p(D|\theta) + \log p(\theta) \right] = \arg\max_{\theta} \left[ \ell(\theta) + \log \pi(\theta) \right]$$

where ℓ(θ) is the log-likelihood and π(θ) is the prior.

MAP = Penalized Maximum Likelihood:

This form reveals a crucial connection: MAP estimation is equivalent to regularized or penalized maximum likelihood estimation!

The log-prior acts as a penalty on parameter values, preventing solutions that are a priori implausible.

Prior-Regularization Correspondences
Prior Distribution	Regularization Type	Penalty Term	Effect
N(0, σ²) on weights	L2 / Ridge	λ‖θ‖² where λ = 1/(2σ²)	Shrinks weights toward zero
Laplace(0, b) on weights	L1 / Lasso	λ‖θ‖₁ where λ = 1/b	Encourages sparsity
Uniform (flat)	None (pure MLE)	0	No regularization
Horseshoe prior	Adaptive sparsity	Complex	Strong shrinkage + heavy tails
Spike-and-slab	Subset selection	Implicit	Exact zeros or full weights

map_as_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy import optimize, stats
import matplotlib.pyplot as plt
 
def demonstrate_map_regularization():
    """
    Show that MAP = regularized MLE through ridge regression example.
    """
    np.random.seed(42)
    
    # Generate data: y = Xβ + ε, with collinear features
    n, p = 50, 10
    X = np.random.randn(n, p)
    X[:, 1] = X[:, 0] + 0.1 * np.random.randn(n)  # Create collinearity
    true_beta = np.array([1, -1, 0.5, 0, 0, 0, 0, 0, 0, 0])
    y = X @ true_beta + 0.5 * np.random.randn(n)
    
    # Method 1: OLS (MLE with flat prior)
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # Method 2: Ridge regression (L2 penalty)
    lambda_ridge = 0.5
    beta_ridge = np.linalg.solve(X.T @ X + lambda_ridge * np.eye(p), X.T @ y)
    
    # Method 3: MAP with Normal(0, σ²) prior (equivalent to ridge)
    # Prior: β ~ N(0, σ²I) where σ² = 1/(2λ)
    prior_variance = 1 / (2 * lambda_ridge)
    
    def neg_log_posterior(beta, X, y, noise_var, prior_var):
        """Negative log-posterior (for minimization)."""
        residuals = y - X @ beta
        log_likelihood = -0.5 * np.sum(residuals**2) / noise_var
        log_prior = -0.5 * np.sum(beta**2) / prior_var
        return -(log_likelihood + log_prior)
    
    noise_variance = 0.5**2  # Assume known
    result = optimize.minimize(
        neg_log_posterior, 
        np.zeros(p), 
        args=(X, y, noise_variance, prior_variance),
        method='L-BFGS-B'
    )
    beta_map = result.x
    
    # Verify equivalence
    print("=" * 60)
    print("MAP = REGULARIZED MLE DEMONSTRATION")
    print("=" * 60)
    print()
    print(f"{'β':<8} {'True':<10} {'OLS':<10} {'Ridge':<10} {'MAP':<10}")
    print("-" * 60)
    for i in range(p):
        print(f"β_{i:<5} {true_beta[i]:<10.4f} {beta_ols[i]:<10.4f} {beta_ridge[i]:<10.4f} {beta_map[i]:<10.4f}")
    
    print()
    print(f"Ridge λ = {lambda_ridge}, Prior variance σ² = {prior_variance:.4f}")
    print(f"Max |Ridge - MAP| = {np.max(np.abs(beta_ridge - beta_map)):.2e}")
    print("Ridge and MAP are IDENTICAL (within numerical precision)!")
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Coefficient comparison
    ax = axes[0]
    x_pos = np.arange(p)
    width = 0.2
    ax.bar(x_pos - 1.5*width, true_beta, width, label='True', alpha=0.8)
    ax.bar(x_pos - 0.5*width, beta_ols, width, label='OLS (flat prior)', alpha=0.8)
    ax.bar(x_pos + 0.5*width, beta_ridge, width, label='Ridge', alpha=0.8)
    ax.bar(x_pos + 1.5*width, beta_map, width, label='MAP', alpha=0.8)
    ax.set_xlabel('Coefficient index')
    ax.set_ylabel('Coefficient value')
    ax.set_title('Coefficient Estimates: OLS vs Ridge vs MAP')
    ax.legend()
    ax.set_xticks(x_pos)
    ax.set_xticklabels([f'β_{i}' for i in range(p)])
    ax.grid(True, alpha=0.3, axis='y')
    
    # Plot 2: Effect of regularization strength
    ax = axes[1]
    lambdas = np.logspace(-3, 2, 50)
    beta_paths = []
    for l in lambdas:
        beta_l = np.linalg.solve(X.T @ X + l * np.eye(p), X.T @ y)
        beta_paths.append(beta_l)
    beta_paths = np.array(beta_paths)
    
    for i in range(p):
        ax.plot(lambdas, beta_paths[:, i], linewidth=2, label=f'β_{i}' if i < 5 else None)
    ax.axhline(0, color='black', linewidth=0.5)
    ax.axvline(lambda_ridge, color='red', linestyle='--', label='Selected λ')
    ax.set_xscale('log')
    ax.set_xlabel('λ (regularization strength)')
    ax.set_ylabel('Coefficient value')
    ax.set_title('Ridge Regularization Path\n(Stronger prior → more shrinkage)')
    ax.legend(fontsize=8)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
 
demonstrate_map_regularization()

MAP Enables Optimization Mindset

Properties of MAP Estimation:

Advantages:

Computationally efficient: Only requires optimization, not integration
Interpretable: The 'most probable' value given data and prior
Connection to regularization: Familiar to ML practitioners
Mode-seeking: Good for peaked posteriors

Disadvantages:

Not invariant to reparameterization: MAP(θ) ≠ function of θ at MAP
Ignores posterior shape: Says nothing about uncertainty
Can be atypical: For high-dimensional problems, the mode may be in a region of low posterior probability mass
Sensitive to prior density: Not just prior beliefs but how they're parameterized

The Bayesian Critique:

The Posterior Mean

The posterior mean is perhaps the most commonly used Bayesian point estimate:

$$\hat{\theta}_{\text{mean}} = E[\theta | D] = \int \theta \cdot p(\theta | D) , d\theta$$

Decision-Theoretic Justification:

As shown earlier, the posterior mean minimizes expected squared error loss:

$$\hat{\theta}{\text{mean}} = \arg\min{\hat{\theta}} E_{\theta|D}[(\theta - \hat{\theta})^2]$$

Squared error penalizes large mistakes heavily, making the mean optimal when big errors are catastrophic.

Properties:

Posterior Mean Properties

•Optimal for squared error — The Bayes estimator under L2 loss
•Accounts for asymmetry — Pulled toward posterior bulk, even if mode is elsewhere
•Invariant under linear transformation — E[aθ + b | D] = a·E[θ|D] + b
•Easy to compute from samples — Just take the sample average of MCMC draws
•Center of mass — Balances the posterior distribution
•Shrinkage interpretation — Often represents compromise between prior and MLE

posterior_mean_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def posterior_mean_analysis():
    """
    Analyze posterior mean vs MAP for different posterior shapes.
    """
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    theta = np.linspace(0, 1, 1000)
    
    scenarios = [
        ("Symmetric: Beta(8, 8)", stats.beta(8, 8)),
        ("Right-skewed: Beta(2, 8)", stats.beta(2, 8)),
        ("Left-skewed: Beta(8, 2)", stats.beta(8, 2)),
        ("Bimodal: Mixture", None),  # Custom
    ]
    
    for ax, (title, dist) in zip(axes.flatten(), scenarios):
        if dist is not None:
            pdf = dist.pdf(theta)
            mean = dist.mean()
            median = dist.median()
            mode = (dist.args[0] - 1) / (dist.args[0] + dist.args[1] - 2) if dist.args[0] > 1 and dist.args[1] > 1 else None
        else:
            # Create bimodal
            dist1 = stats.beta(5, 20)
            dist2 = stats.beta(20, 5)
            pdf = 0.5 * dist1.pdf(theta) + 0.5 * dist2.pdf(theta)
            
            # Compute mean for mixture
            mean = 0.5 * dist1.mean() + 0.5 * dist2.mean()
            
            # Mode is at one of the component modes
            mode = theta[np.argmax(pdf)]
            
            # Median (approximate)
            cdf = 0.5 * dist1.cdf(theta) + 0.5 * dist2.cdf(theta)
            median_idx = np.argmin(np.abs(cdf - 0.5))
            median = theta[median_idx]
        
        # Plot
        ax.fill_between(theta, pdf, alpha=0.3, color='blue')
        ax.plot(theta, pdf, 'b-', linewidth=2)
        
        if mode is not None:
            ax.axvline(mode, color='red', linestyle=':', linewidth=2, 
                       label=f'Mode (MAP) = {mode:.3f}')
        ax.axvline(mean, color='green', linestyle='-', linewidth=2,
                   label=f'Mean = {mean:.3f}')
        ax.axvline(median, color='orange', linestyle='--', linewidth=2,
                   label=f'Median = {median:.3f}')
        
        ax.set_xlabel('θ')
        ax.set_ylabel('p(θ|D)')
        ax.set_title(title)
        ax.legend(fontsize=9)
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
    
    plt.suptitle('Posterior Mean vs Mode vs Median\nfor Different Posterior Shapes', fontsize=14)
    plt.tight_layout()
    
    # Show MCMC estimation
    print("=" * 60)
    print("POSTERIOR MEAN FROM MCMC SAMPLES")
    print("=" * 60)
    
    # Simulate MCMC samples from Beta(3, 7)
    true_dist = stats.beta(3, 7)
    sample_sizes = [100, 1000, 10000, 100000]
    
    print(f"True posterior mean: {true_dist.mean():.6f}")
    print()
    print(f"{'N samples':<15} {'Estimated Mean':<20} {'Error':<20} {'MC Std Error':<15}")
    print("-" * 60)
    
    for n in sample_sizes:
        np.random.seed(42)
        samples = true_dist.rvs(n)
        estimated_mean = np.mean(samples)
        error = abs(estimated_mean - true_dist.mean())
        mc_std_error = np.std(samples) / np.sqrt(n)
        
        print(f"{n:<15} {estimated_mean:.6f}             "
              f"{error:.6f}             {mc_std_error:.6f}")
    
    print()
    print("Monte Carlo standard error decreases as 1/√n")
 
posterior_mean_analysis()

Posterior Mean as Shrinkage:

For conjugate models, the posterior mean often has a beautiful shrinkage interpretation:

$$\hat{\theta}{\text{mean}} = \lambda \cdot \hat{\theta}{\text{prior}} + (1 - \lambda) \cdot \hat{\theta}_{\text{MLE}}$$

where λ depends on relative prior strength vs. data strength.

For example, in the Beta-Binomial model with prior Beta(α, β) and k successes in n trials:

$$E[\theta | D] = \frac{\alpha + k}{\alpha + \beta + n} = \frac{\alpha + \beta}{\alpha + \beta + n} \cdot \frac{\alpha}{\alpha + \beta} + \frac{n}{\alpha + \beta + n} \cdot \frac{k}{n}$$

The posterior mean is literally a weighted average of prior mean and MLE, with weights proportional to 'sample sizes' (prior pseudocounts vs. actual observations).

The Posterior Median

The posterior median is the value that splits the posterior probability in half:

$$P(\theta \leq \hat{\theta}{\text{median}} | D) = P(\theta \geq \hat{\theta}{\text{median}} | D) = 0.5$$

Decision-Theoretic Justification:

The posterior median minimizes expected absolute error loss:

$$\hat{\theta}{\text{median}} = \arg\min{\hat{\theta}} E_{\theta|D}[|\theta - \hat{\theta}|]$$

Absolute error penalizes all errors linearly, regardless of magnitude, making the median more robust to extreme values.

Properties:

Robust to skewness: Less affected by heavy tails than the mean
Invariant to monotonic transformations: Median of g(θ) = g(median of θ) for monotonic g
Equal-tailed intervals: The median is the center of central credible intervals
Easy to compute from samples: Sort MCMC samples and take the middle value

When to Use Median:

When the posterior is heavily skewed
When outlying parameter values (in the posterior) would distort the mean
When you want an estimate that 'represents' the typical posterior value
When reporting to audiences sensitive to extreme claims

Median for Variance Parameters

Comparing Point Estimates

Let's systematically compare the three main point estimates across different posterior shapes and see when they diverge significantly.

Point Estimate Comparison
Posterior Shape	Mean	Median	Mode	Which Differs Most?
Symmetric, unimodal	All equal	All equal	All equal	None
Right-skewed	Highest	Middle	Lowest	Mean (pulled by tail)
Left-skewed	Lowest	Middle	Highest	Mean (pulled by tail)
Heavy-tailed	May not exist	Exists	Exists	Mean (infinite for Cauchy)
Bimodal	Between modes	Between modes	One of the modes	Mode (chooses one)
Uniform	Center	Center	Entire interval	Mode (not unique)

estimate_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def comprehensive_comparison():
    """
    Compare all three point estimates across scenarios.
    """
    
    # Scenario: Different amounts of data, same true parameter
    true_theta = 0.6
    prior_a, prior_b = 2, 2  # Weakly informative prior
    
    data_scenarios = [
        (0, 0, "Prior only"),
        (3, 2, "5 trials: 3 heads"),
        (12, 8, "20 trials: 12 heads"),
        (60, 40, "100 trials: 60 heads"),
        (600, 400, "1000 trials: 600 heads"),
    ]
    
    theta = np.linspace(0, 1, 1000)
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    results = []
    
    for i, (k, n_minus_k, label) in enumerate(data_scenarios):
        n = k + n_minus_k
        post_a = prior_a + k
        post_b = prior_b + n_minus_k
        
        dist = stats.beta(post_a, post_b)
        pdf = dist.pdf(theta)
        
        # Compute estimates
        mean = dist.mean()
        median = dist.median()
        mode = (post_a - 1) / (post_a + post_b - 2) if post_a > 1 and post_b > 1 else 0.5
        mle = k / n if n > 0 else 0.5
        
        results.append({
            'label': label,
            'n': n,
            'mean': mean,
            'median': median,
            'mode': mode,
            'mle': mle
        })
        
        # Plot
        ax = axes[i]
        ax.fill_between(theta, pdf, alpha=0.3, color='blue')
        ax.plot(theta, pdf, 'b-', linewidth=2)
        ax.axvline(mean, color='green', linestyle='-', linewidth=2, label=f'Mean={mean:.3f}')
        ax.axvline(median, color='orange', linestyle='--', linewidth=2, label=f'Median={median:.3f}')
        ax.axvline(mode, color='red', linestyle=':', linewidth=2, label=f'Mode={mode:.3f}')
        if n > 0:
            ax.axvline(mle, color='purple', linestyle='-.', linewidth=1.5, 
                       alpha=0.7, label=f'MLE={mle:.3f}')
        ax.axvline(true_theta, color='black', linestyle='-', alpha=0.5, 
                   linewidth=1, label=f'True θ={true_theta}')
        
        ax.set_xlabel('θ')
        ax.set_ylabel('Density')
        ax.set_title(f'{label}\nBeta({post_a}, {post_b})')
        ax.legend(fontsize=8, loc='upper right')
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
    
    # Summary in last panel
    ax = axes[5]
    ax.axis('off')
    
    summary_text = "SUMMARY: Point Estimate Convergence\n" + "=" * 40 + "\n\n"
    summary_text += f"{'Scenario':<25} {'Mean':<8} {'Median':<8} {'Mode':<8} {'MLE':<8}\n"
    summary_text += "-" * 55 + "\n"
    for r in results:
        summary_text += f"{r['label']:<25} {r['mean']:.4f}   {r['median']:.4f}   {r['mode']:.4f}   {r['mle']:.4f}\n"
    
    summary_text += "\n" + "-" * 55 + "\n"
    summary_text += "As n → ∞, all estimates → MLE → True θ"
    
    ax.text(0.1, 0.5, summary_text, transform=ax.transAxes, fontsize=10,
            verticalalignment='center', fontfamily='monospace',
            bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.5))
    
    plt.suptitle('Point Estimate Convergence with Increasing Data', fontsize=14)
    plt.tight_layout()
 
comprehensive_comparison()

Asymptotic Equivalence:

As sample size n → ∞:

The posterior concentrates tightly around the true parameter
The posterior becomes approximately Gaussian (Bernstein-von Mises)
Mean, median, and mode all converge to the same value
All converge to the MLE (for well-specified models)

Differences between point estimates are most pronounced:

With small samples (where prior influence is strong)
With asymmetric posteriors
With heavy-tailed posteriors
Near boundaries of the parameter space

Computing Estimates in Practice

For complex models, we typically have MCMC samples rather than closed-form posteriors. Here's how to extract point estimates from samples:

From MCMC Samples:

Given samples θ⁽¹⁾, θ⁽²⁾, ..., θ⁽ᴺ⁾ from the posterior:

Posterior Mean: $\hat{\theta}{\text{mean}} \approx \frac{1}{N}\sum{i=1}^{N} \theta^{(i)}$
Posterior Median: Sort samples, take middle value(s)
Posterior Mode: More complex—requires density estimation

Estimating the Mode from Samples:

The mode (MAP) is tricky because it requires estimating the density peak, not just a sample average. Approaches include:

Kernel Density Estimation: Fit a KDE, find its maximum
Half-sample mode: Iteratively find the shortest interval containing half the samples
Optimization: Use the MCMC chain's highest-likelihood point as a proxy
Variational approximation: Fit a parametric approximation and take its mode

Note: For high-dimensional parameters, KDE-based mode estimation becomes unreliable.

mcmc_point_estimates.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
from scipy.ndimage import gaussian_filter1d
import matplotlib.pyplot as plt
 
def extract_point_estimates_from_samples():
    """
    Demonstrate extracting point estimates from MCMC samples.
    """
    np.random.seed(42)
    
    # Simulate MCMC samples from a skewed posterior
    # (Gamma distribution as example)
    true_shape, true_scale = 3, 2  # Gamma(3, 2)
    true_dist = stats.gamma(true_shape, scale=true_scale)
    
    n_samples = 10000
    samples = true_dist.rvs(n_samples)
    
    # True values
    true_mean = true_dist.mean()
    true_median = true_dist.median()
    true_mode = (true_shape - 1) * true_scale  # For Gamma with shape > 1
    
    # Estimated from samples
    est_mean = np.mean(samples)
    est_median = np.median(samples)
    
    # Estimate mode via KDE
    from scipy.stats import gaussian_kde
    kde = gaussian_kde(samples)
    x_grid = np.linspace(0, np.percentile(samples, 99), 500)
    density = kde(x_grid)
    est_mode_kde = x_grid[np.argmax(density)]
    
    # Half-sample mode (robust estimator)
    def half_sample_mode(data, max_iter=100):
        """Robertson-Cryer half-sample mode."""
        data = np.sort(data)
        n = len(data)
        
        for _ in range(max_iter):
            if n <= 3:
                break
            half_n = n // 2 + 1
            widths = data[half_n-1:] - data[:n-half_n+1]
            min_idx = np.argmin(widths)
            data = data[min_idx:min_idx+half_n]
            n = len(data)
        
        return np.mean(data)
    
    est_mode_hsm = half_sample_mode(samples)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Histogram with estimates
    ax = axes[0]
    ax.hist(samples, bins=50, density=True, alpha=0.6, color='gray', label='MCMC samples')
    ax.plot(x_grid, density, 'b-', linewidth=2, label='KDE')
    ax.axvline(est_mean, color='green', linestyle='-', linewidth=2, label=f'Est. mean={est_mean:.3f}')
    ax.axvline(est_median, color='orange', linestyle='--', linewidth=2, label=f'Est. median={est_median:.3f}')
    ax.axvline(est_mode_kde, color='red', linestyle=':', linewidth=2, label=f'Est. mode (KDE)={est_mode_kde:.3f}')
    ax.axvline(est_mode_hsm, color='purple', linestyle='-.', linewidth=2, label=f'Est. mode (HSM)={est_mode_hsm:.3f}')
    ax.set_xlabel('θ')
    ax.set_ylabel('Density')
    ax.set_title('Point Estimates from MCMC Samples')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Comparison with true values
    ax = axes[1]
    categories = ['Mean', 'Median', 'Mode (KDE)', 'Mode (HSM)']
    true_vals = [true_mean, true_median, true_mode, true_mode]
    est_vals = [est_mean, est_median, est_mode_kde, est_mode_hsm]
    errors = [abs(e - t) for e, t in zip(est_vals, true_vals)]
    
    x_pos = np.arange(len(categories))
    width = 0.35
    
    bars1 = ax.bar(x_pos - width/2, true_vals, width, label='True', alpha=0.8, color='blue')
    bars2 = ax.bar(x_pos + width/2, est_vals, width, label='Estimated', alpha=0.8, color='green')
    
    # Add error annotations
    for i, (x, err) in enumerate(zip(x_pos, errors)):
        ax.annotate(f'Err: {err:.3f}', (x + width/2, est_vals[i] + 0.2), 
                    ha='center', fontsize=9)
    
    ax.set_xlabel('Estimate Type')
    ax.set_ylabel('Value')
    ax.set_title('Comparison: True vs Estimated')
    ax.set_xticks(x_pos)
    ax.set_xticklabels(categories)
    ax.legend()
    ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    
    # Print summary
    print("=" * 60)
    print("POINT ESTIMATES FROM MCMC SAMPLES")
    print("=" * 60)
    print(f"True distribution: Gamma({true_shape}, scale={true_scale})")
    print(f"Number of MCMC samples: {n_samples}")
    print()
    print(f"{'Estimate':<20} {'True Value':<15} {'Estimated':<15} {'Error':<15}")
    print("-" * 60)
    print(f"{'Mean':<20} {true_mean:<15.4f} {est_mean:<15.4f} {abs(est_mean - true_mean):<15.4f}")
    print(f"{'Median':<20} {true_median:<15.4f} {est_median:<15.4f} {abs(est_median - true_median):<15.4f}")
    print(f"{'Mode (KDE)':<20} {true_mode:<15.4f} {est_mode_kde:<15.4f} {abs(est_mode_kde - true_mode):<15.4f}")
    print(f"{'Mode (Half-sample)':<20} {true_mode:<15.4f} {est_mode_hsm:<15.4f} {abs(est_mode_hsm - true_mode):<15.4f}")
 
extract_point_estimates_from_samples()

Standard Error of the Mean

What Point Estimates Lose

Point estimates are useful summaries, but they sacrifice information. Understanding these losses helps you decide when a point estimate suffices and when the full posterior is needed.

What's Lost:

1. Uncertainty Quantification

2. Posterior Shape

Bimodality, skewness, heavy tails—all are invisible in a single number. For multimodal posteriors, the mean might lie in a region of LOW posterior probability!

3. Correlations (Multivariate)

4. Transformation Properties

The mean of θ² is NOT (mean of θ)². For nonlinear functions of parameters, point estimates can be misleading. The full posterior allows computing E[g(θ)|D] correctly.

The Plug-in Fallacy

When Point Estimates Suffice:

Posterior is approximately symmetric and unimodal: Mean ≈ median ≈ mode
Uncertainty is secondary: Quick decisions where precision isn't critical
Downstream processes require a single value: Some pipelines can't handle distributions
Communication constraints: Audiences who can't interpret distributions

When Full Posteriors Are Essential:

High-stakes decisions: Medical, financial, safety-critical applications
Multimodal posteriors: Need to represent multiple viable explanations
Uncertainty propagation: Downstream calculations need full uncertainty
Model comparison: Bayes factors require the full posterior
Prediction: Posterior predictive requires integration over θ

Best Practice:

Report point estimates alongside credible intervals. A statement like 'the posterior mean is 0.65 with 95% CI [0.52, 0.77]' is far more informative than 'the estimate is 0.65'.

Summary: Point Estimates from Posterior

Point estimates distill posterior distributions into single numbers, each optimizing a different loss function. Let's consolidate the key concepts:

Key Takeaways

•Decision theory unifies point estimates — Each estimate is optimal for a specific loss function: squared error → mean, absolute error → median, 0-1 loss → mode.
•MAP = regularized MLE — The posterior mode is equivalent to penalized maximum likelihood, with the log-prior serving as the penalty. Normal prior → L2 regularization; Laplace prior → L1 regularization.
•Posterior mean is the default Bayesian estimate — Optimal for squared error, easy to compute, and represents the posterior's center of mass.
•Median is robust to asymmetry — For skewed posteriors, the median provides a more 'typical' value than the mean.
•Estimates converge asymptotically — With large samples, mean, median, mode, and MLE all converge to the true parameter.
•MCMC samples enable estimation — Mean and median are trivial from samples; mode requires density estimation.
•Point estimates sacrifice information — They discard uncertainty, shape, and correlation. Always report credible intervals alongside.

Module Complete:

You have now completed the foundational Bayesian Framework module. You understand:

Prior beliefs and how to encode knowledge before seeing data
Likelihood functions and their role in quantifying data evidence
Posterior distributions as the culmination of Bayesian inference
Bayesian updating as continuous learning
Point estimates as decision-optimal summaries of posteriors

Module Complete