Loading content...
The posterior distribution encapsulates everything we know about a parameter after observing data. But in practice, we often need a single number—a point estimate—for communication, downstream processing, or decision-making. How do we responsibly compress an entire distribution into one value?
This seemingly simple question reveals deep connections between Bayesian inference, decision theory, regularization, and frequentist estimation. Different point estimates answer different implicit questions and optimize different loss functions. Understanding these distinctions ensures you choose the right summary for your purpose.
In this page, we explore the three major Bayesian point estimates—the Maximum A Posteriori (MAP), posterior mean, and posterior median—along with their properties, computation methods, and connections to familiar non-Bayesian techniques.
By the end of this page, you will understand when to use MAP vs mean vs median estimates, derive the decision-theoretic justification for each, recognize MAP as regularized maximum likelihood, compute point estimates from MCMC samples, and appreciate what is lost when summarizing posteriors with single numbers.
Before examining specific point estimates, let's establish the framework that unifies them all: decision theory.
The Setup:
The Bayes Estimator:
The optimal estimate under a given loss function is the one minimizing expected posterior loss:
$$\hat{\theta}{\text{Bayes}} = \arg\min{\hat{\theta}} E_{\theta|D}[L(\theta, \hat{\theta})] = \arg\min_{\hat{\theta}} \int L(\theta, \hat{\theta}) \cdot p(\theta|D) , d\theta$$
Key Insight:
Different loss functions yield different optimal estimates:
| Loss Function | Bayes Estimator | Intuition |
|---|---|---|
| Squared: L = (θ - â)² | Posterior Mean | Penalizes large errors heavily |
| Absolute: L = |θ - â| | Posterior Median | Robust to outliers |
| 0-1: L = 𝟙(θ ≠ â) | Posterior Mode (MAP) | 'All-or-nothing' accuracy |
There's no universally optimal point estimate because different situations have different loss structures. The mean is optimal for squared error; the median for absolute error; the mode for 0-1 loss. Choosing an estimate IS choosing a loss function, implicitly or explicitly.
Proving the Results:
Squared Error → Mean:
$$\frac{d}{d\hat{\theta}} E[(\theta - \hat{\theta})^2] = \frac{d}{d\hat{\theta}} E[\theta^2 - 2\theta\hat{\theta} + \hat{\theta}^2] = -2E[\theta] + 2\hat{\theta} = 0$$
$$\implies \hat{\theta}^* = E[\theta|D]$$
Absolute Error → Median:
The expected absolute loss is minimized when â splits the posterior into equal halves—i.e., at the median. This follows because the derivative of E[|θ - â|] changes sign exactly at the median.
0-1 Loss → Mode:
With 0-1 loss, we only care about getting it exactly right. The probability of 'guessing correctly' is maximized at the posterior mode—the most probable value.
Note: For continuous posteriors, P(θ = â) = 0 for any specific value, so 0-1 loss is technically problematic. We interpret it as equivalent to maximizing posterior density.
The MAP estimate is the posterior mode—the parameter value with highest posterior density:
$$\hat{\theta}{MAP} = \arg\max{\theta} p(\theta | D) = \arg\max_{\theta} \left[ p(D|\theta) \cdot p(\theta) \right]$$
Since the normalizing constant p(D) doesn't depend on θ, it doesn't affect the optimization.
Log-Form:
In practice, we optimize the log-posterior:
$$\hat{\theta}{MAP} = \arg\max{\theta} \left[ \log p(D|\theta) + \log p(\theta) \right] = \arg\max_{\theta} \left[ \ell(\theta) + \log \pi(\theta) \right]$$
where ℓ(θ) is the log-likelihood and π(θ) is the prior.
MAP = Penalized Maximum Likelihood:
This form reveals a crucial connection: MAP estimation is equivalent to regularized or penalized maximum likelihood estimation!
The log-prior acts as a penalty on parameter values, preventing solutions that are a priori implausible.
| Prior Distribution | Regularization Type | Penalty Term | Effect |
|---|---|---|---|
| N(0, σ²) on weights | L2 / Ridge | λ‖θ‖² where λ = 1/(2σ²) | Shrinks weights toward zero |
| Laplace(0, b) on weights | L1 / Lasso | λ‖θ‖₁ where λ = 1/b | Encourages sparsity |
| Uniform (flat) | None (pure MLE) | 0 | No regularization |
| Horseshoe prior | Adaptive sparsity | Complex | Strong shrinkage + heavy tails |
| Spike-and-slab | Subset selection | Implicit | Exact zeros or full weights |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as npfrom scipy import optimize, statsimport matplotlib.pyplot as plt def demonstrate_map_regularization(): """ Show that MAP = regularized MLE through ridge regression example. """ np.random.seed(42) # Generate data: y = Xβ + ε, with collinear features n, p = 50, 10 X = np.random.randn(n, p) X[:, 1] = X[:, 0] + 0.1 * np.random.randn(n) # Create collinearity true_beta = np.array([1, -1, 0.5, 0, 0, 0, 0, 0, 0, 0]) y = X @ true_beta + 0.5 * np.random.randn(n) # Method 1: OLS (MLE with flat prior) beta_ols = np.linalg.lstsq(X, y, rcond=None)[0] # Method 2: Ridge regression (L2 penalty) lambda_ridge = 0.5 beta_ridge = np.linalg.solve(X.T @ X + lambda_ridge * np.eye(p), X.T @ y) # Method 3: MAP with Normal(0, σ²) prior (equivalent to ridge) # Prior: β ~ N(0, σ²I) where σ² = 1/(2λ) prior_variance = 1 / (2 * lambda_ridge) def neg_log_posterior(beta, X, y, noise_var, prior_var): """Negative log-posterior (for minimization).""" residuals = y - X @ beta log_likelihood = -0.5 * np.sum(residuals**2) / noise_var log_prior = -0.5 * np.sum(beta**2) / prior_var return -(log_likelihood + log_prior) noise_variance = 0.5**2 # Assume known result = optimize.minimize( neg_log_posterior, np.zeros(p), args=(X, y, noise_variance, prior_variance), method='L-BFGS-B' ) beta_map = result.x # Verify equivalence print("=" * 60) print("MAP = REGULARIZED MLE DEMONSTRATION") print("=" * 60) print() print(f"{'β':<8} {'True':<10} {'OLS':<10} {'Ridge':<10} {'MAP':<10}") print("-" * 60) for i in range(p): print(f"β_{i:<5} {true_beta[i]:<10.4f} {beta_ols[i]:<10.4f} {beta_ridge[i]:<10.4f} {beta_map[i]:<10.4f}") print() print(f"Ridge λ = {lambda_ridge}, Prior variance σ² = {prior_variance:.4f}") print(f"Max |Ridge - MAP| = {np.max(np.abs(beta_ridge - beta_map)):.2e}") print("Ridge and MAP are IDENTICAL (within numerical precision)!") # Visualization fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Coefficient comparison ax = axes[0] x_pos = np.arange(p) width = 0.2 ax.bar(x_pos - 1.5*width, true_beta, width, label='True', alpha=0.8) ax.bar(x_pos - 0.5*width, beta_ols, width, label='OLS (flat prior)', alpha=0.8) ax.bar(x_pos + 0.5*width, beta_ridge, width, label='Ridge', alpha=0.8) ax.bar(x_pos + 1.5*width, beta_map, width, label='MAP', alpha=0.8) ax.set_xlabel('Coefficient index') ax.set_ylabel('Coefficient value') ax.set_title('Coefficient Estimates: OLS vs Ridge vs MAP') ax.legend() ax.set_xticks(x_pos) ax.set_xticklabels([f'β_{i}' for i in range(p)]) ax.grid(True, alpha=0.3, axis='y') # Plot 2: Effect of regularization strength ax = axes[1] lambdas = np.logspace(-3, 2, 50) beta_paths = [] for l in lambdas: beta_l = np.linalg.solve(X.T @ X + l * np.eye(p), X.T @ y) beta_paths.append(beta_l) beta_paths = np.array(beta_paths) for i in range(p): ax.plot(lambdas, beta_paths[:, i], linewidth=2, label=f'β_{i}' if i < 5 else None) ax.axhline(0, color='black', linewidth=0.5) ax.axvline(lambda_ridge, color='red', linestyle='--', label='Selected λ') ax.set_xscale('log') ax.set_xlabel('λ (regularization strength)') ax.set_ylabel('Coefficient value') ax.set_title('Ridge Regularization Path\n(Stronger prior → more shrinkage)') ax.legend(fontsize=8) ax.grid(True, alpha=0.3) plt.tight_layout() demonstrate_map_regularization()Since MAP reduces to optimization, we can use all the machinery of numerical optimization: gradient descent, Newton's method, L-BFGS, etc. This makes MAP computationally tractable even when full Bayesian inference (computing the entire posterior) is not.
Properties of MAP Estimation:
Advantages:
Disadvantages:
The Bayesian Critique:
From a fully Bayesian perspective, MAP is often considered suboptimal because it discards information about posterior uncertainty. The mode tells you where the peak is, but not how spread out the distribution is around it.
The posterior mean is perhaps the most commonly used Bayesian point estimate:
$$\hat{\theta}_{\text{mean}} = E[\theta | D] = \int \theta \cdot p(\theta | D) , d\theta$$
Decision-Theoretic Justification:
As shown earlier, the posterior mean minimizes expected squared error loss:
$$\hat{\theta}{\text{mean}} = \arg\min{\hat{\theta}} E_{\theta|D}[(\theta - \hat{\theta})^2]$$
Squared error penalizes large mistakes heavily, making the mean optimal when big errors are catastrophic.
Properties:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def posterior_mean_analysis(): """ Analyze posterior mean vs MAP for different posterior shapes. """ fig, axes = plt.subplots(2, 2, figsize=(14, 10)) theta = np.linspace(0, 1, 1000) scenarios = [ ("Symmetric: Beta(8, 8)", stats.beta(8, 8)), ("Right-skewed: Beta(2, 8)", stats.beta(2, 8)), ("Left-skewed: Beta(8, 2)", stats.beta(8, 2)), ("Bimodal: Mixture", None), # Custom ] for ax, (title, dist) in zip(axes.flatten(), scenarios): if dist is not None: pdf = dist.pdf(theta) mean = dist.mean() median = dist.median() mode = (dist.args[0] - 1) / (dist.args[0] + dist.args[1] - 2) if dist.args[0] > 1 and dist.args[1] > 1 else None else: # Create bimodal dist1 = stats.beta(5, 20) dist2 = stats.beta(20, 5) pdf = 0.5 * dist1.pdf(theta) + 0.5 * dist2.pdf(theta) # Compute mean for mixture mean = 0.5 * dist1.mean() + 0.5 * dist2.mean() # Mode is at one of the component modes mode = theta[np.argmax(pdf)] # Median (approximate) cdf = 0.5 * dist1.cdf(theta) + 0.5 * dist2.cdf(theta) median_idx = np.argmin(np.abs(cdf - 0.5)) median = theta[median_idx] # Plot ax.fill_between(theta, pdf, alpha=0.3, color='blue') ax.plot(theta, pdf, 'b-', linewidth=2) if mode is not None: ax.axvline(mode, color='red', linestyle=':', linewidth=2, label=f'Mode (MAP) = {mode:.3f}') ax.axvline(mean, color='green', linestyle='-', linewidth=2, label=f'Mean = {mean:.3f}') ax.axvline(median, color='orange', linestyle='--', linewidth=2, label=f'Median = {median:.3f}') ax.set_xlabel('θ') ax.set_ylabel('p(θ|D)') ax.set_title(title) ax.legend(fontsize=9) ax.grid(True, alpha=0.3) ax.set_xlim(0, 1) plt.suptitle('Posterior Mean vs Mode vs Median\nfor Different Posterior Shapes', fontsize=14) plt.tight_layout() # Show MCMC estimation print("=" * 60) print("POSTERIOR MEAN FROM MCMC SAMPLES") print("=" * 60) # Simulate MCMC samples from Beta(3, 7) true_dist = stats.beta(3, 7) sample_sizes = [100, 1000, 10000, 100000] print(f"True posterior mean: {true_dist.mean():.6f}") print() print(f"{'N samples':<15} {'Estimated Mean':<20} {'Error':<20} {'MC Std Error':<15}") print("-" * 60) for n in sample_sizes: np.random.seed(42) samples = true_dist.rvs(n) estimated_mean = np.mean(samples) error = abs(estimated_mean - true_dist.mean()) mc_std_error = np.std(samples) / np.sqrt(n) print(f"{n:<15} {estimated_mean:.6f} " f"{error:.6f} {mc_std_error:.6f}") print() print("Monte Carlo standard error decreases as 1/√n") posterior_mean_analysis()Posterior Mean as Shrinkage:
For conjugate models, the posterior mean often has a beautiful shrinkage interpretation:
$$\hat{\theta}{\text{mean}} = \lambda \cdot \hat{\theta}{\text{prior}} + (1 - \lambda) \cdot \hat{\theta}_{\text{MLE}}$$
where λ depends on relative prior strength vs. data strength.
For example, in the Beta-Binomial model with prior Beta(α, β) and k successes in n trials:
$$E[\theta | D] = \frac{\alpha + k}{\alpha + \beta + n} = \frac{\alpha + \beta}{\alpha + \beta + n} \cdot \frac{\alpha}{\alpha + \beta} + \frac{n}{\alpha + \beta + n} \cdot \frac{k}{n}$$
The posterior mean is literally a weighted average of prior mean and MLE, with weights proportional to 'sample sizes' (prior pseudocounts vs. actual observations).
The posterior median is the value that splits the posterior probability in half:
$$P(\theta \leq \hat{\theta}{\text{median}} | D) = P(\theta \geq \hat{\theta}{\text{median}} | D) = 0.5$$
Decision-Theoretic Justification:
The posterior median minimizes expected absolute error loss:
$$\hat{\theta}{\text{median}} = \arg\min{\hat{\theta}} E_{\theta|D}[|\theta - \hat{\theta}|]$$
Absolute error penalizes all errors linearly, regardless of magnitude, making the median more robust to extreme values.
Properties:
When to Use Median:
For positive parameters like variances, the posterior is often right-skewed. The posterior mean can be pulled upward by heavy tails, while the median provides a more 'typical' value. This is one reason the median is sometimes preferred for scale parameters.
Let's systematically compare the three main point estimates across different posterior shapes and see when they diverge significantly.
| Posterior Shape | Mean | Median | Mode | Which Differs Most? |
|---|---|---|---|---|
| Symmetric, unimodal | All equal | All equal | All equal | None |
| Right-skewed | Highest | Middle | Lowest | Mean (pulled by tail) |
| Left-skewed | Lowest | Middle | Highest | Mean (pulled by tail) |
| Heavy-tailed | May not exist | Exists | Exists | Mean (infinite for Cauchy) |
| Bimodal | Between modes | Between modes | One of the modes | Mode (chooses one) |
| Uniform | Center | Center | Entire interval | Mode (not unique) |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def comprehensive_comparison(): """ Compare all three point estimates across scenarios. """ # Scenario: Different amounts of data, same true parameter true_theta = 0.6 prior_a, prior_b = 2, 2 # Weakly informative prior data_scenarios = [ (0, 0, "Prior only"), (3, 2, "5 trials: 3 heads"), (12, 8, "20 trials: 12 heads"), (60, 40, "100 trials: 60 heads"), (600, 400, "1000 trials: 600 heads"), ] theta = np.linspace(0, 1, 1000) fig, axes = plt.subplots(2, 3, figsize=(15, 10)) axes = axes.flatten() results = [] for i, (k, n_minus_k, label) in enumerate(data_scenarios): n = k + n_minus_k post_a = prior_a + k post_b = prior_b + n_minus_k dist = stats.beta(post_a, post_b) pdf = dist.pdf(theta) # Compute estimates mean = dist.mean() median = dist.median() mode = (post_a - 1) / (post_a + post_b - 2) if post_a > 1 and post_b > 1 else 0.5 mle = k / n if n > 0 else 0.5 results.append({ 'label': label, 'n': n, 'mean': mean, 'median': median, 'mode': mode, 'mle': mle }) # Plot ax = axes[i] ax.fill_between(theta, pdf, alpha=0.3, color='blue') ax.plot(theta, pdf, 'b-', linewidth=2) ax.axvline(mean, color='green', linestyle='-', linewidth=2, label=f'Mean={mean:.3f}') ax.axvline(median, color='orange', linestyle='--', linewidth=2, label=f'Median={median:.3f}') ax.axvline(mode, color='red', linestyle=':', linewidth=2, label=f'Mode={mode:.3f}') if n > 0: ax.axvline(mle, color='purple', linestyle='-.', linewidth=1.5, alpha=0.7, label=f'MLE={mle:.3f}') ax.axvline(true_theta, color='black', linestyle='-', alpha=0.5, linewidth=1, label=f'True θ={true_theta}') ax.set_xlabel('θ') ax.set_ylabel('Density') ax.set_title(f'{label}\nBeta({post_a}, {post_b})') ax.legend(fontsize=8, loc='upper right') ax.grid(True, alpha=0.3) ax.set_xlim(0, 1) # Summary in last panel ax = axes[5] ax.axis('off') summary_text = "SUMMARY: Point Estimate Convergence\n" + "=" * 40 + "\n\n" summary_text += f"{'Scenario':<25} {'Mean':<8} {'Median':<8} {'Mode':<8} {'MLE':<8}\n" summary_text += "-" * 55 + "\n" for r in results: summary_text += f"{r['label']:<25} {r['mean']:.4f} {r['median']:.4f} {r['mode']:.4f} {r['mle']:.4f}\n" summary_text += "\n" + "-" * 55 + "\n" summary_text += "As n → ∞, all estimates → MLE → True θ" ax.text(0.1, 0.5, summary_text, transform=ax.transAxes, fontsize=10, verticalalignment='center', fontfamily='monospace', bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.5)) plt.suptitle('Point Estimate Convergence with Increasing Data', fontsize=14) plt.tight_layout() comprehensive_comparison()Asymptotic Equivalence:
As sample size n → ∞:
Differences between point estimates are most pronounced:
For complex models, we typically have MCMC samples rather than closed-form posteriors. Here's how to extract point estimates from samples:
From MCMC Samples:
Given samples θ⁽¹⁾, θ⁽²⁾, ..., θ⁽ᴺ⁾ from the posterior:
Estimating the Mode from Samples:
The mode (MAP) is tricky because it requires estimating the density peak, not just a sample average. Approaches include:
Note: For high-dimensional parameters, KDE-based mode estimation becomes unreliable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom scipy import statsfrom scipy.ndimage import gaussian_filter1dimport matplotlib.pyplot as plt def extract_point_estimates_from_samples(): """ Demonstrate extracting point estimates from MCMC samples. """ np.random.seed(42) # Simulate MCMC samples from a skewed posterior # (Gamma distribution as example) true_shape, true_scale = 3, 2 # Gamma(3, 2) true_dist = stats.gamma(true_shape, scale=true_scale) n_samples = 10000 samples = true_dist.rvs(n_samples) # True values true_mean = true_dist.mean() true_median = true_dist.median() true_mode = (true_shape - 1) * true_scale # For Gamma with shape > 1 # Estimated from samples est_mean = np.mean(samples) est_median = np.median(samples) # Estimate mode via KDE from scipy.stats import gaussian_kde kde = gaussian_kde(samples) x_grid = np.linspace(0, np.percentile(samples, 99), 500) density = kde(x_grid) est_mode_kde = x_grid[np.argmax(density)] # Half-sample mode (robust estimator) def half_sample_mode(data, max_iter=100): """Robertson-Cryer half-sample mode.""" data = np.sort(data) n = len(data) for _ in range(max_iter): if n <= 3: break half_n = n // 2 + 1 widths = data[half_n-1:] - data[:n-half_n+1] min_idx = np.argmin(widths) data = data[min_idx:min_idx+half_n] n = len(data) return np.mean(data) est_mode_hsm = half_sample_mode(samples) # Visualization fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Histogram with estimates ax = axes[0] ax.hist(samples, bins=50, density=True, alpha=0.6, color='gray', label='MCMC samples') ax.plot(x_grid, density, 'b-', linewidth=2, label='KDE') ax.axvline(est_mean, color='green', linestyle='-', linewidth=2, label=f'Est. mean={est_mean:.3f}') ax.axvline(est_median, color='orange', linestyle='--', linewidth=2, label=f'Est. median={est_median:.3f}') ax.axvline(est_mode_kde, color='red', linestyle=':', linewidth=2, label=f'Est. mode (KDE)={est_mode_kde:.3f}') ax.axvline(est_mode_hsm, color='purple', linestyle='-.', linewidth=2, label=f'Est. mode (HSM)={est_mode_hsm:.3f}') ax.set_xlabel('θ') ax.set_ylabel('Density') ax.set_title('Point Estimates from MCMC Samples') ax.legend(fontsize=9) ax.grid(True, alpha=0.3) # Plot 2: Comparison with true values ax = axes[1] categories = ['Mean', 'Median', 'Mode (KDE)', 'Mode (HSM)'] true_vals = [true_mean, true_median, true_mode, true_mode] est_vals = [est_mean, est_median, est_mode_kde, est_mode_hsm] errors = [abs(e - t) for e, t in zip(est_vals, true_vals)] x_pos = np.arange(len(categories)) width = 0.35 bars1 = ax.bar(x_pos - width/2, true_vals, width, label='True', alpha=0.8, color='blue') bars2 = ax.bar(x_pos + width/2, est_vals, width, label='Estimated', alpha=0.8, color='green') # Add error annotations for i, (x, err) in enumerate(zip(x_pos, errors)): ax.annotate(f'Err: {err:.3f}', (x + width/2, est_vals[i] + 0.2), ha='center', fontsize=9) ax.set_xlabel('Estimate Type') ax.set_ylabel('Value') ax.set_title('Comparison: True vs Estimated') ax.set_xticks(x_pos) ax.set_xticklabels(categories) ax.legend() ax.grid(True, alpha=0.3, axis='y') plt.tight_layout() # Print summary print("=" * 60) print("POINT ESTIMATES FROM MCMC SAMPLES") print("=" * 60) print(f"True distribution: Gamma({true_shape}, scale={true_scale})") print(f"Number of MCMC samples: {n_samples}") print() print(f"{'Estimate':<20} {'True Value':<15} {'Estimated':<15} {'Error':<15}") print("-" * 60) print(f"{'Mean':<20} {true_mean:<15.4f} {est_mean:<15.4f} {abs(est_mean - true_mean):<15.4f}") print(f"{'Median':<20} {true_median:<15.4f} {est_median:<15.4f} {abs(est_median - true_median):<15.4f}") print(f"{'Mode (KDE)':<20} {true_mode:<15.4f} {est_mode_kde:<15.4f} {abs(est_mode_kde - true_mode):<15.4f}") print(f"{'Mode (Half-sample)':<20} {true_mode:<15.4f} {est_mode_hsm:<15.4f} {abs(est_mode_hsm - true_mode):<15.4f}") extract_point_estimates_from_samples()The Monte Carlo standard error of the posterior mean is σ/√N where σ is the posterior standard deviation and N is the effective sample size (accounting for MCMC autocorrelation). Always report this uncertainty when presenting MCMC-based estimates.
Point estimates are useful summaries, but they sacrifice information. Understanding these losses helps you decide when a point estimate suffices and when the full posterior is needed.
What's Lost:
1. Uncertainty Quantification
A point estimate θ̂ = 0.5 tells you nothing about whether this is a confident conclusion or a wild guess. Was the posterior extremely peaked at 0.5, or broad and flat? Only credible intervals or the full distribution convey this.
2. Posterior Shape
Bimodality, skewness, heavy tails—all are invisible in a single number. For multimodal posteriors, the mean might lie in a region of LOW posterior probability!
3. Correlations (Multivariate)
For multiple parameters, point estimates lose all information about posterior correlations. Knowing β₁ ≈ 2 and β₂ ≈ 3 tells you nothing about whether they're positively or negatively correlated.
4. Transformation Properties
The mean of θ² is NOT (mean of θ)². For nonlinear functions of parameters, point estimates can be misleading. The full posterior allows computing E[g(θ)|D] correctly.
Using a point estimate θ̂ as if it were the true parameter (e.g., for prediction) is the 'plug-in' approach. This underestimates uncertainty because it ignores that θ̂ is itself uncertain. For predictions with proper uncertainty, integrate over the posterior: p(y*|D) = ∫p(y*|θ)p(θ|D)dθ.
When Point Estimates Suffice:
When Full Posteriors Are Essential:
Best Practice:
Report point estimates alongside credible intervals. A statement like 'the posterior mean is 0.65 with 95% CI [0.52, 0.77]' is far more informative than 'the estimate is 0.65'.
Point estimates distill posterior distributions into single numbers, each optimizing a different loss function. Let's consolidate the key concepts:
Module Complete:
You have now completed the foundational Bayesian Framework module. You understand:
With these foundations, you're prepared to explore conjugate priors, inference algorithms (MCMC, variational inference), and applications of Bayesian methods to regression, classification, and beyond.
Congratulations! You've mastered the Bayesian Framework—the philosophical and mathematical foundations of principled probabilistic reasoning. You can now specify priors, construct likelihoods, derive posteriors, update beliefs with new evidence, and extract meaningful point estimates. This foundation supports all advanced Bayesian methods covered in subsequent modules.