Loading content...
Throughout this module, we've focused on model selection—choosing the single best model according to marginal likelihood or Bayes factors. But this approach has a fundamental limitation: it ignores model uncertainty.
In reality, we're rarely certain which model is correct. Even when one model has the highest evidence, others may have non-negligible posterior probabilities. By committing to a single model, we:
Bayesian Model Averaging (BMA) solves these problems by combining predictions from all models, weighted by their posterior probabilities. It's the fully coherent Bayesian approach to prediction under model uncertainty.
By the end of this page, you will understand: (1) The theoretical foundation for BMA as proper Bayesian inference, (2) How to compute BMA predictions and uncertainty intervals, (3) Practical algorithms for approximate BMA, (4) When BMA helps and when it doesn't, (5) Connections to ensemble methods and stacking.
Consider a set of candidate models ${\mathcal{M}_1, \mathcal{M}_2, \ldots, \mathcal{M}_K}$. Each model makes predictions differently, and we're uncertain which is correct. BMA addresses this by:
The BMA predictive distribution for new data $\tilde{y}$ given observed data $\mathcal{D}$ is:
$$p(\tilde{y} | \mathcal{D}) = \sum_{k=1}^{K} p(\tilde{y} | \mathcal{D}, \mathcal{M}_k) \cdot p(\mathcal{M}_k | \mathcal{D})$$
This is a mixture of the predictive distributions from each model, weighted by posterior model probabilities.
BMA isn't a heuristic—it's the exact Bayesian answer to 'What should I predict when I'm uncertain about which model is true?' It follows directly from the law of total probability: marginalize over the unknown quantity (the true model) using its posterior distribution.
Using Bayes' theorem at the model level:
$$p(\mathcal{M}_k | \mathcal{D}) = \frac{p(\mathcal{D} | \mathcal{M}_k) \cdot p(\mathcal{M}k)}{\sum{j=1}^{K} p(\mathcal{D} | \mathcal{M}_j) \cdot p(\mathcal{M}_j)}$$
where:
With equal prior model probabilities $p(\mathcal{M}_k) = 1/K$:
$$p(\mathcal{M}_k | \mathcal{D}) = \frac{p(\mathcal{D} | \mathcal{M}k)}{\sum{j=1}^{K} p(\mathcal{D} | \mathcal{M}_j)}$$
The posterior model probability is simply the normalized marginal likelihood.
| Model | Log Evidence | Evidence (Normalized) | P(M|D) | Weight |
|---|---|---|---|---|
| Linear (d=1) | -165.2 | e⁻¹⁵·² | < 0.01% | ~0 |
| Quadratic (d=2) | -150.0 | e⁰⁺⁰ | 72.3% | 0.723 |
| Cubic (d=3) | -151.2 | e⁻¹·² | 21.8% | 0.218 |
| Quartic (d=4) | -153.4 | e⁻³·⁴ | 5.5% | 0.055 |
| Quintic (d=5) | -156.1 | e⁻⁶·¹ | 0.4% | 0.004 |
For point predictions (e.g., predicting the mean), BMA combines model predictions:
$$\mathbb{E}[\tilde{y} | \mathcal{D}] = \sum_{k=1}^{K} \mathbb{E}[\tilde{y} | \mathcal{D}, \mathcal{M}_k] \cdot p(\mathcal{M}_k | \mathcal{D})$$
This is simply a weighted average of each model's prediction.
The variance of the BMA predictive distribution captures two sources of uncertainty:
$$\text{Var}[\tilde{y} | \mathcal{D}] = \underbrace{\sum_k p(\mathcal{M}_k | \mathcal{D}) \cdot \text{Var}[\tilde{y} | \mathcal{D}, \mathcal{M}k]}{\text{Within-model uncertainty}} + \underbrace{\sum_k p(\mathcal{M}k | \mathcal{D}) \cdot (\mu_k - \bar{\mu})^2}{\text{Between-model uncertainty}}$$
where $\mu_k = \mathbb{E}[\tilde{y} | \mathcal{D}, \mathcal{M}_k]$ and $\bar{\mu} = \sum_k p(\mathcal{M}_k | \mathcal{D}) \mu_k$.
The law of total variance decomposes uncertainty into:
Model selection ignores between-model uncertainty, leading to underestimated prediction intervals.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npfrom scipy.linalg import cho_factor, cho_solve def bayesian_linear_regression_predict(X_train, y_train, X_test, prior_var=10.0, noise_var=0.25): """ Bayesian linear regression: returns predictive mean, variance, and log evidence. """ n_train, d = X_train.shape n_test = X_test.shape[0] # Prior and posterior prior_precision = np.eye(d) / prior_var post_precision = X_train.T @ X_train / noise_var + prior_precision L, lower = cho_factor(post_precision, lower=True) post_cov = cho_solve((L, lower), np.eye(d)) post_mean = post_cov @ X_train.T @ y_train / noise_var # Predictive distribution pred_mean = X_test @ post_mean pred_var = noise_var + np.einsum('ij,jk,ik->i', X_test, post_cov, X_test) # Log marginal likelihood K = prior_var * X_train @ X_train.T + noise_var * np.eye(n_train) L_K, _ = cho_factor(K, lower=True) log_det_K = 2 * np.sum(np.log(np.diag(L_K))) alpha = cho_solve((L_K, True), y_train) log_evidence = -0.5 * (y_train @ alpha + log_det_K + n_train * np.log(2 * np.pi)) return pred_mean, pred_var, log_evidence def bayesian_model_averaging(X_train, y_train, X_test, max_degree=5, prior_var=10.0, noise_var=0.25): """ Perform Bayesian Model Averaging over polynomial regression models. Returns: -------- bma_mean : array BMA predictive mean bma_var : array BMA predictive variance (includes model uncertainty) model_probs : array Posterior model probabilities model_results : list Individual model predictions and evidence """ x_train_raw = X_train.flatten() x_test_raw = X_test.flatten() model_results = [] log_evidences = [] # Fit each polynomial model for degree in range(1, max_degree + 1): # Design matrices X_tr = np.column_stack([x_train_raw**i for i in range(degree + 1)]) X_te = np.column_stack([x_test_raw**i for i in range(degree + 1)]) pred_mean, pred_var, log_ev = bayesian_linear_regression_predict( X_tr, y_train, X_te, prior_var, noise_var ) model_results.append({ 'degree': degree, 'pred_mean': pred_mean, 'pred_var': pred_var, 'log_evidence': log_ev }) log_evidences.append(log_ev) # Compute posterior model probabilities log_evidences = np.array(log_evidences) max_log_ev = np.max(log_evidences) model_probs = np.exp(log_evidences - max_log_ev) model_probs /= model_probs.sum() # BMA predictions n_test = len(x_test_raw) bma_mean = np.zeros(n_test) within_var = np.zeros(n_test) between_var = np.zeros(n_test) for k, result in enumerate(model_results): w_k = model_probs[k] bma_mean += w_k * result['pred_mean'] within_var += w_k * result['pred_var'] # Between-model variance for k, result in enumerate(model_results): w_k = model_probs[k] between_var += w_k * (result['pred_mean'] - bma_mean)**2 bma_var = within_var + between_var return bma_mean, bma_var, model_probs, model_results # Demonstrationnp.random.seed(42)n_train = 30x_train = np.linspace(-1, 1, n_train)y_true = 0.5 + 0.3*x_train - 0.8*x_train**2 # Quadraticy_train = y_true + 0.3 * np.random.randn(n_train) x_test = np.linspace(-1.2, 1.2, 50) bma_mean, bma_var, model_probs, results = bayesian_model_averaging( x_train.reshape(-1, 1), y_train, x_test.reshape(-1, 1), max_degree=6) print("Bayesian Model Averaging: Polynomial Regression\n")print("True model: Quadratic (degree 2)\n")print("Posterior Model Probabilities:")for k, result in enumerate(results): prob = model_probs[k] print(f" Degree {result['degree']}: {prob:.1%}") # Compare uncertainty: BMA vs best single modelbest_idx = np.argmax(model_probs)best_var = results[best_idx]['pred_var'] print(f"\nPrediction at x=0:")print(f" BMA mean: {bma_mean[25]:.3f}")print(f" BMA std: {np.sqrt(bma_var[25]):.3f} (includes model uncertainty)")print(f" Best model (d={results[best_idx]['degree']}) std: {np.sqrt(best_var[25]):.3f}")print(f" Between-model contribution: {np.sqrt(bma_var[25] - best_var[25]):.3f}")The BMA variance is always at least as large as any single model's variance. The extra 'between-model' term captures uncertainty about which model is correct. When models disagree, this term is large; when they agree, it's small. This honest accounting of uncertainty is one of BMA's greatest advantages.
BMA isn't always beneficial. Understanding when it helps guides appropriate application.
If one model dominates (e.g., 99% posterior probability), BMA is essentially equivalent to using that single model. BMA shines when several models are plausible:
For the between-model variance to matter, models must disagree. If all plausible models make similar predictions, BMA reduces to any single model. The value comes from:
Dominant model: When one model is clearly best (e.g., >95% probability), averaging adds little.
Agreeing models: When all plausible models make nearly identical predictions, the between-model variance vanishes.
All models wrong: If the true data-generating process is outside all candidate models (M-open), BMA averages over various types of wrongness—which may or may not be better than a single wrong model.
Computational expense: BMA requires fitting and evaluating multiple models. When resources are limited, a single well-tuned model may be preferable.
Use BMA when: (1) you have a reasonable set of candidate models, (2) you're uncertain which is best, (3) honest uncertainty quantification matters. Don't use BMA when: (1) you're confident in a single model, (2) computational resources are very limited, (3) you only need point predictions and don't care about uncertainty.
Exact BMA requires computing marginal likelihoods for all models—often intractable. Practical algorithms address this challenge.
MC³ samples from the posterior over models using MCMC:
Advantages: Explores model space efficiently; doesn't require enumerating all models.
Challenges: Requires computing marginal likelihoods; mixing can be slow in large model spaces.
When marginal likelihoods are intractable, BIC provides a rough approximation:
$$\log p(\mathcal{D} | \mathcal{M}_k) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}_k) - \frac{d_k}{2}\log n$$
Posterior model probabilities become:
$$p(\mathcal{M}_k | \mathcal{D}) \propto \exp\left(-\frac{1}{2}\text{BIC}_k\right) \cdot p(\mathcal{M}_k)$$
Advantages: Uses only maximum likelihood fits; fast to compute.
Limitations: BIC is a rough approximation; may not reflect true model probabilities.
Stacking (stacked generalization) is a predictive approach that doesn't require marginal likelihoods:
Stacking vs. BMA: BMA uses posterior model probabilities (based on marginal likelihood); stacking uses predictive performance weights. When models are all wrong (M-open), stacking often outperforms BMA for prediction.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
import numpy as npfrom scipy.optimize import minimize def bic_model_weights(log_likelihoods, n_params, n_samples): """ Compute approximate BMA weights using BIC. Parameters: ----------- log_likelihoods : array Maximum log-likelihood for each model n_params : array Number of parameters for each model n_samples : int Number of data points Returns: -------- weights : array Approximate posterior model probabilities """ bics = -2 * log_likelihoods + n_params * np.log(n_samples) # Convert BIC to weights: weight ∝ exp(-BIC/2) log_weights = -bics / 2 max_log_w = np.max(log_weights) weights = np.exp(log_weights - max_log_w) weights /= weights.sum() return weights def stacking_weights(Y_true, Y_pred_cv, constraint='simplex'): """ Compute stacking weights via cross-validated prediction optimization. Parameters: ----------- Y_true : array of shape (n_samples,) True target values Y_pred_cv : array of shape (n_samples, n_models) Out-of-sample predictions from each model (from CV) constraint : str 'simplex' for non-negative weights summing to 1 'unconstrained' for any weights Returns: -------- weights : array Optimal stacking weights """ n_models = Y_pred_cv.shape[1] # Objective: minimize squared error def mse(w): pred = Y_pred_cv @ w return np.mean((Y_true - pred)**2) # Constraint: weights sum to 1 constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1} # Bounds: non-negative weights bounds = [(0, 1) for _ in range(n_models)] # Initial guess: equal weights w0 = np.ones(n_models) / n_models if constraint == 'simplex': result = minimize(mse, w0, method='SLSQP', bounds=bounds, constraints=constraints) else: result = minimize(mse, w0, method='BFGS') return result.x # Demonstrate BIC-based BMAdef fit_polynomial_mle(x, y, degree): """Fit polynomial and return log-likelihood and params.""" X = np.column_stack([x**i for i in range(degree + 1)]) coeffs = np.linalg.lstsq(X, y, rcond=None)[0] pred = X @ coeffs resid = y - pred sigma_sq = np.var(resid, ddof=degree+1) log_like = -len(y)/2 * np.log(2*np.pi*sigma_sq) - np.sum(resid**2)/(2*sigma_sq) return log_like, degree + 1, coeffs # Examplenp.random.seed(42)n = 50x = np.linspace(-1, 1, n)y = 0.5 + 0.3*x - 0.8*x**2 + 0.2*x**3 + 0.4*np.random.randn(n) print("BIC-Based BMA Weights\n")log_likes = []n_params = []for d in range(1, 8): ll, k, _ = fit_polynomial_mle(x, y, d) log_likes.append(ll) n_params.append(k) weights = bic_model_weights(np.array(log_likes), np.array(n_params), n) for d, w in enumerate(weights, 1): print(f" Degree {d}: weight = {w:.3f}") print(f"\nMost probable model: Degree {np.argmax(weights) + 1}")One of the most important applications of BMA is variable selection—choosing which predictors to include in a regression model.
With $p$ potential predictors, there are $2^p$ possible subsets. Each subset defines a different model. Challenges include:
Rather than selecting a single 'best' subset, BMA:
$$\text{PIP}j = \sum{k: x_j \in \mathcal{M}_k} p(\mathcal{M}_k | \mathcal{D})$$
The posterior inclusion probability for variable $j$ is the total probability of models that include variable $j$.
Traditional approaches (stepwise, LASSO) select a single model:
BMA approach:
Example: A variable with coefficient 2.5 in models that include it, but only 40% posterior inclusion probability, has BMA coefficient: $$\beta_{\text{BMA}} = 0.40 \times 2.5 + 0.60 \times 0 = 1.0$$
The effect is 'shrunk' toward zero to reflect selection uncertainty.
An alternative to full BMA is the median probability model: include exactly those variables with PIP > 0.5. This gives a single sparse model that often has good properties (consistency under certain conditions). Use it when you need a single interpretable model but want Bayesian guidance on variable selection.
BMA faces significant computational challenges, especially with large model spaces.
With $p$ predictors, there are $2^p$ variable subsets. For $p = 30$, that's over 1 billion models—impossible to enumerate.
Solutions:
Each model requires marginal likelihood computation, which may be intractable.
Solutions:
BMA requires priors on:
Model space priors:
Parameter priors:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom itertools import combinations def exhaustive_variable_selection_bma(X, y, prior_var=10.0, noise_var=0.5): """ Exhaustive BMA over all variable subsets (for small p only). Returns posterior inclusion probabilities and BMA predictions. """ n, p = X.shape # Enumerate all 2^p - 1 non-empty subsets all_subsets = [] log_evidences = [] for size in range(1, p + 1): for subset in combinations(range(p), size): subset = list(subset) X_sub = np.column_stack([np.ones(n), X[:, subset]]) # Compute log marginal likelihood d = X_sub.shape[1] K = prior_var * X_sub @ X_sub.T + noise_var * np.eye(n) L = np.linalg.cholesky(K) log_det = 2 * np.sum(np.log(np.diag(L))) alpha = np.linalg.solve(K, y) log_ev = -0.5 * (y @ alpha + log_det + n * np.log(2 * np.pi)) all_subsets.append(subset) log_evidences.append(log_ev) # Compute posterior model probabilities log_evidences = np.array(log_evidences) max_log_ev = np.max(log_evidences) model_probs = np.exp(log_evidences - max_log_ev) model_probs /= model_probs.sum() # Posterior inclusion probabilities pip = np.zeros(p) for subset, prob in zip(all_subsets, model_probs): for j in subset: pip[j] += prob return pip, model_probs, all_subsets # Example: Variable selection with true and spurious variablesnp.random.seed(42)n = 100p = 8 # Create predictors (first 3 are truly relevant)X = np.random.randn(n, p) # True coefficients: only x0, x1, x2 mattertrue_beta = np.array([2.0, -1.0, 0.5, 0, 0, 0, 0, 0])y = X @ true_beta + 0.5 * np.random.randn(n) # Run BMApip, model_probs, all_subsets = exhaustive_variable_selection_bma(X, y) print("BMA Variable Selection\n")print("True model uses: x0, x1, x2\n")print("Posterior Inclusion Probabilities:")for j in range(p): status = "TRUE" if j < 3 else "spurious" print(f" x{j} ({status:>8}): PIP = {pip[j]:.3f}") # Find top modelstop_5_idx = np.argsort(model_probs)[-5:][::-1]print(f"\nTop 5 Models:")for idx in top_5_idx: subset = all_subsets[idx] vars_str = ', '.join([f'x{j}' for j in subset]) print(f" P = {model_probs[idx]:.3f}: {{{vars_str}}}")BMA is one of several ensemble approaches. Understanding how it compares to alternatives helps choose the right method.
Bagging trains the same model on bootstrap samples and averages predictions.
BMA averages over different model structures, weighted by evidence.
Boosting sequentially fits models to residuals, combining into a single strong learner.
BMA gives a coherent probability distribution over predictions.
| Method | Source of Diversity | Weighting | Goal | Uncertainty? |
|---|---|---|---|---|
| BMA | Different model structures | Posterior probability | Coherent averaging | Yes (principled) |
| Stacking | Different model structures | CV-optimized | Best prediction | Partial |
| Bagging | Bootstrap samples | Equal | Variance reduction | Via percentiles |
| Boosting | Sequential residuals | Contribution to loss | Bias reduction | Limited |
| Dropout (NN) | Random subnetworks | Equal (implicit) | Regularization | Approximate |
This comparison is particularly important:
BMA:
Stacking:
Rule of thumb: Use BMA when uncertainty quantification matters and you trust your model set. Use stacking when pure predictive accuracy matters and you suspect all models are approximations.
In M-closed problems, the true model is among the candidates—BMA is optimal. In M-open problems, all models are approximations—BMA picks the best average of approximations, but stacking may find better predictive combinations. Most real problems are M-open, so stacking is often preferred for pure prediction, while BMA remains preferred for principled inference.
Let's consolidate what we've learned about Bayesian Model Averaging:
Module Complete!
This module has taken you through the complete framework for Bayesian model comparison:
Marginal Likelihood: The foundation—how models are scored by their predictive accuracy averaged over priors.
Bayes Factors: The ratio comparing two models' evidences, with interpretation scales and computation methods.
Model Evidence: Deep understanding of what marginal likelihood measures—fit vs. complexity, compression, and predictive probability.
Occam's Razor: Why Bayesian inference automatically prefers simpler models when complexity isn't justified.
Bayesian Model Averaging: How to combine models rather than choose, accounting for model uncertainty.
You now have the tools to approach model selection and model uncertainty from a principled Bayesian perspective—understanding not just the mechanics but the deep principles that make these methods coherent and powerful.
Congratulations! You've mastered Bayesian Model Comparison—from marginal likelihood foundations through Bayes factors, Occam's razor, and model averaging. You understand not just the formulas but the deep principles: why evidence penalizes complexity, how BMA accounts for model uncertainty, and when these methods succeed or fail. You're now equipped to apply principled probabilistic reasoning to model selection in any domain.