Bayesian Model Comparison - Learning Module

Loading content...

0/245

Bayesian Model Averaging: Combining Models for Robust Predictions

Beyond Model Selection: Embracing Model Uncertainty

Throughout this module, we've focused on model selection—choosing the single best model according to marginal likelihood or Bayes factors. But this approach has a fundamental limitation: it ignores model uncertainty.

In reality, we're rarely certain which model is correct. Even when one model has the highest evidence, others may have non-negligible posterior probabilities. By committing to a single model, we:

Underestimate prediction uncertainty (ignoring the possibility that another model is true)
Make suboptimal predictions (the best prediction may be a weighted combination)
Lose information from alternative models that may capture different aspects of reality

Bayesian Model Averaging (BMA) solves these problems by combining predictions from all models, weighted by their posterior probabilities. It's the fully coherent Bayesian approach to prediction under model uncertainty.

What You Will Learn

By the end of this page, you will understand: (1) The theoretical foundation for BMA as proper Bayesian inference, (2) How to compute BMA predictions and uncertainty intervals, (3) Practical algorithms for approximate BMA, (4) When BMA helps and when it doesn't, (5) Connections to ensemble methods and stacking.

The Bayesian Model Averaging Framework

The Core Idea

Consider a set of candidate models ${\mathcal{M}_1, \mathcal{M}_2, \ldots, \mathcal{M}_K}$. Each model makes predictions differently, and we're uncertain which is correct. BMA addresses this by:

Computing posterior probability for each model: $p(\mathcal{M}_k | \mathcal{D})$
Making predictions under each model: $p(\tilde{y} | \mathcal{D}, \mathcal{M}_k)$
Averaging predictions weighted by model probabilities

The BMA Predictive Distribution

The BMA predictive distribution for new data $\tilde{y}$ given observed data $\mathcal{D}$ is:

$$p(\tilde{y} | \mathcal{D}) = \sum_{k=1}^{K} p(\tilde{y} | \mathcal{D}, \mathcal{M}_k) \cdot p(\mathcal{M}_k | \mathcal{D})$$

This is a mixture of the predictive distributions from each model, weighted by posterior model probabilities.

Why This Is the 'Right' Thing to Do

BMA isn't a heuristic—it's the exact Bayesian answer to 'What should I predict when I'm uncertain about which model is true?' It follows directly from the law of total probability: marginalize over the unknown quantity (the true model) using its posterior distribution.

Computing Posterior Model Probabilities

Using Bayes' theorem at the model level:

$$p(\mathcal{M}_k | \mathcal{D}) = \frac{p(\mathcal{D} | \mathcal{M}_k) \cdot p(\mathcal{M}k)}{\sum{j=1}^{K} p(\mathcal{D} | \mathcal{M}_j) \cdot p(\mathcal{M}_j)}$$

where:

$p(\mathcal{D} | \mathcal{M}_k)$ is the marginal likelihood (model evidence)
$p(\mathcal{M}_k)$ is the prior probability of model $k$

With equal prior model probabilities $p(\mathcal{M}_k) = 1/K$:

$$p(\mathcal{M}_k | \mathcal{D}) = \frac{p(\mathcal{D} | \mathcal{M}k)}{\sum{j=1}^{K} p(\mathcal{D} | \mathcal{M}_j)}$$

The posterior model probability is simply the normalized marginal likelihood.

BMA Example: Polynomial Regression
Model	Log Evidence	Evidence (Normalized)	P(M\|D)	Weight
Linear (d=1)	-165.2	e⁻¹⁵·²	< 0.01%	~0
Quadratic (d=2)	-150.0	e⁰⁺⁰	72.3%	0.723
Cubic (d=3)	-151.2	e⁻¹·²	21.8%	0.218
Quartic (d=4)	-153.4	e⁻³·⁴	5.5%	0.055
Quintic (d=5)	-156.1	e⁻⁶·¹	0.4%	0.004

Computing BMA Predictions

Point Predictions

For point predictions (e.g., predicting the mean), BMA combines model predictions:

$$\mathbb{E}[\tilde{y} | \mathcal{D}] = \sum_{k=1}^{K} \mathbb{E}[\tilde{y} | \mathcal{D}, \mathcal{M}_k] \cdot p(\mathcal{M}_k | \mathcal{D})$$

This is simply a weighted average of each model's prediction.

Prediction Uncertainty

The variance of the BMA predictive distribution captures two sources of uncertainty:

$$\text{Var}[\tilde{y} | \mathcal{D}] = \underbrace{\sum_k p(\mathcal{M}_k | \mathcal{D}) \cdot \text{Var}[\tilde{y} | \mathcal{D}, \mathcal{M}k]}{\text{Within-model uncertainty}} + \underbrace{\sum_k p(\mathcal{M}k | \mathcal{D}) \cdot (\mu_k - \bar{\mu})^2}{\text{Between-model uncertainty}}$$

where $\mu_k = \mathbb{E}[\tilde{y} | \mathcal{D}, \mathcal{M}_k]$ and $\bar{\mu} = \sum_k p(\mathcal{M}_k | \mathcal{D}) \mu_k$.

The law of total variance decomposes uncertainty into:

Within-model: Average variance across models
Between-model: Variance of model predictions around the BMA mean

Model selection ignores between-model uncertainty, leading to underestimated prediction intervals.

bayesian_model_averaging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def bayesian_linear_regression_predict(X_train, y_train, X_test, 
                                        prior_var=10.0, noise_var=0.25):
    """
    Bayesian linear regression: returns predictive mean, variance, and log evidence.
    """
    n_train, d = X_train.shape
    n_test = X_test.shape[0]
    
    # Prior and posterior
    prior_precision = np.eye(d) / prior_var
    post_precision = X_train.T @ X_train / noise_var + prior_precision
    L, lower = cho_factor(post_precision, lower=True)
    post_cov = cho_solve((L, lower), np.eye(d))
    post_mean = post_cov @ X_train.T @ y_train / noise_var
    
    # Predictive distribution
    pred_mean = X_test @ post_mean
    pred_var = noise_var + np.einsum('ij,jk,ik->i', X_test, post_cov, X_test)
    
    # Log marginal likelihood
    K = prior_var * X_train @ X_train.T + noise_var * np.eye(n_train)
    L_K, _ = cho_factor(K, lower=True)
    log_det_K = 2 * np.sum(np.log(np.diag(L_K)))
    alpha = cho_solve((L_K, True), y_train)
    log_evidence = -0.5 * (y_train @ alpha + log_det_K + n_train * np.log(2 * np.pi))
    
    return pred_mean, pred_var, log_evidence
 
 
def bayesian_model_averaging(X_train, y_train, X_test, max_degree=5,
                              prior_var=10.0, noise_var=0.25):
    """
    Perform Bayesian Model Averaging over polynomial regression models.
    
    Returns:
    --------
    bma_mean : array
        BMA predictive mean
    bma_var : array
        BMA predictive variance (includes model uncertainty)
    model_probs : array
        Posterior model probabilities
    model_results : list
        Individual model predictions and evidence
    """
    x_train_raw = X_train.flatten()
    x_test_raw = X_test.flatten()
    
    model_results = []
    log_evidences = []
    
    # Fit each polynomial model
    for degree in range(1, max_degree + 1):
        # Design matrices
        X_tr = np.column_stack([x_train_raw**i for i in range(degree + 1)])
        X_te = np.column_stack([x_test_raw**i for i in range(degree + 1)])
        
        pred_mean, pred_var, log_ev = bayesian_linear_regression_predict(
            X_tr, y_train, X_te, prior_var, noise_var
        )
        
        model_results.append({
            'degree': degree,
            'pred_mean': pred_mean,
            'pred_var': pred_var,
            'log_evidence': log_ev
        })
        log_evidences.append(log_ev)
    
    # Compute posterior model probabilities
    log_evidences = np.array(log_evidences)
    max_log_ev = np.max(log_evidences)
    model_probs = np.exp(log_evidences - max_log_ev)
    model_probs /= model_probs.sum()
    
    # BMA predictions
    n_test = len(x_test_raw)
    bma_mean = np.zeros(n_test)
    within_var = np.zeros(n_test)
    between_var = np.zeros(n_test)
    
    for k, result in enumerate(model_results):
        w_k = model_probs[k]
        bma_mean += w_k * result['pred_mean']
        within_var += w_k * result['pred_var']
    
    # Between-model variance
    for k, result in enumerate(model_results):
        w_k = model_probs[k]
        between_var += w_k * (result['pred_mean'] - bma_mean)**2
    
    bma_var = within_var + between_var
    
    return bma_mean, bma_var, model_probs, model_results
 
 
# Demonstration
np.random.seed(42)
n_train = 30
x_train = np.linspace(-1, 1, n_train)
y_true = 0.5 + 0.3*x_train - 0.8*x_train**2  # Quadratic
y_train = y_true + 0.3 * np.random.randn(n_train)
 
x_test = np.linspace(-1.2, 1.2, 50)
 
bma_mean, bma_var, model_probs, results = bayesian_model_averaging(
    x_train.reshape(-1, 1), y_train, x_test.reshape(-1, 1), max_degree=6
)
 
print("Bayesian Model Averaging: Polynomial Regression\n")
print("True model: Quadratic (degree 2)\n")
print("Posterior Model Probabilities:")
for k, result in enumerate(results):
    prob = model_probs[k]
    print(f"  Degree {result['degree']}: {prob:.1%}")
 
# Compare uncertainty: BMA vs best single model
best_idx = np.argmax(model_probs)
best_var = results[best_idx]['pred_var']
 
print(f"\nPrediction at x=0:")
print(f"  BMA mean: {bma_mean[25]:.3f}")
print(f"  BMA std:  {np.sqrt(bma_var[25]):.3f} (includes model uncertainty)")
print(f"  Best model (d={results[best_idx]['degree']}) std: {np.sqrt(best_var[25]):.3f}")
print(f"  Between-model contribution: {np.sqrt(bma_var[25] - best_var[25]):.3f}")

The Key Insight: Model Uncertainty Matters

The BMA variance is always at least as large as any single model's variance. The extra 'between-model' term captures uncertainty about which model is correct. When models disagree, this term is large; when they agree, it's small. This honest accounting of uncertainty is one of BMA's greatest advantages.

When Does BMA Improve Predictions?

BMA isn't always beneficial. Understanding when it helps guides appropriate application.

BMA Helps When: Multiple Models Have Substantial Probability

If one model dominates (e.g., 99% posterior probability), BMA is essentially equivalent to using that single model. BMA shines when several models are plausible:

Models with 30%, 40%, 20%, 10% probabilities → substantial averaging
Predictions are a genuine mixture of different model perspectives
Uncertainty reflects true ambiguity about model structure

BMA Helps When: Models Make Diverse Predictions

For the between-model variance to matter, models must disagree. If all plausible models make similar predictions, BMA reduces to any single model. The value comes from:

Different functional forms (linear vs. nonlinear)
Different feature sets (variable selection)
Different assumptions (e.g., homoscedastic vs. heteroscedastic)

Scenarios Where BMA Excels

•Variable selection: When the best subset of predictors is uncertain, BMA averages over subsets, providing predictions robust to variable choice.
•Polynomial order: When model complexity is uncertain, BMA smoothly interpolates between simpler and more complex predictions.
•Preprocessing choices: When methods for handling missing data, outliers, or transformations are uncertain.
•Extrapolation: Near data boundaries, different models may diverge significantly. BMA uncertainty reflects this honestly.
•Scientific inference: When correctly quantifying uncertainty matters more than point prediction accuracy.

When BMA Doesn't Help (Much)

Dominant model: When one model is clearly best (e.g., >95% probability), averaging adds little.

Agreeing models: When all plausible models make nearly identical predictions, the between-model variance vanishes.

All models wrong: If the true data-generating process is outside all candidate models (M-open), BMA averages over various types of wrongness—which may or may not be better than a single wrong model.

Computational expense: BMA requires fitting and evaluating multiple models. When resources are limited, a single well-tuned model may be preferable.

Practical Guidance

Use BMA when: (1) you have a reasonable set of candidate models, (2) you're uncertain which is best, (3) honest uncertainty quantification matters. Don't use BMA when: (1) you're confident in a single model, (2) computational resources are very limited, (3) you only need point predictions and don't care about uncertainty.

Practical Algorithms for BMA

Exact BMA requires computing marginal likelihoods for all models—often intractable. Practical algorithms address this challenge.

MC³: Markov Chain Monte Carlo Model Composition

MC³ samples from the posterior over models using MCMC:

Start with initial model $\mathcal{M}^{(0)}$
Propose a new model $\mathcal{M}'$ (e.g., add/remove a variable)
Accept with probability: $$\alpha = \min\left(1, \frac{p(\mathcal{D} | \mathcal{M}') p(\mathcal{M}')}{p(\mathcal{D} | \mathcal{M}^{(t)}) p(\mathcal{M}^{(t)})} \cdot \frac{q(\mathcal{M}^{(t)} | \mathcal{M}')}{q(\mathcal{M}' | \mathcal{M}^{(t)})}\right)$$
Average predictions over visited models

Advantages: Explores model space efficiently; doesn't require enumerating all models.

Challenges: Requires computing marginal likelihoods; mixing can be slow in large model spaces.

BIC-Based Approximation

When marginal likelihoods are intractable, BIC provides a rough approximation:

$$\log p(\mathcal{D} | \mathcal{M}_k) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}_k) - \frac{d_k}{2}\log n$$

Posterior model probabilities become:

$$p(\mathcal{M}_k | \mathcal{D}) \propto \exp\left(-\frac{1}{2}\text{BIC}_k\right) \cdot p(\mathcal{M}_k)$$

Advantages: Uses only maximum likelihood fits; fast to compute.

Limitations: BIC is a rough approximation; may not reflect true model probabilities.

Stacking as an Alternative

Stacking (stacked generalization) is a predictive approach that doesn't require marginal likelihoods:

Obtain out-of-sample predictions from each model via cross-validation
Find weights that minimize prediction error on held-out data: $$\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \sum_{i} \left(y_i - \sum_k w_k \hat{y}_{ik}^{(-i)}\right)^2$$
Subject to $w_k \geq 0$ and $\sum_k w_k = 1$

Stacking vs. BMA: BMA uses posterior model probabilities (based on marginal likelihood); stacking uses predictive performance weights. When models are all wrong (M-open), stacking often outperforms BMA for prediction.

bma_algorithms.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
from scipy.optimize import minimize
 
def bic_model_weights(log_likelihoods, n_params, n_samples):
    """
    Compute approximate BMA weights using BIC.
    
    Parameters:
    -----------
    log_likelihoods : array
        Maximum log-likelihood for each model
    n_params : array
        Number of parameters for each model
    n_samples : int
        Number of data points
        
    Returns:
    --------
    weights : array
        Approximate posterior model probabilities
    """
    bics = -2 * log_likelihoods + n_params * np.log(n_samples)
    
    # Convert BIC to weights: weight ∝ exp(-BIC/2)
    log_weights = -bics / 2
    max_log_w = np.max(log_weights)
    weights = np.exp(log_weights - max_log_w)
    weights /= weights.sum()
    
    return weights
 
 
def stacking_weights(Y_true, Y_pred_cv, constraint='simplex'):
    """
    Compute stacking weights via cross-validated prediction optimization.
    
    Parameters:
    -----------
    Y_true : array of shape (n_samples,)
        True target values
    Y_pred_cv : array of shape (n_samples, n_models)
        Out-of-sample predictions from each model (from CV)
    constraint : str
        'simplex' for non-negative weights summing to 1
        'unconstrained' for any weights
        
    Returns:
    --------
    weights : array
        Optimal stacking weights
    """
    n_models = Y_pred_cv.shape[1]
    
    # Objective: minimize squared error
    def mse(w):
        pred = Y_pred_cv @ w
        return np.mean((Y_true - pred)**2)
    
    # Constraint: weights sum to 1
    constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}
    
    # Bounds: non-negative weights
    bounds = [(0, 1) for _ in range(n_models)]
    
    # Initial guess: equal weights
    w0 = np.ones(n_models) / n_models
    
    if constraint == 'simplex':
        result = minimize(mse, w0, method='SLSQP', 
                         bounds=bounds, constraints=constraints)
    else:
        result = minimize(mse, w0, method='BFGS')
    
    return result.x
 
 
# Demonstrate BIC-based BMA
def fit_polynomial_mle(x, y, degree):
    """Fit polynomial and return log-likelihood and params."""
    X = np.column_stack([x**i for i in range(degree + 1)])
    coeffs = np.linalg.lstsq(X, y, rcond=None)[0]
    pred = X @ coeffs
    resid = y - pred
    sigma_sq = np.var(resid, ddof=degree+1)
    log_like = -len(y)/2 * np.log(2*np.pi*sigma_sq) - np.sum(resid**2)/(2*sigma_sq)
    return log_like, degree + 1, coeffs
 
 
# Example
np.random.seed(42)
n = 50
x = np.linspace(-1, 1, n)
y = 0.5 + 0.3*x - 0.8*x**2 + 0.2*x**3 + 0.4*np.random.randn(n)
 
print("BIC-Based BMA Weights\n")
log_likes = []
n_params = []
for d in range(1, 8):
    ll, k, _ = fit_polynomial_mle(x, y, d)
    log_likes.append(ll)
    n_params.append(k)
    
weights = bic_model_weights(np.array(log_likes), np.array(n_params), n)
 
for d, w in enumerate(weights, 1):
    print(f"  Degree {d}: weight = {w:.3f}")
 
print(f"\nMost probable model: Degree {np.argmax(weights) + 1}")

BMA for Variable Selection

One of the most important applications of BMA is variable selection—choosing which predictors to include in a regression model.

The Variable Selection Problem

With $p$ potential predictors, there are $2^p$ possible subsets. Each subset defines a different model. Challenges include:

Uncertainty about which variables matter
Correlated predictors (including one may make another redundant)
Overfitting with too many variables
Underfitting with too few

BMA Solution: Average Over Models, Not Variables

Rather than selecting a single 'best' subset, BMA:

Computes posterior probability for each variable subset
Makes predictions averaging over all subsets
Quantifies variable importance via posterior inclusion probability:

$$\text{PIP}j = \sum{k: x_j \in \mathcal{M}_k} p(\mathcal{M}_k | \mathcal{D})$$

The posterior inclusion probability for variable $j$ is the total probability of models that include variable $j$.

Interpreting Posterior Inclusion Probabilities

•PIP > 0.5: Variable is more likely to be included than excluded ('median probability model' includes it)
•PIP > 0.75: Strong evidence for inclusion
•PIP ≈ 0.5: Highly uncertain—variable is on the boundary
•PIP < 0.25: Strong evidence against inclusion
•PIP < 0.1: Variable is almost certainly irrelevant

Advantages Over Traditional Variable Selection

Traditional approaches (stepwise, LASSO) select a single model:

Report coefficients only for selected variables
Other variables treated as having exactly zero effect
No uncertainty about selection itself

BMA approach:

All variables have posterior distributions, potentially including zero
Coefficients are weighted averages across models including/excluding each variable
Uncertainty reflects both coefficient uncertainty (within models) and selection uncertainty (between models)

Example: A variable with coefficient 2.5 in models that include it, but only 40% posterior inclusion probability, has BMA coefficient: $$\beta_{\text{BMA}} = 0.40 \times 2.5 + 0.60 \times 0 = 1.0$$

The effect is 'shrunk' toward zero to reflect selection uncertainty.

The Median Probability Model

An alternative to full BMA is the median probability model: include exactly those variables with PIP > 0.5. This gives a single sparse model that often has good properties (consistency under certain conditions). Use it when you need a single interpretable model but want Bayesian guidance on variable selection.

Computational Challenges and Solutions

BMA faces significant computational challenges, especially with large model spaces.

Challenge 1: Exponential Model Space

With $p$ predictors, there are $2^p$ variable subsets. For $p = 30$, that's over 1 billion models—impossible to enumerate.

Solutions:

MC³ sampling: Sample from model posterior without enumeration
Filtering: Use fast heuristics (correlation screening) to eliminate unlikely models
EM-based algorithms: Treat variable inclusion as latent variables
Variational approximations: Approximate the model posterior tractably

Challenge 2: Computing Marginal Likelihoods

Each model requires marginal likelihood computation, which may be intractable.

Solutions:

Conjugate models: Use closed-form marginal likelihoods when available
BIC approximation: Use $-\frac{1}{2}\text{BIC}$ as log evidence proxy
Laplace approximation: For non-conjugate but regular models
Importance sampling: For complex models

Challenge 3: Prior Specification

BMA requires priors on:

Model space: $p(\mathcal{M}_k)$ for each model
Parameters within each model: $p(\boldsymbol{\theta}_k | \mathcal{M}_k)$

Model space priors:

Uniform: $p(\mathcal{M}_k) = 1/K$—treats all models equally
Binomial: Each variable included with probability $\pi$; implies prior on model size
Beta-Binomial: Hierarchical prior allowing uncertainty about $\pi$
Collinearity-adjusted: Penalize models with correlated predictors

Parameter priors:

g-prior: $\boldsymbol{\beta} | g \sim \mathcal{N}(0, g\sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1})$—scales with data
Unit information prior: Set $g = n$ for prior equivalent to one observation
Hyper-g: Place hyperprior on $g$ to learn appropriate shrinkage

bma_variable_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from itertools import combinations
 
def exhaustive_variable_selection_bma(X, y, prior_var=10.0, noise_var=0.5):
    """
    Exhaustive BMA over all variable subsets (for small p only).
    
    Returns posterior inclusion probabilities and BMA predictions.
    """
    n, p = X.shape
    
    # Enumerate all 2^p - 1 non-empty subsets
    all_subsets = []
    log_evidences = []
    
    for size in range(1, p + 1):
        for subset in combinations(range(p), size):
            subset = list(subset)
            X_sub = np.column_stack([np.ones(n), X[:, subset]])
            
            # Compute log marginal likelihood
            d = X_sub.shape[1]
            K = prior_var * X_sub @ X_sub.T + noise_var * np.eye(n)
            L = np.linalg.cholesky(K)
            log_det = 2 * np.sum(np.log(np.diag(L)))
            alpha = np.linalg.solve(K, y)
            log_ev = -0.5 * (y @ alpha + log_det + n * np.log(2 * np.pi))
            
            all_subsets.append(subset)
            log_evidences.append(log_ev)
    
    # Compute posterior model probabilities
    log_evidences = np.array(log_evidences)
    max_log_ev = np.max(log_evidences)
    model_probs = np.exp(log_evidences - max_log_ev)
    model_probs /= model_probs.sum()
    
    # Posterior inclusion probabilities
    pip = np.zeros(p)
    for subset, prob in zip(all_subsets, model_probs):
        for j in subset:
            pip[j] += prob
    
    return pip, model_probs, all_subsets
 
 
# Example: Variable selection with true and spurious variables
np.random.seed(42)
n = 100
p = 8
 
# Create predictors (first 3 are truly relevant)
X = np.random.randn(n, p)
 
# True coefficients: only x0, x1, x2 matter
true_beta = np.array([2.0, -1.0, 0.5, 0, 0, 0, 0, 0])
y = X @ true_beta + 0.5 * np.random.randn(n)
 
# Run BMA
pip, model_probs, all_subsets = exhaustive_variable_selection_bma(X, y)
 
print("BMA Variable Selection\n")
print("True model uses: x0, x1, x2\n")
print("Posterior Inclusion Probabilities:")
for j in range(p):
    status = "TRUE" if j < 3 else "spurious"
    print(f"  x{j} ({status:>8}): PIP = {pip[j]:.3f}")
 
# Find top models
top_5_idx = np.argsort(model_probs)[-5:][::-1]
print(f"\nTop 5 Models:")
for idx in top_5_idx:
    subset = all_subsets[idx]
    vars_str = ', '.join([f'x{j}' for j in subset])
    print(f"  P = {model_probs[idx]:.3f}: {{{vars_str}}}")

BMA vs. Alternative Ensemble Methods

BMA is one of several ensemble approaches. Understanding how it compares to alternatives helps choose the right method.

BMA vs. Bagging/Random Forests

Bagging trains the same model on bootstrap samples and averages predictions.

Creates diversity via data perturbation, not model structure variation
All base models have the same form (e.g., decision trees)
Weights are equal (no model probability calculation)
Goal: Reduce variance, not capture model uncertainty

BMA averages over different model structures, weighted by evidence.

BMA vs. Boosting

Boosting sequentially fits models to residuals, combining into a single strong learner.

Focuses on reducing bias through sequential correction
Final model is a weighted sum, but weights come from training, not posterior
No explicit probabilistic interpretation

BMA gives a coherent probability distribution over predictions.

Comparison of Model Combination Methods
Method	Source of Diversity	Weighting	Goal	Uncertainty?
BMA	Different model structures	Posterior probability	Coherent averaging	Yes (principled)
Stacking	Different model structures	CV-optimized	Best prediction	Partial
Bagging	Bootstrap samples	Equal	Variance reduction	Via percentiles
Boosting	Sequential residuals	Contribution to loss	Bias reduction	Limited
Dropout (NN)	Random subnetworks	Equal (implicit)	Regularization	Approximate

BMA vs. Stacking

This comparison is particularly important:

BMA:

Weights based on marginal likelihood (how well prior predicted data)
Theoretically principled for M-closed (true model in candidate set)
Can underperform when all models are wrong (M-open)
Provides coherent uncertainty quantification

Stacking:

Weights optimized for cross-validated prediction performance
No explicit probabilistic interpretation
Often outperforms BMA for pure prediction in M-open scenarios
Doesn't provide coherent uncertainties

Rule of thumb: Use BMA when uncertainty quantification matters and you trust your model set. Use stacking when pure predictive accuracy matters and you suspect all models are approximations.

The M-Closed vs. M-Open Distinction

In M-closed problems, the true model is among the candidates—BMA is optimal. In M-open problems, all models are approximations—BMA picks the best average of approximations, but stacking may find better predictive combinations. Most real problems are M-open, so stacking is often preferred for pure prediction, while BMA remains preferred for principled inference.

Summary and Key Takeaways

Let's consolidate what we've learned about Bayesian Model Averaging:

Core Concepts

•Coherent Bayesian inference: BMA follows from marginalizing over model uncertainty. It's not a heuristic but the exact Bayesian answer to prediction under model uncertainty.
•Model-weighted predictions: BMA predictions are mixtures of individual model predictions, weighted by posterior model probabilities.
•Honest uncertainty: BMA variance includes both within-model and between-model uncertainty, providing more honest prediction intervals than any single model.
•Variable selection: For regression, BMA provides posterior inclusion probabilities that quantify evidence for each predictor.
•Computational challenges: Large model spaces and intractable marginal likelihoods require approximations like MC³, BIC-based weights, or stacking.
•Context-dependent value: BMA helps most when multiple models are plausible and make diverse predictions; it adds little when one model dominates.

Module Complete!

This module has taken you through the complete framework for Bayesian model comparison:

Marginal Likelihood: The foundation—how models are scored by their predictive accuracy averaged over priors.
Bayes Factors: The ratio comparing two models' evidences, with interpretation scales and computation methods.
Model Evidence: Deep understanding of what marginal likelihood measures—fit vs. complexity, compression, and predictive probability.
Occam's Razor: Why Bayesian inference automatically prefers simpler models when complexity isn't justified.
Bayesian Model Averaging: How to combine models rather than choose, accounting for model uncertainty.

You now have the tools to approach model selection and model uncertainty from a principled Bayesian perspective—understanding not just the mechanics but the deep principles that make these methods coherent and powerful.

Module Complete

Congratulations! You've mastered Bayesian Model Comparison—from marginal likelihood foundations through Bayes factors, Occam's razor, and model averaging. You understand not just the formulas but the deep principles: why evidence penalizes complexity, how BMA accounts for model uncertainty, and when these methods succeed or fail. You're now equipped to apply principled probabilistic reasoning to model selection in any domain.

Bayesian Model Averaging: Combining Models for Robust Predictions

Beyond Model Selection: Embracing Model Uncertainty

In reality, we're rarely certain which model is correct. Even when one model has the highest evidence, others may have non-negligible posterior probabilities. By committing to a single model, we:

Underestimate prediction uncertainty (ignoring the possibility that another model is true)
Make suboptimal predictions (the best prediction may be a weighted combination)
Lose information from alternative models that may capture different aspects of reality

What You Will Learn

The Bayesian Model Averaging Framework

The Core Idea

Consider a set of candidate models ${\mathcal{M}_1, \mathcal{M}_2, \ldots, \mathcal{M}_K}$. Each model makes predictions differently, and we're uncertain which is correct. BMA addresses this by:

Computing posterior probability for each model: $p(\mathcal{M}_k | \mathcal{D})$
Making predictions under each model: $p(\tilde{y} | \mathcal{D}, \mathcal{M}_k)$
Averaging predictions weighted by model probabilities

The BMA Predictive Distribution

The BMA predictive distribution for new data $\tilde{y}$ given observed data $\mathcal{D}$ is:

$$p(\tilde{y} | \mathcal{D}) = \sum_{k=1}^{K} p(\tilde{y} | \mathcal{D}, \mathcal{M}_k) \cdot p(\mathcal{M}_k | \mathcal{D})$$

This is a mixture of the predictive distributions from each model, weighted by posterior model probabilities.

Why This Is the 'Right' Thing to Do

Computing Posterior Model Probabilities

Using Bayes' theorem at the model level:

$$p(\mathcal{M}_k | \mathcal{D}) = \frac{p(\mathcal{D} | \mathcal{M}_k) \cdot p(\mathcal{M}k)}{\sum{j=1}^{K} p(\mathcal{D} | \mathcal{M}_j) \cdot p(\mathcal{M}_j)}$$

where:

$p(\mathcal{D} | \mathcal{M}_k)$ is the marginal likelihood (model evidence)
$p(\mathcal{M}_k)$ is the prior probability of model $k$

With equal prior model probabilities $p(\mathcal{M}_k) = 1/K$:

$$p(\mathcal{M}_k | \mathcal{D}) = \frac{p(\mathcal{D} | \mathcal{M}k)}{\sum{j=1}^{K} p(\mathcal{D} | \mathcal{M}_j)}$$

The posterior model probability is simply the normalized marginal likelihood.

BMA Example: Polynomial Regression
Model	Log Evidence	Evidence (Normalized)	P(M\|D)	Weight
Linear (d=1)	-165.2	e⁻¹⁵·²	< 0.01%	~0
Quadratic (d=2)	-150.0	e⁰⁺⁰	72.3%	0.723
Cubic (d=3)	-151.2	e⁻¹·²	21.8%	0.218
Quartic (d=4)	-153.4	e⁻³·⁴	5.5%	0.055
Quintic (d=5)	-156.1	e⁻⁶·¹	0.4%	0.004

Computing BMA Predictions

Point Predictions

For point predictions (e.g., predicting the mean), BMA combines model predictions:

$$\mathbb{E}[\tilde{y} | \mathcal{D}] = \sum_{k=1}^{K} \mathbb{E}[\tilde{y} | \mathcal{D}, \mathcal{M}_k] \cdot p(\mathcal{M}_k | \mathcal{D})$$

This is simply a weighted average of each model's prediction.

Prediction Uncertainty

The variance of the BMA predictive distribution captures two sources of uncertainty:

where $\mu_k = \mathbb{E}[\tilde{y} | \mathcal{D}, \mathcal{M}_k]$ and $\bar{\mu} = \sum_k p(\mathcal{M}_k | \mathcal{D}) \mu_k$.

The law of total variance decomposes uncertainty into:

Within-model: Average variance across models
Between-model: Variance of model predictions around the BMA mean

Model selection ignores between-model uncertainty, leading to underestimated prediction intervals.

bayesian_model_averaging.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def bayesian_linear_regression_predict(X_train, y_train, X_test, 
                                        prior_var=10.0, noise_var=0.25):
    """
    Bayesian linear regression: returns predictive mean, variance, and log evidence.
    """
    n_train, d = X_train.shape
    n_test = X_test.shape[0]
    
    # Prior and posterior
    prior_precision = np.eye(d) / prior_var
    post_precision = X_train.T @ X_train / noise_var + prior_precision
    L, lower = cho_factor(post_precision, lower=True)
    post_cov = cho_solve((L, lower), np.eye(d))
    post_mean = post_cov @ X_train.T @ y_train / noise_var
    
    # Predictive distribution
    pred_mean = X_test @ post_mean
    pred_var = noise_var + np.einsum('ij,jk,ik->i', X_test, post_cov, X_test)
    
    # Log marginal likelihood
    K = prior_var * X_train @ X_train.T + noise_var * np.eye(n_train)
    L_K, _ = cho_factor(K, lower=True)
    log_det_K = 2 * np.sum(np.log(np.diag(L_K)))
    alpha = cho_solve((L_K, True), y_train)
    log_evidence = -0.5 * (y_train @ alpha + log_det_K + n_train * np.log(2 * np.pi))
    
    return pred_mean, pred_var, log_evidence
 
 
def bayesian_model_averaging(X_train, y_train, X_test, max_degree=5,
                              prior_var=10.0, noise_var=0.25):
    """
    Perform Bayesian Model Averaging over polynomial regression models.
    
    Returns:
    --------
    bma_mean : array
        BMA predictive mean
    bma_var : array
        BMA predictive variance (includes model uncertainty)
    model_probs : array
        Posterior model probabilities
    model_results : list
        Individual model predictions and evidence
    """
    x_train_raw = X_train.flatten()
    x_test_raw = X_test.flatten()
    
    model_results = []
    log_evidences = []
    
    # Fit each polynomial model
    for degree in range(1, max_degree + 1):
        # Design matrices
        X_tr = np.column_stack([x_train_raw**i for i in range(degree + 1)])
        X_te = np.column_stack([x_test_raw**i for i in range(degree + 1)])
        
        pred_mean, pred_var, log_ev = bayesian_linear_regression_predict(
            X_tr, y_train, X_te, prior_var, noise_var
        )
        
        model_results.append({
            'degree': degree,
            'pred_mean': pred_mean,
            'pred_var': pred_var,
            'log_evidence': log_ev
        })
        log_evidences.append(log_ev)
    
    # Compute posterior model probabilities
    log_evidences = np.array(log_evidences)
    max_log_ev = np.max(log_evidences)
    model_probs = np.exp(log_evidences - max_log_ev)
    model_probs /= model_probs.sum()
    
    # BMA predictions
    n_test = len(x_test_raw)
    bma_mean = np.zeros(n_test)
    within_var = np.zeros(n_test)
    between_var = np.zeros(n_test)
    
    for k, result in enumerate(model_results):
        w_k = model_probs[k]
        bma_mean += w_k * result['pred_mean']
        within_var += w_k * result['pred_var']
    
    # Between-model variance
    for k, result in enumerate(model_results):
        w_k = model_probs[k]
        between_var += w_k * (result['pred_mean'] - bma_mean)**2
    
    bma_var = within_var + between_var
    
    return bma_mean, bma_var, model_probs, model_results
 
 
# Demonstration
np.random.seed(42)
n_train = 30
x_train = np.linspace(-1, 1, n_train)
y_true = 0.5 + 0.3*x_train - 0.8*x_train**2  # Quadratic
y_train = y_true + 0.3 * np.random.randn(n_train)
 
x_test = np.linspace(-1.2, 1.2, 50)
 
bma_mean, bma_var, model_probs, results = bayesian_model_averaging(
    x_train.reshape(-1, 1), y_train, x_test.reshape(-1, 1), max_degree=6
)
 
print("Bayesian Model Averaging: Polynomial Regression\n")
print("True model: Quadratic (degree 2)\n")
print("Posterior Model Probabilities:")
for k, result in enumerate(results):
    prob = model_probs[k]
    print(f"  Degree {result['degree']}: {prob:.1%}")
 
# Compare uncertainty: BMA vs best single model
best_idx = np.argmax(model_probs)
best_var = results[best_idx]['pred_var']
 
print(f"\nPrediction at x=0:")
print(f"  BMA mean: {bma_mean[25]:.3f}")
print(f"  BMA std:  {np.sqrt(bma_var[25]):.3f} (includes model uncertainty)")
print(f"  Best model (d={results[best_idx]['degree']}) std: {np.sqrt(best_var[25]):.3f}")
print(f"  Between-model contribution: {np.sqrt(bma_var[25] - best_var[25]):.3f}")

The Key Insight: Model Uncertainty Matters

When Does BMA Improve Predictions?

BMA isn't always beneficial. Understanding when it helps guides appropriate application.

BMA Helps When: Multiple Models Have Substantial Probability

If one model dominates (e.g., 99% posterior probability), BMA is essentially equivalent to using that single model. BMA shines when several models are plausible:

Models with 30%, 40%, 20%, 10% probabilities → substantial averaging
Predictions are a genuine mixture of different model perspectives
Uncertainty reflects true ambiguity about model structure

BMA Helps When: Models Make Diverse Predictions

For the between-model variance to matter, models must disagree. If all plausible models make similar predictions, BMA reduces to any single model. The value comes from:

Different functional forms (linear vs. nonlinear)
Different feature sets (variable selection)
Different assumptions (e.g., homoscedastic vs. heteroscedastic)

Scenarios Where BMA Excels

•Variable selection: When the best subset of predictors is uncertain, BMA averages over subsets, providing predictions robust to variable choice.
•Polynomial order: When model complexity is uncertain, BMA smoothly interpolates between simpler and more complex predictions.
•Preprocessing choices: When methods for handling missing data, outliers, or transformations are uncertain.
•Extrapolation: Near data boundaries, different models may diverge significantly. BMA uncertainty reflects this honestly.
•Scientific inference: When correctly quantifying uncertainty matters more than point prediction accuracy.

When BMA Doesn't Help (Much)

Dominant model: When one model is clearly best (e.g., >95% probability), averaging adds little.

Agreeing models: When all plausible models make nearly identical predictions, the between-model variance vanishes.

Computational expense: BMA requires fitting and evaluating multiple models. When resources are limited, a single well-tuned model may be preferable.

Practical Guidance

Practical Algorithms for BMA

Exact BMA requires computing marginal likelihoods for all models—often intractable. Practical algorithms address this challenge.

MC³: Markov Chain Monte Carlo Model Composition

MC³ samples from the posterior over models using MCMC:

Start with initial model $\mathcal{M}^{(0)}$
Propose a new model $\mathcal{M}'$ (e.g., add/remove a variable)
Accept with probability: $$\alpha = \min\left(1, \frac{p(\mathcal{D} | \mathcal{M}') p(\mathcal{M}')}{p(\mathcal{D} | \mathcal{M}^{(t)}) p(\mathcal{M}^{(t)})} \cdot \frac{q(\mathcal{M}^{(t)} | \mathcal{M}')}{q(\mathcal{M}' | \mathcal{M}^{(t)})}\right)$$
Average predictions over visited models

Advantages: Explores model space efficiently; doesn't require enumerating all models.

Challenges: Requires computing marginal likelihoods; mixing can be slow in large model spaces.

BIC-Based Approximation

When marginal likelihoods are intractable, BIC provides a rough approximation:

$$\log p(\mathcal{D} | \mathcal{M}_k) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}_k) - \frac{d_k}{2}\log n$$

Posterior model probabilities become:

$$p(\mathcal{M}_k | \mathcal{D}) \propto \exp\left(-\frac{1}{2}\text{BIC}_k\right) \cdot p(\mathcal{M}_k)$$

Advantages: Uses only maximum likelihood fits; fast to compute.

Limitations: BIC is a rough approximation; may not reflect true model probabilities.

Stacking as an Alternative

Stacking (stacked generalization) is a predictive approach that doesn't require marginal likelihoods:

Obtain out-of-sample predictions from each model via cross-validation
Find weights that minimize prediction error on held-out data: $$\hat{\mathbf{w}} = \arg\min_{\mathbf{w}} \sum_{i} \left(y_i - \sum_k w_k \hat{y}_{ik}^{(-i)}\right)^2$$
Subject to $w_k \geq 0$ and $\sum_k w_k = 1$

bma_algorithms.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
from scipy.optimize import minimize
 
def bic_model_weights(log_likelihoods, n_params, n_samples):
    """
    Compute approximate BMA weights using BIC.
    
    Parameters:
    -----------
    log_likelihoods : array
        Maximum log-likelihood for each model
    n_params : array
        Number of parameters for each model
    n_samples : int
        Number of data points
        
    Returns:
    --------
    weights : array
        Approximate posterior model probabilities
    """
    bics = -2 * log_likelihoods + n_params * np.log(n_samples)
    
    # Convert BIC to weights: weight ∝ exp(-BIC/2)
    log_weights = -bics / 2
    max_log_w = np.max(log_weights)
    weights = np.exp(log_weights - max_log_w)
    weights /= weights.sum()
    
    return weights
 
 
def stacking_weights(Y_true, Y_pred_cv, constraint='simplex'):
    """
    Compute stacking weights via cross-validated prediction optimization.
    
    Parameters:
    -----------
    Y_true : array of shape (n_samples,)
        True target values
    Y_pred_cv : array of shape (n_samples, n_models)
        Out-of-sample predictions from each model (from CV)
    constraint : str
        'simplex' for non-negative weights summing to 1
        'unconstrained' for any weights
        
    Returns:
    --------
    weights : array
        Optimal stacking weights
    """
    n_models = Y_pred_cv.shape[1]
    
    # Objective: minimize squared error
    def mse(w):
        pred = Y_pred_cv @ w
        return np.mean((Y_true - pred)**2)
    
    # Constraint: weights sum to 1
    constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}
    
    # Bounds: non-negative weights
    bounds = [(0, 1) for _ in range(n_models)]
    
    # Initial guess: equal weights
    w0 = np.ones(n_models) / n_models
    
    if constraint == 'simplex':
        result = minimize(mse, w0, method='SLSQP', 
                         bounds=bounds, constraints=constraints)
    else:
        result = minimize(mse, w0, method='BFGS')
    
    return result.x
 
 
# Demonstrate BIC-based BMA
def fit_polynomial_mle(x, y, degree):
    """Fit polynomial and return log-likelihood and params."""
    X = np.column_stack([x**i for i in range(degree + 1)])
    coeffs = np.linalg.lstsq(X, y, rcond=None)[0]
    pred = X @ coeffs
    resid = y - pred
    sigma_sq = np.var(resid, ddof=degree+1)
    log_like = -len(y)/2 * np.log(2*np.pi*sigma_sq) - np.sum(resid**2)/(2*sigma_sq)
    return log_like, degree + 1, coeffs
 
 
# Example
np.random.seed(42)
n = 50
x = np.linspace(-1, 1, n)
y = 0.5 + 0.3*x - 0.8*x**2 + 0.2*x**3 + 0.4*np.random.randn(n)
 
print("BIC-Based BMA Weights\n")
log_likes = []
n_params = []
for d in range(1, 8):
    ll, k, _ = fit_polynomial_mle(x, y, d)
    log_likes.append(ll)
    n_params.append(k)
    
weights = bic_model_weights(np.array(log_likes), np.array(n_params), n)
 
for d, w in enumerate(weights, 1):
    print(f"  Degree {d}: weight = {w:.3f}")
 
print(f"\nMost probable model: Degree {np.argmax(weights) + 1}")

BMA for Variable Selection

One of the most important applications of BMA is variable selection—choosing which predictors to include in a regression model.

The Variable Selection Problem

With $p$ potential predictors, there are $2^p$ possible subsets. Each subset defines a different model. Challenges include:

Uncertainty about which variables matter
Correlated predictors (including one may make another redundant)
Overfitting with too many variables
Underfitting with too few

BMA Solution: Average Over Models, Not Variables

Rather than selecting a single 'best' subset, BMA:

Computes posterior probability for each variable subset
Makes predictions averaging over all subsets
Quantifies variable importance via posterior inclusion probability:

$$\text{PIP}j = \sum{k: x_j \in \mathcal{M}_k} p(\mathcal{M}_k | \mathcal{D})$$

The posterior inclusion probability for variable $j$ is the total probability of models that include variable $j$.

Interpreting Posterior Inclusion Probabilities

•PIP > 0.5: Variable is more likely to be included than excluded ('median probability model' includes it)
•PIP > 0.75: Strong evidence for inclusion
•PIP ≈ 0.5: Highly uncertain—variable is on the boundary
•PIP < 0.25: Strong evidence against inclusion
•PIP < 0.1: Variable is almost certainly irrelevant

Advantages Over Traditional Variable Selection

Traditional approaches (stepwise, LASSO) select a single model:

Report coefficients only for selected variables
Other variables treated as having exactly zero effect
No uncertainty about selection itself

BMA approach:

All variables have posterior distributions, potentially including zero
Coefficients are weighted averages across models including/excluding each variable
Uncertainty reflects both coefficient uncertainty (within models) and selection uncertainty (between models)

The effect is 'shrunk' toward zero to reflect selection uncertainty.

The Median Probability Model

Computational Challenges and Solutions

BMA faces significant computational challenges, especially with large model spaces.

Challenge 1: Exponential Model Space

With $p$ predictors, there are $2^p$ variable subsets. For $p = 30$, that's over 1 billion models—impossible to enumerate.

Solutions:

MC³ sampling: Sample from model posterior without enumeration
Filtering: Use fast heuristics (correlation screening) to eliminate unlikely models
EM-based algorithms: Treat variable inclusion as latent variables
Variational approximations: Approximate the model posterior tractably

Challenge 2: Computing Marginal Likelihoods

Each model requires marginal likelihood computation, which may be intractable.

Solutions:

Conjugate models: Use closed-form marginal likelihoods when available
BIC approximation: Use $-\frac{1}{2}\text{BIC}$ as log evidence proxy
Laplace approximation: For non-conjugate but regular models
Importance sampling: For complex models

Challenge 3: Prior Specification

BMA requires priors on:

Model space: $p(\mathcal{M}_k)$ for each model
Parameters within each model: $p(\boldsymbol{\theta}_k | \mathcal{M}_k)$

Model space priors:

Uniform: $p(\mathcal{M}_k) = 1/K$—treats all models equally
Binomial: Each variable included with probability $\pi$; implies prior on model size
Beta-Binomial: Hierarchical prior allowing uncertainty about $\pi$
Collinearity-adjusted: Penalize models with correlated predictors

Parameter priors:

g-prior: $\boldsymbol{\beta} | g \sim \mathcal{N}(0, g\sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1})$—scales with data
Unit information prior: Set $g = n$ for prior equivalent to one observation
Hyper-g: Place hyperprior on $g$ to learn appropriate shrinkage

bma_variable_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from itertools import combinations
 
def exhaustive_variable_selection_bma(X, y, prior_var=10.0, noise_var=0.5):
    """
    Exhaustive BMA over all variable subsets (for small p only).
    
    Returns posterior inclusion probabilities and BMA predictions.
    """
    n, p = X.shape
    
    # Enumerate all 2^p - 1 non-empty subsets
    all_subsets = []
    log_evidences = []
    
    for size in range(1, p + 1):
        for subset in combinations(range(p), size):
            subset = list(subset)
            X_sub = np.column_stack([np.ones(n), X[:, subset]])
            
            # Compute log marginal likelihood
            d = X_sub.shape[1]
            K = prior_var * X_sub @ X_sub.T + noise_var * np.eye(n)
            L = np.linalg.cholesky(K)
            log_det = 2 * np.sum(np.log(np.diag(L)))
            alpha = np.linalg.solve(K, y)
            log_ev = -0.5 * (y @ alpha + log_det + n * np.log(2 * np.pi))
            
            all_subsets.append(subset)
            log_evidences.append(log_ev)
    
    # Compute posterior model probabilities
    log_evidences = np.array(log_evidences)
    max_log_ev = np.max(log_evidences)
    model_probs = np.exp(log_evidences - max_log_ev)
    model_probs /= model_probs.sum()
    
    # Posterior inclusion probabilities
    pip = np.zeros(p)
    for subset, prob in zip(all_subsets, model_probs):
        for j in subset:
            pip[j] += prob
    
    return pip, model_probs, all_subsets
 
 
# Example: Variable selection with true and spurious variables
np.random.seed(42)
n = 100
p = 8
 
# Create predictors (first 3 are truly relevant)
X = np.random.randn(n, p)
 
# True coefficients: only x0, x1, x2 matter
true_beta = np.array([2.0, -1.0, 0.5, 0, 0, 0, 0, 0])
y = X @ true_beta + 0.5 * np.random.randn(n)
 
# Run BMA
pip, model_probs, all_subsets = exhaustive_variable_selection_bma(X, y)
 
print("BMA Variable Selection\n")
print("True model uses: x0, x1, x2\n")
print("Posterior Inclusion Probabilities:")
for j in range(p):
    status = "TRUE" if j < 3 else "spurious"
    print(f"  x{j} ({status:>8}): PIP = {pip[j]:.3f}")
 
# Find top models
top_5_idx = np.argsort(model_probs)[-5:][::-1]
print(f"\nTop 5 Models:")
for idx in top_5_idx:
    subset = all_subsets[idx]
    vars_str = ', '.join([f'x{j}' for j in subset])
    print(f"  P = {model_probs[idx]:.3f}: {{{vars_str}}}")

BMA vs. Alternative Ensemble Methods

BMA is one of several ensemble approaches. Understanding how it compares to alternatives helps choose the right method.

BMA vs. Bagging/Random Forests

Bagging trains the same model on bootstrap samples and averages predictions.

Creates diversity via data perturbation, not model structure variation
All base models have the same form (e.g., decision trees)
Weights are equal (no model probability calculation)
Goal: Reduce variance, not capture model uncertainty

BMA averages over different model structures, weighted by evidence.

BMA vs. Boosting

Boosting sequentially fits models to residuals, combining into a single strong learner.

Focuses on reducing bias through sequential correction
Final model is a weighted sum, but weights come from training, not posterior
No explicit probabilistic interpretation

BMA gives a coherent probability distribution over predictions.

Comparison of Model Combination Methods
Method	Source of Diversity	Weighting	Goal	Uncertainty?
BMA	Different model structures	Posterior probability	Coherent averaging	Yes (principled)
Stacking	Different model structures	CV-optimized	Best prediction	Partial
Bagging	Bootstrap samples	Equal	Variance reduction	Via percentiles
Boosting	Sequential residuals	Contribution to loss	Bias reduction	Limited
Dropout (NN)	Random subnetworks	Equal (implicit)	Regularization	Approximate

BMA vs. Stacking

This comparison is particularly important:

BMA:

Weights based on marginal likelihood (how well prior predicted data)
Theoretically principled for M-closed (true model in candidate set)
Can underperform when all models are wrong (M-open)
Provides coherent uncertainty quantification

Stacking:

Weights optimized for cross-validated prediction performance
No explicit probabilistic interpretation
Often outperforms BMA for pure prediction in M-open scenarios
Doesn't provide coherent uncertainties

Rule of thumb: Use BMA when uncertainty quantification matters and you trust your model set. Use stacking when pure predictive accuracy matters and you suspect all models are approximations.

The M-Closed vs. M-Open Distinction

Summary and Key Takeaways

Let's consolidate what we've learned about Bayesian Model Averaging:

Core Concepts

•Coherent Bayesian inference: BMA follows from marginalizing over model uncertainty. It's not a heuristic but the exact Bayesian answer to prediction under model uncertainty.
•Model-weighted predictions: BMA predictions are mixtures of individual model predictions, weighted by posterior model probabilities.
•Honest uncertainty: BMA variance includes both within-model and between-model uncertainty, providing more honest prediction intervals than any single model.
•Variable selection: For regression, BMA provides posterior inclusion probabilities that quantify evidence for each predictor.
•Computational challenges: Large model spaces and intractable marginal likelihoods require approximations like MC³, BIC-based weights, or stacking.
•Context-dependent value: BMA helps most when multiple models are plausible and make diverse predictions; it adds little when one model dominates.

Module Complete!

This module has taken you through the complete framework for Bayesian model comparison:

Marginal Likelihood: The foundation—how models are scored by their predictive accuracy averaged over priors.
Bayes Factors: The ratio comparing two models' evidences, with interpretation scales and computation methods.
Model Evidence: Deep understanding of what marginal likelihood measures—fit vs. complexity, compression, and predictive probability.
Occam's Razor: Why Bayesian inference automatically prefers simpler models when complexity isn't justified.
Bayesian Model Averaging: How to combine models rather than choose, accounting for model uncertainty.

Module Complete