Bayesian Model Comparison - Learning Module

Loading content...

0/278

Model Evidence: Understanding What Marginal Likelihood Measures

Beyond Bayes Factors: The Deeper Meaning of Model Evidence

We've established that the marginal likelihood $p(\mathcal{D}|\mathcal{M})$—also called model evidence—forms the basis of Bayesian model comparison. But what exactly does this quantity measure? Why should the integral of likelihood over the prior distribution be the 'right' way to evaluate a model?

This page explores the deep conceptual foundations of model evidence. We'll see that it naturally encapsulates the bias-variance trade-off, embodies predictive accuracy, and implements an automatic Occam's razor that prefers simpler models when complexity isn't justified by the data.

Understanding model evidence at this level transforms it from a mathematical convenience into a principled philosophy of model quality.

What You Will Learn

By the end of this page, you will understand: (1) Model evidence as a measure of predictive accuracy on the observed data, (2) The decomposition of evidence into data fit and complexity penalty, (3) How model evidence relates to compression and minimum description length, (4) The connection between evidence maximization and type-II maximum likelihood, (5) When model evidence succeeds and fails as a model selection criterion.

Model Evidence as Predictive Probability

The Prior Predictive Distribution

Before observing any data, a Bayesian model makes probabilistic predictions. If we sample parameters from the prior and then sample data from the likelihood, we get the prior predictive distribution:

$$p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} | \mathcal{M}) , d\boldsymbol{\theta}$$

This is precisely the marginal likelihood! The model evidence is the probability that the model assigns to the observed data before looking at the data to learn parameters.

Interpretation: Model evidence asks: 'If I hadn't seen the data yet and sampled parameters randomly from my prior, how likely would I be to generate exactly the data I observed?'

A model that assigns high probability to the observed data before learning is a model that 'predicted' the data well. A model that assigns low probability made poor predictions—even if it can fit the data well after parameter tuning.

The 'Prediction Before Cheating' Principle

Think of model evidence as grading each model on its predictions made before seeing the exam answers. A flexible model that could 'predict' anything (uniform predictions) scores poorly because it didn't commit. A model that correctly anticipated the actual data scores highly. This is why evidence penalizes unnecessary flexibility.

Sequential Updating Perspective

Model evidence has another interpretation via sequential prediction. Consider observing data points $\mathcal{D} = {y_1, y_2, \ldots, y_n}$ one at a time. By the chain rule:

$$p(\mathcal{D} | \mathcal{M}) = p(y_1 | \mathcal{M}) \cdot p(y_2 | y_1, \mathcal{M}) \cdot p(y_3 | y_1, y_2, \mathcal{M}) \cdots p(y_n | y_1, \ldots, y_{n-1}, \mathcal{M})$$

Each term is a one-step-ahead predictive probability:

$$p(y_t | y_1, \ldots, y_{t-1}, \mathcal{M}) = \int p(y_t | \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta} | y_1, \ldots, y_{t-1}, \mathcal{M}) , d\boldsymbol{\theta}$$

This is the posterior predictive given all previous data.

Interpretation: Model evidence is the product of sequential predictions, where each prediction is made using only the data seen so far. A model with high evidence consistently made accurate predictions throughout the data stream.

The prequential (predictive-sequential) view: This shows that model evidence is essentially a measure of out-of-sample prediction, but evaluated on the actual sample using the Bayesian updating mechanism. It's like leave-one-out cross-validation, but with a specific update scheme.

Evidence vs. Other Model Quality Measures
Measure	What It Evaluates	Complexity Handling	Bayesian?
Model Evidence	Prior predictive probability of data	Automatic via integration	Yes
Maximum Likelihood	Best-case fit	None (always prefers complex)	No
AIC	Estimated prediction error	Linear penalty (2k)	No
BIC	Approximation to log evidence	Log-linear penalty (k·log n)	Approx.
Cross-Validation	Average held-out prediction	Implicit via data splitting	No

Decomposing Model Evidence: Fit vs. Complexity

We can formally decompose model evidence to see how it balances data fit against model complexity.

The Evidence Lower Bound (ELBO) Decomposition

For any distribution $q(\boldsymbol{\theta})$ over parameters, we have:

$$\log p(\mathcal{D} | \mathcal{M}) = \underbrace{\mathbb{E}q[\log p(\mathcal{D} | \boldsymbol{\theta})]}{\text{Expected Log-Likelihood}} - \underbrace{\text{KL}(q | p_{\text{prior}})}{\text{Complexity Penalty}} + \underbrace{\text{KL}(q | p{\text{posterior}})}_{\geq 0}$$

When $q = p_{\text{posterior}}$, the last term vanishes, giving:

$$\log p(\mathcal{D} | \mathcal{M}) = \mathbb{E}{\text{posterior}}[\log p(\mathcal{D} | \boldsymbol{\theta})] - \text{KL}(p{\text{posterior}} | p_{\text{prior}})$$

Interpreting the Decomposition

Term 1 (Expected Log-Likelihood): How well does the model fit the data on average across posterior parameter values? Higher is better. Term 2 (KL Divergence): How much did the posterior 'move' from the prior? This measures complexity—models that need to learn a lot from data (prior → posterior shift) are penalized. Lower is better.

Intuition: The Fit-Complexity Trade-off

This decomposition makes the trade-off explicit:

Simple model with moderate fit:

Expected log-likelihood: Moderate (can't capture all patterns)
KL divergence: Small (posterior close to prior—didn't need to learn much)
Evidence: Can be high if the balance is right

Complex model that overfits:

Expected log-likelihood: High (fits all patterns, including noise)
KL divergence: Large (posterior very different from prior—learned a lot)
Evidence: Often low because the complexity penalty dominates

Appropriately complex model:

Expected log-likelihood: High (captures real patterns)
KL divergence: Moderate (learned what was necessary, no more)
Evidence: Highest among competitors

The key insight: Model evidence doesn't just reward good fit—it rewards efficient fit. A model that achieves good fit while staying close to its prior beliefs is preferred.

evidence_decomposition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from scipy import stats
from scipy.special import digamma, gammaln
 
def compute_evidence_decomposition(x, prior_mean, prior_var, noise_var):
    """
    Decompose log marginal likelihood for Bayesian estimation of a Gaussian mean.
    
    Model: x_i ~ N(mu, noise_var), mu ~ N(prior_mean, prior_var)
    
    Returns:
    --------
    log_evidence : float
        Log marginal likelihood
    expected_ll : float
        Expected log-likelihood under posterior
    kl_divergence : float
        KL(posterior || prior)
    """
    n = len(x)
    x_mean = np.mean(x)
    
    # Posterior parameters
    post_var = 1 / (n / noise_var + 1 / prior_var)
    post_mean = post_var * (n * x_mean / noise_var + prior_mean / prior_var)
    
    # Expected log-likelihood under posterior
    # E[log N(x|mu, sigma^2)] = -n/2*log(2*pi*sigma^2) - 1/(2*sigma^2) * E[sum(x-mu)^2]
    # E[(x_i - mu)^2] = (x_i - post_mean)^2 + post_var
    expected_sq_diff = np.sum((x - post_mean)**2) + n * post_var
    expected_ll = -n/2 * np.log(2 * np.pi * noise_var) - expected_sq_diff / (2 * noise_var)
    
    # KL divergence: KL(N(post_mean, post_var) || N(prior_mean, prior_var))
    kl_divergence = 0.5 * (
        np.log(prior_var / post_var) +
        (post_var + (post_mean - prior_mean)**2) / prior_var - 1
    )
    
    # Log marginal likelihood (analytical)
    combined_var = noise_var / n + prior_var
    log_evidence = stats.norm.logpdf(x_mean, loc=prior_mean, scale=np.sqrt(combined_var)) +                    stats.norm.logpdf(x, loc=x_mean, scale=np.sqrt(noise_var)).sum() -                    stats.norm.logpdf(x, loc=x_mean, scale=np.sqrt(noise_var)).sum()
    # Simpler direct formula:
    log_evidence = -n/2 * np.log(2*np.pi) - n/2 * np.log(noise_var) -                    0.5 * np.log(n * prior_var / noise_var + 1) -                    np.sum((x - x_mean)**2) / (2*noise_var) -                    n * (x_mean - prior_mean)**2 / (2*(noise_var + n*prior_var))
    
    return log_evidence, expected_ll, kl_divergence
 
 
# Example: Compare prior specifications
np.random.seed(42)
true_mu = 5.0
noise_var = 1.0
x = np.random.normal(true_mu, np.sqrt(noise_var), 20)
 
print("Evidence Decomposition for Gaussian Mean Estimation")
print("True μ = 5.0, observed x̄ = {:.2f}
".format(np.mean(x)))
print(f"{'Prior':<30} {'E[log L]':<12} {'KL':<12} {'Evidence':<12}")
print("-" * 68)
 
priors = [
    ("Well-calibrated: N(5, 4)", 5.0, 4.0),
    ("Slightly misspecified: N(3, 4)", 3.0, 4.0),
    ("Diffuse: N(0, 100)", 0.0, 100.0),
    ("Too confident, wrong: N(0, 0.1)", 0.0, 0.1),
    ("Too confident, right: N(5, 0.5)", 5.0, 0.5),
]
 
for name, prior_mean, prior_var in priors:
    log_ev, exp_ll, kl = compute_evidence_decomposition(x, prior_mean, prior_var, noise_var)
    print(f"{name:<30} {exp_ll:>10.2f}  {kl:>10.2f}  {log_ev:>10.2f}")

Model Evidence and Data Compression

A beautiful perspective on model evidence comes from information theory: maximizing evidence is equivalent to finding the model that achieves the best data compression.

The Minimum Description Length Perspective

Minimum Description Length (MDL) is a principle from information theory: the best model is the one that provides the shortest description of the data. Description length is measured in bits (or nats, using natural logarithm).

The total description length for transmitting data using a model has two parts:

$$L_{\text{total}} = L_{\text{model}} + L_{\text{data|model}}$$

$L_{\text{model}}$: Bits to describe the model (including learned parameters)
$L_{\text{data|model}}$: Bits to describe data given the model

Simple models are cheap to describe but may fit data poorly (high $L_{\text{data|model}}$). Complex models describe data well but are expensive to transmit (high $L_{\text{model}}$).

The connection to evidence: Under certain coding schemes, the optimal description length is:

$$L_{\text{total}} = -\log p(\mathcal{D} | \mathcal{M})$$

Maximizing marginal likelihood = Minimizing description length

The Compression Intuition

A good model 'compresses' data by discovering patterns. Random noise can't be compressed—it's incompressible. A model that fits noise wastes description bits on random fluctuations. A model that captures true structure compresses real patterns while not wasting bits on noise. Evidence automatically finds this balance.

Prequential Codes and Evidence

The connection becomes explicit with prequential coding:

Start with the prior $p(\boldsymbol{\theta})$
Observe $y_1$, encode using predictive distribution $p(y_1 | \mathcal{M})$
Update to posterior $p(\boldsymbol{\theta} | y_1)$
Observe $y_2$, encode using $p(y_2 | y_1, \mathcal{M})$
Continue for all data...

The total code length is:

$$L = -\sum_{t=1}^{n} \log p(y_t | y_1, \ldots, y_{t-1}, \mathcal{M}) = -\log p(\mathcal{D} | \mathcal{M})$$

This is exactly the negative log marginal likelihood! The Bayesian updates naturally implement an optimal sequential compression scheme.

Implications:

Models with high evidence are good at predicting the next data point throughout observation
Evidence captures both learning efficiency (how fast the model adapts) and prediction quality (how accurately it predicts)
Overfitting is penalized because an overfit model predicts noise, which is costly to encode

Model Complexity and Compression
Model Type	L(model)	L(data\|model)	Total Bits	Evidence
Underfit (too simple)	Low	High (poor predictions)	Often high	Low
Just right (balanced)	Moderate	Low (good predictions)	Minimal	High
Overfit (too complex)	High	Low (fits everything)	Often high	Low

Evidence Maximization (Type-II Maximum Likelihood)

An important application of model evidence is evidence maximization (also called Type-II Maximum Likelihood or empirical Bayes): learning hyperparameters by maximizing marginal likelihood.

The Setup

Consider a model with:

Parameters $\boldsymbol{\theta}$: What we want to infer (e.g., regression weights)
Hyperparameters $\boldsymbol{\alpha}$: Control the prior on $\boldsymbol{\theta}$ (e.g., prior variance)

The marginal likelihood is a function of hyperparameters:

$$p(\mathcal{D} | \boldsymbol{\alpha}) = \int p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta} | \boldsymbol{\alpha}) d\boldsymbol{\theta}$$

Evidence maximization finds:

$$\boldsymbol{\alpha}^* = \arg\max_{\boldsymbol{\alpha}} p(\mathcal{D} | \boldsymbol{\alpha})$$

This learns 'how much regularization' directly from the data, without cross-validation.

Example: Automatic Relevance Determination

In Automatic Relevance Determination (ARD), each feature has its own prior precision:

$$\beta_j \sim \mathcal{N}(0, \alpha_j^{-1})$$

Maximizing evidence over ${\alpha_j}$ automatically:

Drives $\alpha_j \to \infty$ for irrelevant features (shrinking $\beta_j \to 0$)
Keeps $\alpha_j$ modest for relevant features (allowing $\beta_j$ to fit data)

This achieves automatic feature selection purely through evidence maximization, without cross-validation or L1 regularization.

Example: Learning Kernel Parameters

In Gaussian process regression, the kernel function has hyperparameters (length scales, signal variance). Evidence maximization finds kernel hyperparameters that best explain the data:

$${\ell^, \sigma_f^, \sigma_n^*} = \arg\max \log p(\mathbf{y} | \mathbf{X}, \ell, \sigma_f, \sigma_n)$$

This is the standard approach for GP hyperparameter learning.

evidence_maximization_ridge.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy.optimize import minimize_scalar
from scipy.linalg import cho_factor, cho_solve
 
def log_evidence_ridge(X, y, alpha, noise_var=1.0):
    """
    Compute log marginal likelihood for Bayesian ridge regression.
    
    Prior: beta ~ N(0, (1/alpha) * I)
    Likelihood: y ~ N(X @ beta, noise_var * I)
    """
    n, d = X.shape
    
    # Posterior covariance: (X.T @ X / noise_var + alpha * I)^{-1}
    A = X.T @ X / noise_var + alpha * np.eye(d)
    L, lower = cho_factor(A, lower=True)
    
    # Posterior mean: A^{-1} @ X.T @ y / noise_var  
    post_mean = cho_solve((L, lower), X.T @ y / noise_var)
    
    # Log evidence = log likelihood + log prior - log posterior
    # Using the formula for Gaussian marginal likelihood
    
    # Log determinant of A
    log_det_A = 2 * np.sum(np.log(np.diag(L)))
    
    # Compute evidence
    residual = y - X @ post_mean
    log_ev = -0.5 * (
        n * np.log(2 * np.pi * noise_var) +
        np.dot(residual, residual) / noise_var +
        alpha * np.dot(post_mean, post_mean) +
        log_det_A - d * np.log(alpha)
    )
    
    return log_ev
 
 
def optimize_ridge_alpha(X, y, noise_var=1.0):
    """Find optimal ridge regularization by maximizing evidence."""
    
    def neg_log_evidence(log_alpha):
        alpha = np.exp(log_alpha)
        return -log_evidence_ridge(X, y, alpha, noise_var)
    
    # Optimize over log(alpha) for numerical stability
    result = minimize_scalar(
        neg_log_evidence,
        bounds=(-10, 10),
        method='bounded'
    )
    
    optimal_alpha = np.exp(result.x)
    return optimal_alpha
 
 
# Example: Compare evidence-based vs CV-based regularization
np.random.seed(42)
n, d = 100, 20
 
# True model: only first 5 features are relevant
true_beta = np.zeros(d)
true_beta[:5] = np.array([2, -1, 0.5, 1.5, -0.8])
 
X = np.random.randn(n, d)
y = X @ true_beta + 0.5 * np.random.randn(n)
 
# Maximize evidence for alpha
optimal_alpha = optimize_ridge_alpha(X, y, noise_var=0.25)
 
print("Evidence Maximization for Ridge Regression
")
print(f"Optimal α (from evidence): {optimal_alpha:.4f}")
 
# Show evidence landscape
alphas = np.logspace(-3, 3, 50)
evidences = [log_evidence_ridge(X, y, a, noise_var=0.25) for a in alphas]
 
print(f"
Log evidence at different α values:")
for a in [0.001, 0.01, 0.1, 1, 10, 100]:
    ev = log_evidence_ridge(X, y, a, noise_var=0.25)
    star = " *" if abs(np.log10(a) - np.log10(optimal_alpha)) < 0.3 else ""
    print(f"  α = {a:<6.3f}: log p(D|α) = {ev:.2f}{star}")

Evidence Maximization Caveats

Evidence maximization treats hyperparameters as point estimates rather than integrating over them. This can underestimate uncertainty, especially with limited data. For full Bayesian treatment, place hyperpriors on hyperparameters and integrate. Also, evidence maximization can overfit hyperparameters when the hyperparameter space is large.

When Model Evidence Fails

Model evidence is a powerful principle, but it has failure modes. Understanding when it fails is as important as knowing when it succeeds.

Failure Mode 1: Prior Misspecification

Model evidence rewards models that 'predicted' the data well from their prior. If the prior is badly misspecified, evidence unfairly penalizes the model:

Example: True parameter is $\theta = 100$, but prior is $\theta \sim \mathcal{N}(0, 1)$
Result: Prior assigns near-zero probability to the truth, so evidence is near-zero
Problem: A perfectly good model is rejected due to bad prior calibration

Solution: Use principled priors based on domain knowledge or previous data. Conduct prior predictive checks to ensure priors allow reasonable data patterns.

Failure Mode 2: M-Open World

The M-open scenario occurs when the true data-generating process is not among the candidate models. In this case:

Evidence picks the 'least wrong' model
This may not be the best for prediction or understanding
All models are misspecified, so evidence comparisons lose meaning

Example: True relationship is $y = \sin(x) + \epsilon$. We compare:

Linear model
Quadratic model

Evidence will favor quadratic (it's closer to the truth), but neither is 'right,' and the evidence values don't indicate how far either is from reality.

Solution: Supplement evidence with posterior predictive checks. Look for systematic patterns in residuals that indicate model misspecification.

Other Failure Modes

•Improper priors: If priors aren't proper probability distributions, marginal likelihoods are undefined or always zero.
•Sensitive to prior scale: Models with the same structure but different prior variances get different evidence—even when posteriors are similar.
•Asymptotic dominance: With enough data, evidence overwhelmingly favors the closest-to-true model, even if the difference is scientifically irrelevant.
•Computational errors: Approximate methods (Laplace, MCMC) can introduce errors that affect evidence more than posteriors.
•Hyperparameter sensitivity: Evidence maximization can find pathological hyperparameter values that don't generalize.

The Pragmatic Approach

Use evidence as one input among several. Combine with: (1) Prior predictive checks—do priors generate plausible data? (2) Posterior predictive checks—does the fitted model reproduce observed patterns? (3) Cross-validation—does the model generalize to held-out data? (4) Domain knowledge—does the 'best' model make scientific sense?

Model Evidence in Practice: A Case Study

Let's work through a complete example showing how model evidence guides model selection in a realistic scenario.

The Problem: Polynomial Regression Order Selection

We observe noisy data from an unknown function and must select the polynomial degree. This is a classic bias-variance trade-off:

Too low degree: underfitting (high bias)
Too high degree: overfitting (high variance)

Model evidence should automatically find the sweet spot.

polynomial_model_selection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from scipy.linalg import cho_factor, cho_solve
import warnings
 
def bayesian_polynomial_regression(x, y, degree, prior_var=10.0, noise_var=0.1):
    """
    Fit Bayesian polynomial regression and compute model evidence.
    
    Returns:
    --------
    log_evidence : float
    posterior_mean : array
    posterior_cov : array
    """
    # Design matrix
    X = np.column_stack([x**i for i in range(degree + 1)])
    n, d = X.shape
    
    # Prior: beta ~ N(0, prior_var * I)
    prior_precision = np.eye(d) / prior_var
    
    # Posterior
    post_precision = X.T @ X / noise_var + prior_precision
    L, lower = cho_factor(post_precision, lower=True)
    post_cov = cho_solve((L, lower), np.eye(d))
    post_mean = post_cov @ X.T @ y / noise_var
    
    # Log evidence
    K = prior_var * X @ X.T + noise_var * np.eye(n)
    L_K, _ = cho_factor(K, lower=True)
    log_det_K = 2 * np.sum(np.log(np.diag(L_K)))
    alpha = cho_solve((L_K, True), y)
    
    log_evidence = -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi))
    
    return log_evidence, post_mean, post_cov
 
 
def compare_polynomial_models(x, y, max_degree=10, prior_var=10.0, noise_var=0.1):
    """Compare polynomial models of different degrees using evidence."""
    
    results = []
    for d in range(max_degree + 1):
        try:
            log_ev, post_mean, post_cov = bayesian_polynomial_regression(
                x, y, d, prior_var, noise_var
            )
            results.append({
                'degree': d,
                'log_evidence': log_ev,
                'n_params': d + 1,
                'posterior_mean': post_mean
            })
        except Exception as e:
            warnings.warn(f"Degree {d} failed: {e}")
    
    # Compute posterior model probabilities (assuming equal priors)
    log_evs = np.array([r['log_evidence'] for r in results])
    max_log_ev = np.max(log_evs)
    probs = np.exp(log_evs - max_log_ev)
    probs /= probs.sum()
    
    for i, r in enumerate(results):
        r['posterior_prob'] = probs[i]
    
    return results
 
 
# Generate synthetic data from a cubic function
np.random.seed(42)
n = 50
x = np.linspace(-1, 1, n)
 
# True function: cubic
y_true = 0.5 - 0.3*x + 0.8*x**2 - 0.4*x**3
noise_var_true = 0.04
y = y_true + np.sqrt(noise_var_true) * np.random.randn(n)
 
# Compare models
results = compare_polynomial_models(x, y, max_degree=10, prior_var=10.0, noise_var=0.04)
 
print("Model Comparison via Evidence
")
print("True model: cubic (degree 3)
")
print(f"{'Degree':<8} {'# Params':<10} {'Log Evidence':<15} {'P(M|D)':<12} {'Bayes Factor'}")
print("-" * 70)
 
best_evidence = max(r['log_evidence'] for r in results)
for r in results:
    bf = np.exp(r['log_evidence'] - best_evidence)
    star = " ← best" if r['log_evidence'] == best_evidence else ""
    print(f"{r['degree']:<8} {r['n_params']:<10} {r['log_evidence']:>12.2f}   "
          f"{r['posterior_prob']:>10.1%}   {bf:>8.3f}{star}")
 
# Find and report best model
best = max(results, key=lambda r: r['log_evidence'])
print(f"
Evidence-selected model: degree {best['degree']}")
print(f"Posterior probability: {best['posterior_prob']:.1%}")

What the Example Shows

Evidence peaks at the true complexity: For data from a cubic, evidence is highest around degree 3.
Simpler models have lower evidence: They can't capture the true pattern, hurting the likelihood term.
More complex models have lower evidence: Extra parameters spread prior mass thin without improving fit.
Posterior probabilities are decisive: With enough data, evidence strongly concentrates on the correct model.
BIC approximates evidence: For this smooth case, BIC rankings match evidence rankings closely.

This example demonstrates that model evidence operationalizes the intuition that we want 'as simple as possible, but no simpler.'

Summary and Key Takeaways

Let's consolidate what we've learned about model evidence:

Core Insights

•Evidence = Prior predictive probability: It measures how well a model predicted the data before seeing it, averaged over prior parameter beliefs.
•Evidence = Fit - Complexity: The ELBO decomposition shows evidence balances expected log-likelihood against the KL divergence from prior to posterior.
•Evidence = Compression: Maximizing evidence is equivalent to minimizing description length—finding the most efficient data representation.
•Evidence maximization: Hyperparameters can be learned by maximizing evidence, providing automatic regularization without cross-validation.
•Failure modes: Prior misspecification, M-open worlds, and computational errors can make evidence unreliable. Always supplement with predictive checks.
•Practical power: For well-specified models with reasonable priors, evidence provides principled model selection that automatically trades off fit and complexity.

Looking ahead: We've seen that model evidence embodies an automatic preference for simpler models—Occam's razor. In the next page, we'll explore this connection in depth, understanding why Bayesian inference inherently favors parsimony and what 'simplicity' really means in a probabilistic context.

The Bayesian Occam's razor isn't a heuristic or add-on—it emerges directly from the mathematics of probability. Understanding this provides deep insight into why Bayesian model comparison works.

Page Complete

You now understand model evidence at a deep level—as prior predictive probability, as a fit-complexity trade-off, and as optimal compression. You've seen how to maximize evidence for hyperparameter learning and where evidence can fail. Next, we'll explore the Bayesian Occam's razor that emerges from these principles.