Loading content...
We've established that the marginal likelihood $p(\mathcal{D}|\mathcal{M})$—also called model evidence—forms the basis of Bayesian model comparison. But what exactly does this quantity measure? Why should the integral of likelihood over the prior distribution be the 'right' way to evaluate a model?
This page explores the deep conceptual foundations of model evidence. We'll see that it naturally encapsulates the bias-variance trade-off, embodies predictive accuracy, and implements an automatic Occam's razor that prefers simpler models when complexity isn't justified by the data.
Understanding model evidence at this level transforms it from a mathematical convenience into a principled philosophy of model quality.
By the end of this page, you will understand: (1) Model evidence as a measure of predictive accuracy on the observed data, (2) The decomposition of evidence into data fit and complexity penalty, (3) How model evidence relates to compression and minimum description length, (4) The connection between evidence maximization and type-II maximum likelihood, (5) When model evidence succeeds and fails as a model selection criterion.
Before observing any data, a Bayesian model makes probabilistic predictions. If we sample parameters from the prior and then sample data from the likelihood, we get the prior predictive distribution:
$$p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} | \mathcal{M}) , d\boldsymbol{\theta}$$
This is precisely the marginal likelihood! The model evidence is the probability that the model assigns to the observed data before looking at the data to learn parameters.
Interpretation: Model evidence asks: 'If I hadn't seen the data yet and sampled parameters randomly from my prior, how likely would I be to generate exactly the data I observed?'
A model that assigns high probability to the observed data before learning is a model that 'predicted' the data well. A model that assigns low probability made poor predictions—even if it can fit the data well after parameter tuning.
Think of model evidence as grading each model on its predictions made before seeing the exam answers. A flexible model that could 'predict' anything (uniform predictions) scores poorly because it didn't commit. A model that correctly anticipated the actual data scores highly. This is why evidence penalizes unnecessary flexibility.
Model evidence has another interpretation via sequential prediction. Consider observing data points $\mathcal{D} = {y_1, y_2, \ldots, y_n}$ one at a time. By the chain rule:
$$p(\mathcal{D} | \mathcal{M}) = p(y_1 | \mathcal{M}) \cdot p(y_2 | y_1, \mathcal{M}) \cdot p(y_3 | y_1, y_2, \mathcal{M}) \cdots p(y_n | y_1, \ldots, y_{n-1}, \mathcal{M})$$
Each term is a one-step-ahead predictive probability:
$$p(y_t | y_1, \ldots, y_{t-1}, \mathcal{M}) = \int p(y_t | \boldsymbol{\theta}) \cdot p(\boldsymbol{\theta} | y_1, \ldots, y_{t-1}, \mathcal{M}) , d\boldsymbol{\theta}$$
This is the posterior predictive given all previous data.
Interpretation: Model evidence is the product of sequential predictions, where each prediction is made using only the data seen so far. A model with high evidence consistently made accurate predictions throughout the data stream.
The prequential (predictive-sequential) view: This shows that model evidence is essentially a measure of out-of-sample prediction, but evaluated on the actual sample using the Bayesian updating mechanism. It's like leave-one-out cross-validation, but with a specific update scheme.
| Measure | What It Evaluates | Complexity Handling | Bayesian? |
|---|---|---|---|
| Model Evidence | Prior predictive probability of data | Automatic via integration | Yes |
| Maximum Likelihood | Best-case fit | None (always prefers complex) | No |
| AIC | Estimated prediction error | Linear penalty (2k) | No |
| BIC | Approximation to log evidence | Log-linear penalty (k·log n) | Approx. |
| Cross-Validation | Average held-out prediction | Implicit via data splitting | No |
We can formally decompose model evidence to see how it balances data fit against model complexity.
For any distribution $q(\boldsymbol{\theta})$ over parameters, we have:
$$\log p(\mathcal{D} | \mathcal{M}) = \underbrace{\mathbb{E}q[\log p(\mathcal{D} | \boldsymbol{\theta})]}{\text{Expected Log-Likelihood}} - \underbrace{\text{KL}(q | p_{\text{prior}})}{\text{Complexity Penalty}} + \underbrace{\text{KL}(q | p{\text{posterior}})}_{\geq 0}$$
When $q = p_{\text{posterior}}$, the last term vanishes, giving:
$$\log p(\mathcal{D} | \mathcal{M}) = \mathbb{E}{\text{posterior}}[\log p(\mathcal{D} | \boldsymbol{\theta})] - \text{KL}(p{\text{posterior}} | p_{\text{prior}})$$
Term 1 (Expected Log-Likelihood): How well does the model fit the data on average across posterior parameter values? Higher is better. Term 2 (KL Divergence): How much did the posterior 'move' from the prior? This measures complexity—models that need to learn a lot from data (prior → posterior shift) are penalized. Lower is better.
This decomposition makes the trade-off explicit:
Simple model with moderate fit:
Complex model that overfits:
Appropriately complex model:
The key insight: Model evidence doesn't just reward good fit—it rewards efficient fit. A model that achieves good fit while staying close to its prior beliefs is preferred.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom scipy import statsfrom scipy.special import digamma, gammaln def compute_evidence_decomposition(x, prior_mean, prior_var, noise_var): """ Decompose log marginal likelihood for Bayesian estimation of a Gaussian mean. Model: x_i ~ N(mu, noise_var), mu ~ N(prior_mean, prior_var) Returns: -------- log_evidence : float Log marginal likelihood expected_ll : float Expected log-likelihood under posterior kl_divergence : float KL(posterior || prior) """ n = len(x) x_mean = np.mean(x) # Posterior parameters post_var = 1 / (n / noise_var + 1 / prior_var) post_mean = post_var * (n * x_mean / noise_var + prior_mean / prior_var) # Expected log-likelihood under posterior # E[log N(x|mu, sigma^2)] = -n/2*log(2*pi*sigma^2) - 1/(2*sigma^2) * E[sum(x-mu)^2] # E[(x_i - mu)^2] = (x_i - post_mean)^2 + post_var expected_sq_diff = np.sum((x - post_mean)**2) + n * post_var expected_ll = -n/2 * np.log(2 * np.pi * noise_var) - expected_sq_diff / (2 * noise_var) # KL divergence: KL(N(post_mean, post_var) || N(prior_mean, prior_var)) kl_divergence = 0.5 * ( np.log(prior_var / post_var) + (post_var + (post_mean - prior_mean)**2) / prior_var - 1 ) # Log marginal likelihood (analytical) combined_var = noise_var / n + prior_var log_evidence = stats.norm.logpdf(x_mean, loc=prior_mean, scale=np.sqrt(combined_var)) + stats.norm.logpdf(x, loc=x_mean, scale=np.sqrt(noise_var)).sum() - stats.norm.logpdf(x, loc=x_mean, scale=np.sqrt(noise_var)).sum() # Simpler direct formula: log_evidence = -n/2 * np.log(2*np.pi) - n/2 * np.log(noise_var) - 0.5 * np.log(n * prior_var / noise_var + 1) - np.sum((x - x_mean)**2) / (2*noise_var) - n * (x_mean - prior_mean)**2 / (2*(noise_var + n*prior_var)) return log_evidence, expected_ll, kl_divergence # Example: Compare prior specificationsnp.random.seed(42)true_mu = 5.0noise_var = 1.0x = np.random.normal(true_mu, np.sqrt(noise_var), 20) print("Evidence Decomposition for Gaussian Mean Estimation")print("True μ = 5.0, observed x̄ = {:.2f}".format(np.mean(x)))print(f"{'Prior':<30} {'E[log L]':<12} {'KL':<12} {'Evidence':<12}")print("-" * 68) priors = [ ("Well-calibrated: N(5, 4)", 5.0, 4.0), ("Slightly misspecified: N(3, 4)", 3.0, 4.0), ("Diffuse: N(0, 100)", 0.0, 100.0), ("Too confident, wrong: N(0, 0.1)", 0.0, 0.1), ("Too confident, right: N(5, 0.5)", 5.0, 0.5),] for name, prior_mean, prior_var in priors: log_ev, exp_ll, kl = compute_evidence_decomposition(x, prior_mean, prior_var, noise_var) print(f"{name:<30} {exp_ll:>10.2f} {kl:>10.2f} {log_ev:>10.2f}")A beautiful perspective on model evidence comes from information theory: maximizing evidence is equivalent to finding the model that achieves the best data compression.
Minimum Description Length (MDL) is a principle from information theory: the best model is the one that provides the shortest description of the data. Description length is measured in bits (or nats, using natural logarithm).
The total description length for transmitting data using a model has two parts:
$$L_{\text{total}} = L_{\text{model}} + L_{\text{data|model}}$$
Simple models are cheap to describe but may fit data poorly (high $L_{\text{data|model}}$). Complex models describe data well but are expensive to transmit (high $L_{\text{model}}$).
The connection to evidence: Under certain coding schemes, the optimal description length is:
$$L_{\text{total}} = -\log p(\mathcal{D} | \mathcal{M})$$
Maximizing marginal likelihood = Minimizing description length
A good model 'compresses' data by discovering patterns. Random noise can't be compressed—it's incompressible. A model that fits noise wastes description bits on random fluctuations. A model that captures true structure compresses real patterns while not wasting bits on noise. Evidence automatically finds this balance.
The connection becomes explicit with prequential coding:
The total code length is:
$$L = -\sum_{t=1}^{n} \log p(y_t | y_1, \ldots, y_{t-1}, \mathcal{M}) = -\log p(\mathcal{D} | \mathcal{M})$$
This is exactly the negative log marginal likelihood! The Bayesian updates naturally implement an optimal sequential compression scheme.
Implications:
| Model Type | L(model) | L(data|model) | Total Bits | Evidence |
|---|---|---|---|---|
| Underfit (too simple) | Low | High (poor predictions) | Often high | Low |
| Just right (balanced) | Moderate | Low (good predictions) | Minimal | High |
| Overfit (too complex) | High | Low (fits everything) | Often high | Low |
An important application of model evidence is evidence maximization (also called Type-II Maximum Likelihood or empirical Bayes): learning hyperparameters by maximizing marginal likelihood.
Consider a model with:
The marginal likelihood is a function of hyperparameters:
$$p(\mathcal{D} | \boldsymbol{\alpha}) = \int p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta} | \boldsymbol{\alpha}) d\boldsymbol{\theta}$$
Evidence maximization finds:
$$\boldsymbol{\alpha}^* = \arg\max_{\boldsymbol{\alpha}} p(\mathcal{D} | \boldsymbol{\alpha})$$
This learns 'how much regularization' directly from the data, without cross-validation.
In Automatic Relevance Determination (ARD), each feature has its own prior precision:
$$\beta_j \sim \mathcal{N}(0, \alpha_j^{-1})$$
Maximizing evidence over ${\alpha_j}$ automatically:
This achieves automatic feature selection purely through evidence maximization, without cross-validation or L1 regularization.
In Gaussian process regression, the kernel function has hyperparameters (length scales, signal variance). Evidence maximization finds kernel hyperparameters that best explain the data:
$${\ell^, \sigma_f^, \sigma_n^*} = \arg\max \log p(\mathbf{y} | \mathbf{X}, \ell, \sigma_f, \sigma_n)$$
This is the standard approach for GP hyperparameter learning.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npfrom scipy.optimize import minimize_scalarfrom scipy.linalg import cho_factor, cho_solve def log_evidence_ridge(X, y, alpha, noise_var=1.0): """ Compute log marginal likelihood for Bayesian ridge regression. Prior: beta ~ N(0, (1/alpha) * I) Likelihood: y ~ N(X @ beta, noise_var * I) """ n, d = X.shape # Posterior covariance: (X.T @ X / noise_var + alpha * I)^{-1} A = X.T @ X / noise_var + alpha * np.eye(d) L, lower = cho_factor(A, lower=True) # Posterior mean: A^{-1} @ X.T @ y / noise_var post_mean = cho_solve((L, lower), X.T @ y / noise_var) # Log evidence = log likelihood + log prior - log posterior # Using the formula for Gaussian marginal likelihood # Log determinant of A log_det_A = 2 * np.sum(np.log(np.diag(L))) # Compute evidence residual = y - X @ post_mean log_ev = -0.5 * ( n * np.log(2 * np.pi * noise_var) + np.dot(residual, residual) / noise_var + alpha * np.dot(post_mean, post_mean) + log_det_A - d * np.log(alpha) ) return log_ev def optimize_ridge_alpha(X, y, noise_var=1.0): """Find optimal ridge regularization by maximizing evidence.""" def neg_log_evidence(log_alpha): alpha = np.exp(log_alpha) return -log_evidence_ridge(X, y, alpha, noise_var) # Optimize over log(alpha) for numerical stability result = minimize_scalar( neg_log_evidence, bounds=(-10, 10), method='bounded' ) optimal_alpha = np.exp(result.x) return optimal_alpha # Example: Compare evidence-based vs CV-based regularizationnp.random.seed(42)n, d = 100, 20 # True model: only first 5 features are relevanttrue_beta = np.zeros(d)true_beta[:5] = np.array([2, -1, 0.5, 1.5, -0.8]) X = np.random.randn(n, d)y = X @ true_beta + 0.5 * np.random.randn(n) # Maximize evidence for alphaoptimal_alpha = optimize_ridge_alpha(X, y, noise_var=0.25) print("Evidence Maximization for Ridge Regression")print(f"Optimal α (from evidence): {optimal_alpha:.4f}") # Show evidence landscapealphas = np.logspace(-3, 3, 50)evidences = [log_evidence_ridge(X, y, a, noise_var=0.25) for a in alphas] print(f"Log evidence at different α values:")for a in [0.001, 0.01, 0.1, 1, 10, 100]: ev = log_evidence_ridge(X, y, a, noise_var=0.25) star = " *" if abs(np.log10(a) - np.log10(optimal_alpha)) < 0.3 else "" print(f" α = {a:<6.3f}: log p(D|α) = {ev:.2f}{star}")Evidence maximization treats hyperparameters as point estimates rather than integrating over them. This can underestimate uncertainty, especially with limited data. For full Bayesian treatment, place hyperpriors on hyperparameters and integrate. Also, evidence maximization can overfit hyperparameters when the hyperparameter space is large.
Model evidence is a powerful principle, but it has failure modes. Understanding when it fails is as important as knowing when it succeeds.
Model evidence rewards models that 'predicted' the data well from their prior. If the prior is badly misspecified, evidence unfairly penalizes the model:
Solution: Use principled priors based on domain knowledge or previous data. Conduct prior predictive checks to ensure priors allow reasonable data patterns.
The M-open scenario occurs when the true data-generating process is not among the candidate models. In this case:
Example: True relationship is $y = \sin(x) + \epsilon$. We compare:
Evidence will favor quadratic (it's closer to the truth), but neither is 'right,' and the evidence values don't indicate how far either is from reality.
Solution: Supplement evidence with posterior predictive checks. Look for systematic patterns in residuals that indicate model misspecification.
Use evidence as one input among several. Combine with: (1) Prior predictive checks—do priors generate plausible data? (2) Posterior predictive checks—does the fitted model reproduce observed patterns? (3) Cross-validation—does the model generalize to held-out data? (4) Domain knowledge—does the 'best' model make scientific sense?
Let's work through a complete example showing how model evidence guides model selection in a realistic scenario.
We observe noisy data from an unknown function and must select the polynomial degree. This is a classic bias-variance trade-off:
Model evidence should automatically find the sweet spot.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import numpy as npfrom scipy.linalg import cho_factor, cho_solveimport warnings def bayesian_polynomial_regression(x, y, degree, prior_var=10.0, noise_var=0.1): """ Fit Bayesian polynomial regression and compute model evidence. Returns: -------- log_evidence : float posterior_mean : array posterior_cov : array """ # Design matrix X = np.column_stack([x**i for i in range(degree + 1)]) n, d = X.shape # Prior: beta ~ N(0, prior_var * I) prior_precision = np.eye(d) / prior_var # Posterior post_precision = X.T @ X / noise_var + prior_precision L, lower = cho_factor(post_precision, lower=True) post_cov = cho_solve((L, lower), np.eye(d)) post_mean = post_cov @ X.T @ y / noise_var # Log evidence K = prior_var * X @ X.T + noise_var * np.eye(n) L_K, _ = cho_factor(K, lower=True) log_det_K = 2 * np.sum(np.log(np.diag(L_K))) alpha = cho_solve((L_K, True), y) log_evidence = -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi)) return log_evidence, post_mean, post_cov def compare_polynomial_models(x, y, max_degree=10, prior_var=10.0, noise_var=0.1): """Compare polynomial models of different degrees using evidence.""" results = [] for d in range(max_degree + 1): try: log_ev, post_mean, post_cov = bayesian_polynomial_regression( x, y, d, prior_var, noise_var ) results.append({ 'degree': d, 'log_evidence': log_ev, 'n_params': d + 1, 'posterior_mean': post_mean }) except Exception as e: warnings.warn(f"Degree {d} failed: {e}") # Compute posterior model probabilities (assuming equal priors) log_evs = np.array([r['log_evidence'] for r in results]) max_log_ev = np.max(log_evs) probs = np.exp(log_evs - max_log_ev) probs /= probs.sum() for i, r in enumerate(results): r['posterior_prob'] = probs[i] return results # Generate synthetic data from a cubic functionnp.random.seed(42)n = 50x = np.linspace(-1, 1, n) # True function: cubicy_true = 0.5 - 0.3*x + 0.8*x**2 - 0.4*x**3noise_var_true = 0.04y = y_true + np.sqrt(noise_var_true) * np.random.randn(n) # Compare modelsresults = compare_polynomial_models(x, y, max_degree=10, prior_var=10.0, noise_var=0.04) print("Model Comparison via Evidence")print("True model: cubic (degree 3)")print(f"{'Degree':<8} {'# Params':<10} {'Log Evidence':<15} {'P(M|D)':<12} {'Bayes Factor'}")print("-" * 70) best_evidence = max(r['log_evidence'] for r in results)for r in results: bf = np.exp(r['log_evidence'] - best_evidence) star = " ← best" if r['log_evidence'] == best_evidence else "" print(f"{r['degree']:<8} {r['n_params']:<10} {r['log_evidence']:>12.2f} " f"{r['posterior_prob']:>10.1%} {bf:>8.3f}{star}") # Find and report best modelbest = max(results, key=lambda r: r['log_evidence'])print(f"Evidence-selected model: degree {best['degree']}")print(f"Posterior probability: {best['posterior_prob']:.1%}")Evidence peaks at the true complexity: For data from a cubic, evidence is highest around degree 3.
Simpler models have lower evidence: They can't capture the true pattern, hurting the likelihood term.
More complex models have lower evidence: Extra parameters spread prior mass thin without improving fit.
Posterior probabilities are decisive: With enough data, evidence strongly concentrates on the correct model.
BIC approximates evidence: For this smooth case, BIC rankings match evidence rankings closely.
This example demonstrates that model evidence operationalizes the intuition that we want 'as simple as possible, but no simpler.'
Let's consolidate what we've learned about model evidence:
Looking ahead: We've seen that model evidence embodies an automatic preference for simpler models—Occam's razor. In the next page, we'll explore this connection in depth, understanding why Bayesian inference inherently favors parsimony and what 'simplicity' really means in a probabilistic context.
The Bayesian Occam's razor isn't a heuristic or add-on—it emerges directly from the mathematics of probability. Understanding this provides deep insight into why Bayesian model comparison works.
You now understand model evidence at a deep level—as prior predictive probability, as a fit-complexity trade-off, and as optimal compression. You've seen how to maximize evidence for hyperparameter learning and where evidence can fail. Next, we'll explore the Bayesian Occam's razor that emerges from these principles.