Bayesian Model Comparison - Learning Module

Loading content...

0/278

Occam's Razor: The Bayesian Preference for Simplicity

Why Does Bayesian Inference Favor Simpler Models?

Occam's razor—the principle that simpler explanations should be preferred over complex ones—has guided scientific thinking for centuries. But in machine learning, we often struggle to formalize what 'simple' means and why simpler should be better.

Bayesian inference provides a principled answer. As we've seen, marginal likelihood automatically penalizes model complexity. But this isn't an ad hoc penalty or a regularization term we manually add—it emerges directly from the axioms of probability theory.

This page explores the deep connection between Bayesian inference and Occam's razor. We'll understand exactly why complex models are penalized, what 'complexity' means in this context, and when the Bayesian Occam's razor succeeds or fails.

What You Will Learn

By the end of this page, you will understand: (1) The geometric intuition for why Bayesian inference penalizes complexity, (2) The precise mathematical mechanisms implementing Occam's razor, (3) The difference between parameter counting and 'effective' complexity, (4) How the Occam factor quantifies the complexity penalty, (5) Real-world examples demonstrating automatic parsimony.

The Core Intuition: Prediction Space and Commitment

Models as Predictive Devices

Every model defines a prior predictive distribution—the set of data patterns it considers possible, weighted by probability. This is the marginal distribution over data:

$$p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\boldsymbol{\theta}$$

A crucial constraint applies: this distribution must integrate to 1 over all possible datasets. The model has a fixed 'probability budget' to allocate across the space of possible data.

Simple Models: High Commitment, High Reward

A simple model has limited flexibility. It can only generate certain data patterns:

Its prior predictive distribution is concentrated
It assigns high probability to specific patterns
If those patterns actually occur, it scores well (high marginal likelihood)
If they don't occur, it scores poorly

The intuition: Simple models 'commit' to specific predictions. When right, they're rewarded for their specificity.

Complex Models: Low Commitment, Diluted Reward

A complex model has many parameters and can generate diverse data patterns:

Its prior predictive distribution is spread across many possibilities
It assigns modest probability to many patterns, high probability to few
Regardless of what data appears, it has 'hedged its bets'
It never scores as well as a simple model that committed to the right answer

The analogy: Imagine betting on a horse race.

Betting everything on one horse (simple model): Big win if right, big loss if wrong
Spreading bets across all horses (complex model): Never lose much, but never win much either

If you know which horse will win, you should commit. The 'reward' for correct commitment is what drives Occam's razor.

The Conservation of Probability

The total probability mass over all possible datasets is exactly 1. A complex model spreads this mass more thinly across more patterns. A simple model concentrates mass on fewer patterns. When you observe specific data, the model that assigned highest probability to that specific pattern wins. Complexity is penalized because probability is conserved.

Visualizing the Prediction Space

Consider the space of all possible datasets as a high-dimensional region:

Simple Model (e.g., straight line):

Can only generate data that looks linear
Prior predictive gives high density to linear-looking datasets
Low density to curved, oscillating, or noisy patterns
If true data is linear: high evidence
If true data is curved: low evidence

Complex Model (e.g., high-degree polynomial):

Can generate linear, curved, oscillating, and complex patterns
Prior predictive spreads density across all these possibilities
No single pattern gets very high density
Regardless of true data: moderate evidence at best

When does complexity pay off?

Only when the complex model can achieve so much higher likelihood that it overcomes the dilution penalty. This requires the true data to exhibit complexity that the simple model genuinely cannot capture.

The Mathematical Mechanism: Volumes in Parameter Space

Let's formalize the intuition using the geometry of parameter spaces.

The Prior Volume Problem

For a model with $d$ parameters, the prior occupies a volume in $d$-dimensional space. If parameters are roughly independent with characteristic scale $\Delta\theta_{\text{prior}}$, the prior 'volume' scales as:

$$V_{\text{prior}} \sim (\Delta\theta_{\text{prior}})^d$$

With more parameters:

The prior volume grows exponentially with $d$
The prior probability density at any point decreases exponentially
Points in parameter space become increasingly 'unlikely' under the prior

Data Constrains Parameters to a Subregion

After seeing data, the posterior concentrates on parameters consistent with the observations. The posterior occupies a much smaller volume:

$$V_{\text{posterior}} \sim (\Delta\theta_{\text{posterior}})^d$$

For well-identified parameters, $\Delta\theta_{\text{posterior}} \ll \Delta\theta_{\text{prior}}$:

Data reduces uncertainty about each parameter
The posterior is much more concentrated than the prior

The Occam Factor: Posterior-to-Prior Volume Ratio

The Occam factor is approximately the ratio of posterior to prior volume:

$$\text{Occam Factor} \approx \frac{V_{\text{posterior}}}{V_{\text{prior}}} = \left(\frac{\Delta\theta_{\text{posterior}}}{\Delta\theta_{\text{prior}}}\right)^d$$

Key observations:

Always less than 1: Since $\Delta\theta_{\text{posterior}} < \Delta\theta_{\text{prior}}$, the Occam factor is $< 1$.
Exponential in dimension: Adding parameters multiplies the penalty by the per-parameter factor.
Data informativeness matters: If data strongly constrains parameters (small $\Delta\theta_{\text{posterior}}$), the penalty is larger per parameter.
Unused parameters hurt most: Parameters not constrained by data (large $\Delta\theta_{\text{posterior}}$) still incur the prior volume, providing no benefit.

Occam Factor Examples
Scenario	Prior Width	Posterior Width	Params (d)	Log Occam Factor
Strong constraint, few params	10	0.1	3	3 × log(0.01) = -13.8
Weak constraint, few params	10	1	3	3 × log(0.1) = -6.9
Strong constraint, many params	10	0.1	10	10 × log(0.01) = -46.1
Unused parameters	10	9.5	5	5 × log(0.95) ≈ -0.26

The 'Learning' Interpretation

The Occam factor measures how much the model 'learned' from data. Large posterior shrinkage means data was informative—the model extracted significant information. This information has a 'cost' in the Occam factor. Models are penalized for requiring a lot of information to fit data, because flexible models could fit many datasets with the same information extraction.

Beyond Parameter Counting: Effective Complexity

A naive view of model complexity equates it with parameter count. The Bayesian perspective reveals a more sophisticated picture.

Parameter Count Is Misleading

Example 1: Unused parameters A model with 100 parameters where 90 are unconstrained by data is effectively a 10-parameter model. The Occam factor penalizes the 90 unused parameters only mildly (their posterior equals prior, so the ratio is ~1).

Example 2: Equivalent parameterizations Any model can be reparameterized with more or fewer named parameters. The number of parameters is coordinate-dependent, but model evidence is coordinate-independent.

Example 3: Regularization effects A neural network with strong weight decay behaves like a simpler model despite having many nominal parameters. The effective complexity is reduced by regularization.

Effective Degrees of Freedom

The effective number of parameters (or effective degrees of freedom) is a better complexity measure:

$$d_{\text{eff}} = \sum_j \left(1 - \frac{\sigma^2_{\text{posterior},j}}{\sigma^2_{\text{prior},j}}\right)$$

For each parameter:

If posterior variance $\approx$ prior variance: Data didn't constrain it → contributes $\approx 0$ to $d_{\text{eff}}$
If posterior variance $\ll$ prior variance: Data strongly constrained it → contributes $\approx 1$ to $d_{\text{eff}}$

effective_parameters.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def compute_effective_parameters(X, y, prior_var, noise_var):
    """
    Compute effective number of parameters for Bayesian linear regression.
    
    The effective parameters measure how many degrees of freedom
    the data actually constrains (vs. parameters that remain at prior).
    """
    n, d = X.shape
    
    # Prior precision
    prior_precision = 1 / prior_var
    
    # Posterior precision matrix: (1/noise_var) * X.T @ X + prior_precision * I
    post_precision = X.T @ X / noise_var + prior_precision * np.eye(d)
    
    # Posterior covariance
    L, lower = cho_factor(post_precision, lower=True)
    post_cov = cho_solve((L, lower), np.eye(d))
    post_var = np.diag(post_cov)
    
    # Effective parameters: sum of (1 - posterior_var / prior_var)
    d_eff = np.sum(1 - post_var / prior_var)
    
    # Alternative: via eigenvalues of X.T @ X
    eigenvals = np.linalg.eigvalsh(X.T @ X / noise_var)
    gamma = eigenvals / (eigenvals + prior_precision)  # Shrinkage factors
    d_eff_alt = np.sum(gamma)
    
    return d_eff, post_var, gamma
 
 
def demonstrate_effective_parameters():
    """Show how effective parameters differ from raw parameter count."""
    np.random.seed(42)
    
    print("Effective Parameters vs. Raw Parameter Count
")
    print("-" * 70)
    
    # Case 1: All parameters identifiable
    n, d = 100, 5
    X = np.random.randn(n, d)  # Random design, all features useful
    y = X @ np.array([1, -0.5, 0.3, 0.8, -0.2]) + 0.5 * np.random.randn(n)
    
    d_eff, _, _ = compute_effective_parameters(X, y, prior_var=10.0, noise_var=0.25)
    print(f"Case 1: Random design, all features informative")
    print(f"  Raw parameters: {d}, Effective: {d_eff:.2f}
")
    
    # Case 2: Collinear features
    X_collinear = np.column_stack([X, X[:, 0] + 0.01*np.random.randn(n)])  # Near-duplicate
    d_eff2, _, _ = compute_effective_parameters(X_collinear, y, prior_var=10.0, noise_var=0.25)
    print(f"Case 2: Added collinear feature")
    print(f"  Raw parameters: {X_collinear.shape[1]}, Effective: {d_eff2:.2f}
")
    
    # Case 3: Strong prior (regularization)
    d_eff3, _, _ = compute_effective_parameters(X, y, prior_var=0.1, noise_var=0.25)
    print(f"Case 3: Strong prior (σ²_prior = 0.1)")
    print(f"  Raw parameters: {d}, Effective: {d_eff3:.2f}
")
    
    # Case 4: Many redundant features
    X_redundant = np.column_stack([X] + [np.random.randn(n) for _ in range(20)])
    y_sparse = X[:, :3] @ np.array([1, -0.5, 0.3]) + 0.5 * np.random.randn(n)
    d_eff4, post_var4, gamma4 = compute_effective_parameters(
        X_redundant, y_sparse, prior_var=10.0, noise_var=0.25
    )
    print(f"Case 4: Many irrelevant features")
    print(f"  Raw parameters: {X_redundant.shape[1]}, Effective: {d_eff4:.2f}")
    print(f"  Top 5 shrinkage factors: {sorted(gamma4, reverse=True)[:5]}")
    
 
demonstrate_effective_parameters()

Why Effective Complexity Matters

The distinction between raw and effective parameters explains why regularized models can have thousands of parameters yet not overfit, while unregularized models with fewer parameters do overfit. The Bayesian Occam factor naturally accounts for effective complexity—it penalizes based on actual constraint, not nominal parameterization.

The Laplace Approximation: Making Occam's Razor Explicit

The Laplace approximation provides an explicit formula showing how model evidence decomposes into fit and complexity.

The Laplace Formula

Approximating the posterior as Gaussian around its mode $\hat{\boldsymbol{\theta}}$:

$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$

where $\mathbf{H}$ is the Hessian of the negative log-posterior at its mode.

Rearranging:

$$\log p(\mathcal{D} | \mathcal{M}) \approx \underbrace{\log p(\mathcal{D} | \hat{\boldsymbol{\theta}})}{\text{Goodness of fit}} + \underbrace{\log p(\hat{\boldsymbol{\theta}})}{\text{Prior plausibility}} + \underbrace{\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|}_{\text{Occam factor}}$$

Interpreting Each Term

Term 1: Log-likelihood at mode

Measures how well the best parameters fit the data
Improves with more parameters (more flexibility)
This is what maximum likelihood maximizes

Term 2: Log prior at mode

Measures how plausible the best parameters are a priori
Penalizes extreme parameter values
Often negligible for diffuse priors

Term 3: Occam factor

The $\frac{d}{2}\log(2\pi)$ grows with dimension (favors complexity)
The $-\frac{1}{2}\log|\mathbf{H}|$ captures posterior concentration
- Large $|\mathbf{H}|$ = tight posterior = strong Occam penalty
- Small $|\mathbf{H}|$ = loose posterior = weak Occam penalty

The balance: Complex models gain in Term 1 (better fit) but lose in the Occam factor (more parameters to constrain). Evidence is maximized when these balance optimally.

Laplace Approximation Decomposition Example
Model	log p(D\|θ̂)	Occam Factor	log p(D\|M)	Interpretation
Linear (d=2)	-150	-8	-158	Moderate fit, small penalty
Quadratic (d=3)	-140	-12	-152	Better fit, pays off
Degree 5 (d=6)	-138	-25	-163	Slight fit gain, heavy penalty
Degree 10 (d=11)	-136	-48	-184	Tiny fit gain, huge penalty

From Laplace to BIC

The BIC approximation BIC = log p(D|θ̂) - (d/2)log(n) emerges from Laplace when we assume: (1) the prior is flat, (2) the data provides O(n) information so Hessian eigenvalues scale as O(n). These assumptions make the Occam factor approximately (d/2)log(n). BIC is thus a rough Occam penalty.

Examples of Automatic Parsimony

Let's see the Bayesian Occam's razor in action across different model classes.

Example 1: Polynomial Regression Order Selection

As shown in previous pages, evidence naturally selects the correct polynomial degree. For data from a quadratic:

Linear model: Poor fit → low evidence
Quadratic model: Good fit, appropriate complexity → highest evidence
Higher-degree models: Slightly better fit, but complexity penalty dominates → lower evidence

Example 2: Feature Selection in Regression

With many potential predictors, evidence automatically identifies relevant features:

Models including irrelevant features are penalized (extra parameters, no fit improvement)
Models omitting relevant features are penalized (worse fit)
The 'right' feature set maximizes evidence

occam_razor_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from itertools import combinations
from scipy.linalg import cho_factor, cho_solve
 
def log_evidence_linear_model(X, y, prior_var=10.0, noise_var=0.25):
    """Compute log marginal likelihood for Bayesian linear regression."""
    n, d = X.shape
    K = prior_var * X @ X.T + noise_var * np.eye(n)
    L, lower = cho_factor(K, lower=True)
    log_det_K = 2 * np.sum(np.log(np.diag(L)))
    alpha = cho_solve((L, lower), y)
    return -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi))
 
 
def exhaustive_feature_selection(X, y, feature_names, prior_var=10.0, noise_var=0.25):
    """
    Evaluate all subsets of features using marginal likelihood.
    Returns sorted list of (feature_set, log_evidence) tuples.
    """
    n_features = X.shape[1]
    results = []
    
    # Evaluate all non-empty subsets
    for r in range(1, n_features + 1):
        for idx in combinations(range(n_features), r):
            X_subset = X[:, list(idx)]
            # Add intercept
            X_design = np.column_stack([np.ones(len(y)), X_subset])
            log_ev = log_evidence_linear_model(X_design, y, prior_var, noise_var)
            features = [feature_names[i] for i in idx]
            results.append((features, log_ev))
    
    return sorted(results, key=lambda x: -x[1])
 
 
# Example: Feature selection with true and spurious features
np.random.seed(42)
n = 100
 
# True features
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)
 
# Spurious features (uncorrelated with y)
x4 = np.random.randn(n)
x5 = np.random.randn(n)
 
# True model: y = 2*x1 - 1*x2 + 0.5*x3 + noise
y = 2*x1 - 1*x2 + 0.5*x3 + 0.5*np.random.randn(n)
 
X = np.column_stack([x1, x2, x3, x4, x5])
feature_names = ['x1 (true)', 'x2 (true)', 'x3 (true)', 'x4 (spurious)', 'x5 (spurious)']
 
# Evaluate all subsets
results = exhaustive_feature_selection(X, y, feature_names)
 
print("Bayesian Feature Selection: Occam's Razor in Action
")
print("True model uses: x1, x2, x3")
print("Spurious features: x4, x5
")
print(f"{'Rank':<6} {'Features':<50} {'Log Evidence'}")
print("-" * 75)
 
for rank, (features, log_ev) in enumerate(results[:10], 1):
    features_str = ', '.join(features)
    print(f"{rank:<6} {features_str:<50} {log_ev:.2f}")
 
print("
...")
print(f"
Worst models (including spurious, excluding true):")
for features, log_ev in results[-3:]:
    features_str = ', '.join(features)
    print(f"       {features_str:<50} {log_ev:.2f}")

Example 3: Gaussian Mixture Model Components

When clustering data with Gaussian mixture models, evidence helps select the number of clusters:

Too few clusters: Poor fit to multimodal structure
Too many clusters: Extra means, variances, and mixing weights penalized
Optimal number: Best trade-off between capturing structure and avoiding redundancy

Example 4: Neural Network Pruning

Bayesian neural networks can use evidence to prune unnecessary weights:

Weights with posterior ≈ prior (not constrained by data) have high Occam penalty
Setting these weights to zero and removing them improves evidence
Automatic Relevance Determination (ARD) implements this via hyperparameter learning

The Power of Automatic Parsimony

In each example, the 'right' model complexity emerges without manual tuning. No cross-validation folds to choose, no regularization hyperparameter to set. The mathematics of probability directly implements the intuition that unnecessary complexity should be penalized. This is why Bayesian methods are particularly attractive in scientific applications where interpretability matters.

When the Bayesian Occam's Razor Fails

The Bayesian Occam's razor has failure modes. Recognizing these is crucial for appropriate application.

Failure 1: Prior Misalignment

Occam's razor works by rewarding models that 'predicted' the data well from their prior. If priors are poorly calibrated:

A correctly specified model with a bad prior is penalized
An incorrectly specified model with a lucky prior might be rewarded

Example: True parameter is $\theta = 50$.

Model A (correct structure): Prior $\theta \sim \mathcal{N}(0, 1)$
Model B (wrong structure): Prior happens to concentrate near 50

Model B might have higher evidence despite being structurally wrong, simply because its prior was accidentally better calibrated.

Solution: Use principled priors based on domain knowledge and conduct sensitivity analysis.

Other Failure Modes

•M-open problems: When all models are wrong, Occam's razor picks the best approximation—which may not be simpler or more interpretable.
•Small sample sizes: With little data, evidence can't reliably distinguish models. Results are dominated by prior choices.
•Asymptotic breakdown: With massive data, even tiny model differences become detectable, and Occam's razor becomes irrelevant—the true model always wins.
•Computational approximation: Methods like MCMC or variational inference introduce errors that disproportionately affect evidence estimates.
•Improper priors: Using improper priors (non-normalizable) makes marginal likelihoods undefined, breaking Occam's razor entirely.
•Sensitive hyperpriors: Evidence over hyperparameters can be erratic, and evidence maximization can find pathological solutions.

The 'Simplicity' Caveat

Occam's razor prefers models that are 'simple' in the sense of making specific predictions—not necessarily simple in terms of interpretability or parameter count.

Example: A model with one parameter but a very diffuse prior is 'complex' in the Bayesian sense because it predicts many data patterns. A model with many parameters but strong priors might be 'simpler' if it predicts specific patterns.

Bayesian 'simplicity' is about predictive commitment, not about parsimony of representation.

Practical Wisdom

Use Occam's razor as a guide, not a law. When evidence strongly favors a simple model, that's informative. When evidence slightly favors a complex model, consider whether the complexity is scientifically meaningful. Always triangulate with cross-validation, predictive checks, and domain expertise.

Occam's Razor vs. Other Model Selection Criteria

How does the Bayesian Occam's razor compare to other methods of controlling complexity?

AIC (Akaike Information Criterion)

$$\text{AIC} = -2\log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + 2d$$

Penalizes by raw parameter count (2 per parameter)
Derived from asymptotic arguments about prediction error
No Bayesian interpretation (doesn't involve priors)
Tends to select more complex models than BIC

BIC (Bayesian Information Criterion)

$$\text{BIC} = -2\log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + d\log n$$

Penalty grows with both parameters and sample size
Approximates log marginal likelihood (as we derived)
More conservative than AIC (prefers simpler models)
Ignores prior information beyond flat priors

Cross-Validation

Estimates prediction error on held-out data
No explicit complexity penalty (complexity is penalized implicitly via overfitting)
Data-inefficient (uses only fraction of data for training)
Variance can be high, especially with small samples

Comparison of Model Selection Approaches
Method	Complexity Handling	Requires Prior?	Computational Cost	Properties
Marginal Likelihood	Automatic via integration	Yes (strong dependence)	High (intractable integrals)	Principled, consistent, prior-sensitive
BIC	d·log(n) penalty	No	Low (just MLE + counting)	Consistent, rough approximation
AIC	2d penalty	No	Low (just MLE + counting)	Prediction-focused, liberal
Cross-Validation	Implicit via overfitting	No	Medium (multiple fits)	Robust, variance issues
WAIC/LOO-IC	Estimated effective params	Somewhat (uses posterior)	Medium-High	Prediction-focused, Bayesian

When to Use What

Use marginal likelihood when priors are meaningful and you want to compare model structures. Use BIC as a quick approximation when priors are vague. Use AIC when focused on prediction and you expect the true model is complex. Use cross-validation when you distrust parametric assumptions or want to evaluate prediction directly.

Summary and Key Takeaways

Let's consolidate what we've learned about the Bayesian Occam's razor:

Core Insights

•Automatic complexity penalty: Bayesian inference inherently penalizes complexity through the integrated likelihood—no manual regularization needed.
•Probability conservation: Complex models spread prior predictions across many data patterns, achieving lower probability density on any specific observation.
•The Occam factor: The posterior-to-prior volume ratio explicitly quantifies the complexity penalty, growing exponentially with the number of constrained parameters.
•Effective complexity: What matters is not raw parameter count but how many parameters are meaningfully constrained by data.
•Laplace decomposition: Log evidence ≈ log-likelihood + log-prior + Occam factor, making the fit-complexity trade-off explicit.
•Failure modes: Prior misspecification, M-open problems, and computational approximations can undermine Occam's razor.

Looking ahead: We've now thoroughly understood how to compare models using marginal likelihood, Bayes factors, and Occam's razor. But model comparison isn't the end of the story. Often, no single model is clearly best, and we want to combine the strengths of multiple models. In the next page, we'll explore Bayesian Model Averaging—a principled approach to making predictions by weighting models according to their posterior probabilities.

Model averaging addresses the uncomfortable reality that model uncertainty is real. Rather than committing to a single 'best' model, we can coherently combine all models, weighted by their evidence.

Page Complete

You now understand the Bayesian Occam's razor—why simpler models are preferred when complexity isn't justified, what 'simplicity' means in a probabilistic context, and when this automatic parsimony succeeds or fails. Next, we'll see how to move beyond model selection to model averaging, combining multiple models for robust predictions.

Occam's Razor: The Bayesian Preference for Simplicity

Why Does Bayesian Inference Favor Simpler Models?

What You Will Learn

The Core Intuition: Prediction Space and Commitment

Models as Predictive Devices

Every model defines a prior predictive distribution—the set of data patterns it considers possible, weighted by probability. This is the marginal distribution over data:

$$p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\boldsymbol{\theta}$$

A crucial constraint applies: this distribution must integrate to 1 over all possible datasets. The model has a fixed 'probability budget' to allocate across the space of possible data.

Simple Models: High Commitment, High Reward

A simple model has limited flexibility. It can only generate certain data patterns:

Its prior predictive distribution is concentrated
It assigns high probability to specific patterns
If those patterns actually occur, it scores well (high marginal likelihood)
If they don't occur, it scores poorly

The intuition: Simple models 'commit' to specific predictions. When right, they're rewarded for their specificity.

Complex Models: Low Commitment, Diluted Reward

A complex model has many parameters and can generate diverse data patterns:

Its prior predictive distribution is spread across many possibilities
It assigns modest probability to many patterns, high probability to few
Regardless of what data appears, it has 'hedged its bets'
It never scores as well as a simple model that committed to the right answer

The analogy: Imagine betting on a horse race.

Betting everything on one horse (simple model): Big win if right, big loss if wrong
Spreading bets across all horses (complex model): Never lose much, but never win much either

If you know which horse will win, you should commit. The 'reward' for correct commitment is what drives Occam's razor.

The Conservation of Probability

Visualizing the Prediction Space

Consider the space of all possible datasets as a high-dimensional region:

Simple Model (e.g., straight line):

Can only generate data that looks linear
Prior predictive gives high density to linear-looking datasets
Low density to curved, oscillating, or noisy patterns
If true data is linear: high evidence
If true data is curved: low evidence

Complex Model (e.g., high-degree polynomial):

Can generate linear, curved, oscillating, and complex patterns
Prior predictive spreads density across all these possibilities
No single pattern gets very high density
Regardless of true data: moderate evidence at best

When does complexity pay off?

The Mathematical Mechanism: Volumes in Parameter Space

Let's formalize the intuition using the geometry of parameter spaces.

The Prior Volume Problem

$$V_{\text{prior}} \sim (\Delta\theta_{\text{prior}})^d$$

With more parameters:

The prior volume grows exponentially with $d$
The prior probability density at any point decreases exponentially
Points in parameter space become increasingly 'unlikely' under the prior

Data Constrains Parameters to a Subregion

After seeing data, the posterior concentrates on parameters consistent with the observations. The posterior occupies a much smaller volume:

$$V_{\text{posterior}} \sim (\Delta\theta_{\text{posterior}})^d$$

For well-identified parameters, $\Delta\theta_{\text{posterior}} \ll \Delta\theta_{\text{prior}}$:

Data reduces uncertainty about each parameter
The posterior is much more concentrated than the prior

The Occam Factor: Posterior-to-Prior Volume Ratio

The Occam factor is approximately the ratio of posterior to prior volume:

$$\text{Occam Factor} \approx \frac{V_{\text{posterior}}}{V_{\text{prior}}} = \left(\frac{\Delta\theta_{\text{posterior}}}{\Delta\theta_{\text{prior}}}\right)^d$$

Key observations:

Always less than 1: Since $\Delta\theta_{\text{posterior}} < \Delta\theta_{\text{prior}}$, the Occam factor is $< 1$.
Exponential in dimension: Adding parameters multiplies the penalty by the per-parameter factor.
Data informativeness matters: If data strongly constrains parameters (small $\Delta\theta_{\text{posterior}}$), the penalty is larger per parameter.
Unused parameters hurt most: Parameters not constrained by data (large $\Delta\theta_{\text{posterior}}$) still incur the prior volume, providing no benefit.

Occam Factor Examples
Scenario	Prior Width	Posterior Width	Params (d)	Log Occam Factor
Strong constraint, few params	10	0.1	3	3 × log(0.01) = -13.8
Weak constraint, few params	10	1	3	3 × log(0.1) = -6.9
Strong constraint, many params	10	0.1	10	10 × log(0.01) = -46.1
Unused parameters	10	9.5	5	5 × log(0.95) ≈ -0.26

The 'Learning' Interpretation

Beyond Parameter Counting: Effective Complexity

A naive view of model complexity equates it with parameter count. The Bayesian perspective reveals a more sophisticated picture.

Parameter Count Is Misleading

Effective Degrees of Freedom

The effective number of parameters (or effective degrees of freedom) is a better complexity measure:

$$d_{\text{eff}} = \sum_j \left(1 - \frac{\sigma^2_{\text{posterior},j}}{\sigma^2_{\text{prior},j}}\right)$$

For each parameter:

If posterior variance $\approx$ prior variance: Data didn't constrain it → contributes $\approx 0$ to $d_{\text{eff}}$
If posterior variance $\ll$ prior variance: Data strongly constrained it → contributes $\approx 1$ to $d_{\text{eff}}$

effective_parameters.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def compute_effective_parameters(X, y, prior_var, noise_var):
    """
    Compute effective number of parameters for Bayesian linear regression.
    
    The effective parameters measure how many degrees of freedom
    the data actually constrains (vs. parameters that remain at prior).
    """
    n, d = X.shape
    
    # Prior precision
    prior_precision = 1 / prior_var
    
    # Posterior precision matrix: (1/noise_var) * X.T @ X + prior_precision * I
    post_precision = X.T @ X / noise_var + prior_precision * np.eye(d)
    
    # Posterior covariance
    L, lower = cho_factor(post_precision, lower=True)
    post_cov = cho_solve((L, lower), np.eye(d))
    post_var = np.diag(post_cov)
    
    # Effective parameters: sum of (1 - posterior_var / prior_var)
    d_eff = np.sum(1 - post_var / prior_var)
    
    # Alternative: via eigenvalues of X.T @ X
    eigenvals = np.linalg.eigvalsh(X.T @ X / noise_var)
    gamma = eigenvals / (eigenvals + prior_precision)  # Shrinkage factors
    d_eff_alt = np.sum(gamma)
    
    return d_eff, post_var, gamma
 
 
def demonstrate_effective_parameters():
    """Show how effective parameters differ from raw parameter count."""
    np.random.seed(42)
    
    print("Effective Parameters vs. Raw Parameter Count
")
    print("-" * 70)
    
    # Case 1: All parameters identifiable
    n, d = 100, 5
    X = np.random.randn(n, d)  # Random design, all features useful
    y = X @ np.array([1, -0.5, 0.3, 0.8, -0.2]) + 0.5 * np.random.randn(n)
    
    d_eff, _, _ = compute_effective_parameters(X, y, prior_var=10.0, noise_var=0.25)
    print(f"Case 1: Random design, all features informative")
    print(f"  Raw parameters: {d}, Effective: {d_eff:.2f}
")
    
    # Case 2: Collinear features
    X_collinear = np.column_stack([X, X[:, 0] + 0.01*np.random.randn(n)])  # Near-duplicate
    d_eff2, _, _ = compute_effective_parameters(X_collinear, y, prior_var=10.0, noise_var=0.25)
    print(f"Case 2: Added collinear feature")
    print(f"  Raw parameters: {X_collinear.shape[1]}, Effective: {d_eff2:.2f}
")
    
    # Case 3: Strong prior (regularization)
    d_eff3, _, _ = compute_effective_parameters(X, y, prior_var=0.1, noise_var=0.25)
    print(f"Case 3: Strong prior (σ²_prior = 0.1)")
    print(f"  Raw parameters: {d}, Effective: {d_eff3:.2f}
")
    
    # Case 4: Many redundant features
    X_redundant = np.column_stack([X] + [np.random.randn(n) for _ in range(20)])
    y_sparse = X[:, :3] @ np.array([1, -0.5, 0.3]) + 0.5 * np.random.randn(n)
    d_eff4, post_var4, gamma4 = compute_effective_parameters(
        X_redundant, y_sparse, prior_var=10.0, noise_var=0.25
    )
    print(f"Case 4: Many irrelevant features")
    print(f"  Raw parameters: {X_redundant.shape[1]}, Effective: {d_eff4:.2f}")
    print(f"  Top 5 shrinkage factors: {sorted(gamma4, reverse=True)[:5]}")
    
 
demonstrate_effective_parameters()

Why Effective Complexity Matters

The Laplace Approximation: Making Occam's Razor Explicit

The Laplace approximation provides an explicit formula showing how model evidence decomposes into fit and complexity.

The Laplace Formula

Approximating the posterior as Gaussian around its mode $\hat{\boldsymbol{\theta}}$:

$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$

where $\mathbf{H}$ is the Hessian of the negative log-posterior at its mode.

Rearranging:

Interpreting Each Term

Term 1: Log-likelihood at mode

Measures how well the best parameters fit the data
Improves with more parameters (more flexibility)
This is what maximum likelihood maximizes

Term 2: Log prior at mode

Measures how plausible the best parameters are a priori
Penalizes extreme parameter values
Often negligible for diffuse priors

Term 3: Occam factor

The $\frac{d}{2}\log(2\pi)$ grows with dimension (favors complexity)
The $-\frac{1}{2}\log|\mathbf{H}|$ captures posterior concentration
- Large $|\mathbf{H}|$ = tight posterior = strong Occam penalty
- Small $|\mathbf{H}|$ = loose posterior = weak Occam penalty

The balance: Complex models gain in Term 1 (better fit) but lose in the Occam factor (more parameters to constrain). Evidence is maximized when these balance optimally.

Laplace Approximation Decomposition Example
Model	log p(D\|θ̂)	Occam Factor	log p(D\|M)	Interpretation
Linear (d=2)	-150	-8	-158	Moderate fit, small penalty
Quadratic (d=3)	-140	-12	-152	Better fit, pays off
Degree 5 (d=6)	-138	-25	-163	Slight fit gain, heavy penalty
Degree 10 (d=11)	-136	-48	-184	Tiny fit gain, huge penalty

From Laplace to BIC

Examples of Automatic Parsimony

Let's see the Bayesian Occam's razor in action across different model classes.

Example 1: Polynomial Regression Order Selection

As shown in previous pages, evidence naturally selects the correct polynomial degree. For data from a quadratic:

Linear model: Poor fit → low evidence
Quadratic model: Good fit, appropriate complexity → highest evidence
Higher-degree models: Slightly better fit, but complexity penalty dominates → lower evidence

Example 2: Feature Selection in Regression

With many potential predictors, evidence automatically identifies relevant features:

Models including irrelevant features are penalized (extra parameters, no fit improvement)
Models omitting relevant features are penalized (worse fit)
The 'right' feature set maximizes evidence

occam_razor_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import numpy as np
from itertools import combinations
from scipy.linalg import cho_factor, cho_solve
 
def log_evidence_linear_model(X, y, prior_var=10.0, noise_var=0.25):
    """Compute log marginal likelihood for Bayesian linear regression."""
    n, d = X.shape
    K = prior_var * X @ X.T + noise_var * np.eye(n)
    L, lower = cho_factor(K, lower=True)
    log_det_K = 2 * np.sum(np.log(np.diag(L)))
    alpha = cho_solve((L, lower), y)
    return -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi))
 
 
def exhaustive_feature_selection(X, y, feature_names, prior_var=10.0, noise_var=0.25):
    """
    Evaluate all subsets of features using marginal likelihood.
    Returns sorted list of (feature_set, log_evidence) tuples.
    """
    n_features = X.shape[1]
    results = []
    
    # Evaluate all non-empty subsets
    for r in range(1, n_features + 1):
        for idx in combinations(range(n_features), r):
            X_subset = X[:, list(idx)]
            # Add intercept
            X_design = np.column_stack([np.ones(len(y)), X_subset])
            log_ev = log_evidence_linear_model(X_design, y, prior_var, noise_var)
            features = [feature_names[i] for i in idx]
            results.append((features, log_ev))
    
    return sorted(results, key=lambda x: -x[1])
 
 
# Example: Feature selection with true and spurious features
np.random.seed(42)
n = 100
 
# True features
x1 = np.random.randn(n)
x2 = np.random.randn(n)
x3 = np.random.randn(n)
 
# Spurious features (uncorrelated with y)
x4 = np.random.randn(n)
x5 = np.random.randn(n)
 
# True model: y = 2*x1 - 1*x2 + 0.5*x3 + noise
y = 2*x1 - 1*x2 + 0.5*x3 + 0.5*np.random.randn(n)
 
X = np.column_stack([x1, x2, x3, x4, x5])
feature_names = ['x1 (true)', 'x2 (true)', 'x3 (true)', 'x4 (spurious)', 'x5 (spurious)']
 
# Evaluate all subsets
results = exhaustive_feature_selection(X, y, feature_names)
 
print("Bayesian Feature Selection: Occam's Razor in Action
")
print("True model uses: x1, x2, x3")
print("Spurious features: x4, x5
")
print(f"{'Rank':<6} {'Features':<50} {'Log Evidence'}")
print("-" * 75)
 
for rank, (features, log_ev) in enumerate(results[:10], 1):
    features_str = ', '.join(features)
    print(f"{rank:<6} {features_str:<50} {log_ev:.2f}")
 
print("
...")
print(f"
Worst models (including spurious, excluding true):")
for features, log_ev in results[-3:]:
    features_str = ', '.join(features)
    print(f"       {features_str:<50} {log_ev:.2f}")

Example 3: Gaussian Mixture Model Components

When clustering data with Gaussian mixture models, evidence helps select the number of clusters:

Too few clusters: Poor fit to multimodal structure
Too many clusters: Extra means, variances, and mixing weights penalized
Optimal number: Best trade-off between capturing structure and avoiding redundancy

Example 4: Neural Network Pruning

Bayesian neural networks can use evidence to prune unnecessary weights:

Weights with posterior ≈ prior (not constrained by data) have high Occam penalty
Setting these weights to zero and removing them improves evidence
Automatic Relevance Determination (ARD) implements this via hyperparameter learning

The Power of Automatic Parsimony

When the Bayesian Occam's Razor Fails

The Bayesian Occam's razor has failure modes. Recognizing these is crucial for appropriate application.

Failure 1: Prior Misalignment

Occam's razor works by rewarding models that 'predicted' the data well from their prior. If priors are poorly calibrated:

A correctly specified model with a bad prior is penalized
An incorrectly specified model with a lucky prior might be rewarded

Example: True parameter is $\theta = 50$.

Model A (correct structure): Prior $\theta \sim \mathcal{N}(0, 1)$
Model B (wrong structure): Prior happens to concentrate near 50

Model B might have higher evidence despite being structurally wrong, simply because its prior was accidentally better calibrated.

Solution: Use principled priors based on domain knowledge and conduct sensitivity analysis.

Other Failure Modes

•M-open problems: When all models are wrong, Occam's razor picks the best approximation—which may not be simpler or more interpretable.
•Small sample sizes: With little data, evidence can't reliably distinguish models. Results are dominated by prior choices.
•Asymptotic breakdown: With massive data, even tiny model differences become detectable, and Occam's razor becomes irrelevant—the true model always wins.
•Computational approximation: Methods like MCMC or variational inference introduce errors that disproportionately affect evidence estimates.
•Improper priors: Using improper priors (non-normalizable) makes marginal likelihoods undefined, breaking Occam's razor entirely.
•Sensitive hyperpriors: Evidence over hyperparameters can be erratic, and evidence maximization can find pathological solutions.

The 'Simplicity' Caveat

Occam's razor prefers models that are 'simple' in the sense of making specific predictions—not necessarily simple in terms of interpretability or parameter count.

Bayesian 'simplicity' is about predictive commitment, not about parsimony of representation.

Practical Wisdom

Occam's Razor vs. Other Model Selection Criteria

How does the Bayesian Occam's razor compare to other methods of controlling complexity?

AIC (Akaike Information Criterion)

$$\text{AIC} = -2\log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + 2d$$

Penalizes by raw parameter count (2 per parameter)
Derived from asymptotic arguments about prediction error
No Bayesian interpretation (doesn't involve priors)
Tends to select more complex models than BIC

BIC (Bayesian Information Criterion)

$$\text{BIC} = -2\log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + d\log n$$

Penalty grows with both parameters and sample size
Approximates log marginal likelihood (as we derived)
More conservative than AIC (prefers simpler models)
Ignores prior information beyond flat priors

Cross-Validation

Estimates prediction error on held-out data
No explicit complexity penalty (complexity is penalized implicitly via overfitting)
Data-inefficient (uses only fraction of data for training)
Variance can be high, especially with small samples

Comparison of Model Selection Approaches
Method	Complexity Handling	Requires Prior?	Computational Cost	Properties
Marginal Likelihood	Automatic via integration	Yes (strong dependence)	High (intractable integrals)	Principled, consistent, prior-sensitive
BIC	d·log(n) penalty	No	Low (just MLE + counting)	Consistent, rough approximation
AIC	2d penalty	No	Low (just MLE + counting)	Prediction-focused, liberal
Cross-Validation	Implicit via overfitting	No	Medium (multiple fits)	Robust, variance issues
WAIC/LOO-IC	Estimated effective params	Somewhat (uses posterior)	Medium-High	Prediction-focused, Bayesian

When to Use What

Summary and Key Takeaways

Let's consolidate what we've learned about the Bayesian Occam's razor:

Core Insights

•Automatic complexity penalty: Bayesian inference inherently penalizes complexity through the integrated likelihood—no manual regularization needed.
•Probability conservation: Complex models spread prior predictions across many data patterns, achieving lower probability density on any specific observation.
•The Occam factor: The posterior-to-prior volume ratio explicitly quantifies the complexity penalty, growing exponentially with the number of constrained parameters.
•Effective complexity: What matters is not raw parameter count but how many parameters are meaningfully constrained by data.
•Laplace decomposition: Log evidence ≈ log-likelihood + log-prior + Occam factor, making the fit-complexity trade-off explicit.
•Failure modes: Prior misspecification, M-open problems, and computational approximations can undermine Occam's razor.

Model averaging addresses the uncomfortable reality that model uncertainty is real. Rather than committing to a single 'best' model, we can coherently combine all models, weighted by their evidence.

Page Complete

Occam's Razor: The Bayesian Preference for Simplicity

Models as Predictive Devices

Simple Models: High Commitment, High Reward

Complex Models: Low Commitment, Diluted Reward

Visualizing the Prediction Space

The Prior Volume Problem

Data Constrains Parameters to a Subregion

The Occam Factor: Posterior-to-Prior Volume Ratio

Params (d)

Parameter Count Is Misleading

Effective Degrees of Freedom

The Laplace Formula

Interpreting Each Term

Example 1: Polynomial Regression Order Selection

Example 2: Feature Selection in Regression

Example 3: Gaussian Mixture Model Components

Example 4: Neural Network Pruning

Failure 1: Prior Misalignment

The 'Simplicity' Caveat

AIC (Akaike Information Criterion)

BIC (Bayesian Information Criterion)

Cross-Validation

Occam's Razor: The Bayesian Preference for Simplicity

Models as Predictive Devices

Simple Models: High Commitment, High Reward

Complex Models: Low Commitment, Diluted Reward

Visualizing the Prediction Space

The Prior Volume Problem

Data Constrains Parameters to a Subregion

The Occam Factor: Posterior-to-Prior Volume Ratio

Params (d)

Parameter Count Is Misleading

Effective Degrees of Freedom

The Laplace Formula

Interpreting Each Term

Example 1: Polynomial Regression Order Selection

Example 2: Feature Selection in Regression

Example 3: Gaussian Mixture Model Components

Example 4: Neural Network Pruning

Failure 1: Prior Misalignment

The 'Simplicity' Caveat

AIC (Akaike Information Criterion)

BIC (Bayesian Information Criterion)

Cross-Validation