Loading content...
Occam's razor—the principle that simpler explanations should be preferred over complex ones—has guided scientific thinking for centuries. But in machine learning, we often struggle to formalize what 'simple' means and why simpler should be better.
Bayesian inference provides a principled answer. As we've seen, marginal likelihood automatically penalizes model complexity. But this isn't an ad hoc penalty or a regularization term we manually add—it emerges directly from the axioms of probability theory.
This page explores the deep connection between Bayesian inference and Occam's razor. We'll understand exactly why complex models are penalized, what 'complexity' means in this context, and when the Bayesian Occam's razor succeeds or fails.
By the end of this page, you will understand: (1) The geometric intuition for why Bayesian inference penalizes complexity, (2) The precise mathematical mechanisms implementing Occam's razor, (3) The difference between parameter counting and 'effective' complexity, (4) How the Occam factor quantifies the complexity penalty, (5) Real-world examples demonstrating automatic parsimony.
Every model defines a prior predictive distribution—the set of data patterns it considers possible, weighted by probability. This is the marginal distribution over data:
$$p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\boldsymbol{\theta}$$
A crucial constraint applies: this distribution must integrate to 1 over all possible datasets. The model has a fixed 'probability budget' to allocate across the space of possible data.
A simple model has limited flexibility. It can only generate certain data patterns:
The intuition: Simple models 'commit' to specific predictions. When right, they're rewarded for their specificity.
A complex model has many parameters and can generate diverse data patterns:
The analogy: Imagine betting on a horse race.
If you know which horse will win, you should commit. The 'reward' for correct commitment is what drives Occam's razor.
The total probability mass over all possible datasets is exactly 1. A complex model spreads this mass more thinly across more patterns. A simple model concentrates mass on fewer patterns. When you observe specific data, the model that assigned highest probability to that specific pattern wins. Complexity is penalized because probability is conserved.
Consider the space of all possible datasets as a high-dimensional region:
Simple Model (e.g., straight line):
Complex Model (e.g., high-degree polynomial):
When does complexity pay off?
Only when the complex model can achieve so much higher likelihood that it overcomes the dilution penalty. This requires the true data to exhibit complexity that the simple model genuinely cannot capture.
Let's formalize the intuition using the geometry of parameter spaces.
For a model with $d$ parameters, the prior occupies a volume in $d$-dimensional space. If parameters are roughly independent with characteristic scale $\Delta\theta_{\text{prior}}$, the prior 'volume' scales as:
$$V_{\text{prior}} \sim (\Delta\theta_{\text{prior}})^d$$
With more parameters:
After seeing data, the posterior concentrates on parameters consistent with the observations. The posterior occupies a much smaller volume:
$$V_{\text{posterior}} \sim (\Delta\theta_{\text{posterior}})^d$$
For well-identified parameters, $\Delta\theta_{\text{posterior}} \ll \Delta\theta_{\text{prior}}$:
The Occam factor is approximately the ratio of posterior to prior volume:
$$\text{Occam Factor} \approx \frac{V_{\text{posterior}}}{V_{\text{prior}}} = \left(\frac{\Delta\theta_{\text{posterior}}}{\Delta\theta_{\text{prior}}}\right)^d$$
Key observations:
Always less than 1: Since $\Delta\theta_{\text{posterior}} < \Delta\theta_{\text{prior}}$, the Occam factor is $< 1$.
Exponential in dimension: Adding parameters multiplies the penalty by the per-parameter factor.
Data informativeness matters: If data strongly constrains parameters (small $\Delta\theta_{\text{posterior}}$), the penalty is larger per parameter.
Unused parameters hurt most: Parameters not constrained by data (large $\Delta\theta_{\text{posterior}}$) still incur the prior volume, providing no benefit.
| Scenario | Prior Width | Posterior Width | Params (d) | Log Occam Factor |
|---|---|---|---|---|
| Strong constraint, few params | 10 | 0.1 | 3 | 3 × log(0.01) = -13.8 |
| Weak constraint, few params | 10 | 1 | 3 | 3 × log(0.1) = -6.9 |
| Strong constraint, many params | 10 | 0.1 | 10 | 10 × log(0.01) = -46.1 |
| Unused parameters | 10 | 9.5 | 5 | 5 × log(0.95) ≈ -0.26 |
The Occam factor measures how much the model 'learned' from data. Large posterior shrinkage means data was informative—the model extracted significant information. This information has a 'cost' in the Occam factor. Models are penalized for requiring a lot of information to fit data, because flexible models could fit many datasets with the same information extraction.
A naive view of model complexity equates it with parameter count. The Bayesian perspective reveals a more sophisticated picture.
Example 1: Unused parameters A model with 100 parameters where 90 are unconstrained by data is effectively a 10-parameter model. The Occam factor penalizes the 90 unused parameters only mildly (their posterior equals prior, so the ratio is ~1).
Example 2: Equivalent parameterizations Any model can be reparameterized with more or fewer named parameters. The number of parameters is coordinate-dependent, but model evidence is coordinate-independent.
Example 3: Regularization effects A neural network with strong weight decay behaves like a simpler model despite having many nominal parameters. The effective complexity is reduced by regularization.
The effective number of parameters (or effective degrees of freedom) is a better complexity measure:
$$d_{\text{eff}} = \sum_j \left(1 - \frac{\sigma^2_{\text{posterior},j}}{\sigma^2_{\text{prior},j}}\right)$$
For each parameter:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom scipy.linalg import cho_factor, cho_solve def compute_effective_parameters(X, y, prior_var, noise_var): """ Compute effective number of parameters for Bayesian linear regression. The effective parameters measure how many degrees of freedom the data actually constrains (vs. parameters that remain at prior). """ n, d = X.shape # Prior precision prior_precision = 1 / prior_var # Posterior precision matrix: (1/noise_var) * X.T @ X + prior_precision * I post_precision = X.T @ X / noise_var + prior_precision * np.eye(d) # Posterior covariance L, lower = cho_factor(post_precision, lower=True) post_cov = cho_solve((L, lower), np.eye(d)) post_var = np.diag(post_cov) # Effective parameters: sum of (1 - posterior_var / prior_var) d_eff = np.sum(1 - post_var / prior_var) # Alternative: via eigenvalues of X.T @ X eigenvals = np.linalg.eigvalsh(X.T @ X / noise_var) gamma = eigenvals / (eigenvals + prior_precision) # Shrinkage factors d_eff_alt = np.sum(gamma) return d_eff, post_var, gamma def demonstrate_effective_parameters(): """Show how effective parameters differ from raw parameter count.""" np.random.seed(42) print("Effective Parameters vs. Raw Parameter Count") print("-" * 70) # Case 1: All parameters identifiable n, d = 100, 5 X = np.random.randn(n, d) # Random design, all features useful y = X @ np.array([1, -0.5, 0.3, 0.8, -0.2]) + 0.5 * np.random.randn(n) d_eff, _, _ = compute_effective_parameters(X, y, prior_var=10.0, noise_var=0.25) print(f"Case 1: Random design, all features informative") print(f" Raw parameters: {d}, Effective: {d_eff:.2f}") # Case 2: Collinear features X_collinear = np.column_stack([X, X[:, 0] + 0.01*np.random.randn(n)]) # Near-duplicate d_eff2, _, _ = compute_effective_parameters(X_collinear, y, prior_var=10.0, noise_var=0.25) print(f"Case 2: Added collinear feature") print(f" Raw parameters: {X_collinear.shape[1]}, Effective: {d_eff2:.2f}") # Case 3: Strong prior (regularization) d_eff3, _, _ = compute_effective_parameters(X, y, prior_var=0.1, noise_var=0.25) print(f"Case 3: Strong prior (σ²_prior = 0.1)") print(f" Raw parameters: {d}, Effective: {d_eff3:.2f}") # Case 4: Many redundant features X_redundant = np.column_stack([X] + [np.random.randn(n) for _ in range(20)]) y_sparse = X[:, :3] @ np.array([1, -0.5, 0.3]) + 0.5 * np.random.randn(n) d_eff4, post_var4, gamma4 = compute_effective_parameters( X_redundant, y_sparse, prior_var=10.0, noise_var=0.25 ) print(f"Case 4: Many irrelevant features") print(f" Raw parameters: {X_redundant.shape[1]}, Effective: {d_eff4:.2f}") print(f" Top 5 shrinkage factors: {sorted(gamma4, reverse=True)[:5]}") demonstrate_effective_parameters()The distinction between raw and effective parameters explains why regularized models can have thousands of parameters yet not overfit, while unregularized models with fewer parameters do overfit. The Bayesian Occam factor naturally accounts for effective complexity—it penalizes based on actual constraint, not nominal parameterization.
The Laplace approximation provides an explicit formula showing how model evidence decomposes into fit and complexity.
Approximating the posterior as Gaussian around its mode $\hat{\boldsymbol{\theta}}$:
$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$
where $\mathbf{H}$ is the Hessian of the negative log-posterior at its mode.
Rearranging:
$$\log p(\mathcal{D} | \mathcal{M}) \approx \underbrace{\log p(\mathcal{D} | \hat{\boldsymbol{\theta}})}{\text{Goodness of fit}} + \underbrace{\log p(\hat{\boldsymbol{\theta}})}{\text{Prior plausibility}} + \underbrace{\frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|}_{\text{Occam factor}}$$
Term 1: Log-likelihood at mode
Term 2: Log prior at mode
Term 3: Occam factor
The balance: Complex models gain in Term 1 (better fit) but lose in the Occam factor (more parameters to constrain). Evidence is maximized when these balance optimally.
| Model | log p(D|θ̂) | Occam Factor | log p(D|M) | Interpretation |
|---|---|---|---|---|
| Linear (d=2) | -150 | -8 | -158 | Moderate fit, small penalty |
| Quadratic (d=3) | -140 | -12 | -152 | Better fit, pays off |
| Degree 5 (d=6) | -138 | -25 | -163 | Slight fit gain, heavy penalty |
| Degree 10 (d=11) | -136 | -48 | -184 | Tiny fit gain, huge penalty |
The BIC approximation BIC = log p(D|θ̂) - (d/2)log(n) emerges from Laplace when we assume: (1) the prior is flat, (2) the data provides O(n) information so Hessian eigenvalues scale as O(n). These assumptions make the Occam factor approximately (d/2)log(n). BIC is thus a rough Occam penalty.
Let's see the Bayesian Occam's razor in action across different model classes.
As shown in previous pages, evidence naturally selects the correct polynomial degree. For data from a quadratic:
With many potential predictors, evidence automatically identifies relevant features:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom itertools import combinationsfrom scipy.linalg import cho_factor, cho_solve def log_evidence_linear_model(X, y, prior_var=10.0, noise_var=0.25): """Compute log marginal likelihood for Bayesian linear regression.""" n, d = X.shape K = prior_var * X @ X.T + noise_var * np.eye(n) L, lower = cho_factor(K, lower=True) log_det_K = 2 * np.sum(np.log(np.diag(L))) alpha = cho_solve((L, lower), y) return -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi)) def exhaustive_feature_selection(X, y, feature_names, prior_var=10.0, noise_var=0.25): """ Evaluate all subsets of features using marginal likelihood. Returns sorted list of (feature_set, log_evidence) tuples. """ n_features = X.shape[1] results = [] # Evaluate all non-empty subsets for r in range(1, n_features + 1): for idx in combinations(range(n_features), r): X_subset = X[:, list(idx)] # Add intercept X_design = np.column_stack([np.ones(len(y)), X_subset]) log_ev = log_evidence_linear_model(X_design, y, prior_var, noise_var) features = [feature_names[i] for i in idx] results.append((features, log_ev)) return sorted(results, key=lambda x: -x[1]) # Example: Feature selection with true and spurious featuresnp.random.seed(42)n = 100 # True featuresx1 = np.random.randn(n)x2 = np.random.randn(n)x3 = np.random.randn(n) # Spurious features (uncorrelated with y)x4 = np.random.randn(n)x5 = np.random.randn(n) # True model: y = 2*x1 - 1*x2 + 0.5*x3 + noisey = 2*x1 - 1*x2 + 0.5*x3 + 0.5*np.random.randn(n) X = np.column_stack([x1, x2, x3, x4, x5])feature_names = ['x1 (true)', 'x2 (true)', 'x3 (true)', 'x4 (spurious)', 'x5 (spurious)'] # Evaluate all subsetsresults = exhaustive_feature_selection(X, y, feature_names) print("Bayesian Feature Selection: Occam's Razor in Action")print("True model uses: x1, x2, x3")print("Spurious features: x4, x5")print(f"{'Rank':<6} {'Features':<50} {'Log Evidence'}")print("-" * 75) for rank, (features, log_ev) in enumerate(results[:10], 1): features_str = ', '.join(features) print(f"{rank:<6} {features_str:<50} {log_ev:.2f}") print("...")print(f"Worst models (including spurious, excluding true):")for features, log_ev in results[-3:]: features_str = ', '.join(features) print(f" {features_str:<50} {log_ev:.2f}")When clustering data with Gaussian mixture models, evidence helps select the number of clusters:
Bayesian neural networks can use evidence to prune unnecessary weights:
In each example, the 'right' model complexity emerges without manual tuning. No cross-validation folds to choose, no regularization hyperparameter to set. The mathematics of probability directly implements the intuition that unnecessary complexity should be penalized. This is why Bayesian methods are particularly attractive in scientific applications where interpretability matters.
The Bayesian Occam's razor has failure modes. Recognizing these is crucial for appropriate application.
Occam's razor works by rewarding models that 'predicted' the data well from their prior. If priors are poorly calibrated:
Example: True parameter is $\theta = 50$.
Model B might have higher evidence despite being structurally wrong, simply because its prior was accidentally better calibrated.
Solution: Use principled priors based on domain knowledge and conduct sensitivity analysis.
Occam's razor prefers models that are 'simple' in the sense of making specific predictions—not necessarily simple in terms of interpretability or parameter count.
Example: A model with one parameter but a very diffuse prior is 'complex' in the Bayesian sense because it predicts many data patterns. A model with many parameters but strong priors might be 'simpler' if it predicts specific patterns.
Bayesian 'simplicity' is about predictive commitment, not about parsimony of representation.
Use Occam's razor as a guide, not a law. When evidence strongly favors a simple model, that's informative. When evidence slightly favors a complex model, consider whether the complexity is scientifically meaningful. Always triangulate with cross-validation, predictive checks, and domain expertise.
How does the Bayesian Occam's razor compare to other methods of controlling complexity?
$$\text{AIC} = -2\log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + 2d$$
$$\text{BIC} = -2\log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + d\log n$$
| Method | Complexity Handling | Requires Prior? | Computational Cost | Properties |
|---|---|---|---|---|
| Marginal Likelihood | Automatic via integration | Yes (strong dependence) | High (intractable integrals) | Principled, consistent, prior-sensitive |
| BIC | d·log(n) penalty | No | Low (just MLE + counting) | Consistent, rough approximation |
| AIC | 2d penalty | No | Low (just MLE + counting) | Prediction-focused, liberal |
| Cross-Validation | Implicit via overfitting | No | Medium (multiple fits) | Robust, variance issues |
| WAIC/LOO-IC | Estimated effective params | Somewhat (uses posterior) | Medium-High | Prediction-focused, Bayesian |
Use marginal likelihood when priors are meaningful and you want to compare model structures. Use BIC as a quick approximation when priors are vague. Use AIC when focused on prediction and you expect the true model is complex. Use cross-validation when you distrust parametric assumptions or want to evaluate prediction directly.
Let's consolidate what we've learned about the Bayesian Occam's razor:
Looking ahead: We've now thoroughly understood how to compare models using marginal likelihood, Bayes factors, and Occam's razor. But model comparison isn't the end of the story. Often, no single model is clearly best, and we want to combine the strengths of multiple models. In the next page, we'll explore Bayesian Model Averaging—a principled approach to making predictions by weighting models according to their posterior probabilities.
Model averaging addresses the uncomfortable reality that model uncertainty is real. Rather than committing to a single 'best' model, we can coherently combine all models, weighted by their evidence.
You now understand the Bayesian Occam's razor—why simpler models are preferred when complexity isn't justified, what 'simplicity' means in a probabilistic context, and when this automatic parsimony succeeds or fails. Next, we'll see how to move beyond model selection to model averaging, combining multiple models for robust predictions.