Machine LearningBayesian Inference

Bayesian Model Comparison

LevelAdvanced

Duration90 mins

TopicBayesian Inference

1 / 5

Marginal Likelihood: The Foundation of Bayesian Model Comparison

The Model Selection Challenge

In machine learning, we constantly face a fundamental question: Which model should we use? Given a dataset, we might consider a linear model, a polynomial model, a neural network, or countless other possibilities. Each model makes different assumptions about the underlying data-generating process, and each has different capacity to fit complex patterns.

The classical approach to model selection—choosing based on training error—is deeply flawed. A model with more parameters can always achieve lower training error, but this often leads to overfitting. Cross-validation offers some remedy, but it requires holding out data and doesn't provide a principled probabilistic framework.

Bayesian inference offers an elegant solution: the marginal likelihood. Also called the model evidence, the marginal likelihood quantifies how well a model explains the observed data while automatically penalizing model complexity. It's the foundation upon which all Bayesian model comparison rests.

What You Will Learn

By the end of this page, you will understand: (1) The precise mathematical definition of marginal likelihood and how it differs from the ordinary likelihood, (2) Why marginal likelihood automatically balances data fit against model complexity, (3) The computational challenges in calculating marginal likelihoods, (4) How to compute marginal likelihoods analytically when possible, and (5) Approximation methods for intractable cases.

The Marginal Likelihood Defined

To understand marginal likelihood, we must first revisit the structure of Bayesian inference. Given a model $\mathcal{M}$ with parameters $\boldsymbol{\theta}$ and observed data $\mathcal{D}$, Bayes' theorem gives us the posterior distribution over parameters:

$$p(\boldsymbol{\theta} | \mathcal{D}, \mathcal{M}) = \frac{p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} | \mathcal{M})}{p(\mathcal{D} | \mathcal{M})}$$

The denominator $p(\mathcal{D} | \mathcal{M})$ is the marginal likelihood. It's called 'marginal' because it's obtained by marginalizing (integrating) over all possible parameter values:

$$p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} | \mathcal{M}) , d\boldsymbol{\theta}$$

Terminology Clarification

The marginal likelihood goes by many names: model evidence, integrated likelihood, or simply evidence. In the context of model comparison, 'evidence' is particularly apt—it represents the total evidential support the data provides for the model as a whole, not for any particular parameter setting.

Let's parse this integral carefully:

$p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M})$: The likelihood function. Given specific parameter values $\boldsymbol{\theta}$, how probable is the observed data?
$p(\boldsymbol{\theta} | \mathcal{M})$: The prior distribution. Before seeing data, how plausible are different parameter values under our model assumptions?
The integral: We average the likelihood over all possible parameter values, weighted by their prior probability.

The key insight: The marginal likelihood doesn't evaluate the model at any single point in parameter space. Instead, it asks: 'Considering all possible parameter values this model could have, weighted by their prior probability, how well does this model explain the data on average?'

Marginal Likelihood vs. Maximum Likelihood
Aspect	Maximum Likelihood	Marginal Likelihood
Formula	$\max_{\boldsymbol{\theta}} p(\mathcal{D} \| \boldsymbol{\theta})$	$\int p(\mathcal{D} \| \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\boldsymbol{\theta}$
What it measures	Best-case fit (optimized parameters)	Average fit (all parameters weighted by prior)
Dependency on parameter count	Always improves with more parameters	May decrease with unnecessary parameters
Overfitting tendency	Highly prone to overfitting	Inherently regularized
Computational cost	Optimization (relatively fast)	Integration (often intractable)

Geometric Interpretation: Volume Under the Likelihood Surface

The marginal likelihood has a beautiful geometric interpretation that illuminates why it naturally penalizes complexity.

Imagine the parameter space as a landscape. The height at each point $\boldsymbol{\theta}$ represents the likelihood $p(\mathcal{D} | \boldsymbol{\theta})$—how well those particular parameters explain the data. The prior $p(\boldsymbol{\theta})$ acts as a 'sampling weight' over this landscape.

The marginal likelihood is the volume under this weighted likelihood surface. It's the integral of the likelihood, weighted by the prior, over the entire parameter space.

Now consider two models:

Model A: Simple model (few parameters)

Parameter space is low-dimensional
Prior is concentrated (each parameter region gets more prior mass)
If the model fits well, the likelihood peak is high
Volume = limited extent × high peak × concentrated prior = potentially high

Model B: Complex model (many parameters)

Parameter space is high-dimensional
Prior is spread thin (prior mass distributed over vast space)
Maximum likelihood may be higher (more flexibility)
But volume = vast extent × high peak × diffuse prior = often lower

The Curse of Dimensionality for Complex Models

In high-dimensional parameter spaces, most of the volume is in the 'corners' far from any good-fitting region. A complex model must spread its prior probability mass across this vast space, leaving little mass near the parameters that actually explain the data well. This geometric fact is why marginal likelihood penalizes unnecessary complexity.

An analogy from probability: Consider a Gaussian distribution. A narrow Gaussian (small variance) concentrates its probability mass in a small region, achieving high density there. A wide Gaussian (large variance) spreads its mass thin, achieving low density everywhere. The marginal likelihood follows similar logic—a simpler model that 'commits' to a smaller region of data space can achieve higher evidence if its predictions are correct.

The key principle: Marginal likelihood rewards models that make specific, accurate predictions. A complex model that could explain almost any data pattern gets penalized because it hasn't 'committed' to this particular pattern—its prior probability is spread across all possibilities.

The Occam Factor: Automatic Complexity Penalty

The complexity-penalizing behavior of marginal likelihood can be made explicit through the concept of the Occam factor.

For models where the posterior is approximately Gaussian (which is common due to Bernstein-von Mises theorem), we can decompose the log marginal likelihood as:

$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log \text{Occam Factor}$$

where $\hat{\boldsymbol{\theta}}$ is the maximum a posteriori (MAP) estimate.

The Occam factor captures the ratio of posterior volume to prior volume in parameter space:

$$\text{Occam Factor} = \frac{\text{Posterior Volume}}{\text{Prior Volume}} = \frac{\Delta\boldsymbol{\theta}{\text{posterior}}}{\Delta\boldsymbol{\theta}{\text{prior}}}$$

Since data constrains parameters, the posterior is always more concentrated than the prior. Thus the Occam factor is always less than 1, contributing a negative term to the log marginal likelihood.

Understanding the Occam Factor

•Simple model with good fit: Prior already narrow, posterior similar → Occam factor near 1 → small penalty
•Complex model with good fit: Prior very wide, posterior narrow → Occam factor much less than 1 → large penalty
•Models with unused parameters: Parameters unconstrained by data → adds to prior volume without helping posterior → penalty increases
•Models with strongly constrained parameters: Data provides information → Occam factor captures exactly how much the data taught us

For a d-dimensional Gaussian posterior approximation, the Occam factor can be computed explicitly:

$$\log \text{Occam Factor} = \log \frac{(2\pi)^{d/2} |\boldsymbol{\Sigma}_{\text{post}}|^{1/2}}{\int p(\boldsymbol{\theta}) d\boldsymbol{\theta}}$$

where $\boldsymbol{\Sigma}_{\text{post}}$ is the posterior covariance matrix.

For a model with isotropic priors of width $\sigma_{\text{prior}}$ and isotropic posterior of width $\sigma_{\text{post}}$:

$$\log \text{Occam Factor} \approx d \log \frac{\sigma_{\text{post}}}{\sigma_{\text{prior}}}$$

This is negative (a penalty) because $\sigma_{\text{post}} < \sigma_{\text{prior}}$. The penalty grows linearly with the number of parameters $d$, but is modulated by how much each parameter is constrained by the data.

Why This Matters

The Occam factor provides automatic regularization without any hand-tuned hyperparameters. Unlike L1/L2 regularization where you choose λ, or cross-validation where you choose k-folds, the Bayesian framework derives the complexity penalty from first principles. The 'right' amount of regularization emerges from the prior and data.

Analytical Computation: When Closed Forms Exist

For certain model classes, the marginal likelihood can be computed exactly in closed form. These cases typically involve conjugate prior relationships, where the prior and likelihood combine to produce a posterior in the same family as the prior.

Example 1: Bayesian Linear Regression

Consider linear regression with Gaussian noise: $$y = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$$

With a Gaussian prior on weights: $$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$$

The marginal likelihood (assuming known $\sigma^2, \tau^2$) is:

$$p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2) = \mathcal{N}(\mathbf{y} | \mathbf{0}, \tau^2\mathbf{X}\mathbf{X}^\top + \sigma^2\mathbf{I})$$

This is a multivariate Gaussian with mean zero and covariance $\mathbf{K} = \tau^2\mathbf{X}\mathbf{X}^\top + \sigma^2\mathbf{I}$.

marginal_likelihood_linear_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def log_marginal_likelihood_linear_regression(X, y, tau_sq, sigma_sq):
    """
    Compute log marginal likelihood for Bayesian linear regression.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Design matrix
    y : ndarray of shape (n_samples,)
        Target values
    tau_sq : float
        Prior variance on weights
    sigma_sq : float
        Noise variance
        
    Returns:
    --------
    log_ml : float
        Log marginal likelihood
    """
    n = len(y)
    
    # Covariance matrix: K = tau^2 * X @ X.T + sigma^2 * I
    K = tau_sq * X @ X.T + sigma_sq * np.eye(n)
    
    # Cholesky decomposition for numerical stability
    L, lower = cho_factor(K, lower=True)
    
    # Log determinant via Cholesky: log|K| = 2 * sum(log(diag(L)))
    log_det_K = 2 * np.sum(np.log(np.diag(L)))
    
    # Solve K^{-1} y via Cholesky
    alpha = cho_solve((L, lower), y)
    
    # Log marginal likelihood
    log_ml = -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi))
    
    return log_ml
 
 
# Example: Compare models with different numbers of polynomial features
np.random.seed(42)
n = 50
x_raw = np.linspace(-1, 1, n)
y_true = 2 * x_raw + 0.5 * x_raw**2
y = y_true + 0.3 * np.random.randn(n)
 
# Model 1: Linear (degree 1)
X1 = np.column_stack([np.ones(n), x_raw])
 
# Model 2: Quadratic (degree 2)
X2 = np.column_stack([np.ones(n), x_raw, x_raw**2])
 
# Model 3: High-degree polynomial (degree 10) - overly complex
X3 = np.column_stack([x_raw**i for i in range(11)])
 
tau_sq = 1.0
sigma_sq = 0.1
 
for name, X in [("Linear (d=1)", X1), ("Quadratic (d=2)", X2), ("Degree-10", X3)]:
    log_ml = log_marginal_likelihood_linear_regression(X, y, tau_sq, sigma_sq)
    print(f"{name:20s}: log p(y|X,M) = {log_ml:.2f}")

Example 2: Beta-Bernoulli Model

For coin flipping with a Beta prior on the probability $\theta$:

Prior: $\theta \sim \text{Beta}(\alpha, \beta)$
Likelihood: $k$ successes in $n$ trials: $p(k | \theta) = \binom{n}{k}\theta^k(1-\theta)^{n-k}$

The marginal likelihood is:

$$p(k | \alpha, \beta) = \binom{n}{k} \cdot \frac{B(\alpha + k, \beta + n - k)}{B(\alpha, \beta)}$$

where $B(\cdot, \cdot)$ is the Beta function.

This closed-form result arises from the conjugacy of the Beta prior with the Binomial likelihood.

When Closed Forms Don't Exist

Most practical models—neural networks, Gaussian processes with learned hyperparameters, complex graphical models—don't have closed-form marginal likelihoods. The integral is intractable, and we must resort to approximation methods covered later in this page.

Approximation Methods for Intractable Integrals

When closed-form computation is impossible, we need approximation methods. Each has distinct strengths and is appropriate for different situations.

5.1 Laplace Approximation

The Laplace approximation treats the posterior as approximately Gaussian, centered at its mode (MAP estimate). This gives:

$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$

where $\hat{\boldsymbol{\theta}}$ is the MAP estimate and $\mathbf{H}$ is the Hessian of the negative log posterior at the mode.

Intuition: We approximate the integrand with a Gaussian bump centered at the MAP, then compute the volume under this Gaussian exactly.

Laplace Approximation Properties

•Fast: Only requires finding MAP and computing Hessian
•Deterministic: No sampling variance
•Asymptotically correct: Becomes exact as data → ∞
•Limitation: Poor for multimodal or skewed posteriors
•Limitation: Requires second derivatives (Hessian)

5.2 Bayesian Information Criterion (BIC)

BIC is a rough approximation to the log marginal likelihood:

$$\text{BIC} = \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) - \frac{d}{2}\log n$$

where $d$ is the number of parameters and $n$ is the number of data points.

Derivation sketch: BIC emerges from the Laplace approximation when we assume:

The prior is flat (uniform) over a reasonable region
The data provides $O(n)$ information, so posterior variance scales as $O(1/n)$

Interpretation: The $\frac{d}{2}\log n$ term is the complexity penalty. It grows with the number of parameters but is modulated by sample size—more data justifies more parameters.

BIC vs. AIC

AIC (Akaike Information Criterion) penalizes by $d$ rather than $\frac{d}{2}\log n$. AIC aims to minimize prediction error on new data, while BIC approximates Bayesian model evidence. For large $n$, BIC penalizes complexity more heavily, favoring simpler models. Neither is 'correct'—they optimize different objectives.

5.3 Monte Carlo Integration

For arbitrary posteriors, we can estimate the marginal likelihood using Monte Carlo sampling:

Simple Monte Carlo (often poor): $$p(\mathcal{D}) \approx \frac{1}{S}\sum_{s=1}^{S} p(\mathcal{D} | \boldsymbol{\theta}^{(s)}), \quad \boldsymbol{\theta}^{(s)} \sim p(\boldsymbol{\theta})$$

This samples from the prior and averages likelihoods. It's unbiased but has extremely high variance—prior samples rarely land near high-likelihood regions.

Importance Sampling (better): $$p(\mathcal{D}) \approx \frac{1}{S}\sum_{s=1}^{S} \frac{p(\mathcal{D} | \boldsymbol{\theta}^{(s)}) p(\boldsymbol{\theta}^{(s)})}{q(\boldsymbol{\theta}^{(s)})}, \quad \boldsymbol{\theta}^{(s)} \sim q(\boldsymbol{\theta})$$

We sample from a proposal distribution $q$ concentrated near high-posterior regions.

Bridge Sampling, Path Sampling, Thermodynamic Integration: Advanced methods that construct a 'bridge' between prior and posterior, often via a sequence of intermediate distributions.

monte_carlo_marginal_likelihood.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from scipy import stats
 
def importance_sampling_marginal_likelihood(log_likelihood_fn, prior, proposal, 
                                             n_samples=10000, seed=42):
    """
    Estimate log marginal likelihood using importance sampling.
    
    Parameters:
    -----------
    log_likelihood_fn : callable
        Function that takes theta and returns log p(D|theta)
    prior : scipy.stats distribution
        Prior distribution p(theta)
    proposal : scipy.stats distribution  
        Proposal distribution q(theta), should approximate posterior
    n_samples : int
        Number of importance samples
        
    Returns:
    --------
    log_ml : float
        Estimated log marginal likelihood
    log_ml_se : float
        Standard error of the estimate
    """
    np.random.seed(seed)
    
    # Sample from proposal
    theta_samples = proposal.rvs(n_samples)
    
    # Compute importance weights in log space
    log_likes = np.array([log_likelihood_fn(theta) for theta in theta_samples])
    log_priors = prior.logpdf(theta_samples)
    log_proposals = proposal.logpdf(theta_samples)
    
    # Log importance weights: log(p(D|theta) * p(theta) / q(theta))
    log_weights = log_likes + log_priors - log_proposals
    
    # Log-sum-exp trick for numerical stability
    max_log_w = np.max(log_weights)
    log_ml = max_log_w + np.log(np.mean(np.exp(log_weights - max_log_w)))
    
    # Estimate standard error
    normalized_weights = np.exp(log_weights - log_ml)
    effective_sample_size = 1 / np.sum(normalized_weights**2 / n_samples**2)
    log_ml_se = np.std(log_weights) / np.sqrt(effective_sample_size)
    
    return log_ml, log_ml_se, effective_sample_size
 
 
# Example: Estimating marginal likelihood for Normal mean estimation
true_mu = 2.0
sigma = 1.0
n = 20
data = np.random.normal(true_mu, sigma, n)
 
# Log likelihood function
def log_likelihood(mu):
    return np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))
 
# Prior: Normal(0, 10)
prior = stats.norm(loc=0, scale=10)
 
# Proposal: Approximate posterior (Normal centered at MLE with reduced variance)
mle = np.mean(data)
posterior_std = sigma / np.sqrt(n)
proposal = stats.norm(loc=mle, scale=posterior_std * 1.5)  # slightly wider
 
log_ml, se, ess = importance_sampling_marginal_likelihood(
    log_likelihood, prior, proposal, n_samples=50000
)
print(f"Estimated log marginal likelihood: {log_ml:.4f} ± {se:.4f}")
print(f"Effective sample size: {ess:.1f}")

Sensitivity to Prior Specification

A critical but often overlooked aspect of marginal likelihood is its sensitivity to prior specification. Unlike posterior inference, where reasonable priors often lead to similar conclusions when data is abundant, the marginal likelihood can be highly sensitive to prior choices.

Why Priors Matter More for Model Comparison

Consider two priors for a parameter $\theta$:

Prior A: $\theta \sim \mathcal{N}(0, 1)$
Prior B: $\theta \sim \mathcal{N}(0, 100)$

If the true parameter is $\theta = 0.5$ and data strongly supports this:

For posterior inference: Both priors lead to posteriors concentrated near 0.5. The prior gets 'washed out' by data.
For marginal likelihood: Prior B spreads probability mass over a huge range. Since most of that range has low likelihood, the marginal likelihood is lower. Prior A 'predicted' the true parameter region better, so it gets higher evidence.

The implication: When comparing models via marginal likelihood, you're also comparing how well each prior 'predicted' what the data would show. Vague, diffuse priors get penalized.

The Jeffreys-Lindley Paradox

With improper or very diffuse priors, Bayesian model comparison breaks down. Consider testing H₀: θ = 0 vs H₁: θ ≠ 0; if the prior under H₁ is uniform over (-∞, ∞), the marginal likelihood under H₁ is zero! This paradox shows that priors must be proper and carefully considered for model comparison.

Best Practices for Prior Specification

1. Use informative priors when possible: If you have domain knowledge about reasonable parameter ranges, encode it. This isn't 'cheating'—it's using available information.

2. Conduct sensitivity analysis: Compute marginal likelihoods under several reasonable prior specifications. If conclusions are robust, you can be more confident.

3. Use proper priors: Improper priors (e.g., flat priors over infinite ranges) lead to undefined or zero marginal likelihoods. Always use proper probability distributions.

4. Consider objective Bayes approaches: Methods like reference priors or Jeffreys priors attempt to be 'least informative' while remaining proper, but have their own limitations.

5. Match prior scales across models: When comparing models, ensure that priors on shared parameters have compatible scales. A model isn't 'worse' just because its prior was poorly calibrated.

Prior Sensitivity: Example Calculations
Prior Variance	Prior Predictive Range	Marginal Likelihood (relative)	Interpretation
σ² = 0.1	θ ∈ [-0.6, 0.6] (95%)	1.00	Prior matches data scale well
σ² = 1.0	θ ∈ [-2, 2] (95%)	0.42	Prior somewhat diffuse
σ² = 10	θ ∈ [-6.3, 6.3] (95%)	0.13	Prior too vague—penalized
σ² = 100	θ ∈ [-20, 20] (95%)	0.04	Prior extremely diffuse—heavy penalty

Practical Considerations and Pitfalls

Having understood the theory, let's address practical challenges in using marginal likelihood for model selection.

Numerical Stability

Marginal likelihoods span enormous ranges. For typical datasets, log marginal likelihoods can be -1000 or lower. Always work in log space:

✗ BAD:  ml_ratio = ml_model1 / ml_model2  # Underflow/overflow
✓ GOOD: log_ml_diff = log_ml_model1 - log_ml_model2

Nested Model Comparison

When comparing nested models (e.g., linear vs. polynomial regression), marginal likelihood automatically handles the 'nesting' through the Occam factor. You don't need to account for it separately—it emerges from the integral.

Non-Nested Model Comparison

Marginal likelihood excels here. Unlike likelihood ratio tests (which require nested models), marginal likelihood compares any two models on equal footing. A Gaussian mixture vs. a kernel density estimator? Both get evaluated on how well they explain the data, weighted by their prior commitments.

Common Pitfalls to Avoid

•Comparing models with different data: Marginal likelihood assumes the same dataset. If Model A uses features (x₁, x₂) and Model B uses (x₁, x₃), you're not comparing model structures—you're also comparing feature sets.
•Ignoring prior sensitivity: As discussed, always check whether conclusions change with reasonable prior variations.
•Treating BIC as exact: BIC is a rough approximation valid only asymptotically. For small samples or complex posteriors, use better approximations.
•Forgetting about model misspecification: Marginal likelihood chooses the 'best' model among candidates. If all candidates are wrong, it picks the least wrong—but that's still wrong.
•Overconfidence in large Bayes factors: Even very large Bayes factors don't guarantee truth. They measure relative evidence, not absolute validity.

When Not to Use Marginal Likelihood

If your goal is pure prediction accuracy on new data, cross-validation may be more appropriate. Marginal likelihood optimizes for 'posterior model probability'—the probability that this model generated the data—which isn't always aligned with predictive performance, especially when all models are misspecified.

Summary and Key Takeaways

Let's consolidate what we've learned about marginal likelihood:

Core Concepts

•Definition: Marginal likelihood $p(\mathcal{D}|\mathcal{M})$ is the probability of data under a model, averaged over all parameter values weighted by the prior.
•Automatic Occam's Razor: By integrating over parameters, marginal likelihood naturally penalizes complexity—models with unnecessary parameters are penalized because their prior mass is spread thin.
•Occam Factor: The ratio of posterior to prior volume explicitly quantifies the complexity penalty.
•Closed-form solutions: Exist for conjugate models (Beta-Binomial, Gaussian-Gaussian, etc.) but are rare in practice.
•Approximations: Laplace approximation, BIC, and Monte Carlo methods are essential for tractable computation.
•Prior sensitivity: Marginal likelihood is sensitive to prior specification—careful prior choice and sensitivity analysis are essential.

Looking ahead: Marginal likelihood is rarely used in isolation. In the next page, we'll learn how to combine marginal likelihoods from different models to compute Bayes factors—the ratio of evidences that quantifies how much more one model is supported by the data compared to another.

Bayes factors provide a principled, calibrated scale for interpreting the strength of evidence, moving beyond mere p-values to a fully probabilistic comparison framework.

Page Complete

You now understand marginal likelihood—the foundation of Bayesian model comparison. You know what it measures, why it penalizes complexity, how to compute or approximate it, and its limitations. Next, we'll see how to interpret marginal likelihood ratios as Bayes factors for principled model selection.

1 / 5

Loading learning content...

Machine LearningBayesian Inference

Bayesian Model Comparison

LevelAdvanced

Duration90 mins

TopicBayesian Inference

1 / 5

Marginal Likelihood: The Foundation of Bayesian Model Comparison

The Model Selection Challenge

What You Will Learn

The Marginal Likelihood Defined

$$p(\boldsymbol{\theta} | \mathcal{D}, \mathcal{M}) = \frac{p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} | \mathcal{M})}{p(\mathcal{D} | \mathcal{M})}$$

The denominator $p(\mathcal{D} | \mathcal{M})$ is the marginal likelihood. It's called 'marginal' because it's obtained by marginalizing (integrating) over all possible parameter values:

$$p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M}) \cdot p(\boldsymbol{\theta} | \mathcal{M}) , d\boldsymbol{\theta}$$

Terminology Clarification

Let's parse this integral carefully:

$p(\mathcal{D} | \boldsymbol{\theta}, \mathcal{M})$: The likelihood function. Given specific parameter values $\boldsymbol{\theta}$, how probable is the observed data?
$p(\boldsymbol{\theta} | \mathcal{M})$: The prior distribution. Before seeing data, how plausible are different parameter values under our model assumptions?
The integral: We average the likelihood over all possible parameter values, weighted by their prior probability.

Marginal Likelihood vs. Maximum Likelihood
Aspect	Maximum Likelihood	Marginal Likelihood
Formula	$\max_{\boldsymbol{\theta}} p(\mathcal{D} \| \boldsymbol{\theta})$	$\int p(\mathcal{D} \| \boldsymbol{\theta}) p(\boldsymbol{\theta}) d\boldsymbol{\theta}$
What it measures	Best-case fit (optimized parameters)	Average fit (all parameters weighted by prior)
Dependency on parameter count	Always improves with more parameters	May decrease with unnecessary parameters
Overfitting tendency	Highly prone to overfitting	Inherently regularized
Computational cost	Optimization (relatively fast)	Integration (often intractable)

Geometric Interpretation: Volume Under the Likelihood Surface

The marginal likelihood has a beautiful geometric interpretation that illuminates why it naturally penalizes complexity.

The marginal likelihood is the volume under this weighted likelihood surface. It's the integral of the likelihood, weighted by the prior, over the entire parameter space.

Now consider two models:

Model A: Simple model (few parameters)

Parameter space is low-dimensional
Prior is concentrated (each parameter region gets more prior mass)
If the model fits well, the likelihood peak is high
Volume = limited extent × high peak × concentrated prior = potentially high

Model B: Complex model (many parameters)

Parameter space is high-dimensional
Prior is spread thin (prior mass distributed over vast space)
Maximum likelihood may be higher (more flexibility)
But volume = vast extent × high peak × diffuse prior = often lower

The Curse of Dimensionality for Complex Models

The Occam Factor: Automatic Complexity Penalty

The complexity-penalizing behavior of marginal likelihood can be made explicit through the concept of the Occam factor.

For models where the posterior is approximately Gaussian (which is common due to Bernstein-von Mises theorem), we can decompose the log marginal likelihood as:

$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log \text{Occam Factor}$$

where $\hat{\boldsymbol{\theta}}$ is the maximum a posteriori (MAP) estimate.

The Occam factor captures the ratio of posterior volume to prior volume in parameter space:

$$\text{Occam Factor} = \frac{\text{Posterior Volume}}{\text{Prior Volume}} = \frac{\Delta\boldsymbol{\theta}{\text{posterior}}}{\Delta\boldsymbol{\theta}{\text{prior}}}$$

Since data constrains parameters, the posterior is always more concentrated than the prior. Thus the Occam factor is always less than 1, contributing a negative term to the log marginal likelihood.

Understanding the Occam Factor

•Simple model with good fit: Prior already narrow, posterior similar → Occam factor near 1 → small penalty
•Complex model with good fit: Prior very wide, posterior narrow → Occam factor much less than 1 → large penalty
•Models with unused parameters: Parameters unconstrained by data → adds to prior volume without helping posterior → penalty increases
•Models with strongly constrained parameters: Data provides information → Occam factor captures exactly how much the data taught us

For a d-dimensional Gaussian posterior approximation, the Occam factor can be computed explicitly:

$$\log \text{Occam Factor} = \log \frac{(2\pi)^{d/2} |\boldsymbol{\Sigma}_{\text{post}}|^{1/2}}{\int p(\boldsymbol{\theta}) d\boldsymbol{\theta}}$$

where $\boldsymbol{\Sigma}_{\text{post}}$ is the posterior covariance matrix.

For a model with isotropic priors of width $\sigma_{\text{prior}}$ and isotropic posterior of width $\sigma_{\text{post}}$:

$$\log \text{Occam Factor} \approx d \log \frac{\sigma_{\text{post}}}{\sigma_{\text{prior}}}$$

Why This Matters

Analytical Computation: When Closed Forms Exist

Example 1: Bayesian Linear Regression

Consider linear regression with Gaussian noise: $$y = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$$

With a Gaussian prior on weights: $$\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I})$$

The marginal likelihood (assuming known $\sigma^2, \tau^2$) is:

$$p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2) = \mathcal{N}(\mathbf{y} | \mathbf{0}, \tau^2\mathbf{X}\mathbf{X}^\top + \sigma^2\mathbf{I})$$

This is a multivariate Gaussian with mean zero and covariance $\mathbf{K} = \tau^2\mathbf{X}\mathbf{X}^\top + \sigma^2\mathbf{I}$.

marginal_likelihood_linear_regression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from scipy.linalg import cho_factor, cho_solve
 
def log_marginal_likelihood_linear_regression(X, y, tau_sq, sigma_sq):
    """
    Compute log marginal likelihood for Bayesian linear regression.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Design matrix
    y : ndarray of shape (n_samples,)
        Target values
    tau_sq : float
        Prior variance on weights
    sigma_sq : float
        Noise variance
        
    Returns:
    --------
    log_ml : float
        Log marginal likelihood
    """
    n = len(y)
    
    # Covariance matrix: K = tau^2 * X @ X.T + sigma^2 * I
    K = tau_sq * X @ X.T + sigma_sq * np.eye(n)
    
    # Cholesky decomposition for numerical stability
    L, lower = cho_factor(K, lower=True)
    
    # Log determinant via Cholesky: log|K| = 2 * sum(log(diag(L)))
    log_det_K = 2 * np.sum(np.log(np.diag(L)))
    
    # Solve K^{-1} y via Cholesky
    alpha = cho_solve((L, lower), y)
    
    # Log marginal likelihood
    log_ml = -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi))
    
    return log_ml
 
 
# Example: Compare models with different numbers of polynomial features
np.random.seed(42)
n = 50
x_raw = np.linspace(-1, 1, n)
y_true = 2 * x_raw + 0.5 * x_raw**2
y = y_true + 0.3 * np.random.randn(n)
 
# Model 1: Linear (degree 1)
X1 = np.column_stack([np.ones(n), x_raw])
 
# Model 2: Quadratic (degree 2)
X2 = np.column_stack([np.ones(n), x_raw, x_raw**2])
 
# Model 3: High-degree polynomial (degree 10) - overly complex
X3 = np.column_stack([x_raw**i for i in range(11)])
 
tau_sq = 1.0
sigma_sq = 0.1
 
for name, X in [("Linear (d=1)", X1), ("Quadratic (d=2)", X2), ("Degree-10", X3)]:
    log_ml = log_marginal_likelihood_linear_regression(X, y, tau_sq, sigma_sq)
    print(f"{name:20s}: log p(y|X,M) = {log_ml:.2f}")

Example 2: Beta-Bernoulli Model

For coin flipping with a Beta prior on the probability $\theta$:

Prior: $\theta \sim \text{Beta}(\alpha, \beta)$
Likelihood: $k$ successes in $n$ trials: $p(k | \theta) = \binom{n}{k}\theta^k(1-\theta)^{n-k}$

The marginal likelihood is:

$$p(k | \alpha, \beta) = \binom{n}{k} \cdot \frac{B(\alpha + k, \beta + n - k)}{B(\alpha, \beta)}$$

where $B(\cdot, \cdot)$ is the Beta function.

This closed-form result arises from the conjugacy of the Beta prior with the Binomial likelihood.

When Closed Forms Don't Exist

Approximation Methods for Intractable Integrals

When closed-form computation is impossible, we need approximation methods. Each has distinct strengths and is appropriate for different situations.

5.1 Laplace Approximation

The Laplace approximation treats the posterior as approximately Gaussian, centered at its mode (MAP estimate). This gives:

$$\log p(\mathcal{D} | \mathcal{M}) \approx \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) + \log p(\hat{\boldsymbol{\theta}}) + \frac{d}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{H}|$$

where $\hat{\boldsymbol{\theta}}$ is the MAP estimate and $\mathbf{H}$ is the Hessian of the negative log posterior at the mode.

Intuition: We approximate the integrand with a Gaussian bump centered at the MAP, then compute the volume under this Gaussian exactly.

Laplace Approximation Properties

•Fast: Only requires finding MAP and computing Hessian
•Deterministic: No sampling variance
•Asymptotically correct: Becomes exact as data → ∞
•Limitation: Poor for multimodal or skewed posteriors
•Limitation: Requires second derivatives (Hessian)

5.2 Bayesian Information Criterion (BIC)

BIC is a rough approximation to the log marginal likelihood:

$$\text{BIC} = \log p(\mathcal{D} | \hat{\boldsymbol{\theta}}) - \frac{d}{2}\log n$$

where $d$ is the number of parameters and $n$ is the number of data points.

Derivation sketch: BIC emerges from the Laplace approximation when we assume:

The prior is flat (uniform) over a reasonable region
The data provides $O(n)$ information, so posterior variance scales as $O(1/n)$

Interpretation: The $\frac{d}{2}\log n$ term is the complexity penalty. It grows with the number of parameters but is modulated by sample size—more data justifies more parameters.

BIC vs. AIC

5.3 Monte Carlo Integration

For arbitrary posteriors, we can estimate the marginal likelihood using Monte Carlo sampling:

Simple Monte Carlo (often poor): $$p(\mathcal{D}) \approx \frac{1}{S}\sum_{s=1}^{S} p(\mathcal{D} | \boldsymbol{\theta}^{(s)}), \quad \boldsymbol{\theta}^{(s)} \sim p(\boldsymbol{\theta})$$

This samples from the prior and averages likelihoods. It's unbiased but has extremely high variance—prior samples rarely land near high-likelihood regions.

We sample from a proposal distribution $q$ concentrated near high-posterior regions.

Bridge Sampling, Path Sampling, Thermodynamic Integration: Advanced methods that construct a 'bridge' between prior and posterior, often via a sequence of intermediate distributions.

monte_carlo_marginal_likelihood.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from scipy import stats
 
def importance_sampling_marginal_likelihood(log_likelihood_fn, prior, proposal, 
                                             n_samples=10000, seed=42):
    """
    Estimate log marginal likelihood using importance sampling.
    
    Parameters:
    -----------
    log_likelihood_fn : callable
        Function that takes theta and returns log p(D|theta)
    prior : scipy.stats distribution
        Prior distribution p(theta)
    proposal : scipy.stats distribution  
        Proposal distribution q(theta), should approximate posterior
    n_samples : int
        Number of importance samples
        
    Returns:
    --------
    log_ml : float
        Estimated log marginal likelihood
    log_ml_se : float
        Standard error of the estimate
    """
    np.random.seed(seed)
    
    # Sample from proposal
    theta_samples = proposal.rvs(n_samples)
    
    # Compute importance weights in log space
    log_likes = np.array([log_likelihood_fn(theta) for theta in theta_samples])
    log_priors = prior.logpdf(theta_samples)
    log_proposals = proposal.logpdf(theta_samples)
    
    # Log importance weights: log(p(D|theta) * p(theta) / q(theta))
    log_weights = log_likes + log_priors - log_proposals
    
    # Log-sum-exp trick for numerical stability
    max_log_w = np.max(log_weights)
    log_ml = max_log_w + np.log(np.mean(np.exp(log_weights - max_log_w)))
    
    # Estimate standard error
    normalized_weights = np.exp(log_weights - log_ml)
    effective_sample_size = 1 / np.sum(normalized_weights**2 / n_samples**2)
    log_ml_se = np.std(log_weights) / np.sqrt(effective_sample_size)
    
    return log_ml, log_ml_se, effective_sample_size
 
 
# Example: Estimating marginal likelihood for Normal mean estimation
true_mu = 2.0
sigma = 1.0
n = 20
data = np.random.normal(true_mu, sigma, n)
 
# Log likelihood function
def log_likelihood(mu):
    return np.sum(stats.norm.logpdf(data, loc=mu, scale=sigma))
 
# Prior: Normal(0, 10)
prior = stats.norm(loc=0, scale=10)
 
# Proposal: Approximate posterior (Normal centered at MLE with reduced variance)
mle = np.mean(data)
posterior_std = sigma / np.sqrt(n)
proposal = stats.norm(loc=mle, scale=posterior_std * 1.5)  # slightly wider
 
log_ml, se, ess = importance_sampling_marginal_likelihood(
    log_likelihood, prior, proposal, n_samples=50000
)
print(f"Estimated log marginal likelihood: {log_ml:.4f} ± {se:.4f}")
print(f"Effective sample size: {ess:.1f}")

Sensitivity to Prior Specification

Why Priors Matter More for Model Comparison

Consider two priors for a parameter $\theta$:

Prior A: $\theta \sim \mathcal{N}(0, 1)$
Prior B: $\theta \sim \mathcal{N}(0, 100)$

If the true parameter is $\theta = 0.5$ and data strongly supports this:

For posterior inference: Both priors lead to posteriors concentrated near 0.5. The prior gets 'washed out' by data.
For marginal likelihood: Prior B spreads probability mass over a huge range. Since most of that range has low likelihood, the marginal likelihood is lower. Prior A 'predicted' the true parameter region better, so it gets higher evidence.

The implication: When comparing models via marginal likelihood, you're also comparing how well each prior 'predicted' what the data would show. Vague, diffuse priors get penalized.

The Jeffreys-Lindley Paradox

Best Practices for Prior Specification

1. Use informative priors when possible: If you have domain knowledge about reasonable parameter ranges, encode it. This isn't 'cheating'—it's using available information.

2. Conduct sensitivity analysis: Compute marginal likelihoods under several reasonable prior specifications. If conclusions are robust, you can be more confident.

3. Use proper priors: Improper priors (e.g., flat priors over infinite ranges) lead to undefined or zero marginal likelihoods. Always use proper probability distributions.

4. Consider objective Bayes approaches: Methods like reference priors or Jeffreys priors attempt to be 'least informative' while remaining proper, but have their own limitations.

5. Match prior scales across models: When comparing models, ensure that priors on shared parameters have compatible scales. A model isn't 'worse' just because its prior was poorly calibrated.

Prior Sensitivity: Example Calculations
Prior Variance	Prior Predictive Range	Marginal Likelihood (relative)	Interpretation
σ² = 0.1	θ ∈ [-0.6, 0.6] (95%)	1.00	Prior matches data scale well
σ² = 1.0	θ ∈ [-2, 2] (95%)	0.42	Prior somewhat diffuse
σ² = 10	θ ∈ [-6.3, 6.3] (95%)	0.13	Prior too vague—penalized
σ² = 100	θ ∈ [-20, 20] (95%)	0.04	Prior extremely diffuse—heavy penalty

Practical Considerations and Pitfalls

Having understood the theory, let's address practical challenges in using marginal likelihood for model selection.

Numerical Stability

Marginal likelihoods span enormous ranges. For typical datasets, log marginal likelihoods can be -1000 or lower. Always work in log space:

✗ BAD:  ml_ratio = ml_model1 / ml_model2  # Underflow/overflow
✓ GOOD: log_ml_diff = log_ml_model1 - log_ml_model2

Nested Model Comparison

Non-Nested Model Comparison

Common Pitfalls to Avoid

•Comparing models with different data: Marginal likelihood assumes the same dataset. If Model A uses features (x₁, x₂) and Model B uses (x₁, x₃), you're not comparing model structures—you're also comparing feature sets.
•Ignoring prior sensitivity: As discussed, always check whether conclusions change with reasonable prior variations.
•Treating BIC as exact: BIC is a rough approximation valid only asymptotically. For small samples or complex posteriors, use better approximations.
•Forgetting about model misspecification: Marginal likelihood chooses the 'best' model among candidates. If all candidates are wrong, it picks the least wrong—but that's still wrong.
•Overconfidence in large Bayes factors: Even very large Bayes factors don't guarantee truth. They measure relative evidence, not absolute validity.

When Not to Use Marginal Likelihood

Summary and Key Takeaways

Let's consolidate what we've learned about marginal likelihood:

Core Concepts

•Definition: Marginal likelihood $p(\mathcal{D}|\mathcal{M})$ is the probability of data under a model, averaged over all parameter values weighted by the prior.
•Automatic Occam's Razor: By integrating over parameters, marginal likelihood naturally penalizes complexity—models with unnecessary parameters are penalized because their prior mass is spread thin.
•Occam Factor: The ratio of posterior to prior volume explicitly quantifies the complexity penalty.
•Closed-form solutions: Exist for conjugate models (Beta-Binomial, Gaussian-Gaussian, etc.) but are rare in practice.
•Approximations: Laplace approximation, BIC, and Monte Carlo methods are essential for tractable computation.
•Prior sensitivity: Marginal likelihood is sensitive to prior specification—careful prior choice and sensitivity analysis are essential.

Bayes factors provide a principled, calibrated scale for interpreting the strength of evidence, moving beyond mere p-values to a fully probabilistic comparison framework.

Page Complete

1 / 5