Machine LearningBayesian Inference

Bayesian Model Comparison

LevelAdvanced

Duration90 mins

TopicBayesian Inference

2 / 5

Bayes Factors: Quantifying Relative Model Evidence

From Marginal Likelihoods to Model Comparison

In the previous page, we developed marginal likelihood—the probability of observing our data under a model, integrated over all possible parameter values. While marginal likelihood quantifies the absolute evidence for a single model, scientific inference rarely asks 'How probable is this model?' in isolation. Instead, we ask: 'Which model better explains the data?' or 'How much more evidence supports Model A over Model B?'

The Bayes factor answers precisely this question. It is the ratio of marginal likelihoods between two competing models, providing a principled, calibrated measure of relative evidential support. Unlike p-values (which measure how surprising data would be under a null hypothesis), Bayes factors directly quantify how much one model is favored over another given the observed data.

What You Will Learn

By the end of this page, you will understand: (1) The formal definition of Bayes factors and their relationship to posterior model probabilities, (2) Standard interpretation scales for Bayes factors (Jeffreys, Kass-Raftery), (3) How Bayes factors relate to hypothesis testing, (4) Computational strategies for Bayes factor estimation, (5) Common pitfalls and limitations in applying Bayes factors.

Definition and Derivation

The Bayes Factor Defined

Given two models $\mathcal{M}_1$ and $\mathcal{M}_2$, the Bayes factor in favor of $\mathcal{M}_1$ over $\mathcal{M}_2$ is defined as the ratio of their marginal likelihoods:

$$\text{BF}_{12} = \frac{p(\mathcal{D} | \mathcal{M}_1)}{p(\mathcal{D} | \mathcal{M}_2)}$$

Expanding the marginal likelihoods:

$$\text{BF}_{12} = \frac{\int p(\mathcal{D} | \boldsymbol{\theta}_1, \mathcal{M}_1) p(\boldsymbol{\theta}_1 | \mathcal{M}_1) d\boldsymbol{\theta}_1}{\int p(\mathcal{D} | \boldsymbol{\theta}_2, \mathcal{M}_2) p(\boldsymbol{\theta}_2 | \mathcal{M}_2) d\boldsymbol{\theta}_2}$$

Interpretation: $\text{BF}_{12} = 10$ means the data is 10 times more probable under Model 1 than Model 2. The data provides 10-to-1 odds in favor of Model 1.

Notation Conventions

The subscript order matters: BF₁₂ is evidence for Model 1 over Model 2. Values > 1 favor Model 1; values < 1 favor Model 2. Some authors use BF₁₀ for 'alternative over null' and BF₀₁ for 'null over alternative.' Always check the convention in any paper you read.

Connection to Posterior Model Probabilities

Bayes factors are intimately connected to posterior probabilities over models via Bayes' theorem at the model level:

$$\frac{p(\mathcal{M}_1 | \mathcal{D})}{p(\mathcal{M}_2 | \mathcal{D})} = \frac{p(\mathcal{D} | \mathcal{M}_1)}{p(\mathcal{D} | \mathcal{M}_2)} \cdot \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}$$

$$\underbrace{\text{Posterior Odds}}{\text{after seeing data}} = \underbrace{\text{Bayes Factor}}{\text{evidence from data}} \times \underbrace{\text{Prior Odds}}_{\text{before seeing data}}$$

The Bayes factor is the multiplicative update from prior odds to posterior odds. It represents purely the evidence contributed by the data, independent of prior model preferences.

If we start with equal prior probabilities ($p(\mathcal{M}_1) = p(\mathcal{M}_2) = 0.5$):

$$\text{Posterior Odds} = \text{BF}_{12}$$

This allows conversion to posterior probabilities:

$$p(\mathcal{M}1 | \mathcal{D}) = \frac{\text{BF}{12}}{1 + \text{BF}_{12}}$$

From Bayes Factors to Posterior Probabilities (Equal Prior Odds)
Bayes Factor (BF₁₂)	Posterior Odds	P(M₁\|D)	P(M₂\|D)
1	1 : 1	50%	50%
3	3 : 1	75%	25%
10	10 : 1	91%	9%
30	30 : 1	97%	3%
100	100 : 1	99%	1%
1000	1000 : 1	99.9%	0.1%

Interpretation Scales: When Is Evidence 'Strong'?

A Bayes factor of 5 means the data is 5 times more likely under one model. But is that 'strong' evidence? Researchers have proposed calibration scales to interpret Bayes factors consistently.

The Jeffreys Scale (1961)

Harold Jeffreys, one of the founders of Bayesian statistics, proposed the first widely-used interpretation scale. He worked with $\log_{10}(\text{BF})$ for convenience:

Jeffreys' Scale for Interpreting Bayes Factors
log₁₀(BF₁₂)	BF₁₂	Interpretation
0 to 0.5	1 to 3.2	Not worth more than a bare mention
0.5 to 1	3.2 to 10	Substantial evidence
1 to 1.5	10 to 32	Strong evidence
1.5 to 2	32 to 100	Very strong evidence
2	100	Decisive evidence

The Kass-Raftery Scale (1995)

Robert Kass and Adrian Raftery proposed a modified scale using $2 \ln(\text{BF})$, which has the same scale as likelihood ratio test statistics and the deviance:

Kass-Raftery Scale for Interpreting Bayes Factors
2 ln(BF₁₂)	BF₁₂	Interpretation
0 to 2	1 to 3	Not worth more than a bare mention
2 to 6	3 to 20	Positive evidence
6 to 10	20 to 150	Strong evidence
10	150	Very strong evidence

Use Scales as Guidelines, Not Laws

These scales are conventions, not theorems. A Bayes factor of 15 doesn't magically become 'strong' at an arbitrary threshold. Consider the scientific context: in a new field with weak prior information, BF = 10 might warrant caution; in a mature field with robust theory, BF = 10 might be decisive. Context matters.

Symmetric Interpretation

Bayes factors are symmetric under inversion:

$$\text{BF}{21} = \frac{1}{\text{BF}{12}}$$

This means the same scale applies in both directions:

$\text{BF}_{12} = 100$: Strong evidence for Model 1
$\text{BF}{12} = 0.01$: Strong evidence for Model 2 (since $\text{BF}{21} = 100$)
$\text{BF}_{12} = 1$: No evidence either way (data equally likely under both models)

This symmetry is philosophically important. Unlike p-values, which only measure evidence against a null hypothesis, Bayes factors measure evidence in both directions. A small Bayes factor IS evidence—evidence for the alternative model.

Bayes Factors vs. Frequentist Hypothesis Testing

Understanding how Bayes factors differ from frequentist hypothesis testing illuminates their strengths and appropriate use cases.

What P-Values Actually Measure

A p-value answers: 'If the null hypothesis were true, how often would I see data at least as extreme as what I observed?'

It does NOT measure the probability that the null is true
It does NOT measure the probability that the alternative is true
It does NOT quantify evidence for the alternative
It is computed entirely under the null, ignoring the alternative

What Bayes Factors Actually Measure

A Bayes factor answers: 'How many times more probable is my observed data under Model 1 compared to Model 2?'

It directly compares both models
It quantifies evidence strength on a continuous scale
It can favor either model (not just reject one)
It incorporates model complexity through priors

P-Value Limitations

•Only tests against null, not between alternatives
•Significance depends on sample size, not effect size
•Cannot provide evidence FOR the null
•Conflates effect size with precision
•Prone to misinterpretation ('p < 0.05 means true')
•Ignores prior information and complexity

Bayes Factor Advantages

•Directly compares any two models
•Quantifies evidence on a common scale
•Can provide evidence for simpler models
•Separates effect size from sample size
•Clear interpretation (odds ratio of models)
•Penalizes complexity automatically

The Lindley Paradox

A famous example illustrating the difference: Consider testing whether a coin is fair ($\theta = 0.5$) vs. biased ($\theta \neq 0.5$).

With n = 100,000 flips and 50,500 heads:

Frequentist test: p ≈ 0.0016 (highly significant!)
Bayes factor: BF ≈ 0.08 (evidence FOR the null!)

How can this be?

The frequentist test asks: 'Is 50,500/100,000 significantly different from 0.5?' With huge sample size, even tiny differences are 'significant.'

The Bayesian test compares two models:

Model 1 ($\theta = 0.5$): Makes a precise prediction
Model 2 ($\theta$ unknown): Must spread prior over all possible values

The observed proportion (0.505) is very close to 0.5. Model 1, having made a precise prediction that was nearly correct, gets rewarded. Model 2, having hedged its bets across all possibilities, gets penalized for not committing.

The philosophical lesson: Testing for 'any difference' is different from testing for 'meaningful difference.' Bayes factors respect this distinction; p-values do not.

When Each Approach Is Appropriate

P-values are useful for quality control (detecting any departure from a standard) and as a first-pass filter. Bayes factors are better for scientific inference (weighing competing theories) and formal model comparison. Neither replaces thinking about the science—both are tools, not oracles.

Computing Bayes Factors in Practice

Since Bayes factors are ratios of marginal likelihoods, all methods for computing marginal likelihoods apply here. But there are also specialized methods for Bayes factor computation.

4.1 Direct Computation from Marginal Likelihoods

If both marginal likelihoods can be computed (analytically or by approximation), the Bayes factor is simply their ratio:

$$\log \text{BF}_{12} = \log p(\mathcal{D} | \mathcal{M}_1) - \log p(\mathcal{D} | \mathcal{M}_2)$$

This straightforward approach works well when both models have tractable marginal likelihoods.

bayes_factor_linear_models.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy import stats
from scipy.linalg import cho_factor, cho_solve
 
def log_marginal_likelihood_bayesian_regression(X, y, prior_var, noise_var):
    """
    Compute log marginal likelihood for Bayesian linear regression.
    
    Assumes: y = X @ beta + epsilon
             beta ~ N(0, prior_var * I)
             epsilon ~ N(0, noise_var * I)
    """
    n = len(y)
    K = prior_var * X @ X.T + noise_var * np.eye(n)
    
    L, lower = cho_factor(K, lower=True)
    log_det = 2 * np.sum(np.log(np.diag(L)))
    alpha = cho_solve((L, lower), y)
    
    log_ml = -0.5 * (y @ alpha + log_det + n * np.log(2 * np.pi))
    return log_ml
 
 
def compute_bayes_factor_polynomial_comparison(x, y, degree1, degree2, 
                                                prior_var=1.0, noise_var=0.1):
    """
    Compare two polynomial regression models via Bayes factor.
    
    Returns:
    --------
    log_bf : float
        Log Bayes factor in favor of degree1 over degree2
    interpretation : str
        Verbal interpretation of evidence strength
    """
    # Build design matrices
    X1 = np.column_stack([x**i for i in range(degree1 + 1)])
    X2 = np.column_stack([x**i for i in range(degree2 + 1)])
    
    # Compute log marginal likelihoods
    log_ml1 = log_marginal_likelihood_bayesian_regression(X1, y, prior_var, noise_var)
    log_ml2 = log_marginal_likelihood_bayesian_regression(X2, y, prior_var, noise_var)
    
    log_bf = log_ml1 - log_ml2
    bf = np.exp(log_bf)
    
    # Interpret according to Kass-Raftery scale
    two_ln_bf = 2 * log_bf
    if abs(two_ln_bf) < 2:
        strength = "Not worth more than a bare mention"
    elif abs(two_ln_bf) < 6:
        strength = "Positive evidence"
    elif abs(two_ln_bf) < 10:
        strength = "Strong evidence"
    else:
        strength = "Very strong evidence"
    
    favored = f"degree {degree1}" if log_bf > 0 else f"degree {degree2}"
    interpretation = f"{strength} for {favored}"
    
    return log_bf, bf, interpretation
 
 
# Example: Compare polynomial models on synthetic data
np.random.seed(42)
n = 100
x = np.linspace(-2, 2, n)
 
# True function: quadratic
y_true = 1 + 0.5*x + 0.8*x**2
y = y_true + 0.5 * np.random.randn(n)
 
print("Comparing polynomial regression models:\n")
print(f"{'Comparison':<25} {'log(BF)':<10} {'BF':<12} {'Interpretation'}")
print("-" * 80)
 
for d1, d2 in [(1, 2), (2, 3), (2, 5), (3, 10)]:
    log_bf, bf, interp = compute_bayes_factor_polynomial_comparison(
        x, y, d1, d2, prior_var=10.0, noise_var=0.25
    )
    print(f"Degree {d1} vs Degree {d2:<15} {log_bf:>8.2f}    {bf:>10.2f}    {interp}")

4.2 Savage-Dickey Density Ratio

For nested models where one model is a special case of the other (e.g., testing $\theta = \theta_0$ vs. $\theta \neq \theta_0$), the Savage-Dickey density ratio provides an elegant shortcut:

$$\text{BF}_{01} = \frac{p(\theta = \theta_0 | \mathcal{D}, \mathcal{M}_1)}{p(\theta = \theta_0 | \mathcal{M}_1)}$$

Interpretation: The Bayes factor equals the posterior density at the null value divided by the prior density at the null value.

If the posterior is higher than the prior at $\theta_0$, data supports the null
If the posterior is lower than the prior at $\theta_0$, data favors the alternative

Advantage: You only need to fit the more complex model and evaluate densities at a point—no need to integrate.

4.3 Bridge Sampling

Bridge sampling is a powerful method for estimating Bayes factors, especially when models share parameters. The idea is to construct an 'optimal bridge' between two distributions.

Given samples from two posteriors, bridge sampling estimates:

$$\text{BF}_{12} \approx \frac{\sum_i h(\boldsymbol{\theta}^{(1)}_i) \cdot p(\boldsymbol{\theta}^{(1)}_i | \mathcal{M}_2)}{\sum_j h(\boldsymbol{\theta}^{(2)}_j) \cdot p(\boldsymbol{\theta}^{(2)}_j | \mathcal{M}_1)}$$

where $h(\cdot)$ is an optimally chosen 'bridge function.'

Key properties:

Uses samples from both posteriors efficiently
Achieves lower variance than importance sampling
Works even when priors differ substantially
Implemented in packages like bridgesampling in R

Practical Recommendation

For most applications, start with BIC as a rough approximation. If BIC suggests models are close (within 2-6 units), invest in more accurate methods like Laplace approximation or bridge sampling. Reserve computationally expensive methods like thermodynamic integration for high-stakes decisions.

Testing Point Null Hypotheses

A particularly important application of Bayes factors is testing point null hypotheses: $H_0: \theta = \theta_0$ vs. $H_1: \theta \neq \theta_0$.

The Challenge with Point Nulls

In frequentist testing, the null hypothesis $\theta = 0$ has probability zero in a continuous parameter space, yet we can still compute p-values (probability of data more extreme than observed, given null).

In Bayesian testing, we must assign actual probability to the null hypothesis. This is done using a mixture prior:

$$p(\theta) = \pi_0 \cdot \delta(\theta - \theta_0) + (1 - \pi_0) \cdot p_1(\theta)$$

where:

$\pi_0$ is the prior probability that $H_0$ is true
$\delta(\cdot)$ is a point mass at $\theta_0$
$p_1(\theta)$ is the prior under $H_1$

The JZS Prior for t-Tests

For testing whether a mean differs from zero, Rouder et al. (2009) developed the JZS (Jeffreys-Zellner-Siow) prior, now widely used for Bayesian t-tests:

$$\text{Effect size } \delta \sim \text{Cauchy}(0, r)$$

where $r$ (often 0.707) controls the expected effect size.

Why Cauchy?

Heavy tails allow large effects if data supports them
Scale parameter $r$ provides reasonable default expectations
Closed-form Bayes factor exists for this choice

The resulting Bayes factor compares:

$H_0$: True effect is exactly zero
$H_1$: True effect is non-zero (Cauchy-distributed)

bayesian_t_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy import stats
from scipy.integrate import quad
 
def bayesian_t_test(x, y=None, mu=0, r=0.707):
    """
    Compute Bayes factor for one-sample or two-sample t-test.
    
    Uses JZS prior (Cauchy on effect size).
    
    Parameters:
    -----------
    x : array-like
        First sample
    y : array-like or None
        Second sample (None for one-sample test)
    mu : float
        Null hypothesis value (for one-sample test)
    r : float
        Scale of Cauchy prior on effect size
        
    Returns:
    --------
    bf10 : float
        Bayes factor in favor of H1 (effect exists) over H0
    """
    x = np.asarray(x)
    
    if y is None:
        # One-sample test
        n = len(x)
        t_stat = (np.mean(x) - mu) / (np.std(x, ddof=1) / np.sqrt(n))
        df = n - 1
    else:
        # Two-sample test
        y = np.asarray(y)
        n1, n2 = len(x), len(y)
        n = n1 + n2
        
        # Pooled standard deviation
        var1, var2 = np.var(x, ddof=1), np.var(y, ddof=1)
        pooled_var = ((n1-1)*var1 + (n2-1)*var2) / (n1 + n2 - 2)
        se = np.sqrt(pooled_var * (1/n1 + 1/n2))
        
        t_stat = (np.mean(x) - np.mean(y)) / se
        df = n1 + n2 - 2
    
    # Compute Bayes factor using numerical integration
    # BF10 = integral over delta of: p(t|delta) * p(delta) / p(t|H0)
    
    def integrand(delta):
        # Non-central t distribution for effect size delta
        ncp = delta * np.sqrt(n) if y is None else delta * np.sqrt(n1*n2/(n1+n2))
        likelihood = stats.nct.pdf(t_stat, df, ncp)
        prior = stats.cauchy.pdf(delta, 0, r)
        return likelihood * prior
    
    # Likelihood under null
    null_likelihood = stats.t.pdf(t_stat, df)
    
    # Integrate over effect sizes
    integral, _ = quad(integrand, -np.inf, np.inf)
    
    bf10 = integral / null_likelihood
    
    return bf10
 
 
# Example: Testing treatment effect
np.random.seed(123)
 
# Control group
control = np.random.normal(100, 15, 30)
 
# Treatment groups with different effect sizes
print("Bayesian t-tests with JZS prior (r=0.707):\n")
 
for effect, name in [(0, "No effect"), (5, "Small effect"), (10, "Medium effect"), (20, "Large effect")]:
    treatment = np.random.normal(100 + effect, 15, 30)
    
    bf10 = bayesian_t_test(treatment, control)
    
    # Interpret
    if bf10 > 100:
        interp = "Extreme evidence for H1"
    elif bf10 > 30:
        interp = "Very strong evidence for H1"
    elif bf10 > 10:
        interp = "Strong evidence for H1"
    elif bf10 > 3:
        interp = "Moderate evidence for H1"
    elif bf10 > 1:
        interp = "Anecdotal evidence for H1"
    elif bf10 > 1/3:
        interp = "Anecdotal evidence for H0"
    elif bf10 > 1/10:
        interp = "Moderate evidence for H0"
    else:
        interp = "Strong evidence for H0"
    
    print(f"{name:20s}: BF₁₀ = {bf10:>8.2f}  ({interp})")

Prior Choice Matters for Point Nulls

The choice of prior under H₁ (e.g., the scale r in JZS) affects the Bayes factor. A narrow prior expects small effects; a wide prior hedges across all effect sizes. There's no universally 'correct' choice—use domain knowledge or conduct sensitivity analysis.

Limitations and Critiques of Bayes Factors

Despite their advantages, Bayes factors have important limitations. Understanding these is essential for appropriate application.

Prior Sensitivity

We emphasized this for marginal likelihoods, and it's doubly important for Bayes factors. Since BF = ML₁/ML₂, prior misspecification in either model affects the comparison.

Example: Two researchers test the same hypothesis with the same data:

Researcher A uses tight priors: BF₁₀ = 25
Researcher B uses diffuse priors: BF₁₀ = 3

Who's right? Both—given their priors. The solution is transparency about prior choices and sensitivity analysis.

Key Limitations to Remember

•Prior dependence: Results depend on priors, which can be hard to specify. Sensitivity analysis is essential.
•Computational cost: Accurate Bayes factors require integrating over parameter spaces—often intractable without approximation.
•Not immune to misspecification: If both models are wrong, Bayes factors still pick the 'least wrong'—which may not be meaningful.
•Point null issues: Testing exactly θ = 0 may not match scientific questions ('Is the effect negligible?' differs from 'Is it exactly zero?').
•No concept of power: Unlike frequentist tests, Bayes factors don't have pre-study power analysis frameworks (though Bayesian design analysis exists).
•Bayes factor ≠ probability of hypothesis: BF = 10 doesn't mean 90% probability of H₁. Posterior probability depends on prior odds too.

The Bayes Factor 'Paradox'

Empty models (which make no predictions) can have arbitrarily high Bayes factors. Consider a model that says 'anything is possible' with uniform prior over vast ranges—its marginal likelihood is essentially the prior density, which can be made arbitrarily small.

Lesson: Bayes factors compare specific, well-defined models. Comparing a precise model to a vague 'catch-all' alternative isn't informative.

Alternatives to Bayes Factors

When Bayes factors are problematic, consider:

Posterior predictive checks: Do predictions from your model match observed data patterns?
Cross-validation: How well does each model predict held-out data?
Information criteria: AIC, WAIC, LOO-CV provide different trade-offs
Bayesian model averaging: Don't choose—combine predictions from all models (covered later in this module)

Best Practice: Multiple Approaches

Use Bayes factors as one tool among many. If Bayes factors, cross-validation, and posterior predictive checks all point to the same model, you can be confident. If they disagree, investigate why—the disagreement often reveals important aspects of your models or data.

A Practical Workflow for Bayes Factor Analysis

Let's synthesize everything into a practical workflow for using Bayes factors in research.

Recommended Workflow

•Define models precisely: Specify exactly what each model claims. Include likelihood functions and prior distributions for all parameters.
•Justify priors: Document why you chose each prior. Use domain knowledge, previous studies, or principled defaults (e.g., weakly informative priors).
•Compute Bayes factors: Use appropriate methods—analytical formulas for simple cases, Laplace/BIC for quick approximations, bridge sampling for accuracy.
•Conduct sensitivity analysis: Repeat with alternative reasonable priors. If conclusions change dramatically, report this uncertainty.
•Interpret with context: Apply standard scales but adjust for field conventions and stakes of the decision.
•Report comprehensively: Include the Bayes factor, posterior model probabilities (if prior odds are specified), and sensitivity analysis results.
•Consider alternatives: Cross-validate, check posterior predictions, and use domain knowledge to sanity-check conclusions.

Reporting Guidelines

When publishing results using Bayes factors:

Essential to report:

The Bayes factor value (BF₁₀ or BF₀₁ with clear notation)
Prior distributions for all parameters under both models
Computation method used
Interpretation on a standard scale

Recommended to report:

Posterior model probabilities (stating assumed prior odds)
Sensitivity analysis results
Sample code or references for reproducibility

Example reporting: 'The data provided strong evidence for the interaction model over the main-effects model, BF₁₀ = 27.3 (assuming equal prior odds). Under the interaction model, we used a Normal(0, 10²) prior on coefficients. Sensitivity analysis with prior scales [5, 10, 20] yielded BF₁₀ ∈ [18.1, 35.6], consistently supporting the interaction model.'

Summary and Key Takeaways

Let's consolidate what we've learned about Bayes factors:

Core Concepts

•Definition: The Bayes factor BF₁₂ = p(D|M₁)/p(D|M₂) is the ratio of marginal likelihoods, quantifying how many times more probable the data is under Model 1 than Model 2.
•Interpretation: Bayes factors measure relative evidence. BF = 10 means 10:1 odds favoring one model. Standard scales (Jeffreys, Kass-Raftery) provide calibrated interpretation.
•Posterior odds: Posterior odds = Bayes factor × Prior odds. If prior odds are equal, posterior odds equal the Bayes factor.
•Advantages over p-values: Bayes factors compare models directly, quantify evidence for both hypotheses, and automatically penalize complexity.
•Computation: Methods include analytical formulas (for conjugate cases), Savage-Dickey ratio (for nested models), bridge sampling, and approximations like BIC.
•Limitations: Prior sensitivity, computational cost, and inapplicability when all models are misspecified.

Looking ahead: Bayes factors quantify relative evidence between two models, but the story doesn't end there. In the next page, we'll explore model evidence—understanding marginal likelihood as a measure of model quality that embodies a principled balance between fit and complexity, leading to the Bayesian interpretation of Occam's razor.

Page Complete

You now understand Bayes factors—the ratio of marginal likelihoods that quantifies relative model support. You know how to interpret them, compute them, and recognize their limitations. Next, we'll deepen our understanding of model evidence and the automatic Occam's razor embedded in Bayesian model comparison.

2 / 5

Loading learning content...

Machine LearningBayesian Inference

Bayesian Model Comparison

LevelAdvanced

Duration90 mins

TopicBayesian Inference

2 / 5

Bayes Factors: Quantifying Relative Model Evidence

From Marginal Likelihoods to Model Comparison

What You Will Learn

Definition and Derivation

The Bayes Factor Defined

Given two models $\mathcal{M}_1$ and $\mathcal{M}_2$, the Bayes factor in favor of $\mathcal{M}_1$ over $\mathcal{M}_2$ is defined as the ratio of their marginal likelihoods:

$$\text{BF}_{12} = \frac{p(\mathcal{D} | \mathcal{M}_1)}{p(\mathcal{D} | \mathcal{M}_2)}$$

Expanding the marginal likelihoods:

Interpretation: $\text{BF}_{12} = 10$ means the data is 10 times more probable under Model 1 than Model 2. The data provides 10-to-1 odds in favor of Model 1.

Notation Conventions

Connection to Posterior Model Probabilities

Bayes factors are intimately connected to posterior probabilities over models via Bayes' theorem at the model level:

$$\frac{p(\mathcal{M}_1 | \mathcal{D})}{p(\mathcal{M}_2 | \mathcal{D})} = \frac{p(\mathcal{D} | \mathcal{M}_1)}{p(\mathcal{D} | \mathcal{M}_2)} \cdot \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}$$

$$\underbrace{\text{Posterior Odds}}{\text{after seeing data}} = \underbrace{\text{Bayes Factor}}{\text{evidence from data}} \times \underbrace{\text{Prior Odds}}_{\text{before seeing data}}$$

The Bayes factor is the multiplicative update from prior odds to posterior odds. It represents purely the evidence contributed by the data, independent of prior model preferences.

If we start with equal prior probabilities ($p(\mathcal{M}_1) = p(\mathcal{M}_2) = 0.5$):

$$\text{Posterior Odds} = \text{BF}_{12}$$

This allows conversion to posterior probabilities:

$$p(\mathcal{M}1 | \mathcal{D}) = \frac{\text{BF}{12}}{1 + \text{BF}_{12}}$$

From Bayes Factors to Posterior Probabilities (Equal Prior Odds)
Bayes Factor (BF₁₂)	Posterior Odds	P(M₁\|D)	P(M₂\|D)
1	1 : 1	50%	50%
3	3 : 1	75%	25%
10	10 : 1	91%	9%
30	30 : 1	97%	3%
100	100 : 1	99%	1%
1000	1000 : 1	99.9%	0.1%

Interpretation Scales: When Is Evidence 'Strong'?

A Bayes factor of 5 means the data is 5 times more likely under one model. But is that 'strong' evidence? Researchers have proposed calibration scales to interpret Bayes factors consistently.

The Jeffreys Scale (1961)

Harold Jeffreys, one of the founders of Bayesian statistics, proposed the first widely-used interpretation scale. He worked with $\log_{10}(\text{BF})$ for convenience:

Jeffreys' Scale for Interpreting Bayes Factors
log₁₀(BF₁₂)	BF₁₂	Interpretation
0 to 0.5	1 to 3.2	Not worth more than a bare mention
0.5 to 1	3.2 to 10	Substantial evidence
1 to 1.5	10 to 32	Strong evidence
1.5 to 2	32 to 100	Very strong evidence
2	100	Decisive evidence

The Kass-Raftery Scale (1995)

Robert Kass and Adrian Raftery proposed a modified scale using $2 \ln(\text{BF})$, which has the same scale as likelihood ratio test statistics and the deviance:

Kass-Raftery Scale for Interpreting Bayes Factors
2 ln(BF₁₂)	BF₁₂	Interpretation
0 to 2	1 to 3	Not worth more than a bare mention
2 to 6	3 to 20	Positive evidence
6 to 10	20 to 150	Strong evidence
10	150	Very strong evidence

Use Scales as Guidelines, Not Laws

Symmetric Interpretation

Bayes factors are symmetric under inversion:

$$\text{BF}{21} = \frac{1}{\text{BF}{12}}$$

This means the same scale applies in both directions:

$\text{BF}_{12} = 100$: Strong evidence for Model 1
$\text{BF}{12} = 0.01$: Strong evidence for Model 2 (since $\text{BF}{21} = 100$)
$\text{BF}_{12} = 1$: No evidence either way (data equally likely under both models)

Bayes Factors vs. Frequentist Hypothesis Testing

Understanding how Bayes factors differ from frequentist hypothesis testing illuminates their strengths and appropriate use cases.

What P-Values Actually Measure

A p-value answers: 'If the null hypothesis were true, how often would I see data at least as extreme as what I observed?'

It does NOT measure the probability that the null is true
It does NOT measure the probability that the alternative is true
It does NOT quantify evidence for the alternative
It is computed entirely under the null, ignoring the alternative

What Bayes Factors Actually Measure

A Bayes factor answers: 'How many times more probable is my observed data under Model 1 compared to Model 2?'

It directly compares both models
It quantifies evidence strength on a continuous scale
It can favor either model (not just reject one)
It incorporates model complexity through priors

P-Value Limitations

•Only tests against null, not between alternatives
•Significance depends on sample size, not effect size
•Cannot provide evidence FOR the null
•Conflates effect size with precision
•Prone to misinterpretation ('p < 0.05 means true')
•Ignores prior information and complexity

Bayes Factor Advantages

•Directly compares any two models
•Quantifies evidence on a common scale
•Can provide evidence for simpler models
•Separates effect size from sample size
•Clear interpretation (odds ratio of models)
•Penalizes complexity automatically

The Lindley Paradox

A famous example illustrating the difference: Consider testing whether a coin is fair ($\theta = 0.5$) vs. biased ($\theta \neq 0.5$).

With n = 100,000 flips and 50,500 heads:

Frequentist test: p ≈ 0.0016 (highly significant!)
Bayes factor: BF ≈ 0.08 (evidence FOR the null!)

How can this be?

The frequentist test asks: 'Is 50,500/100,000 significantly different from 0.5?' With huge sample size, even tiny differences are 'significant.'

The Bayesian test compares two models:

Model 1 ($\theta = 0.5$): Makes a precise prediction
Model 2 ($\theta$ unknown): Must spread prior over all possible values

The philosophical lesson: Testing for 'any difference' is different from testing for 'meaningful difference.' Bayes factors respect this distinction; p-values do not.

When Each Approach Is Appropriate

Computing Bayes Factors in Practice

Since Bayes factors are ratios of marginal likelihoods, all methods for computing marginal likelihoods apply here. But there are also specialized methods for Bayes factor computation.

4.1 Direct Computation from Marginal Likelihoods

If both marginal likelihoods can be computed (analytically or by approximation), the Bayes factor is simply their ratio:

$$\log \text{BF}_{12} = \log p(\mathcal{D} | \mathcal{M}_1) - \log p(\mathcal{D} | \mathcal{M}_2)$$

This straightforward approach works well when both models have tractable marginal likelihoods.

bayes_factor_linear_models.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy import stats
from scipy.linalg import cho_factor, cho_solve
 
def log_marginal_likelihood_bayesian_regression(X, y, prior_var, noise_var):
    """
    Compute log marginal likelihood for Bayesian linear regression.
    
    Assumes: y = X @ beta + epsilon
             beta ~ N(0, prior_var * I)
             epsilon ~ N(0, noise_var * I)
    """
    n = len(y)
    K = prior_var * X @ X.T + noise_var * np.eye(n)
    
    L, lower = cho_factor(K, lower=True)
    log_det = 2 * np.sum(np.log(np.diag(L)))
    alpha = cho_solve((L, lower), y)
    
    log_ml = -0.5 * (y @ alpha + log_det + n * np.log(2 * np.pi))
    return log_ml
 
 
def compute_bayes_factor_polynomial_comparison(x, y, degree1, degree2, 
                                                prior_var=1.0, noise_var=0.1):
    """
    Compare two polynomial regression models via Bayes factor.
    
    Returns:
    --------
    log_bf : float
        Log Bayes factor in favor of degree1 over degree2
    interpretation : str
        Verbal interpretation of evidence strength
    """
    # Build design matrices
    X1 = np.column_stack([x**i for i in range(degree1 + 1)])
    X2 = np.column_stack([x**i for i in range(degree2 + 1)])
    
    # Compute log marginal likelihoods
    log_ml1 = log_marginal_likelihood_bayesian_regression(X1, y, prior_var, noise_var)
    log_ml2 = log_marginal_likelihood_bayesian_regression(X2, y, prior_var, noise_var)
    
    log_bf = log_ml1 - log_ml2
    bf = np.exp(log_bf)
    
    # Interpret according to Kass-Raftery scale
    two_ln_bf = 2 * log_bf
    if abs(two_ln_bf) < 2:
        strength = "Not worth more than a bare mention"
    elif abs(two_ln_bf) < 6:
        strength = "Positive evidence"
    elif abs(two_ln_bf) < 10:
        strength = "Strong evidence"
    else:
        strength = "Very strong evidence"
    
    favored = f"degree {degree1}" if log_bf > 0 else f"degree {degree2}"
    interpretation = f"{strength} for {favored}"
    
    return log_bf, bf, interpretation
 
 
# Example: Compare polynomial models on synthetic data
np.random.seed(42)
n = 100
x = np.linspace(-2, 2, n)
 
# True function: quadratic
y_true = 1 + 0.5*x + 0.8*x**2
y = y_true + 0.5 * np.random.randn(n)
 
print("Comparing polynomial regression models:\n")
print(f"{'Comparison':<25} {'log(BF)':<10} {'BF':<12} {'Interpretation'}")
print("-" * 80)
 
for d1, d2 in [(1, 2), (2, 3), (2, 5), (3, 10)]:
    log_bf, bf, interp = compute_bayes_factor_polynomial_comparison(
        x, y, d1, d2, prior_var=10.0, noise_var=0.25
    )
    print(f"Degree {d1} vs Degree {d2:<15} {log_bf:>8.2f}    {bf:>10.2f}    {interp}")

4.2 Savage-Dickey Density Ratio

For nested models where one model is a special case of the other (e.g., testing $\theta = \theta_0$ vs. $\theta \neq \theta_0$), the Savage-Dickey density ratio provides an elegant shortcut:

$$\text{BF}_{01} = \frac{p(\theta = \theta_0 | \mathcal{D}, \mathcal{M}_1)}{p(\theta = \theta_0 | \mathcal{M}_1)}$$

Interpretation: The Bayes factor equals the posterior density at the null value divided by the prior density at the null value.

If the posterior is higher than the prior at $\theta_0$, data supports the null
If the posterior is lower than the prior at $\theta_0$, data favors the alternative

Advantage: You only need to fit the more complex model and evaluate densities at a point—no need to integrate.

4.3 Bridge Sampling

Bridge sampling is a powerful method for estimating Bayes factors, especially when models share parameters. The idea is to construct an 'optimal bridge' between two distributions.

Given samples from two posteriors, bridge sampling estimates:

where $h(\cdot)$ is an optimally chosen 'bridge function.'

Key properties:

Uses samples from both posteriors efficiently
Achieves lower variance than importance sampling
Works even when priors differ substantially
Implemented in packages like bridgesampling in R

Practical Recommendation

Testing Point Null Hypotheses

A particularly important application of Bayes factors is testing point null hypotheses: $H_0: \theta = \theta_0$ vs. $H_1: \theta \neq \theta_0$.

The Challenge with Point Nulls

In Bayesian testing, we must assign actual probability to the null hypothesis. This is done using a mixture prior:

$$p(\theta) = \pi_0 \cdot \delta(\theta - \theta_0) + (1 - \pi_0) \cdot p_1(\theta)$$

where:

$\pi_0$ is the prior probability that $H_0$ is true
$\delta(\cdot)$ is a point mass at $\theta_0$
$p_1(\theta)$ is the prior under $H_1$

The JZS Prior for t-Tests

For testing whether a mean differs from zero, Rouder et al. (2009) developed the JZS (Jeffreys-Zellner-Siow) prior, now widely used for Bayesian t-tests:

$$\text{Effect size } \delta \sim \text{Cauchy}(0, r)$$

where $r$ (often 0.707) controls the expected effect size.

Why Cauchy?

Heavy tails allow large effects if data supports them
Scale parameter $r$ provides reasonable default expectations
Closed-form Bayes factor exists for this choice

The resulting Bayes factor compares:

$H_0$: True effect is exactly zero
$H_1$: True effect is non-zero (Cauchy-distributed)

bayesian_t_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from scipy import stats
from scipy.integrate import quad
 
def bayesian_t_test(x, y=None, mu=0, r=0.707):
    """
    Compute Bayes factor for one-sample or two-sample t-test.
    
    Uses JZS prior (Cauchy on effect size).
    
    Parameters:
    -----------
    x : array-like
        First sample
    y : array-like or None
        Second sample (None for one-sample test)
    mu : float
        Null hypothesis value (for one-sample test)
    r : float
        Scale of Cauchy prior on effect size
        
    Returns:
    --------
    bf10 : float
        Bayes factor in favor of H1 (effect exists) over H0
    """
    x = np.asarray(x)
    
    if y is None:
        # One-sample test
        n = len(x)
        t_stat = (np.mean(x) - mu) / (np.std(x, ddof=1) / np.sqrt(n))
        df = n - 1
    else:
        # Two-sample test
        y = np.asarray(y)
        n1, n2 = len(x), len(y)
        n = n1 + n2
        
        # Pooled standard deviation
        var1, var2 = np.var(x, ddof=1), np.var(y, ddof=1)
        pooled_var = ((n1-1)*var1 + (n2-1)*var2) / (n1 + n2 - 2)
        se = np.sqrt(pooled_var * (1/n1 + 1/n2))
        
        t_stat = (np.mean(x) - np.mean(y)) / se
        df = n1 + n2 - 2
    
    # Compute Bayes factor using numerical integration
    # BF10 = integral over delta of: p(t|delta) * p(delta) / p(t|H0)
    
    def integrand(delta):
        # Non-central t distribution for effect size delta
        ncp = delta * np.sqrt(n) if y is None else delta * np.sqrt(n1*n2/(n1+n2))
        likelihood = stats.nct.pdf(t_stat, df, ncp)
        prior = stats.cauchy.pdf(delta, 0, r)
        return likelihood * prior
    
    # Likelihood under null
    null_likelihood = stats.t.pdf(t_stat, df)
    
    # Integrate over effect sizes
    integral, _ = quad(integrand, -np.inf, np.inf)
    
    bf10 = integral / null_likelihood
    
    return bf10
 
 
# Example: Testing treatment effect
np.random.seed(123)
 
# Control group
control = np.random.normal(100, 15, 30)
 
# Treatment groups with different effect sizes
print("Bayesian t-tests with JZS prior (r=0.707):\n")
 
for effect, name in [(0, "No effect"), (5, "Small effect"), (10, "Medium effect"), (20, "Large effect")]:
    treatment = np.random.normal(100 + effect, 15, 30)
    
    bf10 = bayesian_t_test(treatment, control)
    
    # Interpret
    if bf10 > 100:
        interp = "Extreme evidence for H1"
    elif bf10 > 30:
        interp = "Very strong evidence for H1"
    elif bf10 > 10:
        interp = "Strong evidence for H1"
    elif bf10 > 3:
        interp = "Moderate evidence for H1"
    elif bf10 > 1:
        interp = "Anecdotal evidence for H1"
    elif bf10 > 1/3:
        interp = "Anecdotal evidence for H0"
    elif bf10 > 1/10:
        interp = "Moderate evidence for H0"
    else:
        interp = "Strong evidence for H0"
    
    print(f"{name:20s}: BF₁₀ = {bf10:>8.2f}  ({interp})")

Prior Choice Matters for Point Nulls

Limitations and Critiques of Bayes Factors

Despite their advantages, Bayes factors have important limitations. Understanding these is essential for appropriate application.

Prior Sensitivity

We emphasized this for marginal likelihoods, and it's doubly important for Bayes factors. Since BF = ML₁/ML₂, prior misspecification in either model affects the comparison.

Example: Two researchers test the same hypothesis with the same data:

Researcher A uses tight priors: BF₁₀ = 25
Researcher B uses diffuse priors: BF₁₀ = 3

Who's right? Both—given their priors. The solution is transparency about prior choices and sensitivity analysis.

Key Limitations to Remember

•Prior dependence: Results depend on priors, which can be hard to specify. Sensitivity analysis is essential.
•Computational cost: Accurate Bayes factors require integrating over parameter spaces—often intractable without approximation.
•Not immune to misspecification: If both models are wrong, Bayes factors still pick the 'least wrong'—which may not be meaningful.
•Point null issues: Testing exactly θ = 0 may not match scientific questions ('Is the effect negligible?' differs from 'Is it exactly zero?').
•No concept of power: Unlike frequentist tests, Bayes factors don't have pre-study power analysis frameworks (though Bayesian design analysis exists).
•Bayes factor ≠ probability of hypothesis: BF = 10 doesn't mean 90% probability of H₁. Posterior probability depends on prior odds too.

The Bayes Factor 'Paradox'

Lesson: Bayes factors compare specific, well-defined models. Comparing a precise model to a vague 'catch-all' alternative isn't informative.

Alternatives to Bayes Factors

When Bayes factors are problematic, consider:

Posterior predictive checks: Do predictions from your model match observed data patterns?
Cross-validation: How well does each model predict held-out data?
Information criteria: AIC, WAIC, LOO-CV provide different trade-offs
Bayesian model averaging: Don't choose—combine predictions from all models (covered later in this module)

Best Practice: Multiple Approaches

A Practical Workflow for Bayes Factor Analysis

Let's synthesize everything into a practical workflow for using Bayes factors in research.

Recommended Workflow

•Define models precisely: Specify exactly what each model claims. Include likelihood functions and prior distributions for all parameters.
•Justify priors: Document why you chose each prior. Use domain knowledge, previous studies, or principled defaults (e.g., weakly informative priors).
•Compute Bayes factors: Use appropriate methods—analytical formulas for simple cases, Laplace/BIC for quick approximations, bridge sampling for accuracy.
•Conduct sensitivity analysis: Repeat with alternative reasonable priors. If conclusions change dramatically, report this uncertainty.
•Interpret with context: Apply standard scales but adjust for field conventions and stakes of the decision.
•Report comprehensively: Include the Bayes factor, posterior model probabilities (if prior odds are specified), and sensitivity analysis results.
•Consider alternatives: Cross-validate, check posterior predictions, and use domain knowledge to sanity-check conclusions.

Reporting Guidelines

When publishing results using Bayes factors:

Essential to report:

The Bayes factor value (BF₁₀ or BF₀₁ with clear notation)
Prior distributions for all parameters under both models
Computation method used
Interpretation on a standard scale

Recommended to report:

Posterior model probabilities (stating assumed prior odds)
Sensitivity analysis results
Sample code or references for reproducibility

Summary and Key Takeaways

Let's consolidate what we've learned about Bayes factors:

Core Concepts

•Definition: The Bayes factor BF₁₂ = p(D|M₁)/p(D|M₂) is the ratio of marginal likelihoods, quantifying how many times more probable the data is under Model 1 than Model 2.
•Interpretation: Bayes factors measure relative evidence. BF = 10 means 10:1 odds favoring one model. Standard scales (Jeffreys, Kass-Raftery) provide calibrated interpretation.
•Posterior odds: Posterior odds = Bayes factor × Prior odds. If prior odds are equal, posterior odds equal the Bayes factor.
•Advantages over p-values: Bayes factors compare models directly, quantify evidence for both hypotheses, and automatically penalize complexity.
•Computation: Methods include analytical formulas (for conjugate cases), Savage-Dickey ratio (for nested models), bridge sampling, and approximations like BIC.
•Limitations: Prior sensitivity, computational cost, and inapplicability when all models are misspecified.

Page Complete

2 / 5