Bayesian Framework - Learning Module

Loading content...

0/245

Prior Beliefs

The Starting Point of All Bayesian Reasoning

Imagine you're a medical researcher investigating a new diagnostic test for a rare disease. Before collecting any data, you already possess valuable knowledge: the disease affects approximately 1 in 10,000 people in the general population. This knowledge—prior to observing any test results—fundamentally shapes how you should interpret any evidence you subsequently gather.

This is the essence of prior beliefs in Bayesian inference: the formal mathematical encoding of what we know (or believe) about a phenomenon before observing new data. Unlike frequentist statistics, which treats parameters as fixed but unknown constants, Bayesian inference treats parameters as random variables with probability distributions that capture our uncertainty about their true values.

What You Will Master

By the end of this page, you will understand the philosophical and mathematical foundations of prior distributions, be able to categorize priors by their informativeness, select appropriate priors for different problem contexts, and appreciate how prior choice influences posterior inference. This knowledge forms the bedrock upon which all subsequent Bayesian methods are built.

Philosophical Foundations of Prior Beliefs

Before diving into mathematical formalism, we must understand why priors exist and what they represent. The concept of prior beliefs emerges from a fundamental philosophical position about the nature of probability itself.

Two Interpretations of Probability:

In the frequentist interpretation, probability represents the long-run frequency of events in repeated experiments. Under this view, parameters are fixed constants—there's nothing probabilistic about them. The only randomness comes from sampling variability.

In the Bayesian (subjectivist) interpretation, probability represents a degree of belief about uncertain propositions. Under this view, probability quantifies our state of knowledge, and parameters are random variables whose distributions encode our uncertainty.

This distinction is profound: Bayesians can assign probabilities to any uncertain proposition, including:

"The coin's bias is between 0.4 and 0.6"
"The treatment effect is positive"
"Model A is more appropriate than Model B"

These statements are meaningless in frequentist probability because they concern fixed quantities, not random events. But in Bayesian probability, they represent coherent expressions of belief.

The Subjectivist Stance

The Bayesian interpretation is often called 'subjectivist' because prior beliefs may differ between individuals. Two rational scientists with different background knowledge may specify different priors—and this is not a flaw but a feature. Bayesian inference provides a principled mechanism for updating these beliefs as evidence accumulates, with posteriors converging as data overwhelms prior differences.

Historical Context:

The Reverend Thomas Bayes (1701–1761) first formulated the theorem bearing his name in a posthumously published essay. However, the modern conception of subjective probability was developed extensively by Bruno de Finetti, Leonard Savage, and others in the 20th century.

De Finetti's famous theorem demonstrates that exchangeability (the assumption that the order of observations doesn't matter) implies the existence of a prior distribution over parameters. This provides a rigorous foundation for Bayesian inference that doesn't require accepting subjective probability as a philosophical primitive—it emerges naturally from more basic modeling assumptions.

The Coherence Argument:

Another powerful justification for Bayesian probability comes from Dutch book arguments. If your beliefs don't satisfy the axioms of probability (including Bayes' theorem for updating), then there exist sequences of bets that guarantee you lose money regardless of outcomes. Only probability-coherent beliefs avoid such exploitability.

This means Bayesian reasoning isn't merely one approach among many—it's the unique coherent approach to reasoning under uncertainty.

Mathematical Formalization of Priors

Having established the philosophical foundations, let's formalize prior beliefs mathematically. A prior distribution is a probability distribution over the parameter space that encodes our beliefs before observing data.

Notation and Basic Setup:

Let θ (theta) denote the unknown parameter(s) of our model. The prior distribution is denoted:

$$p(\theta)$$

This is a probability density function (for continuous parameters) or probability mass function (for discrete parameters) that satisfies the standard requirements:

Non-negativity: $p(\theta) \geq 0$ for all $\theta$
Normalization: $\int p(\theta) d\theta = 1$ (or $\sum_{\theta} p(\theta) = 1$ for discrete)

The choice of $p(\theta)$ reflects our prior knowledge. More probability mass in a region means we believe values in that region are more plausible a priori.

prior_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
# Example: Prior distributions for a coin bias parameter θ ∈ [0, 1]
 
theta = np.linspace(0, 1, 1000)
 
# Uniform prior: Complete ignorance
uniform_prior = stats.uniform(0, 1).pdf(theta)
 
# Informative prior: Belief that coin is roughly fair
# Beta(10, 10) centered at 0.5 with moderate concentration
informative_prior = stats.beta(10, 10).pdf(theta)
 
# Expert prior: Strong belief coin is biased toward heads
# Beta(30, 10) centered at 0.75
expert_prior = stats.beta(30, 10).pdf(theta)
 
# Weakly informative prior: Slight preference for fairness
# Beta(2, 2) - gentle peak at 0.5
weakly_informative = stats.beta(2, 2).pdf(theta)
 
# Visualize different priors
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
priors = [
    (uniform_prior, "Uniform (Non-informative)", "Beta(1,1)"),
    (informative_prior, "Informative", "Beta(10,10)"),
    (expert_prior, "Expert Knowledge", "Beta(30,10)"),
    (weakly_informative, "Weakly Informative", "Beta(2,2)")
]
 
for ax, (prior, title, params) in zip(axes.flatten(), priors):
    ax.fill_between(theta, prior, alpha=0.3)
    ax.plot(theta, prior, linewidth=2)
    ax.set_xlabel("θ (coin bias)")
    ax.set_ylabel("p(θ)")
    ax.set_title(f"{title}\n{params}")
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig("prior_distributions.png", dpi=150)

The Parameter Space:

The form of the prior depends critically on the parameter space—the set of all possible values the parameter can take:

Parameter Type	Space	Common Prior Families
Probability	[0, 1]	Beta
Positive real	(0, ∞)	Gamma, Inverse-Gamma, Log-Normal
Real line	(-∞, ∞)	Normal, Cauchy, Student-t
Positive integer	{1, 2, 3, ...}	Poisson, Negative Binomial
Simplex (probabilities summing to 1)	Δ<sub>K-1</sub>	Dirichlet
Covariance matrix	Positive definite	Inverse-Wishart, LKJ

The prior must have support matching the parameter space. Using a Normal prior for a probability parameter would be invalid because Normal distributions place mass on negative values and values greater than 1.

Prior Support Matters

A common error is specifying priors whose support doesn't match the parameter space. If θ represents a variance (strictly positive), using a Normal prior centered at a small positive value will place some probability mass on negative values, creating invalid posterior samples. Always verify your prior's support matches your parameter's domain.

Multivariate Priors:

When models have multiple parameters θ = (θ₁, θ₂, ..., θₖ), we specify a joint prior distribution:

$$p(\theta) = p(\theta_1, \theta_2, ..., \theta_k)$$

A common simplification is to assume prior independence:

$$p(\theta) = \prod_{i=1}^{k} p(\theta_i)$$

This factorization makes prior specification tractable but may not always be appropriate. Sometimes parameters have known relationships that should be encoded in the prior structure.

For example, in a hierarchical model where individual effects are drawn from a population distribution:

Individual effects: $\theta_i \sim \mathcal{N}(\mu, \sigma^2)$
Population mean: $\mu \sim \mathcal{N}(0, \tau^2)$
Population variance: $\sigma^2 \sim \text{InverseGamma}(\alpha, \beta)$

Here, the θᵢ are conditionally independent given μ and σ², but marginally they are correlated through their shared hyperparameters.

Taxonomy of Prior Distributions

Priors can be classified along several dimensions. The most important classification concerns their informativeness—how much constraint they place on parameter values. Understanding this taxonomy is essential for appropriate prior selection.

Prior Informativeness Spectrum

•Non-informative (Objective) Priors — Attempt to represent 'complete ignorance' or let the data speak for themselves. Examples include flat priors, Jeffreys priors, and reference priors. These minimize subjective input but can behave counter-intuitively.
•Weakly Informative Priors — Provide gentle regularization without dominating the likelihood. They rule out implausible extreme values while remaining largely data-driven. This is often the recommended default in modern Bayesian practice.
•Informative Priors — Encode substantial prior knowledge, significantly constraining the posterior. Appropriate when strong domain expertise exists or when data is limited.
•Strongly Informative (Dogmatic) Priors — Place near-certainty on specific parameter values. These overwhelm reasonable data and should rarely be used unless truly justified.

Let's examine each category in depth.

Non-informative Priors:

The quest for 'objective' priors that represent complete ignorance has a long history. The most naive approach is the flat (uniform) prior:

$$p(\theta) \propto 1$$

This seems intuitive—all values are equally likely. However, flat priors have serious problems:

Non-invariance under reparameterization: If $p(\theta) \propto 1$, then for $\phi = g(\theta)$, the prior on $\phi$ is NOT flat: $p(\phi) \propto |d\theta/d\phi|$
Impropriety: On unbounded spaces, flat priors have infinite integral—they're not proper probability distributions
Information paradox: A flat prior on [0, 1] for probability θ is actually informative—it implies specific beliefs about log-odds or other transformations

Jeffreys Prior:

Harold Jeffreys proposed a principled solution: the prior should be proportional to the square root of the Fisher Information:

$$p(\theta) \propto \sqrt{I(\theta)}$$

where $I(\theta) = -E\left[\frac{\partial^2}{\partial\theta^2}\log p(x|\theta)\right]$ is the Fisher Information.

Jeffreys priors are transformation-invariant: the same prior is obtained regardless of how the parameter is expressed. For a Bernoulli likelihood, Jeffreys prior is Beta(1/2, 1/2)—not uniform!

jeffreys_prior.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
# Jeffreys prior for Bernoulli parameter
# I(θ) = 1/(θ(1-θ)), so p(θ) ∝ θ^(-1/2) * (1-θ)^(-1/2) = Beta(1/2, 1/2)
 
theta = np.linspace(0.001, 0.999, 1000)
 
# Compare flat vs Jeffreys prior
flat_prior = np.ones_like(theta)  # Uniform
jeffreys_prior = stats.beta(0.5, 0.5).pdf(theta)  # Beta(1/2, 1/2)
 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
 
# Prior comparison
ax1.plot(theta, flat_prior, 'b-', linewidth=2, label='Flat Prior (Beta(1,1))')
ax1.plot(theta, jeffreys_prior, 'r-', linewidth=2, label="Jeffreys Prior (Beta(1/2,1/2))")
ax1.fill_between(theta, jeffreys_prior, alpha=0.2, color='red')
ax1.set_xlabel("θ", fontsize=12)
ax1.set_ylabel("p(θ)", fontsize=12)
ax1.set_title("Non-informative Priors for Bernoulli Parameter")
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(0, 1)
 
# Show transformation non-invariance of flat prior
# If θ ~ Uniform(0,1), what is the distribution of log-odds φ = log(θ/(1-θ))?
theta_samples = np.random.uniform(0, 1, 100000)
log_odds = np.log(theta_samples / (1 - theta_samples))
 
ax2.hist(log_odds, bins=100, density=True, alpha=0.7, color='blue')
ax2.set_xlabel("φ = log(θ/(1-θ))", fontsize=12)
ax2.set_ylabel("Implied p(φ)", fontsize=12)
ax2.set_title("Implied Prior on Log-Odds from Uniform θ")
ax2.grid(True, alpha=0.3)
ax2.set_xlim(-6, 6)
ax2.axhline(1/2/np.pi, color='red', linestyle='--', 
            label='Logistic(0,1) density (max)')
 
plt.tight_layout()

Weakly Informative Priors:

Modern Bayesian practice, particularly as advocated by Andrew Gelman and collaborators, emphasizes weakly informative priors as a practical default. These priors:

Rule out clearly impossible or implausible values
Have minimal impact when data is abundant
Provide regularization when data is sparse
Are interpretable and defensible

For example, when modeling a regression coefficient β representing the effect of a standardized predictor on a standardized outcome:

$$\beta \sim \mathcal{N}(0, 1)$$

asserts that effects larger than ±2 standard deviations are unlikely (95% prior probability between -2 and 2). This is weak enough to let data dominate with reasonable sample sizes, but provides regularization against overfitting.

Informative Priors from Domain Knowledge:

When genuine prior knowledge exists, it should be used. Examples include:

Physics: Physical constants are known precisely from prior experiments
Medicine: Drug effect sizes from related compounds or populations
Economics: Historical data on inflation, interest rates, etc.
Meta-analysis: Summary of previous studies provides a natural informative prior

Encoding this knowledge as an informative prior is a strength of Bayesian inference, not a weakness. It formalizes the scientific process of building on prior work.

Prior Predictive Checking

Before fitting a model, always examine the prior predictive distribution—data simulated from the prior × likelihood without conditioning on observed data. If the prior predicts outcomes that are nonsensical (e.g., negative heights, heights of 50 meters), your prior is poorly calibrated and should be revised.

Common Prior Distribution Families

Certain distribution families appear repeatedly in Bayesian modeling due to their mathematical convenience and flexibility. Mastering these families is essential for effective prior specification.

Essential Prior Distribution Families
Family	Support	Parameters	Use Case	Key Properties
Beta(α, β)	[0, 1]	α > 0, β > 0 (shape)	Probabilities, proportions	Conjugate to Binomial; mode at (α-1)/(α+β-2)
Normal(μ, σ²)	(-∞, ∞)	μ (mean), σ² (variance)	Unbounded continuous	Conjugate to itself; maximum entropy for given mean/variance
Gamma(α, β)	(0, ∞)	α (shape), β (rate)	Positive quantities, precisions	Conjugate to Poisson, Exponential
Inverse-Gamma(α, β)	(0, ∞)	α (shape), β (scale)	Variances	Conjugate to Normal variance
Dirichlet(α)	Simplex	α ∈ ℝ₊ᴷ (concentration)	Probability vectors	Conjugate to Multinomial
Half-Cauchy(s)	(0, ∞)	s (scale)	Scale parameters, variances	Heavy-tailed; recommended for hierarchical SDs
Student-t(ν, μ, σ)	(-∞, ∞)	ν (df), μ (loc), σ (scale)	Robust alternatives to Normal	Heavy-tailed; ν=1 is Cauchy
LKJ(η)	Correlation matrices	η > 0 (shape)	Correlation structures	η=1 is uniform over correlations

The Beta Distribution in Detail:

The Beta distribution is perhaps the most important prior family for beginners. It's the natural prior for parameters representing probabilities or proportions.

$$p(\theta | \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha-1}(1-\theta)^{\beta-1}$$

Interpreting α and β:

Think of α-1 as "prior successes" and β-1 as "prior failures" (pseudocounts)
Mean: E[θ] = α/(α+β)
Mode (for α,β > 1): (α-1)/(α+β-2)
Concentration: Larger α+β means more concentrated (less uncertain)

Special cases:

Beta(1, 1) = Uniform(0, 1): Complete ignorance
Beta(0.5, 0.5): Jeffreys prior
Beta(2, 2): Weakly informative, favors middle values
Beta(10, 10): Informative, strongly favors values near 0.5
Beta(1, 10): Favors small values
Beta(10, 1): Favors large values

beta_prior_exploration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
theta = np.linspace(0.001, 0.999, 500)
 
# Various Beta distributions demonstrating flexibility
beta_params = [
    (1, 1, "Uniform", "#3498db"),
    (0.5, 0.5, "Jeffreys", "#e74c3c"),
    (2, 2, "Weakly informative (symmetric)", "#2ecc71"),
    (5, 1, "Skewed right", "#9b59b6"),
    (1, 5, "Skewed left", "#f39c12"),
    (10, 10, "Informative (centered)", "#1abc9c"),
    (50, 50, "Highly informative", "#34495e"),
    (2, 5, "Asymmetric", "#e91e63"),
]
 
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()
 
for ax, (a, b, name, color) in zip(axes, beta_params):
    pdf = stats.beta(a, b).pdf(theta)
    ax.fill_between(theta, pdf, alpha=0.3, color=color)
    ax.plot(theta, pdf, color=color, linewidth=2)
    
    # Add mean and mode lines
    mean = a / (a + b)
    ax.axvline(mean, color='black', linestyle='--', alpha=0.7, label=f'Mean={mean:.2f}')
    
    if a > 1 and b > 1:
        mode = (a - 1) / (a + b - 2)
        ax.axvline(mode, color='red', linestyle=':', alpha=0.7, label=f'Mode={mode:.2f}')
    
    ax.set_title(f"Beta({a}, {b})\n{name}", fontsize=10)
    ax.set_xlabel("θ")
    ax.set_ylabel("p(θ)")
    ax.legend(fontsize=8)
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
 
plt.suptitle("Beta Distribution Family: Flexibility for Prior Specification", fontsize=14)
plt.tight_layout()

The Normal Distribution for Location Parameters:

For unbounded real-valued parameters, the Normal (Gaussian) distribution is a natural choice:

$$p(\theta | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(\theta - \mu)^2}{2\sigma^2}\right)$$

The Normal distribution has maximum entropy among all distributions with a given mean and variance, making it a principled default when only these moments are known.

Guidelines for Normal priors:

Center μ at your best guess for the parameter
Set σ large enough to cover plausible values
Remember: 95% of mass falls within μ ± 1.96σ
For standardized regression coefficients, σ = 1 or σ = 2.5 are common defaults

Half-Cauchy for Scale Parameters:

For parameters that must be positive and represent scales or standard deviations, the half-Cauchy distribution is increasingly recommended:

$$p(\sigma | s) = \frac{2}{\pi s \left(1 + (\sigma/s)^2\right)}, \quad \sigma > 0$$

The half-Cauchy has heavier tails than Gamma or Inverse-Gamma alternatives, providing robustness when the true scale is unexpectedly large. This is particularly valuable in hierarchical models where shrinkage might otherwise be excessive.

Strategies for Prior Selection

Selecting appropriate priors is both an art and a science. Here we present a systematic framework for making principled prior choices.

Prior Selection Workflow

•Identify the parameter's domain — What values are logically possible? Is it bounded? Positive? A probability? This immediately constrains your distribution family choices.
•Determine your prior knowledge level — Do you have strong domain expertise? Previous data? Published meta-analyses? Or are you genuinely ignorant? Be honest here.
•Choose a distribution family — Match the family to the parameter domain and select based on computational convenience (conjugacy) or interpretability.
•Set hyperparameters — Use domain knowledge to set the prior's location and spread. When uncertain, err toward wider (less informative) priors.
•Conduct prior predictive checks — Simulate data from the prior predictive distribution. Does it produce sensible outcomes? Adjust if not.
•Perform sensitivity analysis — Try alternative reasonable priors. If conclusions change dramatically, more data or better prior justification is needed.

Encoding Domain Knowledge:

Suppose you're modeling the probability θ that a patient responds to a new cancer treatment. From the literature, similar drugs in this class have response rates between 15% and 35%, with a best estimate around 25%.

Step 1: What prior encodes this?

We want a Beta(α, β) with:

Mean ≈ 0.25: So α/(α+β) ≈ 0.25, meaning α ≈ β/3
95% of mass between 0.15 and 0.35

Using numerical methods or trial-and-error, Beta(10, 30) satisfies these constraints:

Mean = 10/40 = 0.25 ✓
2.5th percentile ≈ 0.14, 97.5th percentile ≈ 0.38 ≈ Close enough

Alternatively, Beta(5, 15) gives similar mean but wider spread—appropriate if literature is less certain.

prior_from_knowledge.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy import stats
from scipy.optimize import minimize
 
def find_beta_params(target_mean, target_lower, target_upper, 
                     coverage=0.95):
    """
    Find Beta parameters that match desired mean and credible interval.
    
    Args:
        target_mean: Desired prior mean
        target_lower: Lower bound of desired credible interval
        target_upper: Upper bound of desired credible interval
        coverage: Probability coverage of interval (default 95%)
    
    Returns:
        (alpha, beta) parameters
    """
    alpha_lower = (1 - coverage) / 2
    alpha_upper = 1 - alpha_lower
    
    def objective(params):
        a, b = params
        if a <= 0 or b <= 0:
            return np.inf
        
        dist = stats.beta(a, b)
        mean_error = (dist.mean() - target_mean)**2
        lower_error = (dist.ppf(alpha_lower) - target_lower)**2
        upper_error = (dist.ppf(alpha_upper) - target_upper)**2
        
        return mean_error + lower_error + upper_error
    
    # Initial guess based on mean
    a_init = target_mean * 10
    b_init = (1 - target_mean) * 10
    
    result = minimize(objective, [a_init, b_init], 
                      bounds=[(0.1, 100), (0.1, 100)])
    return result.x
 
# Example: Drug response rate
# Prior belief: 15-35% range, mean around 25%
alpha, beta = find_beta_params(
    target_mean=0.25,
    target_lower=0.15,
    target_upper=0.35
)
 
print(f"Optimal Beta parameters: α = {alpha:.2f}, β = {beta:.2f}")
 
# Verify
dist = stats.beta(alpha, beta)
print(f"Prior mean: {dist.mean():.3f}")
print(f"Prior 2.5th percentile: {dist.ppf(0.025):.3f}")
print(f"Prior 97.5th percentile: {dist.ppf(0.975):.3f}")
 
# Visualize
theta = np.linspace(0, 1, 500)
plt.figure(figsize=(10, 6))
plt.fill_between(theta, dist.pdf(theta), alpha=0.3, color='blue')
plt.plot(theta, dist.pdf(theta), 'b-', linewidth=2, 
         label=f'Beta({alpha:.1f}, {beta:.1f})')
plt.axvline(0.15, color='red', linestyle='--', label='Target lower (0.15)')
plt.axvline(0.35, color='red', linestyle='--', label='Target upper (0.35)')
plt.axvline(0.25, color='green', linestyle='-', label='Target mean (0.25)')
plt.xlabel("θ (response rate)")
plt.ylabel("p(θ)")
plt.title("Prior Constructed from Domain Knowledge")
plt.legend()
plt.grid(True, alpha=0.3)

Avoid Data-Dependent Priors

It's invalid to 'peek' at the data when selecting priors. Priors must be specified before observing the data, or the posterior will be mis-calibrated. If you must use the data for prior specification (e.g., empirical Bayes), use formal methods that account for this double-use of data.

Improper Priors and Their Dangers

An improper prior is a function that does not integrate to a finite value—it's not a valid probability distribution. The most common example is the flat prior on an unbounded space:

$$p(\theta) \propto 1, \quad \theta \in (-\infty, \infty)$$

Since $\int_{-\infty}^{\infty} 1 , d\theta = \infty$, this is not a proper probability distribution.

Why use improper priors?

Despite their theoretical issues, improper priors are sometimes used because:

They can yield proper posteriors when the likelihood is informative enough
They represent a form of 'letting the data speak'
They can simplify computation in certain cases

The Danger:

Improper priors can lead to improper posteriors—posteriors that also don't integrate to finite values. This makes inference impossible: you can't compute means, credible intervals, or any meaningful summaries.

Example of disaster:

Consider a hierarchical model:

Observations: $y_i \sim \mathcal{N}(\mu, \sigma^2)$ for $i = 1, ..., n$
Prior on mean: $p(\mu) \propto 1$ (improper flat prior)
Prior on variance: $p(\sigma^2) \propto 1/\sigma^2$ (improper Jeffreys-style prior)

If n = 1 (single observation), the posterior on σ² is improper—there's no finite integral. The model is unidentified and inference is meaningless.

Safe practices:

Always verify the posterior is proper before using improper priors
Use proper weakly informative priors as a safer default
If using improper priors, document the theoretical justification
With MCMC, improper posteriors may not be obvious—chains may appear to converge but give misleading results

Posterior Propriety Check

Before using an improper prior, prove that the posterior is proper. A sufficient condition is that the likelihood L(θ) is 'dominating' in the sense that ∫ π(θ)L(θ|y)dθ < ∞ for almost all y. When in doubt, use proper (even very diffuse) priors instead.

Prior Sensitivity Analysis

A critical aspect of responsible Bayesian analysis is understanding how much conclusions depend on prior assumptions. Sensitivity analysis systematically varies the prior and observes how the posterior changes.

When to worry:

With abundant data, posteriors should be robust to reasonable prior variations—the likelihood dominates
With limited data, posteriors may be sensitive to prior choice—this indicates our conclusions are prior-driven
Extreme sensitivity suggests either very weak data or substantively different priors encoding genuinely different assumptions

Formal sensitivity analysis:

Specify a class of priors: Define a range of 'reasonable' priors that all could be justified
Compute posterior under each: Fit the model with each prior
Report range of conclusions: If all reasonable priors yield similar conclusions, results are robust
Identify influential priors: Understand which aspects of the prior (location? spread? tail behavior?) drive sensitivity

prior_sensitivity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def compute_beta_posterior(prior_alpha, prior_beta, successes, failures):
    """
    Compute Beta posterior analytically (conjugacy).
    """
    post_alpha = prior_alpha + successes
    post_beta = prior_beta + failures
    return post_alpha, post_beta
 
# Observed data: 3 successes in 10 trials
successes, failures = 3, 7
 
# Define a class of reasonable priors
priors = [
    (1, 1, "Flat (Beta(1,1))"),
    (0.5, 0.5, "Jeffreys (Beta(0.5,0.5))"),
    (2, 2, "Weakly informative (Beta(2,2))"),
    (1, 9, "Skeptical (Beta(1,9)) - favors low θ"),
    (5, 5, "Moderately informative (Beta(5,5))"),
    (10, 10, "Informative (Beta(10,10))"),
]
 
theta = np.linspace(0, 1, 500)
 
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
 
posterior_means = []
posterior_95ci = []
 
for ax, (a, b, name) in zip(axes, priors):
    # Prior
    prior_dist = stats.beta(a, b)
    
    # Posterior
    post_a, post_b = compute_beta_posterior(a, b, successes, failures)
    post_dist = stats.beta(post_a, post_b)
    
    # Store summary statistics
    posterior_means.append(post_dist.mean())
    posterior_95ci.append((post_dist.ppf(0.025), post_dist.ppf(0.975)))
    
    # Plot
    ax.plot(theta, prior_dist.pdf(theta), 'b--', linewidth=2, 
            label=f'Prior: Beta({a},{b})', alpha=0.7)
    ax.fill_between(theta, post_dist.pdf(theta), alpha=0.3, color='red')
    ax.plot(theta, post_dist.pdf(theta), 'r-', linewidth=2,
            label=f'Posterior: Beta({post_a},{post_b})')
    
    ax.axvline(post_dist.mean(), color='red', linestyle=':', 
               label=f'Post. mean={post_dist.mean():.3f}')
    
    ax.set_title(f"{name}\nPosterior mean: {post_dist.mean():.3f}")
    ax.set_xlabel("θ")
    ax.set_ylabel("Density")
    ax.legend(fontsize=8)
    ax.set_xlim(0, 1)
    ax.grid(True, alpha=0.3)
 
plt.suptitle(f"Prior Sensitivity Analysis\nData: {successes} successes in {successes+failures} trials", 
             fontsize=14)
plt.tight_layout()
 
# Summary report
print("=" * 60)
print("PRIOR SENSITIVITY ANALYSIS SUMMARY")
print("=" * 60)
print(f"{"Prior":<35} {"Post. Mean":<12} {"95% CI":<20}")
print("-" * 60)
for (a, b, name), mean, ci in zip(priors, posterior_means, posterior_95ci):
    print(f"{name:<35} {mean:.4f}       [{ci[0]:.3f}, {ci[1]:.3f}]")
print("-" * 60)
print(f"Range of posterior means: [{min(posterior_means):.4f}, {max(posterior_means):.4f}]")
print(f"Spread: {max(posterior_means) - min(posterior_means):.4f}")

Interpreting Sensitivity:

In the example above with only 10 observations:

Posterior means range from approximately 0.25 to 0.38 depending on prior
This spread of 0.13 indicates moderate sensitivity
With more data (e.g., 30 successes in 100 trials), posteriors would converge regardless of prior

The Asymptotics of Prior Influence:

As sample size n → ∞, the posterior converges to the true parameter value regardless of the prior (assuming the model is well-specified and the prior places non-zero mass on the true value). This is called posterior consistency.

The rate of convergence is typically $O(1/\sqrt{n})$, meaning the prior's influence shrinks proportionally. After 100 observations, the prior has roughly 1/10th the influence it had after 1 observation.

This provides theoretical comfort: with enough data, prior misspecification is automatically corrected. But in finite samples, prior choice matters.

Summary: Prior Beliefs

Prior beliefs form the foundation of Bayesian inference—they encode what we know before observing data. Let's consolidate the key concepts:

Key Takeaways

•Priors encode pre-data beliefs — They represent our state of knowledge before observing new evidence, formalized as probability distributions over parameters.
•The Bayesian interpretation of probability — Probability represents degrees of belief, allowing us to assign probabilities to any uncertain proposition, including parameter values.
•Prior informativeness varies — From non-informative (letting data dominate) to strongly informative (encoding substantial domain knowledge), with weakly informative priors as a practical default.
•Prior families match parameter domains — Beta for probabilities, Normal for real-valued, Gamma for positive quantities, etc. Support must match the parameter space.
•Principled prior selection — Use domain knowledge, prior predictive checks, and sensitivity analysis to choose and validate priors.
•Improper priors require caution — While sometimes useful, they risk improper posteriors and should be used with theoretical justification.
•Priors matter most with limited data — As data accumulates, the likelihood dominates and prior influence diminishes.

What's Next:

With prior beliefs established, we now turn to the likelihood function—the mathematical bridge connecting our model's parameters to observed data. The likelihood quantifies how probable our observations are under different parameter values, and its combination with the prior yields the posterior distribution.

Page Complete

You now understand the philosophical and mathematical foundations of prior distributions in Bayesian inference. You can categorize priors by informativeness, select appropriate distribution families, and conduct sensitivity analyses. Next, we'll explore how the likelihood function quantifies data evidence.