Bayesian Framework - Learning Module

Loading content...

0/245

Likelihood Function

The Evidence Bridge

Having established our prior beliefs about parameter values, we now confront a fundamental question: How do we incorporate observed data into our reasoning? The answer lies in the likelihood function—one of the most important concepts in all of statistical inference.

The likelihood function quantifies how probable our observed data would be under different possible parameter values. It serves as the mathematical bridge connecting our theoretical model (parameterized by θ) to the empirical reality we observe (the data). Without this bridge, prior beliefs would remain forever unupdated, and data would have no voice in our conclusions.

Crucially, the likelihood is not a probability distribution over parameters—it's a function that, given fixed data, tells us how 'compatible' different parameter values are with what we've observed.

What You Will Master

By the end of this page, you will understand the precise mathematical definition of likelihood, appreciate the fundamental distinction between likelihood and probability, work with common likelihood families, understand log-likelihood and its computational advantages, and grasp how the likelihood interacts with the prior to shape the posterior.

Definition and Fundamental Intuition

Let's build a rigorous understanding of the likelihood function from first principles.

The Setup:

Suppose we have:

A statistical model that specifies how data is generated given parameters θ
Observed data D = {x₁, x₂, ..., xₙ}
A probability model p(x | θ) describing the distribution of individual observations

The Definition:

The likelihood function L(θ) is defined as the probability (or probability density) of observing the data D, viewed as a function of the parameter θ:

$$\mathcal{L}(\theta) = \mathcal{L}(\theta | D) = p(D | \theta)$$

Critically, while the mathematical form $p(D | θ)$ looks like a conditional probability, the interpretation reverses when we treat it as a function of θ:

As probability: Fix θ, vary D → Probability distribution over possible data
As likelihood: Fix D, vary θ → Function measuring parameter compatibility with observed data

The Crucial Distinction:

This distinction cannot be overemphasized:

Aspect	Probability p(D\|θ)	Likelihood L(θ\|D)
What varies?	Data D	Parameter θ
What's fixed?	Parameter θ	Data D
Integrates to 1?	Yes (over D)	No (over θ)
Interpretation	Probability of seeing D	Compatibility of θ with D

Likelihood Is NOT Probability

The likelihood function is NOT a probability distribution over θ. It does not integrate (or sum) to 1 over the parameter space. Statements like 'the probability that θ = 0.5 given the data is 0.3' are MEANINGLESS when computed from likelihood alone. Such statements require a prior and the full Bayesian posterior.

An Intuitive Example:

Consider flipping a coin 10 times and observing 7 heads. Let θ = P(Heads) be the coin's bias.

The probability of this specific outcome, given θ, is:

$$p(\text{7H, 3T} | \theta) = \binom{10}{7} \theta^7 (1-\theta)^3$$

Now, let's evaluate this for different θ values:

θ	L(θ) = p(7H, 3T \| θ)	Relative Likelihood
0.1	0.0000008748	Very low
0.3	0.0090017	Low
0.5	0.1171875	Moderate
0.7	0.2668279	Highest
0.9	0.0574131	Moderate

The likelihood function peaks at θ = 0.7 (the sample proportion), indicating this value is most compatible with the observed data. But L(0.7) = 0.267 is NOT the probability that θ = 0.7—it's the probability of observing exactly 7/10 heads IF the true bias were 0.7.

likelihood_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.special import comb
 
# Observed data: 7 heads in 10 flips
heads, total = 7, 10
tails = total - heads
 
# Define likelihood function for Bernoulli/Binomial model
def likelihood(theta, heads, tails):
    """Binomial likelihood (ignoring normalizing constant)."""
    return (theta ** heads) * ((1 - theta) ** tails)
 
def log_likelihood(theta, heads, tails):
    """Log-likelihood for numerical stability."""
    return heads * np.log(theta) + tails * np.log(1 - theta)
 
# Evaluate over parameter space
theta = np.linspace(0.001, 0.999, 1000)
L = likelihood(theta, heads, tails)
LL = log_likelihood(theta, heads, tails)
 
# Find MLE
mle_theta = heads / total
mle_likelihood = likelihood(mle_theta, heads, tails)
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Likelihood function
ax1 = axes[0]
ax1.fill_between(theta, L, alpha=0.3, color='blue')
ax1.plot(theta, L, 'b-', linewidth=2, label='Likelihood L(θ)')
ax1.axvline(mle_theta, color='red', linestyle='--', linewidth=2,
            label=f'MLE: θ = {mle_theta:.1f}')
ax1.scatter([mle_theta], [mle_likelihood], color='red', s=100, zorder=5)
ax1.set_xlabel('θ (coin bias)', fontsize=12)
ax1.set_ylabel('L(θ)', fontsize=12)
ax1.set_title(f'Likelihood Function\nData: {heads} heads in {total} flips', fontsize=12)
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(0, 1)
 
# Log-likelihood function
ax2 = axes[1]
ax2.plot(theta, LL, 'g-', linewidth=2, label='Log-likelihood ℓ(θ)')
ax2.axvline(mle_theta, color='red', linestyle='--', linewidth=2,
            label=f'MLE: θ = {mle_theta:.1f}')
ax2.scatter([mle_theta], [log_likelihood(mle_theta, heads, tails)], 
            color='red', s=100, zorder=5)
ax2.set_xlabel('θ (coin bias)', fontsize=12)
ax2.set_ylabel('ℓ(θ) = log L(θ)', fontsize=12)
ax2.set_title('Log-Likelihood Function\n(Same peak, better numerical properties)', fontsize=12)
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xlim(0, 1)
 
plt.tight_layout()
 
# Demonstrate that likelihood doesn't integrate to 1
from scipy.integrate import quad
integral, _ = quad(lambda t: likelihood(t, heads, tails), 0, 1)
print(f"Integral of likelihood from 0 to 1: {integral:.6f}")
print(f"(This is NOT 1 - likelihood is not a probability distribution!)")

Independence and Likelihood Factorization

In most statistical models, we assume observations are independent and identically distributed (i.i.d.) given the parameters. This assumption has profound implications for the structure of the likelihood function.

The i.i.d. Assumption:

Observations x₁, x₂, ..., xₙ are i.i.d. given θ if:

Independence: $p(x_1, x_2, ..., x_n | \theta) = \prod_{i=1}^{n} p(x_i | \theta)$
Identical distribution: Each $x_i$ has the same distribution $p(x | \theta)$

Likelihood Factorization:

Under the i.i.d. assumption, the likelihood function factorizes:

$$\mathcal{L}(\theta | D) = \prod_{i=1}^{n} p(x_i | \theta) = \prod_{i=1}^{n} \mathcal{L}_i(\theta)$$

where $\mathcal{L}_i(\theta) = p(x_i | \theta)$ is the contribution of the i-th observation to the total likelihood.

This factorization is one of the most important properties in statistics:

It dramatically simplifies computation
It enables sequential updates as new data arrives
It underlies the asymptotic normality of maximum likelihood estimators
It connects to information theory through Fisher Information additivity

When i.i.d. Fails

Many real-world scenarios violate i.i.d.: time series (temporal dependence), spatial data (spatial correlation), clustered data (within-group correlation), network data (graph dependencies). In such cases, the likelihood takes more complex forms that account for dependencies, such as autoregressive likelihoods for time series or mixed-effects likelihoods for clustered data.

Example: Normal Likelihood with Known Variance

Suppose we observe x₁, ..., xₙ from N(μ, σ²) where σ² is known and μ is unknown.

The individual contribution: $$p(x_i | \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$

The joint likelihood: $$\mathcal{L}(\mu) = \prod_{i=1}^{n} p(x_i | \mu) = \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2\right)$$

This product form is mathematically elegant but computationally problematic for large n—the likelihood becomes astronomically small due to multiplying many probabilities less than 1.

The Log-Likelihood: Computational Necessity

For any non-trivial dataset, computing the likelihood directly leads to numerical underflow—products of many small numbers become indistinguishable from zero in floating-point arithmetic. The solution is to work with the log-likelihood instead.

Definition:

The log-likelihood is simply the natural logarithm of the likelihood:

$$\ell(\theta) = \log \mathcal{L}(\theta) = \log p(D | \theta)$$

Why Logarithm?

Products become sums: $\log \prod_i p(x_i|\theta) = \sum_i \log p(x_i|\theta)$
Numerical stability: Sums of moderate numbers vs. products of tiny numbers
Monotonicity: Log is monotonically increasing, so maxima of L and ℓ coincide
Differentiability: Derivatives of log-likelihood often have nicer forms
Connection to information: Log-likelihood relates to entropy and information

log_likelihood_necessity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from scipy import stats
 
# Demonstrate numerical underflow with raw likelihood
 
np.random.seed(42)
 
# Generate data from Normal(5, 2)
true_mu, true_sigma = 5.0, 2.0
n_samples = 1000
data = np.random.normal(true_mu, true_sigma, n_samples)
 
def normal_pdf(x, mu, sigma):
    """Single observation probability density."""
    return (1 / (np.sqrt(2 * np.pi) * sigma)) * np.exp(-(x - mu)**2 / (2 * sigma**2))
 
def raw_likelihood(data, mu, sigma):
    """Compute likelihood as product of individual densities."""
    L = 1.0
    for x in data:
        L *= normal_pdf(x, mu, sigma)
    return L
 
def log_likelihood(data, mu, sigma):
    """Compute log-likelihood as sum of log-densities."""
    ll = 0.0
    for x in data:
        ll += np.log(normal_pdf(x, mu, sigma))
    return ll
 
# Vectorized log-likelihood (more efficient)
def log_likelihood_vectorized(data, mu, sigma):
    return np.sum(stats.norm(mu, sigma).logpdf(data))
 
# Compare approaches
print("=" * 60)
print("NUMERICAL COMPARISON: RAW LIKELIHOOD vs LOG-LIKELIHOOD")
print("=" * 60)
print(f"Data: {n_samples} observations from N({true_mu}, {true_sigma}²)")
print()
 
# Try computing raw likelihood
print("RAW LIKELIHOOD:")
for n in [10, 50, 100, 500, 1000]:
    subset = data[:n]
    L = raw_likelihood(subset, true_mu, true_sigma)
    print(f"  n = {n:4d}: L = {L}")
 
print()
print("LOG-LIKELIHOOD:")
for n in [10, 50, 100, 500, 1000]:
    subset = data[:n]
    ll = log_likelihood_vectorized(subset, true_mu, true_sigma)
    print(f"  n = {n:4d}: ℓ = {ll:.4f}")
 
print()
print("OBSERVATION:")
print("Raw likelihood underflows to 0.0 around n=500+")
print("Log-likelihood remains well-behaved at any n")
 
# Demonstrate that MLE is same regardless of which we optimize
from scipy.optimize import minimize_scalar
 
def neg_log_likelihood(mu, data, sigma):
    return -log_likelihood_vectorized(data, mu, sigma)
 
result = minimize_scalar(neg_log_likelihood, args=(data, true_sigma), 
                         bounds=(0, 10), method='bounded')
 
print()
print(f"MLE for μ (by optimizing log-likelihood): {result.x:.4f}")
print(f"Sample mean (analytical MLE):              {np.mean(data):.4f}")
print(f"True μ:                                    {true_mu:.4f}")

Log-Likelihood for Common Models:

Bernoulli/Binomial (k successes in n trials): $$\ell(\theta) = k \log\theta + (n-k)\log(1-\theta) + \text{const}$$

Normal with known variance: $$\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

Normal with unknown mean and variance: $$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

Poisson: $$\ell(\lambda) = \left(\sum_{i=1}^{n} x_i\right) \log\lambda - n\lambda - \sum_{i=1}^{n}\log(x_i!)$$

Exponential: $$\ell(\lambda) = n\log\lambda - \lambda\sum_{i=1}^{n} x_i$$

Notice how each log-likelihood is a simple function of the parameter(s) and sufficient statistics of the data (sample sums, sample means, sample variances).

Sufficient Statistics

A sufficient statistic T(D) captures all information in the data relevant to θ. If the likelihood depends on data only through T(D), then T(D) is sufficient. For example, for the Normal with known variance, the sample mean is sufficient for μ—individual data points are irrelevant beyond their mean.

Common Likelihood Families in Practice

Different data types call for different likelihood families. Mastering these families is essential for effective statistical modeling.

Likelihood Families by Data Type
Data Type	Distribution	Parameters	Example Applications
Binary (0/1)	Bernoulli	θ ∈ [0,1]	Click-through rates, disease status
Count of successes	Binomial(n, θ)	θ ∈ [0,1], n known	Number of conversions, defects
Unbounded counts	Poisson(λ)	λ > 0	Website visits, earthquakes per year
Continuous (symmetric)	Normal(μ, σ²)	μ ∈ ℝ, σ² > 0	Heights, measurement errors
Positive continuous	Gamma(α, β)	α, β > 0	Waiting times, rainfall amounts
Heavy-tailed continuous	Student-t(ν, μ, σ)	ν, σ > 0, μ ∈ ℝ	Financial returns, robust regression
Proportions (0,1)	Beta(α, β)	α, β > 0	Conversion rates, proportions
Categorical	Multinomial	θ on simplex	Survey responses, word counts
Time-to-event	Exponential, Weibull	Various	Survival analysis, reliability

The Gaussian (Normal) Likelihood:

The Normal distribution is by far the most commonly used likelihood, justified by:

Central Limit Theorem: Sums of many independent random variables are approximately Normal
Maximum Entropy: Among distributions with fixed mean and variance, Normal has maximum entropy
Analytical Tractability: Conjugate priors exist, closed-form posteriors
Least Squares Connection: MLE with Normal likelihood = Least Squares estimation

But beware: Real data often deviates from normality through:

Heavy tails (outliers more frequent than Normal predicts)
Skewness (asymmetric distributions)
Heteroscedasticity (variance changes with covariates)
Multimodality (mixture distributions)

Robust alternatives like Student-t likelihoods accommodate heavy tails by down-weighting outliers automatically.

likelihood_families.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
# Generate example data and visualize different likelihood functions
 
np.random.seed(42)
 
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
 
# 1. Bernoulli likelihood
ax = axes[0]
n_trials, successes = 20, 14
theta_range = np.linspace(0.01, 0.99, 200)
bernoulli_ll = successes * np.log(theta_range) + (n_trials - successes) * np.log(1 - theta_range)
bernoulli_ll -= np.max(bernoulli_ll)  # Normalize for visualization
ax.plot(theta_range, np.exp(bernoulli_ll), 'b-', linewidth=2)
ax.fill_between(theta_range, np.exp(bernoulli_ll), alpha=0.3)
ax.axvline(successes/n_trials, color='red', linestyle='--', label=f'MLE={successes/n_trials:.2f}')
ax.set_title(f'Bernoulli Likelihood\n{successes} successes in {n_trials} trials')
ax.set_xlabel('θ')
ax.set_ylabel('L(θ) [normalized]')
ax.legend()
ax.grid(True, alpha=0.3)
 
# 2. Poisson likelihood
ax = axes[1]
count_data = np.array([3, 5, 2, 8, 4, 6, 3, 7, 5, 4])  # observed counts
lambda_range = np.linspace(0.1, 12, 200)
poisson_ll = np.sum(count_data) * np.log(lambda_range) - len(count_data) * lambda_range
poisson_ll -= np.max(poisson_ll)
ax.plot(lambda_range, np.exp(poisson_ll), 'g-', linewidth=2)
ax.fill_between(lambda_range, np.exp(poisson_ll), alpha=0.3, color='green')
ax.axvline(np.mean(count_data), color='red', linestyle='--', label=f'MLE={np.mean(count_data):.2f}')
ax.set_title(f'Poisson Likelihood\nCounts: {list(count_data)}')
ax.set_xlabel('λ')
ax.set_ylabel('L(λ) [normalized]')
ax.legend()
ax.grid(True, alpha=0.3)
 
# 3. Normal likelihood (unknown mean, known variance)
ax = axes[2]
normal_data = np.random.normal(5, 2, 25)
sigma = 2  # known
mu_range = np.linspace(2, 8, 200)
normal_ll = -0.5 * np.sum((normal_data[:, None] - mu_range)**2, axis=0) / sigma**2
normal_ll -= np.max(normal_ll)
ax.plot(mu_range, np.exp(normal_ll), 'purple', linewidth=2)
ax.fill_between(mu_range, np.exp(normal_ll), alpha=0.3, color='purple')
ax.axvline(np.mean(normal_data), color='red', linestyle='--', label=f'MLE={np.mean(normal_data):.2f}')
ax.set_title(f'Normal Likelihood (σ=2 known)\nn={len(normal_data)} observations')
ax.set_xlabel('μ')
ax.set_ylabel('L(μ) [normalized]')
ax.legend()
ax.grid(True, alpha=0.3)
 
# 4. Exponential likelihood
ax = axes[3]
exp_data = np.random.exponential(scale=3, size=30)  # scale = 1/rate
rate_range = np.linspace(0.1, 1, 200)
exp_ll = len(exp_data) * np.log(rate_range) - rate_range * np.sum(exp_data)
exp_ll -= np.max(exp_ll)
ax.plot(rate_range, np.exp(exp_ll), 'orange', linewidth=2)
ax.fill_between(rate_range, np.exp(exp_ll), alpha=0.3, color='orange')
mle_rate = len(exp_data) / np.sum(exp_data)
ax.axvline(mle_rate, color='red', linestyle='--', label=f'MLE={mle_rate:.3f}')
ax.set_title(f'Exponential Likelihood\nn={len(exp_data)}, sample mean={np.mean(exp_data):.2f}')
ax.set_xlabel('λ (rate)')
ax.set_ylabel('L(λ) [normalized]')
ax.legend()
ax.grid(True, alpha=0.3)
 
# 5. Gamma likelihood (shape α, fixed rate β=1)
ax = axes[4]
gamma_data = np.random.gamma(shape=3, scale=1, size=50)
alpha_range = np.linspace(1, 6, 200)
from scipy.special import gammaln
gamma_ll = (alpha_range - 1) * np.sum(np.log(gamma_data)) - len(gamma_data) * gammaln(alpha_range) - np.sum(gamma_data)
gamma_ll -= np.max(gamma_ll)
ax.plot(alpha_range, np.exp(gamma_ll), 'teal', linewidth=2)
ax.fill_between(alpha_range, np.exp(gamma_ll), alpha=0.3, color='teal')
ax.set_title(f'Gamma Likelihood (β=1 fixed)\nn={len(gamma_data)}')
ax.set_xlabel('α (shape)')
ax.set_ylabel('L(α) [normalized]')
ax.grid(True, alpha=0.3)
 
# 6. Normal vs Student-t (robustness demonstration)
ax = axes[5]
# Data with outliers
robust_data = np.concatenate([np.random.normal(0, 1, 47), [10, -12, 15]])  # 3 outliers
mu_range = np.linspace(-3, 5, 200)
 
# Normal likelihood
normal_ll_robust = -0.5 * np.sum((robust_data[:, None] - mu_range)**2, axis=0)
normal_ll_robust -= np.max(normal_ll_robust)
 
# Student-t likelihood (df=3)
from scipy.stats import t as student_t
t_ll = np.sum(student_t.logpdf(robust_data[:, None] - mu_range, df=3), axis=0)
t_ll -= np.max(t_ll)
 
ax.plot(mu_range, np.exp(normal_ll_robust), 'b-', linewidth=2, label='Normal', alpha=0.7)
ax.plot(mu_range, np.exp(t_ll), 'r-', linewidth=2, label='Student-t (ν=3)')
ax.axvline(np.mean(robust_data), color='blue', linestyle=':', alpha=0.7)
ax.axvline(np.median(robust_data), color='green', linestyle='--', label='Median')
ax.set_title('Robustness: Normal vs Student-t\n(data with outliers)')
ax.set_xlabel('μ')
ax.set_ylabel('L(μ) [normalized]')
ax.legend()
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
 
print("Note: Student-t likelihood is less affected by outliers, yielding")
print("      MLE closer to the majority of data (near median).")

Likelihood Ratios and Relative Evidence

Since likelihood is not a probability, we cannot interpret L(θ) = 0.3 as 'θ has 30% probability'. However, we CAN make meaningful comparisons between parameter values using likelihood ratios.

The Likelihood Ratio:

$$LR(\theta_1, \theta_2) = \frac{\mathcal{L}(\theta_1)}{\mathcal{L}(\theta_2)}$$

This ratio tells us how many times more likely the data are under θ₁ compared to θ₂. If LR = 3, the data are 3 times more probable under θ₁ than θ₂—though neither parameter is 'correct' or 'probable' in an absolute sense.

Interpretation:

Likelihood Ratio	Interpretation
LR ≈ 1	Data equally compatible with both values
LR = 10	Moderate evidence favoring θ₁
LR = 100	Strong evidence favoring θ₁
LR = 1000+	Very strong evidence favoring θ₁

These interpretations are heuristic but widely used, particularly in genetics (LOD scores = log₁₀ of likelihood ratio) and clinical diagnostics.

The Maximum Likelihood Estimator (MLE):

The MLE is the parameter value that maximizes the likelihood:

$$\hat{\theta}{MLE} = \arg\max{\theta} \mathcal{L}(\theta) = \arg\max_{\theta} \ell(\theta)$$

The MLE is the value most compatible with observed data in the likelihood sense. It plays a central role in frequentist inference but is also important in Bayesian analysis: for diffuse priors or large samples, the posterior is approximately centered at the MLE.

Likelihood Ratio Tests

The likelihood ratio forms the basis of likelihood ratio tests (LRT) for hypothesis testing. Under H₀, the test statistic -2 log(LR) is asymptotically χ² distributed. This connects likelihood to classical hypothesis testing and model comparison.

Profile Likelihood:

When models have multiple parameters θ = (ψ, λ) where ψ is of interest and λ are nuisance parameters, we can construct the profile likelihood for ψ:

$$\mathcal{L}p(\psi) = \max{\lambda} \mathcal{L}(\psi, \lambda)$$

This concentrates on the parameter of interest by maximizing over nuisance parameters at each value of ψ. Profile likelihood enables inference about individual parameters in complex models.

Support and Likelihood Intervals:

Before Bayesian credible intervals became widespread, statisticians used likelihood intervals (also called support intervals):

A 1/k likelihood interval contains all θ where: $$\mathcal{L}(\theta) \geq \frac{1}{k} \cdot \mathcal{L}(\hat{\theta}_{MLE})$$

For example, a 1/8 likelihood interval contains all values with likelihood at least 1/8 of the maximum—values reasonably compatible with the data.

Under certain regularity conditions, a 1/6.83 interval approximates a 95% confidence interval, connecting likelihood-based and frequentist approaches.

The Likelihood's Role in Bayesian Inference

Now we connect the likelihood to the Bayesian framework, revealing its role as the vehicle through which data updates our beliefs.

Bayes' Theorem:

$$p(\theta | D) = \frac{p(D | \theta) \cdot p(\theta)}{p(D)} = \frac{\mathcal{L}(\theta) \cdot p(\theta)}{p(D)}$$

Or, more memorably:

$$\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}$$

The likelihood is the multiplicative update factor that transforms prior beliefs into posterior beliefs. Where the likelihood is high, the posterior is pulled upward relative to the prior; where the likelihood is low, the posterior is suppressed.

The Likelihood Principle:

A foundational principle in Bayesian (and some frequentist) inference:

All evidence from the data about θ is contained in the likelihood function.

This means:

The stopping rule (when to stop collecting data) is irrelevant for inference
Only the observed data matters, not what data might have been observed
Two experiments with proportional likelihoods provide identical evidence

The likelihood principle has profound implications for experimental design and sequential data collection—implications that differ substantially from frequentist approaches.

Likelihood Strengths

•Quantifies data evidence objectively
•Independent of prior beliefs
•Summarizes all relevant data information
•Foundation for both frequentist and Bayesian inference
•Enables model comparison via ratios
•Asymptotically dominates prior

Likelihood Limitations

•Not a probability distribution
•Cannot make probability statements about θ alone
•Requires correct model specification
•Sensitive to outliers (for some models)
•May be multimodal or flat
•Numerical issues at extreme parameter values

likelihood_prior_posterior.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
# Demonstrate how likelihood updates prior to posterior
 
# Scenario: Estimating coin bias
# Prior: Beta(2, 2) - weakly informative, slight preference for fairness
# Data: 7 heads in 10 flips
# Posterior: Beta(2+7, 2+3) = Beta(9, 5)
 
theta = np.linspace(0, 1, 500)
 
# Prior
prior_a, prior_b = 2, 2
prior = stats.beta(prior_a, prior_b).pdf(theta)
 
# Data
heads, tails = 7, 3
 
# Likelihood (normalized for visualization)
likelihood = theta**heads * (1-theta)**tails
likelihood /= np.max(likelihood)  # Scale for visibility
 
# Posterior (analytical due to conjugacy)
post_a, post_b = prior_a + heads, prior_b + tails
posterior = stats.beta(post_a, post_b).pdf(theta)
 
# Visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
# Plot 1: Prior only
ax = axes[0]
ax.fill_between(theta, prior, alpha=0.3, color='blue')
ax.plot(theta, prior, 'b-', linewidth=2, label=f'Prior: Beta({prior_a},{prior_b})')
ax.axvline(prior_a/(prior_a+prior_b), color='blue', linestyle='--', 
           label=f'Prior mean = {prior_a/(prior_a+prior_b):.2f}')
ax.set_xlabel('θ')
ax.set_ylabel('Density')
ax.set_title('Step 1: Prior Belief\n(Before seeing data)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
# Plot 2: Prior + Likelihood
ax = axes[1]
ax.fill_between(theta, prior, alpha=0.3, color='blue', label=f'Prior')
ax.plot(theta, likelihood * np.max(prior) * 0.8, 'g-', linewidth=2, 
        label=f'Likelihood (scaled)', alpha=0.8)
ax.axvline(heads/(heads+tails), color='green', linestyle='--', 
           label=f'MLE = {heads/(heads+tails):.1f}')
ax.set_xlabel('θ')
ax.set_ylabel('Density')
ax.set_title(f'Step 2: Prior × Likelihood\nData: {heads} heads in {heads+tails} flips')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
# Plot 3: All three + Posterior
ax = axes[2]
ax.plot(theta, prior, 'b--', linewidth=1.5, label='Prior', alpha=0.6)
ax.plot(theta, likelihood * np.max(posterior) * 0.5, 'g--', linewidth=1.5, 
        label='Likelihood (scaled)', alpha=0.6)
ax.fill_between(theta, posterior, alpha=0.4, color='red')
ax.plot(theta, posterior, 'r-', linewidth=2.5, label=f'Posterior: Beta({post_a},{post_b})')
ax.axvline(post_a/(post_a+post_b), color='red', linestyle='-', 
           label=f'Post. mean = {post_a/(post_a+post_b):.3f}')
ax.set_xlabel('θ')
ax.set_ylabel('Density')
ax.set_title('Step 3: Posterior Distribution\nPrior × Likelihood ∝ Posterior')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
plt.tight_layout()
 
# Print summary
print("=" * 60)
print("BAYESIAN UPDATE SUMMARY")
print("=" * 60)
print(f"Prior:        Beta({prior_a}, {prior_b})  →  Mean = {prior_a/(prior_a+prior_b):.3f}")
print(f"Data:         {heads} heads, {tails} tails")
print(f"Likelihood:   Binomial(θ | n=10, k=7)")
print(f"Posterior:    Beta({post_a}, {post_b})  →  Mean = {post_a/(post_a+post_b):.3f}")
print()
print("Observation: Posterior mean lies between prior mean (0.5)")
print("             and MLE (0.7), pulled toward the data.")

Likelihood Misspecification

All statistical models are approximations. What happens when the likelihood model doesn't match the true data-generating process?

Types of Misspecification:

Distributional Misspecification: Wrong family (e.g., Normal when data is actually heavy-tailed)
Structural Misspecification: Wrong functional form (e.g., linear when relationship is quadratic)
Omitted Variables: Missing important predictors
Measurement Error: Observations are corrupted versions of true values
Violation of Independence: Assuming i.i.d. when data is correlated

Consequences:

Biased estimates: MLEs converge to wrong values
Invalid uncertainty: Confidence/credible intervals have wrong coverage
Poor predictions: Model fails on new data
Misleading model comparisons: Comparing wrong models against each other

Diagnosis:

Posterior predictive checks: Simulate data from the fitted model and compare to observed data
Residual analysis: Check for patterns in residuals that indicate model failure
Leave-one-out cross-validation: Assess predictive performance
Information criteria: AIC, BIC, WAIC for model comparison

M-open World View

Bayesian theory traditionally assumes the true model is in the model class (M-closed). In reality, all models are wrong (M-open). Modern Bayesian practice increasingly adopts predictive-focused model evaluation rather than treating posterior distributions as representing 'truth'. The posterior is understood as a useful approximation, not a description of reality.

Robust Likelihoods:

When concerned about distributional misspecification, consider robust alternatives:

Student-t instead of Normal: Accommodates heavy tails
Negative Binomial instead of Poisson: Allows overdispersion
Beta-Binomial instead of Binomial: Allows extra-binomial variation
Mixture models: Allow for outliers or subpopulations
Nonparametric likelihoods: Make minimal distributional assumptions

These alternatives build in flexibility that makes inference more robust to model misspecification, at the cost of additional parameters or computational complexity.

Summary: The Likelihood Function

The likelihood function is the mathematical heart of statistical inference—the translator between parameters and observations. Let's consolidate the key concepts:

Key Takeaways

•Likelihood is not probability — L(θ) measures data compatibility with θ, not probability of θ. It doesn't integrate to 1 over θ.
•Products become sums via log-likelihood — Always work with log-likelihood for numerical stability. Optimization and comparison are unaffected.
•i.i.d. enables factorization — When observations are independent, the likelihood factors into a product of individual contributions.
•Different data types → different families — Match your likelihood family to the data: Bernoulli for binary, Poisson for counts, Normal for continuous symmetric, etc.
•Likelihood ratios compare parameter values — While absolute likelihoods are meaningless, ratios quantify relative evidence.
•Likelihood updates prior to posterior — In Bayes' theorem, the likelihood is the multiplicative factor that incorporates data evidence.
•Model misspecification corrupts inference — When the likelihood model is wrong, conclusions may be invalid. Use diagnostics and robust alternatives.

What's Next:

With prior beliefs and likelihood in hand, we're ready to combine them into the posterior distribution—the complete Bayesian solution that updates our uncertainty based on observed evidence. The posterior integrates everything we know: prior knowledge, data, and model structure.

Page Complete

You now understand the likelihood function—what it is, what it isn't, how to compute it, and how it connects prior to posterior in Bayesian inference. Next, we'll explore the posterior distribution: the culmination of Bayesian reasoning.