Loading content...
Having established our prior beliefs about parameter values, we now confront a fundamental question: How do we incorporate observed data into our reasoning? The answer lies in the likelihood function—one of the most important concepts in all of statistical inference.
The likelihood function quantifies how probable our observed data would be under different possible parameter values. It serves as the mathematical bridge connecting our theoretical model (parameterized by θ) to the empirical reality we observe (the data). Without this bridge, prior beliefs would remain forever unupdated, and data would have no voice in our conclusions.
Crucially, the likelihood is not a probability distribution over parameters—it's a function that, given fixed data, tells us how 'compatible' different parameter values are with what we've observed.
By the end of this page, you will understand the precise mathematical definition of likelihood, appreciate the fundamental distinction between likelihood and probability, work with common likelihood families, understand log-likelihood and its computational advantages, and grasp how the likelihood interacts with the prior to shape the posterior.
Let's build a rigorous understanding of the likelihood function from first principles.
The Setup:
Suppose we have:
The Definition:
The likelihood function L(θ) is defined as the probability (or probability density) of observing the data D, viewed as a function of the parameter θ:
$$\mathcal{L}(\theta) = \mathcal{L}(\theta | D) = p(D | \theta)$$
Critically, while the mathematical form $p(D | θ)$ looks like a conditional probability, the interpretation reverses when we treat it as a function of θ:
The Crucial Distinction:
This distinction cannot be overemphasized:
| Aspect | Probability p(D|θ) | Likelihood L(θ|D) |
|---|---|---|
| What varies? | Data D | Parameter θ |
| What's fixed? | Parameter θ | Data D |
| Integrates to 1? | Yes (over D) | No (over θ) |
| Interpretation | Probability of seeing D | Compatibility of θ with D |
The likelihood function is NOT a probability distribution over θ. It does not integrate (or sum) to 1 over the parameter space. Statements like 'the probability that θ = 0.5 given the data is 0.3' are MEANINGLESS when computed from likelihood alone. Such statements require a prior and the full Bayesian posterior.
An Intuitive Example:
Consider flipping a coin 10 times and observing 7 heads. Let θ = P(Heads) be the coin's bias.
The probability of this specific outcome, given θ, is:
$$p(\text{7H, 3T} | \theta) = \binom{10}{7} \theta^7 (1-\theta)^3$$
Now, let's evaluate this for different θ values:
| θ | L(θ) = p(7H, 3T | θ) | Relative Likelihood |
|---|---|---|
| 0.1 | 0.0000008748 | Very low |
| 0.3 | 0.0090017 | Low |
| 0.5 | 0.1171875 | Moderate |
| 0.7 | 0.2668279 | Highest |
| 0.9 | 0.0574131 | Moderate |
The likelihood function peaks at θ = 0.7 (the sample proportion), indicating this value is most compatible with the observed data. But L(0.7) = 0.267 is NOT the probability that θ = 0.7—it's the probability of observing exactly 7/10 heads IF the true bias were 0.7.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npfrom scipy import statsimport matplotlib.pyplot as pltfrom scipy.special import comb # Observed data: 7 heads in 10 flipsheads, total = 7, 10tails = total - heads # Define likelihood function for Bernoulli/Binomial modeldef likelihood(theta, heads, tails): """Binomial likelihood (ignoring normalizing constant).""" return (theta ** heads) * ((1 - theta) ** tails) def log_likelihood(theta, heads, tails): """Log-likelihood for numerical stability.""" return heads * np.log(theta) + tails * np.log(1 - theta) # Evaluate over parameter spacetheta = np.linspace(0.001, 0.999, 1000)L = likelihood(theta, heads, tails)LL = log_likelihood(theta, heads, tails) # Find MLEmle_theta = heads / totalmle_likelihood = likelihood(mle_theta, heads, tails) # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Likelihood functionax1 = axes[0]ax1.fill_between(theta, L, alpha=0.3, color='blue')ax1.plot(theta, L, 'b-', linewidth=2, label='Likelihood L(θ)')ax1.axvline(mle_theta, color='red', linestyle='--', linewidth=2, label=f'MLE: θ = {mle_theta:.1f}')ax1.scatter([mle_theta], [mle_likelihood], color='red', s=100, zorder=5)ax1.set_xlabel('θ (coin bias)', fontsize=12)ax1.set_ylabel('L(θ)', fontsize=12)ax1.set_title(f'Likelihood Function\nData: {heads} heads in {total} flips', fontsize=12)ax1.legend()ax1.grid(True, alpha=0.3)ax1.set_xlim(0, 1) # Log-likelihood functionax2 = axes[1]ax2.plot(theta, LL, 'g-', linewidth=2, label='Log-likelihood ℓ(θ)')ax2.axvline(mle_theta, color='red', linestyle='--', linewidth=2, label=f'MLE: θ = {mle_theta:.1f}')ax2.scatter([mle_theta], [log_likelihood(mle_theta, heads, tails)], color='red', s=100, zorder=5)ax2.set_xlabel('θ (coin bias)', fontsize=12)ax2.set_ylabel('ℓ(θ) = log L(θ)', fontsize=12)ax2.set_title('Log-Likelihood Function\n(Same peak, better numerical properties)', fontsize=12)ax2.legend()ax2.grid(True, alpha=0.3)ax2.set_xlim(0, 1) plt.tight_layout() # Demonstrate that likelihood doesn't integrate to 1from scipy.integrate import quadintegral, _ = quad(lambda t: likelihood(t, heads, tails), 0, 1)print(f"Integral of likelihood from 0 to 1: {integral:.6f}")print(f"(This is NOT 1 - likelihood is not a probability distribution!)")In most statistical models, we assume observations are independent and identically distributed (i.i.d.) given the parameters. This assumption has profound implications for the structure of the likelihood function.
The i.i.d. Assumption:
Observations x₁, x₂, ..., xₙ are i.i.d. given θ if:
Likelihood Factorization:
Under the i.i.d. assumption, the likelihood function factorizes:
$$\mathcal{L}(\theta | D) = \prod_{i=1}^{n} p(x_i | \theta) = \prod_{i=1}^{n} \mathcal{L}_i(\theta)$$
where $\mathcal{L}_i(\theta) = p(x_i | \theta)$ is the contribution of the i-th observation to the total likelihood.
This factorization is one of the most important properties in statistics:
Many real-world scenarios violate i.i.d.: time series (temporal dependence), spatial data (spatial correlation), clustered data (within-group correlation), network data (graph dependencies). In such cases, the likelihood takes more complex forms that account for dependencies, such as autoregressive likelihoods for time series or mixed-effects likelihoods for clustered data.
Example: Normal Likelihood with Known Variance
Suppose we observe x₁, ..., xₙ from N(μ, σ²) where σ² is known and μ is unknown.
The individual contribution: $$p(x_i | \mu) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$
The joint likelihood: $$\mathcal{L}(\mu) = \prod_{i=1}^{n} p(x_i | \mu) = \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2\right)$$
This product form is mathematically elegant but computationally problematic for large n—the likelihood becomes astronomically small due to multiplying many probabilities less than 1.
For any non-trivial dataset, computing the likelihood directly leads to numerical underflow—products of many small numbers become indistinguishable from zero in floating-point arithmetic. The solution is to work with the log-likelihood instead.
Definition:
The log-likelihood is simply the natural logarithm of the likelihood:
$$\ell(\theta) = \log \mathcal{L}(\theta) = \log p(D | \theta)$$
Why Logarithm?
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npfrom scipy import stats # Demonstrate numerical underflow with raw likelihood np.random.seed(42) # Generate data from Normal(5, 2)true_mu, true_sigma = 5.0, 2.0n_samples = 1000data = np.random.normal(true_mu, true_sigma, n_samples) def normal_pdf(x, mu, sigma): """Single observation probability density.""" return (1 / (np.sqrt(2 * np.pi) * sigma)) * np.exp(-(x - mu)**2 / (2 * sigma**2)) def raw_likelihood(data, mu, sigma): """Compute likelihood as product of individual densities.""" L = 1.0 for x in data: L *= normal_pdf(x, mu, sigma) return L def log_likelihood(data, mu, sigma): """Compute log-likelihood as sum of log-densities.""" ll = 0.0 for x in data: ll += np.log(normal_pdf(x, mu, sigma)) return ll # Vectorized log-likelihood (more efficient)def log_likelihood_vectorized(data, mu, sigma): return np.sum(stats.norm(mu, sigma).logpdf(data)) # Compare approachesprint("=" * 60)print("NUMERICAL COMPARISON: RAW LIKELIHOOD vs LOG-LIKELIHOOD")print("=" * 60)print(f"Data: {n_samples} observations from N({true_mu}, {true_sigma}²)")print() # Try computing raw likelihoodprint("RAW LIKELIHOOD:")for n in [10, 50, 100, 500, 1000]: subset = data[:n] L = raw_likelihood(subset, true_mu, true_sigma) print(f" n = {n:4d}: L = {L}") print()print("LOG-LIKELIHOOD:")for n in [10, 50, 100, 500, 1000]: subset = data[:n] ll = log_likelihood_vectorized(subset, true_mu, true_sigma) print(f" n = {n:4d}: ℓ = {ll:.4f}") print()print("OBSERVATION:")print("Raw likelihood underflows to 0.0 around n=500+")print("Log-likelihood remains well-behaved at any n") # Demonstrate that MLE is same regardless of which we optimizefrom scipy.optimize import minimize_scalar def neg_log_likelihood(mu, data, sigma): return -log_likelihood_vectorized(data, mu, sigma) result = minimize_scalar(neg_log_likelihood, args=(data, true_sigma), bounds=(0, 10), method='bounded') print()print(f"MLE for μ (by optimizing log-likelihood): {result.x:.4f}")print(f"Sample mean (analytical MLE): {np.mean(data):.4f}")print(f"True μ: {true_mu:.4f}")Log-Likelihood for Common Models:
Bernoulli/Binomial (k successes in n trials): $$\ell(\theta) = k \log\theta + (n-k)\log(1-\theta) + \text{const}$$
Normal with known variance: $$\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$
Normal with unknown mean and variance: $$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$
Poisson: $$\ell(\lambda) = \left(\sum_{i=1}^{n} x_i\right) \log\lambda - n\lambda - \sum_{i=1}^{n}\log(x_i!)$$
Exponential: $$\ell(\lambda) = n\log\lambda - \lambda\sum_{i=1}^{n} x_i$$
Notice how each log-likelihood is a simple function of the parameter(s) and sufficient statistics of the data (sample sums, sample means, sample variances).
A sufficient statistic T(D) captures all information in the data relevant to θ. If the likelihood depends on data only through T(D), then T(D) is sufficient. For example, for the Normal with known variance, the sample mean is sufficient for μ—individual data points are irrelevant beyond their mean.
Different data types call for different likelihood families. Mastering these families is essential for effective statistical modeling.
| Data Type | Distribution | Parameters | Example Applications |
|---|---|---|---|
| Binary (0/1) | Bernoulli | θ ∈ [0,1] | Click-through rates, disease status |
| Count of successes | Binomial(n, θ) | θ ∈ [0,1], n known | Number of conversions, defects |
| Unbounded counts | Poisson(λ) | λ > 0 | Website visits, earthquakes per year |
| Continuous (symmetric) | Normal(μ, σ²) | μ ∈ ℝ, σ² > 0 | Heights, measurement errors |
| Positive continuous | Gamma(α, β) | α, β > 0 | Waiting times, rainfall amounts |
| Heavy-tailed continuous | Student-t(ν, μ, σ) | ν, σ > 0, μ ∈ ℝ | Financial returns, robust regression |
| Proportions (0,1) | Beta(α, β) | α, β > 0 | Conversion rates, proportions |
| Categorical | Multinomial | θ on simplex | Survey responses, word counts |
| Time-to-event | Exponential, Weibull | Various | Survival analysis, reliability |
The Gaussian (Normal) Likelihood:
The Normal distribution is by far the most commonly used likelihood, justified by:
But beware: Real data often deviates from normality through:
Robust alternatives like Student-t likelihoods accommodate heavy tails by down-weighting outliers automatically.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt # Generate example data and visualize different likelihood functions np.random.seed(42) fig, axes = plt.subplots(2, 3, figsize=(15, 10))axes = axes.flatten() # 1. Bernoulli likelihoodax = axes[0]n_trials, successes = 20, 14theta_range = np.linspace(0.01, 0.99, 200)bernoulli_ll = successes * np.log(theta_range) + (n_trials - successes) * np.log(1 - theta_range)bernoulli_ll -= np.max(bernoulli_ll) # Normalize for visualizationax.plot(theta_range, np.exp(bernoulli_ll), 'b-', linewidth=2)ax.fill_between(theta_range, np.exp(bernoulli_ll), alpha=0.3)ax.axvline(successes/n_trials, color='red', linestyle='--', label=f'MLE={successes/n_trials:.2f}')ax.set_title(f'Bernoulli Likelihood\n{successes} successes in {n_trials} trials')ax.set_xlabel('θ')ax.set_ylabel('L(θ) [normalized]')ax.legend()ax.grid(True, alpha=0.3) # 2. Poisson likelihoodax = axes[1]count_data = np.array([3, 5, 2, 8, 4, 6, 3, 7, 5, 4]) # observed countslambda_range = np.linspace(0.1, 12, 200)poisson_ll = np.sum(count_data) * np.log(lambda_range) - len(count_data) * lambda_rangepoisson_ll -= np.max(poisson_ll)ax.plot(lambda_range, np.exp(poisson_ll), 'g-', linewidth=2)ax.fill_between(lambda_range, np.exp(poisson_ll), alpha=0.3, color='green')ax.axvline(np.mean(count_data), color='red', linestyle='--', label=f'MLE={np.mean(count_data):.2f}')ax.set_title(f'Poisson Likelihood\nCounts: {list(count_data)}')ax.set_xlabel('λ')ax.set_ylabel('L(λ) [normalized]')ax.legend()ax.grid(True, alpha=0.3) # 3. Normal likelihood (unknown mean, known variance)ax = axes[2]normal_data = np.random.normal(5, 2, 25)sigma = 2 # knownmu_range = np.linspace(2, 8, 200)normal_ll = -0.5 * np.sum((normal_data[:, None] - mu_range)**2, axis=0) / sigma**2normal_ll -= np.max(normal_ll)ax.plot(mu_range, np.exp(normal_ll), 'purple', linewidth=2)ax.fill_between(mu_range, np.exp(normal_ll), alpha=0.3, color='purple')ax.axvline(np.mean(normal_data), color='red', linestyle='--', label=f'MLE={np.mean(normal_data):.2f}')ax.set_title(f'Normal Likelihood (σ=2 known)\nn={len(normal_data)} observations')ax.set_xlabel('μ')ax.set_ylabel('L(μ) [normalized]')ax.legend()ax.grid(True, alpha=0.3) # 4. Exponential likelihoodax = axes[3]exp_data = np.random.exponential(scale=3, size=30) # scale = 1/raterate_range = np.linspace(0.1, 1, 200)exp_ll = len(exp_data) * np.log(rate_range) - rate_range * np.sum(exp_data)exp_ll -= np.max(exp_ll)ax.plot(rate_range, np.exp(exp_ll), 'orange', linewidth=2)ax.fill_between(rate_range, np.exp(exp_ll), alpha=0.3, color='orange')mle_rate = len(exp_data) / np.sum(exp_data)ax.axvline(mle_rate, color='red', linestyle='--', label=f'MLE={mle_rate:.3f}')ax.set_title(f'Exponential Likelihood\nn={len(exp_data)}, sample mean={np.mean(exp_data):.2f}')ax.set_xlabel('λ (rate)')ax.set_ylabel('L(λ) [normalized]')ax.legend()ax.grid(True, alpha=0.3) # 5. Gamma likelihood (shape α, fixed rate β=1)ax = axes[4]gamma_data = np.random.gamma(shape=3, scale=1, size=50)alpha_range = np.linspace(1, 6, 200)from scipy.special import gammalngamma_ll = (alpha_range - 1) * np.sum(np.log(gamma_data)) - len(gamma_data) * gammaln(alpha_range) - np.sum(gamma_data)gamma_ll -= np.max(gamma_ll)ax.plot(alpha_range, np.exp(gamma_ll), 'teal', linewidth=2)ax.fill_between(alpha_range, np.exp(gamma_ll), alpha=0.3, color='teal')ax.set_title(f'Gamma Likelihood (β=1 fixed)\nn={len(gamma_data)}')ax.set_xlabel('α (shape)')ax.set_ylabel('L(α) [normalized]')ax.grid(True, alpha=0.3) # 6. Normal vs Student-t (robustness demonstration)ax = axes[5]# Data with outliersrobust_data = np.concatenate([np.random.normal(0, 1, 47), [10, -12, 15]]) # 3 outliersmu_range = np.linspace(-3, 5, 200) # Normal likelihoodnormal_ll_robust = -0.5 * np.sum((robust_data[:, None] - mu_range)**2, axis=0)normal_ll_robust -= np.max(normal_ll_robust) # Student-t likelihood (df=3)from scipy.stats import t as student_tt_ll = np.sum(student_t.logpdf(robust_data[:, None] - mu_range, df=3), axis=0)t_ll -= np.max(t_ll) ax.plot(mu_range, np.exp(normal_ll_robust), 'b-', linewidth=2, label='Normal', alpha=0.7)ax.plot(mu_range, np.exp(t_ll), 'r-', linewidth=2, label='Student-t (ν=3)')ax.axvline(np.mean(robust_data), color='blue', linestyle=':', alpha=0.7)ax.axvline(np.median(robust_data), color='green', linestyle='--', label='Median')ax.set_title('Robustness: Normal vs Student-t\n(data with outliers)')ax.set_xlabel('μ')ax.set_ylabel('L(μ) [normalized]')ax.legend()ax.grid(True, alpha=0.3) plt.tight_layout() print("Note: Student-t likelihood is less affected by outliers, yielding")print(" MLE closer to the majority of data (near median).")Since likelihood is not a probability, we cannot interpret L(θ) = 0.3 as 'θ has 30% probability'. However, we CAN make meaningful comparisons between parameter values using likelihood ratios.
The Likelihood Ratio:
$$LR(\theta_1, \theta_2) = \frac{\mathcal{L}(\theta_1)}{\mathcal{L}(\theta_2)}$$
This ratio tells us how many times more likely the data are under θ₁ compared to θ₂. If LR = 3, the data are 3 times more probable under θ₁ than θ₂—though neither parameter is 'correct' or 'probable' in an absolute sense.
Interpretation:
| Likelihood Ratio | Interpretation |
|---|---|
| LR ≈ 1 | Data equally compatible with both values |
| LR = 10 | Moderate evidence favoring θ₁ |
| LR = 100 | Strong evidence favoring θ₁ |
| LR = 1000+ | Very strong evidence favoring θ₁ |
These interpretations are heuristic but widely used, particularly in genetics (LOD scores = log₁₀ of likelihood ratio) and clinical diagnostics.
The Maximum Likelihood Estimator (MLE):
The MLE is the parameter value that maximizes the likelihood:
$$\hat{\theta}{MLE} = \arg\max{\theta} \mathcal{L}(\theta) = \arg\max_{\theta} \ell(\theta)$$
The MLE is the value most compatible with observed data in the likelihood sense. It plays a central role in frequentist inference but is also important in Bayesian analysis: for diffuse priors or large samples, the posterior is approximately centered at the MLE.
The likelihood ratio forms the basis of likelihood ratio tests (LRT) for hypothesis testing. Under H₀, the test statistic -2 log(LR) is asymptotically χ² distributed. This connects likelihood to classical hypothesis testing and model comparison.
Profile Likelihood:
When models have multiple parameters θ = (ψ, λ) where ψ is of interest and λ are nuisance parameters, we can construct the profile likelihood for ψ:
$$\mathcal{L}p(\psi) = \max{\lambda} \mathcal{L}(\psi, \lambda)$$
This concentrates on the parameter of interest by maximizing over nuisance parameters at each value of ψ. Profile likelihood enables inference about individual parameters in complex models.
Support and Likelihood Intervals:
Before Bayesian credible intervals became widespread, statisticians used likelihood intervals (also called support intervals):
A 1/k likelihood interval contains all θ where: $$\mathcal{L}(\theta) \geq \frac{1}{k} \cdot \mathcal{L}(\hat{\theta}_{MLE})$$
For example, a 1/8 likelihood interval contains all values with likelihood at least 1/8 of the maximum—values reasonably compatible with the data.
Under certain regularity conditions, a 1/6.83 interval approximates a 95% confidence interval, connecting likelihood-based and frequentist approaches.
Now we connect the likelihood to the Bayesian framework, revealing its role as the vehicle through which data updates our beliefs.
Bayes' Theorem:
$$p(\theta | D) = \frac{p(D | \theta) \cdot p(\theta)}{p(D)} = \frac{\mathcal{L}(\theta) \cdot p(\theta)}{p(D)}$$
Or, more memorably:
$$\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}$$
The likelihood is the multiplicative update factor that transforms prior beliefs into posterior beliefs. Where the likelihood is high, the posterior is pulled upward relative to the prior; where the likelihood is low, the posterior is suppressed.
The Likelihood Principle:
A foundational principle in Bayesian (and some frequentist) inference:
All evidence from the data about θ is contained in the likelihood function.
This means:
The likelihood principle has profound implications for experimental design and sequential data collection—implications that differ substantially from frequentist approaches.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt # Demonstrate how likelihood updates prior to posterior # Scenario: Estimating coin bias# Prior: Beta(2, 2) - weakly informative, slight preference for fairness# Data: 7 heads in 10 flips# Posterior: Beta(2+7, 2+3) = Beta(9, 5) theta = np.linspace(0, 1, 500) # Priorprior_a, prior_b = 2, 2prior = stats.beta(prior_a, prior_b).pdf(theta) # Dataheads, tails = 7, 3 # Likelihood (normalized for visualization)likelihood = theta**heads * (1-theta)**tailslikelihood /= np.max(likelihood) # Scale for visibility # Posterior (analytical due to conjugacy)post_a, post_b = prior_a + heads, prior_b + tailsposterior = stats.beta(post_a, post_b).pdf(theta) # Visualizationfig, axes = plt.subplots(1, 3, figsize=(15, 5)) # Plot 1: Prior onlyax = axes[0]ax.fill_between(theta, prior, alpha=0.3, color='blue')ax.plot(theta, prior, 'b-', linewidth=2, label=f'Prior: Beta({prior_a},{prior_b})')ax.axvline(prior_a/(prior_a+prior_b), color='blue', linestyle='--', label=f'Prior mean = {prior_a/(prior_a+prior_b):.2f}')ax.set_xlabel('θ')ax.set_ylabel('Density')ax.set_title('Step 1: Prior Belief\n(Before seeing data)')ax.legend()ax.grid(True, alpha=0.3)ax.set_xlim(0, 1) # Plot 2: Prior + Likelihoodax = axes[1]ax.fill_between(theta, prior, alpha=0.3, color='blue', label=f'Prior')ax.plot(theta, likelihood * np.max(prior) * 0.8, 'g-', linewidth=2, label=f'Likelihood (scaled)', alpha=0.8)ax.axvline(heads/(heads+tails), color='green', linestyle='--', label=f'MLE = {heads/(heads+tails):.1f}')ax.set_xlabel('θ')ax.set_ylabel('Density')ax.set_title(f'Step 2: Prior × Likelihood\nData: {heads} heads in {heads+tails} flips')ax.legend()ax.grid(True, alpha=0.3)ax.set_xlim(0, 1) # Plot 3: All three + Posteriorax = axes[2]ax.plot(theta, prior, 'b--', linewidth=1.5, label='Prior', alpha=0.6)ax.plot(theta, likelihood * np.max(posterior) * 0.5, 'g--', linewidth=1.5, label='Likelihood (scaled)', alpha=0.6)ax.fill_between(theta, posterior, alpha=0.4, color='red')ax.plot(theta, posterior, 'r-', linewidth=2.5, label=f'Posterior: Beta({post_a},{post_b})')ax.axvline(post_a/(post_a+post_b), color='red', linestyle='-', label=f'Post. mean = {post_a/(post_a+post_b):.3f}')ax.set_xlabel('θ')ax.set_ylabel('Density')ax.set_title('Step 3: Posterior Distribution\nPrior × Likelihood ∝ Posterior')ax.legend()ax.grid(True, alpha=0.3)ax.set_xlim(0, 1) plt.tight_layout() # Print summaryprint("=" * 60)print("BAYESIAN UPDATE SUMMARY")print("=" * 60)print(f"Prior: Beta({prior_a}, {prior_b}) → Mean = {prior_a/(prior_a+prior_b):.3f}")print(f"Data: {heads} heads, {tails} tails")print(f"Likelihood: Binomial(θ | n=10, k=7)")print(f"Posterior: Beta({post_a}, {post_b}) → Mean = {post_a/(post_a+post_b):.3f}")print()print("Observation: Posterior mean lies between prior mean (0.5)")print(" and MLE (0.7), pulled toward the data.")All statistical models are approximations. What happens when the likelihood model doesn't match the true data-generating process?
Types of Misspecification:
Consequences:
Diagnosis:
Bayesian theory traditionally assumes the true model is in the model class (M-closed). In reality, all models are wrong (M-open). Modern Bayesian practice increasingly adopts predictive-focused model evaluation rather than treating posterior distributions as representing 'truth'. The posterior is understood as a useful approximation, not a description of reality.
Robust Likelihoods:
When concerned about distributional misspecification, consider robust alternatives:
These alternatives build in flexibility that makes inference more robust to model misspecification, at the cost of additional parameters or computational complexity.
The likelihood function is the mathematical heart of statistical inference—the translator between parameters and observations. Let's consolidate the key concepts:
What's Next:
With prior beliefs and likelihood in hand, we're ready to combine them into the posterior distribution—the complete Bayesian solution that updates our uncertainty based on observed evidence. The posterior integrates everything we know: prior knowledge, data, and model structure.
You now understand the likelihood function—what it is, what it isn't, how to compute it, and how it connects prior to posterior in Bayesian inference. Next, we'll explore the posterior distribution: the culmination of Bayesian reasoning.