Loading learning content...
The true power of Bayesian inference emerges when we view it not as a static calculation, but as a dynamic process of learning. Each new piece of evidence refines our beliefs, the posterior from yesterday becomes the prior for today, and knowledge accumulates coherently over time.
This perspective—Bayesian updating—transforms statistical inference from a one-shot procedure into a framework for continuous learning. It explains how rational agents should update beliefs, how machine learning systems can incorporate streaming data, and how scientific knowledge progresses as experiments accumulate.
In this page, we explore the mechanics and philosophy of Bayesian updating, from single sequential observations to batch updates, from the mathematics of belief change to practical algorithms for online learning.
By the end of this page, you will understand the recursive nature of Bayesian updating, implement sequential learning with conjugate priors, appreciate the order-independence of Bayesian updates, recognize connections to online learning algorithms, and understand how Bayesian updating embodies rational learning.
At its core, Bayesian updating is elegantly simple: today's posterior is tomorrow's prior.
The Sequential Update:
Suppose we observe data in a sequence: D₁, then D₂, then D₃, etc. The updating process is:
Mathematically:
$$p(\theta | D_{1:n}) \propto p(D_n | \theta) \cdot p(\theta | D_{1:n-1})$$
The Key Insight:
Each update has the same form: likelihood times prior. The only difference is that the 'prior' at step n is the posterior from step n-1. This recursive structure is powerful because:
Bayesian posteriors are martingales with respect to the prior: E[p(θ|D₁,...,Dₙ)] = p(θ), where expectation is over the joint distribution of future data. In words: before seeing data, we expect our future beliefs to equal our current beliefs. Learning moves beliefs around but doesn't systematically bias them in any direction.
Equivalence of Batch and Sequential Updates:
A fundamental property of Bayesian updating is order independence (for i.i.d. data): the same posterior is obtained whether we:
Proof sketch:
For i.i.d. data: $$p(\theta | D_1, D_2) = \frac{p(D_1, D_2 | \theta) p(\theta)}{p(D_1, D_2)} = \frac{p(D_1|\theta) p(D_2|\theta) p(\theta)}{p(D_1, D_2)}$$
Sequentially: $$p(\theta | D_1, D_2) \propto p(D_2|\theta) p(\theta|D_1) \propto p(D_2|\theta) p(D_1|\theta) p(\theta)$$
Same result up to normalization. This equivalence means we can choose whichever formulation is most convenient.
Let's implement sequential Bayesian updating for several common scenarios to build intuition.
Example 1: Learning a Coin's Bias
Suppose you're unsure about a coin's fairness and flip it repeatedly, updating your beliefs after each flip.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def sequential_coin_learning(true_theta=0.7, n_flips=100, prior_a=1, prior_b=1): """ Demonstrate sequential Bayesian learning of a coin's bias. Each flip updates our belief about θ (probability of heads). Prior: Beta(prior_a, prior_b) Likelihood: Bernoulli(θ) """ np.random.seed(42) # Generate flip sequence flips = np.random.binomial(1, true_theta, n_flips) # Track posterior evolution theta_grid = np.linspace(0, 1, 500) # Storage for animation frames history = [] # Initial state current_a, current_b = prior_a, prior_b history.append({ 'n': 0, 'heads': 0, 'tails': 0, 'a': current_a, 'b': current_b, 'mean': current_a / (current_a + current_b), 'std': np.sqrt(stats.beta(current_a, current_b).var()) }) # Sequential updates cumulative_heads = 0 cumulative_tails = 0 for i, flip in enumerate(flips): # Update sufficient statistics if flip == 1: cumulative_heads += 1 current_a += 1 else: cumulative_tails += 1 current_b += 1 # Record state posterior_mean = current_a / (current_a + current_b) posterior_std = np.sqrt(stats.beta(current_a, current_b).var()) history.append({ 'n': i + 1, 'heads': cumulative_heads, 'tails': cumulative_tails, 'a': current_a, 'b': current_b, 'mean': posterior_mean, 'std': posterior_std }) # Visualization fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: Posterior at key points ax = axes[0, 0] milestones = [0, 5, 20, 50, 100] colors = plt.cm.viridis(np.linspace(0, 1, len(milestones))) for milestone, color in zip(milestones, colors): if milestone <= len(history) - 1: h = history[milestone] pdf = stats.beta(h['a'], h['b']).pdf(theta_grid) ax.plot(theta_grid, pdf, color=color, linewidth=2, label=f"n={h['n']}: Beta({h['a']}, {h['b']})") ax.axvline(true_theta, color='red', linestyle='--', label=f'True θ={true_theta}') ax.set_xlabel('θ') ax.set_ylabel('p(θ|data)') ax.set_title('Posterior Evolution') ax.legend(fontsize=9) ax.grid(True, alpha=0.3) ax.set_xlim(0, 1) # Plot 2: Posterior mean over time ax = axes[0, 1] ns = [h['n'] for h in history] means = [h['mean'] for h in history] stds = [h['std'] for h in history] ax.fill_between(ns, [m-2*s for m,s in zip(means, stds)], [m+2*s for m,s in zip(means, stds)], alpha=0.3, color='blue', label='±2 std') ax.plot(ns, means, 'b-', linewidth=2, label='Posterior mean') ax.axhline(true_theta, color='red', linestyle='--', label=f'True θ={true_theta}') ax.set_xlabel('Number of flips') ax.set_ylabel('E[θ|data]') ax.set_title('Posterior Mean Convergence') ax.legend() ax.grid(True, alpha=0.3) ax.set_xlim(0, n_flips) ax.set_ylim(0, 1) # Plot 3: Posterior standard deviation ax = axes[1, 0] ax.plot(ns, stds, 'purple', linewidth=2) ax.set_xlabel('Number of flips') ax.set_ylabel('Std[θ|data]') ax.set_title('Uncertainty Reduction Over Time') ax.grid(True, alpha=0.3) ax.set_xlim(0, n_flips) # Annotate rate of decay theoretical_std = lambda n: 0.5 / np.sqrt(n + 4) # Approximate for Beta ns_theory = np.linspace(1, n_flips, 100) ax.plot(ns_theory, [theoretical_std(n) for n in ns_theory], 'r--', alpha=0.5, label='~1/√n decay') ax.legend() # Plot 4: 95% credible interval width ax = axes[1, 1] ci_widths = [] for h in history: dist = stats.beta(h['a'], h['b']) width = dist.ppf(0.975) - dist.ppf(0.025) ci_widths.append(width) ax.plot(ns, ci_widths, 'green', linewidth=2) ax.set_xlabel('Number of flips') ax.set_ylabel('95% CI Width') ax.set_title('Credible Interval Shrinkage') ax.grid(True, alpha=0.3) ax.set_xlim(0, n_flips) plt.tight_layout() return history history = sequential_coin_learning() print("\n" + "="*60)print("SEQUENTIAL LEARNING SUMMARY")print("="*60)print(f"{'n':<6} {'Heads':<8} {'Tails':<8} {'Post Mean':<12} {'Post Std':<12}")print("-"*60)for h in history[::20]: # Every 20 observations print(f"{h['n']:<6} {h['heads']:<8} {h['tails']:<8} {h['mean']:.4f} {h['std']:.4f}")Key Observations:
Not all observations are equally informative. Bayesian updating naturally weighs evidence by its information content—surprising observations update beliefs more than expected ones.
Measuring Belief Change:
Several measures quantify how much an observation changes our beliefs:
1. KL Divergence (Relative Entropy): $$D_{KL}(p_{\text{post}} || p_{\text{prior}}) = \int p(\theta|D) \log \frac{p(\theta|D)}{p(\theta)} d\theta$$
Measures the 'information gained' from the prior to the posterior. Always non-negative; zero only if no update occurred.
2. Bayes Factor: $$BF = \frac{p(D|H_1)}{p(D|H_2)}$$
For comparing hypotheses, the Bayes factor measures relative evidence. A BF of 10 means the data are 10× more probable under H₁.
3. Log Posterior Ratio: $$\log \frac{p(\theta_1|D)}{p(\theta_2|D)} - \log \frac{p(\theta_1)}{p(\theta_2)}$$
How much the observation changed the odds between two parameter values.
In information theory, the 'surprisal' of an event with probability p is -log(p). Rare events carry more information. Bayesian updating incorporates this: observations that are improbable under your current beliefs (surprising) cause larger updates than those you expected.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom scipy import statsfrom scipy.integrate import quadimport matplotlib.pyplot as plt def kl_divergence_beta(post_a, post_b, prior_a, prior_b): """ KL divergence from prior to posterior for Beta distributions. D_KL(posterior || prior) """ from scipy.special import betaln, digamma # Analytical formula for KL divergence between Beta distributions kl = (betaln(prior_a, prior_b) - betaln(post_a, post_b) + (post_a - prior_a) * digamma(post_a) + (post_b - prior_b) * digamma(post_b) + (prior_a - post_a + prior_b - post_b) * digamma(post_a + post_b)) return kl def analyze_update_information(prior_a, prior_b, observation): """ Analyze the information content of a single observation. observation: 1 for heads, 0 for tails """ # Posterior parameters post_a = prior_a + observation post_b = prior_b + (1 - observation) # Prior and posterior means prior_mean = prior_a / (prior_a + prior_b) post_mean = post_a / (post_a + post_b) # KL divergence kl = kl_divergence_beta(post_a, post_b, prior_a, prior_b) # Probability of this observation under current belief prob_obs = prior_mean if observation == 1 else (1 - prior_mean) # Surprisal surprisal = -np.log2(prob_obs) return { 'post_a': post_a, 'post_b': post_b, 'prior_mean': prior_mean, 'post_mean': post_mean, 'kl_divergence': kl, 'prob_obs': prob_obs, 'surprisal': surprisal, 'mean_shift': abs(post_mean - prior_mean) } # Example: Different priors, same observationprint("=" * 70)print("INFORMATION CONTENT OF OBSERVATIONS")print("=" * 70) scenarios = [ ("Uniform prior (uncertain)", 1, 1, 1), # Heads with no prior knowledge ("Uniform prior (uncertain)", 1, 1, 0), # Tails with no prior knowledge ("Prior believes fair", 10, 10, 1), # Heads when expecting 50-50 ("Prior believes fair", 10, 10, 0), # Tails when expecting 50-50 ("Prior believes biased toward heads", 20, 5, 1), # Expected heads ("Prior believes biased toward heads", 20, 5, 0), # Surprising tails ("Strong prior, θ≈0.9", 90, 10, 1), # Expected heads ("Strong prior, θ≈0.9", 90, 10, 0), # Very surprising tails] print(f"{'Scenario':<35} {'Obs':<5} {'P(obs)':<8} {'Surpr.':<8} {'KL Div':<10} {'|Δmean|':<10}")print("-" * 70) for name, a, b, obs in scenarios: result = analyze_update_information(a, b, obs) obs_name = "H" if obs == 1 else "T" print(f"{name:<35} {obs_name:<5} {result['prob_obs']:.3f} " f"{result['surprisal']:.3f} {result['kl_divergence']:.5f} " f"{result['mean_shift']:.5f}") print()print("Key insight: Surprising observations (low P(obs)) cause larger updates")print(" (higher KL divergence and mean shift)") # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Case study: updating from different priors with same observation (tails)priors = [(1, 1), (5, 5), (10, 10), (20, 5), (50, 50)]obs = 0 # Tails theta = np.linspace(0, 1, 500) # Plot 1: Prior → Posterior for different starting pointsax = axes[0]for a, b in priors: prior = stats.beta(a, b).pdf(theta) post_a = a + obs post_b = b + (1 - obs) posterior = stats.beta(post_a, post_b).pdf(theta) ax.plot(theta, prior, '--', alpha=0.5, linewidth=1) ax.plot(theta, posterior, '-', linewidth=2, label=f'Beta({a},{b}) → Beta({post_a},{post_b})') ax.set_xlabel('θ')ax.set_ylabel('Density')ax.set_title('Same Observation (Tails), Different Priors')ax.legend(fontsize=8)ax.grid(True, alpha=0.3)ax.set_xlim(0, 1) # Plot 2: KL divergence vs prior strengthax = axes[1]prior_strengths = np.arange(2, 102, 2) # α + β from 2 to 100kl_heads = []kl_tails = [] for strength in prior_strengths: # Symmetric prior a = strength / 2 b = strength / 2 # KL for heads vs tails result_h = analyze_update_information(a, b, 1) result_t = analyze_update_information(a, b, 0) kl_heads.append(result_h['kl_divergence']) kl_tails.append(result_t['kl_divergence']) ax.plot(prior_strengths, kl_heads, 'b-', linewidth=2, label='Heads')ax.plot(prior_strengths, kl_tails, 'r-', linewidth=2, label='Tails')ax.set_xlabel('Prior Strength (α + β)')ax.set_ylabel('KL Divergence (nats)')ax.set_title('Information Gain vs Prior Concentration\n(Symmetric priors, 50-50 expected)')ax.legend()ax.grid(True, alpha=0.3)ax.set_xlim(0, 100) plt.tight_layout()In practice, we often have choices about how to structure updates. Two common modes are:
Batch Updating:
Online Updating:
Mathematical Equivalence:
For conjugate models with i.i.d. data, batch and online updating give identical results:
$$p(\theta | D_1, D_2, ..., D_n) = p(\theta | S(D_1, ..., D_n))$$
where S is the sufficient statistic. The updating path doesn't matter—only the final data summary.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom scipy import stats def demonstrate_equivalence(): """ Show that batch and online updates give identical final posteriors. """ np.random.seed(42) # Data: 100 coin flips true_theta = 0.65 data = np.random.binomial(1, true_theta, 100) total_heads = data.sum() total_tails = len(data) - total_heads # Initial prior prior_a, prior_b = 2, 2 # Method 1: Batch update (all at once) batch_post_a = prior_a + total_heads batch_post_b = prior_b + total_tails # Method 2: Online update (one at a time) online_a, online_b = prior_a, prior_b for flip in data: online_a += flip online_b += (1 - flip) online_post_a, online_post_b = online_a, online_b # Method 3: Mini-batch updates (groups of 10) minibatch_a, minibatch_b = prior_a, prior_b for i in range(0, 100, 10): batch = data[i:i+10] minibatch_a += batch.sum() minibatch_b += len(batch) - batch.sum() minibatch_post_a, minibatch_post_b = minibatch_a, minibatch_b # Method 4: Random order (shuffled data) shuffled_data = np.random.permutation(data) shuffled_a, shuffled_b = prior_a, prior_b for flip in shuffled_data: shuffled_a += flip shuffled_b += (1 - flip) print("=" * 60) print("BATCH vs ONLINE UPDATE EQUIVALENCE") print("=" * 60) print(f"Data: {total_heads} heads, {total_tails} tails (n=100)") print(f"Prior: Beta({prior_a}, {prior_b})") print() print(f"{'Method':<25} {'Posterior α':<15} {'Posterior β':<15}") print("-" * 60) print(f"{'Batch (all at once)':<25} {batch_post_a:<15} {batch_post_b:<15}") print(f"{'Online (one by one)':<25} {online_post_a:<15} {online_post_b:<15}") print(f"{'Mini-batch (10s)':<25} {minibatch_post_a:<15} {minibatch_post_b:<15}") print(f"{'Random order':<25} {shuffled_a:<15} {shuffled_b:<15}") print() # Verify all are identical all_match = (batch_post_a == online_post_a == minibatch_post_a == shuffled_a and batch_post_b == online_post_b == minibatch_post_b == shuffled_b) print(f"All methods give identical results: {all_match}") print() print("This demonstrates ORDER INDEPENDENCE: the final posterior") print("depends only on the sufficient statistic (total heads/tails),") print("not on how we organized the updates.") demonstrate_equivalence()When conjugacy doesn't hold, sequential updating becomes more challenging. The posterior after each observation doesn't have a simple closed form, and approximations are necessary.
The Challenge:
$$p(\theta | D_{1:n}) = \frac{p(D_n | \theta) \cdot p(\theta | D_{1:n-1})}{p(D_n | D_{1:n-1})}$$
Even if we could represent p(θ | D₁:n-1) exactly, multiplying by the likelihood generally produces a distribution we can't represent in closed form.
Approximation Strategies:
1. Particle Filtering (Sequential Monte Carlo):
2. Assumed Density Filtering:
3. Online Variational Inference:
4. Periodic Batch Resampling:
When using approximations for sequential updates, errors compound over time. A small error at step 1 alters the 'prior' for step 2, which alters the posterior at step 2, which becomes an inexact prior for step 3, etc. This error accumulation is the main challenge of online Bayesian inference.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def particle_filter_update(particles, weights, observation, likelihood_func): """ Sequential Monte Carlo (particle filter) update. particles: Current particle positions (samples from posterior) weights: Current particle weights observation: New data point likelihood_func: Function computing p(observation | particle) Returns updated particles and weights. """ n_particles = len(particles) # Update weights by likelihood log_likelihoods = np.array([np.log(likelihood_func(p, observation) + 1e-300) for p in particles]) log_weights = np.log(weights + 1e-300) + log_likelihoods log_weights -= np.max(log_weights) # Numerical stability weights = np.exp(log_weights) weights /= weights.sum() # Normalize # Effective sample size ess = 1 / np.sum(weights**2) # Resample if ESS too low if ess < n_particles / 2: indices = np.random.choice(n_particles, size=n_particles, replace=True, p=weights) particles = particles[indices] weights = np.ones(n_particles) / n_particles return particles, weights, ess # Example: Non-conjugate problem - learning the scale of a Cauchy distribution# Prior: log(σ) ~ Normal(0, 1) → σ ~ Log-Normal(0, 1)# Likelihood: x ~ Cauchy(0, σ) def cauchy_likelihood(sigma, x): """p(x | σ) for Cauchy(0, σ)""" return 1 / (np.pi * sigma * (1 + (x/sigma)**2)) # True parametertrue_sigma = 2.0 # Generate datanp.random.seed(42)n_obs = 50data = stats.cauchy(0, true_sigma).rvs(n_obs) # Initialize particle filtern_particles = 5000# Sample from prior: σ ~ LogNormal(0, 1)particles = np.random.lognormal(0, 1, n_particles)weights = np.ones(n_particles) / n_particles # Track posterior evolutionhistory = [{ 'n': 0, 'mean': np.average(particles, weights=weights), 'std': np.sqrt(np.average((particles - np.average(particles, weights=weights))**2, weights=weights)), 'ess': n_particles}] # Sequential updatesfor i, x in enumerate(data): particles, weights, ess = particle_filter_update( particles, weights, x, cauchy_likelihood) posterior_mean = np.average(particles, weights=weights) posterior_std = np.sqrt(np.average((particles - posterior_mean)**2, weights=weights)) history.append({ 'n': i + 1, 'mean': posterior_mean, 'std': posterior_std, 'ess': ess }) # Visualizationfig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Plot 1: Final posterior approximationax = axes[0, 0]ax.hist(particles, bins=50, density=True, alpha=0.7, weights=weights, label='Particle posterior')ax.axvline(true_sigma, color='red', linestyle='--', linewidth=2, label=f'True σ = {true_sigma}')ax.axvline(history[-1]['mean'], color='green', linestyle='-', linewidth=2, label=f'Post. mean = {history[-1]["mean"]:.2f}')ax.set_xlabel('σ')ax.set_ylabel('Density')ax.set_title(f'Final Posterior (Particle Filter, n={n_obs})')ax.legend()ax.grid(True, alpha=0.3)ax.set_xlim(0, 8) # Plot 2: Posterior mean evolutionax = axes[0, 1]ns = [h['n'] for h in history]means = [h['mean'] for h in history]stds = [h['std'] for h in history]ax.fill_between(ns, [m-s for m,s in zip(means, stds)], [m+s for m,s in zip(means, stds)], alpha=0.3)ax.plot(ns, means, 'b-', linewidth=2)ax.axhline(true_sigma, color='red', linestyle='--', label=f'True σ = {true_sigma}')ax.set_xlabel('Number of observations')ax.set_ylabel('E[σ|data]')ax.set_title('Posterior Mean Convergence')ax.legend()ax.grid(True, alpha=0.3) # Plot 3: Effective sample sizeax = axes[1, 0]ess_values = [h['ess'] for h in history]ax.plot(ns, ess_values, 'purple', linewidth=2)ax.axhline(n_particles/2, color='red', linestyle='--', label='Resample threshold')ax.set_xlabel('Number of observations')ax.set_ylabel('Effective Sample Size')ax.set_title('Particle Filter Efficiency')ax.legend()ax.grid(True, alpha=0.3)ax.set_ylim(0, n_particles) # Plot 4: Data histogram (for context)ax = axes[1, 1]ax.hist(data, bins=30, density=True, alpha=0.7, label='Observed data')x_plot = np.linspace(-15, 15, 200)ax.plot(x_plot, stats.cauchy(0, true_sigma).pdf(x_plot), 'r-', linewidth=2, label=f'True Cauchy(0, {true_sigma})')ax.plot(x_plot, stats.cauchy(0, history[-1]['mean']).pdf(x_plot), 'g--', linewidth=2, label=f'Estimated Cauchy(0, {history[-1]["mean"]:.2f})')ax.set_xlabel('x')ax.set_ylabel('Density')ax.set_title('Data vs Fitted Distribution')ax.legend()ax.grid(True, alpha=0.3)ax.set_xlim(-15, 15) plt.tight_layout() print("\nParticle filter successfully approximated the non-conjugate posterior!")print(f"True σ = {true_sigma:.2f}")print(f"Estimated σ = {history[-1]['mean']:.2f} ± {history[-1]['std']:.2f}")Bayesian updating isn't just about beliefs—it's about making decisions under uncertainty. The posterior distribution enables optimal decision-making through expected utility maximization.
The Decision-Theoretic Framework:
The posterior uncertainty is automatically incorporated: actions robust to parameter uncertainty are favored over those that are optimal only for specific parameter values.
Example: When to Stop Experimenting
Consider A/B testing: you're comparing two website designs. At each point, you can:
Bayesian updating tells you your current beliefs; expected utility tells you when continued learning outweighs its cost.
The 'Value of Information' (VOI) quantifies how much reducing uncertainty is worth for decision-making. VOI = Expected utility with more info − Expected utility now. When VOI < cost of information, stop collecting data.
Sequential Decision Problems:
Many real problems involve sequences of decisions, each informed by the posterior at that moment:
In all these cases, Bayesian updating provides the engine for learning, while decision theory provides the framework for acting optimally.
Thompson Sampling:
A beautiful algorithm that combines Bayesian updating with exploration:
For each decision point:
1. Sample θ̃ from current posterior p(θ|data_so_far)
2. Take action that maximizes utility assuming θ̃ is true
3. Observe outcome
4. Update posterior
This naturally balances exploration (trying uncertain options) with exploitation (choosing best known options). Theoretical guarantees show it achieves near-optimal regret in many settings.
Stepping back from the mathematics, Bayesian updating embodies a compelling theory of rational learning with deep philosophical implications.
Dutch Book Coherence:
As mentioned earlier, beliefs that don't update according to Bayes' theorem are 'incoherent'—they make the believer vulnerable to sure-loss betting sequences (Dutch books). This isn't just an abstract concern; it implies that any consistent, non-exploitable learning process must be Bayesian.
Convergence to Truth:
Under mild conditions (correctly specified model, prior with broad support), Bayesian posteriors concentrate around the true parameter as data accumulates. This posterior consistency means Bayesian learning is self-correcting:
This provides a resolution to the subjectivity objection: while priors may differ, learning pathways converge.
The Role of the Prior:
Priors encode the inductive biases necessary for learning. Without some prior constraints, learning from finite data is impossible—every dataset is consistent with infinitely many generalizations. The prior makes learning tractable by ruling out 'unnatural' hypotheses.
This connects to machine learning regularization, Occam's razor, and minimum description length principles—all can be viewed as implicit choices of prior.
No Free Lunch theorems in machine learning state that no algorithm outperforms others across all problems. Bayesian inference clarifies this: the prior implicitly defines which problems we expect to encounter. Different priors optimize for different problem classes. There's no 'prior-free' learning.
Bayesian Computation of Mind:
Beyond statistics, Bayesian updating has become influential in cognitive science as a theory of how minds work. Evidence suggests humans perform approximate Bayesian inference for:
While the brain likely doesn't compute exact posteriors, it may implement efficient approximations—making Bayesian updating not just normative (how we should learn) but potentially descriptive (how we do learn).
Limits of Bayesian Updating:
Despite its elegance, Bayesian updating has limitations:
These aren't fatal flaws, but reminders that Bayesian updating is a tool—powerful within its domain, but not a solution to all learning problems.
Bayesian updating transforms statistical inference from a static procedure into a dynamic learning process. Let's consolidate the key insights:
What's Next:
With the updating mechanism understood, we now turn to point estimation from the posterior—how to extract single 'best guesses' when decisions or communication require them. We'll explore the MAP estimator, posterior mean, and their connections to frequentist methods and regularization.
You now understand Bayesian updating as continuous learning—how beliefs evolve with evidence, why order doesn't matter, and how updating enables both inference and decision-making. Next, we'll explore extracting point estimates from the rich information contained in posteriors.