Machine LearningBayesian Inference

The Bayesian Framework

LevelAdvanced

Duration90 mins

TopicBayesian Inference

4 / 5

Bayesian Updating

Learning as Continuous Belief Revision

The true power of Bayesian inference emerges when we view it not as a static calculation, but as a dynamic process of learning. Each new piece of evidence refines our beliefs, the posterior from yesterday becomes the prior for today, and knowledge accumulates coherently over time.

This perspective—Bayesian updating—transforms statistical inference from a one-shot procedure into a framework for continuous learning. It explains how rational agents should update beliefs, how machine learning systems can incorporate streaming data, and how scientific knowledge progresses as experiments accumulate.

In this page, we explore the mechanics and philosophy of Bayesian updating, from single sequential observations to batch updates, from the mathematics of belief change to practical algorithms for online learning.

What You Will Master

By the end of this page, you will understand the recursive nature of Bayesian updating, implement sequential learning with conjugate priors, appreciate the order-independence of Bayesian updates, recognize connections to online learning algorithms, and understand how Bayesian updating embodies rational learning.

The Fundamental Updating Principle

At its core, Bayesian updating is elegantly simple: today's posterior is tomorrow's prior.

The Sequential Update:

Suppose we observe data in a sequence: D₁, then D₂, then D₃, etc. The updating process is:

Start with prior p(θ)
Observe D₁ → compute posterior p(θ|D₁) ∝ p(D₁|θ)·p(θ)
Observe D₂ → compute p(θ|D₁,D₂) ∝ p(D₂|θ)·p(θ|D₁)
Continue for each new observation

Mathematically:

$$p(\theta | D_{1:n}) \propto p(D_n | \theta) \cdot p(\theta | D_{1:n-1})$$

The Key Insight:

Each update has the same form: likelihood times prior. The only difference is that the 'prior' at step n is the posterior from step n-1. This recursive structure is powerful because:

We don't need to store all historical data
New data integrates seamlessly with existing beliefs
Computational requirements are constant per update (for conjugate models)
Learning is never 'finished'—we can always incorporate more evidence

The Martingale Property

Bayesian posteriors are martingales with respect to the prior: E[p(θ|D₁,...,Dₙ)] = p(θ), where expectation is over the joint distribution of future data. In words: before seeing data, we expect our future beliefs to equal our current beliefs. Learning moves beliefs around but doesn't systematically bias them in any direction.

Equivalence of Batch and Sequential Updates:

A fundamental property of Bayesian updating is order independence (for i.i.d. data): the same posterior is obtained whether we:

Update sequentially: D₁, then D₂, then D₃
Update in batches: (D₁, D₂), then D₃
Update in reverse order: D₃, then D₂, then D₁
Update all at once: (D₁, D₂, D₃)

Proof sketch:

For i.i.d. data: $$p(\theta | D_1, D_2) = \frac{p(D_1, D_2 | \theta) p(\theta)}{p(D_1, D_2)} = \frac{p(D_1|\theta) p(D_2|\theta) p(\theta)}{p(D_1, D_2)}$$

Same result up to normalization. This equivalence means we can choose whichever formulation is most convenient.

Sequential Learning in Practice

Let's implement sequential Bayesian updating for several common scenarios to build intuition.

Example 1: Learning a Coin's Bias

Suppose you're unsure about a coin's fairness and flip it repeatedly, updating your beliefs after each flip.

sequential_coin_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def sequential_coin_learning(true_theta=0.7, n_flips=100, prior_a=1, prior_b=1):
    """
    Demonstrate sequential Bayesian learning of a coin's bias.
    
    Each flip updates our belief about θ (probability of heads).
    Prior: Beta(prior_a, prior_b)
    Likelihood: Bernoulli(θ)
    """
    np.random.seed(42)
    
    # Generate flip sequence
    flips = np.random.binomial(1, true_theta, n_flips)
    
    # Track posterior evolution
    theta_grid = np.linspace(0, 1, 500)
    
    # Storage for animation frames
    history = []
    
    # Initial state
    current_a, current_b = prior_a, prior_b
    history.append({
        'n': 0,
        'heads': 0,
        'tails': 0,
        'a': current_a,
        'b': current_b,
        'mean': current_a / (current_a + current_b),
        'std': np.sqrt(stats.beta(current_a, current_b).var())
    })
    
    # Sequential updates
    cumulative_heads = 0
    cumulative_tails = 0
    
    for i, flip in enumerate(flips):
        # Update sufficient statistics
        if flip == 1:
            cumulative_heads += 1
            current_a += 1
        else:
            cumulative_tails += 1
            current_b += 1
        
        # Record state
        posterior_mean = current_a / (current_a + current_b)
        posterior_std = np.sqrt(stats.beta(current_a, current_b).var())
        
        history.append({
            'n': i + 1,
            'heads': cumulative_heads,
            'tails': cumulative_tails,
            'a': current_a,
            'b': current_b,
            'mean': posterior_mean,
            'std': posterior_std
        })
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Posterior at key points
    ax = axes[0, 0]
    milestones = [0, 5, 20, 50, 100]
    colors = plt.cm.viridis(np.linspace(0, 1, len(milestones)))
    
    for milestone, color in zip(milestones, colors):
        if milestone <= len(history) - 1:
            h = history[milestone]
            pdf = stats.beta(h['a'], h['b']).pdf(theta_grid)
            ax.plot(theta_grid, pdf, color=color, linewidth=2,
                    label=f"n={h['n']}: Beta({h['a']}, {h['b']})")
    
    ax.axvline(true_theta, color='red', linestyle='--', label=f'True θ={true_theta}')
    ax.set_xlabel('θ')
    ax.set_ylabel('p(θ|data)')
    ax.set_title('Posterior Evolution')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, 1)
    
    # Plot 2: Posterior mean over time
    ax = axes[0, 1]
    ns = [h['n'] for h in history]
    means = [h['mean'] for h in history]
    stds = [h['std'] for h in history]
    
    ax.fill_between(ns, [m-2*s for m,s in zip(means, stds)], 
                    [m+2*s for m,s in zip(means, stds)], 
                    alpha=0.3, color='blue', label='±2 std')
    ax.plot(ns, means, 'b-', linewidth=2, label='Posterior mean')
    ax.axhline(true_theta, color='red', linestyle='--', label=f'True θ={true_theta}')
    ax.set_xlabel('Number of flips')
    ax.set_ylabel('E[θ|data]')
    ax.set_title('Posterior Mean Convergence')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, n_flips)
    ax.set_ylim(0, 1)
    
    # Plot 3: Posterior standard deviation
    ax = axes[1, 0]
    ax.plot(ns, stds, 'purple', linewidth=2)
    ax.set_xlabel('Number of flips')
    ax.set_ylabel('Std[θ|data]')
    ax.set_title('Uncertainty Reduction Over Time')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, n_flips)
    
    # Annotate rate of decay
    theoretical_std = lambda n: 0.5 / np.sqrt(n + 4)  # Approximate for Beta
    ns_theory = np.linspace(1, n_flips, 100)
    ax.plot(ns_theory, [theoretical_std(n) for n in ns_theory], 
            'r--', alpha=0.5, label='~1/√n decay')
    ax.legend()
    
    # Plot 4: 95% credible interval width
    ax = axes[1, 1]
    ci_widths = []
    for h in history:
        dist = stats.beta(h['a'], h['b'])
        width = dist.ppf(0.975) - dist.ppf(0.025)
        ci_widths.append(width)
    
    ax.plot(ns, ci_widths, 'green', linewidth=2)
    ax.set_xlabel('Number of flips')
    ax.set_ylabel('95% CI Width')
    ax.set_title('Credible Interval Shrinkage')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, n_flips)
    
    plt.tight_layout()
    
    return history
 
history = sequential_coin_learning()
 
print("\n" + "="*60)
print("SEQUENTIAL LEARNING SUMMARY")
print("="*60)
print(f"{'n':<6} {'Heads':<8} {'Tails':<8} {'Post Mean':<12} {'Post Std':<12}")
print("-"*60)
for h in history[::20]:  # Every 20 observations
    print(f"{h['n']:<6} {h['heads']:<8} {h['tails']:<8} {h['mean']:.4f}       {h['std']:.4f}")

Key Observations:

Rapid initial learning: The first few observations cause large belief revisions
Diminishing returns: Later observations have smaller effects on already-concentrated posteriors
√n convergence: Posterior standard deviation decreases as O(1/√n)
Principled uncertainty: The credible interval width shrinks automatically with more data
Self-correcting: Early 'lucky' or 'unlucky' streaks are corrected by subsequent evidence

The Information Content of Observations

Not all observations are equally informative. Bayesian updating naturally weighs evidence by its information content—surprising observations update beliefs more than expected ones.

Measuring Belief Change:

Several measures quantify how much an observation changes our beliefs:

1. KL Divergence (Relative Entropy): $$D_{KL}(p_{\text{post}} || p_{\text{prior}}) = \int p(\theta|D) \log \frac{p(\theta|D)}{p(\theta)} d\theta$$

Measures the 'information gained' from the prior to the posterior. Always non-negative; zero only if no update occurred.

2. Bayes Factor: $$BF = \frac{p(D|H_1)}{p(D|H_2)}$$

For comparing hypotheses, the Bayes factor measures relative evidence. A BF of 10 means the data are 10× more probable under H₁.

3. Log Posterior Ratio: $$\log \frac{p(\theta_1|D)}{p(\theta_2|D)} - \log \frac{p(\theta_1)}{p(\theta_2)}$$

How much the observation changed the odds between two parameter values.

Surprisal and Information

In information theory, the 'surprisal' of an event with probability p is -log(p). Rare events carry more information. Bayesian updating incorporates this: observations that are improbable under your current beliefs (surprising) cause larger updates than those you expected.

information_content.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from scipy import stats
from scipy.integrate import quad
import matplotlib.pyplot as plt
 
def kl_divergence_beta(post_a, post_b, prior_a, prior_b):
    """
    KL divergence from prior to posterior for Beta distributions.
    D_KL(posterior || prior)
    """
    from scipy.special import betaln, digamma
    
    # Analytical formula for KL divergence between Beta distributions
    kl = (betaln(prior_a, prior_b) - betaln(post_a, post_b) +
          (post_a - prior_a) * digamma(post_a) +
          (post_b - prior_b) * digamma(post_b) +
          (prior_a - post_a + prior_b - post_b) * digamma(post_a + post_b))
    return kl
 
def analyze_update_information(prior_a, prior_b, observation):
    """
    Analyze the information content of a single observation.
    observation: 1 for heads, 0 for tails
    """
    # Posterior parameters
    post_a = prior_a + observation
    post_b = prior_b + (1 - observation)
    
    # Prior and posterior means
    prior_mean = prior_a / (prior_a + prior_b)
    post_mean = post_a / (post_a + post_b)
    
    # KL divergence
    kl = kl_divergence_beta(post_a, post_b, prior_a, prior_b)
    
    # Probability of this observation under current belief
    prob_obs = prior_mean if observation == 1 else (1 - prior_mean)
    
    # Surprisal
    surprisal = -np.log2(prob_obs)
    
    return {
        'post_a': post_a, 'post_b': post_b,
        'prior_mean': prior_mean, 'post_mean': post_mean,
        'kl_divergence': kl,
        'prob_obs': prob_obs,
        'surprisal': surprisal,
        'mean_shift': abs(post_mean - prior_mean)
    }
 
# Example: Different priors, same observation
print("=" * 70)
print("INFORMATION CONTENT OF OBSERVATIONS")
print("=" * 70)
 
scenarios = [
    ("Uniform prior (uncertain)", 1, 1, 1),  # Heads with no prior knowledge
    ("Uniform prior (uncertain)", 1, 1, 0),  # Tails with no prior knowledge
    ("Prior believes fair", 10, 10, 1),      # Heads when expecting 50-50
    ("Prior believes fair", 10, 10, 0),      # Tails when expecting 50-50
    ("Prior believes biased toward heads", 20, 5, 1),  # Expected heads
    ("Prior believes biased toward heads", 20, 5, 0),  # Surprising tails
    ("Strong prior, θ≈0.9", 90, 10, 1),      # Expected heads
    ("Strong prior, θ≈0.9", 90, 10, 0),      # Very surprising tails
]
 
print(f"{'Scenario':<35} {'Obs':<5} {'P(obs)':<8} {'Surpr.':<8} {'KL Div':<10} {'|Δmean|':<10}")
print("-" * 70)
 
for name, a, b, obs in scenarios:
    result = analyze_update_information(a, b, obs)
    obs_name = "H" if obs == 1 else "T"
    print(f"{name:<35} {obs_name:<5} {result['prob_obs']:.3f}    "
          f"{result['surprisal']:.3f}    {result['kl_divergence']:.5f}    "
          f"{result['mean_shift']:.5f}")
 
print()
print("Key insight: Surprising observations (low P(obs)) cause larger updates")
print("             (higher KL divergence and mean shift)")
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Case study: updating from different priors with same observation (tails)
priors = [(1, 1), (5, 5), (10, 10), (20, 5), (50, 50)]
obs = 0  # Tails
 
theta = np.linspace(0, 1, 500)
 
# Plot 1: Prior → Posterior for different starting points
ax = axes[0]
for a, b in priors:
    prior = stats.beta(a, b).pdf(theta)
    post_a = a + obs
    post_b = b + (1 - obs)
    posterior = stats.beta(post_a, post_b).pdf(theta)
    
    ax.plot(theta, prior, '--', alpha=0.5, linewidth=1)
    ax.plot(theta, posterior, '-', linewidth=2, 
            label=f'Beta({a},{b}) → Beta({post_a},{post_b})')
 
ax.set_xlabel('θ')
ax.set_ylabel('Density')
ax.set_title('Same Observation (Tails), Different Priors')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
# Plot 2: KL divergence vs prior strength
ax = axes[1]
prior_strengths = np.arange(2, 102, 2)  # α + β from 2 to 100
kl_heads = []
kl_tails = []
 
for strength in prior_strengths:
    # Symmetric prior
    a = strength / 2
    b = strength / 2
    
    # KL for heads vs tails
    result_h = analyze_update_information(a, b, 1)
    result_t = analyze_update_information(a, b, 0)
    kl_heads.append(result_h['kl_divergence'])
    kl_tails.append(result_t['kl_divergence'])
 
ax.plot(prior_strengths, kl_heads, 'b-', linewidth=2, label='Heads')
ax.plot(prior_strengths, kl_tails, 'r-', linewidth=2, label='Tails')
ax.set_xlabel('Prior Strength (α + β)')
ax.set_ylabel('KL Divergence (nats)')
ax.set_title('Information Gain vs Prior Concentration\n(Symmetric priors, 50-50 expected)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 100)
 
plt.tight_layout()

Batch vs Online Updates

In practice, we often have choices about how to structure updates. Two common modes are:

Batch Updating:

Collect a dataset of n observations
Perform a single update using all data
Computationally efficient for non-conjugate models (one MCMC run)
Standard in scientific research

Online Updating:

Update beliefs after each observation (or small batches)
Natural for streaming data
Memory-efficient (don't need to store all data)
Essential for real-time applications

Mathematical Equivalence:

For conjugate models with i.i.d. data, batch and online updating give identical results:

$$p(\theta | D_1, D_2, ..., D_n) = p(\theta | S(D_1, ..., D_n))$$

where S is the sufficient statistic. The updating path doesn't matter—only the final data summary.

When to Use Batch Updates

•Scientific experiments with fixed datasets
•Non-conjugate models requiring MCMC
•When computational overhead per update is high
•Retrospective analyses
•When data arrives in natural batches

When to Use Online Updates

•Streaming data applications
•Real-time decision systems
•Memory-constrained environments
•Conjugate models (cheap updates)
•When monitoring belief evolution matters
•Adaptive experiments

batch_vs_online.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy import stats
 
def demonstrate_equivalence():
    """
    Show that batch and online updates give identical final posteriors.
    """
    np.random.seed(42)
    
    # Data: 100 coin flips
    true_theta = 0.65
    data = np.random.binomial(1, true_theta, 100)
    total_heads = data.sum()
    total_tails = len(data) - total_heads
    
    # Initial prior
    prior_a, prior_b = 2, 2
    
    # Method 1: Batch update (all at once)
    batch_post_a = prior_a + total_heads
    batch_post_b = prior_b + total_tails
    
    # Method 2: Online update (one at a time)
    online_a, online_b = prior_a, prior_b
    for flip in data:
        online_a += flip
        online_b += (1 - flip)
    online_post_a, online_post_b = online_a, online_b
    
    # Method 3: Mini-batch updates (groups of 10)
    minibatch_a, minibatch_b = prior_a, prior_b
    for i in range(0, 100, 10):
        batch = data[i:i+10]
        minibatch_a += batch.sum()
        minibatch_b += len(batch) - batch.sum()
    minibatch_post_a, minibatch_post_b = minibatch_a, minibatch_b
    
    # Method 4: Random order (shuffled data)
    shuffled_data = np.random.permutation(data)
    shuffled_a, shuffled_b = prior_a, prior_b
    for flip in shuffled_data:
        shuffled_a += flip
        shuffled_b += (1 - flip)
    
    print("=" * 60)
    print("BATCH vs ONLINE UPDATE EQUIVALENCE")
    print("=" * 60)
    print(f"Data: {total_heads} heads, {total_tails} tails (n=100)")
    print(f"Prior: Beta({prior_a}, {prior_b})")
    print()
    print(f"{'Method':<25} {'Posterior α':<15} {'Posterior β':<15}")
    print("-" * 60)
    print(f"{'Batch (all at once)':<25} {batch_post_a:<15} {batch_post_b:<15}")
    print(f"{'Online (one by one)':<25} {online_post_a:<15} {online_post_b:<15}")
    print(f"{'Mini-batch (10s)':<25} {minibatch_post_a:<15} {minibatch_post_b:<15}")
    print(f"{'Random order':<25} {shuffled_a:<15} {shuffled_b:<15}")
    print()
    
    # Verify all are identical
    all_match = (batch_post_a == online_post_a == minibatch_post_a == shuffled_a and
                 batch_post_b == online_post_b == minibatch_post_b == shuffled_b)
    print(f"All methods give identical results: {all_match}")
    print()
    print("This demonstrates ORDER INDEPENDENCE: the final posterior")
    print("depends only on the sufficient statistic (total heads/tails),")
    print("not on how we organized the updates.")
 
demonstrate_equivalence()

Sequential Updates for Non-Conjugate Models

When conjugacy doesn't hold, sequential updating becomes more challenging. The posterior after each observation doesn't have a simple closed form, and approximations are necessary.

The Challenge:

$$p(\theta | D_{1:n}) = \frac{p(D_n | \theta) \cdot p(\theta | D_{1:n-1})}{p(D_n | D_{1:n-1})}$$

Even if we could represent p(θ | D₁:n-1) exactly, multiplying by the likelihood generally produces a distribution we can't represent in closed form.

Approximation Strategies:

1. Particle Filtering (Sequential Monte Carlo):

Represent posterior as weighted samples (particles)
Multiply each particle's weight by likelihood of new observation
Resample when weights become too uneven
Exact in the limit of infinite particles

2. Assumed Density Filtering:

After each update, project posterior onto a tractable family
E.g., always approximate posterior as Gaussian
Fast but introduces approximation error at each step

3. Online Variational Inference:

Maintain variational approximation
Update variational parameters using stochastic gradient descent
Scales to large datasets

4. Periodic Batch Resampling:

Accumulate observations until a batch is large enough
Run full MCMC to 'correct' accumulated approximation errors
Practical hybrid approach

Approximation Error Accumulation

When using approximations for sequential updates, errors compound over time. A small error at step 1 alters the 'prior' for step 2, which alters the posterior at step 2, which becomes an inexact prior for step 3, etc. This error accumulation is the main challenge of online Bayesian inference.

particle_filter_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def particle_filter_update(particles, weights, observation, likelihood_func):
    """
    Sequential Monte Carlo (particle filter) update.
    
    particles: Current particle positions (samples from posterior)
    weights: Current particle weights
    observation: New data point
    likelihood_func: Function computing p(observation | particle)
    
    Returns updated particles and weights.
    """
    n_particles = len(particles)
    
    # Update weights by likelihood
    log_likelihoods = np.array([np.log(likelihood_func(p, observation) + 1e-300) 
                                 for p in particles])
    log_weights = np.log(weights + 1e-300) + log_likelihoods
    log_weights -= np.max(log_weights)  # Numerical stability
    weights = np.exp(log_weights)
    weights /= weights.sum()  # Normalize
    
    # Effective sample size
    ess = 1 / np.sum(weights**2)
    
    # Resample if ESS too low
    if ess < n_particles / 2:
        indices = np.random.choice(n_particles, size=n_particles, 
                                   replace=True, p=weights)
        particles = particles[indices]
        weights = np.ones(n_particles) / n_particles
    
    return particles, weights, ess
 
# Example: Non-conjugate problem - learning the scale of a Cauchy distribution
# Prior: log(σ) ~ Normal(0, 1)  →  σ ~ Log-Normal(0, 1)
# Likelihood: x ~ Cauchy(0, σ)
 
def cauchy_likelihood(sigma, x):
    """p(x | σ) for Cauchy(0, σ)"""
    return 1 / (np.pi * sigma * (1 + (x/sigma)**2))
 
# True parameter
true_sigma = 2.0
 
# Generate data
np.random.seed(42)
n_obs = 50
data = stats.cauchy(0, true_sigma).rvs(n_obs)
 
# Initialize particle filter
n_particles = 5000
# Sample from prior: σ ~ LogNormal(0, 1)
particles = np.random.lognormal(0, 1, n_particles)
weights = np.ones(n_particles) / n_particles
 
# Track posterior evolution
history = [{
    'n': 0,
    'mean': np.average(particles, weights=weights),
    'std': np.sqrt(np.average((particles - np.average(particles, weights=weights))**2, 
                              weights=weights)),
    'ess': n_particles
}]
 
# Sequential updates
for i, x in enumerate(data):
    particles, weights, ess = particle_filter_update(
        particles, weights, x, cauchy_likelihood)
    
    posterior_mean = np.average(particles, weights=weights)
    posterior_std = np.sqrt(np.average((particles - posterior_mean)**2, 
                                       weights=weights))
    
    history.append({
        'n': i + 1,
        'mean': posterior_mean,
        'std': posterior_std,
        'ess': ess
    })
 
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: Final posterior approximation
ax = axes[0, 0]
ax.hist(particles, bins=50, density=True, alpha=0.7, weights=weights,
        label='Particle posterior')
ax.axvline(true_sigma, color='red', linestyle='--', linewidth=2,
           label=f'True σ = {true_sigma}')
ax.axvline(history[-1]['mean'], color='green', linestyle='-', linewidth=2,
           label=f'Post. mean = {history[-1]["mean"]:.2f}')
ax.set_xlabel('σ')
ax.set_ylabel('Density')
ax.set_title(f'Final Posterior (Particle Filter, n={n_obs})')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 8)
 
# Plot 2: Posterior mean evolution
ax = axes[0, 1]
ns = [h['n'] for h in history]
means = [h['mean'] for h in history]
stds = [h['std'] for h in history]
ax.fill_between(ns, [m-s for m,s in zip(means, stds)], 
                [m+s for m,s in zip(means, stds)], alpha=0.3)
ax.plot(ns, means, 'b-', linewidth=2)
ax.axhline(true_sigma, color='red', linestyle='--', label=f'True σ = {true_sigma}')
ax.set_xlabel('Number of observations')
ax.set_ylabel('E[σ|data]')
ax.set_title('Posterior Mean Convergence')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 3: Effective sample size
ax = axes[1, 0]
ess_values = [h['ess'] for h in history]
ax.plot(ns, ess_values, 'purple', linewidth=2)
ax.axhline(n_particles/2, color='red', linestyle='--', 
           label='Resample threshold')
ax.set_xlabel('Number of observations')
ax.set_ylabel('Effective Sample Size')
ax.set_title('Particle Filter Efficiency')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, n_particles)
 
# Plot 4: Data histogram (for context)
ax = axes[1, 1]
ax.hist(data, bins=30, density=True, alpha=0.7, label='Observed data')
x_plot = np.linspace(-15, 15, 200)
ax.plot(x_plot, stats.cauchy(0, true_sigma).pdf(x_plot), 'r-', 
        linewidth=2, label=f'True Cauchy(0, {true_sigma})')
ax.plot(x_plot, stats.cauchy(0, history[-1]['mean']).pdf(x_plot), 'g--',
        linewidth=2, label=f'Estimated Cauchy(0, {history[-1]["mean"]:.2f})')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.set_title('Data vs Fitted Distribution')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(-15, 15)
 
plt.tight_layout()
 
print("\nParticle filter successfully approximated the non-conjugate posterior!")
print(f"True σ = {true_sigma:.2f}")
print(f"Estimated σ = {history[-1]['mean']:.2f} ± {history[-1]['std']:.2f}")

Bayesian Updating for Decision Making

Bayesian updating isn't just about beliefs—it's about making decisions under uncertainty. The posterior distribution enables optimal decision-making through expected utility maximization.

The Decision-Theoretic Framework:

Actions: Set of possible decisions A = {a₁, a₂, ...}
States: Unknown parameter θ (captured by posterior)
Utility: U(a, θ) = benefit of action a when true state is θ
Expected Utility: $E[U(a, \theta) | D] = \int U(a, \theta) \cdot p(\theta | D) , d\theta$
Optimal Action: $a^* = \arg\max_a E[U(a, \theta) | D]$

The posterior uncertainty is automatically incorporated: actions robust to parameter uncertainty are favored over those that are optimal only for specific parameter values.

Example: When to Stop Experimenting

Consider A/B testing: you're comparing two website designs. At each point, you can:

Continue testing (cost: -c per observation)
Deploy A (utility depends on unknown θ_A)
Deploy B (utility depends on unknown θ_B)

Bayesian updating tells you your current beliefs; expected utility tells you when continued learning outweighs its cost.

Value of Information

The 'Value of Information' (VOI) quantifies how much reducing uncertainty is worth for decision-making. VOI = Expected utility with more info − Expected utility now. When VOI < cost of information, stop collecting data.

Sequential Decision Problems:

Many real problems involve sequences of decisions, each informed by the posterior at that moment:

Clinical trials: Continue enrolling patients or stop early?
Multi-armed bandits: Which arm to pull next?
Adaptive experiments: What experimental condition to run next?
Active learning: Which data point to label next?

In all these cases, Bayesian updating provides the engine for learning, while decision theory provides the framework for acting optimally.

Thompson Sampling:

A beautiful algorithm that combines Bayesian updating with exploration:

For each decision point:
1. Sample θ̃ from current posterior p(θ|data_so_far)
2. Take action that maximizes utility assuming θ̃ is true
3. Observe outcome
4. Update posterior

This naturally balances exploration (trying uncertain options) with exploitation (choosing best known options). Theoretical guarantees show it achieves near-optimal regret in many settings.

Learning as Bayesian Updating

Stepping back from the mathematics, Bayesian updating embodies a compelling theory of rational learning with deep philosophical implications.

Dutch Book Coherence:

As mentioned earlier, beliefs that don't update according to Bayes' theorem are 'incoherent'—they make the believer vulnerable to sure-loss betting sequences (Dutch books). This isn't just an abstract concern; it implies that any consistent, non-exploitable learning process must be Bayesian.

Convergence to Truth:

Under mild conditions (correctly specified model, prior with broad support), Bayesian posteriors concentrate around the true parameter as data accumulates. This posterior consistency means Bayesian learning is self-correcting:

Wrong priors are eventually overwhelmed by data
True parameters receive increasing posterior probability
Beliefs converge regardless of initial disagreements

This provides a resolution to the subjectivity objection: while priors may differ, learning pathways converge.

The Role of the Prior:

Priors encode the inductive biases necessary for learning. Without some prior constraints, learning from finite data is impossible—every dataset is consistent with infinitely many generalizations. The prior makes learning tractable by ruling out 'unnatural' hypotheses.

This connects to machine learning regularization, Occam's razor, and minimum description length principles—all can be viewed as implicit choices of prior.

The No-Free-Lunch Connection

No Free Lunch theorems in machine learning state that no algorithm outperforms others across all problems. Bayesian inference clarifies this: the prior implicitly defines which problems we expect to encounter. Different priors optimize for different problem classes. There's no 'prior-free' learning.

Bayesian Computation of Mind:

Beyond statistics, Bayesian updating has become influential in cognitive science as a theory of how minds work. Evidence suggests humans perform approximate Bayesian inference for:

Perception (integrating sensory cues with expectations)
Language understanding (resolving ambiguity using context)
Motor control (predicting consequences of actions)
Causal reasoning (inferring hidden causes from effects)

While the brain likely doesn't compute exact posteriors, it may implement efficient approximations—making Bayesian updating not just normative (how we should learn) but potentially descriptive (how we do learn).

Limits of Bayesian Updating:

Despite its elegance, Bayesian updating has limitations:

Model misspecification: If the true generative process isn't in the model class, posteriors may converge to wrong answers
Computational intractability: Exact inference is often impossible; approximations introduce errors
Prior specification: Eliciting priors that truly reflect beliefs is difficult
Structural learning: Discovering entirely new hypothesis classes isn't easily captured

These aren't fatal flaws, but reminders that Bayesian updating is a tool—powerful within its domain, but not a solution to all learning problems.

Summary: Bayesian Updating

Bayesian updating transforms statistical inference from a static procedure into a dynamic learning process. Let's consolidate the key insights:

Key Takeaways

•Today's posterior is tomorrow's prior — This recursive structure enables continuous learning over time without storing all historical data.
•Order doesn't matter for i.i.d. data — Batch and sequential updates yield identical posteriors, giving flexibility in implementation.
•Surprising observations update beliefs more — Information content determines update magnitude; expected outcomes change beliefs less than surprising ones.
•Conjugate models enable efficient online learning — When closed-form updates exist, each observation requires only O(1) computation.
•Non-conjugate updates require approximations — Particle filters, variational methods, and periodic MCMC handle general models.
•Bayesian updating enables optimal decisions — Expected utility computed over the posterior uncertainty guides action selection.
•Learning converges to truth — Under reasonable conditions, posteriors concentrate around true parameter values regardless of prior disagreements.

What's Next:

With the updating mechanism understood, we now turn to point estimation from the posterior—how to extract single 'best guesses' when decisions or communication require them. We'll explore the MAP estimator, posterior mean, and their connections to frequentist methods and regularization.

Page Complete

You now understand Bayesian updating as continuous learning—how beliefs evolve with evidence, why order doesn't matter, and how updating enables both inference and decision-making. Next, we'll explore extracting point estimates from the rich information contained in posteriors.

4 / 5

Loading learning content...

Machine LearningBayesian Inference

The Bayesian Framework

LevelAdvanced

Duration90 mins

TopicBayesian Inference

4 / 5

Bayesian Updating

Learning as Continuous Belief Revision

What You Will Master

The Fundamental Updating Principle

At its core, Bayesian updating is elegantly simple: today's posterior is tomorrow's prior.

The Sequential Update:

Suppose we observe data in a sequence: D₁, then D₂, then D₃, etc. The updating process is:

Start with prior p(θ)
Observe D₁ → compute posterior p(θ|D₁) ∝ p(D₁|θ)·p(θ)
Observe D₂ → compute p(θ|D₁,D₂) ∝ p(D₂|θ)·p(θ|D₁)
Continue for each new observation

Mathematically:

$$p(\theta | D_{1:n}) \propto p(D_n | \theta) \cdot p(\theta | D_{1:n-1})$$

The Key Insight:

Each update has the same form: likelihood times prior. The only difference is that the 'prior' at step n is the posterior from step n-1. This recursive structure is powerful because:

We don't need to store all historical data
New data integrates seamlessly with existing beliefs
Computational requirements are constant per update (for conjugate models)
Learning is never 'finished'—we can always incorporate more evidence

The Martingale Property

Equivalence of Batch and Sequential Updates:

A fundamental property of Bayesian updating is order independence (for i.i.d. data): the same posterior is obtained whether we:

Update sequentially: D₁, then D₂, then D₃
Update in batches: (D₁, D₂), then D₃
Update in reverse order: D₃, then D₂, then D₁
Update all at once: (D₁, D₂, D₃)

Proof sketch:

For i.i.d. data: $$p(\theta | D_1, D_2) = \frac{p(D_1, D_2 | \theta) p(\theta)}{p(D_1, D_2)} = \frac{p(D_1|\theta) p(D_2|\theta) p(\theta)}{p(D_1, D_2)}$$

Same result up to normalization. This equivalence means we can choose whichever formulation is most convenient.

Sequential Learning in Practice

Let's implement sequential Bayesian updating for several common scenarios to build intuition.

Example 1: Learning a Coin's Bias

Suppose you're unsure about a coin's fairness and flip it repeatedly, updating your beliefs after each flip.

sequential_coin_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def sequential_coin_learning(true_theta=0.7, n_flips=100, prior_a=1, prior_b=1):
    """
    Demonstrate sequential Bayesian learning of a coin's bias.
    
    Each flip updates our belief about θ (probability of heads).
    Prior: Beta(prior_a, prior_b)
    Likelihood: Bernoulli(θ)
    """
    np.random.seed(42)
    
    # Generate flip sequence
    flips = np.random.binomial(1, true_theta, n_flips)
    
    # Track posterior evolution
    theta_grid = np.linspace(0, 1, 500)
    
    # Storage for animation frames
    history = []
    
    # Initial state
    current_a, current_b = prior_a, prior_b
    history.append({
        'n': 0,
        'heads': 0,
        'tails': 0,
        'a': current_a,
        'b': current_b,
        'mean': current_a / (current_a + current_b),
        'std': np.sqrt(stats.beta(current_a, current_b).var())
    })
    
    # Sequential updates
    cumulative_heads = 0
    cumulative_tails = 0
    
    for i, flip in enumerate(flips):
        # Update sufficient statistics
        if flip == 1:
            cumulative_heads += 1
            current_a += 1
        else:
            cumulative_tails += 1
            current_b += 1
        
        # Record state
        posterior_mean = current_a / (current_a + current_b)
        posterior_std = np.sqrt(stats.beta(current_a, current_b).var())
        
        history.append({
            'n': i + 1,
            'heads': cumulative_heads,
            'tails': cumulative_tails,
            'a': current_a,
            'b': current_b,
            'mean': posterior_mean,
            'std': posterior_std
        })
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Plot 1: Posterior at key points
    ax = axes[0, 0]
    milestones = [0, 5, 20, 50, 100]
    colors = plt.cm.viridis(np.linspace(0, 1, len(milestones)))
    
    for milestone, color in zip(milestones, colors):
        if milestone <= len(history) - 1:
            h = history[milestone]
            pdf = stats.beta(h['a'], h['b']).pdf(theta_grid)
            ax.plot(theta_grid, pdf, color=color, linewidth=2,
                    label=f"n={h['n']}: Beta({h['a']}, {h['b']})")
    
    ax.axvline(true_theta, color='red', linestyle='--', label=f'True θ={true_theta}')
    ax.set_xlabel('θ')
    ax.set_ylabel('p(θ|data)')
    ax.set_title('Posterior Evolution')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, 1)
    
    # Plot 2: Posterior mean over time
    ax = axes[0, 1]
    ns = [h['n'] for h in history]
    means = [h['mean'] for h in history]
    stds = [h['std'] for h in history]
    
    ax.fill_between(ns, [m-2*s for m,s in zip(means, stds)], 
                    [m+2*s for m,s in zip(means, stds)], 
                    alpha=0.3, color='blue', label='±2 std')
    ax.plot(ns, means, 'b-', linewidth=2, label='Posterior mean')
    ax.axhline(true_theta, color='red', linestyle='--', label=f'True θ={true_theta}')
    ax.set_xlabel('Number of flips')
    ax.set_ylabel('E[θ|data]')
    ax.set_title('Posterior Mean Convergence')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, n_flips)
    ax.set_ylim(0, 1)
    
    # Plot 3: Posterior standard deviation
    ax = axes[1, 0]
    ax.plot(ns, stds, 'purple', linewidth=2)
    ax.set_xlabel('Number of flips')
    ax.set_ylabel('Std[θ|data]')
    ax.set_title('Uncertainty Reduction Over Time')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, n_flips)
    
    # Annotate rate of decay
    theoretical_std = lambda n: 0.5 / np.sqrt(n + 4)  # Approximate for Beta
    ns_theory = np.linspace(1, n_flips, 100)
    ax.plot(ns_theory, [theoretical_std(n) for n in ns_theory], 
            'r--', alpha=0.5, label='~1/√n decay')
    ax.legend()
    
    # Plot 4: 95% credible interval width
    ax = axes[1, 1]
    ci_widths = []
    for h in history:
        dist = stats.beta(h['a'], h['b'])
        width = dist.ppf(0.975) - dist.ppf(0.025)
        ci_widths.append(width)
    
    ax.plot(ns, ci_widths, 'green', linewidth=2)
    ax.set_xlabel('Number of flips')
    ax.set_ylabel('95% CI Width')
    ax.set_title('Credible Interval Shrinkage')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(0, n_flips)
    
    plt.tight_layout()
    
    return history
 
history = sequential_coin_learning()
 
print("\n" + "="*60)
print("SEQUENTIAL LEARNING SUMMARY")
print("="*60)
print(f"{'n':<6} {'Heads':<8} {'Tails':<8} {'Post Mean':<12} {'Post Std':<12}")
print("-"*60)
for h in history[::20]:  # Every 20 observations
    print(f"{h['n']:<6} {h['heads']:<8} {h['tails']:<8} {h['mean']:.4f}       {h['std']:.4f}")

Key Observations:

Rapid initial learning: The first few observations cause large belief revisions
Diminishing returns: Later observations have smaller effects on already-concentrated posteriors
√n convergence: Posterior standard deviation decreases as O(1/√n)
Principled uncertainty: The credible interval width shrinks automatically with more data
Self-correcting: Early 'lucky' or 'unlucky' streaks are corrected by subsequent evidence

The Information Content of Observations

Not all observations are equally informative. Bayesian updating naturally weighs evidence by its information content—surprising observations update beliefs more than expected ones.

Measuring Belief Change:

Several measures quantify how much an observation changes our beliefs:

1. KL Divergence (Relative Entropy): $$D_{KL}(p_{\text{post}} || p_{\text{prior}}) = \int p(\theta|D) \log \frac{p(\theta|D)}{p(\theta)} d\theta$$

Measures the 'information gained' from the prior to the posterior. Always non-negative; zero only if no update occurred.

2. Bayes Factor: $$BF = \frac{p(D|H_1)}{p(D|H_2)}$$

For comparing hypotheses, the Bayes factor measures relative evidence. A BF of 10 means the data are 10× more probable under H₁.

3. Log Posterior Ratio: $$\log \frac{p(\theta_1|D)}{p(\theta_2|D)} - \log \frac{p(\theta_1)}{p(\theta_2)}$$

How much the observation changed the odds between two parameter values.

Surprisal and Information

information_content.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from scipy import stats
from scipy.integrate import quad
import matplotlib.pyplot as plt
 
def kl_divergence_beta(post_a, post_b, prior_a, prior_b):
    """
    KL divergence from prior to posterior for Beta distributions.
    D_KL(posterior || prior)
    """
    from scipy.special import betaln, digamma
    
    # Analytical formula for KL divergence between Beta distributions
    kl = (betaln(prior_a, prior_b) - betaln(post_a, post_b) +
          (post_a - prior_a) * digamma(post_a) +
          (post_b - prior_b) * digamma(post_b) +
          (prior_a - post_a + prior_b - post_b) * digamma(post_a + post_b))
    return kl
 
def analyze_update_information(prior_a, prior_b, observation):
    """
    Analyze the information content of a single observation.
    observation: 1 for heads, 0 for tails
    """
    # Posterior parameters
    post_a = prior_a + observation
    post_b = prior_b + (1 - observation)
    
    # Prior and posterior means
    prior_mean = prior_a / (prior_a + prior_b)
    post_mean = post_a / (post_a + post_b)
    
    # KL divergence
    kl = kl_divergence_beta(post_a, post_b, prior_a, prior_b)
    
    # Probability of this observation under current belief
    prob_obs = prior_mean if observation == 1 else (1 - prior_mean)
    
    # Surprisal
    surprisal = -np.log2(prob_obs)
    
    return {
        'post_a': post_a, 'post_b': post_b,
        'prior_mean': prior_mean, 'post_mean': post_mean,
        'kl_divergence': kl,
        'prob_obs': prob_obs,
        'surprisal': surprisal,
        'mean_shift': abs(post_mean - prior_mean)
    }
 
# Example: Different priors, same observation
print("=" * 70)
print("INFORMATION CONTENT OF OBSERVATIONS")
print("=" * 70)
 
scenarios = [
    ("Uniform prior (uncertain)", 1, 1, 1),  # Heads with no prior knowledge
    ("Uniform prior (uncertain)", 1, 1, 0),  # Tails with no prior knowledge
    ("Prior believes fair", 10, 10, 1),      # Heads when expecting 50-50
    ("Prior believes fair", 10, 10, 0),      # Tails when expecting 50-50
    ("Prior believes biased toward heads", 20, 5, 1),  # Expected heads
    ("Prior believes biased toward heads", 20, 5, 0),  # Surprising tails
    ("Strong prior, θ≈0.9", 90, 10, 1),      # Expected heads
    ("Strong prior, θ≈0.9", 90, 10, 0),      # Very surprising tails
]
 
print(f"{'Scenario':<35} {'Obs':<5} {'P(obs)':<8} {'Surpr.':<8} {'KL Div':<10} {'|Δmean|':<10}")
print("-" * 70)
 
for name, a, b, obs in scenarios:
    result = analyze_update_information(a, b, obs)
    obs_name = "H" if obs == 1 else "T"
    print(f"{name:<35} {obs_name:<5} {result['prob_obs']:.3f}    "
          f"{result['surprisal']:.3f}    {result['kl_divergence']:.5f}    "
          f"{result['mean_shift']:.5f}")
 
print()
print("Key insight: Surprising observations (low P(obs)) cause larger updates")
print("             (higher KL divergence and mean shift)")
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Case study: updating from different priors with same observation (tails)
priors = [(1, 1), (5, 5), (10, 10), (20, 5), (50, 50)]
obs = 0  # Tails
 
theta = np.linspace(0, 1, 500)
 
# Plot 1: Prior → Posterior for different starting points
ax = axes[0]
for a, b in priors:
    prior = stats.beta(a, b).pdf(theta)
    post_a = a + obs
    post_b = b + (1 - obs)
    posterior = stats.beta(post_a, post_b).pdf(theta)
    
    ax.plot(theta, prior, '--', alpha=0.5, linewidth=1)
    ax.plot(theta, posterior, '-', linewidth=2, 
            label=f'Beta({a},{b}) → Beta({post_a},{post_b})')
 
ax.set_xlabel('θ')
ax.set_ylabel('Density')
ax.set_title('Same Observation (Tails), Different Priors')
ax.legend(fontsize=8)
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 1)
 
# Plot 2: KL divergence vs prior strength
ax = axes[1]
prior_strengths = np.arange(2, 102, 2)  # α + β from 2 to 100
kl_heads = []
kl_tails = []
 
for strength in prior_strengths:
    # Symmetric prior
    a = strength / 2
    b = strength / 2
    
    # KL for heads vs tails
    result_h = analyze_update_information(a, b, 1)
    result_t = analyze_update_information(a, b, 0)
    kl_heads.append(result_h['kl_divergence'])
    kl_tails.append(result_t['kl_divergence'])
 
ax.plot(prior_strengths, kl_heads, 'b-', linewidth=2, label='Heads')
ax.plot(prior_strengths, kl_tails, 'r-', linewidth=2, label='Tails')
ax.set_xlabel('Prior Strength (α + β)')
ax.set_ylabel('KL Divergence (nats)')
ax.set_title('Information Gain vs Prior Concentration\n(Symmetric priors, 50-50 expected)')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 100)
 
plt.tight_layout()

Batch vs Online Updates

In practice, we often have choices about how to structure updates. Two common modes are:

Batch Updating:

Collect a dataset of n observations
Perform a single update using all data
Computationally efficient for non-conjugate models (one MCMC run)
Standard in scientific research

Online Updating:

Update beliefs after each observation (or small batches)
Natural for streaming data
Memory-efficient (don't need to store all data)
Essential for real-time applications

Mathematical Equivalence:

For conjugate models with i.i.d. data, batch and online updating give identical results:

$$p(\theta | D_1, D_2, ..., D_n) = p(\theta | S(D_1, ..., D_n))$$

where S is the sufficient statistic. The updating path doesn't matter—only the final data summary.

When to Use Batch Updates

•Scientific experiments with fixed datasets
•Non-conjugate models requiring MCMC
•When computational overhead per update is high
•Retrospective analyses
•When data arrives in natural batches

When to Use Online Updates

•Streaming data applications
•Real-time decision systems
•Memory-constrained environments
•Conjugate models (cheap updates)
•When monitoring belief evolution matters
•Adaptive experiments

batch_vs_online.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy import stats
 
def demonstrate_equivalence():
    """
    Show that batch and online updates give identical final posteriors.
    """
    np.random.seed(42)
    
    # Data: 100 coin flips
    true_theta = 0.65
    data = np.random.binomial(1, true_theta, 100)
    total_heads = data.sum()
    total_tails = len(data) - total_heads
    
    # Initial prior
    prior_a, prior_b = 2, 2
    
    # Method 1: Batch update (all at once)
    batch_post_a = prior_a + total_heads
    batch_post_b = prior_b + total_tails
    
    # Method 2: Online update (one at a time)
    online_a, online_b = prior_a, prior_b
    for flip in data:
        online_a += flip
        online_b += (1 - flip)
    online_post_a, online_post_b = online_a, online_b
    
    # Method 3: Mini-batch updates (groups of 10)
    minibatch_a, minibatch_b = prior_a, prior_b
    for i in range(0, 100, 10):
        batch = data[i:i+10]
        minibatch_a += batch.sum()
        minibatch_b += len(batch) - batch.sum()
    minibatch_post_a, minibatch_post_b = minibatch_a, minibatch_b
    
    # Method 4: Random order (shuffled data)
    shuffled_data = np.random.permutation(data)
    shuffled_a, shuffled_b = prior_a, prior_b
    for flip in shuffled_data:
        shuffled_a += flip
        shuffled_b += (1 - flip)
    
    print("=" * 60)
    print("BATCH vs ONLINE UPDATE EQUIVALENCE")
    print("=" * 60)
    print(f"Data: {total_heads} heads, {total_tails} tails (n=100)")
    print(f"Prior: Beta({prior_a}, {prior_b})")
    print()
    print(f"{'Method':<25} {'Posterior α':<15} {'Posterior β':<15}")
    print("-" * 60)
    print(f"{'Batch (all at once)':<25} {batch_post_a:<15} {batch_post_b:<15}")
    print(f"{'Online (one by one)':<25} {online_post_a:<15} {online_post_b:<15}")
    print(f"{'Mini-batch (10s)':<25} {minibatch_post_a:<15} {minibatch_post_b:<15}")
    print(f"{'Random order':<25} {shuffled_a:<15} {shuffled_b:<15}")
    print()
    
    # Verify all are identical
    all_match = (batch_post_a == online_post_a == minibatch_post_a == shuffled_a and
                 batch_post_b == online_post_b == minibatch_post_b == shuffled_b)
    print(f"All methods give identical results: {all_match}")
    print()
    print("This demonstrates ORDER INDEPENDENCE: the final posterior")
    print("depends only on the sufficient statistic (total heads/tails),")
    print("not on how we organized the updates.")
 
demonstrate_equivalence()

Sequential Updates for Non-Conjugate Models

When conjugacy doesn't hold, sequential updating becomes more challenging. The posterior after each observation doesn't have a simple closed form, and approximations are necessary.

The Challenge:

$$p(\theta | D_{1:n}) = \frac{p(D_n | \theta) \cdot p(\theta | D_{1:n-1})}{p(D_n | D_{1:n-1})}$$

Even if we could represent p(θ | D₁:n-1) exactly, multiplying by the likelihood generally produces a distribution we can't represent in closed form.

Approximation Strategies:

1. Particle Filtering (Sequential Monte Carlo):

Represent posterior as weighted samples (particles)
Multiply each particle's weight by likelihood of new observation
Resample when weights become too uneven
Exact in the limit of infinite particles

2. Assumed Density Filtering:

After each update, project posterior onto a tractable family
E.g., always approximate posterior as Gaussian
Fast but introduces approximation error at each step

3. Online Variational Inference:

Maintain variational approximation
Update variational parameters using stochastic gradient descent
Scales to large datasets

4. Periodic Batch Resampling:

Accumulate observations until a batch is large enough
Run full MCMC to 'correct' accumulated approximation errors
Practical hybrid approach

Approximation Error Accumulation

particle_filter_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def particle_filter_update(particles, weights, observation, likelihood_func):
    """
    Sequential Monte Carlo (particle filter) update.
    
    particles: Current particle positions (samples from posterior)
    weights: Current particle weights
    observation: New data point
    likelihood_func: Function computing p(observation | particle)
    
    Returns updated particles and weights.
    """
    n_particles = len(particles)
    
    # Update weights by likelihood
    log_likelihoods = np.array([np.log(likelihood_func(p, observation) + 1e-300) 
                                 for p in particles])
    log_weights = np.log(weights + 1e-300) + log_likelihoods
    log_weights -= np.max(log_weights)  # Numerical stability
    weights = np.exp(log_weights)
    weights /= weights.sum()  # Normalize
    
    # Effective sample size
    ess = 1 / np.sum(weights**2)
    
    # Resample if ESS too low
    if ess < n_particles / 2:
        indices = np.random.choice(n_particles, size=n_particles, 
                                   replace=True, p=weights)
        particles = particles[indices]
        weights = np.ones(n_particles) / n_particles
    
    return particles, weights, ess
 
# Example: Non-conjugate problem - learning the scale of a Cauchy distribution
# Prior: log(σ) ~ Normal(0, 1)  →  σ ~ Log-Normal(0, 1)
# Likelihood: x ~ Cauchy(0, σ)
 
def cauchy_likelihood(sigma, x):
    """p(x | σ) for Cauchy(0, σ)"""
    return 1 / (np.pi * sigma * (1 + (x/sigma)**2))
 
# True parameter
true_sigma = 2.0
 
# Generate data
np.random.seed(42)
n_obs = 50
data = stats.cauchy(0, true_sigma).rvs(n_obs)
 
# Initialize particle filter
n_particles = 5000
# Sample from prior: σ ~ LogNormal(0, 1)
particles = np.random.lognormal(0, 1, n_particles)
weights = np.ones(n_particles) / n_particles
 
# Track posterior evolution
history = [{
    'n': 0,
    'mean': np.average(particles, weights=weights),
    'std': np.sqrt(np.average((particles - np.average(particles, weights=weights))**2, 
                              weights=weights)),
    'ess': n_particles
}]
 
# Sequential updates
for i, x in enumerate(data):
    particles, weights, ess = particle_filter_update(
        particles, weights, x, cauchy_likelihood)
    
    posterior_mean = np.average(particles, weights=weights)
    posterior_std = np.sqrt(np.average((particles - posterior_mean)**2, 
                                       weights=weights))
    
    history.append({
        'n': i + 1,
        'mean': posterior_mean,
        'std': posterior_std,
        'ess': ess
    })
 
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
 
# Plot 1: Final posterior approximation
ax = axes[0, 0]
ax.hist(particles, bins=50, density=True, alpha=0.7, weights=weights,
        label='Particle posterior')
ax.axvline(true_sigma, color='red', linestyle='--', linewidth=2,
           label=f'True σ = {true_sigma}')
ax.axvline(history[-1]['mean'], color='green', linestyle='-', linewidth=2,
           label=f'Post. mean = {history[-1]["mean"]:.2f}')
ax.set_xlabel('σ')
ax.set_ylabel('Density')
ax.set_title(f'Final Posterior (Particle Filter, n={n_obs})')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 8)
 
# Plot 2: Posterior mean evolution
ax = axes[0, 1]
ns = [h['n'] for h in history]
means = [h['mean'] for h in history]
stds = [h['std'] for h in history]
ax.fill_between(ns, [m-s for m,s in zip(means, stds)], 
                [m+s for m,s in zip(means, stds)], alpha=0.3)
ax.plot(ns, means, 'b-', linewidth=2)
ax.axhline(true_sigma, color='red', linestyle='--', label=f'True σ = {true_sigma}')
ax.set_xlabel('Number of observations')
ax.set_ylabel('E[σ|data]')
ax.set_title('Posterior Mean Convergence')
ax.legend()
ax.grid(True, alpha=0.3)
 
# Plot 3: Effective sample size
ax = axes[1, 0]
ess_values = [h['ess'] for h in history]
ax.plot(ns, ess_values, 'purple', linewidth=2)
ax.axhline(n_particles/2, color='red', linestyle='--', 
           label='Resample threshold')
ax.set_xlabel('Number of observations')
ax.set_ylabel('Effective Sample Size')
ax.set_title('Particle Filter Efficiency')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(0, n_particles)
 
# Plot 4: Data histogram (for context)
ax = axes[1, 1]
ax.hist(data, bins=30, density=True, alpha=0.7, label='Observed data')
x_plot = np.linspace(-15, 15, 200)
ax.plot(x_plot, stats.cauchy(0, true_sigma).pdf(x_plot), 'r-', 
        linewidth=2, label=f'True Cauchy(0, {true_sigma})')
ax.plot(x_plot, stats.cauchy(0, history[-1]['mean']).pdf(x_plot), 'g--',
        linewidth=2, label=f'Estimated Cauchy(0, {history[-1]["mean"]:.2f})')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.set_title('Data vs Fitted Distribution')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim(-15, 15)
 
plt.tight_layout()
 
print("\nParticle filter successfully approximated the non-conjugate posterior!")
print(f"True σ = {true_sigma:.2f}")
print(f"Estimated σ = {history[-1]['mean']:.2f} ± {history[-1]['std']:.2f}")

Bayesian Updating for Decision Making

Bayesian updating isn't just about beliefs—it's about making decisions under uncertainty. The posterior distribution enables optimal decision-making through expected utility maximization.

The Decision-Theoretic Framework:

Actions: Set of possible decisions A = {a₁, a₂, ...}
States: Unknown parameter θ (captured by posterior)
Utility: U(a, θ) = benefit of action a when true state is θ
Expected Utility: $E[U(a, \theta) | D] = \int U(a, \theta) \cdot p(\theta | D) , d\theta$
Optimal Action: $a^* = \arg\max_a E[U(a, \theta) | D]$

The posterior uncertainty is automatically incorporated: actions robust to parameter uncertainty are favored over those that are optimal only for specific parameter values.

Example: When to Stop Experimenting

Consider A/B testing: you're comparing two website designs. At each point, you can:

Continue testing (cost: -c per observation)
Deploy A (utility depends on unknown θ_A)
Deploy B (utility depends on unknown θ_B)

Bayesian updating tells you your current beliefs; expected utility tells you when continued learning outweighs its cost.

Value of Information

Sequential Decision Problems:

Many real problems involve sequences of decisions, each informed by the posterior at that moment:

Clinical trials: Continue enrolling patients or stop early?
Multi-armed bandits: Which arm to pull next?
Adaptive experiments: What experimental condition to run next?
Active learning: Which data point to label next?

In all these cases, Bayesian updating provides the engine for learning, while decision theory provides the framework for acting optimally.

Thompson Sampling:

A beautiful algorithm that combines Bayesian updating with exploration:

For each decision point:
1. Sample θ̃ from current posterior p(θ|data_so_far)
2. Take action that maximizes utility assuming θ̃ is true
3. Observe outcome
4. Update posterior

This naturally balances exploration (trying uncertain options) with exploitation (choosing best known options). Theoretical guarantees show it achieves near-optimal regret in many settings.

Learning as Bayesian Updating

Stepping back from the mathematics, Bayesian updating embodies a compelling theory of rational learning with deep philosophical implications.

Dutch Book Coherence:

Convergence to Truth:

Wrong priors are eventually overwhelmed by data
True parameters receive increasing posterior probability
Beliefs converge regardless of initial disagreements

This provides a resolution to the subjectivity objection: while priors may differ, learning pathways converge.

The Role of the Prior:

This connects to machine learning regularization, Occam's razor, and minimum description length principles—all can be viewed as implicit choices of prior.

The No-Free-Lunch Connection

Bayesian Computation of Mind:

Beyond statistics, Bayesian updating has become influential in cognitive science as a theory of how minds work. Evidence suggests humans perform approximate Bayesian inference for:

Perception (integrating sensory cues with expectations)
Language understanding (resolving ambiguity using context)
Motor control (predicting consequences of actions)
Causal reasoning (inferring hidden causes from effects)

Limits of Bayesian Updating:

Despite its elegance, Bayesian updating has limitations:

Model misspecification: If the true generative process isn't in the model class, posteriors may converge to wrong answers
Computational intractability: Exact inference is often impossible; approximations introduce errors
Prior specification: Eliciting priors that truly reflect beliefs is difficult
Structural learning: Discovering entirely new hypothesis classes isn't easily captured

These aren't fatal flaws, but reminders that Bayesian updating is a tool—powerful within its domain, but not a solution to all learning problems.

Summary: Bayesian Updating

Bayesian updating transforms statistical inference from a static procedure into a dynamic learning process. Let's consolidate the key insights:

Key Takeaways

•Today's posterior is tomorrow's prior — This recursive structure enables continuous learning over time without storing all historical data.
•Order doesn't matter for i.i.d. data — Batch and sequential updates yield identical posteriors, giving flexibility in implementation.
•Surprising observations update beliefs more — Information content determines update magnitude; expected outcomes change beliefs less than surprising ones.
•Conjugate models enable efficient online learning — When closed-form updates exist, each observation requires only O(1) computation.
•Non-conjugate updates require approximations — Particle filters, variational methods, and periodic MCMC handle general models.
•Bayesian updating enables optimal decisions — Expected utility computed over the posterior uncertainty guides action selection.
•Learning converges to truth — Under reasonable conditions, posteriors concentrate around true parameter values regardless of prior disagreements.

What's Next:

Page Complete

4 / 5