Machine LearningBayesian Inference

Approximate Inference: Sampling Methods

LevelAdvanced

Duration120 mins

TopicBayesian Inference

5 / 5

Gibbs Sampling

Exploiting Conditional Structure

Metropolis-Hastings updates all parameters simultaneously, which can be inefficient when the joint distribution has special structure that we're not exploiting. Many Bayesian models—particularly hierarchical models—have conditional distributions that are easy to sample from, even when the joint is complex.

Gibbs sampling leverages this structure. Instead of proposing a move for all parameters at once, we update each parameter (or group) one at a time, sampling from its full conditional distribution—the distribution of that parameter given all others and the data.

The key insight:

Full conditionals are simpler: They involve fewer parameters and often have standard forms
Acceptance rate = 100%: Each update is an exact draw from the correct conditional
Exploits model structure: Conjugate priors, latent variable models, mixture models

Named after the physicist Josiah Willard Gibbs (who studied statistical mechanics), Gibbs sampling became the foundation of practical Bayesian computation in the 1990s, powering software like BUGS and WinBUGS that made Bayesian inference accessible to applied researchers.

What You Will Learn

This page develops Gibbs sampling from first principles. You will understand full conditional distributions, see the algorithm for both systematic and random scan, learn blocking strategies for faster mixing, and understand when Gibbs excels versus when other methods are preferable.

Full Conditional Distributions

The full conditional distribution of a parameter $\theta_j$ is its conditional distribution given all other parameters $\theta_{-j}$ and the data $\mathcal{D}$:

$$p(\theta_j | \theta_{-j}, \mathcal{D})$$

where $\theta_{-j} = (\theta_1, \ldots, \theta_{j-1}, \theta_{j+1}, \ldots, \theta_d)$ denotes all parameters except $\theta_j$.

Deriving full conditionals:

The full conditional is proportional to the joint posterior, viewed as a function of $\theta_j$ only:

$$p(\theta_j | \theta_{-j}, \mathcal{D}) \propto p(\theta_1, \ldots, \theta_d | \mathcal{D})$$

with $\theta_{-j}$ treated as fixed constants. Terms not involving $\theta_j$ become part of the proportionality constant.

Example: Multivariate Gaussian

For $(\theta_1, \theta_2) \sim \mathcal{N}\left(\begin{pmatrix} \mu_1 \ \mu_2 \end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix}\right)$

The full conditional of $\theta_1$ given $\theta_2$ is: $$\theta_1 | \theta_2 \sim \mathcal{N}\left(\mu_1 + \rho\frac{\sigma_1}{\sigma_2}(\theta_2 - \mu_2), \sigma_1^2(1 - \rho^2)\right)$$

A univariate Gaussian—easy to sample!

Recognizing Standard Forms

The power of Gibbs sampling comes from recognizing that full conditionals often belong to standard families:

• Gaussian likelihood + Gaussian prior → Gaussian posterior (conjugate) • Binomial likelihood + Beta prior → Beta posterior (conjugate) • Gaussian with unknown precision → Gamma posterior for precision • Mixture model → Multinomial for component membership

Once you identify the kernel (the part involving θⱼ), match it to a known distribution.

full_conditionals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from scipy import stats
 
def derive_normal_normal_conditional():
    """
    Demonstrate deriving full conditionals for a simple hierarchical model.
    
    Model:
        mu ~ N(0, tau^2)           -- prior on mean
        X_i | mu ~ N(mu, sigma^2)  -- likelihood for n observations
        
    Full conditional of mu | X, sigma^2:
        mu | X ~ N(posterior_mean, posterior_var)
    """
    # Hyperparameters
    tau_sq = 10.0  # Prior variance
    sigma_sq = 1.0  # Likelihood variance
    
    # Data
    X = np.array([4.2, 3.8, 5.1, 4.5, 4.9])
    n = len(X)
    x_bar = np.mean(X)
    
    # Derive full conditional
    # p(mu | X) ∝ p(X | mu) * p(mu)
    #           ∝ exp(-sum(X_i - mu)^2 / 2sigma^2) * exp(-mu^2 / 2tau^2)
    # Complete the square in mu...
    
    # Posterior precision = prior precision + data precision
    posterior_precision = 1/tau_sq + n/sigma_sq
    posterior_var = 1 / posterior_precision
    
    # Posterior mean = weighted average of prior mean (0) and data mean
    posterior_mean = posterior_var * (0/tau_sq + n*x_bar/sigma_sq)
    
    print("=== Normal-Normal Conjugate Model ===")
    print(f"Prior: mu ~ N(0, {tau_sq})")
    print(f"Likelihood: X_i | mu ~ N(mu, {sigma_sq})")
    print(f"Data: n={n}, mean={x_bar:.2f}")
    print()
    print(f"Full conditional: mu | X ~ N({posterior_mean:.4f}, {posterior_var:.4f})")
    print(f"  Posterior mean: {posterior_mean:.4f}")
    print(f"  Posterior std:  {np.sqrt(posterior_var):.4f}")
    
    # Sample from the full conditional
    samples = np.random.normal(posterior_mean, np.sqrt(posterior_var), 10000)
    print(f"
Verification (10000 samples):")
    print(f"  Sample mean: {np.mean(samples):.4f}")
    print(f"  Sample std:  {np.std(samples):.4f}")
 
 
derive_normal_normal_conditional()
 
 
def identify_full_conditional_form():
    """
    Show how to identify the form of a full conditional 
    by examining the kernel (terms containing the parameter).
    """
    print("
=== Identifying Full Conditional Forms ===")
    
    examples = [
        {
            "model": "Y ~ Binomial(n, p), p ~ Beta(a, b)",
            "parameter": "p",
            "kernel": "p^y (1-p)^{n-y} * p^{a-1} (1-p)^{b-1}",
            "form": "p^{y+a-1} (1-p)^{n-y+b-1}",
            "distribution": "Beta(y + a, n - y + b)"
        },
        {
            "model": "Y ~ Poisson(λ), λ ~ Gamma(a, b)",
            "parameter": "λ",
            "kernel": "λ^y e^{-λ} * λ^{a-1} e^{-bλ}",
            "form": "λ^{y+a-1} e^{-(1+b)λ}",
            "distribution": "Gamma(y + a, 1 + b)"
        },
        {
            "model": "Y ~ N(μ, σ²), μ ~ N(μ₀, τ²), σ² known",
            "parameter": "μ",
            "kernel": "exp(-(y-μ)²/2σ²) * exp(-(μ-μ₀)²/2τ²)",
            "form": "Quadratic in μ → complete the square",
            "distribution": "N(weighted_mean, combined_variance)"
        }
    ]
    
    for ex in examples:
        print(f"
Model: {ex['model']}")
        print(f"  Parameter: {ex['parameter']}")
        print(f"  Kernel: {ex['kernel']}")
        print(f"  → Full conditional: {ex['distribution']}")
 
 
identify_full_conditional_form()

The Gibbs Sampling Algorithm

With full conditionals in hand, the Gibbs sampling algorithm is straightforward.

Systematic scan Gibbs sampler:

Initialize θ^(0) = (θ_1^(0), θ_2^(0), ..., θ_d^(0))

For t = 1, 2, ..., T:
    Draw θ_1^(t) ~ p(θ_1 | θ_2^(t-1), θ_3^(t-1), ..., θ_d^(t-1), D)
    Draw θ_2^(t) ~ p(θ_2 | θ_1^(t),   θ_3^(t-1), ..., θ_d^(t-1), D)
    Draw θ_3^(t) ~ p(θ_3 | θ_1^(t),   θ_2^(t),   ..., θ_d^(t-1), D)
    ...
    Draw θ_d^(t) ~ p(θ_d | θ_1^(t),   θ_2^(t),   ..., θ_{d-1}^(t), D)

Key observations:

Each parameter is updated using the most recent values of all others
Order matters within an iteration (update $\theta_1$, then use new $\theta_1$ when updating $\theta_2$, etc.)
Each update is an exact draw from the full conditional—no acceptance probability needed
The chain moves to a new state every iteration (no rejections)

Why Gibbs is a special case of MH:

Gibbs sampling is Metropolis-Hastings with a clever proposal: propose $\theta_j'$ from the full conditional $p(\theta_j | \theta_{-j}, \mathcal{D})$. The MH acceptance ratio becomes:

$$\alpha = \frac{\pi(\theta') , q(\theta | \theta')}{\pi(\theta) , q(\theta' | \theta)} = \frac{p(\theta'{-j})p(\theta'j|\theta'{-j}) \cdot p(\theta_j | \theta'{-j})}{p(\theta_{-j})p(\theta_j|\theta_{-j}) \cdot p(\theta'j | \theta'{-j})} = 1$$

Since $\theta'{-j} = \theta{-j}$, everything cancels! Acceptance probability is always 1.

Random Scan vs. Systematic Scan

Systematic scan: Update θ₁, θ₂, ..., θ_d in fixed order each iteration.

Random scan: Each step, choose a random parameter j uniformly and update θⱼ.

Both produce valid MCMC chains with the correct stationary distribution. Systematic scan is more common and often more efficient. Random scan has stronger theoretical guarantees (automatic reversibility) but requires d updates per 'iteration' for fair comparison.

gibbs_sampler.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy import stats
 
def gibbs_sampler_bivariate_normal(mu, Sigma, n_samples, x0=None):
    """
    Gibbs sampler for bivariate normal distribution.
    
    Demonstrates component-wise sampling from full conditionals.
    
    Parameters:
    -----------
    mu : array [mu1, mu2]
        Mean vector
    Sigma : 2x2 array
        Covariance matrix
    """
    mu1, mu2 = mu
    sigma1 = np.sqrt(Sigma[0, 0])
    sigma2 = np.sqrt(Sigma[1, 1])
    rho = Sigma[0, 1] / (sigma1 * sigma2)
    
    # Full conditional parameters
    # x1 | x2 ~ N(mu1 + rho*(s1/s2)*(x2-mu2), s1^2*(1-rho^2))
    # x2 | x1 ~ N(mu2 + rho*(s2/s1)*(x1-mu1), s2^2*(1-rho^2))
    
    cond_var1 = sigma1**2 * (1 - rho**2)
    cond_var2 = sigma2**2 * (1 - rho**2)
    
    # Initialize
    if x0 is None:
        x1, x2 = 0.0, 0.0
    else:
        x1, x2 = x0
    
    samples = []
    
    for t in range(n_samples):
        # Sample x1 | x2
        cond_mean1 = mu1 + rho * (sigma1/sigma2) * (x2 - mu2)
        x1 = np.random.normal(cond_mean1, np.sqrt(cond_var1))
        
        # Sample x2 | x1 (using updated x1)
        cond_mean2 = mu2 + rho * (sigma2/sigma1) * (x1 - mu1)
        x2 = np.random.normal(cond_mean2, np.sqrt(cond_var2))
        
        samples.append([x1, x2])
    
    return np.array(samples)
 
 
# Test on correlated bivariate normal
mu = [2, 3]
rho = 0.8
Sigma = [[1, rho], [rho, 1]]
 
samples = gibbs_sampler_bivariate_normal(mu, Sigma, 10000)
 
print("=== Gibbs Sampler: Bivariate Normal ===")
print(f"Target: N([{mu[0]}, {mu[1]}], [[1, {rho}], [{rho}, 1]])")
print()
print(f"Sample mean: [{np.mean(samples[:,0]):.4f}, {np.mean(samples[:,1]):.4f}]")
print(f"True mean:   [{mu[0]}, {mu[1]}]")
print()
print(f"Sample covariance:
{np.cov(samples.T)}")
print(f"
True covariance:
{np.array(Sigma)}")
 
# Compute autocorrelation to check mixing
from scipy.signal import correlate
 
def autocorr(x, maxlag=50):
    result = correlate(x - np.mean(x), x - np.mean(x), mode='full')
    result = result[len(result)//2:]
    return result[:maxlag] / result[0]
 
acf_x1 = autocorr(samples[:, 0])
print(f"
Autocorrelation of x1 at lags 1, 5, 10:")
print(f"  lag 1:  {acf_x1[1]:.4f}")
print(f"  lag 5:  {acf_x1[5]:.4f}")
print(f"  lag 10: {acf_x1[10]:.4f}")
print(f"
Note: High correlation (ρ={rho}) leads to slow mixing.")

A Complete Example: Hierarchical Normal Model

Let's implement Gibbs sampling for a classic hierarchical model, showcasing the power of exploiting conditional structure.

The Model:

$$\begin{aligned} \mu &\sim \mathcal{N}(\mu_0, \tau_0^2) && \text{(prior on overall mean)} \sigma^2 &\sim \text{Inv-Gamma}(a_0, b_0) && \text{(prior on variance)} X_i | \mu, \sigma^2 &\sim \mathcal{N}(\mu, \sigma^2), \quad i = 1, \ldots, n && \text{(likelihood)} \end{aligned}$$

Full Conditionals:

1. Full conditional for $\mu$: $$\mu | \sigma^2, X \sim \mathcal{N}\left(\frac{\frac{\mu_0}{\tau_0^2} + \frac{n\bar{X}}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}}, \left(\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}\right)^{-1}\right)$$

2. Full conditional for $\sigma^2$: $$\sigma^2 | \mu, X \sim \text{Inv-Gamma}\left(a_0 + \frac{n}{2}, b_0 + \frac{1}{2}\sum_{i=1}^n(X_i - \mu)^2\right)$$

Both are standard distributions we can sample directly!

hierarchical_gibbs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from scipy import stats
 
def gibbs_normal_hierarchical(X, mu0, tau0_sq, a0, b0, n_samples, burn_in=1000):
    """
    Gibbs sampler for hierarchical normal model:
        mu ~ N(mu0, tau0^2)
        sigma^2 ~ Inv-Gamma(a0, b0)
        X_i | mu, sigma^2 ~ N(mu, sigma^2)
    
    Returns posterior samples for (mu, sigma^2).
    """
    n = len(X)
    X_sum = np.sum(X)
    
    # Initialize
    mu = np.mean(X)
    sigma_sq = np.var(X) if np.var(X) > 0 else 1.0
    
    samples = []
    
    for t in range(n_samples + burn_in):
        # --- Sample mu | sigma^2, X ---
        # Posterior precision = prior precision + data precision
        precision_mu = 1/tau0_sq + n/sigma_sq
        var_mu = 1 / precision_mu
        
        # Posterior mean = weighted average
        mean_mu = var_mu * (mu0/tau0_sq + X_sum/sigma_sq)
        
        mu = np.random.normal(mean_mu, np.sqrt(var_mu))
        
        # --- Sample sigma^2 | mu, X ---
        # Inv-Gamma posterior
        a_post = a0 + n / 2
        b_post = b0 + 0.5 * np.sum((X - mu)**2)
        
        # Sample from Inv-Gamma: if Y ~ Gamma(a, b), then 1/Y ~ Inv-Gamma(a, b)
        sigma_sq = 1 / np.random.gamma(a_post, 1/b_post)
        
        if t >= burn_in:
            samples.append([mu, sigma_sq])
    
    return np.array(samples)
 
 
# Generate synthetic data
np.random.seed(42)
true_mu = 5.0
true_sigma_sq = 2.0
n_obs = 50
X = np.random.normal(true_mu, np.sqrt(true_sigma_sq), n_obs)
 
print("=== Gibbs Sampling: Hierarchical Normal Model ===")
print(f"True parameters: μ = {true_mu}, σ² = {true_sigma_sq}")
print(f"Data: n = {n_obs}, mean = {np.mean(X):.3f}, var = {np.var(X):.3f}")
print()
 
# Weak priors
mu0 = 0
tau0_sq = 100  # Vague prior on mu
a0 = 0.01      # Vague prior on sigma^2
b0 = 0.01
 
# Run Gibbs sampler
samples = gibbs_normal_hierarchical(X, mu0, tau0_sq, a0, b0, 
                                     n_samples=10000, burn_in=1000)
 
mu_samples = samples[:, 0]
sigma_sq_samples = samples[:, 1]
 
print("Posterior summaries:")
print(f"  μ:  mean = {np.mean(mu_samples):.4f}, " 
      f"95% CI = [{np.percentile(mu_samples, 2.5):.4f}, {np.percentile(mu_samples, 97.5):.4f}]")
print(f"  σ²: mean = {np.mean(sigma_sq_samples):.4f}, "
      f"95% CI = [{np.percentile(sigma_sq_samples, 2.5):.4f}, {np.percentile(sigma_sq_samples, 97.5):.4f}]")
 
# Transform to standard deviation
sigma_samples = np.sqrt(sigma_sq_samples)
print(f"  σ:  mean = {np.mean(sigma_samples):.4f}, "
      f"95% CI = [{np.percentile(sigma_samples, 2.5):.4f}, {np.percentile(sigma_samples, 97.5):.4f}]")
 
# Check mixing
print(f"
Effective Sample Size:")
print(f"  μ:  ESS ≈ {len(mu_samples) / (1 + 2 * np.sum([np.corrcoef(mu_samples[:-k], mu_samples[k:])[0,1] for k in range(1, min(100, len(mu_samples)//2))])):.0f}")

Blocking Strategies for Faster Mixing

Pure component-wise Gibbs sampling updates one parameter at a time. When parameters are highly correlated, this leads to slow mixing—the chain takes small, correlated steps.

The correlation problem:

Consider a bivariate normal with correlation $\rho = 0.99$. Component-wise Gibbs can only move parallel to the axes, but the posterior is a thin ellipse at 45°. The chain zigzags slowly along the ellipse.

Solution: Block updates

Block Gibbs sampling groups correlated parameters and updates them jointly:

$$\theta = (\theta_{B_1}, \theta_{B_2}, \ldots, \theta_{B_K})$$

where each $B_k$ is a block of parameters updated together from: $$p(\theta_{B_k} | \theta_{-B_k}, \mathcal{D})$$

Choosing blocks:

Group parameters that are highly correlated a posteriori
Group parameters that appear together in likelihood factors
Balance block size: larger blocks capture more correlation but have more complex conditionals

Blocked updates may require MH:

Block conditionals aren't always available in closed form. In such cases, use Metropolis-Hastings for the block update. This hybrid—Metropolis-within-Gibbs—is extremely common in practice.

Reparameterization

An alternative to blocking: transform parameters to reduce posterior correlation. For example, if θ₁ and θ₂ are correlated, define φ = θ₁ + θ₂ and ψ = θ₁ - θ₂. If the transform decorrelates the posterior, component-wise Gibbs on (φ, ψ) mixes faster. This is the idea behind non-centered parameterizations in hierarchical models.

blocked_gibbs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy import stats
import time
 
def compare_componentwise_vs_blocked():
    """
    Compare mixing of component-wise vs blocked Gibbs 
    for highly correlated bivariate normal.
    """
    # Highly correlated target
    mu = [0, 0]
    rho = 0.99
    Sigma = np.array([[1, rho], [rho, 1]])
    
    n_samples = 10000
    
    # --- Component-wise Gibbs ---
    sigma1 = sigma2 = 1
    cond_var = 1 - rho**2
    
    samples_cw = np.zeros((n_samples, 2))
    x1, x2 = 0.0, 0.0
    
    start = time.time()
    for t in range(n_samples):
        # Update x1 | x2
        cond_mean1 = rho * x2
        x1 = np.random.normal(cond_mean1, np.sqrt(cond_var))
        
        # Update x2 | x1
        cond_mean2 = rho * x1
        x2 = np.random.normal(cond_mean2, np.sqrt(cond_var))
        
        samples_cw[t] = [x1, x2]
    time_cw = time.time() - start
    
    # --- Blocked (joint) sampling ---
    # Sample both components jointly from the bivariate normal
    start = time.time()
    samples_blocked = np.random.multivariate_normal(mu, Sigma, n_samples)
    time_blocked = time.time() - start
    
    # Compute autocorrelation
    def autocorr_lag1(samples):
        return np.corrcoef(samples[:-1], samples[1:])[0, 1]
    
    acf1_cw = autocorr_lag1(samples_cw[:, 0])
    acf1_blocked = autocorr_lag1(samples_blocked[:, 0])
    
    # Estimate ESS
    def estimate_ess(samples):
        n = len(samples)
        acf_sum = 0
        for lag in range(1, min(n//3, 500)):
            ac = np.corrcoef(samples[:-lag], samples[lag:])[0, 1]
            if ac < 0.05:
                break
            acf_sum += ac
        return n / (1 + 2 * acf_sum)
    
    ess_cw = estimate_ess(samples_cw[:, 0])
    ess_blocked = estimate_ess(samples_blocked[:, 0])
    
    print("=== Component-wise vs Blocked Gibbs ===")
    print(f"Target: Bivariate Normal with ρ = {rho} (highly correlated)")
    print(f"N = {n_samples} samples")
    print()
    print("                     | Component-wise | Blocked (joint)")
    print("-" * 60)
    print(f"Lag-1 autocorrelation |     {acf1_cw:.4f}     |     {acf1_blocked:.4f}")
    print(f"Effective Sample Size |     {ess_cw:.0f}       |     {ess_blocked:.0f}")
    print(f"ESS per second        |     {ess_cw/time_cw:.0f}       |     {ess_blocked/time_blocked:.0f}")
    print()
    print(f"Blocked sampling achieves {ess_blocked/ess_cw:.0f}x higher ESS!")
    print("This demonstrates the importance of blocking correlated parameters.")
 
 
compare_componentwise_vs_blocked()

Data Augmentation and Auxiliary Variables

A powerful technique closely related to Gibbs sampling is data augmentation—introducing latent (auxiliary) variables that simplify the conditional structure.

The idea:

Sometimes the posterior $p(\theta | \mathcal{D})$ has intractable conditionals, but we can introduce auxiliary variables $Z$ such that:

$p(\theta | Z, \mathcal{D})$ is easy to sample
$p(Z | \theta, \mathcal{D})$ is easy to sample
Marginalizing over $Z$ recovers the original posterior

Classic example: Mixture models

For a Gaussian mixture model: $$p(x_i | \pi, \mu, \sigma) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i; \mu_k, \sigma_k^2)$$

Direct inference is hard because the sum prevents closed-form conditionals.

Solution: Introduce latent component assignments $Z_i \in {1, \ldots, K}$:

$$p(Z_i = k | x_i, \pi, \mu, \sigma) \propto \pi_k \mathcal{N}(x_i; \mu_k, \sigma_k^2)$$

Given $Z$, each observation is assigned to one component, and updating $(\mu_k, \sigma_k)$ becomes a standard normal-normal conjugate problem.

The Gibbs sampler for mixtures:

Sample $Z_i | x_i, \pi, \mu, \sigma$ for each $i$ (categorical)
Sample $\pi | Z$ (Dirichlet)
Sample $(\mu_k, \sigma_k^2) | Z, X$ for each $k$ (normal-inverse-gamma)

Data Augmentation Philosophy

Data augmentation isn't cheating—the auxiliary variables have a genuine interpretation. In mixture models, Z_i represents 'which component generated observation i.' We're not just adding variables for convenience; we're making explicit the hidden structure that the model implies. The marginal posterior over θ is exactly what we want; we simply discard the Z samples.

mixture_gibbs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
 
def gibbs_gaussian_mixture(X, K, n_samples, burn_in=500):
    """
    Gibbs sampler for K-component Gaussian mixture model.
    
    Uses data augmentation with latent component assignments.
    
    Model:
        pi ~ Dirichlet(1, ..., 1)
        mu_k ~ N(0, 10^2) for k = 1..K
        sigma_k^2 = 1 (fixed for simplicity)
        Z_i | pi ~ Categorical(pi)
        X_i | Z_i, mu ~ N(mu_{Z_i}, sigma_{Z_i}^2)
    """
    n = len(X)
    sigma_sq = 1.0  # Fixed variance
    
    # Initialize
    # Spread initial means across data range
    mu = np.linspace(np.min(X) - 1, np.max(X) + 1, K)
    pi = np.ones(K) / K
    Z = np.random.choice(K, n)
    
    samples = []
    
    for t in range(n_samples + burn_in):
        # --- Sample Z | X, mu, pi ---
        # p(Z_i = k | x_i, mu, pi) ∝ pi_k * N(x_i; mu_k, sigma^2)
        log_probs = np.zeros((n, K))
        for k in range(K):
            log_probs[:, k] = np.log(pi[k] + 1e-10) + stats.norm.logpdf(X, mu[k], np.sqrt(sigma_sq))
        
        # Normalize (softmax)
        log_probs -= np.max(log_probs, axis=1, keepdims=True)
        probs = np.exp(log_probs)
        probs /= probs.sum(axis=1, keepdims=True)
        
        # Sample from categorical
        for i in range(n):
            Z[i] = np.random.choice(K, p=probs[i])
        
        # --- Sample pi | Z ---
        # Dirichlet posterior with counts
        counts = np.array([np.sum(Z == k) for k in range(K)])
        pi = np.random.dirichlet(1 + counts)
        
        # --- Sample mu_k | Z, X ---
        for k in range(K):
            X_k = X[Z == k]
            n_k = len(X_k)
            
            if n_k > 0:
                # Normal-Normal conjugate
                prior_var = 100.0  # Prior variance
                post_precision = 1/prior_var + n_k/sigma_sq
                post_var = 1 / post_precision
                post_mean = post_var * (0/prior_var + np.sum(X_k)/sigma_sq)
                mu[k] = np.random.normal(post_mean, np.sqrt(post_var))
            else:
                # No observations assigned; sample from prior
                mu[k] = np.random.normal(0, 10)
        
        if t >= burn_in:
            samples.append({
                'mu': mu.copy(),
                'pi': pi.copy(),
                'Z': Z.copy()
            })
    
    return samples
 
 
# Generate data from true mixture
np.random.seed(42)
true_mu = [-3, 0, 4]
true_pi = [0.3, 0.5, 0.2]
n_obs = 200
 
# Generate
Z_true = np.random.choice(3, n_obs, p=true_pi)
X = np.array([np.random.normal(true_mu[z], 1) for z in Z_true])
 
print("=== Gibbs Sampling: Gaussian Mixture Model ===")
print(f"True parameters: mu = {true_mu}, pi = {true_pi}")
print(f"Data: n = {n_obs}")
print()
 
# Fit with Gibbs
samples = gibbs_gaussian_mixture(X, K=3, n_samples=2000, burn_in=500)
 
# Extract posterior means for mu (accounting for label switching)
mu_samples = np.array([s['mu'] for s in samples])
pi_samples = np.array([s['pi'] for s in samples])
 
# Sort components by mean for identifiability
mu_sorted = np.sort(mu_samples, axis=1)
 
print("Posterior means for components (sorted):")
for k in range(3):
    print(f"  mu_{k+1}: mean = {np.mean(mu_sorted[:, k]):.3f}, "
          f"95% CI = [{np.percentile(mu_sorted[:, k], 2.5):.3f}, "
          f"{np.percentile(mu_sorted[:, k], 97.5):.3f}]")
print(f"
True mu (sorted): {sorted(true_mu)}")

When to Use Gibbs vs. Metropolis-Hastings

Gibbs sampling and Metropolis-Hastings are complementary, not competing approaches. Understanding when each excels helps design efficient samplers.

Gibbs sampling excels when:

Conditionals are standard: Full conditionals are recognizable distributions (normal, gamma, beta, etc.)
Conjugate structure exists: Prior-likelihood combinations yield closed-form posteriors
Latent variable models: Cluster assignments, missing data, auxiliary variables
Model is naturally modular: Parameters partition into conditionally independent blocks

Metropolis-Hastings is preferable when:

Conditionals are non-standard: Full conditionals don't match known distributions
Parameters are tightly coupled: No natural conditional structure to exploit
Continuous optimization flavor: Gradient information available (HMC variant)
Simple implementation needed: Single algorithm handles any target

Hybrid approaches:

In practice, most samplers mix both:

Metropolis-within-Gibbs: Use MH for parameters with non-standard conditionals
Block MH within Gibbs: Update correlated parameter groups jointly
Adaptive schemes: Tune MH proposals during sampling

Gibbs vs. Metropolis-Hastings Comparison
Aspect	Gibbs	Metropolis-Hastings
Acceptance rate	100%	Tunable (20-50% optimal)
Requires	Tractable conditionals	Density evaluation only
Proposal design	None (conditionals define it)	Critical for efficiency
Correlation handling	Slow for correlated params	Can propose jointly
Implementation	Model-specific	Generic algorithm
Software	BUGS, JAGS, Stan	Stan, PyMC, custom code

The Modern Approach: Probabilistic Programming

Modern probabilistic programming languages (Stan, PyMC, NumPyro) automatically derive efficient samplers:

• Stan: Uses Hamiltonian Monte Carlo (gradient-based MH) by default • JAGS/BUGS: Automatically identifies conjugate structure for Gibbs • NumPyro: Combines HMC with automatic differentiation

These tools abstract away the choice between Gibbs and MH, selecting the best strategy for each parameter based on model structure.

Practical Considerations for Gibbs Sampling

Several practical issues arise when implementing Gibbs samplers for real problems.

Update order:

For systematic scan, the order of updates can affect mixing:

Putting highly correlated parameters next to each other may help
Randomizing order each iteration is another option
In practice, differences are often minor

Convergence assessment:

All standard MCMC diagnostics apply:

$\hat{R}$ across multiple chains
Effective sample size for each parameter
Trace plots for stationarity

The label switching problem in mixtures:

In mixture models, posterior is invariant to relabeling components. Chain may switch labels during sampling, causing issues for inference on component-specific parameters. Solutions:

Impose ordering constraints (e.g., $\mu_1 < \mu_2 < \mu_3$)
Post-process samples with label alignment algorithms
Focus on label-invariant quantities (e.g., number of components, cluster assignments)

Sampling efficiency:

Vectorize updates where possible
Use sparse data structures for large models
Consider collapsed Gibbs (marginalize over variables when possible)

Gibbs Sampling Best Practices

•Verify full conditionals are derived correctly
•Initialize from reasonable starting points
•Run multiple chains from dispersed starts
•Monitor all parameters, not just primary ones
•Use blocking for correlated parameters

Common Gibbs Pitfalls

•Incorrect conditional derivation (verify!)
•Ignoring high posterior correlation
•Label switching in mixture models
•Insufficient burn-in for multimodal targets
•Forgetting to update auxiliary variables

Summary: Gibbs Sampling

Gibbs sampling exploits conditional structure to sample from complex distributions by iteratively updating each parameter from its full conditional. When conditionals are tractable, it provides an elegant, efficient alternative to general-purpose MH.

Key Takeaways

•Full conditionals: $p(\theta_j | \theta_{-j}, \mathcal{D}) \propto$ joint, viewed as function of $\theta_j$; often recognizable as standard distributions
•100% acceptance: Each Gibbs update is an exact draw from the conditional—no rejections
•Systematic scan: Update $\theta_1, \theta_2, \ldots, \theta_d$ in order, using most recent values
•Slow mixing for correlated parameters: Use blocking to update correlated groups jointly
•Data augmentation: Introduce auxiliary variables to create tractable conditional structure (e.g., mixture assignments)
•Gibbs as MH: Gibbs is a special case of MH where the proposal equals the full conditional
•Hybrid approaches: Combine Gibbs for tractable conditionals with MH for others

Module Complete:

You have now mastered the core sampling-based approaches to approximate Bayesian inference:

Monte Carlo estimation: The foundation—replace integrals with sample averages
Rejection sampling: Exact sampling using envelopes, limited by dimensionality
Importance sampling: Reweighting samples, powerful but variance-sensitive
Metropolis-Hastings: The general MCMC framework, works for any unnormalized density
Gibbs sampling: Exploits conditional structure for efficient sampling in structured models

These methods form the computational backbone of Bayesian machine learning, enabling inference in complex hierarchical models, latent variable models, and beyond.

Module Complete

Congratulations! You have completed the Approximate Inference: Sampling module. You now have a comprehensive understanding of Monte Carlo methods for Bayesian computation—from the foundational principles to practical implementation. These techniques empower you to perform inference in virtually any probabilistic model, unlocking the full power of Bayesian machine learning.

5 / 5

Loading learning content...

Machine LearningBayesian Inference

Approximate Inference: Sampling Methods

LevelAdvanced

Duration120 mins

TopicBayesian Inference

5 / 5

Gibbs Sampling

Exploiting Conditional Structure

The key insight:

Full conditionals are simpler: They involve fewer parameters and often have standard forms
Acceptance rate = 100%: Each update is an exact draw from the correct conditional
Exploits model structure: Conjugate priors, latent variable models, mixture models

What You Will Learn

Full Conditional Distributions

The full conditional distribution of a parameter $\theta_j$ is its conditional distribution given all other parameters $\theta_{-j}$ and the data $\mathcal{D}$:

$$p(\theta_j | \theta_{-j}, \mathcal{D})$$

where $\theta_{-j} = (\theta_1, \ldots, \theta_{j-1}, \theta_{j+1}, \ldots, \theta_d)$ denotes all parameters except $\theta_j$.

Deriving full conditionals:

The full conditional is proportional to the joint posterior, viewed as a function of $\theta_j$ only:

$$p(\theta_j | \theta_{-j}, \mathcal{D}) \propto p(\theta_1, \ldots, \theta_d | \mathcal{D})$$

with $\theta_{-j}$ treated as fixed constants. Terms not involving $\theta_j$ become part of the proportionality constant.

Example: Multivariate Gaussian

The full conditional of $\theta_1$ given $\theta_2$ is: $$\theta_1 | \theta_2 \sim \mathcal{N}\left(\mu_1 + \rho\frac{\sigma_1}{\sigma_2}(\theta_2 - \mu_2), \sigma_1^2(1 - \rho^2)\right)$$

A univariate Gaussian—easy to sample!

Recognizing Standard Forms

The power of Gibbs sampling comes from recognizing that full conditionals often belong to standard families:

Once you identify the kernel (the part involving θⱼ), match it to a known distribution.

full_conditionals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from scipy import stats
 
def derive_normal_normal_conditional():
    """
    Demonstrate deriving full conditionals for a simple hierarchical model.
    
    Model:
        mu ~ N(0, tau^2)           -- prior on mean
        X_i | mu ~ N(mu, sigma^2)  -- likelihood for n observations
        
    Full conditional of mu | X, sigma^2:
        mu | X ~ N(posterior_mean, posterior_var)
    """
    # Hyperparameters
    tau_sq = 10.0  # Prior variance
    sigma_sq = 1.0  # Likelihood variance
    
    # Data
    X = np.array([4.2, 3.8, 5.1, 4.5, 4.9])
    n = len(X)
    x_bar = np.mean(X)
    
    # Derive full conditional
    # p(mu | X) ∝ p(X | mu) * p(mu)
    #           ∝ exp(-sum(X_i - mu)^2 / 2sigma^2) * exp(-mu^2 / 2tau^2)
    # Complete the square in mu...
    
    # Posterior precision = prior precision + data precision
    posterior_precision = 1/tau_sq + n/sigma_sq
    posterior_var = 1 / posterior_precision
    
    # Posterior mean = weighted average of prior mean (0) and data mean
    posterior_mean = posterior_var * (0/tau_sq + n*x_bar/sigma_sq)
    
    print("=== Normal-Normal Conjugate Model ===")
    print(f"Prior: mu ~ N(0, {tau_sq})")
    print(f"Likelihood: X_i | mu ~ N(mu, {sigma_sq})")
    print(f"Data: n={n}, mean={x_bar:.2f}")
    print()
    print(f"Full conditional: mu | X ~ N({posterior_mean:.4f}, {posterior_var:.4f})")
    print(f"  Posterior mean: {posterior_mean:.4f}")
    print(f"  Posterior std:  {np.sqrt(posterior_var):.4f}")
    
    # Sample from the full conditional
    samples = np.random.normal(posterior_mean, np.sqrt(posterior_var), 10000)
    print(f"
Verification (10000 samples):")
    print(f"  Sample mean: {np.mean(samples):.4f}")
    print(f"  Sample std:  {np.std(samples):.4f}")
 
 
derive_normal_normal_conditional()
 
 
def identify_full_conditional_form():
    """
    Show how to identify the form of a full conditional 
    by examining the kernel (terms containing the parameter).
    """
    print("
=== Identifying Full Conditional Forms ===")
    
    examples = [
        {
            "model": "Y ~ Binomial(n, p), p ~ Beta(a, b)",
            "parameter": "p",
            "kernel": "p^y (1-p)^{n-y} * p^{a-1} (1-p)^{b-1}",
            "form": "p^{y+a-1} (1-p)^{n-y+b-1}",
            "distribution": "Beta(y + a, n - y + b)"
        },
        {
            "model": "Y ~ Poisson(λ), λ ~ Gamma(a, b)",
            "parameter": "λ",
            "kernel": "λ^y e^{-λ} * λ^{a-1} e^{-bλ}",
            "form": "λ^{y+a-1} e^{-(1+b)λ}",
            "distribution": "Gamma(y + a, 1 + b)"
        },
        {
            "model": "Y ~ N(μ, σ²), μ ~ N(μ₀, τ²), σ² known",
            "parameter": "μ",
            "kernel": "exp(-(y-μ)²/2σ²) * exp(-(μ-μ₀)²/2τ²)",
            "form": "Quadratic in μ → complete the square",
            "distribution": "N(weighted_mean, combined_variance)"
        }
    ]
    
    for ex in examples:
        print(f"
Model: {ex['model']}")
        print(f"  Parameter: {ex['parameter']}")
        print(f"  Kernel: {ex['kernel']}")
        print(f"  → Full conditional: {ex['distribution']}")
 
 
identify_full_conditional_form()

The Gibbs Sampling Algorithm

With full conditionals in hand, the Gibbs sampling algorithm is straightforward.

Systematic scan Gibbs sampler:

Initialize θ^(0) = (θ_1^(0), θ_2^(0), ..., θ_d^(0))

For t = 1, 2, ..., T:
    Draw θ_1^(t) ~ p(θ_1 | θ_2^(t-1), θ_3^(t-1), ..., θ_d^(t-1), D)
    Draw θ_2^(t) ~ p(θ_2 | θ_1^(t),   θ_3^(t-1), ..., θ_d^(t-1), D)
    Draw θ_3^(t) ~ p(θ_3 | θ_1^(t),   θ_2^(t),   ..., θ_d^(t-1), D)
    ...
    Draw θ_d^(t) ~ p(θ_d | θ_1^(t),   θ_2^(t),   ..., θ_{d-1}^(t), D)

Key observations:

Each parameter is updated using the most recent values of all others
Order matters within an iteration (update $\theta_1$, then use new $\theta_1$ when updating $\theta_2$, etc.)
Each update is an exact draw from the full conditional—no acceptance probability needed
The chain moves to a new state every iteration (no rejections)

Why Gibbs is a special case of MH:

Gibbs sampling is Metropolis-Hastings with a clever proposal: propose $\theta_j'$ from the full conditional $p(\theta_j | \theta_{-j}, \mathcal{D})$. The MH acceptance ratio becomes:

Since $\theta'{-j} = \theta{-j}$, everything cancels! Acceptance probability is always 1.

Random Scan vs. Systematic Scan

Systematic scan: Update θ₁, θ₂, ..., θ_d in fixed order each iteration.

Random scan: Each step, choose a random parameter j uniformly and update θⱼ.

gibbs_sampler.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from scipy import stats
 
def gibbs_sampler_bivariate_normal(mu, Sigma, n_samples, x0=None):
    """
    Gibbs sampler for bivariate normal distribution.
    
    Demonstrates component-wise sampling from full conditionals.
    
    Parameters:
    -----------
    mu : array [mu1, mu2]
        Mean vector
    Sigma : 2x2 array
        Covariance matrix
    """
    mu1, mu2 = mu
    sigma1 = np.sqrt(Sigma[0, 0])
    sigma2 = np.sqrt(Sigma[1, 1])
    rho = Sigma[0, 1] / (sigma1 * sigma2)
    
    # Full conditional parameters
    # x1 | x2 ~ N(mu1 + rho*(s1/s2)*(x2-mu2), s1^2*(1-rho^2))
    # x2 | x1 ~ N(mu2 + rho*(s2/s1)*(x1-mu1), s2^2*(1-rho^2))
    
    cond_var1 = sigma1**2 * (1 - rho**2)
    cond_var2 = sigma2**2 * (1 - rho**2)
    
    # Initialize
    if x0 is None:
        x1, x2 = 0.0, 0.0
    else:
        x1, x2 = x0
    
    samples = []
    
    for t in range(n_samples):
        # Sample x1 | x2
        cond_mean1 = mu1 + rho * (sigma1/sigma2) * (x2 - mu2)
        x1 = np.random.normal(cond_mean1, np.sqrt(cond_var1))
        
        # Sample x2 | x1 (using updated x1)
        cond_mean2 = mu2 + rho * (sigma2/sigma1) * (x1 - mu1)
        x2 = np.random.normal(cond_mean2, np.sqrt(cond_var2))
        
        samples.append([x1, x2])
    
    return np.array(samples)
 
 
# Test on correlated bivariate normal
mu = [2, 3]
rho = 0.8
Sigma = [[1, rho], [rho, 1]]
 
samples = gibbs_sampler_bivariate_normal(mu, Sigma, 10000)
 
print("=== Gibbs Sampler: Bivariate Normal ===")
print(f"Target: N([{mu[0]}, {mu[1]}], [[1, {rho}], [{rho}, 1]])")
print()
print(f"Sample mean: [{np.mean(samples[:,0]):.4f}, {np.mean(samples[:,1]):.4f}]")
print(f"True mean:   [{mu[0]}, {mu[1]}]")
print()
print(f"Sample covariance:
{np.cov(samples.T)}")
print(f"
True covariance:
{np.array(Sigma)}")
 
# Compute autocorrelation to check mixing
from scipy.signal import correlate
 
def autocorr(x, maxlag=50):
    result = correlate(x - np.mean(x), x - np.mean(x), mode='full')
    result = result[len(result)//2:]
    return result[:maxlag] / result[0]
 
acf_x1 = autocorr(samples[:, 0])
print(f"
Autocorrelation of x1 at lags 1, 5, 10:")
print(f"  lag 1:  {acf_x1[1]:.4f}")
print(f"  lag 5:  {acf_x1[5]:.4f}")
print(f"  lag 10: {acf_x1[10]:.4f}")
print(f"
Note: High correlation (ρ={rho}) leads to slow mixing.")

A Complete Example: Hierarchical Normal Model

Let's implement Gibbs sampling for a classic hierarchical model, showcasing the power of exploiting conditional structure.

The Model:

Full Conditionals:

2. Full conditional for $\sigma^2$: $$\sigma^2 | \mu, X \sim \text{Inv-Gamma}\left(a_0 + \frac{n}{2}, b_0 + \frac{1}{2}\sum_{i=1}^n(X_i - \mu)^2\right)$$

Both are standard distributions we can sample directly!

hierarchical_gibbs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
from scipy import stats
 
def gibbs_normal_hierarchical(X, mu0, tau0_sq, a0, b0, n_samples, burn_in=1000):
    """
    Gibbs sampler for hierarchical normal model:
        mu ~ N(mu0, tau0^2)
        sigma^2 ~ Inv-Gamma(a0, b0)
        X_i | mu, sigma^2 ~ N(mu, sigma^2)
    
    Returns posterior samples for (mu, sigma^2).
    """
    n = len(X)
    X_sum = np.sum(X)
    
    # Initialize
    mu = np.mean(X)
    sigma_sq = np.var(X) if np.var(X) > 0 else 1.0
    
    samples = []
    
    for t in range(n_samples + burn_in):
        # --- Sample mu | sigma^2, X ---
        # Posterior precision = prior precision + data precision
        precision_mu = 1/tau0_sq + n/sigma_sq
        var_mu = 1 / precision_mu
        
        # Posterior mean = weighted average
        mean_mu = var_mu * (mu0/tau0_sq + X_sum/sigma_sq)
        
        mu = np.random.normal(mean_mu, np.sqrt(var_mu))
        
        # --- Sample sigma^2 | mu, X ---
        # Inv-Gamma posterior
        a_post = a0 + n / 2
        b_post = b0 + 0.5 * np.sum((X - mu)**2)
        
        # Sample from Inv-Gamma: if Y ~ Gamma(a, b), then 1/Y ~ Inv-Gamma(a, b)
        sigma_sq = 1 / np.random.gamma(a_post, 1/b_post)
        
        if t >= burn_in:
            samples.append([mu, sigma_sq])
    
    return np.array(samples)
 
 
# Generate synthetic data
np.random.seed(42)
true_mu = 5.0
true_sigma_sq = 2.0
n_obs = 50
X = np.random.normal(true_mu, np.sqrt(true_sigma_sq), n_obs)
 
print("=== Gibbs Sampling: Hierarchical Normal Model ===")
print(f"True parameters: μ = {true_mu}, σ² = {true_sigma_sq}")
print(f"Data: n = {n_obs}, mean = {np.mean(X):.3f}, var = {np.var(X):.3f}")
print()
 
# Weak priors
mu0 = 0
tau0_sq = 100  # Vague prior on mu
a0 = 0.01      # Vague prior on sigma^2
b0 = 0.01
 
# Run Gibbs sampler
samples = gibbs_normal_hierarchical(X, mu0, tau0_sq, a0, b0, 
                                     n_samples=10000, burn_in=1000)
 
mu_samples = samples[:, 0]
sigma_sq_samples = samples[:, 1]
 
print("Posterior summaries:")
print(f"  μ:  mean = {np.mean(mu_samples):.4f}, " 
      f"95% CI = [{np.percentile(mu_samples, 2.5):.4f}, {np.percentile(mu_samples, 97.5):.4f}]")
print(f"  σ²: mean = {np.mean(sigma_sq_samples):.4f}, "
      f"95% CI = [{np.percentile(sigma_sq_samples, 2.5):.4f}, {np.percentile(sigma_sq_samples, 97.5):.4f}]")
 
# Transform to standard deviation
sigma_samples = np.sqrt(sigma_sq_samples)
print(f"  σ:  mean = {np.mean(sigma_samples):.4f}, "
      f"95% CI = [{np.percentile(sigma_samples, 2.5):.4f}, {np.percentile(sigma_samples, 97.5):.4f}]")
 
# Check mixing
print(f"
Effective Sample Size:")
print(f"  μ:  ESS ≈ {len(mu_samples) / (1 + 2 * np.sum([np.corrcoef(mu_samples[:-k], mu_samples[k:])[0,1] for k in range(1, min(100, len(mu_samples)//2))])):.0f}")

Blocking Strategies for Faster Mixing

Pure component-wise Gibbs sampling updates one parameter at a time. When parameters are highly correlated, this leads to slow mixing—the chain takes small, correlated steps.

The correlation problem:

Solution: Block updates

Block Gibbs sampling groups correlated parameters and updates them jointly:

$$\theta = (\theta_{B_1}, \theta_{B_2}, \ldots, \theta_{B_K})$$

where each $B_k$ is a block of parameters updated together from: $$p(\theta_{B_k} | \theta_{-B_k}, \mathcal{D})$$

Choosing blocks:

Group parameters that are highly correlated a posteriori
Group parameters that appear together in likelihood factors
Balance block size: larger blocks capture more correlation but have more complex conditionals

Blocked updates may require MH:

Block conditionals aren't always available in closed form. In such cases, use Metropolis-Hastings for the block update. This hybrid—Metropolis-within-Gibbs—is extremely common in practice.

Reparameterization

blocked_gibbs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy import stats
import time
 
def compare_componentwise_vs_blocked():
    """
    Compare mixing of component-wise vs blocked Gibbs 
    for highly correlated bivariate normal.
    """
    # Highly correlated target
    mu = [0, 0]
    rho = 0.99
    Sigma = np.array([[1, rho], [rho, 1]])
    
    n_samples = 10000
    
    # --- Component-wise Gibbs ---
    sigma1 = sigma2 = 1
    cond_var = 1 - rho**2
    
    samples_cw = np.zeros((n_samples, 2))
    x1, x2 = 0.0, 0.0
    
    start = time.time()
    for t in range(n_samples):
        # Update x1 | x2
        cond_mean1 = rho * x2
        x1 = np.random.normal(cond_mean1, np.sqrt(cond_var))
        
        # Update x2 | x1
        cond_mean2 = rho * x1
        x2 = np.random.normal(cond_mean2, np.sqrt(cond_var))
        
        samples_cw[t] = [x1, x2]
    time_cw = time.time() - start
    
    # --- Blocked (joint) sampling ---
    # Sample both components jointly from the bivariate normal
    start = time.time()
    samples_blocked = np.random.multivariate_normal(mu, Sigma, n_samples)
    time_blocked = time.time() - start
    
    # Compute autocorrelation
    def autocorr_lag1(samples):
        return np.corrcoef(samples[:-1], samples[1:])[0, 1]
    
    acf1_cw = autocorr_lag1(samples_cw[:, 0])
    acf1_blocked = autocorr_lag1(samples_blocked[:, 0])
    
    # Estimate ESS
    def estimate_ess(samples):
        n = len(samples)
        acf_sum = 0
        for lag in range(1, min(n//3, 500)):
            ac = np.corrcoef(samples[:-lag], samples[lag:])[0, 1]
            if ac < 0.05:
                break
            acf_sum += ac
        return n / (1 + 2 * acf_sum)
    
    ess_cw = estimate_ess(samples_cw[:, 0])
    ess_blocked = estimate_ess(samples_blocked[:, 0])
    
    print("=== Component-wise vs Blocked Gibbs ===")
    print(f"Target: Bivariate Normal with ρ = {rho} (highly correlated)")
    print(f"N = {n_samples} samples")
    print()
    print("                     | Component-wise | Blocked (joint)")
    print("-" * 60)
    print(f"Lag-1 autocorrelation |     {acf1_cw:.4f}     |     {acf1_blocked:.4f}")
    print(f"Effective Sample Size |     {ess_cw:.0f}       |     {ess_blocked:.0f}")
    print(f"ESS per second        |     {ess_cw/time_cw:.0f}       |     {ess_blocked/time_blocked:.0f}")
    print()
    print(f"Blocked sampling achieves {ess_blocked/ess_cw:.0f}x higher ESS!")
    print("This demonstrates the importance of blocking correlated parameters.")
 
 
compare_componentwise_vs_blocked()

Data Augmentation and Auxiliary Variables

A powerful technique closely related to Gibbs sampling is data augmentation—introducing latent (auxiliary) variables that simplify the conditional structure.

The idea:

Sometimes the posterior $p(\theta | \mathcal{D})$ has intractable conditionals, but we can introduce auxiliary variables $Z$ such that:

$p(\theta | Z, \mathcal{D})$ is easy to sample
$p(Z | \theta, \mathcal{D})$ is easy to sample
Marginalizing over $Z$ recovers the original posterior

Classic example: Mixture models

For a Gaussian mixture model: $$p(x_i | \pi, \mu, \sigma) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i; \mu_k, \sigma_k^2)$$

Direct inference is hard because the sum prevents closed-form conditionals.

Solution: Introduce latent component assignments $Z_i \in {1, \ldots, K}$:

$$p(Z_i = k | x_i, \pi, \mu, \sigma) \propto \pi_k \mathcal{N}(x_i; \mu_k, \sigma_k^2)$$

Given $Z$, each observation is assigned to one component, and updating $(\mu_k, \sigma_k)$ becomes a standard normal-normal conjugate problem.

The Gibbs sampler for mixtures:

Sample $Z_i | x_i, \pi, \mu, \sigma$ for each $i$ (categorical)
Sample $\pi | Z$ (Dirichlet)
Sample $(\mu_k, \sigma_k^2) | Z, X$ for each $k$ (normal-inverse-gamma)

Data Augmentation Philosophy

mixture_gibbs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
 
def gibbs_gaussian_mixture(X, K, n_samples, burn_in=500):
    """
    Gibbs sampler for K-component Gaussian mixture model.
    
    Uses data augmentation with latent component assignments.
    
    Model:
        pi ~ Dirichlet(1, ..., 1)
        mu_k ~ N(0, 10^2) for k = 1..K
        sigma_k^2 = 1 (fixed for simplicity)
        Z_i | pi ~ Categorical(pi)
        X_i | Z_i, mu ~ N(mu_{Z_i}, sigma_{Z_i}^2)
    """
    n = len(X)
    sigma_sq = 1.0  # Fixed variance
    
    # Initialize
    # Spread initial means across data range
    mu = np.linspace(np.min(X) - 1, np.max(X) + 1, K)
    pi = np.ones(K) / K
    Z = np.random.choice(K, n)
    
    samples = []
    
    for t in range(n_samples + burn_in):
        # --- Sample Z | X, mu, pi ---
        # p(Z_i = k | x_i, mu, pi) ∝ pi_k * N(x_i; mu_k, sigma^2)
        log_probs = np.zeros((n, K))
        for k in range(K):
            log_probs[:, k] = np.log(pi[k] + 1e-10) + stats.norm.logpdf(X, mu[k], np.sqrt(sigma_sq))
        
        # Normalize (softmax)
        log_probs -= np.max(log_probs, axis=1, keepdims=True)
        probs = np.exp(log_probs)
        probs /= probs.sum(axis=1, keepdims=True)
        
        # Sample from categorical
        for i in range(n):
            Z[i] = np.random.choice(K, p=probs[i])
        
        # --- Sample pi | Z ---
        # Dirichlet posterior with counts
        counts = np.array([np.sum(Z == k) for k in range(K)])
        pi = np.random.dirichlet(1 + counts)
        
        # --- Sample mu_k | Z, X ---
        for k in range(K):
            X_k = X[Z == k]
            n_k = len(X_k)
            
            if n_k > 0:
                # Normal-Normal conjugate
                prior_var = 100.0  # Prior variance
                post_precision = 1/prior_var + n_k/sigma_sq
                post_var = 1 / post_precision
                post_mean = post_var * (0/prior_var + np.sum(X_k)/sigma_sq)
                mu[k] = np.random.normal(post_mean, np.sqrt(post_var))
            else:
                # No observations assigned; sample from prior
                mu[k] = np.random.normal(0, 10)
        
        if t >= burn_in:
            samples.append({
                'mu': mu.copy(),
                'pi': pi.copy(),
                'Z': Z.copy()
            })
    
    return samples
 
 
# Generate data from true mixture
np.random.seed(42)
true_mu = [-3, 0, 4]
true_pi = [0.3, 0.5, 0.2]
n_obs = 200
 
# Generate
Z_true = np.random.choice(3, n_obs, p=true_pi)
X = np.array([np.random.normal(true_mu[z], 1) for z in Z_true])
 
print("=== Gibbs Sampling: Gaussian Mixture Model ===")
print(f"True parameters: mu = {true_mu}, pi = {true_pi}")
print(f"Data: n = {n_obs}")
print()
 
# Fit with Gibbs
samples = gibbs_gaussian_mixture(X, K=3, n_samples=2000, burn_in=500)
 
# Extract posterior means for mu (accounting for label switching)
mu_samples = np.array([s['mu'] for s in samples])
pi_samples = np.array([s['pi'] for s in samples])
 
# Sort components by mean for identifiability
mu_sorted = np.sort(mu_samples, axis=1)
 
print("Posterior means for components (sorted):")
for k in range(3):
    print(f"  mu_{k+1}: mean = {np.mean(mu_sorted[:, k]):.3f}, "
          f"95% CI = [{np.percentile(mu_sorted[:, k], 2.5):.3f}, "
          f"{np.percentile(mu_sorted[:, k], 97.5):.3f}]")
print(f"
True mu (sorted): {sorted(true_mu)}")

When to Use Gibbs vs. Metropolis-Hastings

Gibbs sampling and Metropolis-Hastings are complementary, not competing approaches. Understanding when each excels helps design efficient samplers.

Gibbs sampling excels when:

Conditionals are standard: Full conditionals are recognizable distributions (normal, gamma, beta, etc.)
Conjugate structure exists: Prior-likelihood combinations yield closed-form posteriors
Latent variable models: Cluster assignments, missing data, auxiliary variables
Model is naturally modular: Parameters partition into conditionally independent blocks

Metropolis-Hastings is preferable when:

Conditionals are non-standard: Full conditionals don't match known distributions
Parameters are tightly coupled: No natural conditional structure to exploit
Continuous optimization flavor: Gradient information available (HMC variant)
Simple implementation needed: Single algorithm handles any target

Hybrid approaches:

In practice, most samplers mix both:

Metropolis-within-Gibbs: Use MH for parameters with non-standard conditionals
Block MH within Gibbs: Update correlated parameter groups jointly
Adaptive schemes: Tune MH proposals during sampling

Gibbs vs. Metropolis-Hastings Comparison
Aspect	Gibbs	Metropolis-Hastings
Acceptance rate	100%	Tunable (20-50% optimal)
Requires	Tractable conditionals	Density evaluation only
Proposal design	None (conditionals define it)	Critical for efficiency
Correlation handling	Slow for correlated params	Can propose jointly
Implementation	Model-specific	Generic algorithm
Software	BUGS, JAGS, Stan	Stan, PyMC, custom code

The Modern Approach: Probabilistic Programming

Modern probabilistic programming languages (Stan, PyMC, NumPyro) automatically derive efficient samplers:

These tools abstract away the choice between Gibbs and MH, selecting the best strategy for each parameter based on model structure.

Practical Considerations for Gibbs Sampling

Several practical issues arise when implementing Gibbs samplers for real problems.

Update order:

For systematic scan, the order of updates can affect mixing:

Putting highly correlated parameters next to each other may help
Randomizing order each iteration is another option
In practice, differences are often minor

Convergence assessment:

All standard MCMC diagnostics apply:

$\hat{R}$ across multiple chains
Effective sample size for each parameter
Trace plots for stationarity

The label switching problem in mixtures:

In mixture models, posterior is invariant to relabeling components. Chain may switch labels during sampling, causing issues for inference on component-specific parameters. Solutions:

Impose ordering constraints (e.g., $\mu_1 < \mu_2 < \mu_3$)
Post-process samples with label alignment algorithms
Focus on label-invariant quantities (e.g., number of components, cluster assignments)

Sampling efficiency:

Vectorize updates where possible
Use sparse data structures for large models
Consider collapsed Gibbs (marginalize over variables when possible)

Gibbs Sampling Best Practices

•Verify full conditionals are derived correctly
•Initialize from reasonable starting points
•Run multiple chains from dispersed starts
•Monitor all parameters, not just primary ones
•Use blocking for correlated parameters

Common Gibbs Pitfalls

•Incorrect conditional derivation (verify!)
•Ignoring high posterior correlation
•Label switching in mixture models
•Insufficient burn-in for multimodal targets
•Forgetting to update auxiliary variables

Summary: Gibbs Sampling

Key Takeaways

•Full conditionals: $p(\theta_j | \theta_{-j}, \mathcal{D}) \propto$ joint, viewed as function of $\theta_j$; often recognizable as standard distributions
•100% acceptance: Each Gibbs update is an exact draw from the conditional—no rejections
•Systematic scan: Update $\theta_1, \theta_2, \ldots, \theta_d$ in order, using most recent values
•Slow mixing for correlated parameters: Use blocking to update correlated groups jointly
•Data augmentation: Introduce auxiliary variables to create tractable conditional structure (e.g., mixture assignments)
•Gibbs as MH: Gibbs is a special case of MH where the proposal equals the full conditional
•Hybrid approaches: Combine Gibbs for tractable conditionals with MH for others

Module Complete:

You have now mastered the core sampling-based approaches to approximate Bayesian inference:

Monte Carlo estimation: The foundation—replace integrals with sample averages
Rejection sampling: Exact sampling using envelopes, limited by dimensionality
Importance sampling: Reweighting samples, powerful but variance-sensitive
Metropolis-Hastings: The general MCMC framework, works for any unnormalized density
Gibbs sampling: Exploits conditional structure for efficient sampling in structured models

These methods form the computational backbone of Bayesian machine learning, enabling inference in complex hierarchical models, latent variable models, and beyond.

Module Complete

5 / 5