Loading learning content...
Metropolis-Hastings updates all parameters simultaneously, which can be inefficient when the joint distribution has special structure that we're not exploiting. Many Bayesian models—particularly hierarchical models—have conditional distributions that are easy to sample from, even when the joint is complex.
Gibbs sampling leverages this structure. Instead of proposing a move for all parameters at once, we update each parameter (or group) one at a time, sampling from its full conditional distribution—the distribution of that parameter given all others and the data.
The key insight:
Named after the physicist Josiah Willard Gibbs (who studied statistical mechanics), Gibbs sampling became the foundation of practical Bayesian computation in the 1990s, powering software like BUGS and WinBUGS that made Bayesian inference accessible to applied researchers.
This page develops Gibbs sampling from first principles. You will understand full conditional distributions, see the algorithm for both systematic and random scan, learn blocking strategies for faster mixing, and understand when Gibbs excels versus when other methods are preferable.
The full conditional distribution of a parameter $\theta_j$ is its conditional distribution given all other parameters $\theta_{-j}$ and the data $\mathcal{D}$:
$$p(\theta_j | \theta_{-j}, \mathcal{D})$$
where $\theta_{-j} = (\theta_1, \ldots, \theta_{j-1}, \theta_{j+1}, \ldots, \theta_d)$ denotes all parameters except $\theta_j$.
Deriving full conditionals:
The full conditional is proportional to the joint posterior, viewed as a function of $\theta_j$ only:
$$p(\theta_j | \theta_{-j}, \mathcal{D}) \propto p(\theta_1, \ldots, \theta_d | \mathcal{D})$$
with $\theta_{-j}$ treated as fixed constants. Terms not involving $\theta_j$ become part of the proportionality constant.
Example: Multivariate Gaussian
For $(\theta_1, \theta_2) \sim \mathcal{N}\left(\begin{pmatrix} \mu_1 \ \mu_2 \end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix}\right)$
The full conditional of $\theta_1$ given $\theta_2$ is: $$\theta_1 | \theta_2 \sim \mathcal{N}\left(\mu_1 + \rho\frac{\sigma_1}{\sigma_2}(\theta_2 - \mu_2), \sigma_1^2(1 - \rho^2)\right)$$
A univariate Gaussian—easy to sample!
The power of Gibbs sampling comes from recognizing that full conditionals often belong to standard families:
• Gaussian likelihood + Gaussian prior → Gaussian posterior (conjugate) • Binomial likelihood + Beta prior → Beta posterior (conjugate) • Gaussian with unknown precision → Gamma posterior for precision • Mixture model → Multinomial for component membership
Once you identify the kernel (the part involving θⱼ), match it to a known distribution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom scipy import stats def derive_normal_normal_conditional(): """ Demonstrate deriving full conditionals for a simple hierarchical model. Model: mu ~ N(0, tau^2) -- prior on mean X_i | mu ~ N(mu, sigma^2) -- likelihood for n observations Full conditional of mu | X, sigma^2: mu | X ~ N(posterior_mean, posterior_var) """ # Hyperparameters tau_sq = 10.0 # Prior variance sigma_sq = 1.0 # Likelihood variance # Data X = np.array([4.2, 3.8, 5.1, 4.5, 4.9]) n = len(X) x_bar = np.mean(X) # Derive full conditional # p(mu | X) ∝ p(X | mu) * p(mu) # ∝ exp(-sum(X_i - mu)^2 / 2sigma^2) * exp(-mu^2 / 2tau^2) # Complete the square in mu... # Posterior precision = prior precision + data precision posterior_precision = 1/tau_sq + n/sigma_sq posterior_var = 1 / posterior_precision # Posterior mean = weighted average of prior mean (0) and data mean posterior_mean = posterior_var * (0/tau_sq + n*x_bar/sigma_sq) print("=== Normal-Normal Conjugate Model ===") print(f"Prior: mu ~ N(0, {tau_sq})") print(f"Likelihood: X_i | mu ~ N(mu, {sigma_sq})") print(f"Data: n={n}, mean={x_bar:.2f}") print() print(f"Full conditional: mu | X ~ N({posterior_mean:.4f}, {posterior_var:.4f})") print(f" Posterior mean: {posterior_mean:.4f}") print(f" Posterior std: {np.sqrt(posterior_var):.4f}") # Sample from the full conditional samples = np.random.normal(posterior_mean, np.sqrt(posterior_var), 10000) print(f"Verification (10000 samples):") print(f" Sample mean: {np.mean(samples):.4f}") print(f" Sample std: {np.std(samples):.4f}") derive_normal_normal_conditional() def identify_full_conditional_form(): """ Show how to identify the form of a full conditional by examining the kernel (terms containing the parameter). """ print("=== Identifying Full Conditional Forms ===") examples = [ { "model": "Y ~ Binomial(n, p), p ~ Beta(a, b)", "parameter": "p", "kernel": "p^y (1-p)^{n-y} * p^{a-1} (1-p)^{b-1}", "form": "p^{y+a-1} (1-p)^{n-y+b-1}", "distribution": "Beta(y + a, n - y + b)" }, { "model": "Y ~ Poisson(λ), λ ~ Gamma(a, b)", "parameter": "λ", "kernel": "λ^y e^{-λ} * λ^{a-1} e^{-bλ}", "form": "λ^{y+a-1} e^{-(1+b)λ}", "distribution": "Gamma(y + a, 1 + b)" }, { "model": "Y ~ N(μ, σ²), μ ~ N(μ₀, τ²), σ² known", "parameter": "μ", "kernel": "exp(-(y-μ)²/2σ²) * exp(-(μ-μ₀)²/2τ²)", "form": "Quadratic in μ → complete the square", "distribution": "N(weighted_mean, combined_variance)" } ] for ex in examples: print(f"Model: {ex['model']}") print(f" Parameter: {ex['parameter']}") print(f" Kernel: {ex['kernel']}") print(f" → Full conditional: {ex['distribution']}") identify_full_conditional_form()With full conditionals in hand, the Gibbs sampling algorithm is straightforward.
Systematic scan Gibbs sampler:
Initialize θ^(0) = (θ_1^(0), θ_2^(0), ..., θ_d^(0))
For t = 1, 2, ..., T:
Draw θ_1^(t) ~ p(θ_1 | θ_2^(t-1), θ_3^(t-1), ..., θ_d^(t-1), D)
Draw θ_2^(t) ~ p(θ_2 | θ_1^(t), θ_3^(t-1), ..., θ_d^(t-1), D)
Draw θ_3^(t) ~ p(θ_3 | θ_1^(t), θ_2^(t), ..., θ_d^(t-1), D)
...
Draw θ_d^(t) ~ p(θ_d | θ_1^(t), θ_2^(t), ..., θ_{d-1}^(t), D)
Key observations:
Why Gibbs is a special case of MH:
Gibbs sampling is Metropolis-Hastings with a clever proposal: propose $\theta_j'$ from the full conditional $p(\theta_j | \theta_{-j}, \mathcal{D})$. The MH acceptance ratio becomes:
$$\alpha = \frac{\pi(\theta') , q(\theta | \theta')}{\pi(\theta) , q(\theta' | \theta)} = \frac{p(\theta'{-j})p(\theta'j|\theta'{-j}) \cdot p(\theta_j | \theta'{-j})}{p(\theta_{-j})p(\theta_j|\theta_{-j}) \cdot p(\theta'j | \theta'{-j})} = 1$$
Since $\theta'{-j} = \theta{-j}$, everything cancels! Acceptance probability is always 1.
Systematic scan: Update θ₁, θ₂, ..., θ_d in fixed order each iteration.
Random scan: Each step, choose a random parameter j uniformly and update θⱼ.
Both produce valid MCMC chains with the correct stationary distribution. Systematic scan is more common and often more efficient. Random scan has stronger theoretical guarantees (automatic reversibility) but requires d updates per 'iteration' for fair comparison.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom scipy import stats def gibbs_sampler_bivariate_normal(mu, Sigma, n_samples, x0=None): """ Gibbs sampler for bivariate normal distribution. Demonstrates component-wise sampling from full conditionals. Parameters: ----------- mu : array [mu1, mu2] Mean vector Sigma : 2x2 array Covariance matrix """ mu1, mu2 = mu sigma1 = np.sqrt(Sigma[0, 0]) sigma2 = np.sqrt(Sigma[1, 1]) rho = Sigma[0, 1] / (sigma1 * sigma2) # Full conditional parameters # x1 | x2 ~ N(mu1 + rho*(s1/s2)*(x2-mu2), s1^2*(1-rho^2)) # x2 | x1 ~ N(mu2 + rho*(s2/s1)*(x1-mu1), s2^2*(1-rho^2)) cond_var1 = sigma1**2 * (1 - rho**2) cond_var2 = sigma2**2 * (1 - rho**2) # Initialize if x0 is None: x1, x2 = 0.0, 0.0 else: x1, x2 = x0 samples = [] for t in range(n_samples): # Sample x1 | x2 cond_mean1 = mu1 + rho * (sigma1/sigma2) * (x2 - mu2) x1 = np.random.normal(cond_mean1, np.sqrt(cond_var1)) # Sample x2 | x1 (using updated x1) cond_mean2 = mu2 + rho * (sigma2/sigma1) * (x1 - mu1) x2 = np.random.normal(cond_mean2, np.sqrt(cond_var2)) samples.append([x1, x2]) return np.array(samples) # Test on correlated bivariate normalmu = [2, 3]rho = 0.8Sigma = [[1, rho], [rho, 1]] samples = gibbs_sampler_bivariate_normal(mu, Sigma, 10000) print("=== Gibbs Sampler: Bivariate Normal ===")print(f"Target: N([{mu[0]}, {mu[1]}], [[1, {rho}], [{rho}, 1]])")print()print(f"Sample mean: [{np.mean(samples[:,0]):.4f}, {np.mean(samples[:,1]):.4f}]")print(f"True mean: [{mu[0]}, {mu[1]}]")print()print(f"Sample covariance:{np.cov(samples.T)}")print(f"True covariance:{np.array(Sigma)}") # Compute autocorrelation to check mixingfrom scipy.signal import correlate def autocorr(x, maxlag=50): result = correlate(x - np.mean(x), x - np.mean(x), mode='full') result = result[len(result)//2:] return result[:maxlag] / result[0] acf_x1 = autocorr(samples[:, 0])print(f"Autocorrelation of x1 at lags 1, 5, 10:")print(f" lag 1: {acf_x1[1]:.4f}")print(f" lag 5: {acf_x1[5]:.4f}")print(f" lag 10: {acf_x1[10]:.4f}")print(f"Note: High correlation (ρ={rho}) leads to slow mixing.")Let's implement Gibbs sampling for a classic hierarchical model, showcasing the power of exploiting conditional structure.
The Model:
$$\begin{aligned} \mu &\sim \mathcal{N}(\mu_0, \tau_0^2) && \text{(prior on overall mean)} \sigma^2 &\sim \text{Inv-Gamma}(a_0, b_0) && \text{(prior on variance)} X_i | \mu, \sigma^2 &\sim \mathcal{N}(\mu, \sigma^2), \quad i = 1, \ldots, n && \text{(likelihood)} \end{aligned}$$
Full Conditionals:
1. Full conditional for $\mu$: $$\mu | \sigma^2, X \sim \mathcal{N}\left(\frac{\frac{\mu_0}{\tau_0^2} + \frac{n\bar{X}}{\sigma^2}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}}, \left(\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}\right)^{-1}\right)$$
2. Full conditional for $\sigma^2$: $$\sigma^2 | \mu, X \sim \text{Inv-Gamma}\left(a_0 + \frac{n}{2}, b_0 + \frac{1}{2}\sum_{i=1}^n(X_i - \mu)^2\right)$$
Both are standard distributions we can sample directly!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npfrom scipy import stats def gibbs_normal_hierarchical(X, mu0, tau0_sq, a0, b0, n_samples, burn_in=1000): """ Gibbs sampler for hierarchical normal model: mu ~ N(mu0, tau0^2) sigma^2 ~ Inv-Gamma(a0, b0) X_i | mu, sigma^2 ~ N(mu, sigma^2) Returns posterior samples for (mu, sigma^2). """ n = len(X) X_sum = np.sum(X) # Initialize mu = np.mean(X) sigma_sq = np.var(X) if np.var(X) > 0 else 1.0 samples = [] for t in range(n_samples + burn_in): # --- Sample mu | sigma^2, X --- # Posterior precision = prior precision + data precision precision_mu = 1/tau0_sq + n/sigma_sq var_mu = 1 / precision_mu # Posterior mean = weighted average mean_mu = var_mu * (mu0/tau0_sq + X_sum/sigma_sq) mu = np.random.normal(mean_mu, np.sqrt(var_mu)) # --- Sample sigma^2 | mu, X --- # Inv-Gamma posterior a_post = a0 + n / 2 b_post = b0 + 0.5 * np.sum((X - mu)**2) # Sample from Inv-Gamma: if Y ~ Gamma(a, b), then 1/Y ~ Inv-Gamma(a, b) sigma_sq = 1 / np.random.gamma(a_post, 1/b_post) if t >= burn_in: samples.append([mu, sigma_sq]) return np.array(samples) # Generate synthetic datanp.random.seed(42)true_mu = 5.0true_sigma_sq = 2.0n_obs = 50X = np.random.normal(true_mu, np.sqrt(true_sigma_sq), n_obs) print("=== Gibbs Sampling: Hierarchical Normal Model ===")print(f"True parameters: μ = {true_mu}, σ² = {true_sigma_sq}")print(f"Data: n = {n_obs}, mean = {np.mean(X):.3f}, var = {np.var(X):.3f}")print() # Weak priorsmu0 = 0tau0_sq = 100 # Vague prior on mua0 = 0.01 # Vague prior on sigma^2b0 = 0.01 # Run Gibbs samplersamples = gibbs_normal_hierarchical(X, mu0, tau0_sq, a0, b0, n_samples=10000, burn_in=1000) mu_samples = samples[:, 0]sigma_sq_samples = samples[:, 1] print("Posterior summaries:")print(f" μ: mean = {np.mean(mu_samples):.4f}, " f"95% CI = [{np.percentile(mu_samples, 2.5):.4f}, {np.percentile(mu_samples, 97.5):.4f}]")print(f" σ²: mean = {np.mean(sigma_sq_samples):.4f}, " f"95% CI = [{np.percentile(sigma_sq_samples, 2.5):.4f}, {np.percentile(sigma_sq_samples, 97.5):.4f}]") # Transform to standard deviationsigma_samples = np.sqrt(sigma_sq_samples)print(f" σ: mean = {np.mean(sigma_samples):.4f}, " f"95% CI = [{np.percentile(sigma_samples, 2.5):.4f}, {np.percentile(sigma_samples, 97.5):.4f}]") # Check mixingprint(f"Effective Sample Size:")print(f" μ: ESS ≈ {len(mu_samples) / (1 + 2 * np.sum([np.corrcoef(mu_samples[:-k], mu_samples[k:])[0,1] for k in range(1, min(100, len(mu_samples)//2))])):.0f}")Pure component-wise Gibbs sampling updates one parameter at a time. When parameters are highly correlated, this leads to slow mixing—the chain takes small, correlated steps.
The correlation problem:
Consider a bivariate normal with correlation $\rho = 0.99$. Component-wise Gibbs can only move parallel to the axes, but the posterior is a thin ellipse at 45°. The chain zigzags slowly along the ellipse.
Solution: Block updates
Block Gibbs sampling groups correlated parameters and updates them jointly:
$$\theta = (\theta_{B_1}, \theta_{B_2}, \ldots, \theta_{B_K})$$
where each $B_k$ is a block of parameters updated together from: $$p(\theta_{B_k} | \theta_{-B_k}, \mathcal{D})$$
Choosing blocks:
Blocked updates may require MH:
Block conditionals aren't always available in closed form. In such cases, use Metropolis-Hastings for the block update. This hybrid—Metropolis-within-Gibbs—is extremely common in practice.
An alternative to blocking: transform parameters to reduce posterior correlation. For example, if θ₁ and θ₂ are correlated, define φ = θ₁ + θ₂ and ψ = θ₁ - θ₂. If the transform decorrelates the posterior, component-wise Gibbs on (φ, ψ) mixes faster. This is the idea behind non-centered parameterizations in hierarchical models.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy import statsimport time def compare_componentwise_vs_blocked(): """ Compare mixing of component-wise vs blocked Gibbs for highly correlated bivariate normal. """ # Highly correlated target mu = [0, 0] rho = 0.99 Sigma = np.array([[1, rho], [rho, 1]]) n_samples = 10000 # --- Component-wise Gibbs --- sigma1 = sigma2 = 1 cond_var = 1 - rho**2 samples_cw = np.zeros((n_samples, 2)) x1, x2 = 0.0, 0.0 start = time.time() for t in range(n_samples): # Update x1 | x2 cond_mean1 = rho * x2 x1 = np.random.normal(cond_mean1, np.sqrt(cond_var)) # Update x2 | x1 cond_mean2 = rho * x1 x2 = np.random.normal(cond_mean2, np.sqrt(cond_var)) samples_cw[t] = [x1, x2] time_cw = time.time() - start # --- Blocked (joint) sampling --- # Sample both components jointly from the bivariate normal start = time.time() samples_blocked = np.random.multivariate_normal(mu, Sigma, n_samples) time_blocked = time.time() - start # Compute autocorrelation def autocorr_lag1(samples): return np.corrcoef(samples[:-1], samples[1:])[0, 1] acf1_cw = autocorr_lag1(samples_cw[:, 0]) acf1_blocked = autocorr_lag1(samples_blocked[:, 0]) # Estimate ESS def estimate_ess(samples): n = len(samples) acf_sum = 0 for lag in range(1, min(n//3, 500)): ac = np.corrcoef(samples[:-lag], samples[lag:])[0, 1] if ac < 0.05: break acf_sum += ac return n / (1 + 2 * acf_sum) ess_cw = estimate_ess(samples_cw[:, 0]) ess_blocked = estimate_ess(samples_blocked[:, 0]) print("=== Component-wise vs Blocked Gibbs ===") print(f"Target: Bivariate Normal with ρ = {rho} (highly correlated)") print(f"N = {n_samples} samples") print() print(" | Component-wise | Blocked (joint)") print("-" * 60) print(f"Lag-1 autocorrelation | {acf1_cw:.4f} | {acf1_blocked:.4f}") print(f"Effective Sample Size | {ess_cw:.0f} | {ess_blocked:.0f}") print(f"ESS per second | {ess_cw/time_cw:.0f} | {ess_blocked/time_blocked:.0f}") print() print(f"Blocked sampling achieves {ess_blocked/ess_cw:.0f}x higher ESS!") print("This demonstrates the importance of blocking correlated parameters.") compare_componentwise_vs_blocked()A powerful technique closely related to Gibbs sampling is data augmentation—introducing latent (auxiliary) variables that simplify the conditional structure.
The idea:
Sometimes the posterior $p(\theta | \mathcal{D})$ has intractable conditionals, but we can introduce auxiliary variables $Z$ such that:
Classic example: Mixture models
For a Gaussian mixture model: $$p(x_i | \pi, \mu, \sigma) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i; \mu_k, \sigma_k^2)$$
Direct inference is hard because the sum prevents closed-form conditionals.
Solution: Introduce latent component assignments $Z_i \in {1, \ldots, K}$:
$$p(Z_i = k | x_i, \pi, \mu, \sigma) \propto \pi_k \mathcal{N}(x_i; \mu_k, \sigma_k^2)$$
Given $Z$, each observation is assigned to one component, and updating $(\mu_k, \sigma_k)$ becomes a standard normal-normal conjugate problem.
The Gibbs sampler for mixtures:
Data augmentation isn't cheating—the auxiliary variables have a genuine interpretation. In mixture models, Z_i represents 'which component generated observation i.' We're not just adding variables for convenience; we're making explicit the hidden structure that the model implies. The marginal posterior over θ is exactly what we want; we simply discard the Z samples.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
import numpy as npfrom scipy import stats def gibbs_gaussian_mixture(X, K, n_samples, burn_in=500): """ Gibbs sampler for K-component Gaussian mixture model. Uses data augmentation with latent component assignments. Model: pi ~ Dirichlet(1, ..., 1) mu_k ~ N(0, 10^2) for k = 1..K sigma_k^2 = 1 (fixed for simplicity) Z_i | pi ~ Categorical(pi) X_i | Z_i, mu ~ N(mu_{Z_i}, sigma_{Z_i}^2) """ n = len(X) sigma_sq = 1.0 # Fixed variance # Initialize # Spread initial means across data range mu = np.linspace(np.min(X) - 1, np.max(X) + 1, K) pi = np.ones(K) / K Z = np.random.choice(K, n) samples = [] for t in range(n_samples + burn_in): # --- Sample Z | X, mu, pi --- # p(Z_i = k | x_i, mu, pi) ∝ pi_k * N(x_i; mu_k, sigma^2) log_probs = np.zeros((n, K)) for k in range(K): log_probs[:, k] = np.log(pi[k] + 1e-10) + stats.norm.logpdf(X, mu[k], np.sqrt(sigma_sq)) # Normalize (softmax) log_probs -= np.max(log_probs, axis=1, keepdims=True) probs = np.exp(log_probs) probs /= probs.sum(axis=1, keepdims=True) # Sample from categorical for i in range(n): Z[i] = np.random.choice(K, p=probs[i]) # --- Sample pi | Z --- # Dirichlet posterior with counts counts = np.array([np.sum(Z == k) for k in range(K)]) pi = np.random.dirichlet(1 + counts) # --- Sample mu_k | Z, X --- for k in range(K): X_k = X[Z == k] n_k = len(X_k) if n_k > 0: # Normal-Normal conjugate prior_var = 100.0 # Prior variance post_precision = 1/prior_var + n_k/sigma_sq post_var = 1 / post_precision post_mean = post_var * (0/prior_var + np.sum(X_k)/sigma_sq) mu[k] = np.random.normal(post_mean, np.sqrt(post_var)) else: # No observations assigned; sample from prior mu[k] = np.random.normal(0, 10) if t >= burn_in: samples.append({ 'mu': mu.copy(), 'pi': pi.copy(), 'Z': Z.copy() }) return samples # Generate data from true mixturenp.random.seed(42)true_mu = [-3, 0, 4]true_pi = [0.3, 0.5, 0.2]n_obs = 200 # GenerateZ_true = np.random.choice(3, n_obs, p=true_pi)X = np.array([np.random.normal(true_mu[z], 1) for z in Z_true]) print("=== Gibbs Sampling: Gaussian Mixture Model ===")print(f"True parameters: mu = {true_mu}, pi = {true_pi}")print(f"Data: n = {n_obs}")print() # Fit with Gibbssamples = gibbs_gaussian_mixture(X, K=3, n_samples=2000, burn_in=500) # Extract posterior means for mu (accounting for label switching)mu_samples = np.array([s['mu'] for s in samples])pi_samples = np.array([s['pi'] for s in samples]) # Sort components by mean for identifiabilitymu_sorted = np.sort(mu_samples, axis=1) print("Posterior means for components (sorted):")for k in range(3): print(f" mu_{k+1}: mean = {np.mean(mu_sorted[:, k]):.3f}, " f"95% CI = [{np.percentile(mu_sorted[:, k], 2.5):.3f}, " f"{np.percentile(mu_sorted[:, k], 97.5):.3f}]")print(f"True mu (sorted): {sorted(true_mu)}")Gibbs sampling and Metropolis-Hastings are complementary, not competing approaches. Understanding when each excels helps design efficient samplers.
Gibbs sampling excels when:
Metropolis-Hastings is preferable when:
Hybrid approaches:
In practice, most samplers mix both:
| Aspect | Gibbs | Metropolis-Hastings |
|---|---|---|
| Acceptance rate | 100% | Tunable (20-50% optimal) |
| Requires | Tractable conditionals | Density evaluation only |
| Proposal design | None (conditionals define it) | Critical for efficiency |
| Correlation handling | Slow for correlated params | Can propose jointly |
| Implementation | Model-specific | Generic algorithm |
| Software | BUGS, JAGS, Stan | Stan, PyMC, custom code |
Modern probabilistic programming languages (Stan, PyMC, NumPyro) automatically derive efficient samplers:
• Stan: Uses Hamiltonian Monte Carlo (gradient-based MH) by default • JAGS/BUGS: Automatically identifies conjugate structure for Gibbs • NumPyro: Combines HMC with automatic differentiation
These tools abstract away the choice between Gibbs and MH, selecting the best strategy for each parameter based on model structure.
Several practical issues arise when implementing Gibbs samplers for real problems.
Update order:
For systematic scan, the order of updates can affect mixing:
Convergence assessment:
All standard MCMC diagnostics apply:
The label switching problem in mixtures:
In mixture models, posterior is invariant to relabeling components. Chain may switch labels during sampling, causing issues for inference on component-specific parameters. Solutions:
Sampling efficiency:
Gibbs sampling exploits conditional structure to sample from complex distributions by iteratively updating each parameter from its full conditional. When conditionals are tractable, it provides an elegant, efficient alternative to general-purpose MH.
Module Complete:
You have now mastered the core sampling-based approaches to approximate Bayesian inference:
These methods form the computational backbone of Bayesian machine learning, enabling inference in complex hierarchical models, latent variable models, and beyond.
Congratulations! You have completed the Approximate Inference: Sampling module. You now have a comprehensive understanding of Monte Carlo methods for Bayesian computation—from the foundational principles to practical implementation. These techniques empower you to perform inference in virtually any probabilistic model, unlocking the full power of Bayesian machine learning.