Loading content...
A generative model is only as useful as its ability to sample—to produce new, realistic instances from the learned distribution. Density estimation gives us $p_\theta(x)$, but users want actual samples: new images, new text, new molecules.
Sampling is the process of drawing random values $x$ such that $x \sim p(x)$. For simple distributions (Gaussian, uniform), standard libraries provide efficient sampling. But for complex, high-dimensional distributions learned by deep generative models, sampling becomes a sophisticated challenge.
This page explores the spectrum of sampling techniques, from foundational methods that work for any distribution in principle, to specialized approaches designed for modern generative architectures. Understanding these techniques is essential for deploying, evaluating, and improving generative models.
By the end of this page, you will understand ancestral sampling and the reparameterization trick, rejection and importance sampling with their trade-offs, Markov Chain Monte Carlo (MCMC) fundamentals, Metropolis-Hastings and Gibbs sampling algorithms, modern sampling strategies in neural generative models (flow sampling, diffusion denoising, autoregressive generation), and practical considerations for sample quality and diversity.
Ancestral sampling (also called forward sampling) is the most direct approach: sample from a generative model by following its generative process.
For directed graphical models:
If a model factorizes as:
$$p(x_1, x_2, \ldots, x_n) = \prod_{i=1}^n p(x_i | \text{parents}(x_i))$$
we sample by:
Each conditional $p(x_i | \cdot)$ should be a simple distribution we can sample from (e.g., Gaussian, categorical).
Example: Bayesian Network
Consider a weather model:
Sampling proceeds: cloudy → rain → wet_grass.
For autoregressive models:
Autoregressive models are explicitly designed for ancestral sampling:
$$p(x) = p(x_1) \cdot p(x_2|x_1) \cdot p(x_3|x_1,x_2) \cdots$$
Generate $x_1$, then $x_2$ given $x_1$, then $x_3$ given $x_1, x_2$, etc. This is how GPT generates text: token by token, each conditioned on all previous.
Ancestral sampling in autoregressive models is inherently sequential—each step depends on all previous. For a 1024-token sequence, we need 1024 forward passes. This is why autoregressive LLMs have high latency despite fast individual forward passes. Parallel sampling methods (speculative decoding, Jacobi iteration) try to mitigate this, but remain an active research area.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as np # Example: Ancestral sampling from a Bayesian networknp.random.seed(42) def sample_weather(): """ Bayesian Network: cloudy -> rain -> wet_grass \-> sprinkler -> wet_grass """ # P(cloudy) = 0.5 cloudy = np.random.random() < 0.5 # P(sprinkler | cloudy) p_sprinkler = 0.1 if cloudy else 0.5 sprinkler = np.random.random() < p_sprinkler # P(rain | cloudy) p_rain = 0.8 if cloudy else 0.2 rain = np.random.random() < p_rain # P(wet_grass | sprinkler, rain) if sprinkler and rain: p_wet = 0.99 elif sprinkler or rain: p_wet = 0.9 else: p_wet = 0.01 wet_grass = np.random.random() < p_wet return {'cloudy': cloudy, 'sprinkler': sprinkler, 'rain': rain, 'wet_grass': wet_grass} # Generate samplesn_samples = 10000samples = [sample_weather() for _ in range(n_samples)] # Estimate marginal probabilitiesprint("Ancestral Sampling from Weather Bayesian Network:")print(f" Generated {n_samples} samples")print(f" Estimated marginal probabilities:")print(f" P(cloudy) = {np.mean([s['cloudy'] for s in samples]):.3f}")print(f" P(rain) = {np.mean([s['rain'] for s in samples]):.3f}")print(f" P(wet_grass) = {np.mean([s['wet_grass'] for s in samples]):.3f}") # Estimate conditionalrain_samples = [s for s in samples if s['rain']]print(f" Conditional estimates:")print(f" P(cloudy | rain) = {np.mean([s['cloudy'] for s in rain_samples]):.3f}") # Autoregressive sampling exampleprint("--- Autoregressive Text Generation (simplified) ---") # Vocabulary and transition probabilities (toy example)vocab = ['the', 'cat', 'dog', 'sat', 'ran', '.']# P(word | prev_word) - simplified Markov modeltransitions = { '<start>': {'the': 0.8, 'cat': 0.1, 'dog': 0.1}, 'the': {'cat': 0.5, 'dog': 0.5}, 'cat': {'sat': 0.6, 'ran': 0.4}, 'dog': {'sat': 0.3, 'ran': 0.7}, 'sat': {'.': 1.0}, 'ran': {'.': 1.0}} def sample_sentence(): """Autoregressive sampling from Markov language model.""" sentence = [] prev = '<start>' while True: probs = transitions.get(prev, {'.': 1.0}) words = list(probs.keys()) ps = list(probs.values()) word = np.random.choice(words, p=ps) if word == '.': break sentence.append(word) prev = word return ' '.join(sentence) print(" Generated sentences:")for i in range(5): print(f" {i+1}. {sample_sentence()}")The reparameterization trick is a crucial technique enabling gradient-based training through stochastic layers. It's foundational for variational autoencoders and many modern generative models.
The Problem:
Suppose we need to optimize:
$$\mathcal{L}(\theta) = \mathbb{E}{z \sim p\theta(z)}[f(z)]$$
where $p_\theta(z)$ depends on parameters $\theta$. How do we compute $ abla_\theta \mathcal{L}$?
Naively: sample $z \sim p_\theta(z)$, compute $f(z)$. But $z$ is stochastic—we can't backpropagate through the sampling operation.
The Solution: Reparameterize
Express $z$ as a deterministic function of $\theta$ and independent noise:
$$z = g_\theta(\epsilon), \quad \epsilon \sim q(\epsilon)$$
where $q(\epsilon)$ does NOT depend on $\theta$.
Now: $$\mathcal{L}(\theta) = \mathbb{E}{\epsilon \sim q(\epsilon)}[f(g\theta(\epsilon))]$$
Gradients flow through $g_\theta$: $$ abla_\theta \mathcal{L} = \mathbb{E}{\epsilon \sim q(\epsilon)}[ abla\theta f(g_\theta(\epsilon))]$$
Example: Gaussian
For $z \sim \mathcal{N}(\mu, \sigma^2)$:
$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)$$
Now $ abla_\mu z = 1$ and $ abla_\sigma z = \epsilon$—both well-defined.
The reparameterization trick is essential for VAE training. The encoder outputs parameters (μ, σ) of a Gaussian latent distribution. We need gradients through samples from this distribution to train end-to-end. Without reparameterization, the sampling operation blocks gradients, making training impossible.
Beyond Gaussians:
The reparameterization trick extends to many distributions:
Continuous distributions:
Discrete distributions (harder):
For discrete variables, the mapping $g$ is inherently discontinuous. Solutions:
Gumbel-Softmax (Concrete): Differentiable relaxation of categorical sampling $$z_i = \frac{\exp((\log \pi_i + g_i)/\tau)}{\sum_j \exp((\log \pi_j + g_j)/\tau)}$$ where $g_i \sim \text{Gumbel}(0,1)$ and $\tau$ is temperature.
REINFORCE / Score function estimator: Unbiased but high-variance gradient estimate
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as np # Demonstrate reparameterization for Gaussiannp.random.seed(42) # Parameters of our distributionmu = 3.0sigma = 2.0 # Standard sampling (cannot differentiate through this!)def sample_naive(mu, sigma, n=1000): """Sample directly - gradients blocked.""" return np.random.normal(mu, sigma, n) # Reparameterized sampling (gradients flow!)def sample_reparameterized(mu, sigma, n=1000): """Sample via reparameterization: z = mu + sigma * epsilon.""" epsilon = np.random.standard_normal(n) # Independent of mu, sigma z = mu + sigma * epsilon # Deterministic transform return z, epsilon # Both produce same distributionz_naive = sample_naive(mu, sigma, 10000)z_reparam, epsilon = sample_reparameterized(mu, sigma, 10000) print("Reparameterization Trick Demonstration:")print(f" Target: N(μ={mu}, σ={sigma})")print(f" Naive sampling:")print(f" Mean: {z_naive.mean():.3f}, Std: {z_naive.std():.3f}")print(f" Reparameterized sampling:")print(f" Mean: {z_reparam.mean():.3f}, Std: {z_reparam.std():.3f}") # Show gradientsprint(" Gradients (reparameterized):")print(f" dz/dμ = 1 (for all samples)")print(f" dz/dσ = ε (sample-dependent)")print(f" Mean dz/dσ: {epsilon.mean():.3f} (≈ 0)") # Gumbel-Softmax for discrete variablesprint("--- Gumbel-Softmax for Categorical ---") def gumbel_softmax(logits, temperature=1.0): """Differentiable approximation to categorical sampling.""" # Sample Gumbel noise u = np.random.uniform(0, 1, len(logits)) gumbel = -np.log(-np.log(u + 1e-10) + 1e-10) # Add noise and apply temperature-scaled softmax y = (logits + gumbel) / temperature y = np.exp(y) / np.exp(y).sum() return y # Logits for 3 categorieslogits = np.array([2.0, 1.0, 0.5]) # Category 0 most likelytrue_probs = np.exp(logits) / np.exp(logits).sum() print(f" Logits: {logits}")print(f" True probabilities: {true_probs.round(3)}") # Sample at different temperaturesfor temp in [0.1, 0.5, 1.0, 2.0]: samples = [gumbel_softmax(logits, temp) for _ in range(1000)] avg = np.mean(samples, axis=0) print(f" τ={temp}: avg samples = {avg.round(3)}")print(" (Lower τ → closer to one-hot, higher τ → closer to uniform)")When direct sampling isn't possible, we turn to indirect methods that sample from a different distribution and correct for the discrepancy.
Rejection Sampling
Idea: Sample from a simpler proposal distribution $q(x)$ that we can sample from, then accept/reject samples to match target $p(x)$.
Algorithm:
Properties:
The curse: In high dimensions, even a well-chosen $q$ typically requires exponentially large $M$, making rejection sampling impractical.
Importance Sampling
Idea: Don't reject samples—instead, weight them to correct for using the wrong distribution.
To estimate $\mathbb{E}_{x \sim p}[f(x)]$:
$$\mathbb{E}{x \sim p}[f(x)] = \mathbb{E}{x \sim q}\left[f(x) \cdot \frac{p(x)}{q(x)}\right]$$
The importance weight $w(x) = p(x)/q(x)$ re-weights samples from $q$ to be representative of $p$.
Self-Normalized Importance Sampling:
When we only know $p$ up to a normalizing constant:
$$\hat{\mathbb{E}}[f(x)] = \frac{\sum_i w(x_i) f(x_i)}{\sum_i w(x_i)}$$
where $x_i \sim q(x)$.
Importance sampling works well when p and q are similar. In high dimensions, even small mismatches cause weights to have enormous variance—a few samples have huge weights, others near zero. The 'effective sample size' can be tiny even with many samples. This limits the practical utility of importance sampling for complex generative models.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom scipy.stats import norm, uniform np.random.seed(42) # Target distribution: mixture of Gaussians (hard to sample directly)def target_pdf(x): return 0.5 * norm.pdf(x, -2, 0.5) + 0.5 * norm.pdf(x, 2, 0.8) def target_unnormalized(x): return target_pdf(x) # In this case, same as normalized # Proposal: broader Gaussian that covers both modesproposal = norm(0, 3) # Rejection Samplingprint("=== Rejection Sampling ===") # Find M such that target(x) <= M * proposal(x) for all xx_test = np.linspace(-6, 6, 1000)ratios = target_pdf(x_test) / proposal.pdf(x_test)M = np.max(ratios) * 1.1 # Add margin for safety print(f" Proposal: N(0, 3)")print(f" Bound M: {M:.3f}")print(f" Expected acceptance rate: {1/M:.3f}") # Rejection samplingn_target = 10000accepted = []attempts = 0 while len(accepted) < n_target: x = proposal.rvs() u = np.random.uniform(0, 1) attempts += 1 if u < target_pdf(x) / (M * proposal.pdf(x)): accepted.append(x) accepted = np.array(accepted)print(f" Requested: {n_target} samples")print(f" Attempts: {attempts}")print(f" Actual acceptance rate: {n_target/attempts:.3f}")print(f" Sample mean: {accepted.mean():.3f} (expected: 0)")print(f" Sample std: {accepted.std():.3f}") # Importance Samplingprint("=== Importance Sampling ===") # Estimate E[X^2] under target distributiondef estimand(x): return x ** 2 # True value (analytically computed)true_expectation = 0.5 * (4 + 0.5**2) + 0.5 * (4 + 0.8**2) # E[X^2] for mixtureprint(f" True E[X²]: {true_expectation:.3f}") # Importance sampling estimaten_samples = 10000samples = proposal.rvs(n_samples)weights = target_pdf(samples) / proposal.pdf(samples)normalized_weights = weights / weights.sum() is_estimate = np.sum(normalized_weights * estimand(samples))print(f" IS estimate: {is_estimate:.3f}") # Effective sample sizeess = 1 / np.sum(normalized_weights ** 2)print(f" Effective Sample Size: {ess:.0f} out of {n_samples}")print(f" ESS ratio: {ess/n_samples:.1%}")print(f" (Low ESS = high weight variance, poor estimate)") # Show weight distributionprint(f" Weight statistics:")print(f" Max weight: {weights.max():.3f}")print(f" Min weight: {weights.min():.6f}")print(f" Max/Min ratio: {weights.max()/weights.min():.0f}")Markov Chain Monte Carlo (MCMC) is a family of algorithms that sample from complex distributions by constructing a Markov chain whose stationary distribution is the target.
Core Idea:
Instead of sampling independently, we construct a sequence of samples $x_0, x_1, x_2, \ldots$ where each sample depends on the previous one (a Markov chain). If we design the transition probabilities correctly, the distribution of samples converges to the target $p(x)$ as we run the chain.
Key Concepts:
Markov Chain: A sequence where $x_{t+1}$ depends only on $x_t$: $$P(x_{t+1} | x_t, x_{t-1}, \ldots, x_0) = P(x_{t+1} | x_t) = T(x_{t+1} | x_t)$$
where $T$ is the transition kernel.
Stationary Distribution: A distribution $\pi(x)$ such that: $$\pi(x') = \int T(x'|x) \pi(x) dx$$
If we sample $x_t \sim \pi$, then $x_{t+1} \sim \pi$ too. The distribution is 'stable.'
Ergodicity: Under certain conditions (irreducibility, aperiodicity), the chain converges to its stationary distribution regardless of initialization: $$\lim_{t \to \infty} P(x_t | x_0) = \pi(x)$$
MCMC goal: Design $T$ such that $\pi = p$ (target distribution).
A sufficient (not necessary) condition for π to be stationary is detailed balance: π(x)T(x'|x) = π(x')T(x|x'). This says the probability flow from x to x' equals the flow from x' to x. Most MCMC algorithms are designed by constructing T that satisfies detailed balance with respect to the target distribution.
Practical Considerations:
Burn-in: Initial samples are influenced by the starting point, not the target distribution. We discard these 'burn-in' samples.
Mixing: How quickly does the chain explore the distribution? Poor mixing = chain gets stuck in local regions, samples are highly correlated, effective sample size is low.
Thinning: To reduce autocorrelation, keep only every $k$-th sample. Trading samples for computational efficiency.
Diagnostics:
Limitations of MCMC:
| Issue | Symptom | Remedy |
|---|---|---|
| Poor mixing | Autocorrelation high, ESS low | Tune step size, use HMC, reparameterize |
| Not converged | Trace plot trends, R-hat > 1.1 | Run longer, better initialization |
| Stuck in mode | Multimodal target, chain in one mode | Parallel tempering, multiple chains |
| High dimensionality | Acceptance rate too low/high | HMC, NUTS, blocked updates |
Metropolis-Hastings (MH) is the foundational MCMC algorithm. It constructs a Markov chain with the target as its stationary distribution by using a proposal distribution and accept/reject steps.
Algorithm:
Key Properties:
Symmetric Proposals (Metropolis):
If $q(x'|x) = q(x|x')$ (symmetric, like Gaussian random walk), the formula simplifies: $$\alpha = \min\left(1, \frac{p(x')}{p(x_t)}\right)$$
Accept if we move to higher probability; randomly accept if we move to lower.
Tuning the Proposal:
The proposal distribution critically affects performance:
Too narrow (small steps):
Too wide (large steps):
Sweet spot: Acceptance rate around 20-50% for random walk proposals in high dimensions (theoretical optimal ~23.4% for high-dimensional Gaussian targets).
Adaptive MH:
Modern implementations adapt the proposal during run:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as np # Metropolis-Hastings for sampling from a bimodal distributionnp.random.seed(42) # Target: mixture of two Gaussiansdef log_target(x): log_p1 = -0.5 * ((x - (-3)) ** 2) / 1.0 log_p2 = -0.5 * ((x - 3) ** 2) / 1.0 # log-sum-exp for numerical stability max_log = max(log_p1, log_p2) return max_log + np.log(np.exp(log_p1 - max_log) + np.exp(log_p2 - max_log)) def metropolis_hastings(log_target, x0, proposal_std, n_samples, burn_in=1000): """ Random walk Metropolis-Hastings. Args: log_target: Function returning log p(x) (up to constant) x0: Initial state proposal_std: Standard deviation of Gaussian proposal n_samples: Number of samples after burn-in burn_in: Number of initial samples to discard """ x = x0 samples = [] acceptances = 0 total_iter = n_samples + burn_in for t in range(total_iter): # Propose x_prime = x + np.random.normal(0, proposal_std) # Compute acceptance probability (in log space) log_alpha = log_target(x_prime) - log_target(x) # Accept/reject if np.log(np.random.random()) < log_alpha: x = x_prime if t >= burn_in: acceptances += 1 if t >= burn_in: samples.append(x) return np.array(samples), acceptances / n_samples # Run with different proposal widthsn_samples = 10000print("Metropolis-Hastings Sampling from Bimodal Distribution")print("=" * 55) for std in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]: samples, acc_rate = metropolis_hastings( log_target, x0=0.0, proposal_std=std, n_samples=n_samples ) # Compute effective sample size (using autocorrelation) def compute_ess(samples): n = len(samples) mean = samples.mean() var = samples.var() if var == 0: return n autocorr = np.correlate(samples - mean, samples - mean, 'full') autocorr = autocorr[n-1:] / (var * np.arange(n, 0, -1)) # Sum autocorrelations until they go negative ess_denom = 1 + 2 * sum(autocorr[1:min(100, n)] * (autocorr[1:min(100, n)] > 0.05)) return n / max(1, ess_denom) ess = compute_ess(samples) print(f" σ_proposal = {std:4.1f}: Accept rate = {acc_rate:.3f}, " f"ESS = {ess:6.0f} ({ess/n_samples*100:4.1f}%)") print(" (Best: acceptance rate 20-50%, high ESS)") # Check if both modes are visitedbest_samples, _ = metropolis_hastings( log_target, x0=0.0, proposal_std=2.0, n_samples=50000, burn_in=5000)n_left_mode = np.sum(best_samples < 0)n_right_mode = np.sum(best_samples >= 0)print(f" Mode coverage (σ=2.0):")print(f" Left mode (x<0): {n_left_mode} samples ({n_left_mode/len(best_samples):.1%})")print(f" Right mode (x≥0): {n_right_mode} samples ({n_right_mode/len(best_samples):.1%})")Gibbs sampling is a special case of MCMC particularly useful when the joint distribution is hard to sample from, but the conditional distributions are easy.
Algorithm:
For a multivariate distribution $p(x_1, x_2, \ldots, x_d)$:
Each step updates one variable by sampling from its full conditional distribution (conditioned on current values of all others).
Why it works:
Gibbs sampling is a special case of MH where the proposal is the full conditional, and the acceptance probability is always 1 (the updates are always accepted).
Advantages:
Gibbs sampling shines in graphical models where the conditional p(x_i | x_{-i}) depends only on the Markov blanket of x_i (parents, children, and co-parents). This locality makes conditionals tractable even in large models, and enables block updates where subsets of variables are updated together.
Block Gibbs Sampling:
Instead of updating one variable at a time, update blocks of variables together: $$x_A^{(t)} \sim p(x_A | x_B^{(t-1)})$$ $$x_B^{(t)} \sim p(x_B | x_A^{(t)})$$
Larger blocks improve mixing (less autocorrelation) but require sampling from more complex conditionals.
Collapsed Gibbs Sampling:
Integrate out (marginalize) some variables analytically, sampling only the remaining variables from the collapsed conditionals. This can dramatically improve mixing by reducing dependencies.
Limitations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import numpy as npfrom scipy.stats import norm np.random.seed(42) # Gibbs sampling for bivariate Gaussian# Target: p(x, y) = N([0, 0], [[1, ρ], [ρ, 1]])rho = 0.9 # High correlation # Conditionals:# p(x | y) = N(ρy, 1 - ρ²)# p(y | x) = N(ρx, 1 - ρ²) def gibbs_bivariate_gaussian(rho, n_samples, burn_in=1000): """Gibbs sampling for bivariate Gaussian with correlation rho.""" conditional_std = np.sqrt(1 - rho ** 2) x, y = 0.0, 0.0 # Initialize samples = [] for t in range(n_samples + burn_in): # Update x given y x = np.random.normal(rho * y, conditional_std) # Update y given x y = np.random.normal(rho * x, conditional_std) if t >= burn_in: samples.append([x, y]) return np.array(samples) # Sample with different correlationsprint("Gibbs Sampling for Bivariate Gaussian")print("=" * 45) for rho in [0.0, 0.5, 0.9, 0.99]: samples = gibbs_bivariate_gaussian(rho, n_samples=10000) # Check statistics sample_corr = np.corrcoef(samples[:, 0], samples[:, 1])[0, 1] # Estimate ESS (rough) x_autocorr = np.correlate(samples[:, 0] - samples[:, 0].mean(), samples[:, 0] - samples[:, 0].mean(), 'full') x_autocorr = x_autocorr[len(samples)-1:len(samples)+50] / x_autocorr[len(samples)-1] ess_approx = len(samples) / (1 + 2 * np.sum(x_autocorr[1:] * (x_autocorr[1:] > 0.05))) print(f" ρ = {rho:.2f}: Sample corr = {sample_corr:.3f}, " f"ESS ≈ {ess_approx:.0f} ({ess_approx/10000*100:.1f}%)") print(" Note: Higher ρ → slower mixing, lower ESS") # Gibbs for Gaussian Mixture Model (toy example)print("--- Gibbs for Gaussian Mixture Model ---") # Generate data from mixturen_data = 200true_labels = np.random.choice([0, 1], n_data, p=[0.4, 0.6])true_means = [-3, 3]data = np.array([np.random.normal(true_means[z], 1) for z in true_labels]) # Gibbs sampling: alternate between z (labels) and μ (means)def gibbs_gmm(data, n_iter=500, burn_in=100): n = len(data) K = 2 # Initialize mu = np.array([0.0, 1.0]) # Means z = np.random.choice(K, n) # Assignments pi = np.array([0.5, 0.5]) # Mixing weights samples = [] for t in range(n_iter + burn_in): # Sample z given mu (assignments) for i in range(n): log_probs = np.log(pi) - 0.5 * (data[i] - mu) ** 2 probs = np.exp(log_probs - log_probs.max()) probs /= probs.sum() z[i] = np.random.choice(K, p=probs) # Sample mu given z (means, with prior N(0, 10)) for k in range(K): mask = z == k n_k = mask.sum() if n_k > 0: precision = n_k + 1/10 mean = (data[mask].sum() + 0/10) / precision mu[k] = np.random.normal(mean, 1/np.sqrt(precision)) if t >= burn_in: samples.append(mu.copy()) return np.array(samples) mu_samples = gibbs_gmm(data)print(f" True means: {true_means}")print(f" Posterior mean estimates: [{mu_samples[:, 0].mean():.2f}, {mu_samples[:, 1].mean():.2f}]")Modern deep generative models have developed specialized sampling procedures that diverge from classical MCMC. Let's examine how different architectures approach sampling.
Normalizing Flows: One-Shot Sampling
Flows transform a simple base distribution through an invertible function: $$z \sim \mathcal{N}(0, I) \rightarrow x = f_\theta(z)$$
Sampling is a single forward pass—no chains, no iterations, no rejection. This makes flows attractive for applications requiring fast sampling.
VAEs: Latent Space Sampling
VAEs sample from a simple prior in latent space, then decode: $$z \sim \mathcal{N}(0, I) \rightarrow x = \text{Decoder}_\theta(z)$$
Also one-shot. The quality depends on how well the decoder captures the data distribution conditional on $z$.
GANs: Generator Forward Pass
Like VAEs, GANs sample noise and transform: $$z \sim \mathcal{N}(0, I) \rightarrow x = G_\theta(z)$$
One-shot, very fast. No density evaluation—purely implicit sampling.
Autoregressive Models: Sequential Generation
Autoregressive models (GPT, PixelCNN) sample dimension by dimension: $$x_1 \sim p(x_1), \quad x_2 \sim p(x_2|x_1), \quad \ldots$$
This is ancestral sampling, but sequential—one forward pass per dimension. For long sequences (text) or high-resolution images (millions of pixels), this can be slow.
Sampling Strategies for Autoregressive:
Diffusion Models: Iterative Denoising
Diffusion models sample through many denoising steps: $$x_T \sim \mathcal{N}(0, I) \rightarrow x_{T-1} \rightarrow \cdots \rightarrow x_0$$
Typically 20-1000 steps, each requiring a neural network forward pass. This is slower than one-shot methods but produces remarkably high-quality samples.
Fast Diffusion Sampling:
Faster sampling methods (flows, VAEs, GANs) often sacrifice sample quality or likelihood. Slower methods (diffusion, autoregressive) often produce better samples. Choosing the right model involves balancing sampling speed, sample quality, training stability, and mode coverage for your specific application.
| Model | Sampling Method | Steps/Sample | Typical Speed |
|---|---|---|---|
| Normalizing Flow | Forward pass | 1 | ~10ms |
| VAE | Sample + decode | 1 | ~5-10ms |
| GAN | Forward pass | 1 | ~5-10ms |
| Autoregressive (text) | Sequential | ~100-1000 | ~100-1000ms |
| Diffusion (fast) | DDIM denoising | ~20-50 | ~200-500ms |
| Diffusion (quality) | DDPM denoising | ~100-1000 | ~1-10s |
Sampling is where generative models meet the real world—transforming probability distributions into tangible new data. We've covered the spectrum from classical techniques to modern neural approaches.
What's next:
We've seen how generative models learn distributions (density estimation) and produce samples (sampling). But many of the most powerful generative models introduce latent variables—hidden representations that capture meaningful structure in the data. The next page explores this crucial concept, laying the groundwork for VAEs and other latent variable models.
You now understand the full landscape of sampling techniques for generative models. From classical MCMC to modern neural approaches, each method navigates the fundamental challenge of converting probability distributions into tangible samples. This knowledge is essential for implementing, debugging, and improving generative systems.