Loading learning content...
Variational inference requires optimizing an objective that involves an expectation over a distribution we're trying to learn. This creates a fundamental challenge: how do we compute gradients when the thing we're differentiating through depends on the parameters we're optimizing?
This page tackles this challenge head-on, explaining why naive approaches fail and introducing the foundational concepts that enable gradient-based learning in variational models.
By the end of this page, you will: (1) Understand why gradients through expectations are non-trivial, (2) Distinguish between pathwise and score function gradient estimators, (3) Analyze variance and bias properties of different estimators, (4) Apply baseline and control variate techniques to reduce variance.
Consider the ELBO we want to maximize:
$$\mathcal{L}(\phi, \theta) = \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{KL}(q_\phi | p)$$
We need gradients with respect to both:
The Core Issue:
$$\nabla_\phi \mathbb{E}{q\phi(\mathbf{z})}[f(\mathbf{z})] = \nabla_\phi \int q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$
We cannot simply move the gradient inside the integral because the distribution $q_\phi$ depends on $\phi$. The integration bounds (or more precisely, the probability measure) change with $\phi$.
You might think: 'Just sample z ~ q_φ and compute gradients.' But ∇_φ(sample from q_φ) is undefined! Sampling is a non-differentiable operation. We need special techniques to obtain gradient information.
Two Fundamental Approaches:
There are two main strategies for obtaining gradients through expectations:
Score Function Estimator (REINFORCE): Use the log-derivative trick to express the gradient as an expectation we can sample
Pathwise Gradient (Reparameterization): Rewrite the sampling process so randomness is external to the parameters
Both produce unbiased gradient estimates, but with very different variance properties. Understanding both is essential for choosing the right approach for your problem.
The score function estimator (also called REINFORCE or likelihood ratio estimator) uses a clever mathematical identity:
$$\nabla_\phi \log q_\phi(\mathbf{z}) = \frac{\nabla_\phi q_\phi(\mathbf{z})}{q_\phi(\mathbf{z})}$$
Rearranging: $\nabla_\phi q_\phi(\mathbf{z}) = q_\phi(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z})$
Derivation:
$$\nabla_\phi \mathbb{E}{q\phi}[f(\mathbf{z})] = \nabla_\phi \int q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$ $$= \int \nabla_\phi q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$ $$= \int q_\phi(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$ $$= \mathbb{E}{q\phi}[f(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z})]$$
This is an expectation we can estimate via Monte Carlo sampling!
The term ∇_φ log q_φ(z) is called the 'score function' in statistics. It measures how the log-probability of z changes with parameters. The estimator multiplies rewards f(z) by scores, crediting parameters that made high-reward samples more likely.
12345678910111213141516171819202122232425
import torchfrom torch.distributions import Normal def score_function_gradient(encoder, f, x, n_samples=10): """ Compute gradient using score function estimator. ∇_φ E_q[f(z)] ≈ (1/N) Σ f(z_i) ∇_φ log q_φ(z_i) """ z_mean, z_logvar = encoder(x) z_std = torch.exp(0.5 * z_logvar) q = Normal(z_mean, z_std) gradient_estimate = 0 for _ in range(n_samples): z = q.sample() # Non-differentiable sample log_prob = q.log_prob(z).sum(dim=-1) # Log q_φ(z) reward = f(z) # Some scalar function of z # Score function: reward weighted by gradient of log prob # Use retain_graph since we're computing gradients multiple times gradient_estimate += reward * log_prob gradient_estimate /= n_samples return gradient_estimate.mean() # Scalar for backpropThe Variance Problem:
The score function estimator is unbiased but often has extremely high variance. The gradient estimate is:
$$\hat{g} = f(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z})$$
The variance of $\hat{g}$ depends on the variance of $f(\mathbf{z})$. When $f$ can take very different values for different $\mathbf{z}$, the gradient estimates fluctuate wildly, making training slow or impossible.
This is why the reparameterization trick (next pages) was such a breakthrough—it often achieves orders of magnitude lower variance.
The pathwise gradient or reparameterization trick takes a fundamentally different approach. Instead of computing gradients through the probability density, it computes gradients through the samples themselves.
Key Insight:
Many distributions can be written as deterministic transformations of fixed noise:
$$\mathbf{z} = g_\phi(\epsilon), \quad \epsilon \sim p(\epsilon)$$
For example, Gaussian: $\mathbf{z} = \mu_\phi + \sigma_\phi \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$
The Gradient:
$$\nabla_\phi \mathbb{E}{q\phi(\mathbf{z})}[f(\mathbf{z})] = \nabla_\phi \mathbb{E}{p(\epsilon)}[f(g\phi(\epsilon))]$$ $$= \mathbb{E}{p(\epsilon)}[\nabla\phi f(g_\phi(\epsilon))]$$ $$= \mathbb{E}{p(\epsilon)}[\nabla\mathbf{z} f(\mathbf{z}) \cdot \nabla_\phi g_\phi(\epsilon)]$$
Now the expectation is over $p(\epsilon)$, which doesn't depend on $\phi$! We can move the gradient inside.
| Property | Score Function | Pathwise (Reparam) |
|---|---|---|
| Unbiased | Yes | Yes |
| Variance | High | Low |
| Requirements | Only log q_φ(z) | Differentiable transformation g_φ |
| Discrete z | Works | Doesn't work (usually) |
| Black-box f | Works | Requires ∇_z f |
Pathwise gradients require f to be differentiable with respect to z and the distribution to have a reparameterizable form. For discrete distributions or non-differentiable f, score function methods (or their relaxations) are necessary.
When using score function estimators, variance reduction is critical. Several techniques exist:
1. Baselines:
Subtract a baseline $b$ from the reward that doesn't depend on the action:
$$\nabla_\phi \mathbb{E}q[f(\mathbf{z})] = \mathbb{E}q[(f(\mathbf{z}) - b) \nabla\phi \log q\phi(\mathbf{z})]$$
This remains unbiased because $\mathbb{E}q[b \nabla\phi \log q_\phi(\mathbf{z})] = b \nabla_\phi 1 = 0$.
The optimal baseline minimizes variance: $b^* = \frac{\mathbb{E}[f(\mathbf{z})^2 |\nabla \log q|^2]}{\mathbb{E}[|\nabla \log q|^2]}$
In practice, a running average of $f(\mathbf{z})$ works well.
2. Control Variates:
More generally, we can subtract any function whose expectation we know analytically:
$$\hat{g} = f(\mathbf{z}) \nabla_\phi \log q - c \cdot (h(\mathbf{z}) - \mathbb{E}q[h(\mathbf{z})]) \nabla\phi \log q$$
where $c$ is chosen to minimize variance and $h$ is correlated with $f$.
3. Antithetic Sampling:
Use pairs of samples that are negatively correlated: $$\hat{g} = \frac{1}{2}(f(\mathbf{z}) + f(-\mathbf{z})) \nabla_\phi \log q$$
This can reduce variance when $f$ has certain symmetry properties.
4. Multiple Samples:
Simply using more samples reduces variance by $1/\sqrt{N}$, but increases compute proportionally.
12345678910111213141516171819202122232425262728293031323334353637383940
class ScoreFunctionWithBaseline: """Score function estimator with learned baseline.""" def __init__(self): self.running_mean = 0 self.count = 0 def compute_gradient(self, encoder, f, x, n_samples=10): z_mean, z_logvar = encoder(x) z_std = torch.exp(0.5 * z_logvar) q = Normal(z_mean, z_std) rewards = [] log_probs = [] for _ in range(n_samples): z = q.sample() reward = f(z) log_prob = q.log_prob(z).sum(dim=-1) rewards.append(reward) log_probs.append(log_prob) rewards = torch.stack(rewards) log_probs = torch.stack(log_probs) # Baseline: running mean of rewards baseline = self.running_mean # Centered rewards centered_rewards = rewards - baseline # Gradient estimate gradient = (centered_rewards * log_probs).mean() # Update running mean batch_mean = rewards.mean().item() self.count += 1 self.running_mean += (batch_mean - self.running_mean) / self.count return gradientThe choice between score function and pathwise gradients depends on your problem structure:
For most VAE applications with Gaussian posteriors, always use the reparameterization trick. It's implemented automatically in frameworks like PyTorch via rsample() (reparameterized sample) vs sample() (non-differentiable).
You now understand the fundamental challenge of gradient estimation through stochastic objectives. Next, we'll dive deep into the reparameterization trick—the technique that made VAEs practical.