Machine LearningVariational Inference

ELBO Optimization

LevelAdvanced

Duration90 mins

TopicVariational Inference

3 / 5

Gradient Estimation

The Gradient Estimation Challenge

Variational inference requires optimizing an objective that involves an expectation over a distribution we're trying to learn. This creates a fundamental challenge: how do we compute gradients when the thing we're differentiating through depends on the parameters we're optimizing?

This page tackles this challenge head-on, explaining why naive approaches fail and introducing the foundational concepts that enable gradient-based learning in variational models.

What You Will Master

By the end of this page, you will: (1) Understand why gradients through expectations are non-trivial, (2) Distinguish between pathwise and score function gradient estimators, (3) Analyze variance and bias properties of different estimators, (4) Apply baseline and control variate techniques to reduce variance.

The Problem: Gradients Through Expectations

Consider the ELBO we want to maximize:

$$\mathcal{L}(\phi, \theta) = \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{KL}(q_\phi | p)$$

We need gradients with respect to both:

θ (decoder): Straightforward—θ appears inside the expectation but not in the distribution
φ (encoder): Problematic—φ defines the distribution we're taking expectations over

The Core Issue:

$$\nabla_\phi \mathbb{E}{q\phi(\mathbf{z})}[f(\mathbf{z})] = \nabla_\phi \int q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$

We cannot simply move the gradient inside the integral because the distribution $q_\phi$ depends on $\phi$. The integration bounds (or more precisely, the probability measure) change with $\phi$.

Why Monte Carlo Isn't Enough

You might think: 'Just sample z ~ q_φ and compute gradients.' But ∇_φ(sample from q_φ) is undefined! Sampling is a non-differentiable operation. We need special techniques to obtain gradient information.

Two Fundamental Approaches:

There are two main strategies for obtaining gradients through expectations:

Score Function Estimator (REINFORCE): Use the log-derivative trick to express the gradient as an expectation we can sample
Pathwise Gradient (Reparameterization): Rewrite the sampling process so randomness is external to the parameters

Both produce unbiased gradient estimates, but with very different variance properties. Understanding both is essential for choosing the right approach for your problem.

The Score Function Estimator

The score function estimator (also called REINFORCE or likelihood ratio estimator) uses a clever mathematical identity:

$$\nabla_\phi \log q_\phi(\mathbf{z}) = \frac{\nabla_\phi q_\phi(\mathbf{z})}{q_\phi(\mathbf{z})}$$

Rearranging: $\nabla_\phi q_\phi(\mathbf{z}) = q_\phi(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z})$

Derivation:

$$\nabla_\phi \mathbb{E}{q\phi}[f(\mathbf{z})] = \nabla_\phi \int q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$ $$= \int \nabla_\phi q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$ $$= \int q_\phi(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$ $$= \mathbb{E}{q\phi}[f(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z})]$$

This is an expectation we can estimate via Monte Carlo sampling!

The Score Function

The term ∇_φ log q_φ(z) is called the 'score function' in statistics. It measures how the log-probability of z changes with parameters. The estimator multiplies rewards f(z) by scores, crediting parameters that made high-reward samples more likely.

score_function_estimator.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch
from torch.distributions import Normal
 
def score_function_gradient(encoder, f, x, n_samples=10):
    """
    Compute gradient using score function estimator.
    
    ∇_φ E_q[f(z)] ≈ (1/N) Σ f(z_i) ∇_φ log q_φ(z_i)
    """
    z_mean, z_logvar = encoder(x)
    z_std = torch.exp(0.5 * z_logvar)
    q = Normal(z_mean, z_std)
    
    gradient_estimate = 0
    for _ in range(n_samples):
        z = q.sample()  # Non-differentiable sample
        log_prob = q.log_prob(z).sum(dim=-1)  # Log q_φ(z)
        reward = f(z)  # Some scalar function of z
        
        # Score function: reward weighted by gradient of log prob
        # Use retain_graph since we're computing gradients multiple times
        gradient_estimate += reward * log_prob
    
    gradient_estimate /= n_samples
    return gradient_estimate.mean()  # Scalar for backprop

The Variance Problem:

The score function estimator is unbiased but often has extremely high variance. The gradient estimate is:

$$\hat{g} = f(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z})$$

The variance of $\hat{g}$ depends on the variance of $f(\mathbf{z})$. When $f$ can take very different values for different $\mathbf{z}$, the gradient estimates fluctuate wildly, making training slow or impossible.

This is why the reparameterization trick (next pages) was such a breakthrough—it often achieves orders of magnitude lower variance.

The Pathwise Gradient (Introduction)

The pathwise gradient or reparameterization trick takes a fundamentally different approach. Instead of computing gradients through the probability density, it computes gradients through the samples themselves.

Key Insight:

Many distributions can be written as deterministic transformations of fixed noise:

$$\mathbf{z} = g_\phi(\epsilon), \quad \epsilon \sim p(\epsilon)$$

For example, Gaussian: $\mathbf{z} = \mu_\phi + \sigma_\phi \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$

The Gradient:

$$\nabla_\phi \mathbb{E}{q\phi(\mathbf{z})}[f(\mathbf{z})] = \nabla_\phi \mathbb{E}{p(\epsilon)}[f(g\phi(\epsilon))]$$ $$= \mathbb{E}{p(\epsilon)}[\nabla\phi f(g_\phi(\epsilon))]$$ $$= \mathbb{E}{p(\epsilon)}[\nabla\mathbf{z} f(\mathbf{z}) \cdot \nabla_\phi g_\phi(\epsilon)]$$

Now the expectation is over $p(\epsilon)$, which doesn't depend on $\phi$! We can move the gradient inside.

Score Function vs Pathwise Gradients
Property	Score Function	Pathwise (Reparam)
Unbiased	Yes	Yes
Variance	High	Low
Requirements	Only log q_φ(z)	Differentiable transformation g_φ
Discrete z	Works	Doesn't work (usually)
Black-box f	Works	Requires ∇_z f

Applicability

Pathwise gradients require f to be differentiable with respect to z and the distribution to have a reparameterizable form. For discrete distributions or non-differentiable f, score function methods (or their relaxations) are necessary.

Variance Reduction Techniques

When using score function estimators, variance reduction is critical. Several techniques exist:

1. Baselines:

Subtract a baseline $b$ from the reward that doesn't depend on the action:

$$\nabla_\phi \mathbb{E}q[f(\mathbf{z})] = \mathbb{E}q[(f(\mathbf{z}) - b) \nabla\phi \log q\phi(\mathbf{z})]$$

This remains unbiased because $\mathbb{E}q[b \nabla\phi \log q_\phi(\mathbf{z})] = b \nabla_\phi 1 = 0$.

The optimal baseline minimizes variance: $b^* = \frac{\mathbb{E}[f(\mathbf{z})^2 |\nabla \log q|^2]}{\mathbb{E}[|\nabla \log q|^2]}$

In practice, a running average of $f(\mathbf{z})$ works well.

2. Control Variates:

More generally, we can subtract any function whose expectation we know analytically:

$$\hat{g} = f(\mathbf{z}) \nabla_\phi \log q - c \cdot (h(\mathbf{z}) - \mathbb{E}q[h(\mathbf{z})]) \nabla\phi \log q$$

where $c$ is chosen to minimize variance and $h$ is correlated with $f$.

3. Antithetic Sampling:

Use pairs of samples that are negatively correlated: $$\hat{g} = \frac{1}{2}(f(\mathbf{z}) + f(-\mathbf{z})) \nabla_\phi \log q$$

This can reduce variance when $f$ has certain symmetry properties.

4. Multiple Samples:

Simply using more samples reduces variance by $1/\sqrt{N}$, but increases compute proportionally.

variance_reduction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class ScoreFunctionWithBaseline:
    """Score function estimator with learned baseline."""
    
    def __init__(self):
        self.running_mean = 0
        self.count = 0
        
    def compute_gradient(self, encoder, f, x, n_samples=10):
        z_mean, z_logvar = encoder(x)
        z_std = torch.exp(0.5 * z_logvar)
        q = Normal(z_mean, z_std)
        
        rewards = []
        log_probs = []
        
        for _ in range(n_samples):
            z = q.sample()
            reward = f(z)
            log_prob = q.log_prob(z).sum(dim=-1)
            rewards.append(reward)
            log_probs.append(log_prob)
        
        rewards = torch.stack(rewards)
        log_probs = torch.stack(log_probs)
        
        # Baseline: running mean of rewards
        baseline = self.running_mean
        
        # Centered rewards
        centered_rewards = rewards - baseline
        
        # Gradient estimate
        gradient = (centered_rewards * log_probs).mean()
        
        # Update running mean
        batch_mean = rewards.mean().item()
        self.count += 1
        self.running_mean += (batch_mean - self.running_mean) / self.count
        
        return gradient

Choosing an Estimator

The choice between score function and pathwise gradients depends on your problem structure:

Decision Guide

•Use Pathwise (Reparameterization) when: Latents are continuous, transformable distributions (Gaussian, etc.), and f is differentiable. This is the default for VAEs.
•Use Score Function when: Latents are discrete, distributions aren't reparameterizable, or f involves non-differentiable operations.
•Use Hybrid approaches when: Model has both continuous and discrete components. Apply reparam to continuous, REINFORCE to discrete.
•Consider relaxations when: Discrete latents but want lower variance—Gumbel-softmax, REBAR, RELAX provide middle ground.

Practical Default

For most VAE applications with Gaussian posteriors, always use the reparameterization trick. It's implemented automatically in frameworks like PyTorch via rsample() (reparameterized sample) vs sample() (non-differentiable).

Summary

Key Takeaways

•Gradients through expectations are non-trivial because the distribution depends on the parameters we're optimizing
•Score function estimator uses log-derivative trick—general but high variance
•Pathwise gradient reparameterizes sampling—lower variance but requires differentiable setup
•Variance reduction via baselines and control variates is essential for score function methods
•Choose based on problem structure—continuous with differentiable f uses reparam; discrete uses REINFORCE variants

Page Complete

You now understand the fundamental challenge of gradient estimation through stochastic objectives. Next, we'll dive deep into the reparameterization trick—the technique that made VAEs practical.

3 / 5

Loading learning content...

Machine LearningVariational Inference

ELBO Optimization

LevelAdvanced

Duration90 mins

TopicVariational Inference

3 / 5

Gradient Estimation

The Gradient Estimation Challenge

This page tackles this challenge head-on, explaining why naive approaches fail and introducing the foundational concepts that enable gradient-based learning in variational models.

What You Will Master

The Problem: Gradients Through Expectations

Consider the ELBO we want to maximize:

$$\mathcal{L}(\phi, \theta) = \mathbb{E}{q\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{KL}(q_\phi | p)$$

We need gradients with respect to both:

θ (decoder): Straightforward—θ appears inside the expectation but not in the distribution
φ (encoder): Problematic—φ defines the distribution we're taking expectations over

The Core Issue:

$$\nabla_\phi \mathbb{E}{q\phi(\mathbf{z})}[f(\mathbf{z})] = \nabla_\phi \int q_\phi(\mathbf{z}) f(\mathbf{z}) d\mathbf{z}$$

We cannot simply move the gradient inside the integral because the distribution $q_\phi$ depends on $\phi$. The integration bounds (or more precisely, the probability measure) change with $\phi$.

Why Monte Carlo Isn't Enough

Two Fundamental Approaches:

There are two main strategies for obtaining gradients through expectations:

Score Function Estimator (REINFORCE): Use the log-derivative trick to express the gradient as an expectation we can sample
Pathwise Gradient (Reparameterization): Rewrite the sampling process so randomness is external to the parameters

Both produce unbiased gradient estimates, but with very different variance properties. Understanding both is essential for choosing the right approach for your problem.

The Score Function Estimator

The score function estimator (also called REINFORCE or likelihood ratio estimator) uses a clever mathematical identity:

$$\nabla_\phi \log q_\phi(\mathbf{z}) = \frac{\nabla_\phi q_\phi(\mathbf{z})}{q_\phi(\mathbf{z})}$$

Rearranging: $\nabla_\phi q_\phi(\mathbf{z}) = q_\phi(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z})$

Derivation:

This is an expectation we can estimate via Monte Carlo sampling!

The Score Function

score_function_estimator.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import torch
from torch.distributions import Normal
 
def score_function_gradient(encoder, f, x, n_samples=10):
    """
    Compute gradient using score function estimator.
    
    ∇_φ E_q[f(z)] ≈ (1/N) Σ f(z_i) ∇_φ log q_φ(z_i)
    """
    z_mean, z_logvar = encoder(x)
    z_std = torch.exp(0.5 * z_logvar)
    q = Normal(z_mean, z_std)
    
    gradient_estimate = 0
    for _ in range(n_samples):
        z = q.sample()  # Non-differentiable sample
        log_prob = q.log_prob(z).sum(dim=-1)  # Log q_φ(z)
        reward = f(z)  # Some scalar function of z
        
        # Score function: reward weighted by gradient of log prob
        # Use retain_graph since we're computing gradients multiple times
        gradient_estimate += reward * log_prob
    
    gradient_estimate /= n_samples
    return gradient_estimate.mean()  # Scalar for backprop

The Variance Problem:

The score function estimator is unbiased but often has extremely high variance. The gradient estimate is:

$$\hat{g} = f(\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z})$$

This is why the reparameterization trick (next pages) was such a breakthrough—it often achieves orders of magnitude lower variance.

The Pathwise Gradient (Introduction)

Key Insight:

Many distributions can be written as deterministic transformations of fixed noise:

$$\mathbf{z} = g_\phi(\epsilon), \quad \epsilon \sim p(\epsilon)$$

For example, Gaussian: $\mathbf{z} = \mu_\phi + \sigma_\phi \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$

The Gradient:

Now the expectation is over $p(\epsilon)$, which doesn't depend on $\phi$! We can move the gradient inside.

Score Function vs Pathwise Gradients
Property	Score Function	Pathwise (Reparam)
Unbiased	Yes	Yes
Variance	High	Low
Requirements	Only log q_φ(z)	Differentiable transformation g_φ
Discrete z	Works	Doesn't work (usually)
Black-box f	Works	Requires ∇_z f

Applicability

Variance Reduction Techniques

When using score function estimators, variance reduction is critical. Several techniques exist:

1. Baselines:

Subtract a baseline $b$ from the reward that doesn't depend on the action:

$$\nabla_\phi \mathbb{E}q[f(\mathbf{z})] = \mathbb{E}q[(f(\mathbf{z}) - b) \nabla\phi \log q\phi(\mathbf{z})]$$

This remains unbiased because $\mathbb{E}q[b \nabla\phi \log q_\phi(\mathbf{z})] = b \nabla_\phi 1 = 0$.

The optimal baseline minimizes variance: $b^* = \frac{\mathbb{E}[f(\mathbf{z})^2 |\nabla \log q|^2]}{\mathbb{E}[|\nabla \log q|^2]}$

In practice, a running average of $f(\mathbf{z})$ works well.

2. Control Variates:

More generally, we can subtract any function whose expectation we know analytically:

$$\hat{g} = f(\mathbf{z}) \nabla_\phi \log q - c \cdot (h(\mathbf{z}) - \mathbb{E}q[h(\mathbf{z})]) \nabla\phi \log q$$

where $c$ is chosen to minimize variance and $h$ is correlated with $f$.

3. Antithetic Sampling:

Use pairs of samples that are negatively correlated: $$\hat{g} = \frac{1}{2}(f(\mathbf{z}) + f(-\mathbf{z})) \nabla_\phi \log q$$

This can reduce variance when $f$ has certain symmetry properties.

4. Multiple Samples:

Simply using more samples reduces variance by $1/\sqrt{N}$, but increases compute proportionally.

variance_reduction.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class ScoreFunctionWithBaseline:
    """Score function estimator with learned baseline."""
    
    def __init__(self):
        self.running_mean = 0
        self.count = 0
        
    def compute_gradient(self, encoder, f, x, n_samples=10):
        z_mean, z_logvar = encoder(x)
        z_std = torch.exp(0.5 * z_logvar)
        q = Normal(z_mean, z_std)
        
        rewards = []
        log_probs = []
        
        for _ in range(n_samples):
            z = q.sample()
            reward = f(z)
            log_prob = q.log_prob(z).sum(dim=-1)
            rewards.append(reward)
            log_probs.append(log_prob)
        
        rewards = torch.stack(rewards)
        log_probs = torch.stack(log_probs)
        
        # Baseline: running mean of rewards
        baseline = self.running_mean
        
        # Centered rewards
        centered_rewards = rewards - baseline
        
        # Gradient estimate
        gradient = (centered_rewards * log_probs).mean()
        
        # Update running mean
        batch_mean = rewards.mean().item()
        self.count += 1
        self.running_mean += (batch_mean - self.running_mean) / self.count
        
        return gradient

Choosing an Estimator

The choice between score function and pathwise gradients depends on your problem structure:

Decision Guide

•Use Pathwise (Reparameterization) when: Latents are continuous, transformable distributions (Gaussian, etc.), and f is differentiable. This is the default for VAEs.
•Use Score Function when: Latents are discrete, distributions aren't reparameterizable, or f involves non-differentiable operations.
•Use Hybrid approaches when: Model has both continuous and discrete components. Apply reparam to continuous, REINFORCE to discrete.
•Consider relaxations when: Discrete latents but want lower variance—Gumbel-softmax, REBAR, RELAX provide middle ground.

Practical Default

Summary

Key Takeaways

•Gradients through expectations are non-trivial because the distribution depends on the parameters we're optimizing
•Score function estimator uses log-derivative trick—general but high variance
•Pathwise gradient reparameterizes sampling—lower variance but requires differentiable setup
•Variance reduction via baselines and control variates is essential for score function methods
•Choose based on problem structure—continuous with differentiable f uses reparam; discrete uses REINFORCE variants

Page Complete

You now understand the fundamental challenge of gradient estimation through stochastic objectives. Next, we'll dive deep into the reparameterization trick—the technique that made VAEs practical.

3 / 5