Diffusion Models - Learning Module

Loading content...

0/245

Denoising Score Matching

Learning to Reverse Diffusion

We established that reversing the diffusion process requires estimating the score function $\nabla_\mathbf{x} \log p_t(\mathbf{x})$—the gradient of the log probability density at each noise level. But how do we train a neural network to estimate this quantity when we don't have access to the true probability density?

The answer lies in denoising score matching, a remarkable technique that sidesteps the intractable normalization constant problem and provides a practical training objective. This page derives the theory, explains the equivalences, and presents the practical training algorithms.

Learning Objectives

After this page, you will understand: (1) why direct score matching is intractable, (2) how denoising score matching provides an equivalent but tractable objective, (3) the connection between noise prediction and score estimation, (4) the complete training algorithm, and (5) practical implementation details.

The Score Function and Why We Need It

Definition:

The score function of a probability distribution $p(\mathbf{x})$ is the gradient of its log-density:

$$\mathbf{s}(\mathbf{x}) = \nabla_\mathbf{x} \log p(\mathbf{x})$$

Why is the score useful?

Avoids the partition function: Unlike the density $p(\mathbf{x})$ itself, the score doesn't depend on the intractable normalization constant $Z$
Points toward high density: The score vector points in the direction of steepest density increase
Enables sampling: Langevin dynamics uses the score to generate samples: $\mathbf{x}_{k+1} = \mathbf{x}k + \frac{\epsilon}{2} \nabla\mathbf{x} \log p(\mathbf{x}_k) + \sqrt{\epsilon} \mathbf{z}$

The score matching objective (Hyvärinen, 2005):

To train a score network $\mathbf{s}_\theta(\mathbf{x})$ to approximate the true score, we minimize:

$$\mathcal{J}(\theta) = \mathbb{E}{p(\mathbf{x})} \left[ \frac{1}{2} | \mathbf{s}\theta(\mathbf{x}) - \nabla_\mathbf{x} \log p(\mathbf{x}) |^2 \right]$$

The problem: We don't know $\nabla_\mathbf{x} \log p(\mathbf{x})$—that's exactly what we're trying to learn!

The solution: Through integration by parts, this objective can be rewritten in a form that doesn't require the true score:

$$\mathcal{J}(\theta) = \mathbb{E}{p(\mathbf{x})} \left[ \text{tr}(\nabla\mathbf{x} \mathbf{s}\theta(\mathbf{x})) + \frac{1}{2} | \mathbf{s}\theta(\mathbf{x}) |^2 \right]$$

But this requires computing the Jacobian trace, which is expensive in high dimensions.

Computational Challenge

The trace of the Jacobian requires O(d) backward passes for d-dimensional data. For images with millions of dimensions, this is prohibitively expensive. We need a better approach.

Denoising Score Matching: The Practical Solution

Vincent (2011) introduced Denoising Score Matching, which provides an equivalent yet tractable objective.

The key insight:

If we perturb data $\mathbf{x}_0$ with known noise to get $\mathbf{x}_t$, we know the score of the perturbed distribution analytically!

For Gaussian perturbation: $q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$

The conditional score is: $$\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) = -\frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0}{1 - \bar{\alpha}_t} = -\frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}}$$

where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is the noise that was added.

The Equivalence Theorem

Denoising score matching proves that training to match the conditional score ∇log q(x_t|x_0) yields the same optimum as matching the marginal score ∇log q(x_t). The neural network automatically learns the correct marginal score by matching conditional scores across all data points.

The denoising score matching objective:

$$\mathcal{L}{DSM}(\theta) = \mathbb{E}{t, \mathbf{x}0, \boldsymbol{\epsilon}} \left[ | \mathbf{s}\theta(\mathbf{x}t, t) - \nabla{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) |^2 \right]$$

$$= \mathbb{E}_{t, \mathbf{x}0, \boldsymbol{\epsilon}} \left[ \left| \mathbf{s}\theta(\mathbf{x}_t, t) + \frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}} \right|^2 \right]$$

This is tractable! We can sample $t$, $\mathbf{x}_0$, and $\boldsymbol{\epsilon}$, then compute the loss without any expensive Jacobian calculations.

The ε-Prediction Formulation

Instead of directly predicting the score, we can reparameterize to predict the noise $\boldsymbol{\epsilon}$.

The relationship:

If $\boldsymbol{\epsilon}_\theta(\mathbf{x}t, t)$ predicts the noise, then: $$\mathbf{s}\theta(\mathbf{x}t, t) = -\frac{\boldsymbol{\epsilon}\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}}$$

The simplified training objective (DDPM):

$$\mathcal{L}{simple}(\theta) = \mathbb{E}{t, \mathbf{x}0, \boldsymbol{\epsilon}} \left[ | \boldsymbol{\epsilon} - \boldsymbol{\epsilon}\theta(\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t) |^2 \right]$$

This is the objective used in the original DDPM paper—beautifully simple: predict the noise that was added.

training_objective.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import torch
import torch.nn.functional as F
 
def ddpm_loss(
    model: torch.nn.Module,
    x_0: torch.Tensor,
    sqrt_alphas_cumprod: torch.Tensor,
    sqrt_one_minus_alphas_cumprod: torch.Tensor,
    T: int,
) -> torch.Tensor:
    """
    Compute the DDPM training loss (simplified objective).
    
    Args:
        model: Neural network that predicts noise epsilon(x_t, t)
        x_0: Clean data samples [B, C, H, W]
        sqrt_alphas_cumprod: Precomputed schedule values
        sqrt_one_minus_alphas_cumprod: Precomputed schedule values
        T: Total number of diffusion timesteps
    
    Returns:
        loss: Mean squared error between true and predicted noise
    """
    batch_size = x_0.shape[0]
    device = x_0.device
    
    # Sample random timesteps uniformly
    t = torch.randint(0, T, (batch_size,), device=device)
    
    # Sample random noise
    epsilon = torch.randn_like(x_0)
    
    # Compute x_t using closed-form formula
    sqrt_alpha_bar = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
    sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
    
    x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * epsilon
    
    # Predict the noise
    epsilon_pred = model(x_t, t)
    
    # Simple MSE loss
    loss = F.mse_loss(epsilon_pred, epsilon)
    
    return loss

Prediction Target Comparison
Target	Formula	Advantages	Usage
ε (noise)	$\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$	Simple objective, stable training	DDPM, most implementations
Score	$\mathbf{s}_\theta(\mathbf{x}_t, t)$	Direct theoretical connection	Score-based models
x₀ (clean data)	$\mathbf{x}_{0,\theta}(\mathbf{x}_t, t)$	Interpretable predictions	Some variants
v (velocity)	$\mathbf{v}_\theta = \sqrt{\bar{\alpha}_t}\boldsymbol{\epsilon} - \sqrt{1-\bar{\alpha}_t}\mathbf{x}_0$	Better for high-res	Progressive distillation

Loss Weighting Across Timesteps

Not all timesteps are equally important. The simplified objective weights all timesteps equally, but we can improve by considering the signal-to-noise ratio.

The variational bound objective:

The full ELBO-derived objective weights each timestep differently:

$$\mathcal{L}{vlb}(\theta) = \mathbb{E}{t} \left[ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}t)} | \boldsymbol{\epsilon} - \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t) |^2 \right]$$

Why weighting matters:

Early timesteps (low noise): Small errors are magnified
Late timesteps (high noise): Model must learn semantics with little signal
Uniform weighting often over-emphasizes late timesteps

Common Weighting Strategies

•Uniform (Simple): $w(t) = 1$ — Used in DDPM, works surprisingly well
•SNR weighting: $w(t) = \text{SNR}(t)$ — Emphasizes timesteps where signal is clearer
•Min-SNR-γ: $w(t) = \min(\text{SNR}(t), \gamma)$ — Clips to prevent extreme weights
•Importance sampling: Sample $t$ non-uniformly based on loss variance
•Learned weighting: Let the model learn optimal weights during training

Practical Recommendation

The simple uniform-weighted objective works remarkably well in practice. For improved quality, Min-SNR-γ weighting (with γ=5) provides consistent improvements across datasets and model architectures.

Neural Network Architecture for Score Estimation

The denoising network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ must accept both noisy data and the timestep. The standard architecture is a U-Net with timestep conditioning.

Key architectural components:

•U-Net backbone: Encoder-decoder with skip connections preserves spatial information
•Timestep embedding: Sinusoidal embeddings (like Transformer positions) encode timestep information
•Conditioning injection: Timestep embeddings added to feature maps or used in normalization layers
•Attention blocks: Self-attention at low resolutions captures global dependencies
•Residual blocks: Deep networks with stable gradient flow

timestep_embedding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import torch
import math
 
def sinusoidal_timestep_embedding(
    timesteps: torch.Tensor,
    embedding_dim: int,
    max_period: float = 10000.0,
) -> torch.Tensor:
    """
    Create sinusoidal timestep embeddings (similar to Transformer positional encoding).
    
    Args:
        timesteps: 1D tensor of timesteps [B]
        embedding_dim: Dimension of the embedding
        max_period: Maximum period for the sinusoidal functions
        
    Returns:
        embeddings: Timestep embeddings [B, embedding_dim]
    """
    half_dim = embedding_dim // 2
    
    # Compute frequencies
    freqs = torch.exp(
        -math.log(max_period) * torch.arange(half_dim, device=timesteps.device) / half_dim
    )
    
    # Compute embeddings
    args = timesteps[:, None].float() * freqs[None, :]
    embeddings = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
    
    # Handle odd embedding dimensions
    if embedding_dim % 2:
        embeddings = torch.cat([embeddings, torch.zeros_like(embeddings[:, :1])], dim=-1)
    
    return embeddings

The Complete Training Algorithm

Putting it all together, the training algorithm is elegantly simple:

training_loop.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
from torch.optim import AdamW
from tqdm import tqdm
 
def train_diffusion_model(
    model: torch.nn.Module,
    dataloader: torch.utils.data.DataLoader,
    schedule_params: dict,
    num_epochs: int,
    lr: float = 1e-4,
    device: str = "cuda",
) -> None:
    """
    Train a diffusion model using the DDPM objective.
    """
    model = model.to(device)
    optimizer = AdamW(model.parameters(), lr=lr)
    
    sqrt_alphas_cumprod = schedule_params['sqrt_alphas_cumprod'].to(device)
    sqrt_one_minus_alphas_cumprod = schedule_params['sqrt_one_minus_alphas_cumprod'].to(device)
    T = len(sqrt_alphas_cumprod)
    
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        
        for x_0 in tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}"):
            x_0 = x_0.to(device)
            batch_size = x_0.shape[0]
            
            # Algorithm 1 from DDPM paper:
            # 1. Sample timesteps uniformly: t ~ Uniform({1,...,T})
            t = torch.randint(0, T, (batch_size,), device=device)
            
            # 2. Sample noise: epsilon ~ N(0, I)
            epsilon = torch.randn_like(x_0)
            
            # 3. Compute noisy samples
            sqrt_alpha_bar = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
            sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
            x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * epsilon
            
            # 4. Predict noise and compute loss
            epsilon_pred = model(x_t, t)
            loss = torch.nn.functional.mse_loss(epsilon_pred, epsilon)
            
            # 5. Gradient descent step
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(dataloader)
        print(f"Epoch {epoch+1}: Average Loss = {avg_loss:.6f}")

Training Complexity

The training algorithm is remarkably straightforward compared to GANs (no adversarial training dynamics) or VAEs (no KL term balancing). Each iteration is a simple supervised regression problem: predict the noise that was added.

Summary: Training Diffusion Models

Key Takeaways

•Score matching trains networks to estimate ∇log p(x), but the direct objective is intractable.
•Denoising score matching provides an equivalent tractable objective by matching conditional scores.
•ε-prediction reparameterizes score estimation as noise prediction, yielding the simple DDPM loss.
•U-Net with timestep conditioning is the standard architecture for noise prediction.
•Loss weighting can improve quality but uniform weighting works surprisingly well.
•Training is stable: Unlike GANs, no mode collapse or adversarial dynamics.

Training Complete

You now understand how to train diffusion models using denoising score matching. Next, we'll dive into DDPM—the specific formulation that made diffusion models practical for high-quality image generation.