Loading content...
We established that reversing the diffusion process requires estimating the score function $\nabla_\mathbf{x} \log p_t(\mathbf{x})$—the gradient of the log probability density at each noise level. But how do we train a neural network to estimate this quantity when we don't have access to the true probability density?
The answer lies in denoising score matching, a remarkable technique that sidesteps the intractable normalization constant problem and provides a practical training objective. This page derives the theory, explains the equivalences, and presents the practical training algorithms.
After this page, you will understand: (1) why direct score matching is intractable, (2) how denoising score matching provides an equivalent but tractable objective, (3) the connection between noise prediction and score estimation, (4) the complete training algorithm, and (5) practical implementation details.
Definition:
The score function of a probability distribution $p(\mathbf{x})$ is the gradient of its log-density:
$$\mathbf{s}(\mathbf{x}) = \nabla_\mathbf{x} \log p(\mathbf{x})$$
Why is the score useful?
The score matching objective (Hyvärinen, 2005):
To train a score network $\mathbf{s}_\theta(\mathbf{x})$ to approximate the true score, we minimize:
$$\mathcal{J}(\theta) = \mathbb{E}{p(\mathbf{x})} \left[ \frac{1}{2} | \mathbf{s}\theta(\mathbf{x}) - \nabla_\mathbf{x} \log p(\mathbf{x}) |^2 \right]$$
The problem: We don't know $\nabla_\mathbf{x} \log p(\mathbf{x})$—that's exactly what we're trying to learn!
The solution: Through integration by parts, this objective can be rewritten in a form that doesn't require the true score:
$$\mathcal{J}(\theta) = \mathbb{E}{p(\mathbf{x})} \left[ \text{tr}(\nabla\mathbf{x} \mathbf{s}\theta(\mathbf{x})) + \frac{1}{2} | \mathbf{s}\theta(\mathbf{x}) |^2 \right]$$
But this requires computing the Jacobian trace, which is expensive in high dimensions.
The trace of the Jacobian requires O(d) backward passes for d-dimensional data. For images with millions of dimensions, this is prohibitively expensive. We need a better approach.
Vincent (2011) introduced Denoising Score Matching, which provides an equivalent yet tractable objective.
The key insight:
If we perturb data $\mathbf{x}_0$ with known noise to get $\mathbf{x}_t$, we know the score of the perturbed distribution analytically!
For Gaussian perturbation: $q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$
The conditional score is: $$\nabla_{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) = -\frac{\mathbf{x}_t - \sqrt{\bar{\alpha}_t} \mathbf{x}_0}{1 - \bar{\alpha}_t} = -\frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}}$$
where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is the noise that was added.
Denoising score matching proves that training to match the conditional score ∇log q(x_t|x_0) yields the same optimum as matching the marginal score ∇log q(x_t). The neural network automatically learns the correct marginal score by matching conditional scores across all data points.
The denoising score matching objective:
$$\mathcal{L}{DSM}(\theta) = \mathbb{E}{t, \mathbf{x}0, \boldsymbol{\epsilon}} \left[ | \mathbf{s}\theta(\mathbf{x}t, t) - \nabla{\mathbf{x}_t} \log q(\mathbf{x}_t | \mathbf{x}_0) |^2 \right]$$
$$= \mathbb{E}_{t, \mathbf{x}0, \boldsymbol{\epsilon}} \left[ \left| \mathbf{s}\theta(\mathbf{x}_t, t) + \frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}} \right|^2 \right]$$
This is tractable! We can sample $t$, $\mathbf{x}_0$, and $\boldsymbol{\epsilon}$, then compute the loss without any expensive Jacobian calculations.
Instead of directly predicting the score, we can reparameterize to predict the noise $\boldsymbol{\epsilon}$.
The relationship:
If $\boldsymbol{\epsilon}_\theta(\mathbf{x}t, t)$ predicts the noise, then: $$\mathbf{s}\theta(\mathbf{x}t, t) = -\frac{\boldsymbol{\epsilon}\theta(\mathbf{x}_t, t)}{\sqrt{1 - \bar{\alpha}_t}}$$
The simplified training objective (DDPM):
$$\mathcal{L}{simple}(\theta) = \mathbb{E}{t, \mathbf{x}0, \boldsymbol{\epsilon}} \left[ | \boldsymbol{\epsilon} - \boldsymbol{\epsilon}\theta(\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t) |^2 \right]$$
This is the objective used in the original DDPM paper—beautifully simple: predict the noise that was added.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import torchimport torch.nn.functional as F def ddpm_loss( model: torch.nn.Module, x_0: torch.Tensor, sqrt_alphas_cumprod: torch.Tensor, sqrt_one_minus_alphas_cumprod: torch.Tensor, T: int,) -> torch.Tensor: """ Compute the DDPM training loss (simplified objective). Args: model: Neural network that predicts noise epsilon(x_t, t) x_0: Clean data samples [B, C, H, W] sqrt_alphas_cumprod: Precomputed schedule values sqrt_one_minus_alphas_cumprod: Precomputed schedule values T: Total number of diffusion timesteps Returns: loss: Mean squared error between true and predicted noise """ batch_size = x_0.shape[0] device = x_0.device # Sample random timesteps uniformly t = torch.randint(0, T, (batch_size,), device=device) # Sample random noise epsilon = torch.randn_like(x_0) # Compute x_t using closed-form formula sqrt_alpha_bar = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1) sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1) x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * epsilon # Predict the noise epsilon_pred = model(x_t, t) # Simple MSE loss loss = F.mse_loss(epsilon_pred, epsilon) return loss| Target | Formula | Advantages | Usage |
|---|---|---|---|
| ε (noise) | $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ | Simple objective, stable training | DDPM, most implementations |
| Score | $\mathbf{s}_\theta(\mathbf{x}_t, t)$ | Direct theoretical connection | Score-based models |
| x₀ (clean data) | $\mathbf{x}_{0,\theta}(\mathbf{x}_t, t)$ | Interpretable predictions | Some variants |
| v (velocity) | $\mathbf{v}_\theta = \sqrt{\bar{\alpha}_t}\boldsymbol{\epsilon} - \sqrt{1-\bar{\alpha}_t}\mathbf{x}_0$ | Better for high-res | Progressive distillation |
Not all timesteps are equally important. The simplified objective weights all timesteps equally, but we can improve by considering the signal-to-noise ratio.
The variational bound objective:
The full ELBO-derived objective weights each timestep differently:
$$\mathcal{L}{vlb}(\theta) = \mathbb{E}{t} \left[ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1-\bar{\alpha}t)} | \boldsymbol{\epsilon} - \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t) |^2 \right]$$
Why weighting matters:
The simple uniform-weighted objective works remarkably well in practice. For improved quality, Min-SNR-γ weighting (with γ=5) provides consistent improvements across datasets and model architectures.
The denoising network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ must accept both noisy data and the timestep. The standard architecture is a U-Net with timestep conditioning.
Key architectural components:
1234567891011121314151617181920212223242526272829303132333435
import torchimport math def sinusoidal_timestep_embedding( timesteps: torch.Tensor, embedding_dim: int, max_period: float = 10000.0,) -> torch.Tensor: """ Create sinusoidal timestep embeddings (similar to Transformer positional encoding). Args: timesteps: 1D tensor of timesteps [B] embedding_dim: Dimension of the embedding max_period: Maximum period for the sinusoidal functions Returns: embeddings: Timestep embeddings [B, embedding_dim] """ half_dim = embedding_dim // 2 # Compute frequencies freqs = torch.exp( -math.log(max_period) * torch.arange(half_dim, device=timesteps.device) / half_dim ) # Compute embeddings args = timesteps[:, None].float() * freqs[None, :] embeddings = torch.cat([torch.cos(args), torch.sin(args)], dim=-1) # Handle odd embedding dimensions if embedding_dim % 2: embeddings = torch.cat([embeddings, torch.zeros_like(embeddings[:, :1])], dim=-1) return embeddingsPutting it all together, the training algorithm is elegantly simple:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import torchfrom torch.optim import AdamWfrom tqdm import tqdm def train_diffusion_model( model: torch.nn.Module, dataloader: torch.utils.data.DataLoader, schedule_params: dict, num_epochs: int, lr: float = 1e-4, device: str = "cuda",) -> None: """ Train a diffusion model using the DDPM objective. """ model = model.to(device) optimizer = AdamW(model.parameters(), lr=lr) sqrt_alphas_cumprod = schedule_params['sqrt_alphas_cumprod'].to(device) sqrt_one_minus_alphas_cumprod = schedule_params['sqrt_one_minus_alphas_cumprod'].to(device) T = len(sqrt_alphas_cumprod) for epoch in range(num_epochs): model.train() total_loss = 0 for x_0 in tqdm(dataloader, desc=f"Epoch {epoch+1}/{num_epochs}"): x_0 = x_0.to(device) batch_size = x_0.shape[0] # Algorithm 1 from DDPM paper: # 1. Sample timesteps uniformly: t ~ Uniform({1,...,T}) t = torch.randint(0, T, (batch_size,), device=device) # 2. Sample noise: epsilon ~ N(0, I) epsilon = torch.randn_like(x_0) # 3. Compute noisy samples sqrt_alpha_bar = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1) sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1) x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * epsilon # 4. Predict noise and compute loss epsilon_pred = model(x_t, t) loss = torch.nn.functional.mse_loss(epsilon_pred, epsilon) # 5. Gradient descent step optimizer.zero_grad() loss.backward() optimizer.step() total_loss += loss.item() avg_loss = total_loss / len(dataloader) print(f"Epoch {epoch+1}: Average Loss = {avg_loss:.6f}")The training algorithm is remarkably straightforward compared to GANs (no adversarial training dynamics) or VAEs (no KL term balancing). Each iteration is a simple supervised regression problem: predict the noise that was added.
You now understand how to train diffusion models using denoising score matching. Next, we'll dive into DDPM—the specific formulation that made diffusion models practical for high-quality image generation.