Diffusion Models - Learning Module

Loading content...

0/245

The Diffusion Process

A Revolution in Generative Modeling

In 2020, a remarkable paper introduced Denoising Diffusion Probabilistic Models (DDPM), demonstrating that iteratively denoising random noise could generate images rivaling GANs—without adversarial training's instabilities. By 2022, diffusion models powered DALL-E 2, Stable Diffusion, and Imagen, fundamentally transforming AI-generated content.

The key insight is deceptively simple: destroying data is easy; learning to reverse that destruction teaches a model to create. This page explores the mathematical and conceptual foundations of the diffusion process—the forward noising mechanism that makes this possible.

What You Will Learn

By completing this page, you will understand: (1) the forward diffusion process that gradually adds noise to data, (2) the Markov chain formulation and its mathematical properties, (3) the variance schedule and its importance, (4) the closed-form expression for sampling any timestep, and (5) why reversing this process enables generation.

The Core Intuition: Destruction as the Path to Creation

Traditional generative models (VAEs, GANs, flows) learn to map a simple distribution (typically Gaussian) directly to the complex data distribution. This is an extraordinarily difficult task—the mapping must be learned in a single step, requiring the model to capture all data complexity at once.

Diffusion models invert this philosophy:

Instead of learning one giant transformation, diffusion models:

Forward Process: Gradually destroy data by adding noise over many small steps until it becomes pure Gaussian noise
Reverse Process: Learn to reverse each small step, gradually reconstructing data from noise

Each reverse step is a much simpler task than the full generation—the model only needs to denoise slightly, not create entire images from scratch.

The Sculptor's Analogy

Imagine teaching someone to sculpt by first showing them how marble erodes into dust, grain by grain. If they can learn to reverse each tiny erosion step, they can reconstruct the statue from dust. Diffusion models learn to sculpt data from noise by mastering this reversal at every granularity.

Why destruction is easier than creation:

Forward process: No learning required—we simply add Gaussian noise according to a fixed schedule
Reverse process: At each step, we only need to predict how to remove a small amount of noise
Decomposition: The complex generation problem is decomposed into thousands of simple denoising problems

This decomposition is the fundamental insight: a hard generative problem becomes many easy denoising problems.

The Forward Diffusion Process

The forward diffusion process is a Markov chain that progressively adds Gaussian noise to data over $T$ timesteps. Starting from clean data $\mathbf{x}_0 \sim q(\mathbf{x}_0)$, we define a sequence of increasingly noisy latents $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$.

The forward transition kernel:

$$q(\mathbf{x}t | \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 - \beta_t} \mathbf{x}{t-1}, \beta_t \mathbf{I})$$

where:

$\beta_t \in (0, 1)$ is the variance schedule at timestep $t$
$\sqrt{1 - \beta_t}$ scales the previous signal (slightly shrinking it)
$\beta_t$ controls how much noise is added at each step

forward_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch
 
def forward_diffusion_step(x_t_minus_1: torch.Tensor, beta_t: float) -> torch.Tensor:
    """
    Perform one step of the forward diffusion process.
    
    Args:
        x_t_minus_1: Data at timestep t-1, shape [B, C, H, W]
        beta_t: Variance schedule value at timestep t
        
    Returns:
        x_t: Noised data at timestep t
    """
    # Sample noise from standard Gaussian
    epsilon = torch.randn_like(x_t_minus_1)
    
    # Apply the forward transition
    # x_t = sqrt(1 - beta_t) * x_{t-1} + sqrt(beta_t) * epsilon
    scale = torch.sqrt(torch.tensor(1.0 - beta_t))
    noise_scale = torch.sqrt(torch.tensor(beta_t))
    
    x_t = scale * x_t_minus_1 + noise_scale * epsilon
    
    return x_t

The Markov property:

The forward process is Markovian—each step depends only on the immediately preceding state. The full forward trajectory factorizes as:

$$q(\mathbf{x}_{1:T} | \mathbf{x}0) = \prod{t=1}^{T} q(\mathbf{x}t | \mathbf{x}{t-1})$$

The terminal distribution:

As $T \to \infty$ (or for sufficiently large $T$ with appropriate $\beta_t$), the distribution $q(\mathbf{x}_T)$ converges to an isotropic Gaussian: $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.

This is crucial: the endpoint of the forward process is a known, simple distribution from which we can easily sample.

Variance Schedules: Controlling the Noise

The sequence ${\beta_1, \beta_2, \ldots, \beta_T}$ is called the variance schedule or noise schedule. Its design critically impacts model performance—too aggressive destroys information too fast; too gentle wastes computation.

Common variance schedules:

Variance Schedule Comparison
Schedule	Formula	Characteristics	Use Case
Linear	$\beta_t = \beta_{\min} + \frac{t-1}{T-1}(\beta_{\max} - \beta_{\min})$	Constant noise increase rate	Original DDPM, simple baseline
Cosine	$\bar{\alpha}_t = \cos^2\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)$	Gentler at start and end	Improved DDPM, better quality
Quadratic	$\beta_t = \beta_{\min} + (\beta_{\max} - \beta_{\min}) \cdot (t/T)^2$	Slow start, fast finish	Alternative exploration
Sigmoid	$\beta_t = \sigma(-6 + 12t/T) \cdot (\beta_{\max} - \beta_{\min}) + \beta_{\min}$	S-curve progression	Smooth transitions

variance_schedules.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import torch
import numpy as np
 
def linear_schedule(T: int, beta_min: float = 1e-4, beta_max: float = 0.02):
    """Linear variance schedule from DDPM paper."""
    return torch.linspace(beta_min, beta_max, T)
 
def cosine_schedule(T: int, s: float = 0.008):
    """
    Cosine variance schedule from 'Improved DDPM'.
    Produces gentler noise addition at boundaries.
    """
    steps = torch.arange(T + 1, dtype=torch.float64)
    alphas_cumprod = torch.cos(((steps / T) + s) / (1 + s) * np.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]  # Normalize
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0.0001, 0.9999).float()
 
def get_schedule_params(betas: torch.Tensor):
    """Compute all derived quantities from beta schedule."""
    alphas = 1.0 - betas
    alphas_cumprod = torch.cumprod(alphas, dim=0)
    alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]])
    
    # For sampling x_t directly from x_0
    sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
    sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - alphas_cumprod)
    
    return {
        'betas': betas,
        'alphas': alphas,
        'alphas_cumprod': alphas_cumprod,
        'sqrt_alphas_cumprod': sqrt_alphas_cumprod,
        'sqrt_one_minus_alphas_cumprod': sqrt_one_minus_alphas_cumprod,
    }

Schedule Design Matters

The cosine schedule was introduced specifically because linear schedules destroy too much information too quickly in the early steps. With linear scheduling, models often struggle to recover fine details. The cosine schedule preserves more signal early on, leading to better generation quality.

Closed-Form Sampling: Direct Access to Any Timestep

A crucial mathematical property enables efficient training: we can sample $\mathbf{x}_t$ for any timestep $t$ directly from $\mathbf{x}_0$ without iterating through intermediate steps.

Derivation:

Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^{t} \alpha_s$ (cumulative product).

Since each forward step adds Gaussian noise, and sums of Gaussians are Gaussian, we can show:

$$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$$

The reparameterization form:

$$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

closed_form_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
 
def sample_xt_given_x0(
    x_0: torch.Tensor,
    t: torch.Tensor,
    sqrt_alphas_cumprod: torch.Tensor,
    sqrt_one_minus_alphas_cumprod: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Sample x_t directly from x_0 using closed-form formula.
    
    Args:
        x_0: Clean data, shape [B, C, H, W]
        t: Timesteps, shape [B]
        sqrt_alphas_cumprod: Precomputed sqrt(alpha_bar_t)
        sqrt_one_minus_alphas_cumprod: Precomputed sqrt(1 - alpha_bar_t)
        
    Returns:
        x_t: Noised samples at timestep t
        epsilon: The noise that was added (needed for training)
    """
    # Sample standard Gaussian noise
    epsilon = torch.randn_like(x_0)
    
    # Get schedule values for each sample's timestep
    # Reshape for broadcasting: [B] -> [B, 1, 1, 1]
    sqrt_alpha_bar = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
    sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
    
    # Apply closed-form formula
    x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * epsilon
    
    return x_t, epsilon

Why This Matters for Training

This closed-form expression is essential for efficient training. Instead of sequentially noising through all t steps, we can: (1) Sample a random timestep t uniformly, (2) Compute x_t directly from x_0 in one step, (3) Train the model to predict the noise ε that was added. This enables parallelization across timesteps during training.

Signal-to-Noise Ratio Through Time

The Signal-to-Noise Ratio (SNR) provides insight into how information degrades during diffusion:

$$\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$$

Interpretation:

At $t=0$: $\bar{\alpha}_0 = 1$, so $\text{SNR}(0) = \infty$ (pure signal, no noise)
At $t=T$: $\bar{\alpha}_T \approx 0$, so $\text{SNR}(T) \approx 0$ (pure noise, no signal)
The schedule controls how SNR decays over time

SNR Implications for Model Learning

•High SNR (early timesteps): Model learns fine details—textures, edges, subtle patterns. Denoising here is refinement.
•Medium SNR (middle timesteps): Model learns structure—object shapes, spatial relationships, composition. Denoising here is structure recovery.
•Low SNR (late timesteps): Model learns semantics—what objects exist, global layout. Denoising here is content creation.
•Different timesteps learn different abstractions: This multi-scale learning is a key strength of diffusion models.

The Coarse-to-Fine Generation

During generation (reverse process), the model first decides 'what should exist' (high-level semantics at low SNR), then 'how things are arranged' (structure at medium SNR), finally 'what details look like' (textures at high SNR). This coarse-to-fine progression naturally emerges from the SNR dynamics.

The Reversal Insight: Why This Works

The fundamental question: how can we reverse a noise-adding process that destroys information?

The key insight from Feller (1949) and Anderson (1982):

For a forward diffusion process defined by the SDE: $$d\mathbf{x} = f(\mathbf{x}, t)dt + g(t)d\mathbf{w}$$

There exists a reverse-time SDE: $$d\mathbf{x} = [f(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})]dt + g(t)d\bar{\mathbf{w}}$$

where $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ is the score function—the gradient of the log probability density at time $t$.

The practical implication:

If we can estimate the score function $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ at each timestep, we can reverse the diffusion and generate data from noise.

Connection to denoising:

It can be shown that predicting the noise $\boldsymbol{\epsilon}$ added to create $\mathbf{x}_t$ is equivalent to estimating the score:

$$\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) = -\frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}}$$

This connects score matching (a principled statistical framework) with the intuitive task of denoising.

The Beautiful Duality

Denoising and score estimation are two perspectives on the same task. A neural network trained to denoise is implicitly learning the score function, which enables principled generation through the reverse SDE. This duality unifies practical intuition with rigorous probability theory.

Summary: The Foundation of Diffusion Models

Key Takeaways

•Forward diffusion gradually destroys data by adding Gaussian noise over T timesteps, ending in pure noise.
•Markov chain formulation: Each step depends only on the previous state: $q(\mathbf{x}t | \mathbf{x}{t-1})$.
•Variance schedules (linear, cosine, etc.) control the rate of noise addition and impact generation quality.
•Closed-form sampling allows direct computation of $\mathbf{x}_t$ from $\mathbf{x}_0$ without iteration.
•SNR decay explains how different timesteps learn different levels of abstraction.
•Reversal is possible because the reverse-time SDE exists and depends on the learnable score function.

Foundation Complete

You now understand the forward diffusion process—the mathematical foundation of diffusion models. Next, we'll explore how to train neural networks to reverse this process through denoising score matching.