Machine LearningGenerative Models

Diffusion Models

LevelAdvanced

Duration120 mins

TopicGenerative Models

4 / 5

Score-Based Generative Models

The Continuous-Time Perspective

While DDPM uses discrete timesteps, Score-Based Generative Models through Stochastic Differential Equations (SDEs) provide a unified continuous-time framework. This perspective, developed by Song et al. in "Score-Based Generative Modeling through Stochastic Differential Equations", reveals deep connections between different diffusion model variants and enables powerful new sampling methods.

The SDE viewpoint shows that diffusion models and score matching are two sides of the same coin—both learn to estimate the gradient of the log probability density.

Learning Objectives

After this page, you will understand: (1) the score function and score matching fundamentals, (2) the continuous-time SDE formulation, (3) Variance Exploding (VE) and Variance Preserving (VP) SDEs, (4) the reverse-time SDE for generation, (5) probability flow ODEs, and (6) the connection to DDPM.

Score Matching: Learning the Gradient Field

The score function is the gradient of the log probability density:

$$\nabla_\mathbf{x} \log p(\mathbf{x})$$

Geometrically, this is a vector field that points toward higher density regions. The score has a remarkable property: it doesn't depend on the normalizing constant!

Why scores are special:

If $p(\mathbf{x}) = \frac{1}{Z} \tilde{p}(\mathbf{x})$ where $Z$ is intractable: $$\nabla_\mathbf{x} \log p(\mathbf{x}) = \nabla_\mathbf{x} \log \tilde{p}(\mathbf{x}) - \nabla_\mathbf{x} \log Z = \nabla_\mathbf{x} \log \tilde{p}(\mathbf{x})$$

The partition function $Z$ vanishes! This is why score-based methods can model complex unnormalized densities.

Langevin dynamics for sampling:

Given the score, we can generate samples via Langevin Monte Carlo:

$$\mathbf{x}_{k+1} = \mathbf{x}k + \frac{\epsilon}{2} \nabla\mathbf{x} \log p(\mathbf{x}_k) + \sqrt{\epsilon} \mathbf{z}_k, \quad \mathbf{z}_k \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

As $\epsilon \to 0$ and $k \to \infty$, samples converge to $p(\mathbf{x})$.

The challenge: In low-density regions, the score estimate is inaccurate (few training samples there). This makes it hard to traverse from random initialization to data regions.

Noise Conditional Score Networks (NCSN)

Song & Ermon (2019) proposed training score networks at multiple noise levels. At high noise, data spreads everywhere, making scores accurate globally. At low noise, scores are accurate near data. Annealing noise levels during sampling guides random points to data.

The Stochastic Differential Equation Framework

The SDE framework generalizes diffusion to continuous time. The forward SDE describes how data evolves into noise:

$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t) dt + g(t) d\mathbf{w}$$

where:

$\mathbf{f}(\mathbf{x}, t)$: Drift coefficient (deterministic component)
$g(t)$: Diffusion coefficient (noise scale)
$\mathbf{w}$: Standard Wiener process (Brownian motion)
$t \in [0, T]$: Continuous time variable

The marginal distribution $p_t(\mathbf{x})$ evolves according to the Fokker-Planck equation.

Common SDE Formulations
SDE Type	Forward SDE	Properties	DDPM Connection
Variance Exploding (VE)	$d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\mathbf{w}$	Variance grows unboundedly	NCSN / SMLD
Variance Preserving (VP)	$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x} dt + \sqrt{\beta(t)} d\mathbf{w}$	Variance stays bounded	DDPM exactly
sub-VP	Modified VP with smaller variance	Tighter variance bound	Better likelihoods

VP-SDE (connects to DDPM):

The Variance Preserving SDE with $\beta(t)$ corresponding to the DDPM schedule: $$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x} dt + \sqrt{\beta(t)} d\mathbf{w}$$

This SDE, when discretized with the Euler-Maruyama method, recovers exactly the DDPM forward process. The marginal distribution at any time $t$ satisfies: $$p_{0t}(\mathbf{x}(t)|\mathbf{x}(0)) = \mathcal{N}(\mathbf{x}(t); \sqrt{\bar{\alpha}(t)}\mathbf{x}(0), (1-\bar{\alpha}(t))\mathbf{I})$$

The Reverse-Time SDE: Going Backward in Time

A remarkable result from Anderson (1982): for any forward SDE, there exists a reverse-time SDE that traverses the same distribution in reverse:

$$d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})] dt + g(t) d\bar{\mathbf{w}}$$

where $\bar{\mathbf{w}}$ is a Wiener process in reverse time, and $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ is the time-dependent score function.

The generation procedure:

Start with $\mathbf{x}(T) \sim p_T$ (approximately Gaussian)
Integrate the reverse SDE from $t=T$ to $t=0$
End with $\mathbf{x}(0) \sim p_0 = p_{data}$

The Key Insight

The only unknown in the reverse SDE is the score ∇log p_t(x). If we can estimate this with a neural network s_θ(x, t) ≈ ∇log p_t(x), we can generate samples by solving the reverse SDE numerically.

reverse_sde_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch
from tqdm import tqdm
 
def euler_maruyama_reverse_sde(
    score_model: torch.nn.Module,
    shape: tuple,
    num_steps: int = 1000,
    sigma_max: float = 80.0,
    sigma_min: float = 0.002,
    device: str = "cuda",
) -> torch.Tensor:
    """
    Sample from VP-SDE using Euler-Maruyama discretization.
    """
    # Time schedule
    timesteps = torch.linspace(1.0, 0.0, num_steps + 1, device=device)
    dt = 1.0 / num_steps
    
    # Start from noise
    x = torch.randn(shape, device=device) * sigma_max
    
    for i in tqdm(range(num_steps), desc="Reverse SDE"):
        t = timesteps[i]
        
        # Get noise level for this timestep
        sigma_t = sigma_min * (sigma_max / sigma_min) ** t
        
        # Estimate score
        with torch.no_grad():
            # Score network often parameterized as epsilon / sigma
            score = score_model(x, t.expand(shape[0]))
        
        # Reverse SDE step (simplified VP-SDE)
        # Drift: -0.5 * beta(t) * x + beta(t) * score
        # For VE-SDE: drift = sigma_t^2 * score
        drift = sigma_t ** 2 * score
        
        # Euler-Maruyama step
        noise = torch.randn_like(x)
        x = x + drift * dt + sigma_t * torch.sqrt(torch.tensor(dt)) * noise
    
    return x

Probability Flow ODE: Deterministic Generation

A stunning result: for any SDE, there exists an Ordinary Differential Equation (ODE) with the same marginal distributions but no stochasticity:

$$\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})$$

This is the Probability Flow ODE.

Why this is remarkable:

Deterministic: Same initial condition → same output (reproducibility)
Exact likelihood: Can compute exact log-likelihood via change of variables
Fast solvers: Can use adaptive ODE solvers (RK45, etc.) for speed
Invertible: Can encode real images to latents and decode back perfectly

Probability Flow ODE Applications

•Image editing: Encode image → modify latent → decode back
•Interpolation: Interpolate between two images in latent space
•Likelihood evaluation: Compute exact bits-per-dimension
•Fast sampling: Adaptive solvers can skip timesteps intelligently
•Consistency models: Distill ODE trajectory into single-step model

Connection to Normalizing Flows

The probability flow ODE is a continuous normalizing flow! The score network implicitly defines a velocity field, and integrating this ODE gives a bijection between data and noise. This unifies diffusion models with the flow literature.

Advanced Numerical Solvers

The SDE/ODE perspective enables sophisticated numerical methods for faster, higher-quality sampling:

Diffusion Samplers Comparison
Sampler	Type	Steps	Quality	Notes
DDPM	Ancestral (SDE)	1000	Good	Original, slow
DDIM	ODE	50-100	Good	Deterministic, fast
Euler-Maruyama	SDE	~1000	Good	Simple discretization
Heun (2nd order)	ODE	~100	Better	Improved accuracy
DPM-Solver	ODE	10-20	Excellent	Exponential integrator
DPM-Solver++	ODE	10-20	Excellent	Multi-step variant
UniPC	ODE + Predictor-Corrector	10-20	Excellent	Unified framework

DPM-Solver key insight:

The probability flow ODE has the semi-linear form: $$\frac{d\mathbf{x}}{dt} = f(t)\mathbf{x} + \frac{g^2(t)}{2\sigma_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}, t)$$

The linear part $f(t)\mathbf{x}$ can be solved exactly. DPM-Solver uses exponential integrators that separate the linear and nonlinear parts, achieving high accuracy with few steps.

Practical impact: DPM-Solver can generate high-quality images in 10-20 steps, a 50-100x speedup over DDPM.

Choosing a Sampler

For most applications, DPM-Solver++ with 20-25 steps provides an excellent speed-quality tradeoff. For maximum quality, adding a few extra steps (50-100) with DDIM or ancestral sampling helps. For fastest generation, distillation methods (consistency models, LCM) enable 1-4 step generation.

Classifier and Classifier-Free Guidance

Conditional generation controls what the model generates (e.g., generating images of a specific class or matching a text prompt).

Classifier Guidance (Dhariwal & Nichol, 2021):

Train a classifier $p_\phi(y|\mathbf{x}t)$ on noisy images. The guided score is: $$\tilde{\nabla} \log p_t(\mathbf{x}|y) = \nabla \log p_t(\mathbf{x}) + s \cdot \nabla \log p\phi(y|\mathbf{x}_t)$$

where $s > 1$ amplifies the class signal for stronger conditioning.

Classifier-Free Guidance (Ho & Salimans, 2022):

No separate classifier needed! Train a single model on both conditional and unconditional generation: $$\tilde{\boldsymbol{\epsilon}}\theta = \boldsymbol{\epsilon}\theta(\mathbf{x}t, \emptyset) + s \cdot (\boldsymbol{\epsilon}\theta(\mathbf{x}t, c) - \boldsymbol{\epsilon}\theta(\mathbf{x}_t, \emptyset))$$

classifier_free_guidance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
 
def classifier_free_guidance_step(
    model: torch.nn.Module,
    x_t: torch.Tensor,
    t: torch.Tensor,
    condition: torch.Tensor,  # e.g., text embedding
    guidance_scale: float = 7.5,
) -> torch.Tensor:
    """
    Compute guided noise prediction using classifier-free guidance.
    
    Args:
        model: Noise prediction network (accepts condition or null)
        x_t: Noisy samples
        t: Timesteps
        condition: Conditioning signal (text embedding, class label, etc.)
        guidance_scale: CFG scale (1 = no guidance, 7.5 = typical for images)
        
    Returns:
        Guided noise prediction
    """
    # Unconditional prediction (null condition)
    # Often implemented by replacing condition with learned null embedding
    null_condition = torch.zeros_like(condition)
    epsilon_uncond = model(x_t, t, null_condition)
    
    # Conditional prediction
    epsilon_cond = model(x_t, t, condition)
    
    # CFG interpolation (extrapolation when scale > 1)
    epsilon_guided = epsilon_uncond + guidance_scale * (epsilon_cond - epsilon_uncond)
    
    return epsilon_guided

Why CFG Dominates

Classifier-free guidance is the standard for modern text-to-image models. It's simpler (one model), more flexible (any conditioning), and produces higher-quality results. Stable Diffusion, DALL-E 2, Imagen, and Midjourney all use CFG.

Summary: Score-Based Generative Models

Key Takeaways

•Score functions are gradients of log-density—they point toward data and avoid partition functions.
•SDEs provide a continuous-time framework unifying DDPM, NCSN, and other variants.
•Reverse-time SDEs exist and depend only on the learnable score function.
•Probability flow ODEs enable deterministic generation, exact likelihoods, and fast solvers.
•Advanced samplers (DPM-Solver) achieve high quality in 10-20 steps.
•Classifier-free guidance enables powerful conditional generation without separate classifiers.

Theoretical Foundation Complete

You now understand score-based models and the SDE perspective that unifies diffusion methods. Next, we'll survey the state of the art—the systems and architectures that have made diffusion models the dominant approach for generative AI.

4 / 5

Loading learning content...

Machine LearningGenerative Models

Diffusion Models

LevelAdvanced

Duration120 mins

TopicGenerative Models

4 / 5

Score-Based Generative Models

The Continuous-Time Perspective

The SDE viewpoint shows that diffusion models and score matching are two sides of the same coin—both learn to estimate the gradient of the log probability density.

Learning Objectives

Score Matching: Learning the Gradient Field

The score function is the gradient of the log probability density:

$$\nabla_\mathbf{x} \log p(\mathbf{x})$$

Geometrically, this is a vector field that points toward higher density regions. The score has a remarkable property: it doesn't depend on the normalizing constant!

Why scores are special:

The partition function $Z$ vanishes! This is why score-based methods can model complex unnormalized densities.

Langevin dynamics for sampling:

Given the score, we can generate samples via Langevin Monte Carlo:

$$\mathbf{x}_{k+1} = \mathbf{x}k + \frac{\epsilon}{2} \nabla\mathbf{x} \log p(\mathbf{x}_k) + \sqrt{\epsilon} \mathbf{z}_k, \quad \mathbf{z}_k \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

As $\epsilon \to 0$ and $k \to \infty$, samples converge to $p(\mathbf{x})$.

The challenge: In low-density regions, the score estimate is inaccurate (few training samples there). This makes it hard to traverse from random initialization to data regions.

Noise Conditional Score Networks (NCSN)

The Stochastic Differential Equation Framework

The SDE framework generalizes diffusion to continuous time. The forward SDE describes how data evolves into noise:

$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t) dt + g(t) d\mathbf{w}$$

where:

$\mathbf{f}(\mathbf{x}, t)$: Drift coefficient (deterministic component)
$g(t)$: Diffusion coefficient (noise scale)
$\mathbf{w}$: Standard Wiener process (Brownian motion)
$t \in [0, T]$: Continuous time variable

The marginal distribution $p_t(\mathbf{x})$ evolves according to the Fokker-Planck equation.

Common SDE Formulations
SDE Type	Forward SDE	Properties	DDPM Connection
Variance Exploding (VE)	$d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\mathbf{w}$	Variance grows unboundedly	NCSN / SMLD
Variance Preserving (VP)	$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x} dt + \sqrt{\beta(t)} d\mathbf{w}$	Variance stays bounded	DDPM exactly
sub-VP	Modified VP with smaller variance	Tighter variance bound	Better likelihoods

VP-SDE (connects to DDPM):

The Variance Preserving SDE with $\beta(t)$ corresponding to the DDPM schedule: $$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x} dt + \sqrt{\beta(t)} d\mathbf{w}$$

The Reverse-Time SDE: Going Backward in Time

A remarkable result from Anderson (1982): for any forward SDE, there exists a reverse-time SDE that traverses the same distribution in reverse:

$$d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})] dt + g(t) d\bar{\mathbf{w}}$$

where $\bar{\mathbf{w}}$ is a Wiener process in reverse time, and $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ is the time-dependent score function.

The generation procedure:

Start with $\mathbf{x}(T) \sim p_T$ (approximately Gaussian)
Integrate the reverse SDE from $t=T$ to $t=0$
End with $\mathbf{x}(0) \sim p_0 = p_{data}$

The Key Insight

reverse_sde_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch
from tqdm import tqdm
 
def euler_maruyama_reverse_sde(
    score_model: torch.nn.Module,
    shape: tuple,
    num_steps: int = 1000,
    sigma_max: float = 80.0,
    sigma_min: float = 0.002,
    device: str = "cuda",
) -> torch.Tensor:
    """
    Sample from VP-SDE using Euler-Maruyama discretization.
    """
    # Time schedule
    timesteps = torch.linspace(1.0, 0.0, num_steps + 1, device=device)
    dt = 1.0 / num_steps
    
    # Start from noise
    x = torch.randn(shape, device=device) * sigma_max
    
    for i in tqdm(range(num_steps), desc="Reverse SDE"):
        t = timesteps[i]
        
        # Get noise level for this timestep
        sigma_t = sigma_min * (sigma_max / sigma_min) ** t
        
        # Estimate score
        with torch.no_grad():
            # Score network often parameterized as epsilon / sigma
            score = score_model(x, t.expand(shape[0]))
        
        # Reverse SDE step (simplified VP-SDE)
        # Drift: -0.5 * beta(t) * x + beta(t) * score
        # For VE-SDE: drift = sigma_t^2 * score
        drift = sigma_t ** 2 * score
        
        # Euler-Maruyama step
        noise = torch.randn_like(x)
        x = x + drift * dt + sigma_t * torch.sqrt(torch.tensor(dt)) * noise
    
    return x

Probability Flow ODE: Deterministic Generation

A stunning result: for any SDE, there exists an Ordinary Differential Equation (ODE) with the same marginal distributions but no stochasticity:

$$\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})$$

This is the Probability Flow ODE.

Why this is remarkable:

Deterministic: Same initial condition → same output (reproducibility)
Exact likelihood: Can compute exact log-likelihood via change of variables
Fast solvers: Can use adaptive ODE solvers (RK45, etc.) for speed
Invertible: Can encode real images to latents and decode back perfectly

Probability Flow ODE Applications

•Image editing: Encode image → modify latent → decode back
•Interpolation: Interpolate between two images in latent space
•Likelihood evaluation: Compute exact bits-per-dimension
•Fast sampling: Adaptive solvers can skip timesteps intelligently
•Consistency models: Distill ODE trajectory into single-step model

Connection to Normalizing Flows

Advanced Numerical Solvers

The SDE/ODE perspective enables sophisticated numerical methods for faster, higher-quality sampling:

Diffusion Samplers Comparison
Sampler	Type	Steps	Quality	Notes
DDPM	Ancestral (SDE)	1000	Good	Original, slow
DDIM	ODE	50-100	Good	Deterministic, fast
Euler-Maruyama	SDE	~1000	Good	Simple discretization
Heun (2nd order)	ODE	~100	Better	Improved accuracy
DPM-Solver	ODE	10-20	Excellent	Exponential integrator
DPM-Solver++	ODE	10-20	Excellent	Multi-step variant
UniPC	ODE + Predictor-Corrector	10-20	Excellent	Unified framework

DPM-Solver key insight:

The probability flow ODE has the semi-linear form: $$\frac{d\mathbf{x}}{dt} = f(t)\mathbf{x} + \frac{g^2(t)}{2\sigma_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}, t)$$

The linear part $f(t)\mathbf{x}$ can be solved exactly. DPM-Solver uses exponential integrators that separate the linear and nonlinear parts, achieving high accuracy with few steps.

Practical impact: DPM-Solver can generate high-quality images in 10-20 steps, a 50-100x speedup over DDPM.

Choosing a Sampler

Classifier and Classifier-Free Guidance

Conditional generation controls what the model generates (e.g., generating images of a specific class or matching a text prompt).

Classifier Guidance (Dhariwal & Nichol, 2021):

Train a classifier $p_\phi(y|\mathbf{x}t)$ on noisy images. The guided score is: $$\tilde{\nabla} \log p_t(\mathbf{x}|y) = \nabla \log p_t(\mathbf{x}) + s \cdot \nabla \log p\phi(y|\mathbf{x}_t)$$

where $s > 1$ amplifies the class signal for stronger conditioning.

Classifier-Free Guidance (Ho & Salimans, 2022):

classifier_free_guidance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
 
def classifier_free_guidance_step(
    model: torch.nn.Module,
    x_t: torch.Tensor,
    t: torch.Tensor,
    condition: torch.Tensor,  # e.g., text embedding
    guidance_scale: float = 7.5,
) -> torch.Tensor:
    """
    Compute guided noise prediction using classifier-free guidance.
    
    Args:
        model: Noise prediction network (accepts condition or null)
        x_t: Noisy samples
        t: Timesteps
        condition: Conditioning signal (text embedding, class label, etc.)
        guidance_scale: CFG scale (1 = no guidance, 7.5 = typical for images)
        
    Returns:
        Guided noise prediction
    """
    # Unconditional prediction (null condition)
    # Often implemented by replacing condition with learned null embedding
    null_condition = torch.zeros_like(condition)
    epsilon_uncond = model(x_t, t, null_condition)
    
    # Conditional prediction
    epsilon_cond = model(x_t, t, condition)
    
    # CFG interpolation (extrapolation when scale > 1)
    epsilon_guided = epsilon_uncond + guidance_scale * (epsilon_cond - epsilon_uncond)
    
    return epsilon_guided

Why CFG Dominates

Summary: Score-Based Generative Models

Key Takeaways

•Score functions are gradients of log-density—they point toward data and avoid partition functions.
•SDEs provide a continuous-time framework unifying DDPM, NCSN, and other variants.
•Reverse-time SDEs exist and depend only on the learnable score function.
•Probability flow ODEs enable deterministic generation, exact likelihoods, and fast solvers.
•Advanced samplers (DPM-Solver) achieve high quality in 10-20 steps.
•Classifier-free guidance enables powerful conditional generation without separate classifiers.

Theoretical Foundation Complete

4 / 5