Loading learning content...
While DDPM uses discrete timesteps, Score-Based Generative Models through Stochastic Differential Equations (SDEs) provide a unified continuous-time framework. This perspective, developed by Song et al. in "Score-Based Generative Modeling through Stochastic Differential Equations", reveals deep connections between different diffusion model variants and enables powerful new sampling methods.
The SDE viewpoint shows that diffusion models and score matching are two sides of the same coin—both learn to estimate the gradient of the log probability density.
After this page, you will understand: (1) the score function and score matching fundamentals, (2) the continuous-time SDE formulation, (3) Variance Exploding (VE) and Variance Preserving (VP) SDEs, (4) the reverse-time SDE for generation, (5) probability flow ODEs, and (6) the connection to DDPM.
The score function is the gradient of the log probability density:
$$\nabla_\mathbf{x} \log p(\mathbf{x})$$
Geometrically, this is a vector field that points toward higher density regions. The score has a remarkable property: it doesn't depend on the normalizing constant!
Why scores are special:
If $p(\mathbf{x}) = \frac{1}{Z} \tilde{p}(\mathbf{x})$ where $Z$ is intractable: $$\nabla_\mathbf{x} \log p(\mathbf{x}) = \nabla_\mathbf{x} \log \tilde{p}(\mathbf{x}) - \nabla_\mathbf{x} \log Z = \nabla_\mathbf{x} \log \tilde{p}(\mathbf{x})$$
The partition function $Z$ vanishes! This is why score-based methods can model complex unnormalized densities.
Langevin dynamics for sampling:
Given the score, we can generate samples via Langevin Monte Carlo:
$$\mathbf{x}_{k+1} = \mathbf{x}k + \frac{\epsilon}{2} \nabla\mathbf{x} \log p(\mathbf{x}_k) + \sqrt{\epsilon} \mathbf{z}_k, \quad \mathbf{z}_k \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$
As $\epsilon \to 0$ and $k \to \infty$, samples converge to $p(\mathbf{x})$.
The challenge: In low-density regions, the score estimate is inaccurate (few training samples there). This makes it hard to traverse from random initialization to data regions.
Song & Ermon (2019) proposed training score networks at multiple noise levels. At high noise, data spreads everywhere, making scores accurate globally. At low noise, scores are accurate near data. Annealing noise levels during sampling guides random points to data.
The SDE framework generalizes diffusion to continuous time. The forward SDE describes how data evolves into noise:
$$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t) dt + g(t) d\mathbf{w}$$
where:
The marginal distribution $p_t(\mathbf{x})$ evolves according to the Fokker-Planck equation.
| SDE Type | Forward SDE | Properties | DDPM Connection |
|---|---|---|---|
| Variance Exploding (VE) | $d\mathbf{x} = \sqrt{\frac{d[\sigma^2(t)]}{dt}} d\mathbf{w}$ | Variance grows unboundedly | NCSN / SMLD |
| Variance Preserving (VP) | $d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x} dt + \sqrt{\beta(t)} d\mathbf{w}$ | Variance stays bounded | DDPM exactly |
| sub-VP | Modified VP with smaller variance | Tighter variance bound | Better likelihoods |
VP-SDE (connects to DDPM):
The Variance Preserving SDE with $\beta(t)$ corresponding to the DDPM schedule: $$d\mathbf{x} = -\frac{1}{2}\beta(t)\mathbf{x} dt + \sqrt{\beta(t)} d\mathbf{w}$$
This SDE, when discretized with the Euler-Maruyama method, recovers exactly the DDPM forward process. The marginal distribution at any time $t$ satisfies: $$p_{0t}(\mathbf{x}(t)|\mathbf{x}(0)) = \mathcal{N}(\mathbf{x}(t); \sqrt{\bar{\alpha}(t)}\mathbf{x}(0), (1-\bar{\alpha}(t))\mathbf{I})$$
A remarkable result from Anderson (1982): for any forward SDE, there exists a reverse-time SDE that traverses the same distribution in reverse:
$$d\mathbf{x} = [\mathbf{f}(\mathbf{x}, t) - g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})] dt + g(t) d\bar{\mathbf{w}}$$
where $\bar{\mathbf{w}}$ is a Wiener process in reverse time, and $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ is the time-dependent score function.
The generation procedure:
The only unknown in the reverse SDE is the score ∇log p_t(x). If we can estimate this with a neural network s_θ(x, t) ≈ ∇log p_t(x), we can generate samples by solving the reverse SDE numerically.
123456789101112131415161718192021222324252627282930313233343536373839404142
import torchfrom tqdm import tqdm def euler_maruyama_reverse_sde( score_model: torch.nn.Module, shape: tuple, num_steps: int = 1000, sigma_max: float = 80.0, sigma_min: float = 0.002, device: str = "cuda",) -> torch.Tensor: """ Sample from VP-SDE using Euler-Maruyama discretization. """ # Time schedule timesteps = torch.linspace(1.0, 0.0, num_steps + 1, device=device) dt = 1.0 / num_steps # Start from noise x = torch.randn(shape, device=device) * sigma_max for i in tqdm(range(num_steps), desc="Reverse SDE"): t = timesteps[i] # Get noise level for this timestep sigma_t = sigma_min * (sigma_max / sigma_min) ** t # Estimate score with torch.no_grad(): # Score network often parameterized as epsilon / sigma score = score_model(x, t.expand(shape[0])) # Reverse SDE step (simplified VP-SDE) # Drift: -0.5 * beta(t) * x + beta(t) * score # For VE-SDE: drift = sigma_t^2 * score drift = sigma_t ** 2 * score # Euler-Maruyama step noise = torch.randn_like(x) x = x + drift * dt + sigma_t * torch.sqrt(torch.tensor(dt)) * noise return xA stunning result: for any SDE, there exists an Ordinary Differential Equation (ODE) with the same marginal distributions but no stochasticity:
$$\frac{d\mathbf{x}}{dt} = \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} g(t)^2 \nabla_\mathbf{x} \log p_t(\mathbf{x})$$
This is the Probability Flow ODE.
Why this is remarkable:
The probability flow ODE is a continuous normalizing flow! The score network implicitly defines a velocity field, and integrating this ODE gives a bijection between data and noise. This unifies diffusion models with the flow literature.
The SDE/ODE perspective enables sophisticated numerical methods for faster, higher-quality sampling:
| Sampler | Type | Steps | Quality | Notes |
|---|---|---|---|---|
| DDPM | Ancestral (SDE) | 1000 | Good | Original, slow |
| DDIM | ODE | 50-100 | Good | Deterministic, fast |
| Euler-Maruyama | SDE | ~1000 | Good | Simple discretization |
| Heun (2nd order) | ODE | ~100 | Better | Improved accuracy |
| DPM-Solver | ODE | 10-20 | Excellent | Exponential integrator |
| DPM-Solver++ | ODE | 10-20 | Excellent | Multi-step variant |
| UniPC | ODE + Predictor-Corrector | 10-20 | Excellent | Unified framework |
DPM-Solver key insight:
The probability flow ODE has the semi-linear form: $$\frac{d\mathbf{x}}{dt} = f(t)\mathbf{x} + \frac{g^2(t)}{2\sigma_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}, t)$$
The linear part $f(t)\mathbf{x}$ can be solved exactly. DPM-Solver uses exponential integrators that separate the linear and nonlinear parts, achieving high accuracy with few steps.
Practical impact: DPM-Solver can generate high-quality images in 10-20 steps, a 50-100x speedup over DDPM.
For most applications, DPM-Solver++ with 20-25 steps provides an excellent speed-quality tradeoff. For maximum quality, adding a few extra steps (50-100) with DDIM or ancestral sampling helps. For fastest generation, distillation methods (consistency models, LCM) enable 1-4 step generation.
Conditional generation controls what the model generates (e.g., generating images of a specific class or matching a text prompt).
Classifier Guidance (Dhariwal & Nichol, 2021):
Train a classifier $p_\phi(y|\mathbf{x}t)$ on noisy images. The guided score is: $$\tilde{\nabla} \log p_t(\mathbf{x}|y) = \nabla \log p_t(\mathbf{x}) + s \cdot \nabla \log p\phi(y|\mathbf{x}_t)$$
where $s > 1$ amplifies the class signal for stronger conditioning.
Classifier-Free Guidance (Ho & Salimans, 2022):
No separate classifier needed! Train a single model on both conditional and unconditional generation: $$\tilde{\boldsymbol{\epsilon}}\theta = \boldsymbol{\epsilon}\theta(\mathbf{x}t, \emptyset) + s \cdot (\boldsymbol{\epsilon}\theta(\mathbf{x}t, c) - \boldsymbol{\epsilon}\theta(\mathbf{x}_t, \emptyset))$$
12345678910111213141516171819202122232425262728293031323334
import torch def classifier_free_guidance_step( model: torch.nn.Module, x_t: torch.Tensor, t: torch.Tensor, condition: torch.Tensor, # e.g., text embedding guidance_scale: float = 7.5,) -> torch.Tensor: """ Compute guided noise prediction using classifier-free guidance. Args: model: Noise prediction network (accepts condition or null) x_t: Noisy samples t: Timesteps condition: Conditioning signal (text embedding, class label, etc.) guidance_scale: CFG scale (1 = no guidance, 7.5 = typical for images) Returns: Guided noise prediction """ # Unconditional prediction (null condition) # Often implemented by replacing condition with learned null embedding null_condition = torch.zeros_like(condition) epsilon_uncond = model(x_t, t, null_condition) # Conditional prediction epsilon_cond = model(x_t, t, condition) # CFG interpolation (extrapolation when scale > 1) epsilon_guided = epsilon_uncond + guidance_scale * (epsilon_cond - epsilon_uncond) return epsilon_guidedClassifier-free guidance is the standard for modern text-to-image models. It's simpler (one model), more flexible (any conditioning), and produces higher-quality results. Stable Diffusion, DALL-E 2, Imagen, and Midjourney all use CFG.
You now understand score-based models and the SDE perspective that unifies diffusion methods. Next, we'll survey the state of the art—the systems and architectures that have made diffusion models the dominant approach for generative AI.