Loading content...
In 2020, a remarkable paper introduced Denoising Diffusion Probabilistic Models (DDPM), demonstrating that iteratively denoising random noise could generate images rivaling GANs—without adversarial training's instabilities. By 2022, diffusion models powered DALL-E 2, Stable Diffusion, and Imagen, fundamentally transforming AI-generated content.
The key insight is deceptively simple: destroying data is easy; learning to reverse that destruction teaches a model to create. This page explores the mathematical and conceptual foundations of the diffusion process—the forward noising mechanism that makes this possible.
By completing this page, you will understand: (1) the forward diffusion process that gradually adds noise to data, (2) the Markov chain formulation and its mathematical properties, (3) the variance schedule and its importance, (4) the closed-form expression for sampling any timestep, and (5) why reversing this process enables generation.
Traditional generative models (VAEs, GANs, flows) learn to map a simple distribution (typically Gaussian) directly to the complex data distribution. This is an extraordinarily difficult task—the mapping must be learned in a single step, requiring the model to capture all data complexity at once.
Diffusion models invert this philosophy:
Instead of learning one giant transformation, diffusion models:
Each reverse step is a much simpler task than the full generation—the model only needs to denoise slightly, not create entire images from scratch.
Imagine teaching someone to sculpt by first showing them how marble erodes into dust, grain by grain. If they can learn to reverse each tiny erosion step, they can reconstruct the statue from dust. Diffusion models learn to sculpt data from noise by mastering this reversal at every granularity.
Why destruction is easier than creation:
This decomposition is the fundamental insight: a hard generative problem becomes many easy denoising problems.
The forward diffusion process is a Markov chain that progressively adds Gaussian noise to data over $T$ timesteps. Starting from clean data $\mathbf{x}_0 \sim q(\mathbf{x}_0)$, we define a sequence of increasingly noisy latents $\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_T$.
The forward transition kernel:
$$q(\mathbf{x}t | \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1 - \beta_t} \mathbf{x}{t-1}, \beta_t \mathbf{I})$$
where:
123456789101112131415161718192021222324
import torch def forward_diffusion_step(x_t_minus_1: torch.Tensor, beta_t: float) -> torch.Tensor: """ Perform one step of the forward diffusion process. Args: x_t_minus_1: Data at timestep t-1, shape [B, C, H, W] beta_t: Variance schedule value at timestep t Returns: x_t: Noised data at timestep t """ # Sample noise from standard Gaussian epsilon = torch.randn_like(x_t_minus_1) # Apply the forward transition # x_t = sqrt(1 - beta_t) * x_{t-1} + sqrt(beta_t) * epsilon scale = torch.sqrt(torch.tensor(1.0 - beta_t)) noise_scale = torch.sqrt(torch.tensor(beta_t)) x_t = scale * x_t_minus_1 + noise_scale * epsilon return x_tThe Markov property:
The forward process is Markovian—each step depends only on the immediately preceding state. The full forward trajectory factorizes as:
$$q(\mathbf{x}_{1:T} | \mathbf{x}0) = \prod{t=1}^{T} q(\mathbf{x}t | \mathbf{x}{t-1})$$
The terminal distribution:
As $T \to \infty$ (or for sufficiently large $T$ with appropriate $\beta_t$), the distribution $q(\mathbf{x}_T)$ converges to an isotropic Gaussian: $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$.
This is crucial: the endpoint of the forward process is a known, simple distribution from which we can easily sample.
The sequence ${\beta_1, \beta_2, \ldots, \beta_T}$ is called the variance schedule or noise schedule. Its design critically impacts model performance—too aggressive destroys information too fast; too gentle wastes computation.
Common variance schedules:
| Schedule | Formula | Characteristics | Use Case |
|---|---|---|---|
| Linear | $\beta_t = \beta_{\min} + \frac{t-1}{T-1}(\beta_{\max} - \beta_{\min})$ | Constant noise increase rate | Original DDPM, simple baseline |
| Cosine | $\bar{\alpha}_t = \cos^2\left(\frac{t/T + s}{1+s} \cdot \frac{\pi}{2}\right)$ | Gentler at start and end | Improved DDPM, better quality |
| Quadratic | $\beta_t = \beta_{\min} + (\beta_{\max} - \beta_{\min}) \cdot (t/T)^2$ | Slow start, fast finish | Alternative exploration |
| Sigmoid | $\beta_t = \sigma(-6 + 12t/T) \cdot (\beta_{\max} - \beta_{\min}) + \beta_{\min}$ | S-curve progression | Smooth transitions |
1234567891011121314151617181920212223242526272829303132333435
import torchimport numpy as np def linear_schedule(T: int, beta_min: float = 1e-4, beta_max: float = 0.02): """Linear variance schedule from DDPM paper.""" return torch.linspace(beta_min, beta_max, T) def cosine_schedule(T: int, s: float = 0.008): """ Cosine variance schedule from 'Improved DDPM'. Produces gentler noise addition at boundaries. """ steps = torch.arange(T + 1, dtype=torch.float64) alphas_cumprod = torch.cos(((steps / T) + s) / (1 + s) * np.pi * 0.5) ** 2 alphas_cumprod = alphas_cumprod / alphas_cumprod[0] # Normalize betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1]) return torch.clip(betas, 0.0001, 0.9999).float() def get_schedule_params(betas: torch.Tensor): """Compute all derived quantities from beta schedule.""" alphas = 1.0 - betas alphas_cumprod = torch.cumprod(alphas, dim=0) alphas_cumprod_prev = torch.cat([torch.tensor([1.0]), alphas_cumprod[:-1]]) # For sampling x_t directly from x_0 sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod) sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - alphas_cumprod) return { 'betas': betas, 'alphas': alphas, 'alphas_cumprod': alphas_cumprod, 'sqrt_alphas_cumprod': sqrt_alphas_cumprod, 'sqrt_one_minus_alphas_cumprod': sqrt_one_minus_alphas_cumprod, }The cosine schedule was introduced specifically because linear schedules destroy too much information too quickly in the early steps. With linear scheduling, models often struggle to recover fine details. The cosine schedule preserves more signal early on, leading to better generation quality.
A crucial mathematical property enables efficient training: we can sample $\mathbf{x}_t$ for any timestep $t$ directly from $\mathbf{x}_0$ without iterating through intermediate steps.
Derivation:
Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^{t} \alpha_s$ (cumulative product).
Since each forward step adds Gaussian noise, and sums of Gaussians are Gaussian, we can show:
$$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I})$$
The reparameterization form:
$$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$
123456789101112131415161718192021222324252627282930313233
import torch def sample_xt_given_x0( x_0: torch.Tensor, t: torch.Tensor, sqrt_alphas_cumprod: torch.Tensor, sqrt_one_minus_alphas_cumprod: torch.Tensor,) -> tuple[torch.Tensor, torch.Tensor]: """ Sample x_t directly from x_0 using closed-form formula. Args: x_0: Clean data, shape [B, C, H, W] t: Timesteps, shape [B] sqrt_alphas_cumprod: Precomputed sqrt(alpha_bar_t) sqrt_one_minus_alphas_cumprod: Precomputed sqrt(1 - alpha_bar_t) Returns: x_t: Noised samples at timestep t epsilon: The noise that was added (needed for training) """ # Sample standard Gaussian noise epsilon = torch.randn_like(x_0) # Get schedule values for each sample's timestep # Reshape for broadcasting: [B] -> [B, 1, 1, 1] sqrt_alpha_bar = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1) sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1) # Apply closed-form formula x_t = sqrt_alpha_bar * x_0 + sqrt_one_minus_alpha_bar * epsilon return x_t, epsilonThis closed-form expression is essential for efficient training. Instead of sequentially noising through all t steps, we can: (1) Sample a random timestep t uniformly, (2) Compute x_t directly from x_0 in one step, (3) Train the model to predict the noise ε that was added. This enables parallelization across timesteps during training.
The Signal-to-Noise Ratio (SNR) provides insight into how information degrades during diffusion:
$$\text{SNR}(t) = \frac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$$
Interpretation:
During generation (reverse process), the model first decides 'what should exist' (high-level semantics at low SNR), then 'how things are arranged' (structure at medium SNR), finally 'what details look like' (textures at high SNR). This coarse-to-fine progression naturally emerges from the SNR dynamics.
The fundamental question: how can we reverse a noise-adding process that destroys information?
The key insight from Feller (1949) and Anderson (1982):
For a forward diffusion process defined by the SDE: $$d\mathbf{x} = f(\mathbf{x}, t)dt + g(t)d\mathbf{w}$$
There exists a reverse-time SDE: $$d\mathbf{x} = [f(\mathbf{x}, t) - g(t)^2 \nabla_{\mathbf{x}} \log p_t(\mathbf{x})]dt + g(t)d\bar{\mathbf{w}}$$
where $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ is the score function—the gradient of the log probability density at time $t$.
The practical implication:
If we can estimate the score function $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ at each timestep, we can reverse the diffusion and generate data from noise.
Connection to denoising:
It can be shown that predicting the noise $\boldsymbol{\epsilon}$ added to create $\mathbf{x}_t$ is equivalent to estimating the score:
$$\nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t) = -\frac{\boldsymbol{\epsilon}}{\sqrt{1 - \bar{\alpha}_t}}$$
This connects score matching (a principled statistical framework) with the intuitive task of denoising.
Denoising and score estimation are two perspectives on the same task. A neural network trained to denoise is implicitly learning the score function, which enables principled generation through the reverse SDE. This duality unifies practical intuition with rigorous probability theory.
You now understand the forward diffusion process—the mathematical foundation of diffusion models. Next, we'll explore how to train neural networks to reverse this process through denoising score matching.