Loading learning content...
The 2020 paper "Denoising Diffusion Probabilistic Models" by Ho, Jain, and Abbeel unified decades of theoretical work into a practical framework that achieved image quality comparable to GANs. This page dissects the complete DDPM framework: the probabilistic formulation, the reverse process derivation, and the sampling algorithms that convert noise into data.
DDPM established the template that all subsequent diffusion models build upon—understanding it deeply is essential for working with modern generative AI.
After this page, you will understand: (1) the variational formulation and ELBO derivation, (2) the parameterized reverse process, (3) the posterior distribution q(x_{t-1}|x_t, x_0), (4) the complete sampling algorithm, and (5) practical considerations for implementation.
DDPM frames diffusion as a latent variable model where $\mathbf{x}_1, \ldots, \mathbf{x}_T$ are latent variables. Generation maximizes a variational lower bound (ELBO) on the data log-likelihood.
The generative model:
$$p_\theta(\mathbf{x}0) = \int p(\mathbf{x}T) \prod{t=1}^{T} p\theta(\mathbf{x}_{t-1} | \mathbf{x}t) , d\mathbf{x}{1:T}$$
where $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ is the prior.
The ELBO derivation:
$$\log p_\theta(\mathbf{x}0) \geq \mathbb{E}q \left[ \log \frac{p\theta(\mathbf{x}{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}0)} \right] = -\mathcal{L}{vlb}$$
The decomposed ELBO:
The variational bound decomposes into interpretable terms:
$$\mathcal{L}{vlb} = \underbrace{D{KL}(q(\mathbf{x}T|\mathbf{x}0) | p(\mathbf{x}T))}{L_T} + \sum{t=2}^{T} \underbrace{D{KL}(q(\mathbf{x}{t-1}|\mathbf{x}t, \mathbf{x}0) | p\theta(\mathbf{x}{t-1}|\mathbf{x}t))}{L{t-1}} - \underbrace{\log p_\theta(\mathbf{x}_0|\mathbf{x}1)}{L_0}$$
Interpreting each term:
DDPM showed that the simplified loss (uniform weighting, MSE on noise prediction) empirically outperforms the full weighted VLB. The simple loss emphasizes perceptual quality over exact likelihood, which aligns with image generation goals.
A key mathematical result: the posterior $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is tractable and Gaussian.
Derivation via Bayes' theorem:
$$q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}t|\mathbf{x}{t-1}, \mathbf{x}0) q(\mathbf{x}{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)}$$
Since all these distributions are Gaussian, the posterior is also Gaussian:
$$q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}0) = \mathcal{N}(\mathbf{x}{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$$
The posterior parameters:
$$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}0) = \frac{\sqrt{\bar{\alpha}{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})}{1-\bar{\alpha}_t} \mathbf{x}_t$$
$$\tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t} \beta_t$$
The crucial insight: The posterior mean depends on both $\mathbf{x}_t$ and $\mathbf{x}_0$. Since we don't have $\mathbf{x}_0$ during generation, we must estimate it from the predicted noise:
$$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}t} \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}}$$
Predicting noise ε is equivalent to predicting x_0. The model implicitly estimates what the clean data would be at each step, then uses this to compute the posterior mean for the next step.
The learned reverse process approximates the posterior:
$$p_\theta(\mathbf{x}{t-1}|\mathbf{x}t) = \mathcal{N}(\mathbf{x}{t-1}; \boldsymbol{\mu}\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$
Parameterizing the mean:
Using the noise prediction network: $$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}t}} \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t) \right)$$
Choosing the variance:
DDPM explored two options:
Both work well empirically; some later works learn $\sigma_t$ as well.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import torch def ddpm_reverse_step( model: torch.nn.Module, x_t: torch.Tensor, t: int, betas: torch.Tensor, alphas: torch.Tensor, alphas_cumprod: torch.Tensor, sqrt_one_minus_alphas_cumprod: torch.Tensor,) -> torch.Tensor: """ Perform one reverse diffusion step (denoising). Args: model: Trained noise prediction network x_t: Current noisy sample at timestep t t: Current timestep betas, alphas, etc.: Precomputed schedule values Returns: x_{t-1}: Denoised sample at timestep t-1 """ # Predict the noise t_tensor = torch.tensor([t], device=x_t.device).expand(x_t.shape[0]) epsilon_pred = model(x_t, t_tensor) # Compute the mean of p(x_{t-1} | x_t) alpha_t = alphas[t] beta_t = betas[t] sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t] # mu = (1/sqrt(alpha_t)) * (x_t - beta_t/sqrt(1-alpha_bar_t) * epsilon) mu = (1.0 / torch.sqrt(alpha_t)) * ( x_t - (beta_t / sqrt_one_minus_alpha_bar) * epsilon_pred ) # Add noise (except at t=0) if t > 0: noise = torch.randn_like(x_t) sigma_t = torch.sqrt(beta_t) # Can also use posterior variance x_t_minus_1 = mu + sigma_t * noise else: x_t_minus_1 = mu # No noise at final step return x_t_minus_1The full generation algorithm iteratively applies the reverse process from $t=T$ down to $t=0$:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import torchfrom tqdm import tqdm @torch.no_grad()def ddpm_sample( model: torch.nn.Module, shape: tuple, schedule_params: dict, device: str = "cuda",) -> torch.Tensor: """ Generate samples using the DDPM sampling algorithm. Args: model: Trained noise prediction network shape: Shape of samples to generate (B, C, H, W) schedule_params: Precomputed schedule values device: Device to use Returns: x_0: Generated samples """ model.eval() betas = schedule_params['betas'].to(device) alphas = schedule_params['alphas'].to(device) alphas_cumprod = schedule_params['alphas_cumprod'].to(device) sqrt_one_minus_alphas_cumprod = schedule_params['sqrt_one_minus_alphas_cumprod'].to(device) T = len(betas) # Start from pure noise: x_T ~ N(0, I) x_t = torch.randn(shape, device=device) # Iteratively denoise from t=T-1 to t=0 for t in tqdm(reversed(range(T)), total=T, desc="Sampling"): t_tensor = torch.tensor([t], device=device).expand(shape[0]) # Predict noise epsilon_pred = model(x_t, t_tensor) # Compute posterior mean alpha_t = alphas[t] beta_t = betas[t] sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t] mu = (1.0 / torch.sqrt(alpha_t)) * ( x_t - (beta_t / sqrt_one_minus_alpha_bar) * epsilon_pred ) # Sample x_{t-1} if t > 0: noise = torch.randn_like(x_t) sigma_t = torch.sqrt(beta_t) x_t = mu + sigma_t * noise else: x_t = mu return x_tDDPM requires T=1000 sequential neural network evaluations to generate one sample. This is much slower than GANs (one forward pass) or VAEs (one encoder + decoder pass). Accelerated sampling methods (DDIM, DPM-Solver) address this limitation.
The follow-up paper "Improved Denoising Diffusion Probabilistic Models" introduced several enhancements:
| Aspect | Original DDPM | Improved DDPM | Impact |
|---|---|---|---|
| Noise schedule | Linear | Cosine | Better image quality, especially details |
| Variance | Fixed (β_t or posterior) | Learned interpolation | Improved log-likelihood |
| Training objective | L_simple only | Hybrid L_simple + L_vlb | Better likelihood + quality |
| Steps at inference | Same as training (T) | Can use fewer | More flexibility |
Learning the variance:
Improved DDPM parameterizes variance as an interpolation: $$\Sigma_\theta(\mathbf{x}_t, t) = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)$$
where $v = v_\theta(\mathbf{x}_t, t)$ is a learned output. This allows the model to adaptively choose variance based on the input.
The combination of cosine schedule, learned variance, and hybrid loss in Improved DDPM became the foundation for subsequent models like Guided Diffusion and Stable Diffusion. These seemingly small changes yield significant quality improvements.
Denoising Diffusion Implicit Models (DDIM) by Song et al. revealed that the DDPM training objective actually supports a family of sampling procedures, not just the original stochastic one.
Key insight: The same trained model can be sampled using:
The DDIM update rule:
$$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}} \hat{\mathbf{x}}0 + \sqrt{1-\bar{\alpha}{t-1} - \sigma_t^2} \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + \sigma_t \boldsymbol{\epsilon}_t$$
Setting $\sigma_t = 0$ gives deterministic sampling.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import torch @torch.no_grad()def ddim_sample( model: torch.nn.Module, shape: tuple, schedule_params: dict, num_steps: int = 50, # Much fewer than T! eta: float = 0.0, # 0 = deterministic, 1 = DDPM device: str = "cuda",) -> torch.Tensor: """ DDIM sampling with configurable number of steps. """ alphas_cumprod = schedule_params['alphas_cumprod'].to(device) T = len(alphas_cumprod) # Select subset of timesteps timesteps = torch.linspace(T - 1, 0, num_steps, dtype=torch.long, device=device) x_t = torch.randn(shape, device=device) for i, t in enumerate(timesteps): t_next = timesteps[i + 1] if i + 1 < len(timesteps) else torch.tensor(0) # Predict noise epsilon_pred = model(x_t, t.expand(shape[0])) # Predict x_0 alpha_bar_t = alphas_cumprod[t] x0_pred = (x_t - torch.sqrt(1 - alpha_bar_t) * epsilon_pred) / torch.sqrt(alpha_bar_t) # Compute coefficients for x_{t-1} alpha_bar_t_next = alphas_cumprod[t_next] if t_next > 0 else torch.tensor(1.0) sigma_t = eta * torch.sqrt((1 - alpha_bar_t_next) / (1 - alpha_bar_t)) * \ torch.sqrt(1 - alpha_bar_t / alpha_bar_t_next) # DDIM update x_t = torch.sqrt(alpha_bar_t_next) * x0_pred + \ torch.sqrt(1 - alpha_bar_t_next - sigma_t**2) * epsilon_pred if sigma_t > 0 and t > 0: x_t += sigma_t * torch.randn_like(x_t) return x_tYou now have a deep understanding of the DDPM framework—the foundation of modern diffusion models. Next, we'll explore score-based models, which provide an alternative continuous-time perspective on the same underlying principles.