Machine LearningGenerative Models

Diffusion Models

LevelAdvanced

Duration120 mins

TopicGenerative Models

3 / 5

Denoising Diffusion Probabilistic Models (DDPM)

The DDPM Framework

The 2020 paper "Denoising Diffusion Probabilistic Models" by Ho, Jain, and Abbeel unified decades of theoretical work into a practical framework that achieved image quality comparable to GANs. This page dissects the complete DDPM framework: the probabilistic formulation, the reverse process derivation, and the sampling algorithms that convert noise into data.

DDPM established the template that all subsequent diffusion models build upon—understanding it deeply is essential for working with modern generative AI.

Learning Objectives

After this page, you will understand: (1) the variational formulation and ELBO derivation, (2) the parameterized reverse process, (3) the posterior distribution q(x_{t-1}|x_t, x_0), (4) the complete sampling algorithm, and (5) practical considerations for implementation.

The Variational Perspective

DDPM frames diffusion as a latent variable model where $\mathbf{x}_1, \ldots, \mathbf{x}_T$ are latent variables. Generation maximizes a variational lower bound (ELBO) on the data log-likelihood.

The generative model:

$$p_\theta(\mathbf{x}0) = \int p(\mathbf{x}T) \prod{t=1}^{T} p\theta(\mathbf{x}_{t-1} | \mathbf{x}t) , d\mathbf{x}{1:T}$$

where $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ is the prior.

The ELBO derivation:

$$\log p_\theta(\mathbf{x}0) \geq \mathbb{E}q \left[ \log \frac{p\theta(\mathbf{x}{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}0)} \right] = -\mathcal{L}{vlb}$$

The decomposed ELBO:

The variational bound decomposes into interpretable terms:

$$\mathcal{L}{vlb} = \underbrace{D{KL}(q(\mathbf{x}T|\mathbf{x}0) | p(\mathbf{x}T))}{L_T} + \sum{t=2}^{T} \underbrace{D{KL}(q(\mathbf{x}{t-1}|\mathbf{x}t, \mathbf{x}0) | p\theta(\mathbf{x}{t-1}|\mathbf{x}t))}{L{t-1}} - \underbrace{\log p_\theta(\mathbf{x}_0|\mathbf{x}1)}{L_0}$$

Interpreting each term:

$L_T$: Prior matching (no learnable parameters, constant)
$L_{t-1}$: Denoising matching (the main training signal)
$L_0$: Reconstruction term (optional, often omitted)

Why the Simplified Loss Works

DDPM showed that the simplified loss (uniform weighting, MSE on noise prediction) empirically outperforms the full weighted VLB. The simple loss emphasizes perceptual quality over exact likelihood, which aligns with image generation goals.

The Tractable Posterior q(x_{t-1}|x_t, x_0)

A key mathematical result: the posterior $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is tractable and Gaussian.

Derivation via Bayes' theorem:

$$q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}t|\mathbf{x}{t-1}, \mathbf{x}0) q(\mathbf{x}{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)}$$

Since all these distributions are Gaussian, the posterior is also Gaussian:

$$q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}0) = \mathcal{N}(\mathbf{x}{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$$

The posterior parameters:

$$\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}0) = \frac{\sqrt{\bar{\alpha}{t-1}} \beta_t}{1 - \bar{\alpha}_t} \mathbf{x}0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}{t-1})}{1-\bar{\alpha}_t} \mathbf{x}_t$$

$$\tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t} \beta_t$$

The crucial insight: The posterior mean depends on both $\mathbf{x}_t$ and $\mathbf{x}_0$. Since we don't have $\mathbf{x}_0$ during generation, we must estimate it from the predicted noise:

$$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}t} \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}}$$

The x_0 Prediction Connection

Predicting noise ε is equivalent to predicting x_0. The model implicitly estimates what the clean data would be at each step, then uses this to compute the posterior mean for the next step.

The Parameterized Reverse Process

The learned reverse process approximates the posterior:

$$p_\theta(\mathbf{x}{t-1}|\mathbf{x}t) = \mathcal{N}(\mathbf{x}{t-1}; \boldsymbol{\mu}\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

Parameterizing the mean:

Using the noise prediction network: $$\boldsymbol{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}t}} \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t) \right)$$

Choosing the variance:

DDPM explored two options:

$\sigma_t^2 = \beta_t$ (upper bound)
$\sigma_t^2 = \tilde{\beta}_t$ (posterior variance)

Both work well empirically; some later works learn $\sigma_t$ as well.

reverse_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
 
def ddpm_reverse_step(
    model: torch.nn.Module,
    x_t: torch.Tensor,
    t: int,
    betas: torch.Tensor,
    alphas: torch.Tensor,
    alphas_cumprod: torch.Tensor,
    sqrt_one_minus_alphas_cumprod: torch.Tensor,
) -> torch.Tensor:
    """
    Perform one reverse diffusion step (denoising).
    
    Args:
        model: Trained noise prediction network
        x_t: Current noisy sample at timestep t
        t: Current timestep
        betas, alphas, etc.: Precomputed schedule values
        
    Returns:
        x_{t-1}: Denoised sample at timestep t-1
    """
    # Predict the noise
    t_tensor = torch.tensor([t], device=x_t.device).expand(x_t.shape[0])
    epsilon_pred = model(x_t, t_tensor)
    
    # Compute the mean of p(x_{t-1} | x_t)
    alpha_t = alphas[t]
    beta_t = betas[t]
    sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t]
    
    # mu = (1/sqrt(alpha_t)) * (x_t - beta_t/sqrt(1-alpha_bar_t) * epsilon)
    mu = (1.0 / torch.sqrt(alpha_t)) * (
        x_t - (beta_t / sqrt_one_minus_alpha_bar) * epsilon_pred
    )
    
    # Add noise (except at t=0)
    if t > 0:
        noise = torch.randn_like(x_t)
        sigma_t = torch.sqrt(beta_t)  # Can also use posterior variance
        x_t_minus_1 = mu + sigma_t * noise
    else:
        x_t_minus_1 = mu  # No noise at final step
    
    return x_t_minus_1

The Complete DDPM Sampling Algorithm

The full generation algorithm iteratively applies the reverse process from $t=T$ down to $t=0$:

ddpm_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
from tqdm import tqdm
 
@torch.no_grad()
def ddpm_sample(
    model: torch.nn.Module,
    shape: tuple,
    schedule_params: dict,
    device: str = "cuda",
) -> torch.Tensor:
    """
    Generate samples using the DDPM sampling algorithm.
    
    Args:
        model: Trained noise prediction network
        shape: Shape of samples to generate (B, C, H, W)
        schedule_params: Precomputed schedule values
        device: Device to use
        
    Returns:
        x_0: Generated samples
    """
    model.eval()
    
    betas = schedule_params['betas'].to(device)
    alphas = schedule_params['alphas'].to(device)
    alphas_cumprod = schedule_params['alphas_cumprod'].to(device)
    sqrt_one_minus_alphas_cumprod = schedule_params['sqrt_one_minus_alphas_cumprod'].to(device)
    T = len(betas)
    
    # Start from pure noise: x_T ~ N(0, I)
    x_t = torch.randn(shape, device=device)
    
    # Iteratively denoise from t=T-1 to t=0
    for t in tqdm(reversed(range(T)), total=T, desc="Sampling"):
        t_tensor = torch.tensor([t], device=device).expand(shape[0])
        
        # Predict noise
        epsilon_pred = model(x_t, t_tensor)
        
        # Compute posterior mean
        alpha_t = alphas[t]
        beta_t = betas[t]
        sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t]
        
        mu = (1.0 / torch.sqrt(alpha_t)) * (
            x_t - (beta_t / sqrt_one_minus_alpha_bar) * epsilon_pred
        )
        
        # Sample x_{t-1}
        if t > 0:
            noise = torch.randn_like(x_t)
            sigma_t = torch.sqrt(beta_t)
            x_t = mu + sigma_t * noise
        else:
            x_t = mu
    
    return x_t

Sampling Speed

DDPM requires T=1000 sequential neural network evaluations to generate one sample. This is much slower than GANs (one forward pass) or VAEs (one encoder + decoder pass). Accelerated sampling methods (DDIM, DPM-Solver) address this limitation.

Improved DDPM: Key Enhancements

The follow-up paper "Improved Denoising Diffusion Probabilistic Models" introduced several enhancements:

DDPM vs Improved DDPM
Aspect	Original DDPM	Improved DDPM	Impact
Noise schedule	Linear	Cosine	Better image quality, especially details
Variance	Fixed (β_t or posterior)	Learned interpolation	Improved log-likelihood
Training objective	L_simple only	Hybrid L_simple + L_vlb	Better likelihood + quality
Steps at inference	Same as training (T)	Can use fewer	More flexibility

Learning the variance:

Improved DDPM parameterizes variance as an interpolation: $$\Sigma_\theta(\mathbf{x}_t, t) = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)$$

where $v = v_\theta(\mathbf{x}_t, t)$ is a learned output. This allows the model to adaptively choose variance based on the input.

State-of-the-Art Foundation

The combination of cosine schedule, learned variance, and hybrid loss in Improved DDPM became the foundation for subsequent models like Guided Diffusion and Stable Diffusion. These seemingly small changes yield significant quality improvements.

DDIM: Accelerated Deterministic Sampling

Denoising Diffusion Implicit Models (DDIM) by Song et al. revealed that the DDPM training objective actually supports a family of sampling procedures, not just the original stochastic one.

Key insight: The same trained model can be sampled using:

Stochastic sampling (original DDPM): Adds noise at each step
Deterministic sampling (DDIM): No noise, same input → same output

The DDIM update rule:

$$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}} \hat{\mathbf{x}}0 + \sqrt{1-\bar{\alpha}{t-1} - \sigma_t^2} \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + \sigma_t \boldsymbol{\epsilon}_t$$

Setting $\sigma_t = 0$ gives deterministic sampling.

DDIM Advantages

•Fewer steps: Can use 50-100 steps instead of 1000 with minimal quality loss
•Deterministic: Same noise → same image, enabling interpolation and editing
•Invertible: Can encode real images into latent noise for manipulation
•No retraining: Works with any DDPM-trained model directly

ddim_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
 
@torch.no_grad()
def ddim_sample(
    model: torch.nn.Module,
    shape: tuple,
    schedule_params: dict,
    num_steps: int = 50,  # Much fewer than T!
    eta: float = 0.0,     # 0 = deterministic, 1 = DDPM
    device: str = "cuda",
) -> torch.Tensor:
    """
    DDIM sampling with configurable number of steps.
    """
    alphas_cumprod = schedule_params['alphas_cumprod'].to(device)
    T = len(alphas_cumprod)
    
    # Select subset of timesteps
    timesteps = torch.linspace(T - 1, 0, num_steps, dtype=torch.long, device=device)
    
    x_t = torch.randn(shape, device=device)
    
    for i, t in enumerate(timesteps):
        t_next = timesteps[i + 1] if i + 1 < len(timesteps) else torch.tensor(0)
        
        # Predict noise
        epsilon_pred = model(x_t, t.expand(shape[0]))
        
        # Predict x_0
        alpha_bar_t = alphas_cumprod[t]
        x0_pred = (x_t - torch.sqrt(1 - alpha_bar_t) * epsilon_pred) / torch.sqrt(alpha_bar_t)
        
        # Compute coefficients for x_{t-1}
        alpha_bar_t_next = alphas_cumprod[t_next] if t_next > 0 else torch.tensor(1.0)
        
        sigma_t = eta * torch.sqrt((1 - alpha_bar_t_next) / (1 - alpha_bar_t)) * \
                  torch.sqrt(1 - alpha_bar_t / alpha_bar_t_next)
        
        # DDIM update
        x_t = torch.sqrt(alpha_bar_t_next) * x0_pred + \
              torch.sqrt(1 - alpha_bar_t_next - sigma_t**2) * epsilon_pred
        
        if sigma_t > 0 and t > 0:
            x_t += sigma_t * torch.randn_like(x_t)
    
    return x_t

Summary: The DDPM Framework

Key Takeaways

•Variational formulation: DDPM is a latent variable model optimizing an ELBO.
•Tractable posterior: q(x_{t-1}|x_t, x_0) is Gaussian with known parameters.
•Noise prediction: The model estimates noise to compute posterior mean.
•Iterative sampling: Generate by denoising from T→0 through reverse process.
•Improved DDPM: Cosine schedule and learned variance boost quality.
•DDIM: Enables fast, deterministic sampling with the same trained model.

DDPM Mastery Complete

You now have a deep understanding of the DDPM framework—the foundation of modern diffusion models. Next, we'll explore score-based models, which provide an alternative continuous-time perspective on the same underlying principles.

3 / 5

Loading learning content...

Machine LearningGenerative Models

Diffusion Models

LevelAdvanced

Duration120 mins

TopicGenerative Models

3 / 5

Denoising Diffusion Probabilistic Models (DDPM)

The DDPM Framework

DDPM established the template that all subsequent diffusion models build upon—understanding it deeply is essential for working with modern generative AI.

Learning Objectives

The Variational Perspective

DDPM frames diffusion as a latent variable model where $\mathbf{x}_1, \ldots, \mathbf{x}_T$ are latent variables. Generation maximizes a variational lower bound (ELBO) on the data log-likelihood.

The generative model:

$$p_\theta(\mathbf{x}0) = \int p(\mathbf{x}T) \prod{t=1}^{T} p\theta(\mathbf{x}_{t-1} | \mathbf{x}t) , d\mathbf{x}{1:T}$$

where $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ is the prior.

The ELBO derivation:

$$\log p_\theta(\mathbf{x}0) \geq \mathbb{E}q \left[ \log \frac{p\theta(\mathbf{x}{0:T})}{q(\mathbf{x}_{1:T}|\mathbf{x}0)} \right] = -\mathcal{L}{vlb}$$

The decomposed ELBO:

The variational bound decomposes into interpretable terms:

Interpreting each term:

$L_T$: Prior matching (no learnable parameters, constant)
$L_{t-1}$: Denoising matching (the main training signal)
$L_0$: Reconstruction term (optional, often omitted)

Why the Simplified Loss Works

The Tractable Posterior q(x_{t-1}|x_t, x_0)

A key mathematical result: the posterior $q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0)$ is tractable and Gaussian.

Derivation via Bayes' theorem:

$$q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}_0) = \frac{q(\mathbf{x}t|\mathbf{x}{t-1}, \mathbf{x}0) q(\mathbf{x}{t-1}|\mathbf{x}_0)}{q(\mathbf{x}_t|\mathbf{x}_0)}$$

Since all these distributions are Gaussian, the posterior is also Gaussian:

$$q(\mathbf{x}_{t-1}|\mathbf{x}_t, \mathbf{x}0) = \mathcal{N}(\mathbf{x}{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$$

The posterior parameters:

$$\tilde{\beta}t = \frac{1-\bar{\alpha}{t-1}}{1-\bar{\alpha}_t} \beta_t$$

The crucial insight: The posterior mean depends on both $\mathbf{x}_t$ and $\mathbf{x}_0$. Since we don't have $\mathbf{x}_0$ during generation, we must estimate it from the predicted noise:

$$\hat{\mathbf{x}}_0 = \frac{\mathbf{x}_t - \sqrt{1-\bar{\alpha}t} \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t)}{\sqrt{\bar{\alpha}_t}}$$

The x_0 Prediction Connection

Predicting noise ε is equivalent to predicting x_0. The model implicitly estimates what the clean data would be at each step, then uses this to compute the posterior mean for the next step.

The Parameterized Reverse Process

The learned reverse process approximates the posterior:

$$p_\theta(\mathbf{x}{t-1}|\mathbf{x}t) = \mathcal{N}(\mathbf{x}{t-1}; \boldsymbol{\mu}\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

Parameterizing the mean:

Choosing the variance:

DDPM explored two options:

$\sigma_t^2 = \beta_t$ (upper bound)
$\sigma_t^2 = \tilde{\beta}_t$ (posterior variance)

Both work well empirically; some later works learn $\sigma_t$ as well.

reverse_step.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
 
def ddpm_reverse_step(
    model: torch.nn.Module,
    x_t: torch.Tensor,
    t: int,
    betas: torch.Tensor,
    alphas: torch.Tensor,
    alphas_cumprod: torch.Tensor,
    sqrt_one_minus_alphas_cumprod: torch.Tensor,
) -> torch.Tensor:
    """
    Perform one reverse diffusion step (denoising).
    
    Args:
        model: Trained noise prediction network
        x_t: Current noisy sample at timestep t
        t: Current timestep
        betas, alphas, etc.: Precomputed schedule values
        
    Returns:
        x_{t-1}: Denoised sample at timestep t-1
    """
    # Predict the noise
    t_tensor = torch.tensor([t], device=x_t.device).expand(x_t.shape[0])
    epsilon_pred = model(x_t, t_tensor)
    
    # Compute the mean of p(x_{t-1} | x_t)
    alpha_t = alphas[t]
    beta_t = betas[t]
    sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t]
    
    # mu = (1/sqrt(alpha_t)) * (x_t - beta_t/sqrt(1-alpha_bar_t) * epsilon)
    mu = (1.0 / torch.sqrt(alpha_t)) * (
        x_t - (beta_t / sqrt_one_minus_alpha_bar) * epsilon_pred
    )
    
    # Add noise (except at t=0)
    if t > 0:
        noise = torch.randn_like(x_t)
        sigma_t = torch.sqrt(beta_t)  # Can also use posterior variance
        x_t_minus_1 = mu + sigma_t * noise
    else:
        x_t_minus_1 = mu  # No noise at final step
    
    return x_t_minus_1

The Complete DDPM Sampling Algorithm

The full generation algorithm iteratively applies the reverse process from $t=T$ down to $t=0$:

ddpm_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
from tqdm import tqdm
 
@torch.no_grad()
def ddpm_sample(
    model: torch.nn.Module,
    shape: tuple,
    schedule_params: dict,
    device: str = "cuda",
) -> torch.Tensor:
    """
    Generate samples using the DDPM sampling algorithm.
    
    Args:
        model: Trained noise prediction network
        shape: Shape of samples to generate (B, C, H, W)
        schedule_params: Precomputed schedule values
        device: Device to use
        
    Returns:
        x_0: Generated samples
    """
    model.eval()
    
    betas = schedule_params['betas'].to(device)
    alphas = schedule_params['alphas'].to(device)
    alphas_cumprod = schedule_params['alphas_cumprod'].to(device)
    sqrt_one_minus_alphas_cumprod = schedule_params['sqrt_one_minus_alphas_cumprod'].to(device)
    T = len(betas)
    
    # Start from pure noise: x_T ~ N(0, I)
    x_t = torch.randn(shape, device=device)
    
    # Iteratively denoise from t=T-1 to t=0
    for t in tqdm(reversed(range(T)), total=T, desc="Sampling"):
        t_tensor = torch.tensor([t], device=device).expand(shape[0])
        
        # Predict noise
        epsilon_pred = model(x_t, t_tensor)
        
        # Compute posterior mean
        alpha_t = alphas[t]
        beta_t = betas[t]
        sqrt_one_minus_alpha_bar = sqrt_one_minus_alphas_cumprod[t]
        
        mu = (1.0 / torch.sqrt(alpha_t)) * (
            x_t - (beta_t / sqrt_one_minus_alpha_bar) * epsilon_pred
        )
        
        # Sample x_{t-1}
        if t > 0:
            noise = torch.randn_like(x_t)
            sigma_t = torch.sqrt(beta_t)
            x_t = mu + sigma_t * noise
        else:
            x_t = mu
    
    return x_t

Sampling Speed

Improved DDPM: Key Enhancements

The follow-up paper "Improved Denoising Diffusion Probabilistic Models" introduced several enhancements:

DDPM vs Improved DDPM
Aspect	Original DDPM	Improved DDPM	Impact
Noise schedule	Linear	Cosine	Better image quality, especially details
Variance	Fixed (β_t or posterior)	Learned interpolation	Improved log-likelihood
Training objective	L_simple only	Hybrid L_simple + L_vlb	Better likelihood + quality
Steps at inference	Same as training (T)	Can use fewer	More flexibility

Learning the variance:

Improved DDPM parameterizes variance as an interpolation: $$\Sigma_\theta(\mathbf{x}_t, t) = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)$$

where $v = v_\theta(\mathbf{x}_t, t)$ is a learned output. This allows the model to adaptively choose variance based on the input.

State-of-the-Art Foundation

DDIM: Accelerated Deterministic Sampling

Denoising Diffusion Implicit Models (DDIM) by Song et al. revealed that the DDPM training objective actually supports a family of sampling procedures, not just the original stochastic one.

Key insight: The same trained model can be sampled using:

Stochastic sampling (original DDPM): Adds noise at each step
Deterministic sampling (DDIM): No noise, same input → same output

The DDIM update rule:

$$\mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}} \hat{\mathbf{x}}0 + \sqrt{1-\bar{\alpha}{t-1} - \sigma_t^2} \cdot \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) + \sigma_t \boldsymbol{\epsilon}_t$$

Setting $\sigma_t = 0$ gives deterministic sampling.

DDIM Advantages

•Fewer steps: Can use 50-100 steps instead of 1000 with minimal quality loss
•Deterministic: Same noise → same image, enabling interpolation and editing
•Invertible: Can encode real images into latent noise for manipulation
•No retraining: Works with any DDPM-trained model directly

ddim_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
 
@torch.no_grad()
def ddim_sample(
    model: torch.nn.Module,
    shape: tuple,
    schedule_params: dict,
    num_steps: int = 50,  # Much fewer than T!
    eta: float = 0.0,     # 0 = deterministic, 1 = DDPM
    device: str = "cuda",
) -> torch.Tensor:
    """
    DDIM sampling with configurable number of steps.
    """
    alphas_cumprod = schedule_params['alphas_cumprod'].to(device)
    T = len(alphas_cumprod)
    
    # Select subset of timesteps
    timesteps = torch.linspace(T - 1, 0, num_steps, dtype=torch.long, device=device)
    
    x_t = torch.randn(shape, device=device)
    
    for i, t in enumerate(timesteps):
        t_next = timesteps[i + 1] if i + 1 < len(timesteps) else torch.tensor(0)
        
        # Predict noise
        epsilon_pred = model(x_t, t.expand(shape[0]))
        
        # Predict x_0
        alpha_bar_t = alphas_cumprod[t]
        x0_pred = (x_t - torch.sqrt(1 - alpha_bar_t) * epsilon_pred) / torch.sqrt(alpha_bar_t)
        
        # Compute coefficients for x_{t-1}
        alpha_bar_t_next = alphas_cumprod[t_next] if t_next > 0 else torch.tensor(1.0)
        
        sigma_t = eta * torch.sqrt((1 - alpha_bar_t_next) / (1 - alpha_bar_t)) * \
                  torch.sqrt(1 - alpha_bar_t / alpha_bar_t_next)
        
        # DDIM update
        x_t = torch.sqrt(alpha_bar_t_next) * x0_pred + \
              torch.sqrt(1 - alpha_bar_t_next - sigma_t**2) * epsilon_pred
        
        if sigma_t > 0 and t > 0:
            x_t += sigma_t * torch.randn_like(x_t)
    
    return x_t

Summary: The DDPM Framework

Key Takeaways

•Variational formulation: DDPM is a latent variable model optimizing an ELBO.
•Tractable posterior: q(x_{t-1}|x_t, x_0) is Gaussian with known parameters.
•Noise prediction: The model estimates noise to compute posterior mean.
•Iterative sampling: Generate by denoising from T→0 through reverse process.
•Improved DDPM: Cosine schedule and learned variance boost quality.
•DDIM: Enables fast, deterministic sampling with the same trained model.

DDPM Mastery Complete

3 / 5