Machine LearningGenerative Models

Flow-Based Models

LevelAdvanced

Duration90 mins

TopicGenerative Models

4 / 5

RealNVP and Glow: State-of-the-Art Image Flows

From Theory to Image Generation

RealNVP (Real-valued Non-Volume Preserving) and Glow (Generative Flow with Invertible 1×1 Convolutions) represent the maturation of normalizing flows into practical, high-quality generative models for images. These architectures combine the coupling layer foundations we studied with multi-scale processing, specialized convolutions, and careful engineering to achieve compelling image generation while maintaining exact likelihood computation.

RealNVP (2016) demonstrated that flows could generate recognizable images, while Glow (2018) pushed quality to near-GAN levels, generating 1024×1024 photorealistic faces. Understanding these architectures provides both practical implementation knowledge and insight into the design principles that make flows work at scale.

Learning Objectives

Understand the multi-scale architecture for efficient image processing, master the specific components of RealNVP and Glow (squeeze operations, split operations, invertible 1×1 convolutions), and learn practical training techniques for flow-based image models.

RealNVP Architecture

RealNVP builds on affine coupling layers with several key innovations for processing images efficiently.

Multi-Scale Architecture:

Images are high-dimensional (e.g., 256×256×3 = 196,608 dimensions), making it expensive to process all dimensions at every layer. RealNVP uses a multi-scale architecture that progressively factors out dimensions:

Process at full resolution with coupling layers
Squeeze: Reshape spatial dimensions into channels (2×2 spatial → 4× channels)
Split: Factor out half the channels to the output
Repeat at lower resolution

This creates a pyramid structure where coarse features are modeled at lower resolutions and fine details at higher resolutions.

Squeeze Operation:

The squeeze operation trades spatial size for channel depth:

Input: $H \times W \times C$
Output: $\frac{H}{2} \times \frac{W}{2} \times 4C$

This is a simple reshaping—taking 2×2 spatial blocks and stacking them as channels. It's invertible with determinant 1 (just a permutation of elements).

Split Operation:

After squeezing, split factors out half the channels:

The factored-out channels go directly to output (modeled by Gaussian prior)
The remaining channels continue through more layers

This progressively reduces computation while allowing the model to allocate capacity where needed.

Checkerboard and Channel Masking:

RealNVP uses two types of masks for coupling layers:

Checkerboard mask: Alternating spatial positions
Channel mask: Top/bottom half of channels

These ensure different dimensions interact across layers.

realnvp_components.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import torch
import torch.nn as nn
 
class Squeeze(nn.Module):
    """
    Squeeze operation: trades spatial size for channels.
    H×W×C -> (H/2)×(W/2)×(4C)
    """
    def forward(self, x):
        B, C, H, W = x.shape
        x = x.view(B, C, H // 2, 2, W // 2, 2)
        x = x.permute(0, 1, 3, 5, 2, 4).contiguous()
        x = x.view(B, C * 4, H // 2, W // 2)
        return x, torch.zeros(B, device=x.device)
    
    def inverse(self, x):
        B, C, H, W = x.shape
        x = x.view(B, C // 4, 2, 2, H, W)
        x = x.permute(0, 1, 4, 2, 5, 3).contiguous()
        x = x.view(B, C // 4, H * 2, W * 2)
        return x, torch.zeros(B, device=x.device)
 
 
class Split(nn.Module):
    """
    Split operation: factor out half the channels.
    Models factored channels with learned prior.
    """
    def __init__(self, num_channels):
        super().__init__()
        # Prior parameters (mean and log-std) conditioned on remaining channels
        self.prior_net = nn.Sequential(
            nn.Conv2d(num_channels // 2, num_channels, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(num_channels, num_channels, 1)
        )
    
    def forward(self, x):
        z1, z2 = x.chunk(2, dim=1)
        
        # z2 is factored out; compute its prior params from z1
        prior_params = self.prior_net(z1)
        mean, log_std = prior_params.chunk(2, dim=1)
        
        # Log prob of z2 under prior
        log_prob = -0.5 * (torch.log(torch.tensor(2 * 3.14159)) + 
                          2 * log_std + 
                          (z2 - mean) ** 2 / torch.exp(2 * log_std))
        log_det = log_prob.sum(dim=[1, 2, 3])
        
        return z1, log_det, z2  # z2 is stored for reconstruction
    
    def inverse(self, z1, z2=None, temperature=1.0):
        if z2 is None:
            # Sample from prior
            prior_params = self.prior_net(z1)
            mean, log_std = prior_params.chunk(2, dim=1)
            z2 = mean + torch.exp(log_std) * torch.randn_like(mean) * temperature
        
        return torch.cat([z1, z2], dim=1), torch.zeros(z1.shape[0], device=z1.device)

Glow: Key Innovations

Glow builds on RealNVP with three significant improvements that push flow quality toward GANs.

1. Invertible 1×1 Convolutions:

RealNVP uses fixed permutations between coupling layers. Glow replaces these with learned invertible 1×1 convolutions—essentially learned linear mixing of channels.

For input with $C$ channels, a 1×1 convolution is a $C \times C$ matrix $\mathbf{W}$ applied identically at each spatial position. The log-determinant is: $$\log|\det(J)| = H \cdot W \cdot \log|\det(\mathbf{W})|$$

Glow uses LU decomposition for efficient computation: $\mathbf{W} = \mathbf{P}\mathbf{L}\mathbf{U}$ where $\mathbf{P}$ is a fixed permutation and $\mathbf{L}, \mathbf{U}$ are triangular and learned.

2. ActNorm (Activation Normalization):

Instead of batch normalization (which is data-dependent and thus not truly invertible), Glow uses ActNorm: a learnable affine transformation per channel: $$y = s \odot x + b$$

The parameters $s$ and $b$ are initialized data-dependently on the first batch to normalize activations to zero mean and unit variance, then trained as regular parameters.

3. Affine Coupling with Neural Networks:

Glow strengthens the conditioner networks in affine coupling layers, using deep residual networks. The coupling function is: $$x_B = z_B \odot \sigma(s(z_A)) + t(z_A)$$

where $\sigma$ is a sigmoid that bounds the scale to prevent numerical issues.

The Complete Glow Block:

One flow step in Glow consists of:

ActNorm
Invertible 1×1 Conv
Affine Coupling Layer

Multiple steps are stacked, with squeeze and split operations at scale boundaries.

Multi-Scale Structure:

Glow uses $L$ levels with $K$ steps per level:

Level $\ell$: Apply $K$ flow steps, then squeeze (if not last level)
Between levels: Split to factor out channels
Total flow steps: $L \times K$

glow_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class ActNorm2d(nn.Module):
    """
    Activation Normalization for 2D inputs.
    Data-dependent initialization, then trained.
    """
    def __init__(self, num_channels):
        super().__init__()
        self.scale = nn.Parameter(torch.ones(1, num_channels, 1, 1))
        self.bias = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        self.register_buffer('initialized', torch.tensor(False))
    
    def initialize(self, x):
        with torch.no_grad():
            mean = x.mean(dim=[0, 2, 3], keepdim=True)
            std = x.std(dim=[0, 2, 3], keepdim=True) + 1e-6
            self.bias.data = -mean
            self.scale.data = 1.0 / std
            self.initialized.fill_(True)
    
    def forward(self, x):
        if not self.initialized:
            self.initialize(x)
        
        y = self.scale * (x + self.bias)
        
        # Log det: log|scale| * H * W (same scale at all positions)
        B, C, H, W = x.shape
        log_det = H * W * torch.sum(torch.log(torch.abs(self.scale)))
        return y, log_det.expand(B)
    
    def inverse(self, y):
        x = y / self.scale - self.bias
        B, C, H, W = y.shape
        log_det = -H * W * torch.sum(torch.log(torch.abs(self.scale)))
        return x, log_det.expand(B)
 
 
class GlowStep(nn.Module):
    """Single Glow step: ActNorm -> 1x1Conv -> AffineCoupling"""
    
    def __init__(self, num_channels, hidden_channels=512):
        super().__init__()
        self.actnorm = ActNorm2d(num_channels)
        self.invconv = Invertible1x1Conv2d(num_channels)
        self.coupling = AffineCoupling2d(num_channels, hidden_channels)
    
    def forward(self, x):
        log_det = torch.zeros(x.shape[0], device=x.device)
        
        y, ld = self.actnorm.forward(x)
        log_det += ld
        
        y, ld = self.invconv.forward(y)
        log_det += ld
        
        y, ld = self.coupling.forward(y)
        log_det += ld
        
        return y, log_det
    
    def inverse(self, y):
        log_det = torch.zeros(y.shape[0], device=y.device)
        
        x, ld = self.coupling.inverse(y)
        log_det += ld
        
        x, ld = self.invconv.inverse(x)
        log_det += ld
        
        x, ld = self.actnorm.inverse(x)
        log_det += ld
        
        return x, log_det

RealNVP vs Glow Comparison
Component	RealNVP	Glow
Permutation	Fixed (reverse, shuffle)	Learned 1×1 convolution
Normalization	Batch normalization	ActNorm (data-dependent init)
Coupling	Affine with CNNs	Affine with deeper ResNets
Image resolution	32×32, 64×64	Up to 1024×1024
Sample quality	Recognizable but blurry	Near-photorealistic faces

Multi-Scale Processing in Detail

The multi-scale architecture is crucial for efficient image processing. Let's trace through how an image flows through a Glow model.

Example: 64×64×3 Image with 3 Levels, 8 Steps per Level

Level 1 (64×64×3):

Apply 8 Glow steps (ActNorm + 1×1Conv + Coupling)
Squeeze: 64×64×3 → 32×32×12
Split: Factor out 6 channels → 32×32×6 to prior, 32×32×6 continues

Level 2 (32×32×6):

Apply 8 Glow steps
Squeeze: 32×32×6 → 16×16×24
Split: Factor out 12 channels → 16×16×12 to prior, 16×16×12 continues

Level 3 (16×16×12):

Apply 8 Glow steps
No squeeze/split at final level
16×16×12 → directly to prior

Total latent representation:

32×32×6 + 16×16×12 + 16×16×12 = 6144 + 3072 + 3072 = 12,288 dims
Original image: 64×64×3 = 12,288 dims ✓

Efficiency of Multi-Scale

Multi-scale processing means most computation happens at lower resolutions. A 1024×1024 image might have final flow steps at 32×32, reducing computation 1000×. The early splits also allow the model to allocate detail-level vs structure-level capacity appropriately.

Training and Sampling

Training:

Training maximizes log-likelihood via gradient descent: $$\mathcal{L} = \mathbb{E}{\mathbf{x} \sim p{data}}[\log p_\theta(\mathbf{x})]$$

For images with discrete pixel values, dequantization is essential: $$\tilde{\mathbf{x}} = \mathbf{x} + \mathbf{u}, \quad \mathbf{u} \sim \text{Uniform}(0, 1/256)$$

Performance metric: Bits-per-dimension (BPD): $$\text{BPD} = \frac{-\log_2 p(\mathbf{x})}{d}$$

where $d$ is the number of dimensions. Lower is better.

Sampling:

Sampling is straightforward:

Sample $\mathbf{z} \sim \mathcal{N}(0, T^2 \mathbf{I})$ where $T$ is temperature
Run forward pass through flow
Clamp and quantize to valid pixel range

Temperature controls sample diversity:

$T = 1$: Standard sampling
$T < 1$: More conservative, less diverse, often sharper
$T > 1$: More diverse, potentially lower quality

glow_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import torch
 
def train_glow(model, dataloader, optimizer, epochs=100):
    """Training loop for Glow model."""
    
    for epoch in range(epochs):
        total_bpd = 0
        num_batches = 0
        
        for batch in dataloader:
            images = batch[0].cuda()
            
            # Dequantization: add uniform noise
            images = (images * 255 + torch.rand_like(images)) / 256
            
            # Forward pass and compute log-likelihood
            z, log_det = model.inverse(images)
            
            # Log prob under base distribution
            log_pz = -0.5 * (z ** 2 + torch.log(torch.tensor(2 * 3.14159))).sum(dim=[1,2,3])
            
            # Total log probability
            log_px = log_pz + log_det
            
            # Negative log-likelihood loss
            loss = -log_px.mean()
            
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            # Compute BPD
            d = images[0].numel()
            bpd = -log_px.mean() / (d * torch.log(torch.tensor(2.0)))
            total_bpd += bpd.item()
            num_batches += 1
        
        print(f"Epoch {epoch}: BPD = {total_bpd / num_batches:.3f}")
 
 
def sample_glow(model, num_samples, temperature=0.7):
    """Generate samples from trained Glow model."""
    model.eval()
    with torch.no_grad():
        # Sample from base distribution
        z_shapes = model.get_z_shapes()  # Get shapes for multi-scale z
        zs = [torch.randn(num_samples, *shape).cuda() * temperature 
              for shape in z_shapes]
        
        # Forward through flow
        images = model.forward(zs)
        
        # Clamp to valid range
        images = torch.clamp(images, 0, 1)
    
    return images

Practical Results and Benchmarks

Flow Model Performance (Bits Per Dimension)
Model	CIFAR-10	ImageNet 32×32	ImageNet 64×64
RealNVP	3.49	4.28	3.98
Glow	3.35	4.09	3.81
Flow++	3.08	3.86	3.69
FFJORD	3.40	–	–
PixelCNN++ (autoregressive)	2.92	–	–

Key observations:

Flows vs. autoregressive models: Autoregressive models (like PixelCNN++) achieve better BPD but sample sequentially (slow). Flows sample in one pass.
Glow face generation: Glow achieved impressive 1024×1024 face synthesis, demonstrating flows can scale to high resolution.
Interpolation: Flows enable smooth latent interpolation since the mapping is bijective—every point in latent space maps to a valid image.
Attribute manipulation: By finding directions in latent space corresponding to attributes (age, glasses, smile), Glow enables semantic image editing.

Current Landscape

While diffusion models now achieve better image quality than flows, flows remain valuable for exact likelihood computation, fast sampling, and applications requiring bijective mappings. Hybrid approaches combining flows with other methods continue to be an active research area.

Summary

Key Takeaways

•RealNVP introduced multi-scale processing with squeeze and split operations for efficient image flows.
•Glow improved on RealNVP with learned 1×1 convolutions, ActNorm, and stronger coupling networks.
•Multi-scale architecture progressively factors out dimensions, enabling high-resolution image processing.
•Training uses standard maximum likelihood with dequantization; sampling uses temperature-scaled Gaussian noise.
•Applications include density estimation, image generation, interpolation, and semantic manipulation.

Next: Continuous Normalizing Flows

We've covered the discrete-layer flow architectures that dominated early flow research. Next, we'll explore continuous normalizing flows based on neural ODEs—a fundamentally different approach that treats the transformation as continuous through time, enabling free-form Jacobians and new theoretical insights.

4 / 5

Loading learning content...

Machine LearningGenerative Models

Flow-Based Models

LevelAdvanced

Duration90 mins

TopicGenerative Models

4 / 5

RealNVP and Glow: State-of-the-Art Image Flows

From Theory to Image Generation

Learning Objectives

RealNVP Architecture

RealNVP builds on affine coupling layers with several key innovations for processing images efficiently.

Multi-Scale Architecture:

Process at full resolution with coupling layers
Squeeze: Reshape spatial dimensions into channels (2×2 spatial → 4× channels)
Split: Factor out half the channels to the output
Repeat at lower resolution

This creates a pyramid structure where coarse features are modeled at lower resolutions and fine details at higher resolutions.

Squeeze Operation:

The squeeze operation trades spatial size for channel depth:

Input: $H \times W \times C$
Output: $\frac{H}{2} \times \frac{W}{2} \times 4C$

This is a simple reshaping—taking 2×2 spatial blocks and stacking them as channels. It's invertible with determinant 1 (just a permutation of elements).

Split Operation:

After squeezing, split factors out half the channels:

The factored-out channels go directly to output (modeled by Gaussian prior)
The remaining channels continue through more layers

This progressively reduces computation while allowing the model to allocate capacity where needed.

Checkerboard and Channel Masking:

RealNVP uses two types of masks for coupling layers:

Checkerboard mask: Alternating spatial positions
Channel mask: Top/bottom half of channels

These ensure different dimensions interact across layers.

realnvp_components.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import torch
import torch.nn as nn
 
class Squeeze(nn.Module):
    """
    Squeeze operation: trades spatial size for channels.
    H×W×C -> (H/2)×(W/2)×(4C)
    """
    def forward(self, x):
        B, C, H, W = x.shape
        x = x.view(B, C, H // 2, 2, W // 2, 2)
        x = x.permute(0, 1, 3, 5, 2, 4).contiguous()
        x = x.view(B, C * 4, H // 2, W // 2)
        return x, torch.zeros(B, device=x.device)
    
    def inverse(self, x):
        B, C, H, W = x.shape
        x = x.view(B, C // 4, 2, 2, H, W)
        x = x.permute(0, 1, 4, 2, 5, 3).contiguous()
        x = x.view(B, C // 4, H * 2, W * 2)
        return x, torch.zeros(B, device=x.device)
 
 
class Split(nn.Module):
    """
    Split operation: factor out half the channels.
    Models factored channels with learned prior.
    """
    def __init__(self, num_channels):
        super().__init__()
        # Prior parameters (mean and log-std) conditioned on remaining channels
        self.prior_net = nn.Sequential(
            nn.Conv2d(num_channels // 2, num_channels, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(num_channels, num_channels, 1)
        )
    
    def forward(self, x):
        z1, z2 = x.chunk(2, dim=1)
        
        # z2 is factored out; compute its prior params from z1
        prior_params = self.prior_net(z1)
        mean, log_std = prior_params.chunk(2, dim=1)
        
        # Log prob of z2 under prior
        log_prob = -0.5 * (torch.log(torch.tensor(2 * 3.14159)) + 
                          2 * log_std + 
                          (z2 - mean) ** 2 / torch.exp(2 * log_std))
        log_det = log_prob.sum(dim=[1, 2, 3])
        
        return z1, log_det, z2  # z2 is stored for reconstruction
    
    def inverse(self, z1, z2=None, temperature=1.0):
        if z2 is None:
            # Sample from prior
            prior_params = self.prior_net(z1)
            mean, log_std = prior_params.chunk(2, dim=1)
            z2 = mean + torch.exp(log_std) * torch.randn_like(mean) * temperature
        
        return torch.cat([z1, z2], dim=1), torch.zeros(z1.shape[0], device=z1.device)

Glow: Key Innovations

Glow builds on RealNVP with three significant improvements that push flow quality toward GANs.

1. Invertible 1×1 Convolutions:

RealNVP uses fixed permutations between coupling layers. Glow replaces these with learned invertible 1×1 convolutions—essentially learned linear mixing of channels.

2. ActNorm (Activation Normalization):

Instead of batch normalization (which is data-dependent and thus not truly invertible), Glow uses ActNorm: a learnable affine transformation per channel: $$y = s \odot x + b$$

The parameters $s$ and $b$ are initialized data-dependently on the first batch to normalize activations to zero mean and unit variance, then trained as regular parameters.

3. Affine Coupling with Neural Networks:

Glow strengthens the conditioner networks in affine coupling layers, using deep residual networks. The coupling function is: $$x_B = z_B \odot \sigma(s(z_A)) + t(z_A)$$

where $\sigma$ is a sigmoid that bounds the scale to prevent numerical issues.

The Complete Glow Block:

One flow step in Glow consists of:

ActNorm
Invertible 1×1 Conv
Affine Coupling Layer

Multiple steps are stacked, with squeeze and split operations at scale boundaries.

Multi-Scale Structure:

Glow uses $L$ levels with $K$ steps per level:

Level $\ell$: Apply $K$ flow steps, then squeeze (if not last level)
Between levels: Split to factor out channels
Total flow steps: $L \times K$

glow_architecture.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class ActNorm2d(nn.Module):
    """
    Activation Normalization for 2D inputs.
    Data-dependent initialization, then trained.
    """
    def __init__(self, num_channels):
        super().__init__()
        self.scale = nn.Parameter(torch.ones(1, num_channels, 1, 1))
        self.bias = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
        self.register_buffer('initialized', torch.tensor(False))
    
    def initialize(self, x):
        with torch.no_grad():
            mean = x.mean(dim=[0, 2, 3], keepdim=True)
            std = x.std(dim=[0, 2, 3], keepdim=True) + 1e-6
            self.bias.data = -mean
            self.scale.data = 1.0 / std
            self.initialized.fill_(True)
    
    def forward(self, x):
        if not self.initialized:
            self.initialize(x)
        
        y = self.scale * (x + self.bias)
        
        # Log det: log|scale| * H * W (same scale at all positions)
        B, C, H, W = x.shape
        log_det = H * W * torch.sum(torch.log(torch.abs(self.scale)))
        return y, log_det.expand(B)
    
    def inverse(self, y):
        x = y / self.scale - self.bias
        B, C, H, W = y.shape
        log_det = -H * W * torch.sum(torch.log(torch.abs(self.scale)))
        return x, log_det.expand(B)
 
 
class GlowStep(nn.Module):
    """Single Glow step: ActNorm -> 1x1Conv -> AffineCoupling"""
    
    def __init__(self, num_channels, hidden_channels=512):
        super().__init__()
        self.actnorm = ActNorm2d(num_channels)
        self.invconv = Invertible1x1Conv2d(num_channels)
        self.coupling = AffineCoupling2d(num_channels, hidden_channels)
    
    def forward(self, x):
        log_det = torch.zeros(x.shape[0], device=x.device)
        
        y, ld = self.actnorm.forward(x)
        log_det += ld
        
        y, ld = self.invconv.forward(y)
        log_det += ld
        
        y, ld = self.coupling.forward(y)
        log_det += ld
        
        return y, log_det
    
    def inverse(self, y):
        log_det = torch.zeros(y.shape[0], device=y.device)
        
        x, ld = self.coupling.inverse(y)
        log_det += ld
        
        x, ld = self.invconv.inverse(x)
        log_det += ld
        
        x, ld = self.actnorm.inverse(x)
        log_det += ld
        
        return x, log_det

RealNVP vs Glow Comparison
Component	RealNVP	Glow
Permutation	Fixed (reverse, shuffle)	Learned 1×1 convolution
Normalization	Batch normalization	ActNorm (data-dependent init)
Coupling	Affine with CNNs	Affine with deeper ResNets
Image resolution	32×32, 64×64	Up to 1024×1024
Sample quality	Recognizable but blurry	Near-photorealistic faces

Multi-Scale Processing in Detail

The multi-scale architecture is crucial for efficient image processing. Let's trace through how an image flows through a Glow model.

Example: 64×64×3 Image with 3 Levels, 8 Steps per Level

Level 1 (64×64×3):

Apply 8 Glow steps (ActNorm + 1×1Conv + Coupling)
Squeeze: 64×64×3 → 32×32×12
Split: Factor out 6 channels → 32×32×6 to prior, 32×32×6 continues

Level 2 (32×32×6):

Apply 8 Glow steps
Squeeze: 32×32×6 → 16×16×24
Split: Factor out 12 channels → 16×16×12 to prior, 16×16×12 continues

Level 3 (16×16×12):

Apply 8 Glow steps
No squeeze/split at final level
16×16×12 → directly to prior

Total latent representation:

32×32×6 + 16×16×12 + 16×16×12 = 6144 + 3072 + 3072 = 12,288 dims
Original image: 64×64×3 = 12,288 dims ✓

Efficiency of Multi-Scale

Training and Sampling

Training:

Training maximizes log-likelihood via gradient descent: $$\mathcal{L} = \mathbb{E}{\mathbf{x} \sim p{data}}[\log p_\theta(\mathbf{x})]$$

For images with discrete pixel values, dequantization is essential: $$\tilde{\mathbf{x}} = \mathbf{x} + \mathbf{u}, \quad \mathbf{u} \sim \text{Uniform}(0, 1/256)$$

Performance metric: Bits-per-dimension (BPD): $$\text{BPD} = \frac{-\log_2 p(\mathbf{x})}{d}$$

where $d$ is the number of dimensions. Lower is better.

Sampling:

Sampling is straightforward:

Sample $\mathbf{z} \sim \mathcal{N}(0, T^2 \mathbf{I})$ where $T$ is temperature
Run forward pass through flow
Clamp and quantize to valid pixel range

Temperature controls sample diversity:

$T = 1$: Standard sampling
$T < 1$: More conservative, less diverse, often sharper
$T > 1$: More diverse, potentially lower quality

glow_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import torch
 
def train_glow(model, dataloader, optimizer, epochs=100):
    """Training loop for Glow model."""
    
    for epoch in range(epochs):
        total_bpd = 0
        num_batches = 0
        
        for batch in dataloader:
            images = batch[0].cuda()
            
            # Dequantization: add uniform noise
            images = (images * 255 + torch.rand_like(images)) / 256
            
            # Forward pass and compute log-likelihood
            z, log_det = model.inverse(images)
            
            # Log prob under base distribution
            log_pz = -0.5 * (z ** 2 + torch.log(torch.tensor(2 * 3.14159))).sum(dim=[1,2,3])
            
            # Total log probability
            log_px = log_pz + log_det
            
            # Negative log-likelihood loss
            loss = -log_px.mean()
            
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            # Compute BPD
            d = images[0].numel()
            bpd = -log_px.mean() / (d * torch.log(torch.tensor(2.0)))
            total_bpd += bpd.item()
            num_batches += 1
        
        print(f"Epoch {epoch}: BPD = {total_bpd / num_batches:.3f}")
 
 
def sample_glow(model, num_samples, temperature=0.7):
    """Generate samples from trained Glow model."""
    model.eval()
    with torch.no_grad():
        # Sample from base distribution
        z_shapes = model.get_z_shapes()  # Get shapes for multi-scale z
        zs = [torch.randn(num_samples, *shape).cuda() * temperature 
              for shape in z_shapes]
        
        # Forward through flow
        images = model.forward(zs)
        
        # Clamp to valid range
        images = torch.clamp(images, 0, 1)
    
    return images

Practical Results and Benchmarks

Flow Model Performance (Bits Per Dimension)
Model	CIFAR-10	ImageNet 32×32	ImageNet 64×64
RealNVP	3.49	4.28	3.98
Glow	3.35	4.09	3.81
Flow++	3.08	3.86	3.69
FFJORD	3.40	–	–
PixelCNN++ (autoregressive)	2.92	–	–

Key observations:

Flows vs. autoregressive models: Autoregressive models (like PixelCNN++) achieve better BPD but sample sequentially (slow). Flows sample in one pass.
Glow face generation: Glow achieved impressive 1024×1024 face synthesis, demonstrating flows can scale to high resolution.
Interpolation: Flows enable smooth latent interpolation since the mapping is bijective—every point in latent space maps to a valid image.
Attribute manipulation: By finding directions in latent space corresponding to attributes (age, glasses, smile), Glow enables semantic image editing.

Current Landscape

Summary

Key Takeaways

•RealNVP introduced multi-scale processing with squeeze and split operations for efficient image flows.
•Glow improved on RealNVP with learned 1×1 convolutions, ActNorm, and stronger coupling networks.
•Multi-scale architecture progressively factors out dimensions, enabling high-resolution image processing.
•Training uses standard maximum likelihood with dequantization; sampling uses temperature-scaled Gaussian noise.
•Applications include density estimation, image generation, interpolation, and semantic manipulation.

Next: Continuous Normalizing Flows

4 / 5