Machine LearningGenerative Models

Flow-Based Models

LevelAdvanced

Duration90 mins

TopicGenerative Models

3 / 5

Coupling Layers: The Architectural Breakthrough

The Key Innovation for Practical Flows

Coupling layers represent the architectural breakthrough that made normalizing flows practical for high-dimensional data. Before coupling layers, flows faced an impossible trade-off: transformations with tractable Jacobians were too simple, while expressive transformations had intractable $O(d^3)$ Jacobian determinants.

Coupling layers resolve this dilemma through a clever insight: partition the input dimensions and transform one partition using an arbitrary function of the other. This structure yields a triangular Jacobian with tractable $O(d)$ determinant computation, while allowing the transformation itself to be arbitrarily complex (parameterized by deep neural networks).

This single innovation, introduced in NICE (2014) and refined in RealNVP (2016), unlocked flows for images, audio, and other high-dimensional domains.

Learning Objectives

Understand the coupling layer construction, derive the triangular Jacobian structure, implement additive and affine coupling layers, and learn how to compose layers with permutations for full expressiveness.

Coupling Layer Construction

A coupling layer partitions the input into two parts and transforms one part based on the other.

The Partition:

Given input $\mathbf{z} \in \mathbb{R}^d$, split it into two parts:

$\mathbf{z}A = \mathbf{z}{1:k}$ (first $k$ dimensions)
$\mathbf{z}B = \mathbf{z}{k+1:d}$ (remaining $d-k$ dimensions)

The Transformation:

$$\mathbf{x}_A = \mathbf{z}_A$$ $$\mathbf{x}_B = g(\mathbf{z}_B; \theta(\mathbf{z}_A))$$

where:

$\mathbf{z}_A$ passes through unchanged
$\mathbf{z}_B$ is transformed by $g$, parameterized by $\theta(\mathbf{z}_A)$
$\theta$ can be an arbitrary neural network
$g$ must be invertible with respect to $\mathbf{z}_B$ for fixed $\theta$

The Jacobian Structure:

The Jacobian of this transformation has a special block structure:

$$J = \begin{bmatrix} \frac{\partial \mathbf{x}_A}{\partial \mathbf{z}_A} & \frac{\partial \mathbf{x}_A}{\partial \mathbf{z}_B} \ \frac{\partial \mathbf{x}_B}{\partial \mathbf{z}_A} & \frac{\partial \mathbf{x}_B}{\partial \mathbf{z}_B} \end{bmatrix} = \begin{bmatrix} \mathbf{I}_k & \mathbf{0} \ \frac{\partial \mathbf{x}_B}{\partial \mathbf{z}_A} & \frac{\partial g}{\partial \mathbf{z}_B} \end{bmatrix}$$

This is a block lower triangular matrix! Its determinant is: $$\det(J) = \det(\mathbf{I}_k) \cdot \det\left(\frac{\partial g}{\partial \mathbf{z}_B}\right) = \det\left(\frac{\partial g}{\partial \mathbf{z}_B}\right)$$

The complex coupling network $\theta(\mathbf{z}_A)$ doesn't appear in the determinant at all—only the transformation $g$ of the $B$ dimensions matters.

The Magic of Coupling

The conditioner network θ(z_A) can be arbitrarily complex—it could be a 100-layer ResNet—and it doesn't affect the Jacobian determinant computation. All that complexity is 'free' from a tractability standpoint. Only the transformation g of z_B must have a tractable Jacobian.

Additive Coupling (NICE)

The simplest coupling layer uses additive coupling, introduced in NICE:

$$\mathbf{x}_A = \mathbf{z}_A$$ $$\mathbf{x}_B = \mathbf{z}_B + m(\mathbf{z}_A)$$

where $m: \mathbb{R}^k \to \mathbb{R}^{d-k}$ is an arbitrary neural network.

Properties:

Jacobian: $\frac{\partial \mathbf{x}_B}{\partial \mathbf{z}B} = \mathbf{I}{d-k}$
Determinant: $\det(J) = 1$ (volume-preserving)
Inverse: $\mathbf{z}_B = \mathbf{x}_B - m(\mathbf{x}_A)$ (trivially computed)

Additive coupling is extremely simple but volume-preserving (det = 1), which limits expressiveness. All density changes must come from the base distribution being warped, not local expansion/contraction.

additive_coupling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import torch
import torch.nn as nn
 
class AdditiveCoupling(nn.Module):
    """
    Additive coupling layer (NICE).
    
    x_A = z_A
    x_B = z_B + m(z_A)
    
    Volume-preserving: log|det(J)| = 0
    """
    
    def __init__(self, dim, hidden_dim=256, mask_type='left'):
        super().__init__()
        self.dim = dim
        self.split = dim // 2
        
        # Conditioner network: arbitrary architecture
        self.m = nn.Sequential(
            nn.Linear(self.split, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, dim - self.split)
        )
        
        # Which half to transform
        self.mask_type = mask_type
    
    def forward(self, z):
        if self.mask_type == 'left':
            z_A, z_B = z[:, :self.split], z[:, self.split:]
        else:
            z_B, z_A = z[:, :self.split], z[:, self.split:]
        
        x_A = z_A
        x_B = z_B + self.m(z_A)
        
        if self.mask_type == 'left':
            x = torch.cat([x_A, x_B], dim=1)
        else:
            x = torch.cat([x_B, x_A], dim=1)
        
        # Log determinant is 0 (volume-preserving)
        log_det = torch.zeros(z.shape[0], device=z.device)
        return x, log_det
    
    def inverse(self, x):
        if self.mask_type == 'left':
            x_A, x_B = x[:, :self.split], x[:, self.split:]
        else:
            x_B, x_A = x[:, :self.split], x[:, self.split:]
        
        z_A = x_A
        z_B = x_B - self.m(x_A)  # Simple subtraction!
        
        if self.mask_type == 'left':
            z = torch.cat([z_A, z_B], dim=1)
        else:
            z = torch.cat([z_B, z_A], dim=1)
        
        log_det = torch.zeros(x.shape[0], device=x.device)
        return z, log_det

Affine Coupling (RealNVP)

Affine coupling extends additive coupling with learnable scaling, introduced in RealNVP:

$$\mathbf{x}_A = \mathbf{z}_A$$ $$\mathbf{x}_B = \mathbf{z}_B \odot \exp(s(\mathbf{z}_A)) + t(\mathbf{z}_A)$$

where:

$s: \mathbb{R}^k \to \mathbb{R}^{d-k}$ outputs log-scale factors
$t: \mathbb{R}^k \to \mathbb{R}^{d-k}$ outputs translations
$\odot$ denotes element-wise multiplication

Properties:

Jacobian: $\frac{\partial \mathbf{x}_B}{\partial \mathbf{z}_B} = \text{diag}(\exp(s(\mathbf{z}_A)))$
Log-determinant: $\log|\det(J)| = \sum_i s_i(\mathbf{z}_A)$
Inverse: $\mathbf{z}_B = (\mathbf{x}_B - t(\mathbf{x}_A)) \odot \exp(-s(\mathbf{x}_A))$

The scaling allows non-volume-preserving transformations, dramatically increasing expressiveness.

affine_coupling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import torch
import torch.nn as nn
 
class AffineCoupling(nn.Module):
    """
    Affine coupling layer (RealNVP).
    
    x_A = z_A
    x_B = z_B * exp(s(z_A)) + t(z_A)
    
    log|det(J)| = sum(s(z_A))
    """
    
    def __init__(self, dim, hidden_dim=256, mask_type='left'):
        super().__init__()
        self.dim = dim
        self.split = dim // 2
        self.mask_type = mask_type
        
        # Network outputs both scale and translation
        self.net = nn.Sequential(
            nn.Linear(self.split, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 2 * (dim - self.split))  # s and t
        )
        
        # Initialize to identity transform
        self.net[-1].weight.data.zero_()
        self.net[-1].bias.data.zero_()
    
    def forward(self, z):
        z_A, z_B = self._split(z)
        
        # Get scale and translation
        st = self.net(z_A)
        s, t = st.chunk(2, dim=1)
        
        # Affine transformation
        x_A = z_A
        x_B = z_B * torch.exp(s) + t
        
        x = self._merge(x_A, x_B)
        
        # Log determinant = sum of scales
        log_det = s.sum(dim=1)
        return x, log_det
    
    def inverse(self, x):
        x_A, x_B = self._split(x)
        
        st = self.net(x_A)
        s, t = st.chunk(2, dim=1)
        
        # Inverse affine
        z_A = x_A
        z_B = (x_B - t) * torch.exp(-s)
        
        z = self._merge(z_A, z_B)
        
        # Log det of inverse is negative
        log_det = -s.sum(dim=1)
        return z, log_det
    
    def _split(self, x):
        if self.mask_type == 'left':
            return x[:, :self.split], x[:, self.split:]
        else:
            return x[:, self.split:], x[:, :self.split]
    
    def _merge(self, x_A, x_B):
        if self.mask_type == 'left':
            return torch.cat([x_A, x_B], dim=1)
        else:
            return torch.cat([x_B, x_A], dim=1)

Scale Stability

The scale factor exp(s) can cause numerical issues if s is too large or small. Common remedies: (1) Initialize the scale network to output zeros (identity transform), (2) Clamp s to a reasonable range like [-5, 5], (3) Use a tanh activation on s and scale appropriately.

Permutations and Masking Strategies

A single coupling layer only transforms half the dimensions. To transform all dimensions, we must alternate which dimensions are transformed across layers.

Alternating Masks:

The simplest strategy alternates between transforming the first and second halves:

Layer 1: Transform $\mathbf{z}_B$ based on $\mathbf{z}_A$
Layer 2: Transform $\mathbf{z}_A$ based on $\mathbf{z}_B$

This ensures all dimensions are eventually transformed.

Permutations Between Layers:

More sophisticated approaches permute dimensions between coupling layers:

Fixed permutations: Predefined shuffling (e.g., reverse, interleave)
Random permutations: Random but fixed at initialization
Learned permutations: Invertible 1×1 convolutions (Glow)

Permutation Strategies
Strategy	Description	Jacobian Det	Pros/Cons
Alternating masks	Flip which half is conditioned	1	Simple, but limited mixing
Reverse permutation	Reverse dimension order	±1	Deterministic, moderate mixing
Random permutation	Random shuffle (fixed)	±1	Better mixing, no learning
1×1 convolution	Learned linear mixing	$\det(\mathbf{W})$	Full flexibility, needs LU decomposition

permutations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
 
class ReversePermutation(nn.Module):
    """Simple reverse permutation layer."""
    
    def forward(self, z):
        return torch.flip(z, dims=[1]), torch.zeros(z.shape[0], device=z.device)
    
    def inverse(self, x):
        return torch.flip(x, dims=[1]), torch.zeros(x.shape[0], device=x.device)
 
 
class Invertible1x1Conv(nn.Module):
    """
    Learned permutation via invertible 1x1 convolution (Glow).
    Uses LU decomposition for efficient determinant computation.
    """
    
    def __init__(self, num_channels):
        super().__init__()
        # Initialize as random rotation
        W = torch.randn(num_channels, num_channels)
        Q, _ = torch.linalg.qr(W)
        
        # LU decomposition for efficient det computation
        P, L, U = torch.linalg.lu(Q)
        
        self.register_buffer('P', P)
        self.L = nn.Parameter(L)
        self.U = nn.Parameter(U)
        self.register_buffer('L_mask', torch.tril(torch.ones_like(L), -1))
        self.register_buffer('U_mask', torch.triu(torch.ones_like(U), 1))
    
    def _get_weight(self):
        L = self.L * self.L_mask + torch.eye(self.L.shape[0], device=self.L.device)
        U = self.U * self.U_mask + torch.diag(torch.diag(self.U))
        return self.P @ L @ U
    
    def forward(self, z):
        W = self._get_weight()
        x = z @ W.T
        
        # Log det = sum of log|diagonal of U|
        log_det = torch.sum(torch.log(torch.abs(torch.diag(self.U))))
        log_det = log_det * torch.ones(z.shape[0], device=z.device)
        
        return x, log_det
    
    def inverse(self, x):
        W = self._get_weight()
        W_inv = torch.linalg.inv(W)
        z = x @ W_inv.T
        
        log_det = -torch.sum(torch.log(torch.abs(torch.diag(self.U))))
        log_det = log_det * torch.ones(x.shape[0], device=x.device)
        
        return z, log_det

Complete Coupling Block Architecture

A complete flow is built by stacking coupling blocks, each consisting of:

Normalization (ActNorm): Learnable scaling and shifting
Permutation: Mix dimensions (1×1 conv or fixed permutation)
Coupling layer: Affine coupling transformation

Stacking many such blocks creates an expressive flow that can model complex distributions while maintaining tractable likelihood computation.

coupling_block.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch
import torch.nn as nn
 
class CouplingBlock(nn.Module):
    """
    Complete coupling block: ActNorm + Permutation + AffineCoupling
    """
    
    def __init__(self, dim, hidden_dim=256, use_1x1_conv=True):
        super().__init__()
        
        self.actnorm = ActNorm(dim)
        
        if use_1x1_conv:
            self.permute = Invertible1x1Conv(dim)
        else:
            self.permute = ReversePermutation()
        
        self.coupling = AffineCoupling(dim, hidden_dim)
    
    def forward(self, z):
        log_det = torch.zeros(z.shape[0], device=z.device)
        
        x, ld = self.actnorm.forward(z)
        log_det += ld
        
        x, ld = self.permute.forward(x)
        log_det += ld
        
        x, ld = self.coupling.forward(x)
        log_det += ld
        
        return x, log_det
    
    def inverse(self, x):
        log_det = torch.zeros(x.shape[0], device=x.device)
        
        z, ld = self.coupling.inverse(x)
        log_det += ld
        
        z, ld = self.permute.inverse(z)
        log_det += ld
        
        z, ld = self.actnorm.inverse(z)
        log_det += ld
        
        return z, log_det
 
 
class CouplingFlow(nn.Module):
    """Complete flow model with multiple coupling blocks."""
    
    def __init__(self, dim, num_blocks=8, hidden_dim=256):
        super().__init__()
        self.blocks = nn.ModuleList([
            CouplingBlock(dim, hidden_dim) for _ in range(num_blocks)
        ])
    
    def forward(self, z):
        log_det = torch.zeros(z.shape[0], device=z.device)
        x = z
        for block in self.blocks:
            x, ld = block.forward(x)
            log_det += ld
        return x, log_det
    
    def inverse(self, x):
        log_det = torch.zeros(x.shape[0], device=x.device)
        z = x
        for block in reversed(self.blocks):
            z, ld = block.inverse(z)
            log_det += ld
        return z, log_det

Summary

Key Takeaways

•Coupling layers partition dimensions, transforming one half based on the other, yielding triangular Jacobians.
•Additive coupling (NICE) is volume-preserving (det=1), simple but limited in expressiveness.
•Affine coupling (RealNVP) adds learnable scaling, enabling non-volume-preserving transforms.
•The conditioner network can be arbitrarily complex without affecting Jacobian computation.
•Permutations between layers ensure all dimensions are transformed, with learned 1×1 convolutions being most flexible.

Ready for RealNVP and Glow

You now understand the coupling layer architecture that powers modern flow models. Next, we'll see how RealNVP and Glow build on these foundations with multi-scale architectures and additional innovations to achieve state-of-the-art image generation.

3 / 5

Loading learning content...

Machine LearningGenerative Models

Flow-Based Models

LevelAdvanced

Duration90 mins

TopicGenerative Models

3 / 5

Coupling Layers: The Architectural Breakthrough

The Key Innovation for Practical Flows

This single innovation, introduced in NICE (2014) and refined in RealNVP (2016), unlocked flows for images, audio, and other high-dimensional domains.

Learning Objectives

Coupling Layer Construction

A coupling layer partitions the input into two parts and transforms one part based on the other.

The Partition:

Given input $\mathbf{z} \in \mathbb{R}^d$, split it into two parts:

$\mathbf{z}A = \mathbf{z}{1:k}$ (first $k$ dimensions)
$\mathbf{z}B = \mathbf{z}{k+1:d}$ (remaining $d-k$ dimensions)

The Transformation:

$$\mathbf{x}_A = \mathbf{z}_A$$ $$\mathbf{x}_B = g(\mathbf{z}_B; \theta(\mathbf{z}_A))$$

where:

$\mathbf{z}_A$ passes through unchanged
$\mathbf{z}_B$ is transformed by $g$, parameterized by $\theta(\mathbf{z}_A)$
$\theta$ can be an arbitrary neural network
$g$ must be invertible with respect to $\mathbf{z}_B$ for fixed $\theta$

The Jacobian Structure:

The Jacobian of this transformation has a special block structure:

The complex coupling network $\theta(\mathbf{z}_A)$ doesn't appear in the determinant at all—only the transformation $g$ of the $B$ dimensions matters.

The Magic of Coupling

Additive Coupling (NICE)

The simplest coupling layer uses additive coupling, introduced in NICE:

$$\mathbf{x}_A = \mathbf{z}_A$$ $$\mathbf{x}_B = \mathbf{z}_B + m(\mathbf{z}_A)$$

where $m: \mathbb{R}^k \to \mathbb{R}^{d-k}$ is an arbitrary neural network.

Properties:

Jacobian: $\frac{\partial \mathbf{x}_B}{\partial \mathbf{z}B} = \mathbf{I}{d-k}$
Determinant: $\det(J) = 1$ (volume-preserving)
Inverse: $\mathbf{z}_B = \mathbf{x}_B - m(\mathbf{x}_A)$ (trivially computed)

additive_coupling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import torch
import torch.nn as nn
 
class AdditiveCoupling(nn.Module):
    """
    Additive coupling layer (NICE).
    
    x_A = z_A
    x_B = z_B + m(z_A)
    
    Volume-preserving: log|det(J)| = 0
    """
    
    def __init__(self, dim, hidden_dim=256, mask_type='left'):
        super().__init__()
        self.dim = dim
        self.split = dim // 2
        
        # Conditioner network: arbitrary architecture
        self.m = nn.Sequential(
            nn.Linear(self.split, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, dim - self.split)
        )
        
        # Which half to transform
        self.mask_type = mask_type
    
    def forward(self, z):
        if self.mask_type == 'left':
            z_A, z_B = z[:, :self.split], z[:, self.split:]
        else:
            z_B, z_A = z[:, :self.split], z[:, self.split:]
        
        x_A = z_A
        x_B = z_B + self.m(z_A)
        
        if self.mask_type == 'left':
            x = torch.cat([x_A, x_B], dim=1)
        else:
            x = torch.cat([x_B, x_A], dim=1)
        
        # Log determinant is 0 (volume-preserving)
        log_det = torch.zeros(z.shape[0], device=z.device)
        return x, log_det
    
    def inverse(self, x):
        if self.mask_type == 'left':
            x_A, x_B = x[:, :self.split], x[:, self.split:]
        else:
            x_B, x_A = x[:, :self.split], x[:, self.split:]
        
        z_A = x_A
        z_B = x_B - self.m(x_A)  # Simple subtraction!
        
        if self.mask_type == 'left':
            z = torch.cat([z_A, z_B], dim=1)
        else:
            z = torch.cat([z_B, z_A], dim=1)
        
        log_det = torch.zeros(x.shape[0], device=x.device)
        return z, log_det

Affine Coupling (RealNVP)

Affine coupling extends additive coupling with learnable scaling, introduced in RealNVP:

$$\mathbf{x}_A = \mathbf{z}_A$$ $$\mathbf{x}_B = \mathbf{z}_B \odot \exp(s(\mathbf{z}_A)) + t(\mathbf{z}_A)$$

where:

$s: \mathbb{R}^k \to \mathbb{R}^{d-k}$ outputs log-scale factors
$t: \mathbb{R}^k \to \mathbb{R}^{d-k}$ outputs translations
$\odot$ denotes element-wise multiplication

Properties:

Jacobian: $\frac{\partial \mathbf{x}_B}{\partial \mathbf{z}_B} = \text{diag}(\exp(s(\mathbf{z}_A)))$
Log-determinant: $\log|\det(J)| = \sum_i s_i(\mathbf{z}_A)$
Inverse: $\mathbf{z}_B = (\mathbf{x}_B - t(\mathbf{x}_A)) \odot \exp(-s(\mathbf{x}_A))$

The scaling allows non-volume-preserving transformations, dramatically increasing expressiveness.

affine_coupling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import torch
import torch.nn as nn
 
class AffineCoupling(nn.Module):
    """
    Affine coupling layer (RealNVP).
    
    x_A = z_A
    x_B = z_B * exp(s(z_A)) + t(z_A)
    
    log|det(J)| = sum(s(z_A))
    """
    
    def __init__(self, dim, hidden_dim=256, mask_type='left'):
        super().__init__()
        self.dim = dim
        self.split = dim // 2
        self.mask_type = mask_type
        
        # Network outputs both scale and translation
        self.net = nn.Sequential(
            nn.Linear(self.split, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 2 * (dim - self.split))  # s and t
        )
        
        # Initialize to identity transform
        self.net[-1].weight.data.zero_()
        self.net[-1].bias.data.zero_()
    
    def forward(self, z):
        z_A, z_B = self._split(z)
        
        # Get scale and translation
        st = self.net(z_A)
        s, t = st.chunk(2, dim=1)
        
        # Affine transformation
        x_A = z_A
        x_B = z_B * torch.exp(s) + t
        
        x = self._merge(x_A, x_B)
        
        # Log determinant = sum of scales
        log_det = s.sum(dim=1)
        return x, log_det
    
    def inverse(self, x):
        x_A, x_B = self._split(x)
        
        st = self.net(x_A)
        s, t = st.chunk(2, dim=1)
        
        # Inverse affine
        z_A = x_A
        z_B = (x_B - t) * torch.exp(-s)
        
        z = self._merge(z_A, z_B)
        
        # Log det of inverse is negative
        log_det = -s.sum(dim=1)
        return z, log_det
    
    def _split(self, x):
        if self.mask_type == 'left':
            return x[:, :self.split], x[:, self.split:]
        else:
            return x[:, self.split:], x[:, :self.split]
    
    def _merge(self, x_A, x_B):
        if self.mask_type == 'left':
            return torch.cat([x_A, x_B], dim=1)
        else:
            return torch.cat([x_B, x_A], dim=1)

Scale Stability

Permutations and Masking Strategies

A single coupling layer only transforms half the dimensions. To transform all dimensions, we must alternate which dimensions are transformed across layers.

Alternating Masks:

The simplest strategy alternates between transforming the first and second halves:

Layer 1: Transform $\mathbf{z}_B$ based on $\mathbf{z}_A$
Layer 2: Transform $\mathbf{z}_A$ based on $\mathbf{z}_B$

This ensures all dimensions are eventually transformed.

Permutations Between Layers:

More sophisticated approaches permute dimensions between coupling layers:

Fixed permutations: Predefined shuffling (e.g., reverse, interleave)
Random permutations: Random but fixed at initialization
Learned permutations: Invertible 1×1 convolutions (Glow)

Permutation Strategies
Strategy	Description	Jacobian Det	Pros/Cons
Alternating masks	Flip which half is conditioned	1	Simple, but limited mixing
Reverse permutation	Reverse dimension order	±1	Deterministic, moderate mixing
Random permutation	Random shuffle (fixed)	±1	Better mixing, no learning
1×1 convolution	Learned linear mixing	$\det(\mathbf{W})$	Full flexibility, needs LU decomposition

permutations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
 
class ReversePermutation(nn.Module):
    """Simple reverse permutation layer."""
    
    def forward(self, z):
        return torch.flip(z, dims=[1]), torch.zeros(z.shape[0], device=z.device)
    
    def inverse(self, x):
        return torch.flip(x, dims=[1]), torch.zeros(x.shape[0], device=x.device)
 
 
class Invertible1x1Conv(nn.Module):
    """
    Learned permutation via invertible 1x1 convolution (Glow).
    Uses LU decomposition for efficient determinant computation.
    """
    
    def __init__(self, num_channels):
        super().__init__()
        # Initialize as random rotation
        W = torch.randn(num_channels, num_channels)
        Q, _ = torch.linalg.qr(W)
        
        # LU decomposition for efficient det computation
        P, L, U = torch.linalg.lu(Q)
        
        self.register_buffer('P', P)
        self.L = nn.Parameter(L)
        self.U = nn.Parameter(U)
        self.register_buffer('L_mask', torch.tril(torch.ones_like(L), -1))
        self.register_buffer('U_mask', torch.triu(torch.ones_like(U), 1))
    
    def _get_weight(self):
        L = self.L * self.L_mask + torch.eye(self.L.shape[0], device=self.L.device)
        U = self.U * self.U_mask + torch.diag(torch.diag(self.U))
        return self.P @ L @ U
    
    def forward(self, z):
        W = self._get_weight()
        x = z @ W.T
        
        # Log det = sum of log|diagonal of U|
        log_det = torch.sum(torch.log(torch.abs(torch.diag(self.U))))
        log_det = log_det * torch.ones(z.shape[0], device=z.device)
        
        return x, log_det
    
    def inverse(self, x):
        W = self._get_weight()
        W_inv = torch.linalg.inv(W)
        z = x @ W_inv.T
        
        log_det = -torch.sum(torch.log(torch.abs(torch.diag(self.U))))
        log_det = log_det * torch.ones(x.shape[0], device=x.device)
        
        return z, log_det

Complete Coupling Block Architecture

A complete flow is built by stacking coupling blocks, each consisting of:

Normalization (ActNorm): Learnable scaling and shifting
Permutation: Mix dimensions (1×1 conv or fixed permutation)
Coupling layer: Affine coupling transformation

Stacking many such blocks creates an expressive flow that can model complex distributions while maintaining tractable likelihood computation.

coupling_block.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch
import torch.nn as nn
 
class CouplingBlock(nn.Module):
    """
    Complete coupling block: ActNorm + Permutation + AffineCoupling
    """
    
    def __init__(self, dim, hidden_dim=256, use_1x1_conv=True):
        super().__init__()
        
        self.actnorm = ActNorm(dim)
        
        if use_1x1_conv:
            self.permute = Invertible1x1Conv(dim)
        else:
            self.permute = ReversePermutation()
        
        self.coupling = AffineCoupling(dim, hidden_dim)
    
    def forward(self, z):
        log_det = torch.zeros(z.shape[0], device=z.device)
        
        x, ld = self.actnorm.forward(z)
        log_det += ld
        
        x, ld = self.permute.forward(x)
        log_det += ld
        
        x, ld = self.coupling.forward(x)
        log_det += ld
        
        return x, log_det
    
    def inverse(self, x):
        log_det = torch.zeros(x.shape[0], device=x.device)
        
        z, ld = self.coupling.inverse(x)
        log_det += ld
        
        z, ld = self.permute.inverse(z)
        log_det += ld
        
        z, ld = self.actnorm.inverse(z)
        log_det += ld
        
        return z, log_det
 
 
class CouplingFlow(nn.Module):
    """Complete flow model with multiple coupling blocks."""
    
    def __init__(self, dim, num_blocks=8, hidden_dim=256):
        super().__init__()
        self.blocks = nn.ModuleList([
            CouplingBlock(dim, hidden_dim) for _ in range(num_blocks)
        ])
    
    def forward(self, z):
        log_det = torch.zeros(z.shape[0], device=z.device)
        x = z
        for block in self.blocks:
            x, ld = block.forward(x)
            log_det += ld
        return x, log_det
    
    def inverse(self, x):
        log_det = torch.zeros(x.shape[0], device=x.device)
        z = x
        for block in reversed(self.blocks):
            z, ld = block.inverse(z)
            log_det += ld
        return z, log_det

Summary

Key Takeaways

•Coupling layers partition dimensions, transforming one half based on the other, yielding triangular Jacobians.
•Additive coupling (NICE) is volume-preserving (det=1), simple but limited in expressiveness.
•Affine coupling (RealNVP) adds learnable scaling, enabling non-volume-preserving transforms.
•The conditioner network can be arbitrarily complex without affecting Jacobian computation.
•Permutations between layers ensure all dimensions are transformed, with learned 1×1 convolutions being most flexible.

Ready for RealNVP and Glow

3 / 5