Loading learning content...
RealNVP (Real-valued Non-Volume Preserving) and Glow (Generative Flow with Invertible 1×1 Convolutions) represent the maturation of normalizing flows into practical, high-quality generative models for images. These architectures combine the coupling layer foundations we studied with multi-scale processing, specialized convolutions, and careful engineering to achieve compelling image generation while maintaining exact likelihood computation.
RealNVP (2016) demonstrated that flows could generate recognizable images, while Glow (2018) pushed quality to near-GAN levels, generating 1024×1024 photorealistic faces. Understanding these architectures provides both practical implementation knowledge and insight into the design principles that make flows work at scale.
Understand the multi-scale architecture for efficient image processing, master the specific components of RealNVP and Glow (squeeze operations, split operations, invertible 1×1 convolutions), and learn practical training techniques for flow-based image models.
RealNVP builds on affine coupling layers with several key innovations for processing images efficiently.
Multi-Scale Architecture:
Images are high-dimensional (e.g., 256×256×3 = 196,608 dimensions), making it expensive to process all dimensions at every layer. RealNVP uses a multi-scale architecture that progressively factors out dimensions:
This creates a pyramid structure where coarse features are modeled at lower resolutions and fine details at higher resolutions.
Squeeze Operation:
The squeeze operation trades spatial size for channel depth:
This is a simple reshaping—taking 2×2 spatial blocks and stacking them as channels. It's invertible with determinant 1 (just a permutation of elements).
Split Operation:
After squeezing, split factors out half the channels:
This progressively reduces computation while allowing the model to allocate capacity where needed.
Checkerboard and Channel Masking:
RealNVP uses two types of masks for coupling layers:
These ensure different dimensions interact across layers.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import torchimport torch.nn as nn class Squeeze(nn.Module): """ Squeeze operation: trades spatial size for channels. H×W×C -> (H/2)×(W/2)×(4C) """ def forward(self, x): B, C, H, W = x.shape x = x.view(B, C, H // 2, 2, W // 2, 2) x = x.permute(0, 1, 3, 5, 2, 4).contiguous() x = x.view(B, C * 4, H // 2, W // 2) return x, torch.zeros(B, device=x.device) def inverse(self, x): B, C, H, W = x.shape x = x.view(B, C // 4, 2, 2, H, W) x = x.permute(0, 1, 4, 2, 5, 3).contiguous() x = x.view(B, C // 4, H * 2, W * 2) return x, torch.zeros(B, device=x.device) class Split(nn.Module): """ Split operation: factor out half the channels. Models factored channels with learned prior. """ def __init__(self, num_channels): super().__init__() # Prior parameters (mean and log-std) conditioned on remaining channels self.prior_net = nn.Sequential( nn.Conv2d(num_channels // 2, num_channels, 3, padding=1), nn.ReLU(), nn.Conv2d(num_channels, num_channels, 1) ) def forward(self, x): z1, z2 = x.chunk(2, dim=1) # z2 is factored out; compute its prior params from z1 prior_params = self.prior_net(z1) mean, log_std = prior_params.chunk(2, dim=1) # Log prob of z2 under prior log_prob = -0.5 * (torch.log(torch.tensor(2 * 3.14159)) + 2 * log_std + (z2 - mean) ** 2 / torch.exp(2 * log_std)) log_det = log_prob.sum(dim=[1, 2, 3]) return z1, log_det, z2 # z2 is stored for reconstruction def inverse(self, z1, z2=None, temperature=1.0): if z2 is None: # Sample from prior prior_params = self.prior_net(z1) mean, log_std = prior_params.chunk(2, dim=1) z2 = mean + torch.exp(log_std) * torch.randn_like(mean) * temperature return torch.cat([z1, z2], dim=1), torch.zeros(z1.shape[0], device=z1.device)Glow builds on RealNVP with three significant improvements that push flow quality toward GANs.
1. Invertible 1×1 Convolutions:
RealNVP uses fixed permutations between coupling layers. Glow replaces these with learned invertible 1×1 convolutions—essentially learned linear mixing of channels.
For input with $C$ channels, a 1×1 convolution is a $C \times C$ matrix $\mathbf{W}$ applied identically at each spatial position. The log-determinant is: $$\log|\det(J)| = H \cdot W \cdot \log|\det(\mathbf{W})|$$
Glow uses LU decomposition for efficient computation: $\mathbf{W} = \mathbf{P}\mathbf{L}\mathbf{U}$ where $\mathbf{P}$ is a fixed permutation and $\mathbf{L}, \mathbf{U}$ are triangular and learned.
2. ActNorm (Activation Normalization):
Instead of batch normalization (which is data-dependent and thus not truly invertible), Glow uses ActNorm: a learnable affine transformation per channel: $$y = s \odot x + b$$
The parameters $s$ and $b$ are initialized data-dependently on the first batch to normalize activations to zero mean and unit variance, then trained as regular parameters.
3. Affine Coupling with Neural Networks:
Glow strengthens the conditioner networks in affine coupling layers, using deep residual networks. The coupling function is: $$x_B = z_B \odot \sigma(s(z_A)) + t(z_A)$$
where $\sigma$ is a sigmoid that bounds the scale to prevent numerical issues.
The Complete Glow Block:
One flow step in Glow consists of:
Multiple steps are stacked, with squeeze and split operations at scale boundaries.
Multi-Scale Structure:
Glow uses $L$ levels with $K$ steps per level:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import torchimport torch.nn as nnimport torch.nn.functional as F class ActNorm2d(nn.Module): """ Activation Normalization for 2D inputs. Data-dependent initialization, then trained. """ def __init__(self, num_channels): super().__init__() self.scale = nn.Parameter(torch.ones(1, num_channels, 1, 1)) self.bias = nn.Parameter(torch.zeros(1, num_channels, 1, 1)) self.register_buffer('initialized', torch.tensor(False)) def initialize(self, x): with torch.no_grad(): mean = x.mean(dim=[0, 2, 3], keepdim=True) std = x.std(dim=[0, 2, 3], keepdim=True) + 1e-6 self.bias.data = -mean self.scale.data = 1.0 / std self.initialized.fill_(True) def forward(self, x): if not self.initialized: self.initialize(x) y = self.scale * (x + self.bias) # Log det: log|scale| * H * W (same scale at all positions) B, C, H, W = x.shape log_det = H * W * torch.sum(torch.log(torch.abs(self.scale))) return y, log_det.expand(B) def inverse(self, y): x = y / self.scale - self.bias B, C, H, W = y.shape log_det = -H * W * torch.sum(torch.log(torch.abs(self.scale))) return x, log_det.expand(B) class GlowStep(nn.Module): """Single Glow step: ActNorm -> 1x1Conv -> AffineCoupling""" def __init__(self, num_channels, hidden_channels=512): super().__init__() self.actnorm = ActNorm2d(num_channels) self.invconv = Invertible1x1Conv2d(num_channels) self.coupling = AffineCoupling2d(num_channels, hidden_channels) def forward(self, x): log_det = torch.zeros(x.shape[0], device=x.device) y, ld = self.actnorm.forward(x) log_det += ld y, ld = self.invconv.forward(y) log_det += ld y, ld = self.coupling.forward(y) log_det += ld return y, log_det def inverse(self, y): log_det = torch.zeros(y.shape[0], device=y.device) x, ld = self.coupling.inverse(y) log_det += ld x, ld = self.invconv.inverse(x) log_det += ld x, ld = self.actnorm.inverse(x) log_det += ld return x, log_det| Component | RealNVP | Glow |
|---|---|---|
| Permutation | Fixed (reverse, shuffle) | Learned 1×1 convolution |
| Normalization | Batch normalization | ActNorm (data-dependent init) |
| Coupling | Affine with CNNs | Affine with deeper ResNets |
| Image resolution | 32×32, 64×64 | Up to 1024×1024 |
| Sample quality | Recognizable but blurry | Near-photorealistic faces |
The multi-scale architecture is crucial for efficient image processing. Let's trace through how an image flows through a Glow model.
Example: 64×64×3 Image with 3 Levels, 8 Steps per Level
Level 1 (64×64×3):
Level 2 (32×32×6):
Level 3 (16×16×12):
Total latent representation:
Multi-scale processing means most computation happens at lower resolutions. A 1024×1024 image might have final flow steps at 32×32, reducing computation 1000×. The early splits also allow the model to allocate detail-level vs structure-level capacity appropriately.
Training:
Training maximizes log-likelihood via gradient descent: $$\mathcal{L} = \mathbb{E}{\mathbf{x} \sim p{data}}[\log p_\theta(\mathbf{x})]$$
For images with discrete pixel values, dequantization is essential: $$\tilde{\mathbf{x}} = \mathbf{x} + \mathbf{u}, \quad \mathbf{u} \sim \text{Uniform}(0, 1/256)$$
Performance metric: Bits-per-dimension (BPD): $$\text{BPD} = \frac{-\log_2 p(\mathbf{x})}{d}$$
where $d$ is the number of dimensions. Lower is better.
Sampling:
Sampling is straightforward:
Temperature controls sample diversity:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import torch def train_glow(model, dataloader, optimizer, epochs=100): """Training loop for Glow model.""" for epoch in range(epochs): total_bpd = 0 num_batches = 0 for batch in dataloader: images = batch[0].cuda() # Dequantization: add uniform noise images = (images * 255 + torch.rand_like(images)) / 256 # Forward pass and compute log-likelihood z, log_det = model.inverse(images) # Log prob under base distribution log_pz = -0.5 * (z ** 2 + torch.log(torch.tensor(2 * 3.14159))).sum(dim=[1,2,3]) # Total log probability log_px = log_pz + log_det # Negative log-likelihood loss loss = -log_px.mean() optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() # Compute BPD d = images[0].numel() bpd = -log_px.mean() / (d * torch.log(torch.tensor(2.0))) total_bpd += bpd.item() num_batches += 1 print(f"Epoch {epoch}: BPD = {total_bpd / num_batches:.3f}") def sample_glow(model, num_samples, temperature=0.7): """Generate samples from trained Glow model.""" model.eval() with torch.no_grad(): # Sample from base distribution z_shapes = model.get_z_shapes() # Get shapes for multi-scale z zs = [torch.randn(num_samples, *shape).cuda() * temperature for shape in z_shapes] # Forward through flow images = model.forward(zs) # Clamp to valid range images = torch.clamp(images, 0, 1) return images| Model | CIFAR-10 | ImageNet 32×32 | ImageNet 64×64 |
|---|---|---|---|
| RealNVP | 3.49 | 4.28 | 3.98 |
| Glow | 3.35 | 4.09 | 3.81 |
| Flow++ | 3.08 | 3.86 | 3.69 |
| FFJORD | 3.40 | – | – |
| PixelCNN++ (autoregressive) | 2.92 | – | – |
Key observations:
Flows vs. autoregressive models: Autoregressive models (like PixelCNN++) achieve better BPD but sample sequentially (slow). Flows sample in one pass.
Glow face generation: Glow achieved impressive 1024×1024 face synthesis, demonstrating flows can scale to high resolution.
Interpolation: Flows enable smooth latent interpolation since the mapping is bijective—every point in latent space maps to a valid image.
Attribute manipulation: By finding directions in latent space corresponding to attributes (age, glasses, smile), Glow enables semantic image editing.
While diffusion models now achieve better image quality than flows, flows remain valuable for exact likelihood computation, fast sampling, and applications requiring bijective mappings. Hybrid approaches combining flows with other methods continue to be an active research area.
We've covered the discrete-layer flow architectures that dominated early flow research. Next, we'll explore continuous normalizing flows based on neural ODEs—a fundamentally different approach that treats the transformation as continuous through time, enabling free-form Jacobians and new theoretical insights.