Machine LearningGenerative Adversarial Networks

Generative Adversarial Networks

LevelAdvanced

Duration90 mins

TopicGenerative Adversarial Networks

2 / 5

Generator and Discriminator

The Two Adversaries

At the heart of every GAN lie two neural networks locked in perpetual competition: the Generator and the Discriminator. These networks have fundamentally different objectives, yet their adversarial relationship drives both toward excellence. Understanding their individual architectures, roles, and the delicate balance between them is essential for successfully training GANs.

The generator's task seems almost magical: starting from random noise, it must learn to produce samples indistinguishable from real data. The discriminator, meanwhile, serves as an increasingly sophisticated critic, learning to spot the subtle tells that distinguish real from fake. This page explores both networks in depth, from their mathematical formulations to practical implementation details.

What You Will Learn

By the end of this page, you will understand: the generator's role as a learned transformation from noise to data, the discriminator's function as an adaptive binary classifier, architectural guidelines for both networks, the importance of capacity balance, and practical considerations for network design.

The Generator: From Noise to Data

The generator $G: \mathcal{Z} \rightarrow \mathcal{X}$ learns a deterministic mapping from a simple latent space to the complex data space. This transformation is the core of the GAN's generative capability.

Latent Space $\mathcal{Z}$:

The latent space is typically a low-dimensional space with a simple, tractable distribution:

$$\mathbf{z} \sim p_z(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Alternatively, uniform distributions $\mathbf{z} \sim \text{Uniform}(-1, 1)^{d_z}$ are sometimes used. The choice matters less than ensuring the distribution is easy to sample and has full support.

The Transformation:

The generator learns to warp this simple distribution into the complex data distribution. Conceptually:

$$p_g(\mathbf{x}) = \int_{\mathbf{z}: G(\mathbf{z}) = \mathbf{x}} p_z(\mathbf{z}) |\det(\partial G / \partial \mathbf{z})|^{-1}$$

Unlike normalizing flows, we don't compute this density—we only sample from it.

Architectural Principles:

Generator Design Guidelines

•Progressive Upsampling: Start from a small spatial representation and gradually increase resolution. Use transposed convolutions or upsampling + convolution.
•Batch Normalization: Apply after each layer except the output. Stabilizes training by normalizing activations.
•ReLU Activation: Use ReLU in hidden layers for non-saturating gradients. LeakyReLU also works well.
•Bounded Output: Use Tanh for images in [-1,1] or Sigmoid for [0,1]. Match your data normalization.
•Skip Connections: For high-resolution generation, skip connections (U-Net style) help preserve spatial details.

generator_architectures.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
"""
Generator Architectures: From MLP to Deep Convolutional Networks
"""
import torch
import torch.nn as nn
 
class DCGANGenerator(nn.Module):
    """
    Deep Convolutional Generator following DCGAN guidelines.
    
    Maps latent vector z to image through progressive upsampling.
    Architecture: z -> FC -> Reshape -> ConvT -> ConvT -> ... -> Image
    """
    def __init__(self, latent_dim=100, feature_maps=64, channels=3):
        super().__init__()
        
        self.latent_dim = latent_dim
        
        # Project and reshape: z -> 4x4 spatial with many features
        self.project = nn.Sequential(
            nn.Linear(latent_dim, feature_maps * 8 * 4 * 4),
            nn.BatchNorm1d(feature_maps * 8 * 4 * 4),
            nn.ReLU(True)
        )
        
        # Progressive upsampling: 4x4 -> 8x8 -> 16x16 -> 32x32 -> 64x64
        self.conv_blocks = nn.Sequential(
            # 4x4 -> 8x8
            nn.ConvTranspose2d(feature_maps*8, feature_maps*4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 4),
            nn.ReLU(True),
            
            # 8x8 -> 16x16
            nn.ConvTranspose2d(feature_maps*4, feature_maps*2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 2),
            nn.ReLU(True),
            
            # 16x16 -> 32x32
            nn.ConvTranspose2d(feature_maps*2, feature_maps, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps),
            nn.ReLU(True),
            
            # 32x32 -> 64x64 (output)
            nn.ConvTranspose2d(feature_maps, channels, 4, 2, 1, bias=False),
            nn.Tanh()  # Output in [-1, 1]
        )
    
    def forward(self, z):
        x = self.project(z)
        x = x.view(x.size(0), -1, 4, 4)  # Reshape to spatial
        return self.conv_blocks(x)
 
# Test
gen = DCGANGenerator(latent_dim=100)
z = torch.randn(4, 100)
fake_images = gen(z)
print(f"Generator output shape: {fake_images.shape}")  # [4, 3, 64, 64]

The Discriminator: The Learned Critic

The discriminator $D: \mathcal{X} \rightarrow [0, 1]$ serves as a binary classifier distinguishing real samples from generated ones. However, its role extends beyond classification—it provides the training signal that guides the generator toward realistic outputs.

The Discriminator's Dual Role:

Classifier: Outputs probability that input is real
Loss Function: Gradients through $D$ tell $G$ how to improve

Optimal Discriminator Reminder:

For fixed generator, the optimal discriminator is:

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$

This formula reveals that the discriminator implicitly estimates the density ratio between real and generated distributions.

Architectural Principles:

Discriminator Design Guidelines

•Strided Convolutions: Use strided convolutions instead of pooling for downsampling. Lets the network learn its own downsampling.
•LeakyReLU: Use LeakyReLU with slope 0.2 to prevent dying neurons and ensure gradient flow.
•No Batch Normalization in Input: Don't normalize the input layer; it disrupts the data statistics the discriminator needs to learn.
•Spectral Normalization: Constrains the Lipschitz constant of each layer, stabilizing training significantly.
•Avoid Max Pooling: Pooling loses spatial information needed for detecting artifacts.

discriminator_architectures.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
"""
Discriminator Architectures: From Images to Probability
"""
import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm
 
class DCGANDiscriminator(nn.Module):
    """
    Deep Convolutional Discriminator following DCGAN guidelines.
    
    Maps image to probability through progressive downsampling.
    Architecture: Image -> Conv -> Conv -> ... -> FC -> Probability
    """
    def __init__(self, channels=3, feature_maps=64):
        super().__init__()
        
        self.conv_blocks = nn.Sequential(
            # 64x64 -> 32x32 (no batchnorm on first layer)
            nn.Conv2d(channels, feature_maps, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            
            # 32x32 -> 16x16
            nn.Conv2d(feature_maps, feature_maps*2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 2),
            nn.LeakyReLU(0.2, inplace=True),
            
            # 16x16 -> 8x8
            nn.Conv2d(feature_maps*2, feature_maps*4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 4),
            nn.LeakyReLU(0.2, inplace=True),
            
            # 8x8 -> 4x4
            nn.Conv2d(feature_maps*4, feature_maps*8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 8),
            nn.LeakyReLU(0.2, inplace=True),
            
            # 4x4 -> 1x1 (output)
            nn.Conv2d(feature_maps*8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.conv_blocks(x).view(-1, 1)
 
 
class SpectralNormDiscriminator(nn.Module):
    """
    Discriminator with Spectral Normalization for stable training.
    
    Spectral norm constrains the Lipschitz constant of each layer,
    preventing the discriminator from becoming too confident.
    """
    def __init__(self, channels=3, feature_maps=64):
        super().__init__()
        
        self.conv_blocks = nn.Sequential(
            spectral_norm(nn.Conv2d(channels, feature_maps, 4, 2, 1)),
            nn.LeakyReLU(0.2, inplace=True),
            
            spectral_norm(nn.Conv2d(feature_maps, feature_maps*2, 4, 2, 1)),
            nn.LeakyReLU(0.2, inplace=True),
            
            spectral_norm(nn.Conv2d(feature_maps*2, feature_maps*4, 4, 2, 1)),
            nn.LeakyReLU(0.2, inplace=True),
            
            spectral_norm(nn.Conv2d(feature_maps*4, feature_maps*8, 4, 2, 1)),
            nn.LeakyReLU(0.2, inplace=True),
            
            spectral_norm(nn.Conv2d(feature_maps*8, 1, 4, 1, 0)),
        )
    
    def forward(self, x):
        return self.conv_blocks(x).view(-1, 1)
 
# Test
disc = DCGANDiscriminator()
images = torch.randn(4, 3, 64, 64)
probs = disc(images)
print(f"Discriminator output shape: {probs.shape}")  # [4, 1]

The Capacity Balance

One of the most critical—and often overlooked—aspects of GAN design is balancing the capacities of generator and discriminator. An imbalanced setup leads to training pathologies.

The Discriminator's Advantage:

Discrimination is fundamentally easier than generation. The discriminator only needs to find any difference between real and fake distributions, while the generator must match all aspects of the real distribution. This asymmetry creates natural imbalance.

Consequences of Imbalance:

Overpowered Discriminator

•Discriminator achieves near-perfect accuracy
•Generator gradients vanish (saturated sigmoid)
•Training stalls, no improvement
•D(G(z)) → 0 for all z

Underpowered Discriminator

•Discriminator provides poor gradient signal
•Generator receives random feedback
•Generated samples remain low quality
•No clear direction for improvement

Balancing Strategies:

Architecture Sizing: Make discriminator slightly weaker (fewer parameters) than generator
Training Ratio: Train generator more steps per discriminator step (G:D ratio > 1)
Learning Rates: Use lower learning rate for discriminator
Regularization: Apply stronger regularization to discriminator (dropout, spectral norm)
Label Smoothing: Use soft labels (0.9 instead of 1.0) for real samples

The ideal balance is problem-dependent and often requires experimentation. Monitor discriminator accuracy—if it hovers around 50-70%, the balance is likely good.

The Goldilocks Zone

The discriminator should be strong enough to provide meaningful feedback but not so strong that it perfectly distinguishes real from fake. Think of it as a teacher: too easy and the student learns nothing; too hard and the student gives up. The ideal discriminator stays just slightly ahead of the generator.

The Discriminator as Feature Extractor

A remarkable property of trained discriminators is that they learn rich, semantically meaningful representations in the process of distinguishing real from fake. These learned features have value beyond their original purpose.

Why Discriminators Learn Good Features:

To distinguish real from fake, the discriminator must understand what makes data realistic. For images, this includes:

Low-level features: edges, textures, color patterns
Mid-level features: parts, shapes, spatial relationships
High-level features: objects, scenes, semantic content

This hierarchical understanding emerges automatically from the adversarial objective.

Applications of Discriminator Features:

•Transfer Learning: Pre-trained discriminator features transfer well to classification tasks, sometimes rivaling supervised pre-training.
•Perceptual Loss: Using discriminator intermediate features as a perceptual loss for other tasks (super-resolution, style transfer).
•Anomaly Detection: Real data should activate discriminator features differently than anomalies.
•Semi-supervised Learning: Combine GAN training with classification head on discriminator for label-efficient learning.

BiGAN and ALI

Bidirectional GANs (BiGAN) and Adversarially Learned Inference (ALI) extend the GAN framework to also learn an encoder that maps data back to latent space. This enables the discriminator's learned features to be explicitly used for representation learning.

Summary: Generator and Discriminator

Key Takeaways

•Generator: Transforms simple noise to complex data through learned upsampling. Uses BatchNorm, ReLU, and bounded output activations.
•Discriminator: Binary classifier providing training signal to generator. Uses strided convolutions, LeakyReLU, and often spectral normalization.
•Capacity Balance: Critical for stable training. Overpowered discriminator causes vanishing gradients; underpowered discriminator gives poor signal.
•Feature Learning: Discriminators learn hierarchical representations useful for transfer learning and perceptual losses.
•Architectural Guidelines: DCGAN established key practices—no pooling, batch norm in G, LeakyReLU in D, strided convolutions.

Page Complete

You now understand the two networks at the heart of GANs—their architectures, roles, and the delicate balance between them. Next, we'll examine the minimax objective in mathematical detail, understanding its properties and practical modifications.

2 / 5

Loading learning content...

Machine LearningGenerative Adversarial Networks

Generative Adversarial Networks

LevelAdvanced

Duration90 mins

TopicGenerative Adversarial Networks

2 / 5

Generator and Discriminator

The Two Adversaries

What You Will Learn

The Generator: From Noise to Data

Latent Space $\mathcal{Z}$:

The latent space is typically a low-dimensional space with a simple, tractable distribution:

$$\mathbf{z} \sim p_z(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Alternatively, uniform distributions $\mathbf{z} \sim \text{Uniform}(-1, 1)^{d_z}$ are sometimes used. The choice matters less than ensuring the distribution is easy to sample and has full support.

The Transformation:

The generator learns to warp this simple distribution into the complex data distribution. Conceptually:

$$p_g(\mathbf{x}) = \int_{\mathbf{z}: G(\mathbf{z}) = \mathbf{x}} p_z(\mathbf{z}) |\det(\partial G / \partial \mathbf{z})|^{-1}$$

Unlike normalizing flows, we don't compute this density—we only sample from it.

Architectural Principles:

Generator Design Guidelines

•Progressive Upsampling: Start from a small spatial representation and gradually increase resolution. Use transposed convolutions or upsampling + convolution.
•Batch Normalization: Apply after each layer except the output. Stabilizes training by normalizing activations.
•ReLU Activation: Use ReLU in hidden layers for non-saturating gradients. LeakyReLU also works well.
•Bounded Output: Use Tanh for images in [-1,1] or Sigmoid for [0,1]. Match your data normalization.
•Skip Connections: For high-resolution generation, skip connections (U-Net style) help preserve spatial details.

generator_architectures.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
"""
Generator Architectures: From MLP to Deep Convolutional Networks
"""
import torch
import torch.nn as nn
 
class DCGANGenerator(nn.Module):
    """
    Deep Convolutional Generator following DCGAN guidelines.
    
    Maps latent vector z to image through progressive upsampling.
    Architecture: z -> FC -> Reshape -> ConvT -> ConvT -> ... -> Image
    """
    def __init__(self, latent_dim=100, feature_maps=64, channels=3):
        super().__init__()
        
        self.latent_dim = latent_dim
        
        # Project and reshape: z -> 4x4 spatial with many features
        self.project = nn.Sequential(
            nn.Linear(latent_dim, feature_maps * 8 * 4 * 4),
            nn.BatchNorm1d(feature_maps * 8 * 4 * 4),
            nn.ReLU(True)
        )
        
        # Progressive upsampling: 4x4 -> 8x8 -> 16x16 -> 32x32 -> 64x64
        self.conv_blocks = nn.Sequential(
            # 4x4 -> 8x8
            nn.ConvTranspose2d(feature_maps*8, feature_maps*4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 4),
            nn.ReLU(True),
            
            # 8x8 -> 16x16
            nn.ConvTranspose2d(feature_maps*4, feature_maps*2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 2),
            nn.ReLU(True),
            
            # 16x16 -> 32x32
            nn.ConvTranspose2d(feature_maps*2, feature_maps, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps),
            nn.ReLU(True),
            
            # 32x32 -> 64x64 (output)
            nn.ConvTranspose2d(feature_maps, channels, 4, 2, 1, bias=False),
            nn.Tanh()  # Output in [-1, 1]
        )
    
    def forward(self, z):
        x = self.project(z)
        x = x.view(x.size(0), -1, 4, 4)  # Reshape to spatial
        return self.conv_blocks(x)
 
# Test
gen = DCGANGenerator(latent_dim=100)
z = torch.randn(4, 100)
fake_images = gen(z)
print(f"Generator output shape: {fake_images.shape}")  # [4, 3, 64, 64]

The Discriminator: The Learned Critic

The Discriminator's Dual Role:

Classifier: Outputs probability that input is real
Loss Function: Gradients through $D$ tell $G$ how to improve

Optimal Discriminator Reminder:

For fixed generator, the optimal discriminator is:

$$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$

This formula reveals that the discriminator implicitly estimates the density ratio between real and generated distributions.

Architectural Principles:

Discriminator Design Guidelines

•Strided Convolutions: Use strided convolutions instead of pooling for downsampling. Lets the network learn its own downsampling.
•LeakyReLU: Use LeakyReLU with slope 0.2 to prevent dying neurons and ensure gradient flow.
•No Batch Normalization in Input: Don't normalize the input layer; it disrupts the data statistics the discriminator needs to learn.
•Spectral Normalization: Constrains the Lipschitz constant of each layer, stabilizing training significantly.
•Avoid Max Pooling: Pooling loses spatial information needed for detecting artifacts.

discriminator_architectures.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
"""
Discriminator Architectures: From Images to Probability
"""
import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm
 
class DCGANDiscriminator(nn.Module):
    """
    Deep Convolutional Discriminator following DCGAN guidelines.
    
    Maps image to probability through progressive downsampling.
    Architecture: Image -> Conv -> Conv -> ... -> FC -> Probability
    """
    def __init__(self, channels=3, feature_maps=64):
        super().__init__()
        
        self.conv_blocks = nn.Sequential(
            # 64x64 -> 32x32 (no batchnorm on first layer)
            nn.Conv2d(channels, feature_maps, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            
            # 32x32 -> 16x16
            nn.Conv2d(feature_maps, feature_maps*2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 2),
            nn.LeakyReLU(0.2, inplace=True),
            
            # 16x16 -> 8x8
            nn.Conv2d(feature_maps*2, feature_maps*4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 4),
            nn.LeakyReLU(0.2, inplace=True),
            
            # 8x8 -> 4x4
            nn.Conv2d(feature_maps*4, feature_maps*8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(feature_maps * 8),
            nn.LeakyReLU(0.2, inplace=True),
            
            # 4x4 -> 1x1 (output)
            nn.Conv2d(feature_maps*8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.conv_blocks(x).view(-1, 1)
 
 
class SpectralNormDiscriminator(nn.Module):
    """
    Discriminator with Spectral Normalization for stable training.
    
    Spectral norm constrains the Lipschitz constant of each layer,
    preventing the discriminator from becoming too confident.
    """
    def __init__(self, channels=3, feature_maps=64):
        super().__init__()
        
        self.conv_blocks = nn.Sequential(
            spectral_norm(nn.Conv2d(channels, feature_maps, 4, 2, 1)),
            nn.LeakyReLU(0.2, inplace=True),
            
            spectral_norm(nn.Conv2d(feature_maps, feature_maps*2, 4, 2, 1)),
            nn.LeakyReLU(0.2, inplace=True),
            
            spectral_norm(nn.Conv2d(feature_maps*2, feature_maps*4, 4, 2, 1)),
            nn.LeakyReLU(0.2, inplace=True),
            
            spectral_norm(nn.Conv2d(feature_maps*4, feature_maps*8, 4, 2, 1)),
            nn.LeakyReLU(0.2, inplace=True),
            
            spectral_norm(nn.Conv2d(feature_maps*8, 1, 4, 1, 0)),
        )
    
    def forward(self, x):
        return self.conv_blocks(x).view(-1, 1)
 
# Test
disc = DCGANDiscriminator()
images = torch.randn(4, 3, 64, 64)
probs = disc(images)
print(f"Discriminator output shape: {probs.shape}")  # [4, 1]

The Capacity Balance

One of the most critical—and often overlooked—aspects of GAN design is balancing the capacities of generator and discriminator. An imbalanced setup leads to training pathologies.

The Discriminator's Advantage:

Consequences of Imbalance:

Overpowered Discriminator

•Discriminator achieves near-perfect accuracy
•Generator gradients vanish (saturated sigmoid)
•Training stalls, no improvement
•D(G(z)) → 0 for all z

Underpowered Discriminator

•Discriminator provides poor gradient signal
•Generator receives random feedback
•Generated samples remain low quality
•No clear direction for improvement

Balancing Strategies:

Architecture Sizing: Make discriminator slightly weaker (fewer parameters) than generator
Training Ratio: Train generator more steps per discriminator step (G:D ratio > 1)
Learning Rates: Use lower learning rate for discriminator
Regularization: Apply stronger regularization to discriminator (dropout, spectral norm)
Label Smoothing: Use soft labels (0.9 instead of 1.0) for real samples

The ideal balance is problem-dependent and often requires experimentation. Monitor discriminator accuracy—if it hovers around 50-70%, the balance is likely good.

The Goldilocks Zone

The Discriminator as Feature Extractor

Why Discriminators Learn Good Features:

To distinguish real from fake, the discriminator must understand what makes data realistic. For images, this includes:

Low-level features: edges, textures, color patterns
Mid-level features: parts, shapes, spatial relationships
High-level features: objects, scenes, semantic content

This hierarchical understanding emerges automatically from the adversarial objective.

Applications of Discriminator Features:

•Transfer Learning: Pre-trained discriminator features transfer well to classification tasks, sometimes rivaling supervised pre-training.
•Perceptual Loss: Using discriminator intermediate features as a perceptual loss for other tasks (super-resolution, style transfer).
•Anomaly Detection: Real data should activate discriminator features differently than anomalies.
•Semi-supervised Learning: Combine GAN training with classification head on discriminator for label-efficient learning.

BiGAN and ALI

Summary: Generator and Discriminator

Key Takeaways

•Generator: Transforms simple noise to complex data through learned upsampling. Uses BatchNorm, ReLU, and bounded output activations.
•Discriminator: Binary classifier providing training signal to generator. Uses strided convolutions, LeakyReLU, and often spectral normalization.
•Capacity Balance: Critical for stable training. Overpowered discriminator causes vanishing gradients; underpowered discriminator gives poor signal.
•Feature Learning: Discriminators learn hierarchical representations useful for transfer learning and perceptual losses.
•Architectural Guidelines: DCGAN established key practices—no pooling, batch norm in G, LeakyReLU in D, strided convolutions.

Page Complete

2 / 5