Gan Variants - Learning Module

Loading content...

0/245

StyleGAN: Style-Based Generation with Disentangled Latent Spaces

A New Paradigm in Generative Modeling

In 2018, NVIDIA research unveiled StyleGAN, a generative architecture that fundamentally reimagined how generators create images. While ProGAN had solved the high-resolution challenge through progressive growing, StyleGAN asked a different question: how can we give artists and researchers meaningful control over generation?

The core insight was revolutionary: instead of feeding the latent vector z directly into the generator, StyleGAN introduces an intermediate latent space W and uses style injection at each layer. This seemingly simple change had profound implications:

Disentanglement — Different aspects of the image (pose, identity, hair style, background) become more separate in W space
Fine-grained control — Mixing styles from different latents produces meaningful combinations
Quality improvement — Interpolations in W space are smoother and more realistic
Unprecedented realism — StyleGAN-generated faces became virtually indistinguishable from real photographs

StyleGAN2 further refined the architecture, eliminating artifacts and improving quality. Together, the StyleGAN family represents the pinnacle of GAN-based image synthesis.

What You Will Learn

By the end of this page, you will understand the mapping network and W latent space, adaptive instance normalization (AdaIN), style injection mechanism, noise injection for stochastic detail, the constant input paradigm, style mixing and its implications, and StyleGAN2's improvements including weight demodulation.

The Mapping Network: From Z to W

The first innovation of StyleGAN is the mapping network f: Z → W, an 8-layer MLP that transforms the input latent code z into an intermediate latent code w. This might seem like added complexity for no reason, but the W space has fundamentally different properties than Z.

Why Z is problematic:

Z is drawn from a spherical Gaussian N(0, I). This imposes a specific geometry on the latent space. But the space of realistic images is not spherical—it has complex, non-convex structure. Forcing the generator to map from a sphere to this complex manifold creates entanglement: changing one dimension of z affects multiple image attributes simultaneously.

Why W is better:

The mapping network learns to 'warp' the spherical Z into a W space that better matches the structure of images. W is not constrained to any particular distribution—it naturally takes whatever shape best represents the data. This leads to much better disentanglement.

mapping_network.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import torch
import torch.nn as nn
import numpy as np
 
class MappingNetwork(nn.Module):
    """
    StyleGAN mapping network: f(z) → w
    
    8 fully-connected layers with LeakyReLU activation.
    Transforms the spherical latent Z into the disentangled W space.
    
    Key design choices:
    - Uses equalized learning rate (like ProGAN)
    - LeakyReLU activation throughout
    - Output dimension equals input (both 512 typically)
    - No normalization layers (pure MLP)
    """
    
    def __init__(
        self, 
        z_dim=512, 
        w_dim=512, 
        num_layers=8,
        lr_multiplier=0.01  # Lower learning rate for mapping network
    ):
        super().__init__()
        
        layers = []
        in_dim = z_dim
        
        for i in range(num_layers):
            out_dim = w_dim
            # Equalized linear layer
            layers.append(EqualizedLinear(in_dim, out_dim, lr_mul=lr_multiplier))
            layers.append(nn.LeakyReLU(0.2))
            in_dim = out_dim
        
        self.mapping = nn.Sequential(*layers)
        
    def forward(self, z):
        """
        Map z → w with optional truncation.
        
        z: [batch, z_dim] - sampled from N(0, I)
        returns: [batch, w_dim] - in learned W space
        """
        # Normalize z to unit sphere (pixel norm style)
        z = z / (z.norm(dim=1, keepdim=True) + 1e-8)
        
        return self.mapping(z)
 
 
class EqualizedLinear(nn.Module):
    """Linear layer with equalized learning rate."""
    
    def __init__(self, in_features, out_features, lr_mul=1.0):
        super().__init__()
        
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bias = nn.Parameter(torch.zeros(out_features))
        
        # Scale for equalized learning rate
        self.scale = (1 / np.sqrt(in_features)) * lr_mul
        self.lr_mul = lr_mul
    
    def forward(self, x):
        return F.linear(x, self.weight * self.scale, self.bias * self.lr_mul)
 
 
# The W space emerges from training
"""
During training, the mapping network learns to:
1. Cluster similar image attributes together
2. Separate independent attributes into different directions
3. Create smooth interpolation paths
 
Empirical observations:
- Interpolating in Z: often passes through unrealistic images
- Interpolating in W: smooth, realistic transitions throughout
- Linear directions in W correspond to semantic attributes
  (age, gender, smile, glasses, etc.)
 
This is not explicitly supervised - it emerges from the 
generator's pressure to create realistic images.
"""

Truncation Trick in W Space

The truncation trick pulls w vectors toward the mean w̄ (average over many samples): w' = w̄ + ψ(w - w̄). With ψ < 1, samples are more 'average' but higher quality. With ψ > 1, samples are more varied but may have artifacts. This works much better in W than Z because W has a more regular distribution—its mean is meaningful.

Adaptive Instance Normalization (AdaIN)

Adaptive Instance Normalization (AdaIN) is the mechanism by which style information from W space is injected into the generator. Rather than using the latent code as input to the generator, StyleGAN uses it to modulate the activations at each layer.

The AdaIN operation:

$$\text{AdaIN}(x_i, y) = y_{s,i} \cdot \frac{x_i - \mu(x_i)}{\sigma(x_i)} + y_{b,i}$$

where:

$x_i$ is the activation of feature channel i
$\mu(x_i)$, $\sigma(x_i)$ are per-channel mean and std (instance norm)
$y_{s,i}$, $y_{b,i}$ are learned scale and bias from the style vector

In words: normalize each feature channel, then scale and shift it based on the style code. The style code thus controls the statistics of the activations, not their spatial structure.

adain_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class AdaIN(nn.Module):
    """
    Adaptive Instance Normalization.
    
    Takes feature maps and style vector, outputs modulated features.
    The style controls the 'style' (statistics) while the features
    provide the 'content' (spatial structure).
    """
    
    def __init__(self, num_features, w_dim=512):
        super().__init__()
        
        # Affine transformation from w to scale and bias
        # 2 * num_features: one scale and one bias per feature channel
        self.style_transform = nn.Linear(w_dim, 2 * num_features)
        
        # Initialize to identity: scale=1, bias=0
        self.style_transform.weight.data.zero_()
        self.style_transform.bias.data[:num_features] = 1.0  # scales
        self.style_transform.bias.data[num_features:] = 0.0  # biases
        
        self.num_features = num_features
    
    def forward(self, x, w):
        """
        x: [batch, channels, height, width] - feature maps
        w: [batch, w_dim] - style vector
        
        returns: modulated feature maps
        """
        batch_size = x.size(0)
        
        # Get per-channel scale and bias from style
        style = self.style_transform(w)  # [batch, 2*channels]
        scale = style[:, :self.num_features].view(batch_size, -1, 1, 1)
        bias = style[:, self.num_features:].view(batch_size, -1, 1, 1)
        
        # Instance normalization: normalize each sample's each channel
        # [B, C, H, W] → normalize over H, W for each (B, C) pair
        mean = x.mean(dim=[2, 3], keepdim=True)
        std = x.std(dim=[2, 3], keepdim=True) + 1e-8
        
        x_normalized = (x - mean) / std
        
        # Apply style modulation
        return scale * x_normalized + bias
 
 
# Why AdaIN works for style control
"""
Insight from neural style transfer:
- Content is captured by spatial patterns (where features activate)
- Style is captured by feature statistics (mean, variance, correlations)
 
By having AdaIN control only the statistics, we separate:
- Structure/layout: determined by the generator's learned spatial patterns
- Style/appearance: controlled by the W latent code
 
This separation is why StyleGAN achieves such good disentanglement:
- Early layers control global style (pose, face shape)
- Middle layers control features (hair style, face features)
- Late layers control fine details (colors, microstructure)
"""
 
 
class StyleGANSynthesisBlock(nn.Module):
    """
    A single block in the StyleGAN synthesis network.
    
    Each block:
    1. Upsamples the input (except first block)
    2. Applies 2 convolutions with style modulation
    3. Adds noise for stochastic variation
    """
    
    def __init__(
        self, 
        in_channels, 
        out_channels, 
        w_dim=512,
        resolution=None,  # For noise generation
        is_first_block=False
    ):
        super().__init__()
        
        self.is_first_block = is_first_block
        self.resolution = resolution
        
        if not is_first_block:
            self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', 
                                        align_corners=False)
        
        # Two convolutional layers
        self.conv1 = ModulatedConv2d(in_channels, out_channels, 3)
        self.conv2 = ModulatedConv2d(out_channels, out_channels, 3)
        
        # Style transform for each conv
        self.style1 = nn.Linear(w_dim, in_channels)
        self.style2 = nn.Linear(w_dim, out_channels)
        
        # Noise injection after each conv
        self.noise_weight1 = nn.Parameter(torch.zeros(1))
        self.noise_weight2 = nn.Parameter(torch.zeros(1))
        
        self.activation = nn.LeakyReLU(0.2)
    
    def forward(self, x, w, noise=None):
        if not self.is_first_block:
            x = self.upsample(x)
        
        # First conv + style + noise
        style1 = self.style1(w)
        x = self.conv1(x, style1)
        x = x + self.noise_weight1 * self._get_noise(x, noise)
        x = self.activation(x)
        
        # Second conv + style + noise
        style2 = self.style2(w)
        x = self.conv2(x, style2)
        x = x + self.noise_weight2 * self._get_noise(x, noise)
        x = self.activation(x)
        
        return x
    
    def _get_noise(self, x, noise=None):
        if noise is None:
            noise = torch.randn(x.size(0), 1, x.size(2), x.size(3), 
                               device=x.device)
        return noise

Layer-Specific Styles

In StyleGAN, a different (potentially different) w vector is fed to each layer. This extended space is called W+ (W-plus). Using the same w everywhere is called W space; using different w per layer is W+ space. W+ provides more control but makes optimization harder when inverting real images.

Noise Injection: Stochastic Variation

A key innovation in StyleGAN is per-pixel noise injection after each convolutional layer. This might seem counterintuitive—why add randomness to a generative process? The answer reveals a fundamental insight about image structure.

The observation:

Real images have two types of variation:

Global/semantic: identity, pose, expression—should be controlled by the latent code
Stochastic/local: exact hair placement, skin pore patterns, subtle textures—random but realistic

If we force the latent code to control everything, including random details, it becomes overloaded and less disentangled. By providing explicit noise channels, we free the latent space to focus on meaningful variations.

noise_injection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import torch
import torch.nn as nn
 
class NoiseInjection(nn.Module):
    """
    Inject per-pixel Gaussian noise into feature maps.
    
    Noise is scaled by a learned per-channel weight, allowing
    the network to learn how much stochastic variation each
    feature channel should have.
    """
    
    def __init__(self, num_channels):
        super().__init__()
        # One learnable weight per channel, initialized to 0
        self.weight = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
    
    def forward(self, x, noise=None):
        """
        x: [batch, channels, height, width]
        noise: optional [batch, 1, height, width] - shared across channels
        """
        if noise is None:
            batch, _, height, width = x.shape
            noise = torch.randn(batch, 1, height, width, device=x.device)
        
        # Broadcast noise across channels, scale by learned weight
        return x + self.weight * noise
 
 
# Demonstration: Effect of noise
def demonstrate_noise_effect(generator, w):
    """
    Generate multiple images with same w but different noise.
    
    Result: Images have identical global structure (same person)
    but different stochastic details (different exact hair strands,
    slightly different freckle patterns, etc.)
    """
    images = []
    for _ in range(5):
        # Same w, different noise realization
        noise = [torch.randn(1, 1, 2**i, 2**i) for i in range(2, 11)]
        img = generator(w, noise=noise)
        images.append(img)
    
    return images  # Same identity, different fine details
 
 
# What noise controls at different resolutions:
"""
Resolution | Noise Controls
4×4        | Barely visible effect (very coarse)
8×8        | Large-scale texture variations
16×16      | Hair shape variations
32×32      | Hair strand patterns, facial texture
64×64      | Finer hair details, skin texture
128×128    | Individual hair strands, pore patterns
256×256    | Very fine texture details
512×512    | Sub-pixel variations
1024×1024  | Finest details, almost imperceptible
 
Key insight: This hierarchy emerges automatically from training.
The network learns to use noise for resolution-appropriate details.
"""
 
 
class StyleGANGenerator(nn.Module):
    """
    Complete StyleGAN generator with noise injection.
    """
    
    def __init__(self, z_dim=512, w_dim=512, resolution=1024):
        super().__init__()
        
        self.mapping = MappingNetwork(z_dim, w_dim)
        
        # Learned constant input (replaces ProGAN's first layer)
        self.constant = nn.Parameter(torch.randn(1, 512, 4, 4))
        
        # Build synthesis network
        self.synthesis_blocks = nn.ModuleList()
        self.to_rgb_blocks = nn.ModuleList()
        
        channels = {4: 512, 8: 512, 16: 512, 32: 512, 64: 256, 
                   128: 128, 256: 64, 512: 32, 1024: 16}
        
        # ... build blocks as shown in previous sections
    
    def forward(self, z, noise=None, truncation_psi=1.0):
        """
        Generate image from z.
        
        z: latent code from N(0, I)
        noise: optional list of noise tensors per layer
        truncation_psi: interpolate toward mean w (1.0 = no truncation)
        """
        # Map z → w
        w = self.mapping(z)
        
        # Apply truncation trick
        if truncation_psi != 1.0:
            w_mean = self.get_mean_w()  # Compute from many z samples
            w = w_mean + truncation_psi * (w - w_mean)
        
        # Generate noise if not provided
        if noise is None:
            noise = self._generate_noise()
        
        # Start from learned constant
        x = self.constant.expand(z.size(0), -1, -1, -1)
        
        # Apply synthesis blocks with style and noise
        for block, rgb, layer_noise in zip(
            self.synthesis_blocks, self.to_rgb_blocks, noise
        ):
            x = block(x, w, layer_noise)
        
        return x

With Noise Injection

•Latent code controls only semantic attributes
•Stochastic details vary naturally
•Better disentanglement in W space
•More realistic fine textures
•Same identity, varying details is easy

Without Noise Injection

•Latent must encode everything
•Stochastic details become entangled
•W space is more cluttered
•'Painted' or 'smooth' looking textures
•Varying details means varying identity

The Constant Input Paradigm

Perhaps StyleGAN's most surprising design choice is that the generator starts from a learned constant rather than the latent code z. In ProGAN and DCGAN, z is reshaped into a 4×4 feature map as the first layer. In StyleGAN, a fixed 4×4×512 tensor is learned during training, and all variation comes from style modulation and noise.

Why this works:

The constant input provides a stable 'canvas' for the synthesis network. The style (from W) and noise completely determine the output. This makes the role of each component crystal clear:

Constant: Starting point, learned to be useful for face generation
W styles: Control what the image looks like (identity, pose, expression)
Noise: Add realistic stochastic variation

This separation is key to StyleGAN's interpretability and control.

constant_input.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import torch
import torch.nn as nn
 
class StyleGANSynthesisNetwork(nn.Module):
    """
    StyleGAN synthesis network with constant input.
    
    Architecture:
    constant([1, 512, 4, 4]) → modulated convs → RGB output
    
    The constant is the only 'seed' - all variation comes from
    style injection and noise.
    """
    
    def __init__(self, w_dim=512, img_resolution=1024):
        super().__init__()
        
        # THE learned constant (4×4×512)
        # This is the same for every generated image
        self.constant = nn.Parameter(torch.ones(1, 512, 4, 4))
        nn.init.normal_(self.constant)
        
        # Channel progression
        self.channels = {
            4: 512, 8: 512, 16: 512, 32: 512,
            64: 256, 128: 128, 256: 64, 512: 32, 1024: 16
        }
        
        # First layer: modulate the constant
        self.first_style = nn.Linear(w_dim, 512)
        self.first_conv = ModulatedConv2d(512, 512, 3)
        
        # ... rest of synthesis blocks
    
    def forward(self, w, noise=None):
        """
        w: [batch, w_dim] or [batch, num_layers, w_dim] for W+ space
        """
        batch_size = w.size(0)
        
        # Expand constant to batch size
        x = self.constant.expand(batch_size, -1, -1, -1)
        
        # Apply first style modulation
        # (note: we can also apply noise here)
        w0 = w[:, 0] if w.dim() == 3 else w
        style = self.first_style(w0)
        x = self.first_conv(x, style)
        
        # Continue through synthesis blocks...
        return x
 
 
# Visualization: What the constant looks like
def analyze_constant(generator):
    """
    The constant is a 4×4×512 tensor. We can't visualize 512 channels,
    but we can analyze its statistics.
    """
    c = generator.synthesis.constant.data.squeeze(0)  # [512, 4, 4]
    
    print(f"Constant shape: {c.shape}")
    print(f"Mean: {c.mean():.4f}")
    print(f"Std: {c.std():.4f}")
    print(f"Min: {c.min():.4f}, Max: {c.max():.4f}")
    
    # The constant typically converges to a pattern that
    # encodes a "neutral" face layout that can be modulated
    # into any specific face via style injection
 
# Implications for image editing:
"""
Because all variation comes from style and noise, we can:
 
1. Style Mixing: Take styles from different images at different layers
   - Layers 0-3: coarse (pose, face shape) from image A
   - Layers 4-7: middle (hair, features) from image B  
   - Layers 8+: fine (colors, texture) from image C
   
2. GAN Inversion: Find w (or w+) that generates a real image
   - Easier than finding z because W space is more regular
   - Enables editing real photos
   
3. Semantic Editing: Find directions in W that change specific attributes
   - Add "smile" vector to make any face smile
   - These directions often linearly combine
"""

Style Mixing: Fine-Grained Control

Style mixing is both a training regularization technique and a powerful generation control mechanism. The idea is simple: instead of using one w vector for all layers, use different w vectors from different sources.

Training with style mixing:

During training, with probability p (often 0.9):

Sample two latent codes z1, z2 and map to w1, w2
Choose a random crossover point (layer index)
Use w1 for layers before crossover, w2 for layers after

This regularizes the network, preventing it from expecting correlated styles across layers and improving disentanglement.

style_mixing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import torch
import torch.nn as nn
import numpy as np
 
class StyleMixer:
    """
    Style mixing utilities for StyleGAN.
    """
    
    def __init__(self, generator, num_layers=18):
        """
        num_layers: total number of style injection points
                    (e.g., 18 for 1024×1024 = 2 per resolution × 9 resolutions)
        """
        self.G = generator
        self.num_layers = num_layers
    
    def mix_styles(self, w1, w2, crossover_layer):
        """
        Create mixed style tensor using w1 for early layers, w2 for late.
        
        w1, w2: [batch, w_dim]
        crossover_layer: int, layer index where we switch from w1 to w2
        
        returns: [batch, num_layers, w_dim]
        """
        batch_size = w1.size(0)
        w_dim = w1.size(1)
        
        # Expand to [batch, num_layers, w_dim]
        w = torch.zeros(batch_size, self.num_layers, w_dim, device=w1.device)
        
        for layer_idx in range(self.num_layers):
            if layer_idx < crossover_layer:
                w[:, layer_idx] = w1
            else:
                w[:, layer_idx] = w2
        
        return w
    
    def generate_mixed(self, z1, z2, crossover_layer):
        """Generate image with style mixing."""
        w1 = self.G.mapping(z1)
        w2 = self.G.mapping(z2)
        
        w_mixed = self.mix_styles(w1, w2, crossover_layer)
        
        return self.G.synthesis(w_mixed)
    
    def style_mixing_matrix(self, num_sources=5, num_destinations=5):
        """
        Generate a grid showing style mixing between multiple latents.
        
        Rows: source images (provide coarse styles)
        Cols: destination images (provide fine styles)
        Cell (i,j): coarse from source i, fine from destination j
        """
        z_source = torch.randn(num_sources, 512)
        z_dest = torch.randn(num_destinations, 512)
        
        w_source = self.G.mapping(z_source)
        w_dest = self.G.mapping(z_dest)
        
        crossover_layer = 4  # Switch after 4 layers (controls coarse structure)
        
        grid = []
        for i in range(num_sources):
            row = []
            for j in range(num_destinations):
                # Coarse from source[i], fine from dest[j]
                w_mixed = self.mix_styles(
                    w_source[i:i+1], 
                    w_dest[j:j+1], 
                    crossover_layer
                )
                img = self.G.synthesis(w_mixed)
                row.append(img)
            grid.append(row)
        
        return grid
 
 
# What each layer range controls (for 1024×1024):
"""
Layer Range | Resolution | Controls
0-1         | 4×4        | Overall face shape, pose
2-3         | 8×8        | Face shape details
4-5         | 16×16      | Hair style, eyes
6-7         | 32×32      | Hair, face features
8-9         | 64×64      | Smaller features
10-11       | 128×128    | Colors, textures
12-13       | 256×256    | Fine details
14-15       | 512×512    | Very fine details
16-17       | 1024×1024  | Microstructure
 
General categories:
- Coarse (0-3): Pose, face shape, general structure
- Middle (4-7): Facial features, hairstyle
- Fine (8+): Colors, textures, details
 
These emerge from training without explicit supervision!
"""
 
 
# Training with style mixing
def train_with_mixing(G, D, z_batch, mixing_prob=0.9):
    """Apply style mixing regularization during training."""
    
    batch_size = z_batch.size(0)
    
    if np.random.random() < mixing_prob:
        # Mix styles
        z2 = torch.randn_like(z_batch)
        crossover = np.random.randint(1, G.num_layers)
        
        w1 = G.mapping(z_batch)
        w2 = G.mapping(z2)
        
        # Create per-layer w
        w = torch.zeros(batch_size, G.num_layers, 512)
        for layer in range(G.num_layers):
            w[:, layer] = w1 if layer < crossover else w2
    else:
        # No mixing - same w for all layers
        w = G.mapping(z_batch).unsqueeze(1).expand(-1, G.num_layers, -1)
    
    return G.synthesis(w)

Creative Applications of Style Mixing

Style mixing enables powerful creative applications: (1) Create a person with one face structure but another's coloring; (2) Transfer just the hairstyle from one image to another; (3) Maintain identity while changing lighting/texture. These operations are impossible or very difficult with pre-StyleGAN architectures.

StyleGAN2: Fixing the Artifacts

StyleGAN produced stunning results, but careful inspection revealed characteristic artifacts—'water droplet' blobs that appeared in many generated images. StyleGAN2 (2019) traced these to the AdaIN normalization step and introduced elegant solutions.

The problem with AdaIN:

AdaIN normalizes feature statistics, destroying information about the relative magnitudes of features. When certain features should dominate (e.g., edge features in high-contrast areas), normalization inappropriately equalizes them, creating artifacts.

StyleGAN2's key innovations:

Weight demodulation — Move normalization from features to conv weights
Path length regularization — Encourage smoother latent space
No progressive growing — Train at full resolution from the start
Redesigned architecture — Skip connections, cleaner structure

weight_demodulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class ModulatedConv2d(nn.Module):
    """
    Modulated convolution used in StyleGAN2.
    
    Instead of modulating features (AdaIN), we modulate the weights.
    This achieves similar style control without normalizing features.
    
    Process:
    1. Scale conv weights by style vector
    2. Demodulate weights to maintain expected signal statistics
    3. Apply convolution
    
    This is mathematically similar to AdaIN but avoids the artifacts.
    """
    
    def __init__(
        self, 
        in_channels, 
        out_channels, 
        kernel_size,
        demodulate=True,
        lr_mul=1.0
    ):
        super().__init__()
        
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.demodulate = demodulate
        
        # Base convolution weight
        self.weight = nn.Parameter(
            torch.randn(out_channels, in_channels, kernel_size, kernel_size)
        )
        
        # Scale for equalized learning rate
        self.scale = (1 / np.sqrt(in_channels * kernel_size ** 2)) * lr_mul
        
        self.padding = kernel_size // 2
    
    def forward(self, x, style):
        """
        x: [batch, in_channels, height, width]
        style: [batch, in_channels] - per-channel modulation scales
        
        returns: [batch, out_channels, height, width]
        """
        batch_size, in_channels, height, width = x.shape
        
        # Scale base weight
        weight = self.weight * self.scale  # [out, in, k, k]
        
        # Modulate: multiply weight by style (per input channel)
        # style: [batch, in] → [batch, 1, in, 1, 1]
        style = style.view(batch_size, 1, in_channels, 1, 1)
        weight = weight.unsqueeze(0) * style  # [batch, out, in, k, k]
        
        if self.demodulate:
            # Demodulate: normalize by expected output std
            # For each output channel, compute std over input channels and kernel
            # sigma = sqrt(sum(w^2))
            demod = torch.rsqrt(
                weight.pow(2).sum(dim=[2, 3, 4], keepdim=True) + 1e-8
            )  # [batch, out, 1, 1, 1]
            weight = weight * demod
        
        # Reshape for grouped convolution
        # This implements per-sample convolution efficiently
        weight = weight.view(
            batch_size * self.out_channels, 
            in_channels, 
            self.kernel_size, 
            self.kernel_size
        )
        
        # Group convolution: each sample uses its own weights
        x = x.view(1, batch_size * in_channels, height, width)
        out = F.conv2d(x, weight, padding=self.padding, groups=batch_size)
        out = out.view(batch_size, self.out_channels, height, width)
        
        return out
 
 
# Why weight modulation works better:
"""
AdaIN approach (StyleGAN1):
1. Convolve: y = conv(x)
2. Normalize: y' = (y - μ) / σ  
3. Scale: out = γ * y' + β
 
Problem: Step 2 destroys relative magnitude information.
If one feature is 10x larger than another (which conveys meaning),
normalization makes them similar magnitude.
 
Weight modulation approach (StyleGAN2):
1. Modulate weights: w' = w * s
2. Demodulate weights: w'' = w' / ||w'|| 
3. Convolve: out = conv(x, w'')
 
The key insight: we can achieve the same statistical effect by 
modifying the weights instead of the activations. The demodulation
ensures output has unit variance (like instance norm), but we never
actually normalize the activations themselves.
 
Result: Same style control, no artifacts from feature normalization.
"""

StyleGAN2 Additional Improvements

•Path Length Regularization — Encourages the generator to map small changes in w to proportionally small changes in image. Makes the latent space more predictable and invertible.
•Lazy Regularization — Apply expensive regularization (R1 penalty, path length) only every 16 iterations instead of every step. Same final quality, faster training.
•No Progressive Growing — With improved architecture, can train at full resolution directly. Simpler pipeline, no growing schedule.
•Skip Connections — Add residual connections from earlier layers to output, inspired by ResNets. Improves gradient flow and detail preservation.
•Redesigned Normalization — Careful analysis of where/how normalization is applied, preventing artifacts throughout the network.

path_length_regularization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import torch
 
def path_length_regularization(generator, latents, mean_path_length, decay=0.01):
    """
    Path length regularization for StyleGAN2.
    
    Encourages: ||J_w|| ≈ constant for all w
    
    Where J_w is the Jacobian of the generator output w.r.t latents.
    This makes the mapping from W to images locally stable,
    improving interpolation quality and inversion.
    """
    
    # Generate images and track gradients w.r.t. latents
    images = generator(latents)
    
    # Compute path lengths (simplified: use noise as fake noise)
    noise = torch.randn_like(images) / np.sqrt(images.shape[2] * images.shape[3])
    
    # Compute grad of (images * noise).sum() w.r.t latents
    # This is a stochastic estimate of ||J||
    grad = torch.autograd.grad(
        outputs=(images * noise).sum(),
        inputs=latents,
        create_graph=True,
    )[0]
    
    # Path length for this batch
    path_lengths = torch.sqrt(grad.pow(2).sum(dim=1).mean(dim=0))
    
    # Update running mean
    mean_path_length = mean_path_length + decay * (
        path_lengths.mean().item() - mean_path_length
    )
    
    # Penalty: encourage path length to match running mean
    path_penalty = (path_lengths - mean_path_length).pow(2).mean()
    
    return path_penalty, mean_path_length
 
 
# The intuition:
"""
If the generator is well-behaved, small perturbations in W
should cause small perturbations in the output image.
 
Mathematically: ||∂G(w)/∂w|| should be roughly constant.
 
The path length regularizer encourages this by:
1. Computing the Jacobian norm (path length)
2. Penalizing deviation from the average
 
Benefits:
- Smoother interpolations in W space
- Better GAN inversion (finding w for real images)
- More predictable edits when moving in W space
"""

Applications and Impact

StyleGAN's controllable, high-quality generation has enabled numerous applications beyond simple image synthesis.

Key Applications

•GAN Inversion — Find the w (or w+) code that reconstructs a real image. Enables editing real photos using StyleGAN's learned representations.
•Semantic Editing — Find directions in W that correspond to attributes (age, smile, lighting). Linear interpolation ± these directions edits images.
•Domain Adaptation — Fine-tune StyleGAN on new domains (cars, churches, anime) while preserving the learned latent structure.
•Data Augmentation — Generate synthetic training data for other ML tasks, especially useful for privacy-sensitive domains.
•Art and Design — Artists use StyleGAN for creative exploration, generating novel faces, styles, and compositions.
•Medical Imaging — Generate synthetic medical images for training, augmentation, and privacy preservation.

Ethical Considerations

StyleGAN's realism has raised significant ethical concerns. Generated faces can be used for fake profiles, disinformation, and fraud. The research community has responded with detection methods and watermarking techniques. Understanding these capabilities and their potential misuse is essential for responsible deployment.

Summary: The StyleGAN Revolution

StyleGAN and StyleGAN2 represent the pinnacle of GAN-based image synthesis, combining unprecedented quality with meaningful controllability.

Key Takeaways

•Mapping Network transforms Z to W, creating a more disentangled latent space
•AdaIN / Weight Demodulation injects style information at each layer
•Noise Injection separates stochastic details from semantic content
•Constant Input provides a stable canvas; all variation comes from style/noise
•Style Mixing enables fine-grained control over different aspects of generation
•Weight Demodulation (StyleGAN2) eliminates artifacts from feature normalization
•Path Length Regularization ensures smooth, invertible latent space

The Foundation for Modern Generative AI

StyleGAN's innovations—disentangled latent spaces, layer-wise control, separating content from style—have influenced architectures well beyond GANs. Many ideas in diffusion models and vision-language models trace conceptual ancestry to the StyleGAN paradigm. The next page covers Conditional GANs, which add explicit control through labels or input conditions.