Machine LearningRegularization in Deep Learning

Weight Regularization

LevelAdvanced

Duration75 mins

TopicRegularization in Deep Learning

5 / 5

Spectral Normalization

Controlling Network Lipschitz Continuity

While L2 regularization and max-norm constraints control the aggregate or per-unit magnitude of weights, they don't directly address a more fundamental property: how much the network's output can change relative to changes in its input. This is the Lipschitz constant—a measure of the "steepness" of the function.

Spectral normalization provides a principled way to control the Lipschitz constant by constraining the spectral norm (largest singular value) of each weight matrix. By ensuring every layer has spectral norm at most 1, we guarantee the entire network is 1-Lipschitz: changes in output are bounded by changes in input.

This technique, introduced for training stable Generative Adversarial Networks (GANs), has broader applications in:

GAN discriminator stability: Preventing discriminator collapse
Robustness: Networks with bounded Lipschitz are more robust to input perturbations
Gradient control: Bounded spectral norms prevent gradient explosion
Wasserstein constraints: Required for Wasserstein GAN formulations

What You Will Learn

This page covers spectral normalization in depth: spectral norm definition, Lipschitz continuity, power iteration for efficient computation, implementation, relationship to other normalizations, and applications in GANs and beyond.

Spectral Norm Definition

The spectral norm of a matrix $\mathbf{W}$ is its largest singular value, denoted $\sigma(\mathbf{W})$ or $|\mathbf{W}|_2$.

Formal Definition: $$\sigma(\mathbf{W}) = \max_{\mathbf{x} \neq 0} \frac{|\mathbf{W}\mathbf{x}|_2}{|\mathbf{x}|2} = \max{|\mathbf{x}|_2 = 1} |\mathbf{W}\mathbf{x}|_2$$

Intuitively, the spectral norm measures the maximum "stretching" the matrix applies to any vector.

Connection to Singular Values:

For the Singular Value Decomposition (SVD) $\mathbf{W} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top$: $$\sigma(\mathbf{W}) = \sigma_1 = \max_i \sigma_i$$

where $\sigma_1 \geq \sigma_2 \geq ... \geq \sigma_r$ are the singular values.

Key Properties:

$\sigma(\mathbf{W}) \geq 0$ (always non-negative)
$\sigma(\alpha\mathbf{W}) = |\alpha| \sigma(\mathbf{W})$ (scales linearly)
$\sigma(\mathbf{W}_1 \mathbf{W}_2) \leq \sigma(\mathbf{W}_1) \sigma(\mathbf{W}_2)$ (submultiplicative)

Matrix Norms Comparison
Norm	Formula	Interpretation
Frobenius	$\sqrt{\sum_{i,j} W_{ij}^2}$	Sum of squared elements (like L2)
Spectral	$\sigma_1(\mathbf{W})$	Maximum stretching factor
Max-norm	$\max_j \|\mathbf{W}_{:,j}\|_2$	Largest column norm
Operator L1	$\max_j \sum_i\|W_{ij}\|$	Maximum column sum
Nuclear	$\sum_i \sigma_i$	Sum of singular values

Lipschitz Continuity and Neural Networks

A function $f: \mathbb{R}^n \to \mathbb{R}^m$ is Lipschitz continuous with constant $L$ if: $$|f(\mathbf{x}_1) - f(\mathbf{x}_2)|_2 \leq L |\mathbf{x}_1 - \mathbf{x}_2|_2$$

for all $\mathbf{x}_1, \mathbf{x}_2$. The smallest such $L$ is the Lipschitz constant of $f$.

For Linear Layers:

A linear transformation $f(\mathbf{x}) = \mathbf{W}\mathbf{x}$ has Lipschitz constant exactly $\sigma(\mathbf{W})$: $$|\mathbf{W}\mathbf{x}_1 - \mathbf{W}\mathbf{x}_2|_2 = |\mathbf{W}(\mathbf{x}_1 - \mathbf{x}_2)|_2 \leq \sigma(\mathbf{W}) |\mathbf{x}_1 - \mathbf{x}_2|_2$$

For Neural Networks:

A feedforward network $f = f_L \circ f_{L-1} \circ ... \circ f_1$ has Lipschitz constant bounded by: $$L \leq \prod_{l=1}^{L} L_l$$

where $L_l$ is the Lipschitz constant of layer $l$ (including activation).

1-Lipschitz Activations

Common activations are 1-Lipschitz: ReLU, Leaky ReLU, tanh (at 1), and sigmoid (at 1/4). If each weight matrix has spectral norm ≤ 1 and activations are 1-Lipschitz, the entire network is 1-Lipschitz.

Why Lipschitz Bounds Matter:

Gradient Control: For a Lipschitz network, gradients cannot explode: $|\nabla_\mathbf{x} f|_2 \leq L$
Robustness: Small input perturbations cause bounded output changes
GAN Stability: The WGAN discriminator must be 1-Lipschitz for the Wasserstein distance formulation to hold
Generalization: Networks with controlled Lipschitz constants have tighter generalization bounds

Power Iteration for Efficient Computation

Computing the spectral norm exactly requires SVD, which is $O(\min(mn^2, m^2n))$ for an $m \times n$ matrix—too expensive for every training iteration.

Power Iteration provides an efficient approximation. The basic idea:

Start with a random vector $\mathbf{v}$
Repeatedly compute $\mathbf{u} = \mathbf{W}\mathbf{v}$ (normalize) and $\mathbf{v} = \mathbf{W}^\top\mathbf{u}$ (normalize)
After convergence, $\sigma \approx \mathbf{u}^\top \mathbf{W} \mathbf{v}$

Key Insight: We don't need convergence at each step! With only one power iteration per training step, $\mathbf{u}$ and $\mathbf{v}$ track the leading singular vectors as $\mathbf{W}$ slowly changes. This gives:

$O(mn)$ cost per iteration (just two matrix-vector products)
Accurate estimates after a few hundred training steps
No explicit SVD computation

power_iteration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def power_iteration(W, u, v, num_iters=1):
    """
    Estimate spectral norm using power iteration.
    
    Args:
        W: Weight matrix (out_features, in_features)
        u: Left singular vector estimate (out_features,)
        v: Right singular vector estimate (in_features,)
        num_iters: Number of power iteration steps
    
    Returns:
        sigma: Spectral norm estimate
        u_new: Updated left singular vector
        v_new: Updated right singular vector
    """
    for _ in range(num_iters):
        # v <- W^T u / ||W^T u||
        v_new = F.normalize(torch.mv(W.t(), u), dim=0)
        # u <- W v / ||W v||
        u_new = F.normalize(torch.mv(W, v_new), dim=0)
        u, v = u_new, v_new
    
    # Spectral norm estimate: u^T W v
    sigma = torch.dot(u, torch.mv(W, v))
    
    return sigma, u, v
 
def spectral_normalize(W, u, v, num_iters=1):
    """
    Apply spectral normalization to weight matrix.
    
    W_sn = W / σ(W)
    
    Returns normalized weight and updated singular vectors.
    """
    sigma, u_new, v_new = power_iteration(W, u, v, num_iters)
    W_normalized = W / sigma
    
    return W_normalized, u_new, v_new, sigma

Detaching Singular Vectors

During training, u and v must be detached from the computation graph. We don't want gradients flowing through the spectral norm estimation. They're updated as buffers, not parameters.

Implementation in PyTorch

PyTorch provides built-in spectral normalization through torch.nn.utils.spectral_norm.

pytorch_spectral_norm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm
 
# Method 1: Using built-in spectral_norm
class SpectralNormDiscriminator(nn.Module):
    """GAN discriminator with spectral normalization."""
    
    def __init__(self, input_dim=784, hidden_dim=256):
        super().__init__()
        
        # Apply spectral_norm wrapper to each Linear layer
        self.layers = nn.Sequential(
            spectral_norm(nn.Linear(input_dim, hidden_dim)),
            nn.LeakyReLU(0.2),
            spectral_norm(nn.Linear(hidden_dim, hidden_dim)),
            nn.LeakyReLU(0.2),
            spectral_norm(nn.Linear(hidden_dim, 1))
        )
    
    def forward(self, x):
        return self.layers(x)
 
# Method 2: Applying to existing model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
 
# Apply spectral norm to all Linear layers
def add_spectral_norm(model):
    for name, module in model.named_children():
        if isinstance(module, nn.Linear):
            spectral_norm(module)
        else:
            add_spectral_norm(module)
 
add_spectral_norm(model)
 
# Method 3: Custom implementation for understanding
class SpectralNormLinear(nn.Module):
    """
    Linear layer with spectral normalization.
    
    Demonstrates the internal mechanics.
    """
    def __init__(self, in_features, out_features, n_power_iters=1):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.n_power_iters = n_power_iters
        
        # Weight parameter (unnormalized)
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bias = nn.Parameter(torch.zeros(out_features))
        
        # Singular vector estimates (buffers, not parameters)
        self.register_buffer('u', F.normalize(torch.randn(out_features), dim=0))
        self.register_buffer('v', F.normalize(torch.randn(in_features), dim=0))
    
    def forward(self, x):
        # Update singular vectors and compute spectral norm
        with torch.no_grad():
            for _ in range(self.n_power_iters):
                self.v = F.normalize(torch.mv(self.weight.t(), self.u), dim=0)
                self.u = F.normalize(torch.mv(self.weight, self.v), dim=0)
        
        # Compute spectral norm (allow gradients here)
        sigma = torch.dot(self.u, torch.mv(self.weight, self.v))
        
        # Normalize weight
        weight_normalized = self.weight / sigma
        
        return F.linear(x, weight_normalized, self.bias)

Applications in GANs

Spectral normalization was developed primarily for training GANs. It addresses the core challenge of discriminator (critic) training.

The GAN Stability Problem:

Training alternates between generator and discriminator
If discriminator becomes too strong, gradients to generator vanish
If discriminator becomes too weak, it provides no useful signal
We need a "just right" discriminator that constrains its capacity

Why Spectral Norm Helps:

Constrains discriminator's Lipschitz constant
Prevents discriminator from becoming arbitrarily steep
Required for Wasserstein GAN: discriminator must satisfy $|f|_L \leq 1$
Empirically stabilizes training without gradient penalty (which is expensive)

sngan_discriminator.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm
 
class SNGANDiscriminator(nn.Module):
    """
    Spectral Normalization GAN Discriminator.
    
    From "Spectral Normalization for GANs" (Miyato et al., 2018)
    """
    def __init__(self, img_channels=3, hidden_dim=64):
        super().__init__()
        
        self.main = nn.Sequential(
            # Input: 64x64
            spectral_norm(nn.Conv2d(img_channels, hidden_dim, 4, 2, 1)),
            nn.LeakyReLU(0.1),
            
            # 32x32
            spectral_norm(nn.Conv2d(hidden_dim, hidden_dim*2, 4, 2, 1)),
            nn.LeakyReLU(0.1),
            
            # 16x16
            spectral_norm(nn.Conv2d(hidden_dim*2, hidden_dim*4, 4, 2, 1)),
            nn.LeakyReLU(0.1),
            
            # 8x8
            spectral_norm(nn.Conv2d(hidden_dim*4, hidden_dim*8, 4, 2, 1)),
            nn.LeakyReLU(0.1),
            
            # 4x4
            spectral_norm(nn.Conv2d(hidden_dim*8, 1, 4, 1, 0))
        )
    
    def forward(self, x):
        return self.main(x).view(-1)
 
# Note: Generator typically does NOT use spectral norm
# (or uses it more selectively)
class SNGANGenerator(nn.Module):
    """Generator can optionally use spectral norm too."""
    def __init__(self, latent_dim=128, hidden_dim=64, img_channels=3):
        super().__init__()
        
        self.main = nn.Sequential(
            # Latent -> 4x4
            spectral_norm(nn.ConvTranspose2d(latent_dim, hidden_dim*8, 4, 1, 0)),
            nn.BatchNorm2d(hidden_dim*8),
            nn.ReLU(),
            
            # ... (upsampling layers)
        )
    
    def forward(self, z):
        return self.main(z.view(-1, z.size(1), 1, 1))

SN-GAN Best Practices

Apply spectral norm to ALL layers of the discriminator including the final layer. For the generator, spectral norm is optional but can help. Use LeakyReLU (slope ~0.1-0.2) with spectral norm. Avoid batch norm in discriminator with spectral norm.

Applications Beyond GANs

While developed for GANs, spectral normalization has broader applications.

Non-GAN Applications

•Adversarial robustness: Lipschitz-bounded networks are provably robust to small perturbations
•Gradient regularization: Controls gradient magnitudes without explicit penalty
•Neural ODEs: Lipschitz bounds ensure existence and uniqueness of solutions
•Normalizing flows: Spectral norm ensures bounded Jacobian determinant
•Reinforcement learning: Stabilizes Q-function approximation
•Self-supervised learning: Contrastive learning benefits from bounded representations

Spectral Norm vs Other Regularizations
Technique	Controls	Computation	Use Case
L2 regularization	Weight magnitude	O(n)	General regularization
Max-norm	Per-neuron norm	O(n)	Dropout combination
Spectral norm	Lipschitz/stretching	O(mn)	GANs, robustness
Gradient penalty	Gradient norm @ samples	O(forward)	WGAN-GP (expensive)

Summary: Spectral Normalization

Key Takeaways

•Spectral norm is the largest singular value—maximum stretching factor of a matrix
•Dividing by spectral norm constrains each layer to be 1-Lipschitz
•Power iteration efficiently estimates spectral norm with just 1 iteration per step
•GANs benefit most: stable discriminator training without gradient penalty
•Lipschitz networks have bounded gradients and improved robustness
•Use PyTorch's spectral_norm wrapper for easy application

Module Complete

You've completed the Weight Regularization module! You now understand three fundamental approaches: penalty-based (L2/L1), constraint-based (max-norm), and spectral (Lipschitz control). These tools form the foundation for preventing overfitting in deep networks, each suited to different scenarios and architectural choices.

5 / 5

Loading learning content...

Machine LearningRegularization in Deep Learning

Weight Regularization

LevelAdvanced

Duration75 mins

TopicRegularization in Deep Learning

5 / 5

Spectral Normalization

Controlling Network Lipschitz Continuity

This technique, introduced for training stable Generative Adversarial Networks (GANs), has broader applications in:

GAN discriminator stability: Preventing discriminator collapse
Robustness: Networks with bounded Lipschitz are more robust to input perturbations
Gradient control: Bounded spectral norms prevent gradient explosion
Wasserstein constraints: Required for Wasserstein GAN formulations

What You Will Learn

Spectral Norm Definition

The spectral norm of a matrix $\mathbf{W}$ is its largest singular value, denoted $\sigma(\mathbf{W})$ or $|\mathbf{W}|_2$.

Formal Definition: $$\sigma(\mathbf{W}) = \max_{\mathbf{x} \neq 0} \frac{|\mathbf{W}\mathbf{x}|_2}{|\mathbf{x}|2} = \max{|\mathbf{x}|_2 = 1} |\mathbf{W}\mathbf{x}|_2$$

Intuitively, the spectral norm measures the maximum "stretching" the matrix applies to any vector.

Connection to Singular Values:

For the Singular Value Decomposition (SVD) $\mathbf{W} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^\top$: $$\sigma(\mathbf{W}) = \sigma_1 = \max_i \sigma_i$$

where $\sigma_1 \geq \sigma_2 \geq ... \geq \sigma_r$ are the singular values.

Key Properties:

$\sigma(\mathbf{W}) \geq 0$ (always non-negative)
$\sigma(\alpha\mathbf{W}) = |\alpha| \sigma(\mathbf{W})$ (scales linearly)
$\sigma(\mathbf{W}_1 \mathbf{W}_2) \leq \sigma(\mathbf{W}_1) \sigma(\mathbf{W}_2)$ (submultiplicative)

Matrix Norms Comparison
Norm	Formula	Interpretation
Frobenius	$\sqrt{\sum_{i,j} W_{ij}^2}$	Sum of squared elements (like L2)
Spectral	$\sigma_1(\mathbf{W})$	Maximum stretching factor
Max-norm	$\max_j \|\mathbf{W}_{:,j}\|_2$	Largest column norm
Operator L1	$\max_j \sum_i\|W_{ij}\|$	Maximum column sum
Nuclear	$\sum_i \sigma_i$	Sum of singular values

Lipschitz Continuity and Neural Networks

A function $f: \mathbb{R}^n \to \mathbb{R}^m$ is Lipschitz continuous with constant $L$ if: $$|f(\mathbf{x}_1) - f(\mathbf{x}_2)|_2 \leq L |\mathbf{x}_1 - \mathbf{x}_2|_2$$

for all $\mathbf{x}_1, \mathbf{x}_2$. The smallest such $L$ is the Lipschitz constant of $f$.

For Linear Layers:

For Neural Networks:

A feedforward network $f = f_L \circ f_{L-1} \circ ... \circ f_1$ has Lipschitz constant bounded by: $$L \leq \prod_{l=1}^{L} L_l$$

where $L_l$ is the Lipschitz constant of layer $l$ (including activation).

1-Lipschitz Activations

Why Lipschitz Bounds Matter:

Gradient Control: For a Lipschitz network, gradients cannot explode: $|\nabla_\mathbf{x} f|_2 \leq L$
Robustness: Small input perturbations cause bounded output changes
GAN Stability: The WGAN discriminator must be 1-Lipschitz for the Wasserstein distance formulation to hold
Generalization: Networks with controlled Lipschitz constants have tighter generalization bounds

Power Iteration for Efficient Computation

Computing the spectral norm exactly requires SVD, which is $O(\min(mn^2, m^2n))$ for an $m \times n$ matrix—too expensive for every training iteration.

Power Iteration provides an efficient approximation. The basic idea:

Start with a random vector $\mathbf{v}$
Repeatedly compute $\mathbf{u} = \mathbf{W}\mathbf{v}$ (normalize) and $\mathbf{v} = \mathbf{W}^\top\mathbf{u}$ (normalize)
After convergence, $\sigma \approx \mathbf{u}^\top \mathbf{W} \mathbf{v}$

$O(mn)$ cost per iteration (just two matrix-vector products)
Accurate estimates after a few hundred training steps
No explicit SVD computation

power_iteration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def power_iteration(W, u, v, num_iters=1):
    """
    Estimate spectral norm using power iteration.
    
    Args:
        W: Weight matrix (out_features, in_features)
        u: Left singular vector estimate (out_features,)
        v: Right singular vector estimate (in_features,)
        num_iters: Number of power iteration steps
    
    Returns:
        sigma: Spectral norm estimate
        u_new: Updated left singular vector
        v_new: Updated right singular vector
    """
    for _ in range(num_iters):
        # v <- W^T u / ||W^T u||
        v_new = F.normalize(torch.mv(W.t(), u), dim=0)
        # u <- W v / ||W v||
        u_new = F.normalize(torch.mv(W, v_new), dim=0)
        u, v = u_new, v_new
    
    # Spectral norm estimate: u^T W v
    sigma = torch.dot(u, torch.mv(W, v))
    
    return sigma, u, v
 
def spectral_normalize(W, u, v, num_iters=1):
    """
    Apply spectral normalization to weight matrix.
    
    W_sn = W / σ(W)
    
    Returns normalized weight and updated singular vectors.
    """
    sigma, u_new, v_new = power_iteration(W, u, v, num_iters)
    W_normalized = W / sigma
    
    return W_normalized, u_new, v_new, sigma

Detaching Singular Vectors

During training, u and v must be detached from the computation graph. We don't want gradients flowing through the spectral norm estimation. They're updated as buffers, not parameters.

Implementation in PyTorch

PyTorch provides built-in spectral normalization through torch.nn.utils.spectral_norm.

pytorch_spectral_norm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm
 
# Method 1: Using built-in spectral_norm
class SpectralNormDiscriminator(nn.Module):
    """GAN discriminator with spectral normalization."""
    
    def __init__(self, input_dim=784, hidden_dim=256):
        super().__init__()
        
        # Apply spectral_norm wrapper to each Linear layer
        self.layers = nn.Sequential(
            spectral_norm(nn.Linear(input_dim, hidden_dim)),
            nn.LeakyReLU(0.2),
            spectral_norm(nn.Linear(hidden_dim, hidden_dim)),
            nn.LeakyReLU(0.2),
            spectral_norm(nn.Linear(hidden_dim, 1))
        )
    
    def forward(self, x):
        return self.layers(x)
 
# Method 2: Applying to existing model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
 
# Apply spectral norm to all Linear layers
def add_spectral_norm(model):
    for name, module in model.named_children():
        if isinstance(module, nn.Linear):
            spectral_norm(module)
        else:
            add_spectral_norm(module)
 
add_spectral_norm(model)
 
# Method 3: Custom implementation for understanding
class SpectralNormLinear(nn.Module):
    """
    Linear layer with spectral normalization.
    
    Demonstrates the internal mechanics.
    """
    def __init__(self, in_features, out_features, n_power_iters=1):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.n_power_iters = n_power_iters
        
        # Weight parameter (unnormalized)
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bias = nn.Parameter(torch.zeros(out_features))
        
        # Singular vector estimates (buffers, not parameters)
        self.register_buffer('u', F.normalize(torch.randn(out_features), dim=0))
        self.register_buffer('v', F.normalize(torch.randn(in_features), dim=0))
    
    def forward(self, x):
        # Update singular vectors and compute spectral norm
        with torch.no_grad():
            for _ in range(self.n_power_iters):
                self.v = F.normalize(torch.mv(self.weight.t(), self.u), dim=0)
                self.u = F.normalize(torch.mv(self.weight, self.v), dim=0)
        
        # Compute spectral norm (allow gradients here)
        sigma = torch.dot(self.u, torch.mv(self.weight, self.v))
        
        # Normalize weight
        weight_normalized = self.weight / sigma
        
        return F.linear(x, weight_normalized, self.bias)

Applications in GANs

Spectral normalization was developed primarily for training GANs. It addresses the core challenge of discriminator (critic) training.

The GAN Stability Problem:

Training alternates between generator and discriminator
If discriminator becomes too strong, gradients to generator vanish
If discriminator becomes too weak, it provides no useful signal
We need a "just right" discriminator that constrains its capacity

Why Spectral Norm Helps:

Constrains discriminator's Lipschitz constant
Prevents discriminator from becoming arbitrarily steep
Required for Wasserstein GAN: discriminator must satisfy $|f|_L \leq 1$
Empirically stabilizes training without gradient penalty (which is expensive)

sngan_discriminator.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm
 
class SNGANDiscriminator(nn.Module):
    """
    Spectral Normalization GAN Discriminator.
    
    From "Spectral Normalization for GANs" (Miyato et al., 2018)
    """
    def __init__(self, img_channels=3, hidden_dim=64):
        super().__init__()
        
        self.main = nn.Sequential(
            # Input: 64x64
            spectral_norm(nn.Conv2d(img_channels, hidden_dim, 4, 2, 1)),
            nn.LeakyReLU(0.1),
            
            # 32x32
            spectral_norm(nn.Conv2d(hidden_dim, hidden_dim*2, 4, 2, 1)),
            nn.LeakyReLU(0.1),
            
            # 16x16
            spectral_norm(nn.Conv2d(hidden_dim*2, hidden_dim*4, 4, 2, 1)),
            nn.LeakyReLU(0.1),
            
            # 8x8
            spectral_norm(nn.Conv2d(hidden_dim*4, hidden_dim*8, 4, 2, 1)),
            nn.LeakyReLU(0.1),
            
            # 4x4
            spectral_norm(nn.Conv2d(hidden_dim*8, 1, 4, 1, 0))
        )
    
    def forward(self, x):
        return self.main(x).view(-1)
 
# Note: Generator typically does NOT use spectral norm
# (or uses it more selectively)
class SNGANGenerator(nn.Module):
    """Generator can optionally use spectral norm too."""
    def __init__(self, latent_dim=128, hidden_dim=64, img_channels=3):
        super().__init__()
        
        self.main = nn.Sequential(
            # Latent -> 4x4
            spectral_norm(nn.ConvTranspose2d(latent_dim, hidden_dim*8, 4, 1, 0)),
            nn.BatchNorm2d(hidden_dim*8),
            nn.ReLU(),
            
            # ... (upsampling layers)
        )
    
    def forward(self, z):
        return self.main(z.view(-1, z.size(1), 1, 1))

SN-GAN Best Practices

Applications Beyond GANs

While developed for GANs, spectral normalization has broader applications.

Non-GAN Applications

•Adversarial robustness: Lipschitz-bounded networks are provably robust to small perturbations
•Gradient regularization: Controls gradient magnitudes without explicit penalty
•Neural ODEs: Lipschitz bounds ensure existence and uniqueness of solutions
•Normalizing flows: Spectral norm ensures bounded Jacobian determinant
•Reinforcement learning: Stabilizes Q-function approximation
•Self-supervised learning: Contrastive learning benefits from bounded representations

Spectral Norm vs Other Regularizations
Technique	Controls	Computation	Use Case
L2 regularization	Weight magnitude	O(n)	General regularization
Max-norm	Per-neuron norm	O(n)	Dropout combination
Spectral norm	Lipschitz/stretching	O(mn)	GANs, robustness
Gradient penalty	Gradient norm @ samples	O(forward)	WGAN-GP (expensive)

Summary: Spectral Normalization

Key Takeaways

•Spectral norm is the largest singular value—maximum stretching factor of a matrix
•Dividing by spectral norm constrains each layer to be 1-Lipschitz
•Power iteration efficiently estimates spectral norm with just 1 iteration per step
•GANs benefit most: stable discriminator training without gradient penalty
•Lipschitz networks have bounded gradients and improved robustness
•Use PyTorch's spectral_norm wrapper for easy application

Module Complete

5 / 5