Machine LearningGenerative Adversarial Networks

Generative Adversarial Networks

LevelAdvanced

Duration90 mins

TopicGenerative Adversarial Networks

5 / 5

Mode Collapse

The Diversity Crisis in GANs

Imagine training a GAN to generate human faces. After hours of training, the samples look photorealistic—but something is wrong. Every face looks eerily similar: same nose shape, same eye spacing, same expression. You've encountered mode collapse, one of the most insidious failure modes in generative adversarial networks.

Mode collapse occurs when the generator learns to produce only a limited variety of outputs, ignoring large portions of the true data distribution. The samples may be high quality, but they lack the diversity that characterizes real data. Understanding mode collapse—its causes, detection, and prevention—is essential for building robust generative models.

What You Will Learn

By the end of this page, you will understand: the mathematical definition of mode collapse, why the GAN objective permits this behavior, how to detect mode collapse in practice, and the various techniques developed to prevent or mitigate it.

Understanding Mode Collapse

Definition:

Mode collapse occurs when the generator maps many different latent vectors to the same (or very similar) outputs, effectively collapsing the output distribution onto a few modes of the true data distribution.

Formally, if $p_{\text{data}}$ has $K$ distinct modes (clusters in data space), a mode-collapsed generator $G$ might only cover $k << K$ of these modes.

Types of Mode Collapse:

Complete Collapse: Generator produces essentially identical outputs regardless of input noise $\mathbf{z}$. All generated samples look the same.
Partial Collapse: Generator covers some modes but ignores others. For face generation, it might produce only female faces, or only faces of a certain ethnicity.
Intra-class Collapse: Within each class (for conditional GANs), diversity is limited. Generating "cats" produces only tabby cats, never Siamese.

Why It Matters:

Mode collapse fundamentally violates the goal of generative modeling. If a model only generates a subset of realistic data, it's not truly modeling $p_{\text{data}}$—it's modeling a distorted, less diverse approximation.

Mode Collapse Severity Spectrum
Severity	Symptoms	Impact	Detection Difficulty
Complete	All samples nearly identical	Model unusable	Easy—visual inspection
Severe	Few distinct sample types	Very limited diversity	Easy—sample grids
Moderate	Missing major categories	Biased generation	Medium—requires coverage analysis
Subtle	Reduced within-category variety	Slight quality issues	Hard—requires statistical tests

Why Mode Collapse Happens

Mode collapse emerges from the adversarial training dynamic itself. Understanding its root causes helps us prevent it.

The Game-Theoretic Origin:

Consider the generator's perspective. Its goal is to fool the discriminator. If it finds a single sample that reliably fools the current discriminator, why explore other modes? Producing diverse samples is not rewarded—only fooling the discriminator is.

This creates a local optimum: the generator can achieve low loss by producing one "perfect" sample repeatedly, rather than learning the full distribution.

The Discriminator's Failure:

In theory, the discriminator should prevent this by learning that all samples are the same (hence clearly from the generator). In practice, discriminator training lags behind:

G collapses to mode A
D learns to reject mode A
G switches to mode B
D learns to reject mode B
G switches back to mode A
Cycle continues...

This is mode oscillation—related to mode collapse but even more pathological.

The Missing Diversity Term

The GAN objective doesn't explicitly encourage diversity. Forward KL divergence (used in MLE) heavily penalizes modes that G misses. But the JS divergence in GANs treats mode-seeking equally—G isn't punished for ignoring modes as long as what it produces looks real.

Mathematical Perspective:

Recall the discriminator's optimal response: $$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$

If $p_g$ concentrates on a small region, $D^*$ approaches $0.5$ only in that region. The generator receives gradients only from that region, reinforcing concentration rather than spreading.

Capacity Mismatch:

If the generator has limited capacity, it may not be able to represent the full data distribution. It will naturally focus on easier-to-generate modes.

Data Imbalance:

If some modes are more common in training data, the discriminator sees them more often and focuses on them. The generator follows, ignoring rare modes.

Detecting Mode Collapse

Early detection is crucial—mode collapse often worsens if uncaught. Here are methods for detection:

1. Visual Inspection:

Generate a grid of samples from different $\mathbf{z}$ vectors. Look for:

Repeated patterns or identical faces
Missing categories (e.g., no elderly faces)
Lack of variety in features (same pose, background, lighting)

2. Latent Space Analysis:

Generate samples from distant points in latent space. They should be distinct:

Sample $G(\mathbf{z}_1)$ and $G(\mathbf{z}_2)$ where $|\mathbf{z}_1 - \mathbf{z}_2| >> 0$
Compute image-space distance
Low distance despite high latent distance indicates collapse

3. Reverse KL Divergence (Approximated):

Measures whether generated samples cover all modes: $$D_{KL}(p_g | p_{\text{data}})$$ is low when $p_g$ only produces samples where $p_{\text{data}}$ is high—but this doesn't penalize missing modes.

mode_collapse_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
"""
Methods for Detecting Mode Collapse
"""
import torch
import numpy as np
from scipy.stats import entropy
 
def latent_diversity_check(generator, num_samples=100, device='cuda'):
    """
    Check if different latent vectors produce different outputs.
    Low diversity score indicates mode collapse.
    """
    with torch.no_grad():
        z_samples = torch.randn(num_samples, generator.latent_dim, device=device)
        generated = generator(z_samples)
        
        # Flatten samples
        generated_flat = generated.view(num_samples, -1)
        
        # Compute pairwise L2 distances
        dists = torch.cdist(generated_flat, generated_flat)
        
        # Average non-diagonal distance
        mask = ~torch.eye(num_samples, dtype=bool, device=device)
        avg_dist = dists[mask].mean().item()
        
    return avg_dist
 
 
def class_coverage_check(generator, classifier, num_samples=1000, 
                         num_classes=10, device='cuda'):
    """
    For conditional or class-producing GANs, check coverage of all classes.
    Uses a pretrained classifier to label generated samples.
    """
    with torch.no_grad():
        z = torch.randn(num_samples, generator.latent_dim, device=device)
        generated = generator(z)
        
        # Classify generated samples
        logits = classifier(generated)
        predictions = logits.argmax(dim=1)
        
        # Count samples per class
        class_counts = torch.bincount(predictions, minlength=num_classes)
        coverage = (class_counts > 0).float().mean().item()
        
        # Entropy of class distribution (higher = more balanced)
        probs = class_counts.float() / num_samples
        class_entropy = entropy(probs.cpu().numpy())
        
    return {
        'coverage': coverage,  # Fraction of classes with at least one sample
        'entropy': class_entropy,  # Uniformity of class distribution
        'class_counts': class_counts.cpu().numpy()
    }
 
 
def inception_based_diversity(generator, inception_model, num_samples=1000):
    """
    Compute diversity using Inception features (for FID-style analysis).
    Low feature variance indicates mode collapse.
    """
    with torch.no_grad():
        # Generate samples and extract features
        z = torch.randn(num_samples, generator.latent_dim)
        generated = generator(z)
        features = inception_model(generated)  # [N, feature_dim]
        
        # Compute feature covariance determinant
        # Low determinant = features concentrated = mode collapse
        cov = torch.cov(features.T)
        det = torch.linalg.det(cov).item()
        
    return {'feature_covariance_det': det}

Prevention Techniques

Numerous techniques have been developed to combat mode collapse. Here are the most effective:

1. Minibatch Discrimination:

Give the discriminator access to multiple samples simultaneously. If all samples look similar, it should flag them as fake.

Implementation: Add a layer that computes statistics across the minibatch (e.g., average pairwise distances) and appends them to each sample's features.

2. Feature Matching:

Instead of the standard GAN loss, train G to match intermediate features of the discriminator:

$$\mathcal{L}_G = |\mathbb{E}_z[f(G(\mathbf{z}))] - \mathbb{E}_x[f(\mathbf{x})]|^2$$

where $f(\cdot)$ is an intermediate layer of D. This encourages G to match statistics of real data, not just fool D.

3. Unrolled GANs:

Train G against a "future" version of D by unrolling D's optimization steps:

Compute D's update: $D' = D - \alpha \nabla_D \mathcal{L}_D$
Use $D'$ to compute G's loss
This makes G optimize against where D is going, not where it is

Architecture-Based Solutions

•Multiple Generators: Train several G's against one D. Diversity emerges from ensemble.
•Progressive Growing: Start at low resolution, add layers gradually. Prevents early mode concentration.
•Self-Attention: Long-range dependencies help capture global modes, not just local patterns.

Objective-Based Solutions

•Mode Regularization: Add explicit diversity term to G's loss (e.g., maximize entropy of outputs).
•Energy-Based GANs: Modify D to output energy rather than probability. Better gradient coverage.
•Wasserstein Distance: WGAN provides gradients even for disjoint supports.

Minibatch Discrimination Deep Dive

Minibatch discrimination is one of the most effective anti-collapse techniques. Let's examine it in detail.

Intuition:

Mode collapse means all samples look similar. A normal discriminator processes samples independently—it can't detect this. Minibatch discrimination gives D information about sample similarity within the batch.

Mechanism:

Extract features from each sample: $f(x_i) \in \mathbb{R}^A$
Multiply by tensor $T \in \mathbb{R}^{A \times B \times C}$: $M_i = f(x_i) \cdot T$
Compute pairwise distances: $c_b(x_i, x_j) = \exp(-|M_{i,b} - M_{j,b}|)$
Sum over batch: $o(x_i)_b = \sum_j c_b(x_i, x_j)$
Concatenate to features: $[f(x_i), o(x_i)]$ goes to next layer

If all samples are similar, $o(x_i)$ will be large (many close neighbors), signaling mode collapse to D.

minibatch_discrimination.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
"""
Minibatch Discrimination Layer
"""
import torch
import torch.nn as nn
 
class MinibatchDiscrimination(nn.Module):
    """
    Computes similarity statistics across minibatch.
    Helps discriminator detect when all samples are similar (mode collapse).
    """
    def __init__(self, input_features, output_features, kernel_dim=5):
        super().__init__()
        self.input_features = input_features
        self.output_features = output_features
        self.kernel_dim = kernel_dim
        
        # Tensor T that transforms input to comparison space
        self.T = nn.Parameter(torch.randn(
            input_features, output_features * kernel_dim
        ) * 0.01)
    
    def forward(self, x):
        # x: [batch, input_features]
        batch_size = x.size(0)
        
        # Transform to comparison space
        # [batch, output_features * kernel_dim]
        activation = x @ self.T
        
        # Reshape for pairwise comparison
        # [batch, output_features, kernel_dim]
        activation = activation.view(batch_size, self.output_features, 
                                     self.kernel_dim)
        
        # Compute L1 distance for all pairs
        # [batch, batch, output_features, kernel_dim]
        diffs = activation.unsqueeze(0) - activation.unsqueeze(1)
        abs_diffs = torch.abs(diffs).sum(dim=3)  # [batch, batch, output_features]
        
        # Convert to similarity (negative exponent of distance)
        similarities = torch.exp(-abs_diffs)  # [batch, batch, output_features]
        
        # Sum over other samples in batch (excluding self)
        # [batch, output_features]
        minibatch_features = similarities.sum(dim=1) - 1  # Subtract self-similarity
        
        # Concatenate with original features
        return torch.cat([x, minibatch_features], dim=1)
 
 
# Example usage in discriminator
class DiscriminatorWithMinibatch(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256):
        super().__init__()
        self.features = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, hidden_dim),
            nn.LeakyReLU(0.2)
        )
        self.minibatch_disc = MinibatchDiscrimination(hidden_dim, 32, 5)
        self.classifier = nn.Linear(hidden_dim + 32, 1)
    
    def forward(self, x):
        x = x.view(x.size(0), -1)
        features = self.features(x)
        features_with_mb = self.minibatch_disc(features)
        return torch.sigmoid(self.classifier(features_with_mb))

How Wasserstein GAN Helps

Wasserstein GAN (WGAN) addresses mode collapse through a fundamentally different objective.

The Problem with JS Divergence:

When $p_g$ and $p_{\text{data}}$ have disjoint supports (don't overlap), JS divergence is constant, providing no gradient. This allows G to concentrate on a tiny region without penalty.

Wasserstein Distance:

Also called Earth Mover's Distance (EMD). Intuitively, it measures the minimum "work" to transform $p_g$ into $p_{\text{data}}$:

$$W(p_{\text{data}}, p_g) = \inf_{\gamma \in \Pi(p_{\text{data}}, p_g)} \mathbb{E}_{(x,y) \sim \gamma}[|x - y|]$$

Key Properties:

Always finite: Even for disjoint distributions
Meaningful gradients: Tells G which direction to move
Smooth: Small changes in $p_g$ produce small changes in $W$

WGAN Objective:

Using Kantorovich-Rubinstein duality:

$$W(p_{\text{data}}, p_g) = \sup_{|D|L \leq 1} \mathbb{E}{x \sim p_{\text{data}}}[D(x)] - \mathbb{E}_{x \sim p_g}[D(x)]$$

The discriminator (now called "critic") must be 1-Lipschitz. Enforced via weight clipping (original) or gradient penalty (WGAN-GP).

Why WGAN Reduces Mode Collapse

WGAN provides meaningful gradients even when distributions don't overlap. If G concentrates on one mode, there's still a gradient telling it about other modes (because EMD measures distance to all of p_data, not just overlap regions). This prevents the "ignoring modes" failure of JS divergence.

Summary: Mode Collapse

Key Takeaways

•Mode Collapse Definition: Generator produces limited variety, covering only some modes of the data distribution despite producing realistic samples.
•Root Cause: GAN objective doesn't penalize missing modes—only producing unrealistic samples. Generator finds a local optimum by perfecting a few modes.
•Detection Methods: Visual inspection of sample grids, latent diversity analysis, class coverage metrics, and feature covariance analysis.
•Minibatch Discrimination: Gives D access to batch-level statistics, enabling it to detect when all samples are similar.
•Feature Matching: Train G to match intermediate features of D, providing richer gradient signal than just fool/not-fool.
•WGAN: Wasserstein distance provides meaningful gradients even for disjoint distributions, inherently discouraging mode collapse.

Module Complete

Congratulations! You have completed the Generative Adversarial Networks module. You now understand the GAN framework, generator and discriminator architectures, the minimax objective, training dynamics, and the mode collapse problem. This foundation prepares you for exploring advanced GAN variants like DCGAN, Wasserstein GAN, StyleGAN, and conditional GANs.

5 / 5

Loading learning content...

Machine LearningGenerative Adversarial Networks

Generative Adversarial Networks

LevelAdvanced

Duration90 mins

TopicGenerative Adversarial Networks

5 / 5

Mode Collapse

The Diversity Crisis in GANs

What You Will Learn

Understanding Mode Collapse

Definition:

Formally, if $p_{\text{data}}$ has $K$ distinct modes (clusters in data space), a mode-collapsed generator $G$ might only cover $k << K$ of these modes.

Types of Mode Collapse:

Complete Collapse: Generator produces essentially identical outputs regardless of input noise $\mathbf{z}$. All generated samples look the same.
Partial Collapse: Generator covers some modes but ignores others. For face generation, it might produce only female faces, or only faces of a certain ethnicity.
Intra-class Collapse: Within each class (for conditional GANs), diversity is limited. Generating "cats" produces only tabby cats, never Siamese.

Why It Matters:

Mode Collapse Severity Spectrum
Severity	Symptoms	Impact	Detection Difficulty
Complete	All samples nearly identical	Model unusable	Easy—visual inspection
Severe	Few distinct sample types	Very limited diversity	Easy—sample grids
Moderate	Missing major categories	Biased generation	Medium—requires coverage analysis
Subtle	Reduced within-category variety	Slight quality issues	Hard—requires statistical tests

Why Mode Collapse Happens

Mode collapse emerges from the adversarial training dynamic itself. Understanding its root causes helps us prevent it.

The Game-Theoretic Origin:

This creates a local optimum: the generator can achieve low loss by producing one "perfect" sample repeatedly, rather than learning the full distribution.

The Discriminator's Failure:

In theory, the discriminator should prevent this by learning that all samples are the same (hence clearly from the generator). In practice, discriminator training lags behind:

G collapses to mode A
D learns to reject mode A
G switches to mode B
D learns to reject mode B
G switches back to mode A
Cycle continues...

This is mode oscillation—related to mode collapse but even more pathological.

The Missing Diversity Term

Mathematical Perspective:

Recall the discriminator's optimal response: $$D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$

If $p_g$ concentrates on a small region, $D^*$ approaches $0.5$ only in that region. The generator receives gradients only from that region, reinforcing concentration rather than spreading.

Capacity Mismatch:

If the generator has limited capacity, it may not be able to represent the full data distribution. It will naturally focus on easier-to-generate modes.

Data Imbalance:

If some modes are more common in training data, the discriminator sees them more often and focuses on them. The generator follows, ignoring rare modes.

Detecting Mode Collapse

Early detection is crucial—mode collapse often worsens if uncaught. Here are methods for detection:

1. Visual Inspection:

Generate a grid of samples from different $\mathbf{z}$ vectors. Look for:

Repeated patterns or identical faces
Missing categories (e.g., no elderly faces)
Lack of variety in features (same pose, background, lighting)

2. Latent Space Analysis:

Generate samples from distant points in latent space. They should be distinct:

Sample $G(\mathbf{z}_1)$ and $G(\mathbf{z}_2)$ where $|\mathbf{z}_1 - \mathbf{z}_2| >> 0$
Compute image-space distance
Low distance despite high latent distance indicates collapse

3. Reverse KL Divergence (Approximated):

mode_collapse_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
"""
Methods for Detecting Mode Collapse
"""
import torch
import numpy as np
from scipy.stats import entropy
 
def latent_diversity_check(generator, num_samples=100, device='cuda'):
    """
    Check if different latent vectors produce different outputs.
    Low diversity score indicates mode collapse.
    """
    with torch.no_grad():
        z_samples = torch.randn(num_samples, generator.latent_dim, device=device)
        generated = generator(z_samples)
        
        # Flatten samples
        generated_flat = generated.view(num_samples, -1)
        
        # Compute pairwise L2 distances
        dists = torch.cdist(generated_flat, generated_flat)
        
        # Average non-diagonal distance
        mask = ~torch.eye(num_samples, dtype=bool, device=device)
        avg_dist = dists[mask].mean().item()
        
    return avg_dist
 
 
def class_coverage_check(generator, classifier, num_samples=1000, 
                         num_classes=10, device='cuda'):
    """
    For conditional or class-producing GANs, check coverage of all classes.
    Uses a pretrained classifier to label generated samples.
    """
    with torch.no_grad():
        z = torch.randn(num_samples, generator.latent_dim, device=device)
        generated = generator(z)
        
        # Classify generated samples
        logits = classifier(generated)
        predictions = logits.argmax(dim=1)
        
        # Count samples per class
        class_counts = torch.bincount(predictions, minlength=num_classes)
        coverage = (class_counts > 0).float().mean().item()
        
        # Entropy of class distribution (higher = more balanced)
        probs = class_counts.float() / num_samples
        class_entropy = entropy(probs.cpu().numpy())
        
    return {
        'coverage': coverage,  # Fraction of classes with at least one sample
        'entropy': class_entropy,  # Uniformity of class distribution
        'class_counts': class_counts.cpu().numpy()
    }
 
 
def inception_based_diversity(generator, inception_model, num_samples=1000):
    """
    Compute diversity using Inception features (for FID-style analysis).
    Low feature variance indicates mode collapse.
    """
    with torch.no_grad():
        # Generate samples and extract features
        z = torch.randn(num_samples, generator.latent_dim)
        generated = generator(z)
        features = inception_model(generated)  # [N, feature_dim]
        
        # Compute feature covariance determinant
        # Low determinant = features concentrated = mode collapse
        cov = torch.cov(features.T)
        det = torch.linalg.det(cov).item()
        
    return {'feature_covariance_det': det}

Prevention Techniques

Numerous techniques have been developed to combat mode collapse. Here are the most effective:

1. Minibatch Discrimination:

Give the discriminator access to multiple samples simultaneously. If all samples look similar, it should flag them as fake.

Implementation: Add a layer that computes statistics across the minibatch (e.g., average pairwise distances) and appends them to each sample's features.

2. Feature Matching:

Instead of the standard GAN loss, train G to match intermediate features of the discriminator:

$$\mathcal{L}_G = |\mathbb{E}_z[f(G(\mathbf{z}))] - \mathbb{E}_x[f(\mathbf{x})]|^2$$

where $f(\cdot)$ is an intermediate layer of D. This encourages G to match statistics of real data, not just fool D.

3. Unrolled GANs:

Train G against a "future" version of D by unrolling D's optimization steps:

Compute D's update: $D' = D - \alpha \nabla_D \mathcal{L}_D$
Use $D'$ to compute G's loss
This makes G optimize against where D is going, not where it is

Architecture-Based Solutions

•Multiple Generators: Train several G's against one D. Diversity emerges from ensemble.
•Progressive Growing: Start at low resolution, add layers gradually. Prevents early mode concentration.
•Self-Attention: Long-range dependencies help capture global modes, not just local patterns.

Objective-Based Solutions

•Mode Regularization: Add explicit diversity term to G's loss (e.g., maximize entropy of outputs).
•Energy-Based GANs: Modify D to output energy rather than probability. Better gradient coverage.
•Wasserstein Distance: WGAN provides gradients even for disjoint supports.

Minibatch Discrimination Deep Dive

Minibatch discrimination is one of the most effective anti-collapse techniques. Let's examine it in detail.

Intuition:

Mechanism:

Extract features from each sample: $f(x_i) \in \mathbb{R}^A$
Multiply by tensor $T \in \mathbb{R}^{A \times B \times C}$: $M_i = f(x_i) \cdot T$
Compute pairwise distances: $c_b(x_i, x_j) = \exp(-|M_{i,b} - M_{j,b}|)$
Sum over batch: $o(x_i)_b = \sum_j c_b(x_i, x_j)$
Concatenate to features: $[f(x_i), o(x_i)]$ goes to next layer

If all samples are similar, $o(x_i)$ will be large (many close neighbors), signaling mode collapse to D.

minibatch_discrimination.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
"""
Minibatch Discrimination Layer
"""
import torch
import torch.nn as nn
 
class MinibatchDiscrimination(nn.Module):
    """
    Computes similarity statistics across minibatch.
    Helps discriminator detect when all samples are similar (mode collapse).
    """
    def __init__(self, input_features, output_features, kernel_dim=5):
        super().__init__()
        self.input_features = input_features
        self.output_features = output_features
        self.kernel_dim = kernel_dim
        
        # Tensor T that transforms input to comparison space
        self.T = nn.Parameter(torch.randn(
            input_features, output_features * kernel_dim
        ) * 0.01)
    
    def forward(self, x):
        # x: [batch, input_features]
        batch_size = x.size(0)
        
        # Transform to comparison space
        # [batch, output_features * kernel_dim]
        activation = x @ self.T
        
        # Reshape for pairwise comparison
        # [batch, output_features, kernel_dim]
        activation = activation.view(batch_size, self.output_features, 
                                     self.kernel_dim)
        
        # Compute L1 distance for all pairs
        # [batch, batch, output_features, kernel_dim]
        diffs = activation.unsqueeze(0) - activation.unsqueeze(1)
        abs_diffs = torch.abs(diffs).sum(dim=3)  # [batch, batch, output_features]
        
        # Convert to similarity (negative exponent of distance)
        similarities = torch.exp(-abs_diffs)  # [batch, batch, output_features]
        
        # Sum over other samples in batch (excluding self)
        # [batch, output_features]
        minibatch_features = similarities.sum(dim=1) - 1  # Subtract self-similarity
        
        # Concatenate with original features
        return torch.cat([x, minibatch_features], dim=1)
 
 
# Example usage in discriminator
class DiscriminatorWithMinibatch(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256):
        super().__init__()
        self.features = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dim, hidden_dim),
            nn.LeakyReLU(0.2)
        )
        self.minibatch_disc = MinibatchDiscrimination(hidden_dim, 32, 5)
        self.classifier = nn.Linear(hidden_dim + 32, 1)
    
    def forward(self, x):
        x = x.view(x.size(0), -1)
        features = self.features(x)
        features_with_mb = self.minibatch_disc(features)
        return torch.sigmoid(self.classifier(features_with_mb))

How Wasserstein GAN Helps

Wasserstein GAN (WGAN) addresses mode collapse through a fundamentally different objective.

The Problem with JS Divergence:

When $p_g$ and $p_{\text{data}}$ have disjoint supports (don't overlap), JS divergence is constant, providing no gradient. This allows G to concentrate on a tiny region without penalty.

Wasserstein Distance:

Also called Earth Mover's Distance (EMD). Intuitively, it measures the minimum "work" to transform $p_g$ into $p_{\text{data}}$:

$$W(p_{\text{data}}, p_g) = \inf_{\gamma \in \Pi(p_{\text{data}}, p_g)} \mathbb{E}_{(x,y) \sim \gamma}[|x - y|]$$

Key Properties:

Always finite: Even for disjoint distributions
Meaningful gradients: Tells G which direction to move
Smooth: Small changes in $p_g$ produce small changes in $W$

WGAN Objective:

Using Kantorovich-Rubinstein duality:

$$W(p_{\text{data}}, p_g) = \sup_{|D|L \leq 1} \mathbb{E}{x \sim p_{\text{data}}}[D(x)] - \mathbb{E}_{x \sim p_g}[D(x)]$$

The discriminator (now called "critic") must be 1-Lipschitz. Enforced via weight clipping (original) or gradient penalty (WGAN-GP).

Why WGAN Reduces Mode Collapse

Summary: Mode Collapse

Key Takeaways

•Mode Collapse Definition: Generator produces limited variety, covering only some modes of the data distribution despite producing realistic samples.
•Root Cause: GAN objective doesn't penalize missing modes—only producing unrealistic samples. Generator finds a local optimum by perfecting a few modes.
•Detection Methods: Visual inspection of sample grids, latent diversity analysis, class coverage metrics, and feature covariance analysis.
•Minibatch Discrimination: Gives D access to batch-level statistics, enabling it to detect when all samples are similar.
•Feature Matching: Train G to match intermediate features of D, providing richer gradient signal than just fool/not-fool.
•WGAN: Wasserstein distance provides meaningful gradients even for disjoint distributions, inherently discouraging mode collapse.

Module Complete

5 / 5