Generative Adversarial Networks - Learning Module

Loading content...

0/278

Minimax Objective

The Game-Theoretic Foundation

The GAN objective is not a simple loss function to minimize—it's a minimax game where two players with opposing goals seek equilibrium. This game-theoretic formulation distinguishes GANs from all other generative models and is both the source of their power and their notorious training difficulties.

Understanding the minimax objective deeply reveals why GANs work, when they fail, and how various modifications address their shortcomings. This page provides a rigorous mathematical treatment of the objective, its theoretical guarantees, and the practical alternatives that have improved GAN training.

What You Will Learn

By the end of this page, you will understand: the formal minimax formulation and its game-theoretic interpretation, the derivation of the optimal discriminator, why the vanilla objective causes training problems, and the non-saturating and other loss variants that address these issues.

The Formal Minimax Formulation

The GAN objective defines a two-player minimax game:

$$\min_G \max_D V(D, G) = \mathbb{E}{\mathbf{x} \sim p{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))]$$

Parsing the Objective:

The value function $V(D, G)$ has two terms:

$\mathbb{E}{\mathbf{x} \sim p{\text{data}}}[\log D(\mathbf{x})]$: Expected log-probability the discriminator assigns to real data being real. Maximized when $D(\mathbf{x}) \rightarrow 1$ for real $\mathbf{x}$.
$\mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))]$: Expected log-probability the discriminator assigns to fake data being fake. Maximized when $D(G(\mathbf{z})) \rightarrow 0$ for generated samples.

The Game:

Discriminator (maximizing player): Wants $V$ as large as possible. Achieves this by being a perfect classifier.
Generator (minimizing player): Wants $V$ as small as possible. Achieves this by fooling the discriminator.

Relation to Binary Cross-Entropy:

The discriminator's objective is precisely the negative binary cross-entropy for classifying real vs. fake:

$$\mathcal{L}D = -\frac{1}{2}\left(\mathbb{E}{p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{p_g}[\log(1 - D(\mathbf{x}))]\right)$$

Zero-Sum Games

The GAN minimax game is zero-sum: the generator's loss is the negative of the discriminator's gain. At equilibrium, neither player can improve unilaterally. This is a Nash equilibrium in game theory—but reaching it through gradient descent is not guaranteed.

Optimal Discriminator Derivation

For any fixed generator $G$, we can derive the optimal discriminator analytically. This derivation is fundamental to understanding GANs.

Derivation:

For fixed $G$, the value function can be written as:

$$V(D, G) = \int_{\mathbf{x}} p_{\text{data}}(\mathbf{x}) \log D(\mathbf{x}) + p_g(\mathbf{x}) \log(1 - D(\mathbf{x})) , d\mathbf{x}$$

For each point $\mathbf{x}$, we're maximizing:

$$f(D(\mathbf{x})) = a \log(y) + b \log(1-y)$$

where $a = p_{\text{data}}(\mathbf{x})$, $b = p_g(\mathbf{x})$, and $y = D(\mathbf{x})$.

Taking the derivative and setting to zero:

$$\frac{\partial f}{\partial y} = \frac{a}{y} - \frac{b}{1-y} = 0$$

Solving: $a(1-y) = by$, so $a = y(a+b)$, giving:

$$D^*G(\mathbf{x}) = \frac{p{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$

optimal_discriminator_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
"""
Analysis of the Optimal Discriminator Properties
"""
import numpy as np
import matplotlib.pyplot as plt
 
def optimal_discriminator(p_data, p_g):
    """D*(x) = p_data(x) / (p_data(x) + p_g(x))"""
    return p_data / (p_data + p_g + 1e-8)
 
# Scenario Analysis
print("Optimal Discriminator Behavior")
print("=" * 50)
 
# Case 1: Region where only real data exists
p_data, p_g = 1.0, 0.0
print(f"Real-only region:   D* = {optimal_discriminator(p_data, p_g):.3f} (confident real)")
 
# Case 2: Region where only fake data exists  
p_data, p_g = 0.0, 1.0
print(f"Fake-only region:   D* = {optimal_discriminator(p_data, p_g):.3f} (confident fake)")
 
# Case 3: Equal density (perfect generator)
p_data, p_g = 0.5, 0.5
print(f"Equal density:      D* = {optimal_discriminator(p_data, p_g):.3f} (can't distinguish)")
 
# Case 4: Generator approaching real distribution
for ratio in [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]:
    p_data = 1.0
    p_g = ratio
    d_star = optimal_discriminator(p_data, p_g)
    print(f"p_g/p_data = {ratio:.1f}:  D* = {d_star:.3f}")

Key Insights from $D^*$:

At equilibrium: When $p_g = p_{\text{data}}$, $D^*(\mathbf{x}) = 1/2$ everywhere. The discriminator cannot distinguish real from fake.
Density ratio estimation: $\frac{D^(\mathbf{x})}{1 - D^(\mathbf{x})} = \frac{p_{\text{data}}(\mathbf{x})}{p_g(\mathbf{x})}$
Gradient information: Even when $D^* < 0.5$, its value tells the generator where the generated distribution exceeds real data density.

Connection to Jensen-Shannon Divergence

Substituting $D^*$ back into $V$ reveals what the generator is truly minimizing.

The Reformulated Objective:

$$C(G) = \max_D V(D, G) = V(D^*_G, G)$$

After substitution and algebraic manipulation:

$$C(G) = -\log 4 + 2 \cdot D_{JS}(p_{\text{data}} | p_g)$$

where the Jensen-Shannon divergence is:

$$D_{JS}(P | Q) = \frac{1}{2}D_{KL}\left(P \Big| \frac{P+Q}{2}\right) + \frac{1}{2}D_{KL}\left(Q \Big| \frac{P+Q}{2}\right)$$

Properties of JS Divergence:

Property	Description	Implication for GANs
Symmetry	$D_{JS}(P\|Q) = D_{JS}(Q\|P)$	Fair treatment of real and fake distributions
Bounded	$0 \leq D_{JS} \leq \log 2$	Loss is bounded, unlike KL divergence
Zero iff equal	$D_{JS} = 0 \Leftrightarrow P = Q$	Global optimum at perfect generation
Defined for disjoint	Finite even if supports don't overlap	Always computable, but gradients vanish

The Disjoint Support Problem

When p_data and p_g have disjoint supports (no overlap), JS divergence is constant (log 2). This means gradients are zero! Early in training, high-dimensional data distributions rarely overlap significantly, causing vanishing gradients. This is a fundamental limitation addressed by Wasserstein GAN.

The Saturation Problem

The vanilla GAN objective has a critical flaw: it provides weak gradients for the generator when training starts.

The Problem:

The generator's loss is $\mathbb{E}[\log(1 - D(G(\mathbf{z})))]$.

Early in training:

The discriminator easily distinguishes fake from real
$D(G(\mathbf{z})) \approx 0$ for generated samples
$\log(1 - D(G(\mathbf{z}))) \approx \log(1) = 0$
Gradient: $\frac{\partial}{\partial \theta_g}\log(1-D)$ is small when $D \approx 0$

This is gradient saturation: when the generator needs the most learning signal, it receives the weakest gradients.

Visualization:

Consider $f(D) = \log(1-D)$:

At $D = 0$: $f = 0$, gradient = $-1$
At $D = 0.5$: $f = -0.69$, gradient = $-2$
At $D = 0.99$: $f = -4.6$, gradient = $-100$

The generator only gets strong gradients when it's already doing well!

saturation_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
"""
Analyzing the Gradient Saturation Problem
"""
import numpy as np
 
def vanilla_generator_gradient(D_G_z):
    """Gradient of log(1 - D(G(z))) w.r.t. D(G(z))"""
    return -1 / (1 - D_G_z + 1e-8)
 
def non_saturating_gradient(D_G_z):
    """Gradient of -log(D(G(z))) w.r.t. D(G(z))"""
    return 1 / (D_G_z + 1e-8)
 
print("Gradient Comparison: Vanilla vs Non-Saturating")
print("=" * 55)
 
for d in [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99]:
    v_grad = abs(vanilla_generator_gradient(d))
    ns_grad = abs(non_saturating_gradient(d))
    print(f"D(G(z)) = {d:.2f}: Vanilla gradient = {v_grad:8.2f}, "
          f"Non-sat gradient = {ns_grad:8.2f}")
 
# Key insight: When D(G(z)) is small (early training), 
# vanilla gradient ~1, but non-saturating gradient is huge

The Non-Saturating Alternative

Goodfellow et al. proposed a simple fix in the original paper: instead of minimizing $\log(1 - D(G(\mathbf{z})))$, maximize $\log D(G(\mathbf{z}))$.

The Non-Saturating Loss:

$$\mathcal{L}G = -\mathbb{E}{\mathbf{z} \sim p_z}[\log D(G(\mathbf{z}))]$$

Why This Works:

When $D(G(\mathbf{z})) \approx 0$: $-\log D \approx \infty$, huge gradient!
When $D(G(\mathbf{z})) \approx 1$: $-\log D \approx 0$, small gradient

This flips the gradient behavior: the generator gets strong learning signal precisely when it needs it most (early training).

Mathematical Subtlety:

The non-saturating loss changes the game from zero-sum to general-sum. The generator and discriminator are no longer playing exactly opposite games. In practice, this doesn't matter—training still converges to similar solutions, but with much better gradient flow.

Standard Practice

The non-saturating loss is used in virtually all GAN implementations despite theoretical impurity. It's the practical default. When papers say 'GAN loss' without qualification, they typically mean the non-saturating variant.

Alternative Loss Functions

Beyond the non-saturating loss, several alternatives have been proposed to improve training stability:

Least Squares GAN (LSGAN):

$$\mathcal{L}_D = \mathbb{E}[(D(\mathbf{x}) - 1)^2] + \mathbb{E}[D(G(\mathbf{z}))^2]$$ $$\mathcal{L}_G = \mathbb{E}[(D(G(\mathbf{z})) - 1)^2]$$

Uses MSE instead of cross-entropy. Provides non-vanishing gradients for samples far from decision boundary.

Hinge Loss:

$$\mathcal{L}_D = \mathbb{E}[\max(0, 1 - D(\mathbf{x}))] + \mathbb{E}[\max(0, 1 + D(G(\mathbf{z})))]$$ $$\mathcal{L}_G = -\mathbb{E}[D(G(\mathbf{z}))]$$

Popular in BigGAN and other large-scale models. Provides stable gradients.

Wasserstein Loss (WGAN):

$$\mathcal{L}_D = \mathbb{E}[D(G(\mathbf{z}))] - \mathbb{E}[D(\mathbf{x})]$$ $$\mathcal{L}_G = -\mathbb{E}[D(G(\mathbf{z}))]$$

Minimizes Earth Mover's distance. Requires Lipschitz constraint on discriminator.

GAN Loss Function Comparison
Loss Type	Divergence Minimized	Key Advantage	Consideration
Vanilla	Jensen-Shannon	Theoretically grounded	Vanishing gradients
Non-Saturating	~JS (modified)	Strong early gradients	Standard practice
LSGAN	Pearson χ²	Stable training	May underfit
Hinge	~Support matching	Works at scale	Less theoretical basis
Wasserstein	Earth Mover's	Meaningful loss curve	Requires Lipschitz constraint

Summary: Minimax Objective

Key Takeaways

•Minimax Game: GAN training is a two-player game where D maximizes and G minimizes the same value function.
•Optimal Discriminator: $D^*(x) = p_{data}(x)/(p_{data}(x) + p_g(x))$ provides density ratio estimation.
•JS Divergence: Under optimal D, G minimizes Jensen-Shannon divergence between real and generated distributions.
•Saturation Problem: Vanilla loss provides weak gradients when the generator is poor, precisely when strong signal is needed.
•Non-Saturating Loss: Maximizing $\log D(G(z))$ instead of minimizing $\log(1-D(G(z)))$ fixes gradient flow. This is standard practice.
•Alternative Losses: LSGAN, Hinge, and Wasserstein losses offer different tradeoffs for stability and quality.

Page Complete

You now understand the mathematical foundation of GAN training—the minimax objective, optimal discriminator properties, and practical loss modifications. Next, we'll examine training dynamics: how these objectives play out in practice, convergence challenges, and stabilization techniques.