Loading content...
The GAN objective is not a simple loss function to minimize—it's a minimax game where two players with opposing goals seek equilibrium. This game-theoretic formulation distinguishes GANs from all other generative models and is both the source of their power and their notorious training difficulties.
Understanding the minimax objective deeply reveals why GANs work, when they fail, and how various modifications address their shortcomings. This page provides a rigorous mathematical treatment of the objective, its theoretical guarantees, and the practical alternatives that have improved GAN training.
By the end of this page, you will understand: the formal minimax formulation and its game-theoretic interpretation, the derivation of the optimal discriminator, why the vanilla objective causes training problems, and the non-saturating and other loss variants that address these issues.
The GAN objective defines a two-player minimax game:
$$\min_G \max_D V(D, G) = \mathbb{E}{\mathbf{x} \sim p{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))]$$
Parsing the Objective:
The value function $V(D, G)$ has two terms:
$\mathbb{E}{\mathbf{x} \sim p{\text{data}}}[\log D(\mathbf{x})]$: Expected log-probability the discriminator assigns to real data being real. Maximized when $D(\mathbf{x}) \rightarrow 1$ for real $\mathbf{x}$.
$\mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))]$: Expected log-probability the discriminator assigns to fake data being fake. Maximized when $D(G(\mathbf{z})) \rightarrow 0$ for generated samples.
The Game:
Relation to Binary Cross-Entropy:
The discriminator's objective is precisely the negative binary cross-entropy for classifying real vs. fake:
$$\mathcal{L}D = -\frac{1}{2}\left(\mathbb{E}{p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{p_g}[\log(1 - D(\mathbf{x}))]\right)$$
The GAN minimax game is zero-sum: the generator's loss is the negative of the discriminator's gain. At equilibrium, neither player can improve unilaterally. This is a Nash equilibrium in game theory—but reaching it through gradient descent is not guaranteed.
For any fixed generator $G$, we can derive the optimal discriminator analytically. This derivation is fundamental to understanding GANs.
Derivation:
For fixed $G$, the value function can be written as:
$$V(D, G) = \int_{\mathbf{x}} p_{\text{data}}(\mathbf{x}) \log D(\mathbf{x}) + p_g(\mathbf{x}) \log(1 - D(\mathbf{x})) , d\mathbf{x}$$
For each point $\mathbf{x}$, we're maximizing:
$$f(D(\mathbf{x})) = a \log(y) + b \log(1-y)$$
where $a = p_{\text{data}}(\mathbf{x})$, $b = p_g(\mathbf{x})$, and $y = D(\mathbf{x})$.
Taking the derivative and setting to zero:
$$\frac{\partial f}{\partial y} = \frac{a}{y} - \frac{b}{1-y} = 0$$
Solving: $a(1-y) = by$, so $a = y(a+b)$, giving:
$$D^*G(\mathbf{x}) = \frac{p{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}$$
1234567891011121314151617181920212223242526272829303132
"""Analysis of the Optimal Discriminator Properties"""import numpy as npimport matplotlib.pyplot as plt def optimal_discriminator(p_data, p_g): """D*(x) = p_data(x) / (p_data(x) + p_g(x))""" return p_data / (p_data + p_g + 1e-8) # Scenario Analysisprint("Optimal Discriminator Behavior")print("=" * 50) # Case 1: Region where only real data existsp_data, p_g = 1.0, 0.0print(f"Real-only region: D* = {optimal_discriminator(p_data, p_g):.3f} (confident real)") # Case 2: Region where only fake data exists p_data, p_g = 0.0, 1.0print(f"Fake-only region: D* = {optimal_discriminator(p_data, p_g):.3f} (confident fake)") # Case 3: Equal density (perfect generator)p_data, p_g = 0.5, 0.5print(f"Equal density: D* = {optimal_discriminator(p_data, p_g):.3f} (can't distinguish)") # Case 4: Generator approaching real distributionfor ratio in [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]: p_data = 1.0 p_g = ratio d_star = optimal_discriminator(p_data, p_g) print(f"p_g/p_data = {ratio:.1f}: D* = {d_star:.3f}")Key Insights from $D^*$:
At equilibrium: When $p_g = p_{\text{data}}$, $D^*(\mathbf{x}) = 1/2$ everywhere. The discriminator cannot distinguish real from fake.
Density ratio estimation: $\frac{D^(\mathbf{x})}{1 - D^(\mathbf{x})} = \frac{p_{\text{data}}(\mathbf{x})}{p_g(\mathbf{x})}$
Gradient information: Even when $D^* < 0.5$, its value tells the generator where the generated distribution exceeds real data density.
Substituting $D^*$ back into $V$ reveals what the generator is truly minimizing.
The Reformulated Objective:
$$C(G) = \max_D V(D, G) = V(D^*_G, G)$$
After substitution and algebraic manipulation:
$$C(G) = -\log 4 + 2 \cdot D_{JS}(p_{\text{data}} | p_g)$$
where the Jensen-Shannon divergence is:
$$D_{JS}(P | Q) = \frac{1}{2}D_{KL}\left(P \Big| \frac{P+Q}{2}\right) + \frac{1}{2}D_{KL}\left(Q \Big| \frac{P+Q}{2}\right)$$
Properties of JS Divergence:
| Property | Description | Implication for GANs |
|---|---|---|
| Symmetry | $D_{JS}(P|Q) = D_{JS}(Q|P)$ | Fair treatment of real and fake distributions |
| Bounded | $0 \leq D_{JS} \leq \log 2$ | Loss is bounded, unlike KL divergence |
| Zero iff equal | $D_{JS} = 0 \Leftrightarrow P = Q$ | Global optimum at perfect generation |
| Defined for disjoint | Finite even if supports don't overlap | Always computable, but gradients vanish |
When p_data and p_g have disjoint supports (no overlap), JS divergence is constant (log 2). This means gradients are zero! Early in training, high-dimensional data distributions rarely overlap significantly, causing vanishing gradients. This is a fundamental limitation addressed by Wasserstein GAN.
The vanilla GAN objective has a critical flaw: it provides weak gradients for the generator when training starts.
The Problem:
The generator's loss is $\mathbb{E}[\log(1 - D(G(\mathbf{z})))]$.
Early in training:
This is gradient saturation: when the generator needs the most learning signal, it receives the weakest gradients.
Visualization:
Consider $f(D) = \log(1-D)$:
The generator only gets strong gradients when it's already doing well!
123456789101112131415161718192021222324
"""Analyzing the Gradient Saturation Problem"""import numpy as np def vanilla_generator_gradient(D_G_z): """Gradient of log(1 - D(G(z))) w.r.t. D(G(z))""" return -1 / (1 - D_G_z + 1e-8) def non_saturating_gradient(D_G_z): """Gradient of -log(D(G(z))) w.r.t. D(G(z))""" return 1 / (D_G_z + 1e-8) print("Gradient Comparison: Vanilla vs Non-Saturating")print("=" * 55) for d in [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99]: v_grad = abs(vanilla_generator_gradient(d)) ns_grad = abs(non_saturating_gradient(d)) print(f"D(G(z)) = {d:.2f}: Vanilla gradient = {v_grad:8.2f}, " f"Non-sat gradient = {ns_grad:8.2f}") # Key insight: When D(G(z)) is small (early training), # vanilla gradient ~1, but non-saturating gradient is hugeGoodfellow et al. proposed a simple fix in the original paper: instead of minimizing $\log(1 - D(G(\mathbf{z})))$, maximize $\log D(G(\mathbf{z}))$.
The Non-Saturating Loss:
$$\mathcal{L}G = -\mathbb{E}{\mathbf{z} \sim p_z}[\log D(G(\mathbf{z}))]$$
Why This Works:
This flips the gradient behavior: the generator gets strong learning signal precisely when it needs it most (early training).
Mathematical Subtlety:
The non-saturating loss changes the game from zero-sum to general-sum. The generator and discriminator are no longer playing exactly opposite games. In practice, this doesn't matter—training still converges to similar solutions, but with much better gradient flow.
The non-saturating loss is used in virtually all GAN implementations despite theoretical impurity. It's the practical default. When papers say 'GAN loss' without qualification, they typically mean the non-saturating variant.
Beyond the non-saturating loss, several alternatives have been proposed to improve training stability:
Least Squares GAN (LSGAN):
$$\mathcal{L}_D = \mathbb{E}[(D(\mathbf{x}) - 1)^2] + \mathbb{E}[D(G(\mathbf{z}))^2]$$ $$\mathcal{L}_G = \mathbb{E}[(D(G(\mathbf{z})) - 1)^2]$$
Uses MSE instead of cross-entropy. Provides non-vanishing gradients for samples far from decision boundary.
Hinge Loss:
$$\mathcal{L}_D = \mathbb{E}[\max(0, 1 - D(\mathbf{x}))] + \mathbb{E}[\max(0, 1 + D(G(\mathbf{z})))]$$ $$\mathcal{L}_G = -\mathbb{E}[D(G(\mathbf{z}))]$$
Popular in BigGAN and other large-scale models. Provides stable gradients.
Wasserstein Loss (WGAN):
$$\mathcal{L}_D = \mathbb{E}[D(G(\mathbf{z}))] - \mathbb{E}[D(\mathbf{x})]$$ $$\mathcal{L}_G = -\mathbb{E}[D(G(\mathbf{z}))]$$
Minimizes Earth Mover's distance. Requires Lipschitz constraint on discriminator.
| Loss Type | Divergence Minimized | Key Advantage | Consideration |
|---|---|---|---|
| Vanilla | Jensen-Shannon | Theoretically grounded | Vanishing gradients |
| Non-Saturating | ~JS (modified) | Strong early gradients | Standard practice |
| LSGAN | Pearson χ² | Stable training | May underfit |
| Hinge | ~Support matching | Works at scale | Less theoretical basis |
| Wasserstein | Earth Mover's | Meaningful loss curve | Requires Lipschitz constraint |
You now understand the mathematical foundation of GAN training—the minimax objective, optimal discriminator properties, and practical loss modifications. Next, we'll examine training dynamics: how these objectives play out in practice, convergence challenges, and stabilization techniques.