Loading content...
Evaluating generative models is perhaps the hardest unsolved problem in the field. Unlike discriminative tasks—where accuracy, F1, or AUC provide clear metrics—generative evaluation has no gold standard. What does it mean for a generated image to be 'good'? How do we compare a model that produces realistic but repetitive faces to one with diverse but occasionally distorted outputs?
The difficulty is fundamental: the true data distribution $p^*(x)$ is unknown. We only have samples from it. We want to evaluate whether $p_\theta(x)$ matches $p^*$, but we can't directly compute distances between these distributions.
This page explores the landscape of evaluation approaches: their intuitions, their mathematical foundations, their failure modes, and the practical wisdom that guides experimental work in generative modeling.
By the end of this page, you will understand why generative evaluation is fundamentally challenging, what log-likelihood measures and when it fails, sample quality metrics (Inception Score, FID, and their variants), sample diversity vs. quality trade-offs, human evaluation approaches and their limitations, and best practices for rigorous generative model evaluation.
The Core Difficulty:
We want to answer: 'Does $p_\theta(x) = p^*(x)$?' But:
Multiple Desiderata:
What makes a generative model 'good'? Multiple properties matter:
No single metric captures all these aspects. Different metrics emphasize different properties, and optimizing one can hurt others.
A model that memorizes training data has perfect sample quality (every sample is a real image) but zero diversity and no generalization. A model that produces uniform noise has maximum entropy (diversity) but zero realism. Good generative models navigate between these extremes, but metrics often fail to capture this balance properly.
Mode Collapse and Coverage:
A particularly pernicious failure mode is mode collapse: the model generates high-quality samples from only a subset of the data distribution. Consider:
These models might score well on quality metrics (samples are realistic!) but fail catastrophically on coverage. Detecting mode collapse requires metrics that measure diversity, which is itself challenging.
Perceptual vs. Statistical Quality:
Humans perceive image quality differently than statistical measures suggest:
This mismatch between perceptual and statistical quality motivates learned perceptual metrics.
| Failure Mode | Symptom | Detection Challenge |
|---|---|---|
| Mode collapse | Low diversity, missing modes | Requires measuring coverage of true distribution |
| Memorization | Samples identical to training data | Need to compare to training set at scale |
| Blurriness | Individually okay, lack sharp details | Perceptual metrics, not pixel-wise |
| Artifacts | Unrealistic local patterns | May require human inspection |
| Semantic errors | Wrong structure (6-fingered hands) | Requires semantic understanding |
The most principled evaluation approach: measure how much probability the model assigns to held-out data.
Log-Likelihood:
$$\mathcal{L} = \frac{1}{n} \sum_{i=1}^n \log p_\theta(x_i)$$
Higher likelihood = model considers real data more probable = better fit.
Bits Per Dimension (BPD):
For images, normalize by dimensionality and convert to bits:
$$\text{BPD} = -\frac{1}{d \cdot \log 2} \sum_{i=1}^n \log p_\theta(x_i)$$
where $d$ is the number of dimensions (pixels × channels). Lower BPD = better compression = better model.
Typical values for natural images:
Advantages of Likelihood:
A disturbing finding: high likelihood does NOT guarantee good samples. Models with excellent BPD can produce blurry, unrealistic samples. Conversely, GANs with amazing samples have undefined or infinite likelihood. This disconnect—perhaps the most important fact in generative model evaluation—means likelihood alone is insufficient.
Why Likelihood Can Fail:
1. Likelihood rewards 'safe' predictions.
A model predicting the average image (a gray blob) assigns reasonable probability to most images—it's never 'surprised' by data. But samples from this model are useless gray blobs. Likelihood rewards avoiding low-probability predictions more than generating high-quality samples.
2. Likelihood is dominated by noise modeling.
For images: $\log p(x) = \log p(\text{structure}) + \log p(\text{noise | structure})$.
Much of the likelihood comes from modeling imperceptible high-frequency noise. A model that perfectly captures pixel noise but misses semantics can beat one with reverse priorities.
3. Likelihood only applies to explicit density models.
GANs, implicit models, and diffusion models (without careful derivation) don't provide tractable likelihoods. We can't compare them to likelihood-based models on this metric.
4. Likelihood is sensitive to distribution assumptions.
If we assume Gaussian observation noise with variance $\sigma^2$, the likelihood depends heavily on $\sigma$. Different $\sigma$ choices are not comparable.
Addressing Limitations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom scipy.stats import norm, multivariate_normal np.random.seed(42) # Demonstrate likelihood-quality disconnect # True distribution: sharp image (represented as 2D point at [1, 1])true_mean = np.array([1.0, 1.0])true_cov = np.array([[0.1, 0], [0, 0.1]]) # Sharp (low variance) # Model A: "Safe" model - predicts mean, high variancemodel_A_mean = np.array([0.0, 0.0]) # Center of spacemodel_A_cov = np.array([[2.0, 0], [0, 2.0]]) # High variance # Model B: "Bold" model - predicts correct location, low variancemodel_B_mean = np.array([1.0, 1.0])model_B_cov = np.array([[0.1, 0], [0, 0.1]]) # Model C: "Wrong" model - confidently wrongmodel_C_mean = np.array([-1.0, -1.0])model_C_cov = np.array([[0.1, 0], [0, 0.1]]) # Generate test data from true distributionn_test = 1000test_data = np.random.multivariate_normal(true_mean, true_cov, n_test) # Compute log-likelihoodsdef evaluate_model(mean, cov, data): model = multivariate_normal(mean, cov) log_lik = model.logpdf(data).mean() return log_lik ll_A = evaluate_model(model_A_mean, model_A_cov, test_data)ll_B = evaluate_model(model_B_mean, model_B_cov, test_data)ll_C = evaluate_model(model_C_mean, model_C_cov, test_data) print("=== Likelihood-Quality Disconnect Demo ===")print(f"True distribution: N([1, 1], small variance)")print(f"Model A (safe, high variance): mean log-lik = {ll_A:.3f}")print(f"Model B (correct): mean log-lik = {ll_B:.3f}")print(f"Model C (wrong): mean log-lik = {ll_C:.3f}") # But sample quality differs!print("--- Sample Quality ---")samples_A = np.random.multivariate_normal(model_A_mean, model_A_cov, 5)samples_B = np.random.multivariate_normal(model_B_mean, model_B_cov, 5) print("Model A samples (centered at 0, spread out - unrealistic):")print(samples_A.round(2))print("Model B samples (centered at 1,1 - realistic):")print(samples_B.round(2)) # Compute 'sample quality' (distance to true mean)def sample_quality(samples, true_mean): return np.mean(np.linalg.norm(samples - true_mean, axis=1)) quality_A = sample_quality(samples_A, true_mean)quality_B = sample_quality(samples_B, true_mean) print(f"Mean distance from true mean (lower = better):")print(f" Model A: {quality_A:.3f}")print(f" Model B: {quality_B:.3f}") # Bits per dimensiond = 2bpd_A = -ll_A / (d * np.log(2))bpd_B = -ll_B / (d * np.log(2))print(f"Bits per dimension (lower = better):")print(f" Model A: {bpd_A:.3f}")print(f" Model B: {bpd_B:.3f}") print("Conclusion: Model B has better likelihood AND sample quality.")print("But: high-variance models can sometimes have deceptively okay likelihood.")The Inception Score was one of the first widely-adopted metrics for evaluating GAN-generated images. It uses a pretrained Inception network to measure both quality and diversity.
Definition:
$$\text{IS} = \exp\left(\mathbb{E}x\left[D{KL}(p(y|x) | p(y))\right]\right)$$
where:
Intuition:
Quality: Good samples should be confidently classified. The conditional $p(y|x)$ should have low entropy (pick one class).
Diversity: Across samples, all classes should be represented. The marginal $p(y)$ should be uniform (high entropy).
KL divergence between a peaked conditional and uniform marginal is high → high IS.
Computing IS:
Real ImageNet images: IS ≈ 250. Early GANs: IS ≈ 2-5. Modern GANs on ImageNet: IS ≈ 100-200. CIFAR-10 GANs: IS ≈ 9-10 (max ~11 for real data). Higher is better, but the scale is dataset-dependent.
Limitations of Inception Score:
1. Ignores real data distribution.
IS only measures properties of generated images. A model generating perfect ImageNet images in slightly wrong proportions (99% dogs, 1% cats) has undefined behavior relative to true data.
2. Requires ImageNet-like images.
Inception was trained on ImageNet. For other domains (faces, medical images, text-to-image), its classifications are meaningless. A beautiful face image might get classified as various objects nonsensically.
3. Sensitive to mode collapse in subtle ways.
IF the model produces diverse images across Inception classes but with low within-class diversity, IS can be high despite severe mode collapse.
4. Not comparable across datasets.
CIFAR-10 IS and ImageNet IS are on different scales, making cross-dataset comparisons impossible.
5. Can be gamed.
Models can be optimized to fool Inception specifically, producing adversarial examples that score high on IS but look wrong to humans.
When to Use IS:
When NOT to Use IS:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom scipy.stats import entropy np.random.seed(42) # Conceptual demonstration of Inception Score# (Real implementation requires Inception network) def compute_is_from_logits(conditional_probs): """ Compute IS given p(y|x) for each generated sample. Args: conditional_probs: Array of shape (n_samples, n_classes) Each row is p(y|x_i) """ n_samples, n_classes = conditional_probs.shape # Marginal p(y) = E[p(y|x)] marginal = conditional_probs.mean(axis=0) # KL divergence for each sample kl_divs = [] for i in range(n_samples): # KL(p(y|x_i) || p(y)) kl = entropy(conditional_probs[i], marginal) kl_divs.append(kl) # IS = exp(E[KL]) is_score = np.exp(np.mean(kl_divs)) return is_score, marginal # Scenario 1: Good model (confident, diverse)n_samples = 1000n_classes = 10 # Each sample is confidently classified (low entropy conditional)# Different samples use different classes (uniform marginal)good_conditionals = np.zeros((n_samples, n_classes))for i in range(n_samples): true_class = i % n_classes # Cycle through classes good_conditionals[i, true_class] = 0.9 good_conditionals[i, (true_class + 1) % n_classes] = 0.1 is_good, marg_good = compute_is_from_logits(good_conditionals)print("=== Inception Score Demonstration ===")print(f"Scenario 1: Good model (confident + diverse)")print(f" IS = {is_good:.2f}")print(f" Marginal entropy: {entropy(marg_good):.3f} (max = {np.log(n_classes):.3f})") # Scenario 2: Mode collapse (confident but all same class)collapsed_conditionals = np.zeros((n_samples, n_classes))collapsed_conditionals[:, 0] = 0.95collapsed_conditionals[:, 1] = 0.05 is_collapsed, marg_collapsed = compute_is_from_logits(collapsed_conditionals)print(f"Scenario 2: Mode collapse (all same class)")print(f" IS = {is_collapsed:.2f}")print(f" Marginal entropy: {entropy(marg_collapsed):.3f}") # Scenario 3: Poor quality (uncertain classifications)uncertain_conditionals = np.ones((n_samples, n_classes)) / n_classes # Uniform is_uncertain, marg_uncertain = compute_is_from_logits(uncertain_conditionals)print(f"Scenario 3: Poor quality (uncertain)")print(f" IS = {is_uncertain:.2f}")print(f" Marginal entropy: {entropy(marg_uncertain):.3f}") # Scenario 4: Real data distributionreal_conditionals = np.zeros((n_samples, n_classes))for i in range(n_samples): true_class = i % n_classes real_conditionals[i, true_class] = 0.95 for j in range(n_classes): if j != true_class: real_conditionals[i, j] = 0.05 / (n_classes - 1) is_real, marg_real = compute_is_from_logits(real_conditionals)print(f"Scenario 4: Real data (confident + perfectly uniform)")print(f" IS = {is_real:.2f} (theoretical max for {n_classes} classes)") print(" Higher IS = better (confident + diverse)")print(" Max IS = n_classes when perfectly confident and uniform")Fréchet Inception Distance (FID) is the current standard metric for evaluating generative image models. It compares statistics of real and generated images in a learned feature space.
Definition:
$$\text{FID} = |\mu_r - \mu_g|^2 + \text{Tr}\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$$
where:
Intuition:
FID measures the distance between two multivariate Gaussians fitted to real and generated features. It captures:
Key properties:
Use at least 50,000 samples. FID is biased with fewer samples—always report sample count. Use the same preprocessing as original FID paper. Different resize methods and crops give different FIDs. Compare only within the same dataset and resolution. FID at 64×64 vs 256×256 are not comparable. Report confidence intervals when possible.
Typical FID Values:
Advantages over IS:
Compares to real data. FID directly measures discrepancy from real distribution, not just generated image properties.
Works beyond ImageNet. Features from Inception generalize somewhat to other domains (faces, art), making FID more broadly applicable.
Better mode collapse detection. Covariance mismatch penalizes mode collapse more effectively.
More consistent with human judgments in comparative studies.
Limitations of FID:
Relies on Inception features. Like IS, features are learned on ImageNet. May miss domain-specific quality issues.
Gaussian assumption. Real features may not be Gaussian-distributed; FID only captures first two moments.
Sample size sensitivity. FID is biased with small sample sizes (underestimates true FID).
Not a proper metric. FID doesn't satisfy triangle inequality; not suitable for principled statistical comparisons.
Ignores spatial structure. Two images with same global statistics but different compositions get same FID contribution.
| Property | Inception Score (IS) | Fréchet Inception Distance (FID) |
|---|---|---|
| Direction | Higher is better | Lower is better |
| Uses real data? | No | Yes |
| Mode collapse detection | Limited | Better |
| Feature level | Logits (1000 classes) | Pool3 (2048 features) |
| Sample size needed | ~50K | ~50K (more stable) |
| Correlation with human eval | Moderate | Good |
FID Variants:
Kernel Inception Distance (KID): Uses polynomial kernel instead of Fréchet distance. Unbiased with any sample size. More principled but less commonly reported.
Clean FID: Addresses preprocessing inconsistencies by standardizing resize, crop, and quantization. Improves reproducibility.
FID-CLIP: Replaces Inception with CLIP features. Better for text-to-image models where CLIP representations are more relevant.
Precision and Recall: Decomposes FID-like comparison into precision (quality) and recall (coverage). Precision = fraction of generated samples near real data. Recall = fraction of real data near generated samples.
FID provides a single number, but we often want to understand quality vs. diversity separately. Precision and Recall for generative models address this.
Intuition:
A mode-collapsed model has high precision (all samples are realistic) but low recall (misses modes). An overly-diverse model might have high recall but low precision (covers distribution but includes unrealistic samples).
Manifold-Based Computation:
Estimate supports of real ($R$) and generated ($G$) distributions in feature space:
$$\text{Precision} = \frac{|{g \in G : g \in \text{manifold}(R)}|}{|G|}$$ $$\text{Recall} = \frac{|{r \in R : r \in \text{manifold}(G)}|}{|R|}$$
where 'manifold' is estimated via k-nearest neighbor balls.
Improved Precision and Recall:
The original formulation had artifacts. Improved versions use:
Extensions define Density (quality-weighted precision—dense regions count more) and Coverage (fraction of real modes covered, not weighted by generated density). These provide more nuanced understanding of quality-diversity trade-offs, especially for mode collapse diagnosis.
Interpreting Precision-Recall:
| Model Type | Precision | Recall | Interpretation |
|---|---|---|---|
| Perfect | High | High | Quality + Coverage |
| Mode collapse | High | Low | Realistic but repetitive |
| Over-diverse | Low | High | Good coverage, artifacts |
| Poor | Low | Low | Bad quality and coverage |
F1-Score for Generative Models:
Can combine precision and recall: $$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$
$\beta = 1$ weights equally; $\beta > 1$ emphasizes recall; $\beta < 1$ emphasizes precision.
Use Cases:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom scipy.spatial.distance import cdist np.random.seed(42) def compute_precision_recall(real_features, gen_features, k=3): """ Simplified precision/recall computation. Real implementation uses more sophisticated manifold estimation. """ # Compute pairwise distances in feature space real_to_real = cdist(real_features, real_features) gen_to_real = cdist(gen_features, real_features) real_to_gen = cdist(real_features, gen_features) n_real = len(real_features) n_gen = len(gen_features) # For each real point, find radius to k-th nearest real neighbor np.fill_diagonal(real_to_real, np.inf) # Exclude self real_radii = np.sort(real_to_real, axis=1)[:, k-1] # Precision: fraction of generated points inside some real manifold precision_count = 0 for i in range(n_gen): # Is this generated point inside any real point's ball? inside = np.any(gen_to_real[i] <= real_radii) if inside: precision_count += 1 precision = precision_count / n_gen # For each gen point, find radius to k-th nearest gen neighbor gen_to_gen = cdist(gen_features, gen_features) np.fill_diagonal(gen_to_gen, np.inf) gen_radii = np.sort(gen_to_gen, axis=1)[:, k-1] # Recall: fraction of real points inside some generated manifold recall_count = 0 for i in range(n_real): inside = np.any(real_to_gen[i] <= gen_radii) if inside: recall_count += 1 recall = recall_count / n_real return precision, recall # Demo scenariosprint("=== Precision-Recall for Generative Models ===")n_real = 200n_gen = 200dim = 10 # Scenario 1: Good model (matches real distribution)real_data = np.random.multivariate_normal(np.zeros(dim), np.eye(dim), n_real)good_gen = np.random.multivariate_normal(np.zeros(dim), np.eye(dim), n_gen) p_good, r_good = compute_precision_recall(real_data, good_gen)print(f"Good model (matches real):")print(f" Precision = {p_good:.3f}, Recall = {r_good:.3f}") # Scenario 2: Mode collapse (generates only part of distribution)collapsed_gen = np.random.multivariate_normal( np.zeros(dim), 0.3 * np.eye(dim), n_gen # Smaller variance)p_col, r_col = compute_precision_recall(real_data, collapsed_gen)print(f"Mode collapse (too narrow):")print(f" Precision = {p_col:.3f}, Recall = {r_col:.3f}") # Scenario 3: Over-diverse (spreads beyond real data)diverse_gen = np.random.multivariate_normal( np.zeros(dim), 3 * np.eye(dim), n_gen # Larger variance)p_div, r_div = compute_precision_recall(real_data, diverse_gen)print(f"Over-diverse (too wide):")print(f" Precision = {p_div:.3f}, Recall = {r_div:.3f}") # Scenario 4: Wrong mode (shifted)wrong_gen = np.random.multivariate_normal( 5 * np.ones(dim), np.eye(dim), n_gen # Shifted mean)p_wrong, r_wrong = compute_precision_recall(real_data, wrong_gen)print(f"Wrong mode (shifted):")print(f" Precision = {p_wrong:.3f}, Recall = {r_wrong:.3f}") print(" → Mode collapse: High P, Low R")print(" → Over-diverse: Low P, High R")print(" → Wrong mode: Low P, Low R")Despite advances in automated metrics, human evaluation remains the gold standard—especially for perceptual quality and semantic correctness.
Common Human Evaluation Protocols:
1. Single-Stimulus Rating: Show a single image to raters; ask 'How realistic is this image?' on a 1-5 Likert scale.
2. Two-Alternative Forced Choice (2AFC): Show real and generated image; ask 'Which is real?'
3. Side-by-Side Comparison: Show two generated images (from different models); ask 'Which is more realistic?'
4. Turing-Style Tests: Complex scenarios where evaluators converse or interact with generated content.
Human studies are expensive, slow, and surprisingly unreliable. Inter-rater agreement on 'realism' is often low. Raters fatigue and lose attention. Results depend heavily on the specific images shown (cherry-picking is tempting). Small sample sizes lead to high variance. For these reasons, human evaluation is often used to validate automated metrics rather than as the primary evaluation.
Crowdsourcing Considerations:
When Human Evaluation Is Essential:
Reporting Human Evaluation Results:
| Protocol | Measures | Pros | Cons |
|---|---|---|---|
| Single rating | Absolute quality | Simple, fast | Calibration issues |
| 2AFC (real vs fake) | Deception rate | Grounded in real data | Doesn't compare models |
| Side-by-side | Relative preference | Removes calibration | Needs many comparisons |
| Turing test | Full realism | Most ecological validity | Very expensive |
Different generative domains require specialized evaluation approaches beyond general-purpose metrics like FID.
Text Generation:
Text-to-Image:
Audio Generation:
3D and Video Generation:
Molecular Generation:
Domain-Specific Wisdom:
Each domain has idiosyncratic quality notions:
Generic metrics may miss domain-critical quality issues. Always complement with domain experts and specialized metrics.
Best practice: use a suite of metrics rather than any single number. Report log-likelihood (if available), FID, precision/recall, and ideally human evaluation. Different metrics catch different failure modes. Agree in advance which metrics matter most for your specific application.
Evaluating generative models is fundamentally challenging because we cannot directly compare learned distributions to ground truth. No single metric captures all aspects of quality, diversity, and faithfulness. Practical evaluation requires multiple complementary metrics, domain expertise, and often human judgment.
Module Complete:
You have now completed the foundational module on Generative Model Fundamentals. You understand the generative-discriminative distinction, density estimation, sampling, latent variables, and the challenges of evaluation. This foundation prepares you for the specific model architectures—VAEs, GANs, flows, and diffusion models—covered in subsequent modules.
Congratulations! You now possess a deep understanding of generative model fundamentals. The concepts of density estimation, sampling, latent variables, and evaluation challenges form the intellectual foundation for all specific architectures. You're ready to dive into Variational Autoencoders, GANs, Flow-based Models, and Diffusion Models—each a different answer to the central question: 'How do we learn to generate?'