Generative Model Fundamentals - Learning Module

Loading content...

0/278

Evaluation Challenges

The Evaluation Problem

Evaluating generative models is perhaps the hardest unsolved problem in the field. Unlike discriminative tasks—where accuracy, F1, or AUC provide clear metrics—generative evaluation has no gold standard. What does it mean for a generated image to be 'good'? How do we compare a model that produces realistic but repetitive faces to one with diverse but occasionally distorted outputs?

The difficulty is fundamental: the true data distribution $p^*(x)$ is unknown. We only have samples from it. We want to evaluate whether $p_\theta(x)$ matches $p^*$, but we can't directly compute distances between these distributions.

This page explores the landscape of evaluation approaches: their intuitions, their mathematical foundations, their failure modes, and the practical wisdom that guides experimental work in generative modeling.

What You Will Learn

By the end of this page, you will understand why generative evaluation is fundamentally challenging, what log-likelihood measures and when it fails, sample quality metrics (Inception Score, FID, and their variants), sample diversity vs. quality trade-offs, human evaluation approaches and their limitations, and best practices for rigorous generative model evaluation.

Why Evaluation Is Hard

The Core Difficulty:

We want to answer: 'Does $p_\theta(x) = p^*(x)$?' But:

$p^*(x)$ is unknown. We only have samples ${x_1, \ldots, x_n} \sim p^*$.
$p_\theta(x)$ may be intractable. For GANs and some other models, we can sample from $p_\theta$ but cannot evaluate its density.
The support is high-dimensional. Images live in million-dimensional spaces; any finite sample covers a negligible fraction.

Multiple Desiderata:

What makes a generative model 'good'? Multiple properties matter:

Sample quality: Are individual samples realistic?
Sample diversity: Does the model produce varied outputs?
Coverage: Does it capture all modes of the true distribution?
Likelihood: Does it assign high probability to real data?
Latent structure: Are learned representations useful?
Computational efficiency: Is sampling/training fast?

No single metric captures all these aspects. Different metrics emphasize different properties, and optimizing one can hurt others.

The Fundamental Trade-off

A model that memorizes training data has perfect sample quality (every sample is a real image) but zero diversity and no generalization. A model that produces uniform noise has maximum entropy (diversity) but zero realism. Good generative models navigate between these extremes, but metrics often fail to capture this balance properly.

Mode Collapse and Coverage:

A particularly pernicious failure mode is mode collapse: the model generates high-quality samples from only a subset of the data distribution. Consider:

A face generator that only produces young white women (despite training on diverse faces)
A text generator that repeats the same phrases
A music generator that only plays in one key

These models might score well on quality metrics (samples are realistic!) but fail catastrophically on coverage. Detecting mode collapse requires metrics that measure diversity, which is itself challenging.

Perceptual vs. Statistical Quality:

Humans perceive image quality differently than statistical measures suggest:

A slightly blurry but semantically correct image might be rated higher than a sharp but distorted one
High-frequency artifacts that dominate pixel-wise MSE may be imperceptible to humans
Photorealistic texture on anatomically incorrect structure fools statistics but not humans

This mismatch between perceptual and statistical quality motivates learned perceptual metrics.

Generative Model Failure Modes
Failure Mode	Symptom	Detection Challenge
Mode collapse	Low diversity, missing modes	Requires measuring coverage of true distribution
Memorization	Samples identical to training data	Need to compare to training set at scale
Blurriness	Individually okay, lack sharp details	Perceptual metrics, not pixel-wise
Artifacts	Unrealistic local patterns	May require human inspection
Semantic errors	Wrong structure (6-fingered hands)	Requires semantic understanding

Log-Likelihood Based Evaluation

The most principled evaluation approach: measure how much probability the model assigns to held-out data.

Log-Likelihood:

$$\mathcal{L} = \frac{1}{n} \sum_{i=1}^n \log p_\theta(x_i)$$

Higher likelihood = model considers real data more probable = better fit.

Bits Per Dimension (BPD):

For images, normalize by dimensionality and convert to bits:

$$\text{BPD} = -\frac{1}{d \cdot \log 2} \sum_{i=1}^n \log p_\theta(x_i)$$

where $d$ is the number of dimensions (pixels × channels). Lower BPD = better compression = better model.

Typical values for natural images:

Theoretical optimum: ~0 (lossless compression + knowing true distribution)
State-of-the-art generative models: ~2.5-3.5 BPD on CIFAR-10
Simple baselines (per-pixel independent): ~8 BPD

Advantages of Likelihood:

Principled (directly measures KL to true distribution asymptotically)
Comparable across models (same units)
Rewards both quality and coverage (penalizes missing modes)

Likelihood-Sample Quality Disconnect

A disturbing finding: high likelihood does NOT guarantee good samples. Models with excellent BPD can produce blurry, unrealistic samples. Conversely, GANs with amazing samples have undefined or infinite likelihood. This disconnect—perhaps the most important fact in generative model evaluation—means likelihood alone is insufficient.

Why Likelihood Can Fail:

1. Likelihood rewards 'safe' predictions.

A model predicting the average image (a gray blob) assigns reasonable probability to most images—it's never 'surprised' by data. But samples from this model are useless gray blobs. Likelihood rewards avoiding low-probability predictions more than generating high-quality samples.

2. Likelihood is dominated by noise modeling.

For images: $\log p(x) = \log p(\text{structure}) + \log p(\text{noise | structure})$.

Much of the likelihood comes from modeling imperceptible high-frequency noise. A model that perfectly captures pixel noise but misses semantics can beat one with reverse priorities.

3. Likelihood only applies to explicit density models.

GANs, implicit models, and diffusion models (without careful derivation) don't provide tractable likelihoods. We can't compare them to likelihood-based models on this metric.

4. Likelihood is sensitive to distribution assumptions.

If we assume Gaussian observation noise with variance $\sigma^2$, the likelihood depends heavily on $\sigma$. Different $\sigma$ choices are not comparable.

Addressing Limitations:

Use likelihood as one metric among many, not the sole criterion
Report alongside sample quality metrics (FID, IS)
For implicit models, use kernel density estimation or other density approximations

likelihood_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy.stats import norm, multivariate_normal
 
np.random.seed(42)
 
# Demonstrate likelihood-quality disconnect
 
# True distribution: sharp image (represented as 2D point at [1, 1])
true_mean = np.array([1.0, 1.0])
true_cov = np.array([[0.1, 0], [0, 0.1]])  # Sharp (low variance)
 
# Model A: "Safe" model - predicts mean, high variance
model_A_mean = np.array([0.0, 0.0])  # Center of space
model_A_cov = np.array([[2.0, 0], [0, 2.0]])  # High variance
 
# Model B: "Bold" model - predicts correct location, low variance
model_B_mean = np.array([1.0, 1.0])
model_B_cov = np.array([[0.1, 0], [0, 0.1]])
 
# Model C: "Wrong" model - confidently wrong
model_C_mean = np.array([-1.0, -1.0])
model_C_cov = np.array([[0.1, 0], [0, 0.1]])
 
# Generate test data from true distribution
n_test = 1000
test_data = np.random.multivariate_normal(true_mean, true_cov, n_test)
 
# Compute log-likelihoods
def evaluate_model(mean, cov, data):
    model = multivariate_normal(mean, cov)
    log_lik = model.logpdf(data).mean()
    return log_lik
 
ll_A = evaluate_model(model_A_mean, model_A_cov, test_data)
ll_B = evaluate_model(model_B_mean, model_B_cov, test_data)
ll_C = evaluate_model(model_C_mean, model_C_cov, test_data)
 
print("=== Likelihood-Quality Disconnect Demo ===")
print(f"
True distribution: N([1, 1], small variance)")
print(f"
Model A (safe, high variance): mean log-lik = {ll_A:.3f}")
print(f"Model B (correct): mean log-lik = {ll_B:.3f}")
print(f"Model C (wrong): mean log-lik = {ll_C:.3f}")
 
# But sample quality differs!
print("
--- Sample Quality ---")
samples_A = np.random.multivariate_normal(model_A_mean, model_A_cov, 5)
samples_B = np.random.multivariate_normal(model_B_mean, model_B_cov, 5)
 
print("Model A samples (centered at 0, spread out - unrealistic):")
print(samples_A.round(2))
print("
Model B samples (centered at 1,1 - realistic):")
print(samples_B.round(2))
 
# Compute 'sample quality' (distance to true mean)
def sample_quality(samples, true_mean):
    return np.mean(np.linalg.norm(samples - true_mean, axis=1))
 
quality_A = sample_quality(samples_A, true_mean)
quality_B = sample_quality(samples_B, true_mean)
 
print(f"
Mean distance from true mean (lower = better):")
print(f"  Model A: {quality_A:.3f}")
print(f"  Model B: {quality_B:.3f}")
 
# Bits per dimension
d = 2
bpd_A = -ll_A / (d * np.log(2))
bpd_B = -ll_B / (d * np.log(2))
print(f"
Bits per dimension (lower = better):")
print(f"  Model A: {bpd_A:.3f}")
print(f"  Model B: {bpd_B:.3f}")
 
print("
Conclusion: Model B has better likelihood AND sample quality.")
print("But: high-variance models can sometimes have deceptively okay likelihood.")

Inception Score (IS)

The Inception Score was one of the first widely-adopted metrics for evaluating GAN-generated images. It uses a pretrained Inception network to measure both quality and diversity.

Definition:

$$\text{IS} = \exp\left(\mathbb{E}x\left[D{KL}(p(y|x) | p(y))\right]\right)$$

where:

$x$ is a generated image
$y$ is the class label predicted by Inception
$p(y|x)$ is the conditional class distribution for image $x$
$p(y) = \mathbb{E}_x[p(y|x)]$ is the marginal class distribution over generated images

Intuition:

Quality: Good samples should be confidently classified. The conditional $p(y|x)$ should have low entropy (pick one class).

Diversity: Across samples, all classes should be represented. The marginal $p(y)$ should be uniform (high entropy).

KL divergence between a peaked conditional and uniform marginal is high → high IS.

Computing IS:

Generate $N$ images (typically $N \geq 50,000$)
Pass through pretrained Inception-v3
Compute softmax outputs (1000 ImageNet classes)
Calculate IS from conditionals and marginal

Typical IS Values

Real ImageNet images: IS ≈ 250. Early GANs: IS ≈ 2-5. Modern GANs on ImageNet: IS ≈ 100-200. CIFAR-10 GANs: IS ≈ 9-10 (max ~11 for real data). Higher is better, but the scale is dataset-dependent.

Limitations of Inception Score:

1. Ignores real data distribution.

IS only measures properties of generated images. A model generating perfect ImageNet images in slightly wrong proportions (99% dogs, 1% cats) has undefined behavior relative to true data.

2. Requires ImageNet-like images.

Inception was trained on ImageNet. For other domains (faces, medical images, text-to-image), its classifications are meaningless. A beautiful face image might get classified as various objects nonsensically.

3. Sensitive to mode collapse in subtle ways.

IF the model produces diverse images across Inception classes but with low within-class diversity, IS can be high despite severe mode collapse.

4. Not comparable across datasets.

CIFAR-10 IS and ImageNet IS are on different scales, making cross-dataset comparisons impossible.

5. Can be gamed.

Models can be optimized to fool Inception specifically, producing adversarial examples that score high on IS but look wrong to humans.

When to Use IS:

Comparing models on the same ImageNet-like dataset
Quick sanity check during training
Historical comparison with older work

When NOT to Use IS:

Non-ImageNet domains (faces, medical, art)
As the sole evaluation metric
Cross-dataset comparisons

inception_score_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy.stats import entropy
 
np.random.seed(42)
 
# Conceptual demonstration of Inception Score
# (Real implementation requires Inception network)
 
def compute_is_from_logits(conditional_probs):
    """
    Compute IS given p(y|x) for each generated sample.
    
    Args:
        conditional_probs: Array of shape (n_samples, n_classes)
                          Each row is p(y|x_i)
    """
    n_samples, n_classes = conditional_probs.shape
    
    # Marginal p(y) = E[p(y|x)]
    marginal = conditional_probs.mean(axis=0)
    
    # KL divergence for each sample
    kl_divs = []
    for i in range(n_samples):
        # KL(p(y|x_i) || p(y))
        kl = entropy(conditional_probs[i], marginal)
        kl_divs.append(kl)
    
    # IS = exp(E[KL])
    is_score = np.exp(np.mean(kl_divs))
    return is_score, marginal
 
# Scenario 1: Good model (confident, diverse)
n_samples = 1000
n_classes = 10
 
# Each sample is confidently classified (low entropy conditional)
# Different samples use different classes (uniform marginal)
good_conditionals = np.zeros((n_samples, n_classes))
for i in range(n_samples):
    true_class = i % n_classes  # Cycle through classes
    good_conditionals[i, true_class] = 0.9
    good_conditionals[i, (true_class + 1) % n_classes] = 0.1
 
is_good, marg_good = compute_is_from_logits(good_conditionals)
print("=== Inception Score Demonstration ===")
print(f"
Scenario 1: Good model (confident + diverse)")
print(f"  IS = {is_good:.2f}")
print(f"  Marginal entropy: {entropy(marg_good):.3f} (max = {np.log(n_classes):.3f})")
 
# Scenario 2: Mode collapse (confident but all same class)
collapsed_conditionals = np.zeros((n_samples, n_classes))
collapsed_conditionals[:, 0] = 0.95
collapsed_conditionals[:, 1] = 0.05
 
is_collapsed, marg_collapsed = compute_is_from_logits(collapsed_conditionals)
print(f"
Scenario 2: Mode collapse (all same class)")
print(f"  IS = {is_collapsed:.2f}")
print(f"  Marginal entropy: {entropy(marg_collapsed):.3f}")
 
# Scenario 3: Poor quality (uncertain classifications)
uncertain_conditionals = np.ones((n_samples, n_classes)) / n_classes  # Uniform
 
is_uncertain, marg_uncertain = compute_is_from_logits(uncertain_conditionals)
print(f"
Scenario 3: Poor quality (uncertain)")
print(f"  IS = {is_uncertain:.2f}")
print(f"  Marginal entropy: {entropy(marg_uncertain):.3f}")
 
# Scenario 4: Real data distribution
real_conditionals = np.zeros((n_samples, n_classes))
for i in range(n_samples):
    true_class = i % n_classes
    real_conditionals[i, true_class] = 0.95
    for j in range(n_classes):
        if j != true_class:
            real_conditionals[i, j] = 0.05 / (n_classes - 1)
 
is_real, marg_real = compute_is_from_logits(real_conditionals)
print(f"
Scenario 4: Real data (confident + perfectly uniform)")
print(f"  IS = {is_real:.2f} (theoretical max for {n_classes} classes)")
 
print("
  Higher IS = better (confident + diverse)")
print("  Max IS = n_classes when perfectly confident and uniform")

Fréchet Inception Distance (FID)

Fréchet Inception Distance (FID) is the current standard metric for evaluating generative image models. It compares statistics of real and generated images in a learned feature space.

Definition:

$$\text{FID} = |\mu_r - \mu_g|^2 + \text{Tr}\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$$

where:

$(\mu_r, \Sigma_r)$ are mean and covariance of real image features (from Inception)
$(\mu_g, \Sigma_g)$ are mean and covariance of generated image features
Features are 2048-dimensional from Inception's pool3 layer

Intuition:

FID measures the distance between two multivariate Gaussians fitted to real and generated features. It captures:

Mean shift: Generated images systematically differ from real in feature space
Covariance mismatch: Generated images have different correlations between features

Key properties:

Lower is better (0 = perfect match)
Compares to real data (unlike IS, which ignores real data)
Captures diversity (via covariance matching)

FID Best Practices

Use at least 50,000 samples. FID is biased with fewer samples—always report sample count. Use the same preprocessing as original FID paper. Different resize methods and crops give different FIDs. Compare only within the same dataset and resolution. FID at 64×64 vs 256×256 are not comparable. Report confidence intervals when possible.

Typical FID Values:

CIFAR-10: Real data FID ≈ 0 (to itself), good GANs ≈ 2-5, poor models ≈ 50+
ImageNet 256×256: State-of-art diffusion ≈ 2-5, baseline GANs ≈ 10-50
FFHQ (faces): Best models ≈ 2-5

Advantages over IS:

Compares to real data. FID directly measures discrepancy from real distribution, not just generated image properties.
Works beyond ImageNet. Features from Inception generalize somewhat to other domains (faces, art), making FID more broadly applicable.
Better mode collapse detection. Covariance mismatch penalizes mode collapse more effectively.
More consistent with human judgments in comparative studies.

Limitations of FID:

Relies on Inception features. Like IS, features are learned on ImageNet. May miss domain-specific quality issues.
Gaussian assumption. Real features may not be Gaussian-distributed; FID only captures first two moments.
Sample size sensitivity. FID is biased with small sample sizes (underestimates true FID).
Not a proper metric. FID doesn't satisfy triangle inequality; not suitable for principled statistical comparisons.
Ignores spatial structure. Two images with same global statistics but different compositions get same FID contribution.

IS vs FID Comparison
Property	Inception Score (IS)	Fréchet Inception Distance (FID)
Direction	Higher is better	Lower is better
Uses real data?	No	Yes
Mode collapse detection	Limited	Better
Feature level	Logits (1000 classes)	Pool3 (2048 features)
Sample size needed	~50K	~50K (more stable)
Correlation with human eval	Moderate	Good

FID Variants:

Kernel Inception Distance (KID): Uses polynomial kernel instead of Fréchet distance. Unbiased with any sample size. More principled but less commonly reported.

Clean FID: Addresses preprocessing inconsistencies by standardizing resize, crop, and quantization. Improves reproducibility.

FID-CLIP: Replaces Inception with CLIP features. Better for text-to-image models where CLIP representations are more relevant.

Precision and Recall: Decomposes FID-like comparison into precision (quality) and recall (coverage). Precision = fraction of generated samples near real data. Recall = fraction of real data near generated samples.

Precision, Recall, and Density

FID provides a single number, but we often want to understand quality vs. diversity separately. Precision and Recall for generative models address this.

Intuition:

Precision: Of the generated samples, how many are realistic (close to real data)?
Recall: Of the real data, how much is covered by the generated distribution?

A mode-collapsed model has high precision (all samples are realistic) but low recall (misses modes). An overly-diverse model might have high recall but low precision (covers distribution but includes unrealistic samples).

Manifold-Based Computation:

Estimate supports of real ($R$) and generated ($G$) distributions in feature space:

$$\text{Precision} = \frac{|{g \in G : g \in \text{manifold}(R)}|}{|G|}$$ $$\text{Recall} = \frac{|{r \in R : r \in \text{manifold}(G)}|}{|R|}$$

where 'manifold' is estimated via k-nearest neighbor balls.

Improved Precision and Recall:

The original formulation had artifacts. Improved versions use:

Hypersphere around each real sample contains its k nearest real neighbors
A generated sample is 'real' if inside any hypersphere

Density and Coverage

Extensions define Density (quality-weighted precision—dense regions count more) and Coverage (fraction of real modes covered, not weighted by generated density). These provide more nuanced understanding of quality-diversity trade-offs, especially for mode collapse diagnosis.

Interpreting Precision-Recall:

Model Type	Precision	Recall	Interpretation
Perfect	High	High	Quality + Coverage
Mode collapse	High	Low	Realistic but repetitive
Over-diverse	Low	High	Good coverage, artifacts
Poor	Low	Low	Bad quality and coverage

F1-Score for Generative Models:

Can combine precision and recall: $$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

$\beta = 1$ weights equally; $\beta > 1$ emphasizes recall; $\beta < 1$ emphasizes precision.

Use Cases:

Diagnosing mode collapse: High precision + low recall
Trading off quality vs diversity: Adjust model (e.g., truncation) and track P-R curve
Comparing architectures: Same FID but different P-R trade-offs reveals qualitative differences

precision_recall_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from scipy.spatial.distance import cdist
 
np.random.seed(42)
 
def compute_precision_recall(real_features, gen_features, k=3):
    """
    Simplified precision/recall computation.
    
    Real implementation uses more sophisticated manifold estimation.
    """
    # Compute pairwise distances in feature space
    real_to_real = cdist(real_features, real_features)
    gen_to_real = cdist(gen_features, real_features)
    real_to_gen = cdist(real_features, gen_features)
    
    n_real = len(real_features)
    n_gen = len(gen_features)
    
    # For each real point, find radius to k-th nearest real neighbor
    np.fill_diagonal(real_to_real, np.inf)  # Exclude self
    real_radii = np.sort(real_to_real, axis=1)[:, k-1]
    
    # Precision: fraction of generated points inside some real manifold
    precision_count = 0
    for i in range(n_gen):
        # Is this generated point inside any real point's ball?
        inside = np.any(gen_to_real[i] <= real_radii)
        if inside:
            precision_count += 1
    precision = precision_count / n_gen
    
    # For each gen point, find radius to k-th nearest gen neighbor
    gen_to_gen = cdist(gen_features, gen_features)
    np.fill_diagonal(gen_to_gen, np.inf)
    gen_radii = np.sort(gen_to_gen, axis=1)[:, k-1]
    
    # Recall: fraction of real points inside some generated manifold
    recall_count = 0
    for i in range(n_real):
        inside = np.any(real_to_gen[i] <= gen_radii)
        if inside:
            recall_count += 1
    recall = recall_count / n_real
 
    return precision, recall
 
# Demo scenarios
print("=== Precision-Recall for Generative Models ===")
n_real = 200
n_gen = 200
dim = 10
 
# Scenario 1: Good model (matches real distribution)
real_data = np.random.multivariate_normal(np.zeros(dim), np.eye(dim), n_real)
good_gen = np.random.multivariate_normal(np.zeros(dim), np.eye(dim), n_gen)
 
p_good, r_good = compute_precision_recall(real_data, good_gen)
print(f"
Good model (matches real):")
print(f"  Precision = {p_good:.3f}, Recall = {r_good:.3f}")
 
# Scenario 2: Mode collapse (generates only part of distribution)
collapsed_gen = np.random.multivariate_normal(
    np.zeros(dim), 0.3 * np.eye(dim), n_gen  # Smaller variance
)
p_col, r_col = compute_precision_recall(real_data, collapsed_gen)
print(f"
Mode collapse (too narrow):")
print(f"  Precision = {p_col:.3f}, Recall = {r_col:.3f}")
 
# Scenario 3: Over-diverse (spreads beyond real data)
diverse_gen = np.random.multivariate_normal(
    np.zeros(dim), 3 * np.eye(dim), n_gen  # Larger variance
)
p_div, r_div = compute_precision_recall(real_data, diverse_gen)
print(f"
Over-diverse (too wide):")
print(f"  Precision = {p_div:.3f}, Recall = {r_div:.3f}")
 
# Scenario 4: Wrong mode (shifted)
wrong_gen = np.random.multivariate_normal(
    5 * np.ones(dim), np.eye(dim), n_gen  # Shifted mean
)
p_wrong, r_wrong = compute_precision_recall(real_data, wrong_gen)
print(f"
Wrong mode (shifted):")
print(f"  Precision = {p_wrong:.3f}, Recall = {r_wrong:.3f}")
 
print("
  → Mode collapse: High P, Low R")
print("  → Over-diverse: Low P, High R")
print("  → Wrong mode: Low P, Low R")

Human Evaluation

Despite advances in automated metrics, human evaluation remains the gold standard—especially for perceptual quality and semantic correctness.

Common Human Evaluation Protocols:

1. Single-Stimulus Rating: Show a single image to raters; ask 'How realistic is this image?' on a 1-5 Likert scale.

Simple to implement
Subject to calibration issues (what is '3'?)
Doesn't compare models directly

2. Two-Alternative Forced Choice (2AFC): Show real and generated image; ask 'Which is real?'

Directly measures deception ability
Requires balanced presentation
Measures quality relative to real, not between models

3. Side-by-Side Comparison: Show two generated images (from different models); ask 'Which is more realistic?'

Directly compares models
Removes calibration issues
Requires many comparisons for significance

4. Turing-Style Tests: Complex scenarios where evaluators converse or interact with generated content.

Most realistic assessment
Expensive and time-consuming
Hard to standardize

Human Evaluation Pitfalls

Human studies are expensive, slow, and surprisingly unreliable. Inter-rater agreement on 'realism' is often low. Raters fatigue and lose attention. Results depend heavily on the specific images shown (cherry-picking is tempting). Small sample sizes lead to high variance. For these reasons, human evaluation is often used to validate automated metrics rather than as the primary evaluation.

Crowdsourcing Considerations:

Platform choice: Amazon Mechanical Turk, Prolific, Scale AI have different worker pools
Quality control: Attention checks, gold-standard questions, repetition filtering
Demographics: Worker demographics may not match target users
Instructions: Vague instructions lead to inconsistent ratings
Fatigue: Long sessions degrade quality; limit to 15-20 minutes

When Human Evaluation Is Essential:

New domains: Automated metrics may not apply (medical images, scientific visualizations)
Semantic correctness: Text-image alignment, factual correctness require understanding
Preference tasks: What do users actually prefer? Metrics optimize proxies.
Validating metrics: Check if a proposed metric correlates with human judgment
Final paper claims: Major claims should be human-validated, not just metric-compared

Reporting Human Evaluation Results:

Sample size and statistical significance
Inter-rater reliability (Krippendorff's alpha, Cohen's kappa)
Worker demographics and platform
Exact instructions given
Examples of images shown

Human Evaluation Protocol Comparison
Protocol	Measures	Pros	Cons
Single rating	Absolute quality	Simple, fast	Calibration issues
2AFC (real vs fake)	Deception rate	Grounded in real data	Doesn't compare models
Side-by-side	Relative preference	Removes calibration	Needs many comparisons
Turing test	Full realism	Most ecological validity	Very expensive

Specialized Evaluation

Different generative domains require specialized evaluation approaches beyond general-purpose metrics like FID.

Text Generation:

Perplexity: Log-likelihood under the model. Lower = better. But same limitations as image likelihood.
BLEU/ROUGE: N-gram overlap with references. Doesn't capture fluency or meaning.
BERTScore: Semantic similarity using BERT embeddings. Captures meaning better.
Human evaluation: Fluency, coherence, factual correctness often required.

Text-to-Image:

FID: Sample quality
CLIP Score: Cosine similarity between image and text embeddings. Measures alignment with prompt.
Human eval: 'Does this image match the caption?'
Compositional benchmarks: Test handling of complex prompts (multiple objects, spatial relations)

Audio Generation:

Fréchet Audio Distance (FAD): Like FID, using audio embeddings.
MOS (Mean Opinion Score): Human ratings of naturalness.
Word Error Rate: For speech, transcription accuracy.
Mel-spectrogram distance: Domain-specific perceptual quality.

3D and Video Generation:

Fréchet Video Distance (FVD): Extend FID to video using 3D features.
Temporal coherence: Metrics for smoothness, motion quality.
Chamfer Distance: For 3D point clouds.

Molecular Generation:

Validity: Percentage of generated molecules that are chemically valid.
Novelty: Percentage not in training set.
Uniqueness: Diversity among generated samples.
Drug-likeness scores: Lipinski's rules, QED score.
Property prediction: Do generated molecules have desired properties?

Domain-Specific Wisdom:

Each domain has idiosyncratic quality notions:

Faces: Skin texture, eye symmetry, anatomical correctness
Landscapes: Horizon consistency, natural lighting
Medical imaging: Anatomical fidelity, diagnostic utility
Art: Stylistic coherence, aesthetic appeal

Generic metrics may miss domain-critical quality issues. Always complement with domain experts and specialized metrics.

Metric Suites

Best practice: use a suite of metrics rather than any single number. Report log-likelihood (if available), FID, precision/recall, and ideally human evaluation. Different metrics catch different failure modes. Agree in advance which metrics matter most for your specific application.

Evaluation Suite Example (Image Generation)

•FID: Primary quality measure (≤5 is excellent)
•IS: Supplementary quality/diversity (higher is better)
•Precision/Recall: Quality vs. coverage decomposition
•BPD/Likelihood: If available from the model
•2AFC human eval: Fraction of images fooling humans
•Memorization check: Nearest neighbor distance to training set
•Fairness audit: Demographic distribution in generated samples

Summary

Evaluating generative models is fundamentally challenging because we cannot directly compare learned distributions to ground truth. No single metric captures all aspects of quality, diversity, and faithfulness. Practical evaluation requires multiple complementary metrics, domain expertise, and often human judgment.

Key Takeaways

•Evaluation is hard because $p^*(x)$ is unknown, $p_\theta(x)$ may be intractable, and multiple desiderata (quality, diversity, coverage) can conflict.
•Log-likelihood is principled but doesn't correlate well with sample quality. High likelihood ≠ good samples.
•Inception Score (IS) measures quality and diversity but ignores real data and is limited to ImageNet-like images.
•FID is the current standard: compares real and generated feature statistics. Lower is better. Use ≥50K samples.
•Precision/Recall decompose FID into quality (precision) and coverage (recall), revealing mode collapse vs. over-diversity.
•Human evaluation remains essential for semantic correctness and final claims, despite being expensive and noisy.
•Specialized metrics exist for text (perplexity, BLEU), text-to-image (CLIP score), audio (FAD), and other domains.
•Best practice: Use metric suites, be transparent about limitations, validate new domains with human evaluation.

Module Complete:

You have now completed the foundational module on Generative Model Fundamentals. You understand the generative-discriminative distinction, density estimation, sampling, latent variables, and the challenges of evaluation. This foundation prepares you for the specific model architectures—VAEs, GANs, flows, and diffusion models—covered in subsequent modules.

Module 1 Complete

Congratulations! You now possess a deep understanding of generative model fundamentals. The concepts of density estimation, sampling, latent variables, and evaluation challenges form the intellectual foundation for all specific architectures. You're ready to dive into Variational Autoencoders, GANs, Flow-based Models, and Diffusion Models—each a different answer to the central question: 'How do we learn to generate?'

Evaluation Challenges

The Evaluation Problem

What You Will Learn

Why Evaluation Is Hard

The Core Difficulty:

We want to answer: 'Does $p_\theta(x) = p^*(x)$?' But:

$p^*(x)$ is unknown. We only have samples ${x_1, \ldots, x_n} \sim p^*$.
$p_\theta(x)$ may be intractable. For GANs and some other models, we can sample from $p_\theta$ but cannot evaluate its density.
The support is high-dimensional. Images live in million-dimensional spaces; any finite sample covers a negligible fraction.

Multiple Desiderata:

What makes a generative model 'good'? Multiple properties matter:

Sample quality: Are individual samples realistic?
Sample diversity: Does the model produce varied outputs?
Coverage: Does it capture all modes of the true distribution?
Likelihood: Does it assign high probability to real data?
Latent structure: Are learned representations useful?
Computational efficiency: Is sampling/training fast?

No single metric captures all these aspects. Different metrics emphasize different properties, and optimizing one can hurt others.

The Fundamental Trade-off

Mode Collapse and Coverage:

A particularly pernicious failure mode is mode collapse: the model generates high-quality samples from only a subset of the data distribution. Consider:

A face generator that only produces young white women (despite training on diverse faces)
A text generator that repeats the same phrases
A music generator that only plays in one key

Perceptual vs. Statistical Quality:

Humans perceive image quality differently than statistical measures suggest:

A slightly blurry but semantically correct image might be rated higher than a sharp but distorted one
High-frequency artifacts that dominate pixel-wise MSE may be imperceptible to humans
Photorealistic texture on anatomically incorrect structure fools statistics but not humans

This mismatch between perceptual and statistical quality motivates learned perceptual metrics.

Generative Model Failure Modes
Failure Mode	Symptom	Detection Challenge
Mode collapse	Low diversity, missing modes	Requires measuring coverage of true distribution
Memorization	Samples identical to training data	Need to compare to training set at scale
Blurriness	Individually okay, lack sharp details	Perceptual metrics, not pixel-wise
Artifacts	Unrealistic local patterns	May require human inspection
Semantic errors	Wrong structure (6-fingered hands)	Requires semantic understanding

Log-Likelihood Based Evaluation

The most principled evaluation approach: measure how much probability the model assigns to held-out data.

Log-Likelihood:

$$\mathcal{L} = \frac{1}{n} \sum_{i=1}^n \log p_\theta(x_i)$$

Higher likelihood = model considers real data more probable = better fit.

Bits Per Dimension (BPD):

For images, normalize by dimensionality and convert to bits:

$$\text{BPD} = -\frac{1}{d \cdot \log 2} \sum_{i=1}^n \log p_\theta(x_i)$$

where $d$ is the number of dimensions (pixels × channels). Lower BPD = better compression = better model.

Typical values for natural images:

Theoretical optimum: ~0 (lossless compression + knowing true distribution)
State-of-the-art generative models: ~2.5-3.5 BPD on CIFAR-10
Simple baselines (per-pixel independent): ~8 BPD

Advantages of Likelihood:

Principled (directly measures KL to true distribution asymptotically)
Comparable across models (same units)
Rewards both quality and coverage (penalizes missing modes)

Likelihood-Sample Quality Disconnect

Why Likelihood Can Fail:

1. Likelihood rewards 'safe' predictions.

2. Likelihood is dominated by noise modeling.

For images: $\log p(x) = \log p(\text{structure}) + \log p(\text{noise | structure})$.

Much of the likelihood comes from modeling imperceptible high-frequency noise. A model that perfectly captures pixel noise but misses semantics can beat one with reverse priorities.

3. Likelihood only applies to explicit density models.

GANs, implicit models, and diffusion models (without careful derivation) don't provide tractable likelihoods. We can't compare them to likelihood-based models on this metric.

4. Likelihood is sensitive to distribution assumptions.

If we assume Gaussian observation noise with variance $\sigma^2$, the likelihood depends heavily on $\sigma$. Different $\sigma$ choices are not comparable.

Addressing Limitations:

Use likelihood as one metric among many, not the sole criterion
Report alongside sample quality metrics (FID, IS)
For implicit models, use kernel density estimation or other density approximations

likelihood_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy.stats import norm, multivariate_normal
 
np.random.seed(42)
 
# Demonstrate likelihood-quality disconnect
 
# True distribution: sharp image (represented as 2D point at [1, 1])
true_mean = np.array([1.0, 1.0])
true_cov = np.array([[0.1, 0], [0, 0.1]])  # Sharp (low variance)
 
# Model A: "Safe" model - predicts mean, high variance
model_A_mean = np.array([0.0, 0.0])  # Center of space
model_A_cov = np.array([[2.0, 0], [0, 2.0]])  # High variance
 
# Model B: "Bold" model - predicts correct location, low variance
model_B_mean = np.array([1.0, 1.0])
model_B_cov = np.array([[0.1, 0], [0, 0.1]])
 
# Model C: "Wrong" model - confidently wrong
model_C_mean = np.array([-1.0, -1.0])
model_C_cov = np.array([[0.1, 0], [0, 0.1]])
 
# Generate test data from true distribution
n_test = 1000
test_data = np.random.multivariate_normal(true_mean, true_cov, n_test)
 
# Compute log-likelihoods
def evaluate_model(mean, cov, data):
    model = multivariate_normal(mean, cov)
    log_lik = model.logpdf(data).mean()
    return log_lik
 
ll_A = evaluate_model(model_A_mean, model_A_cov, test_data)
ll_B = evaluate_model(model_B_mean, model_B_cov, test_data)
ll_C = evaluate_model(model_C_mean, model_C_cov, test_data)
 
print("=== Likelihood-Quality Disconnect Demo ===")
print(f"
True distribution: N([1, 1], small variance)")
print(f"
Model A (safe, high variance): mean log-lik = {ll_A:.3f}")
print(f"Model B (correct): mean log-lik = {ll_B:.3f}")
print(f"Model C (wrong): mean log-lik = {ll_C:.3f}")
 
# But sample quality differs!
print("
--- Sample Quality ---")
samples_A = np.random.multivariate_normal(model_A_mean, model_A_cov, 5)
samples_B = np.random.multivariate_normal(model_B_mean, model_B_cov, 5)
 
print("Model A samples (centered at 0, spread out - unrealistic):")
print(samples_A.round(2))
print("
Model B samples (centered at 1,1 - realistic):")
print(samples_B.round(2))
 
# Compute 'sample quality' (distance to true mean)
def sample_quality(samples, true_mean):
    return np.mean(np.linalg.norm(samples - true_mean, axis=1))
 
quality_A = sample_quality(samples_A, true_mean)
quality_B = sample_quality(samples_B, true_mean)
 
print(f"
Mean distance from true mean (lower = better):")
print(f"  Model A: {quality_A:.3f}")
print(f"  Model B: {quality_B:.3f}")
 
# Bits per dimension
d = 2
bpd_A = -ll_A / (d * np.log(2))
bpd_B = -ll_B / (d * np.log(2))
print(f"
Bits per dimension (lower = better):")
print(f"  Model A: {bpd_A:.3f}")
print(f"  Model B: {bpd_B:.3f}")
 
print("
Conclusion: Model B has better likelihood AND sample quality.")
print("But: high-variance models can sometimes have deceptively okay likelihood.")

Inception Score (IS)

The Inception Score was one of the first widely-adopted metrics for evaluating GAN-generated images. It uses a pretrained Inception network to measure both quality and diversity.

Definition:

$$\text{IS} = \exp\left(\mathbb{E}x\left[D{KL}(p(y|x) | p(y))\right]\right)$$

where:

$x$ is a generated image
$y$ is the class label predicted by Inception
$p(y|x)$ is the conditional class distribution for image $x$
$p(y) = \mathbb{E}_x[p(y|x)]$ is the marginal class distribution over generated images

Intuition:

Quality: Good samples should be confidently classified. The conditional $p(y|x)$ should have low entropy (pick one class).

Diversity: Across samples, all classes should be represented. The marginal $p(y)$ should be uniform (high entropy).

KL divergence between a peaked conditional and uniform marginal is high → high IS.

Computing IS:

Generate $N$ images (typically $N \geq 50,000$)
Pass through pretrained Inception-v3
Compute softmax outputs (1000 ImageNet classes)
Calculate IS from conditionals and marginal

Typical IS Values

Limitations of Inception Score:

1. Ignores real data distribution.

IS only measures properties of generated images. A model generating perfect ImageNet images in slightly wrong proportions (99% dogs, 1% cats) has undefined behavior relative to true data.

2. Requires ImageNet-like images.

3. Sensitive to mode collapse in subtle ways.

IF the model produces diverse images across Inception classes but with low within-class diversity, IS can be high despite severe mode collapse.

4. Not comparable across datasets.

CIFAR-10 IS and ImageNet IS are on different scales, making cross-dataset comparisons impossible.

5. Can be gamed.

Models can be optimized to fool Inception specifically, producing adversarial examples that score high on IS but look wrong to humans.

When to Use IS:

Comparing models on the same ImageNet-like dataset
Quick sanity check during training
Historical comparison with older work

When NOT to Use IS:

Non-ImageNet domains (faces, medical, art)
As the sole evaluation metric
Cross-dataset comparisons

inception_score_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy.stats import entropy
 
np.random.seed(42)
 
# Conceptual demonstration of Inception Score
# (Real implementation requires Inception network)
 
def compute_is_from_logits(conditional_probs):
    """
    Compute IS given p(y|x) for each generated sample.
    
    Args:
        conditional_probs: Array of shape (n_samples, n_classes)
                          Each row is p(y|x_i)
    """
    n_samples, n_classes = conditional_probs.shape
    
    # Marginal p(y) = E[p(y|x)]
    marginal = conditional_probs.mean(axis=0)
    
    # KL divergence for each sample
    kl_divs = []
    for i in range(n_samples):
        # KL(p(y|x_i) || p(y))
        kl = entropy(conditional_probs[i], marginal)
        kl_divs.append(kl)
    
    # IS = exp(E[KL])
    is_score = np.exp(np.mean(kl_divs))
    return is_score, marginal
 
# Scenario 1: Good model (confident, diverse)
n_samples = 1000
n_classes = 10
 
# Each sample is confidently classified (low entropy conditional)
# Different samples use different classes (uniform marginal)
good_conditionals = np.zeros((n_samples, n_classes))
for i in range(n_samples):
    true_class = i % n_classes  # Cycle through classes
    good_conditionals[i, true_class] = 0.9
    good_conditionals[i, (true_class + 1) % n_classes] = 0.1
 
is_good, marg_good = compute_is_from_logits(good_conditionals)
print("=== Inception Score Demonstration ===")
print(f"
Scenario 1: Good model (confident + diverse)")
print(f"  IS = {is_good:.2f}")
print(f"  Marginal entropy: {entropy(marg_good):.3f} (max = {np.log(n_classes):.3f})")
 
# Scenario 2: Mode collapse (confident but all same class)
collapsed_conditionals = np.zeros((n_samples, n_classes))
collapsed_conditionals[:, 0] = 0.95
collapsed_conditionals[:, 1] = 0.05
 
is_collapsed, marg_collapsed = compute_is_from_logits(collapsed_conditionals)
print(f"
Scenario 2: Mode collapse (all same class)")
print(f"  IS = {is_collapsed:.2f}")
print(f"  Marginal entropy: {entropy(marg_collapsed):.3f}")
 
# Scenario 3: Poor quality (uncertain classifications)
uncertain_conditionals = np.ones((n_samples, n_classes)) / n_classes  # Uniform
 
is_uncertain, marg_uncertain = compute_is_from_logits(uncertain_conditionals)
print(f"
Scenario 3: Poor quality (uncertain)")
print(f"  IS = {is_uncertain:.2f}")
print(f"  Marginal entropy: {entropy(marg_uncertain):.3f}")
 
# Scenario 4: Real data distribution
real_conditionals = np.zeros((n_samples, n_classes))
for i in range(n_samples):
    true_class = i % n_classes
    real_conditionals[i, true_class] = 0.95
    for j in range(n_classes):
        if j != true_class:
            real_conditionals[i, j] = 0.05 / (n_classes - 1)
 
is_real, marg_real = compute_is_from_logits(real_conditionals)
print(f"
Scenario 4: Real data (confident + perfectly uniform)")
print(f"  IS = {is_real:.2f} (theoretical max for {n_classes} classes)")
 
print("
  Higher IS = better (confident + diverse)")
print("  Max IS = n_classes when perfectly confident and uniform")

Fréchet Inception Distance (FID)

Fréchet Inception Distance (FID) is the current standard metric for evaluating generative image models. It compares statistics of real and generated images in a learned feature space.

Definition:

$$\text{FID} = |\mu_r - \mu_g|^2 + \text{Tr}\left(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\right)$$

where:

$(\mu_r, \Sigma_r)$ are mean and covariance of real image features (from Inception)
$(\mu_g, \Sigma_g)$ are mean and covariance of generated image features
Features are 2048-dimensional from Inception's pool3 layer

Intuition:

FID measures the distance between two multivariate Gaussians fitted to real and generated features. It captures:

Mean shift: Generated images systematically differ from real in feature space
Covariance mismatch: Generated images have different correlations between features

Key properties:

Lower is better (0 = perfect match)
Compares to real data (unlike IS, which ignores real data)
Captures diversity (via covariance matching)

FID Best Practices

Typical FID Values:

CIFAR-10: Real data FID ≈ 0 (to itself), good GANs ≈ 2-5, poor models ≈ 50+
ImageNet 256×256: State-of-art diffusion ≈ 2-5, baseline GANs ≈ 10-50
FFHQ (faces): Best models ≈ 2-5

Advantages over IS:

Compares to real data. FID directly measures discrepancy from real distribution, not just generated image properties.
Works beyond ImageNet. Features from Inception generalize somewhat to other domains (faces, art), making FID more broadly applicable.
Better mode collapse detection. Covariance mismatch penalizes mode collapse more effectively.
More consistent with human judgments in comparative studies.

Limitations of FID:

Relies on Inception features. Like IS, features are learned on ImageNet. May miss domain-specific quality issues.
Gaussian assumption. Real features may not be Gaussian-distributed; FID only captures first two moments.
Sample size sensitivity. FID is biased with small sample sizes (underestimates true FID).
Not a proper metric. FID doesn't satisfy triangle inequality; not suitable for principled statistical comparisons.
Ignores spatial structure. Two images with same global statistics but different compositions get same FID contribution.

IS vs FID Comparison
Property	Inception Score (IS)	Fréchet Inception Distance (FID)
Direction	Higher is better	Lower is better
Uses real data?	No	Yes
Mode collapse detection	Limited	Better
Feature level	Logits (1000 classes)	Pool3 (2048 features)
Sample size needed	~50K	~50K (more stable)
Correlation with human eval	Moderate	Good

FID Variants:

Kernel Inception Distance (KID): Uses polynomial kernel instead of Fréchet distance. Unbiased with any sample size. More principled but less commonly reported.

Clean FID: Addresses preprocessing inconsistencies by standardizing resize, crop, and quantization. Improves reproducibility.

FID-CLIP: Replaces Inception with CLIP features. Better for text-to-image models where CLIP representations are more relevant.

Precision, Recall, and Density

FID provides a single number, but we often want to understand quality vs. diversity separately. Precision and Recall for generative models address this.

Intuition:

Precision: Of the generated samples, how many are realistic (close to real data)?
Recall: Of the real data, how much is covered by the generated distribution?

Manifold-Based Computation:

Estimate supports of real ($R$) and generated ($G$) distributions in feature space:

$$\text{Precision} = \frac{|{g \in G : g \in \text{manifold}(R)}|}{|G|}$$ $$\text{Recall} = \frac{|{r \in R : r \in \text{manifold}(G)}|}{|R|}$$

where 'manifold' is estimated via k-nearest neighbor balls.

Improved Precision and Recall:

The original formulation had artifacts. Improved versions use:

Hypersphere around each real sample contains its k nearest real neighbors
A generated sample is 'real' if inside any hypersphere

Density and Coverage

Interpreting Precision-Recall:

Model Type	Precision	Recall	Interpretation
Perfect	High	High	Quality + Coverage
Mode collapse	High	Low	Realistic but repetitive
Over-diverse	Low	High	Good coverage, artifacts
Poor	Low	Low	Bad quality and coverage

F1-Score for Generative Models:

Can combine precision and recall: $$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

$\beta = 1$ weights equally; $\beta > 1$ emphasizes recall; $\beta < 1$ emphasizes precision.

Use Cases:

Diagnosing mode collapse: High precision + low recall
Trading off quality vs diversity: Adjust model (e.g., truncation) and track P-R curve
Comparing architectures: Same FID but different P-R trade-offs reveals qualitative differences

precision_recall_concept.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from scipy.spatial.distance import cdist
 
np.random.seed(42)
 
def compute_precision_recall(real_features, gen_features, k=3):
    """
    Simplified precision/recall computation.
    
    Real implementation uses more sophisticated manifold estimation.
    """
    # Compute pairwise distances in feature space
    real_to_real = cdist(real_features, real_features)
    gen_to_real = cdist(gen_features, real_features)
    real_to_gen = cdist(real_features, gen_features)
    
    n_real = len(real_features)
    n_gen = len(gen_features)
    
    # For each real point, find radius to k-th nearest real neighbor
    np.fill_diagonal(real_to_real, np.inf)  # Exclude self
    real_radii = np.sort(real_to_real, axis=1)[:, k-1]
    
    # Precision: fraction of generated points inside some real manifold
    precision_count = 0
    for i in range(n_gen):
        # Is this generated point inside any real point's ball?
        inside = np.any(gen_to_real[i] <= real_radii)
        if inside:
            precision_count += 1
    precision = precision_count / n_gen
    
    # For each gen point, find radius to k-th nearest gen neighbor
    gen_to_gen = cdist(gen_features, gen_features)
    np.fill_diagonal(gen_to_gen, np.inf)
    gen_radii = np.sort(gen_to_gen, axis=1)[:, k-1]
    
    # Recall: fraction of real points inside some generated manifold
    recall_count = 0
    for i in range(n_real):
        inside = np.any(real_to_gen[i] <= gen_radii)
        if inside:
            recall_count += 1
    recall = recall_count / n_real
 
    return precision, recall
 
# Demo scenarios
print("=== Precision-Recall for Generative Models ===")
n_real = 200
n_gen = 200
dim = 10
 
# Scenario 1: Good model (matches real distribution)
real_data = np.random.multivariate_normal(np.zeros(dim), np.eye(dim), n_real)
good_gen = np.random.multivariate_normal(np.zeros(dim), np.eye(dim), n_gen)
 
p_good, r_good = compute_precision_recall(real_data, good_gen)
print(f"
Good model (matches real):")
print(f"  Precision = {p_good:.3f}, Recall = {r_good:.3f}")
 
# Scenario 2: Mode collapse (generates only part of distribution)
collapsed_gen = np.random.multivariate_normal(
    np.zeros(dim), 0.3 * np.eye(dim), n_gen  # Smaller variance
)
p_col, r_col = compute_precision_recall(real_data, collapsed_gen)
print(f"
Mode collapse (too narrow):")
print(f"  Precision = {p_col:.3f}, Recall = {r_col:.3f}")
 
# Scenario 3: Over-diverse (spreads beyond real data)
diverse_gen = np.random.multivariate_normal(
    np.zeros(dim), 3 * np.eye(dim), n_gen  # Larger variance
)
p_div, r_div = compute_precision_recall(real_data, diverse_gen)
print(f"
Over-diverse (too wide):")
print(f"  Precision = {p_div:.3f}, Recall = {r_div:.3f}")
 
# Scenario 4: Wrong mode (shifted)
wrong_gen = np.random.multivariate_normal(
    5 * np.ones(dim), np.eye(dim), n_gen  # Shifted mean
)
p_wrong, r_wrong = compute_precision_recall(real_data, wrong_gen)
print(f"
Wrong mode (shifted):")
print(f"  Precision = {p_wrong:.3f}, Recall = {r_wrong:.3f}")
 
print("
  → Mode collapse: High P, Low R")
print("  → Over-diverse: Low P, High R")
print("  → Wrong mode: Low P, Low R")

Human Evaluation

Despite advances in automated metrics, human evaluation remains the gold standard—especially for perceptual quality and semantic correctness.

Common Human Evaluation Protocols:

1. Single-Stimulus Rating: Show a single image to raters; ask 'How realistic is this image?' on a 1-5 Likert scale.

Simple to implement
Subject to calibration issues (what is '3'?)
Doesn't compare models directly

2. Two-Alternative Forced Choice (2AFC): Show real and generated image; ask 'Which is real?'

Directly measures deception ability
Requires balanced presentation
Measures quality relative to real, not between models

3. Side-by-Side Comparison: Show two generated images (from different models); ask 'Which is more realistic?'

Directly compares models
Removes calibration issues
Requires many comparisons for significance

4. Turing-Style Tests: Complex scenarios where evaluators converse or interact with generated content.

Most realistic assessment
Expensive and time-consuming
Hard to standardize

Human Evaluation Pitfalls

Crowdsourcing Considerations:

Platform choice: Amazon Mechanical Turk, Prolific, Scale AI have different worker pools
Quality control: Attention checks, gold-standard questions, repetition filtering
Demographics: Worker demographics may not match target users
Instructions: Vague instructions lead to inconsistent ratings
Fatigue: Long sessions degrade quality; limit to 15-20 minutes

When Human Evaluation Is Essential:

New domains: Automated metrics may not apply (medical images, scientific visualizations)
Semantic correctness: Text-image alignment, factual correctness require understanding
Preference tasks: What do users actually prefer? Metrics optimize proxies.
Validating metrics: Check if a proposed metric correlates with human judgment
Final paper claims: Major claims should be human-validated, not just metric-compared

Reporting Human Evaluation Results:

Sample size and statistical significance
Inter-rater reliability (Krippendorff's alpha, Cohen's kappa)
Worker demographics and platform
Exact instructions given
Examples of images shown

Human Evaluation Protocol Comparison
Protocol	Measures	Pros	Cons
Single rating	Absolute quality	Simple, fast	Calibration issues
2AFC (real vs fake)	Deception rate	Grounded in real data	Doesn't compare models
Side-by-side	Relative preference	Removes calibration	Needs many comparisons
Turing test	Full realism	Most ecological validity	Very expensive

Specialized Evaluation

Different generative domains require specialized evaluation approaches beyond general-purpose metrics like FID.

Text Generation:

Perplexity: Log-likelihood under the model. Lower = better. But same limitations as image likelihood.
BLEU/ROUGE: N-gram overlap with references. Doesn't capture fluency or meaning.
BERTScore: Semantic similarity using BERT embeddings. Captures meaning better.
Human evaluation: Fluency, coherence, factual correctness often required.

Text-to-Image:

FID: Sample quality
CLIP Score: Cosine similarity between image and text embeddings. Measures alignment with prompt.
Human eval: 'Does this image match the caption?'
Compositional benchmarks: Test handling of complex prompts (multiple objects, spatial relations)

Audio Generation:

Fréchet Audio Distance (FAD): Like FID, using audio embeddings.
MOS (Mean Opinion Score): Human ratings of naturalness.
Word Error Rate: For speech, transcription accuracy.
Mel-spectrogram distance: Domain-specific perceptual quality.

3D and Video Generation:

Fréchet Video Distance (FVD): Extend FID to video using 3D features.
Temporal coherence: Metrics for smoothness, motion quality.
Chamfer Distance: For 3D point clouds.

Molecular Generation:

Validity: Percentage of generated molecules that are chemically valid.
Novelty: Percentage not in training set.
Uniqueness: Diversity among generated samples.
Drug-likeness scores: Lipinski's rules, QED score.
Property prediction: Do generated molecules have desired properties?

Domain-Specific Wisdom:

Each domain has idiosyncratic quality notions:

Faces: Skin texture, eye symmetry, anatomical correctness
Landscapes: Horizon consistency, natural lighting
Medical imaging: Anatomical fidelity, diagnostic utility
Art: Stylistic coherence, aesthetic appeal

Generic metrics may miss domain-critical quality issues. Always complement with domain experts and specialized metrics.

Metric Suites

Evaluation Suite Example (Image Generation)

•FID: Primary quality measure (≤5 is excellent)
•IS: Supplementary quality/diversity (higher is better)
•Precision/Recall: Quality vs. coverage decomposition
•BPD/Likelihood: If available from the model
•2AFC human eval: Fraction of images fooling humans
•Memorization check: Nearest neighbor distance to training set
•Fairness audit: Demographic distribution in generated samples

Summary

Key Takeaways

•Evaluation is hard because $p^*(x)$ is unknown, $p_\theta(x)$ may be intractable, and multiple desiderata (quality, diversity, coverage) can conflict.
•Log-likelihood is principled but doesn't correlate well with sample quality. High likelihood ≠ good samples.
•Inception Score (IS) measures quality and diversity but ignores real data and is limited to ImageNet-like images.
•FID is the current standard: compares real and generated feature statistics. Lower is better. Use ≥50K samples.
•Precision/Recall decompose FID into quality (precision) and coverage (recall), revealing mode collapse vs. over-diversity.
•Human evaluation remains essential for semantic correctness and final claims, despite being expensive and noisy.
•Specialized metrics exist for text (perplexity, BLEU), text-to-image (CLIP score), audio (FAD), and other domains.
•Best practice: Use metric suites, be transparent about limitations, validate new domains with human evaluation.

Module Complete:

Module 1 Complete