The Label Scarcity Problem - Learning Module

Loading content...

0/278

Evaluation Challenges: Measuring SSL Effectiveness

The Hidden Complexity of SSL Evaluation

Evaluating semi-supervised learning methods presents unique challenges that don't arise in standard supervised learning. The presence of unlabeled data, the sensitivity to labeled sample selection, and the complex interactions between labeled and unlabeled distributions create a minefield of evaluation pitfalls.

Many published SSL papers suffer from subtle evaluation flaws that inflate reported improvements. Understanding these pitfalls is essential not only for conducting rigorous research but also for critically reading the literature and making informed decisions about method selection.

The fundamental question we address: How do we fairly and rigorously compare SSL methods, and what does 'improvement' actually mean in the semi-supervised setting?

What You Will Learn

This page provides comprehensive coverage of SSL evaluation. You will understand: (1) Standard evaluation protocols and their limitations, (2) Common pitfalls that invalidate comparisons, (3) Proper experimental design for SSL, (4) Statistical considerations with limited labels, and (5) Best practices for reporting SSL results.

Standard Evaluation Protocol

Let's first establish the standard evaluation protocol for semi-supervised learning, then examine its limitations.

The Idealized Protocol

1. Dataset Preparation:

Start with a fully labeled dataset D = {(x_i, y_i)}
Split into train/validation/test sets
Within the training set, designate l samples as 'labeled'
Remaining training samples are 'unlabeled' (labels hidden during training)

2. Training:

Train SSL method using l labeled + u unlabeled training samples
Tune hyperparameters using validation set
Select best model checkpoint

3. Evaluation:

Evaluate final model on held-out test set
Report test accuracy/other metrics

Key Evaluation Metrics

Common Evaluation Metrics for SSL
Metric	Formula	Use Case	SSL-Specific Notes
Test Accuracy	Correct / Total	Balanced classification	Compare to supervised baseline
Balanced Accuracy	Mean per-class accuracy	Imbalanced classification	Critical when few labels per class
F1-Score (macro)	Mean of class F1s	Imbalanced multi-class	Robust to class imbalance
AUC-ROC	Area under ROC	Binary/ranking problems	Threshold-independent
Error Rate Reduction	(Base - SSL) / Base	Comparing methods	Normalizes across datasets
Label Efficiency Ratio	SSL accuracy / SL accuracy	Core SSL metric	Should exceed 1.0 for benefit

The Compare-to-Supervised Principle

The most fundamental comparison in SSL evaluation:

Supervised Baseline (Lower Bound):

Train using only the l labeled samples
This is what SSL must beat to be useful

Oracle (Upper Bound):

Train using all n = l + u samples with their true labels
This is what SSL approaches asymptotically

SSL Method:

Train using l labeled + u unlabeled
Should fall between baseline and oracle

The relative improvement is often reported as:

$$\text{Relative Improvement} = \frac{\text{Error}{SL} - \text{Error}{SSL}}{\text{Error}{SL} - \text{Error}{Oracle}}$$

A value of 1.0 means SSL matches the oracle; 0.0 means no improvement over supervised; negative means SSL hurts.

Baseline Selection Matters

A weak supervised baseline inflates SSL improvements. Always use a properly tuned supervised baseline—same architecture, same data augmentation, same training budget. Many papers use under-tuned baselines that make SSL look better than it is.

Common Evaluation Pitfalls

SSL evaluation is prone to subtle mistakes that can dramatically bias results. Understanding these pitfalls is essential for both conducting and interpreting research.

Pitfall 1: Information Leakage Through Hyperparameter Tuning

The Problem: Hyperparameters are tuned using validation data. If the validation set is small (which it often is when labels are scarce), tuning becomes noisy and can inadvertently overfit to the specific validation samples.

Worse: Some papers use test accuracy for model selection, which is a fatal flaw that invalidates all results.

The Solution:

Use cross-validation for hyperparameter tuning
Report sensitivity analysis for key hyperparameters
Fix hyperparameters based on prior work when possible
Never touch test set during development

Pitfall 2: Labeled Sample Selection Bias

The Problem: Which l samples are designated as 'labeled' matters enormously. Random selection from the training set can produce wildly different results across runs.

Example: If a class has 10 training samples and you randomly select 1 as labeled, you might get:

A representative prototype → SSL works well
An outlier → SSL propagates wrong structure

The Solution:

Report mean and standard deviation over multiple random selections
Use stratified sampling to ensure class balance
Consider using fixed, published label splits for reproducibility

Pitfall 3: Unfair Architecture/Augmentation Comparisons

The Problem: SSL methods often come bundled with stronger architectures or augmentations than baselines. Reported improvements conflate architecture/augmentation gains with algorithmic gains.

Example: FixMatch uses RandAugment with strong augmentations. If the supervised baseline doesn't use augmentation, 'FixMatch improvement' might come entirely from augmentation, not the SSL algorithm.

The Solution:

Use identical architectures for all methods
Apply same augmentations to supervised baseline
Ablate each component separately

Pitfall 4: Ignoring Computational Cost

The Problem: SSL methods often require more compute (multiple forward passes, more epochs, larger batches). A method that takes 10x more compute should provide commensurate improvement.

The Solution:

Report wall-clock training time
Compare under matched compute budget
Report FLOPs per training run

Additional Pitfalls to Avoid

•Pitfall 5: Using full train set as unlabeled when comparing to test — The 'unlabeled' data includes samples never seen by supervised baseline.
•Pitfall 6: Distribution mismatch between labeled and unlabeled — If class distributions differ, SSL assumptions may not hold.
•Pitfall 7: Comparing across different random seeds — Use same seeds for fair comparison; report variance.
•Pitfall 8: Ignoring early stopping effects — SSL may need longer training; comparing at fixed epochs is unfair.
•Pitfall 9: Selective reporting — Only showing label ratios where SSL works well.

The Oliver et al. Reality Check

The paper 'Realistic Evaluation of Deep Semi-Supervised Learning Algorithms' (Oliver et al., 2018) systematically exposed many of these pitfalls, finding that properly controlled experiments dramatically reduced reported SSL improvements. It's essential reading for anyone working in this field.

Statistical Considerations with Limited Labels

When working with few labeled samples, statistical considerations become critical. Standard evaluation practices from supervised learning often fail in the SSL regime.

High Variance from Small Labeled Sets

With only 10-100 labeled samples per class, results can vary dramatically based on:

Which samples are selected as labeled
Random initialization of the model
Stochastic training dynamics

Empirical observation: With 40 CIFAR-10 labels (4 per class), test accuracy standard deviation across runs can exceed 5%—larger than many reported SSL improvements!

Proper Statistical Testing

Paired comparisons: When comparing method A to method B, use the same:

Labeled sample selection
Random seeds
Early stopping criterion

Then apply paired statistical tests (paired t-test, Wilcoxon signed-rank) to the differences.

Multiple comparisons: When comparing many methods, apply corrections (Bonferroni, Holm-Bonferroni) to avoid false positives from multiple testing.

ssl_statistical_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
from typing import List, Dict, Tuple
 
def analyze_ssl_results(
    supervised_accs: List[float],
    ssl_accs: List[float],
    alpha: float = 0.05
) -> Dict[str, float]:
    """
    Statistical analysis of SSL vs supervised comparison.
    
    Args:
        supervised_accs: Accuracies from supervised baseline (multiple seeds)
        ssl_accs: Accuracies from SSL method (same seeds, paired)
        alpha: Significance level
    
    Returns:
        Dictionary with statistical analysis results
    """
    supervised = np.array(supervised_accs)
    ssl = np.array(ssl_accs)
    
    # Basic statistics
    sup_mean, sup_std = np.mean(supervised), np.std(supervised, ddof=1)
    ssl_mean, ssl_std = np.mean(ssl), np.std(ssl, ddof=1)
    
    # Improvement
    improvements = ssl - supervised
    mean_improvement = np.mean(improvements)
    
    # Paired t-test (assumes normal distribution of differences)
    t_stat, t_pvalue = stats.ttest_rel(ssl, supervised)
    
    # Wilcoxon signed-rank test (non-parametric alternative)
    try:
        w_stat, w_pvalue = stats.wilcoxon(improvements)
    except ValueError:
        # All differences are zero
        w_stat, w_pvalue = 0, 1.0
    
    # Effect size (Cohen's d for paired samples)
    cohens_d = mean_improvement / np.std(improvements, ddof=1)
    
    # 95% confidence interval for mean improvement
    se = stats.sem(improvements)
    ci_low, ci_high = stats.t.interval(
        1 - alpha, 
        len(improvements) - 1, 
        loc=mean_improvement, 
        scale=se
    )
    
    # Win/Tie/Loss analysis
    wins = np.sum(ssl > supervised)
    losses = np.sum(ssl < supervised)
    ties = np.sum(ssl == supervised)
    
    return {
        'supervised_mean': sup_mean,
        'supervised_std': sup_std,
        'ssl_mean': ssl_mean,
        'ssl_std': ssl_std,
        'mean_improvement': mean_improvement,
        'improvement_ci': (ci_low, ci_high),
        'paired_t_stat': t_stat,
        'paired_t_pvalue': t_pvalue,
        'wilcoxon_stat': w_stat,
        'wilcoxon_pvalue': w_pvalue,
        'cohens_d': cohens_d,
        'wins': wins,
        'losses': losses,
        'ties': ties,
        'statistically_significant': t_pvalue < alpha and ci_low > 0,
        'practical_significance': abs(cohens_d) > 0.5,  # Medium effect
    }
 
 
def power_analysis_ssl(
    expected_improvement: float,
    expected_std: float,
    alpha: float = 0.05,
    power: float = 0.8
) -> int:
    """
    Calculate required number of runs for statistical power.
    
    Args:
        expected_improvement: Expected accuracy improvement
        expected_std: Expected standard deviation of improvements
        alpha: Significance level
        power: Desired statistical power
    
    Returns:
        Minimum number of paired runs required
    """
    from scipy.stats import norm
    
    effect_size = expected_improvement / expected_std
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    
    n = ((z_alpha + z_beta) / effect_size) ** 2
    return int(np.ceil(n))
 
 
# Example usage:
# results = analyze_ssl_results(
#     supervised_accs=[85.2, 83.4, 86.1, 84.8, 85.5],
#     ssl_accs=[88.3, 87.1, 88.9, 87.5, 88.2]
# )
# print(f"Mean improvement: {results['mean_improvement']:.2f}%")
# print(f"95% CI: [{results['improvement_ci'][0]:.2f}, {results['improvement_ci'][1]:.2f}]")
# print(f"Statistically significant: {results['statistically_significant']}")

How Many Runs Are Needed?

A common question: how many random seeds should we report?

Rule of thumb: At least 5 runs, preferably 10+, especially when:

Labeled sample count is very low (<100 total)
Methods have similar performance
Results will inform real deployment decisions

Power analysis: To detect a 2% improvement with 80% power, given typical variance:

CIFAR-10 (40 labels): ~10-15 runs needed
CIFAR-10 (400 labels): ~5-8 runs needed
ImageNet (1% labels): ~5 runs needed (lower variance at scale)

Underpowered Experiments

Many SSL papers report only 3 runs. With typical variance, this yields ~50% power to detect true 2% improvements—essentially a coin flip. If a result 'almost reaches significance,' more runs are needed, not creative interpretation.

Benchmark Datasets and Protocols

The SSL literature has converged on several standard benchmarks. Understanding these benchmarks—including their limitations—is essential for interpreting results.

Standard Image Classification Benchmarks

Standard SSL Benchmark Datasets
Dataset	Classes	Train Size	Test Size	Common Label Counts	Notes
CIFAR-10	10	50,000	10,000	40, 250, 4000	Most common; may be saturated
CIFAR-100	100	50,000	10,000	400, 2500, 10000	Harder; more realistic class count
SVHN	10	73,257	26,032	40, 250, 1000	Digit recognition; easier than CIFAR
STL-10	10	5,000 + 100K unlab	8,000	1000 total	Designed for SSL; mismatched unlabeled
ImageNet	1000	1.28M	50,000	1%, 10%	Full-scale; most realistic
Mini-ImageNet	100	50,000	10,000	Varies	ImageNet subset for faster iteration

Standard Label Split Protocols

Fixed splits:

Use published label indices for reproducibility
Most papers now provide their splits
TorchSSL, USB frameworks provide standardized splits

Class-balanced sampling:

Ensure equal labels per class
Critical for fairness across methods
Standard: CIFAR-10 with 4 labels/class = 40 total

Imbalanced protocols:

Test robustness to realistic label distributions
Some classes may have 0 labels (open-world SSL)
Emerging as important evaluation dimension

Beyond Standard Benchmarks

Limitations of Standard Benchmarks

•Distribution match: Labeled and unlabeled data come from same distribution. Real-world often has distribution shift.
•Clean data: Standard benchmarks have little noise. Real data has label errors, corruptions, and outliers.
•Matched class sets: All unlabeled samples have one of the labeled classes. Real data may have novel classes (open-set SSL).
•Static data: Benchmarks are fixed. Real applications involve streaming data and concept drift.
•Saturated benchmarks: CIFAR-10 with 4000 labels approaches oracle accuracy with modern methods; improvements are marginal.

Emerging Evaluation Dimensions

The field is evolving toward more realistic evaluation:

1. Distribution-Shifted Unlabeled Data:

Use different domains for labeled vs. unlabeled (e.g., CIFAR ↔ ImageNet)
Tests whether SSL methods can handle realistic distribution gaps

2. Open-Set SSL:

Unlabeled data contains classes not in the labeled set
Tests whether methods correctly handle 'none of the above'

3. Class-Imbalanced SSL:

Realistic long-tailed label distributions
Tests robustness to majority class bias in pseudo-labeling

4. Noisy Labels:

Some fraction of 'ground truth' labels are incorrect
Tests interaction between SSL and label noise

5. Continual/Online SSL:

Data arrives in a stream; model must update continuously
Tests scalability and stability over time

Framework Recommendations

Use standardized SSL frameworks for evaluation: TorchSSL, USB (Unified Semi-supervised Learning Benchmark), or libsvm-ssl. These provide consistent data splits, evaluation protocols, and baseline implementations, ensuring reproducible and comparable results.

Experimental Design Best Practices

Rigorous experimental design is essential for meaningful SSL evaluation. Here we provide a comprehensive checklist and best practices.

Pre-Experiment Checklist

Before Running Experiments

•Define primary metric: What single number determines success? Commit before running.
•Set label budget: What label counts will you test? Use standard splits if available.
•Choose baselines: Include supervised baseline, at least one strong SSL baseline, and oracle.
•Fix hyperparameters: Use defaults from prior work or tune on a separate validation split.
•Determine run count: Plan for statistical power. At least 5 runs, preferably 10+.
•Document random seeds: Use explicit seeds for reproducibility.
•Set compute budget: Define maximum epochs/time for each method.

Controlling for Confounding Factors

Architecture Control:

Use identical backbone (e.g., Wide ResNet-28-2) for all methods
Same number of parameters, same initialization scheme

Augmentation Control:

Apply same augmentations to supervised baseline as to SSL method
If SSL requires strong augmentation, give it to baseline too (in supervised loss)

Training Control:

Same batch size, learning rate schedule, weight decay
Same early stopping criterion (validation performance, fixed epochs)
Same optimizer (typically SGD with momentum or AdamW)

Evaluation Control:

Single test set used once for final numbers
Do not tune based on test set performance

experimental_config.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
from dataclasses import dataclass, field
from typing import List, Optional
import json
 
@dataclass
class SSLExperimentConfig:
    """
    Configuration for rigorous SSL experiments.
    Document all decisions for reproducibility.
    """
    # Dataset configuration
    dataset: str = "cifar10"
    num_labeled: int = 40
    num_unlabeled: int = 49960  # Rest of training set
    label_split_seed: int = 0  # For reproducible label selection
    
    # Architecture (same for all methods)
    architecture: str = "WideResNet-28-2"
    num_classes: int = 10
    
    # Training (same for all methods)
    batch_size_labeled: int = 64
    batch_size_unlabeled: int = 448  # μB in FixMatch notation
    total_steps: int = 1_000_000
    learning_rate: float = 0.03
    weight_decay: float = 5e-4
    optimizer: str = "SGD"
    momentum: float = 0.9
    lr_schedule: str = "cosine"
    
    # Augmentation (apply same to supervised baseline)
    weak_augmentation: str = "random_crop_flip"
    strong_augmentation: str = "randaugment"  # If used by SSL method
    
    # Evaluation
    eval_every_steps: int = 1000
    test_split: str = "test"
    primary_metric: str = "accuracy"
    
    # Statistical design
    random_seeds: List[int] = field(
        default_factory=lambda: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    )
    min_runs: int = 5
    significance_level: float = 0.05
    
    # Methods to compare
    methods: List[str] = field(
        default_factory=lambda: [
            "supervised_baseline",
            "pseudo_label",
            "mean_teacher",
            "mixmatch",
            "fixmatch",
            "supervised_oracle"  # All labels
        ]
    )
    
    def save(self, path: str):
        """Save config for reproducibility."""
        with open(path, 'w') as f:
            json.dump(self.__dict__, f, indent=2)
    
    @classmethod
    def load(cls, path: str):
        """Load config from file."""
        with open(path, 'r') as f:
            return cls(**json.load(f))
    
    def describe(self) -> str:
        """Human-readable description of experiment."""
        return f"""
SSL Experiment Configuration
============================
Dataset: {self.dataset} with {self.num_labeled} labels
Architecture: {self.architecture}
Training: {self.total_steps} steps, LR={self.learning_rate}
Augmentations: weak={self.weak_augmentation}, strong={self.strong_augmentation}
Statistical: {len(self.random_seeds)} seeds, α={self.significance_level}
Methods: {', '.join(self.methods)}
        """.strip()

Ablation Study Design

To understand why an SSL method works, systematic ablations are essential:

1. Additive Ablation: Start with supervised baseline, add components one by one:

Supervised → + Pseudo-labels → + Consistency → + Strong aug → Full method

2. Subtractive Ablation: Start with full method, remove components one by one:

Full method → - Strong aug → - Consistency → - Pseudo-labels → Supervised

3. Hyperparameter Sensitivity: Vary key hyperparameters one at a time:

Confidence threshold τ: [0.7, 0.8, 0.9, 0.95, 0.99]
Unlabeled weight λ: [0.1, 0.5, 1.0, 2.0, 5.0]
Warmup epochs: [0, 10, 50, 100]

The Ablation Before Comparison Rule

Before comparing a new method to baselines, ablate to identify which components provide benefit. If a component provides no improvement in ablation, remove it before final comparison. This prevents overclaiming credit from unnecessary complexity.

Reporting Best Practices

How you report SSL results matters for reproducibility and fair interpretation. Here we provide guidelines for transparent reporting.

Essential Information to Report

SSL Reporting Checklist
Category	Information to Include	Why It Matters
Data	Dataset, label counts, split seeds/indices	Reproducibility
Architecture	Network, parameters, initialization	Fair comparison
Training	Epochs, batch size, LR, schedule	Reproducibility
Augmentation	All augmentations applied	Often source of gains
Method	All hyperparameters, thresholds	Implementation details
Evaluation	Metric, test split, selection criterion	Result interpretation
Statistics	Runs, seeds, mean ± std	Significance assessment
Baselines	Supervised, oracle, prior methods	Context for improvement
Compute	GPU type, hours, FLOPs	Efficiency comparison
Code	Repository URL, version/commit	Reproducibility

Recommended Reporting Format

Main Table Format:

Method	CIFAR-10 (40)	CIFAR-10 (250)	CIFAR-10 (4000)
Supervised	84.2 ± 1.3	89.5 ± 0.8	94.1 ± 0.3
Pseudo-Label	86.1 ± 1.5	91.2 ± 0.6	94.8 ± 0.2
MixMatch	89.3 ± 0.9	93.1 ± 0.4	95.4 ± 0.2
Ours	90.1 ± 0.7	93.8 ± 0.3	95.7 ± 0.1
Oracle	95.5 ± 0.2	95.5 ± 0.2	95.5 ± 0.2

Results: mean ± std over 10 runs. Bold: best excluding oracle.

What NOT to Do

Reporting Anti-Patterns

•Cherry-picking: Only showing label counts where your method wins
•Hiding variance: Reporting only mean without standard deviation
•Unfair baselines: Using under-tuned baselines or different architectures
•Missing oracle: Not showing what supervised learning can achieve with all labels
•Vague descriptions: 'Standard augmentation' instead of specific transforms
•No code release: Claiming reproducibility without providing implementation
•Single runs: Reporting single-run results for high-variance settings

Supplementary Information

Include in appendix or supplementary material:

Full training curves: Accuracy vs. steps for each method
Hyperparameter sensitivity: Plots showing robustness to key hyperparameters
Per-class breakdown: Especially important for imbalanced settings
Failure case analysis: When does the method fail? Why?
Pseudo-label statistics: Accuracy of pseudo-labels over training, by class
Compute cost: Training time breakdown by component

The Reproducibility Standard

A good test: Could someone reproduce your results to within reported variance from the paper alone? If the answer is 'only with the code,' your paper is under-documented. Code should supplement, not replace, method description.

Real-World Evaluation Considerations

Academic benchmarks differ from real-world deployment. Here we discuss evaluation considerations specific to practical SSL deployment.

Production-Relevant Metrics

Beyond accuracy, production systems care about:

Production Metrics

•Calibration: Are confidence scores meaningful? Overconfident pseudo-labels can be dangerous.
•Worst-class performance: Is any class completely failing? SSL can amplify class imbalances.
•Performance on edge cases: How does the model behave on rare but important inputs?
•Robustness: Performance under distribution shift, corruption, or adversarial perturbation.
•Inference speed: SSL sometimes produces larger models; inference cost matters.
•Update efficiency: Can the model be efficiently updated as new labeled data arrives?

Evaluating Under Distribution Shift

Real unlabeled data often differs from labeled data:

Covariate shift evaluation:

Train on labels from domain A, unlabeled from domain B, test on domain B
Measures whether SSL helps or hurts under distribution mismatch

Prior shift evaluation:

Labeled and unlabeled have different class proportions
Measures robustness to class distribution mismatch

Open-set evaluation:

Unlabeled contains classes not in labeled set
Measures whether method correctly rejects or handles novel classes

A/B Testing for SSL

The ultimate test of SSL value in production:

Train model A: Supervised only on labeled data
Train model B: SSL with labeled + unlabeled data
Deploy both: Randomize which users see which model
Measure business metrics: Conversion, engagement, satisfaction

A/B testing captures real-world value that academic metrics may miss. A 2% accuracy improvement means nothing if it doesn't change business outcomes.

Before Production Deployment

•Validate on held-out labeled data from production distribution
•Test on edge cases and failure modes
•Verify calibration of confidence scores
•Check for bias amplification
•Measure inference latency

Ongoing Monitoring

•Track per-class performance over time
•Monitor for concept drift
•Log and review model errors
•Measure business impact
•Plan for model refresh cadence

The Benchmark-Production Gap

Methods that excel on academic benchmarks may fail in production. Clean benchmarks favor aggressive pseudo-labeling; noisy production data punishes it. Always validate SSL methods on data resembling your production distribution before broad deployment.

Summary: Rigorous SSL Evaluation

We have examined the complex landscape of semi-supervised learning evaluation. Proper evaluation is not merely methodological hygiene—it's essential for understanding what works, what doesn't, and why. Let's consolidate the key insights:

Key Takeaways

•Compare to proper baselines: Use a well-tuned supervised baseline with same architecture and augmentations. Include an oracle to contextualize improvements.
•Avoid common pitfalls: Don't leak information through validation, use stratified sampling, control for architecture and augmentation confounds.
•Report statistics properly: Multiple runs (5-10+), mean and standard deviation, paired statistical tests for comparisons.
•Use standard protocols: Fixed label splits, established benchmarks, standardized frameworks for reproducibility.
•Design experiments carefully: Pre-register primary metrics, control confounding factors, ablate to understand why methods work.
•Report transparently: All hyperparameters, training details, code release, compute cost.
•Consider production realities: Distribution shift, calibration, worst-case performance, business metrics.

Module Complete:

With this page, we have concluded Module 1: The Label Scarcity Problem. You have learned:

Page 0: The economics of data labeling and why labels are scarce
Page 1: The formal semi-supervised learning setting and notation
Page 2: The distinction between transductive and inductive learning
Page 3: The assumptions that make SSL possible
Page 4: How to properly evaluate SSL methods

You are now equipped to understand why semi-supervised learning is important, what it is formally, when it can help (assumptions), and how to rigorously assess whether it's working. The subsequent modules will dive into specific SSL methods armed with this foundational understanding.

Module Complete

Congratulations! You have completed Module 1: The Label Scarcity Problem. You now have a principled understanding of the economic, mathematical, and methodological foundations of semi-supervised learning. This knowledge will serve as the bedrock for understanding and applying the specific SSL methods covered in subsequent modules.

Evaluation Challenges: Measuring SSL Effectiveness

The Hidden Complexity of SSL Evaluation

The fundamental question we address: How do we fairly and rigorously compare SSL methods, and what does 'improvement' actually mean in the semi-supervised setting?

What You Will Learn

Standard Evaluation Protocol

Let's first establish the standard evaluation protocol for semi-supervised learning, then examine its limitations.

The Idealized Protocol

1. Dataset Preparation:

Start with a fully labeled dataset D = {(x_i, y_i)}
Split into train/validation/test sets
Within the training set, designate l samples as 'labeled'
Remaining training samples are 'unlabeled' (labels hidden during training)

2. Training:

Train SSL method using l labeled + u unlabeled training samples
Tune hyperparameters using validation set
Select best model checkpoint

3. Evaluation:

Evaluate final model on held-out test set
Report test accuracy/other metrics

Key Evaluation Metrics

Common Evaluation Metrics for SSL
Metric	Formula	Use Case	SSL-Specific Notes
Test Accuracy	Correct / Total	Balanced classification	Compare to supervised baseline
Balanced Accuracy	Mean per-class accuracy	Imbalanced classification	Critical when few labels per class
F1-Score (macro)	Mean of class F1s	Imbalanced multi-class	Robust to class imbalance
AUC-ROC	Area under ROC	Binary/ranking problems	Threshold-independent
Error Rate Reduction	(Base - SSL) / Base	Comparing methods	Normalizes across datasets
Label Efficiency Ratio	SSL accuracy / SL accuracy	Core SSL metric	Should exceed 1.0 for benefit

The Compare-to-Supervised Principle

The most fundamental comparison in SSL evaluation:

Supervised Baseline (Lower Bound):

Train using only the l labeled samples
This is what SSL must beat to be useful

Oracle (Upper Bound):

Train using all n = l + u samples with their true labels
This is what SSL approaches asymptotically

SSL Method:

Train using l labeled + u unlabeled
Should fall between baseline and oracle

The relative improvement is often reported as:

$$\text{Relative Improvement} = \frac{\text{Error}{SL} - \text{Error}{SSL}}{\text{Error}{SL} - \text{Error}{Oracle}}$$

A value of 1.0 means SSL matches the oracle; 0.0 means no improvement over supervised; negative means SSL hurts.

Baseline Selection Matters

Common Evaluation Pitfalls

SSL evaluation is prone to subtle mistakes that can dramatically bias results. Understanding these pitfalls is essential for both conducting and interpreting research.

Pitfall 1: Information Leakage Through Hyperparameter Tuning

Worse: Some papers use test accuracy for model selection, which is a fatal flaw that invalidates all results.

The Solution:

Use cross-validation for hyperparameter tuning
Report sensitivity analysis for key hyperparameters
Fix hyperparameters based on prior work when possible
Never touch test set during development

Pitfall 2: Labeled Sample Selection Bias

The Problem: Which l samples are designated as 'labeled' matters enormously. Random selection from the training set can produce wildly different results across runs.

Example: If a class has 10 training samples and you randomly select 1 as labeled, you might get:

A representative prototype → SSL works well
An outlier → SSL propagates wrong structure

The Solution:

Report mean and standard deviation over multiple random selections
Use stratified sampling to ensure class balance
Consider using fixed, published label splits for reproducibility

Pitfall 3: Unfair Architecture/Augmentation Comparisons

The Problem: SSL methods often come bundled with stronger architectures or augmentations than baselines. Reported improvements conflate architecture/augmentation gains with algorithmic gains.

The Solution:

Use identical architectures for all methods
Apply same augmentations to supervised baseline
Ablate each component separately

Pitfall 4: Ignoring Computational Cost

The Problem: SSL methods often require more compute (multiple forward passes, more epochs, larger batches). A method that takes 10x more compute should provide commensurate improvement.

The Solution:

Report wall-clock training time
Compare under matched compute budget
Report FLOPs per training run

Additional Pitfalls to Avoid

•Pitfall 5: Using full train set as unlabeled when comparing to test — The 'unlabeled' data includes samples never seen by supervised baseline.
•Pitfall 6: Distribution mismatch between labeled and unlabeled — If class distributions differ, SSL assumptions may not hold.
•Pitfall 7: Comparing across different random seeds — Use same seeds for fair comparison; report variance.
•Pitfall 8: Ignoring early stopping effects — SSL may need longer training; comparing at fixed epochs is unfair.
•Pitfall 9: Selective reporting — Only showing label ratios where SSL works well.

The Oliver et al. Reality Check

Statistical Considerations with Limited Labels

When working with few labeled samples, statistical considerations become critical. Standard evaluation practices from supervised learning often fail in the SSL regime.

High Variance from Small Labeled Sets

With only 10-100 labeled samples per class, results can vary dramatically based on:

Which samples are selected as labeled
Random initialization of the model
Stochastic training dynamics

Empirical observation: With 40 CIFAR-10 labels (4 per class), test accuracy standard deviation across runs can exceed 5%—larger than many reported SSL improvements!

Proper Statistical Testing

Paired comparisons: When comparing method A to method B, use the same:

Labeled sample selection
Random seeds
Early stopping criterion

Then apply paired statistical tests (paired t-test, Wilcoxon signed-rank) to the differences.

Multiple comparisons: When comparing many methods, apply corrections (Bonferroni, Holm-Bonferroni) to avoid false positives from multiple testing.

ssl_statistical_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from scipy import stats
from typing import List, Dict, Tuple
 
def analyze_ssl_results(
    supervised_accs: List[float],
    ssl_accs: List[float],
    alpha: float = 0.05
) -> Dict[str, float]:
    """
    Statistical analysis of SSL vs supervised comparison.
    
    Args:
        supervised_accs: Accuracies from supervised baseline (multiple seeds)
        ssl_accs: Accuracies from SSL method (same seeds, paired)
        alpha: Significance level
    
    Returns:
        Dictionary with statistical analysis results
    """
    supervised = np.array(supervised_accs)
    ssl = np.array(ssl_accs)
    
    # Basic statistics
    sup_mean, sup_std = np.mean(supervised), np.std(supervised, ddof=1)
    ssl_mean, ssl_std = np.mean(ssl), np.std(ssl, ddof=1)
    
    # Improvement
    improvements = ssl - supervised
    mean_improvement = np.mean(improvements)
    
    # Paired t-test (assumes normal distribution of differences)
    t_stat, t_pvalue = stats.ttest_rel(ssl, supervised)
    
    # Wilcoxon signed-rank test (non-parametric alternative)
    try:
        w_stat, w_pvalue = stats.wilcoxon(improvements)
    except ValueError:
        # All differences are zero
        w_stat, w_pvalue = 0, 1.0
    
    # Effect size (Cohen's d for paired samples)
    cohens_d = mean_improvement / np.std(improvements, ddof=1)
    
    # 95% confidence interval for mean improvement
    se = stats.sem(improvements)
    ci_low, ci_high = stats.t.interval(
        1 - alpha, 
        len(improvements) - 1, 
        loc=mean_improvement, 
        scale=se
    )
    
    # Win/Tie/Loss analysis
    wins = np.sum(ssl > supervised)
    losses = np.sum(ssl < supervised)
    ties = np.sum(ssl == supervised)
    
    return {
        'supervised_mean': sup_mean,
        'supervised_std': sup_std,
        'ssl_mean': ssl_mean,
        'ssl_std': ssl_std,
        'mean_improvement': mean_improvement,
        'improvement_ci': (ci_low, ci_high),
        'paired_t_stat': t_stat,
        'paired_t_pvalue': t_pvalue,
        'wilcoxon_stat': w_stat,
        'wilcoxon_pvalue': w_pvalue,
        'cohens_d': cohens_d,
        'wins': wins,
        'losses': losses,
        'ties': ties,
        'statistically_significant': t_pvalue < alpha and ci_low > 0,
        'practical_significance': abs(cohens_d) > 0.5,  # Medium effect
    }
 
 
def power_analysis_ssl(
    expected_improvement: float,
    expected_std: float,
    alpha: float = 0.05,
    power: float = 0.8
) -> int:
    """
    Calculate required number of runs for statistical power.
    
    Args:
        expected_improvement: Expected accuracy improvement
        expected_std: Expected standard deviation of improvements
        alpha: Significance level
        power: Desired statistical power
    
    Returns:
        Minimum number of paired runs required
    """
    from scipy.stats import norm
    
    effect_size = expected_improvement / expected_std
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    
    n = ((z_alpha + z_beta) / effect_size) ** 2
    return int(np.ceil(n))
 
 
# Example usage:
# results = analyze_ssl_results(
#     supervised_accs=[85.2, 83.4, 86.1, 84.8, 85.5],
#     ssl_accs=[88.3, 87.1, 88.9, 87.5, 88.2]
# )
# print(f"Mean improvement: {results['mean_improvement']:.2f}%")
# print(f"95% CI: [{results['improvement_ci'][0]:.2f}, {results['improvement_ci'][1]:.2f}]")
# print(f"Statistically significant: {results['statistically_significant']}")

How Many Runs Are Needed?

A common question: how many random seeds should we report?

Rule of thumb: At least 5 runs, preferably 10+, especially when:

Labeled sample count is very low (<100 total)
Methods have similar performance
Results will inform real deployment decisions

Power analysis: To detect a 2% improvement with 80% power, given typical variance:

CIFAR-10 (40 labels): ~10-15 runs needed
CIFAR-10 (400 labels): ~5-8 runs needed
ImageNet (1% labels): ~5 runs needed (lower variance at scale)

Underpowered Experiments

Benchmark Datasets and Protocols

The SSL literature has converged on several standard benchmarks. Understanding these benchmarks—including their limitations—is essential for interpreting results.

Standard Image Classification Benchmarks

Standard SSL Benchmark Datasets
Dataset	Classes	Train Size	Test Size	Common Label Counts	Notes
CIFAR-10	10	50,000	10,000	40, 250, 4000	Most common; may be saturated
CIFAR-100	100	50,000	10,000	400, 2500, 10000	Harder; more realistic class count
SVHN	10	73,257	26,032	40, 250, 1000	Digit recognition; easier than CIFAR
STL-10	10	5,000 + 100K unlab	8,000	1000 total	Designed for SSL; mismatched unlabeled
ImageNet	1000	1.28M	50,000	1%, 10%	Full-scale; most realistic
Mini-ImageNet	100	50,000	10,000	Varies	ImageNet subset for faster iteration

Standard Label Split Protocols

Fixed splits:

Use published label indices for reproducibility
Most papers now provide their splits
TorchSSL, USB frameworks provide standardized splits

Class-balanced sampling:

Ensure equal labels per class
Critical for fairness across methods
Standard: CIFAR-10 with 4 labels/class = 40 total

Imbalanced protocols:

Test robustness to realistic label distributions
Some classes may have 0 labels (open-world SSL)
Emerging as important evaluation dimension

Beyond Standard Benchmarks

Limitations of Standard Benchmarks

•Distribution match: Labeled and unlabeled data come from same distribution. Real-world often has distribution shift.
•Clean data: Standard benchmarks have little noise. Real data has label errors, corruptions, and outliers.
•Matched class sets: All unlabeled samples have one of the labeled classes. Real data may have novel classes (open-set SSL).
•Static data: Benchmarks are fixed. Real applications involve streaming data and concept drift.
•Saturated benchmarks: CIFAR-10 with 4000 labels approaches oracle accuracy with modern methods; improvements are marginal.

Emerging Evaluation Dimensions

The field is evolving toward more realistic evaluation:

1. Distribution-Shifted Unlabeled Data:

Use different domains for labeled vs. unlabeled (e.g., CIFAR ↔ ImageNet)
Tests whether SSL methods can handle realistic distribution gaps

2. Open-Set SSL:

Unlabeled data contains classes not in the labeled set
Tests whether methods correctly handle 'none of the above'

3. Class-Imbalanced SSL:

Realistic long-tailed label distributions
Tests robustness to majority class bias in pseudo-labeling

4. Noisy Labels:

Some fraction of 'ground truth' labels are incorrect
Tests interaction between SSL and label noise

5. Continual/Online SSL:

Data arrives in a stream; model must update continuously
Tests scalability and stability over time

Framework Recommendations

Experimental Design Best Practices

Rigorous experimental design is essential for meaningful SSL evaluation. Here we provide a comprehensive checklist and best practices.

Pre-Experiment Checklist

Before Running Experiments

•Define primary metric: What single number determines success? Commit before running.
•Set label budget: What label counts will you test? Use standard splits if available.
•Choose baselines: Include supervised baseline, at least one strong SSL baseline, and oracle.
•Fix hyperparameters: Use defaults from prior work or tune on a separate validation split.
•Determine run count: Plan for statistical power. At least 5 runs, preferably 10+.
•Document random seeds: Use explicit seeds for reproducibility.
•Set compute budget: Define maximum epochs/time for each method.

Controlling for Confounding Factors

Architecture Control:

Use identical backbone (e.g., Wide ResNet-28-2) for all methods
Same number of parameters, same initialization scheme

Augmentation Control:

Apply same augmentations to supervised baseline as to SSL method
If SSL requires strong augmentation, give it to baseline too (in supervised loss)

Training Control:

Same batch size, learning rate schedule, weight decay
Same early stopping criterion (validation performance, fixed epochs)
Same optimizer (typically SGD with momentum or AdamW)

Evaluation Control:

Single test set used once for final numbers
Do not tune based on test set performance

experimental_config.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
from dataclasses import dataclass, field
from typing import List, Optional
import json
 
@dataclass
class SSLExperimentConfig:
    """
    Configuration for rigorous SSL experiments.
    Document all decisions for reproducibility.
    """
    # Dataset configuration
    dataset: str = "cifar10"
    num_labeled: int = 40
    num_unlabeled: int = 49960  # Rest of training set
    label_split_seed: int = 0  # For reproducible label selection
    
    # Architecture (same for all methods)
    architecture: str = "WideResNet-28-2"
    num_classes: int = 10
    
    # Training (same for all methods)
    batch_size_labeled: int = 64
    batch_size_unlabeled: int = 448  # μB in FixMatch notation
    total_steps: int = 1_000_000
    learning_rate: float = 0.03
    weight_decay: float = 5e-4
    optimizer: str = "SGD"
    momentum: float = 0.9
    lr_schedule: str = "cosine"
    
    # Augmentation (apply same to supervised baseline)
    weak_augmentation: str = "random_crop_flip"
    strong_augmentation: str = "randaugment"  # If used by SSL method
    
    # Evaluation
    eval_every_steps: int = 1000
    test_split: str = "test"
    primary_metric: str = "accuracy"
    
    # Statistical design
    random_seeds: List[int] = field(
        default_factory=lambda: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    )
    min_runs: int = 5
    significance_level: float = 0.05
    
    # Methods to compare
    methods: List[str] = field(
        default_factory=lambda: [
            "supervised_baseline",
            "pseudo_label",
            "mean_teacher",
            "mixmatch",
            "fixmatch",
            "supervised_oracle"  # All labels
        ]
    )
    
    def save(self, path: str):
        """Save config for reproducibility."""
        with open(path, 'w') as f:
            json.dump(self.__dict__, f, indent=2)
    
    @classmethod
    def load(cls, path: str):
        """Load config from file."""
        with open(path, 'r') as f:
            return cls(**json.load(f))
    
    def describe(self) -> str:
        """Human-readable description of experiment."""
        return f"""
SSL Experiment Configuration
============================
Dataset: {self.dataset} with {self.num_labeled} labels
Architecture: {self.architecture}
Training: {self.total_steps} steps, LR={self.learning_rate}
Augmentations: weak={self.weak_augmentation}, strong={self.strong_augmentation}
Statistical: {len(self.random_seeds)} seeds, α={self.significance_level}
Methods: {', '.join(self.methods)}
        """.strip()

Ablation Study Design

To understand why an SSL method works, systematic ablations are essential:

1. Additive Ablation: Start with supervised baseline, add components one by one:

Supervised → + Pseudo-labels → + Consistency → + Strong aug → Full method

2. Subtractive Ablation: Start with full method, remove components one by one:

Full method → - Strong aug → - Consistency → - Pseudo-labels → Supervised

3. Hyperparameter Sensitivity: Vary key hyperparameters one at a time:

Confidence threshold τ: [0.7, 0.8, 0.9, 0.95, 0.99]
Unlabeled weight λ: [0.1, 0.5, 1.0, 2.0, 5.0]
Warmup epochs: [0, 10, 50, 100]

The Ablation Before Comparison Rule

Reporting Best Practices

How you report SSL results matters for reproducibility and fair interpretation. Here we provide guidelines for transparent reporting.

Essential Information to Report

SSL Reporting Checklist
Category	Information to Include	Why It Matters
Data	Dataset, label counts, split seeds/indices	Reproducibility
Architecture	Network, parameters, initialization	Fair comparison
Training	Epochs, batch size, LR, schedule	Reproducibility
Augmentation	All augmentations applied	Often source of gains
Method	All hyperparameters, thresholds	Implementation details
Evaluation	Metric, test split, selection criterion	Result interpretation
Statistics	Runs, seeds, mean ± std	Significance assessment
Baselines	Supervised, oracle, prior methods	Context for improvement
Compute	GPU type, hours, FLOPs	Efficiency comparison
Code	Repository URL, version/commit	Reproducibility

Recommended Reporting Format

Main Table Format:

Method	CIFAR-10 (40)	CIFAR-10 (250)	CIFAR-10 (4000)
Supervised	84.2 ± 1.3	89.5 ± 0.8	94.1 ± 0.3
Pseudo-Label	86.1 ± 1.5	91.2 ± 0.6	94.8 ± 0.2
MixMatch	89.3 ± 0.9	93.1 ± 0.4	95.4 ± 0.2
Ours	90.1 ± 0.7	93.8 ± 0.3	95.7 ± 0.1
Oracle	95.5 ± 0.2	95.5 ± 0.2	95.5 ± 0.2

Results: mean ± std over 10 runs. Bold: best excluding oracle.

What NOT to Do

Reporting Anti-Patterns

•Cherry-picking: Only showing label counts where your method wins
•Hiding variance: Reporting only mean without standard deviation
•Unfair baselines: Using under-tuned baselines or different architectures
•Missing oracle: Not showing what supervised learning can achieve with all labels
•Vague descriptions: 'Standard augmentation' instead of specific transforms
•No code release: Claiming reproducibility without providing implementation
•Single runs: Reporting single-run results for high-variance settings

Supplementary Information

Include in appendix or supplementary material:

Full training curves: Accuracy vs. steps for each method
Hyperparameter sensitivity: Plots showing robustness to key hyperparameters
Per-class breakdown: Especially important for imbalanced settings
Failure case analysis: When does the method fail? Why?
Pseudo-label statistics: Accuracy of pseudo-labels over training, by class
Compute cost: Training time breakdown by component

The Reproducibility Standard

Real-World Evaluation Considerations

Academic benchmarks differ from real-world deployment. Here we discuss evaluation considerations specific to practical SSL deployment.

Production-Relevant Metrics

Beyond accuracy, production systems care about:

Production Metrics

•Calibration: Are confidence scores meaningful? Overconfident pseudo-labels can be dangerous.
•Worst-class performance: Is any class completely failing? SSL can amplify class imbalances.
•Performance on edge cases: How does the model behave on rare but important inputs?
•Robustness: Performance under distribution shift, corruption, or adversarial perturbation.
•Inference speed: SSL sometimes produces larger models; inference cost matters.
•Update efficiency: Can the model be efficiently updated as new labeled data arrives?

Evaluating Under Distribution Shift

Real unlabeled data often differs from labeled data:

Covariate shift evaluation:

Train on labels from domain A, unlabeled from domain B, test on domain B
Measures whether SSL helps or hurts under distribution mismatch

Prior shift evaluation:

Labeled and unlabeled have different class proportions
Measures robustness to class distribution mismatch

Open-set evaluation:

Unlabeled contains classes not in labeled set
Measures whether method correctly rejects or handles novel classes

A/B Testing for SSL

The ultimate test of SSL value in production:

Train model A: Supervised only on labeled data
Train model B: SSL with labeled + unlabeled data
Deploy both: Randomize which users see which model
Measure business metrics: Conversion, engagement, satisfaction

A/B testing captures real-world value that academic metrics may miss. A 2% accuracy improvement means nothing if it doesn't change business outcomes.

Before Production Deployment

•Validate on held-out labeled data from production distribution
•Test on edge cases and failure modes
•Verify calibration of confidence scores
•Check for bias amplification
•Measure inference latency

Ongoing Monitoring

•Track per-class performance over time
•Monitor for concept drift
•Log and review model errors
•Measure business impact
•Plan for model refresh cadence

The Benchmark-Production Gap

Summary: Rigorous SSL Evaluation

Key Takeaways

•Compare to proper baselines: Use a well-tuned supervised baseline with same architecture and augmentations. Include an oracle to contextualize improvements.
•Avoid common pitfalls: Don't leak information through validation, use stratified sampling, control for architecture and augmentation confounds.
•Report statistics properly: Multiple runs (5-10+), mean and standard deviation, paired statistical tests for comparisons.
•Use standard protocols: Fixed label splits, established benchmarks, standardized frameworks for reproducibility.
•Design experiments carefully: Pre-register primary metrics, control confounding factors, ablate to understand why methods work.
•Report transparently: All hyperparameters, training details, code release, compute cost.
•Consider production realities: Distribution shift, calibration, worst-case performance, business metrics.

Module Complete:

With this page, we have concluded Module 1: The Label Scarcity Problem. You have learned:

Page 0: The economics of data labeling and why labels are scarce
Page 1: The formal semi-supervised learning setting and notation
Page 2: The distinction between transductive and inductive learning
Page 3: The assumptions that make SSL possible
Page 4: How to properly evaluate SSL methods

Module Complete