Machine LearningSemi-Supervised & Self-Supervised Learning

Self-Supervised Learning

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

5 / 5

Evaluation Protocols

The Evaluation Challenge

Self-supervised learning presents a unique evaluation challenge: how do we measure the quality of representations learned without labels? Unlike supervised learning where accuracy on held-out data is the natural metric, SSL requires proxy evaluations that predict how well representations will transfer to downstream tasks.

The field has developed standardized evaluation protocols that enable fair comparison across methods. Understanding these protocols is essential for both conducting rigorous research and selecting methods for practical applications.

What You Will Master

By the end of this page, you will master the standard evaluation protocols (linear probe, fine-tuning, k-NN), understand benchmarks and datasets used for SSL evaluation, analyze the relationship between evaluation metrics and downstream performance, and design rigorous evaluation pipelines for your own SSL research.

Linear Evaluation Protocol

The linear evaluation protocol is the gold standard for comparing self-supervised methods. The procedure is simple:

Train an SSL model (encoder) on unlabeled data
Freeze all encoder weights
Train a linear classifier (single layer) on top using labeled data
Report accuracy on held-out test set

The key insight: if a linear classifier achieves high accuracy, the representation has already done the hard work of separating semantic categories. Complex non-linear decision boundaries aren't needed.

Why Linear Probing?

Linear probing measures representation quality in isolation. Since the linear classifier has minimal capacity, all discriminative power must come from the frozen representations. This prevents conflating 'good SSL method' with 'good fine-tuning procedure' and enables fair comparison.

Linear Probe Implementation Details
Aspect	Common Choice	Rationale
Optimizer	SGD with momentum	More stable than Adam for linear probes
Learning rate	0.1-30 (with decay)	Optimal LR varies by representation
Batch size	256-4096	Larger batches often help
Training epochs	100	Enough for convergence
Augmentation	Standard crop + flip	Consistent with supervised baselines
Feature extraction	Global average pool	For CNNs; CLS token for Transformers

linear_evaluation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from typing import Tuple
 
class LinearEvaluator:
    """
    Standard linear evaluation protocol for SSL representations.
    """
    def __init__(
        self,
        encoder: nn.Module,
        feature_dim: int,
        num_classes: int,
        device: str = "cuda"
    ):
        self.encoder = encoder.to(device)
        self.encoder.eval()  # Freeze encoder
        
        # Single linear layer classifier
        self.classifier = nn.Linear(feature_dim, num_classes).to(device)
        self.device = device
        
        # Freeze encoder parameters
        for param in self.encoder.parameters():
            param.requires_grad = False
    
    def extract_features(self, dataloader: DataLoader) -> Tuple[torch.Tensor, torch.Tensor]:
        """Extract features for entire dataset."""
        features_list = []
        labels_list = []
        
        with torch.no_grad():
            for images, labels in dataloader:
                images = images.to(self.device)
                features = self.encoder(images)
                
                # Global average pooling if needed
                if features.dim() == 4:
                    features = features.mean(dim=[2, 3])
                
                features_list.append(features.cpu())
                labels_list.append(labels)
        
        return torch.cat(features_list), torch.cat(labels_list)
    
    def train(
        self,
        train_loader: DataLoader,
        val_loader: DataLoader,
        epochs: int = 100,
        lr: float = 0.1
    ) -> dict:
        """Train linear classifier and evaluate."""
        optimizer = optim.SGD(
            self.classifier.parameters(),
            lr=lr,
            momentum=0.9,
            weight_decay=0
        )
        scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs)
        criterion = nn.CrossEntropyLoss()
        
        best_acc = 0.0
        
        for epoch in range(epochs):
            self.classifier.train()
            
            for images, labels in train_loader:
                images, labels = images.to(self.device), labels.to(self.device)
                
                # Extract features (frozen encoder)
                with torch.no_grad():
                    features = self.encoder(images)
                    if features.dim() == 4:
                        features = features.mean(dim=[2, 3])
                
                # Train classifier
                logits = self.classifier(features)
                loss = criterion(logits, labels)
                
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
            
            scheduler.step()
            
            # Evaluate
            val_acc = self.evaluate(val_loader)
            best_acc = max(best_acc, val_acc)
        
        return {"best_accuracy": best_acc, "final_accuracy": val_acc}
    
    @torch.no_grad()
    def evaluate(self, dataloader: DataLoader) -> float:
        """Evaluate on validation/test set."""
        self.classifier.eval()
        correct = 0
        total = 0
        
        for images, labels in dataloader:
            images, labels = images.to(self.device), labels.to(self.device)
            
            features = self.encoder(images)
            if features.dim() == 4:
                features = features.mean(dim=[2, 3])
            
            logits = self.classifier(features)
            preds = logits.argmax(dim=1)
            
            correct += (preds == labels).sum().item()
            total += labels.size(0)
        
        return correct / total

k-Nearest Neighbors Evaluation

k-NN evaluation provides a training-free alternative to linear probing. The idea is simple: classify each test sample by the majority vote of its k nearest neighbors in the training set.

Why k-NN is valuable:

No training required: Just compute distances between representations
Sensitive to local structure: Reveals if similar samples cluster together
Quick iteration: Can evaluate during pretraining without overhead
Hyperparameter-free (mostly): Typically use k=20 or weighted voting

Linear Probe vs k-NN Evaluation
Aspect	Linear Probe	k-NN
Training	Required	None
Computational cost	Moderate	Low (one-time distance computation)
What it measures	Linear separability	Local neighborhood structure
Hyperparameters	LR, epochs, WD	k value, distance metric
Typical gap	Higher accuracy	~5-10% lower than linear
Use case	Final comparison	Quick validation, checkpointing

k-NN as Early Stopping Signal

k-NN evaluation is commonly computed periodically during SSL pretraining to monitor representation quality. Since it requires no training, it adds minimal overhead. Sudden drops in k-NN accuracy can signal collapse or other training issues.

Fine-Tuning Evaluation

While linear probing measures frozen representation quality, fine-tuning evaluation measures how well pretrained weights serve as initialization for downstream task training.

Fine-tuning protocols:

Full fine-tuning: Update all weights, including encoder
Partial fine-tuning: Freeze early layers, update later layers
Low-data fine-tuning: Fine-tune with limited labeled examples (1%, 10%)

Fine-tuning often yields higher absolute accuracy than linear probing because the encoder can adapt to the target task.

Fine-Tuning Evaluation Scenarios

•Same-domain fine-tuning: Pretrain and fine-tune on same dataset (e.g., ImageNet). Measures representation quality in matched setting.
•Transfer fine-tuning: Pretrain on source, fine-tune on target domain. Measures generalization across domains.
•Low-shot fine-tuning: Fine-tune with very few examples per class. Tests whether good representations enable sample-efficient learning.
•Task transfer: Pretrain for classification, fine-tune for detection/segmentation. Tests representation versatility.

Fine-Tuning Hides Representation Quality

With enough fine-tuning, even poor initializations can reach good final accuracy. This is why linear probing remains the primary comparison metric—it isolates the contribution of pretraining from the contribution of downstream training.

Standard Benchmarks and Datasets

The SSL community has converged on standard benchmarks enabling fair comparison across methods.

Common SSL Evaluation Benchmarks
Benchmark	Task	Scale	What It Tests
ImageNet Linear	Classification (1000 classes)	1.28M images	General visual representations
ImageNet 1% / 10%	Semi-supervised classification	~13K / ~128K images	Label efficiency
COCO Detection	Object detection	118K images	Transfer to detection
VOC Segmentation	Semantic segmentation	~10K images	Dense prediction transfer
Transfer Suite	Multiple classification tasks	12+ datasets	Cross-domain generalization

Transfer Learning Benchmarks

•CIFAR-10/100: Small-scale image classification, quick iteration
•Food-101, Flowers-102, Pets, Cars: Fine-grained recognition transfer
•iNaturalist: Long-tailed, fine-grained natural world classification
•Places365: Scene recognition, different from object-centric ImageNet
•VTAB: Visual Task Adaptation Benchmark with natural, specialized, and structured tasks

Beyond Classification Metrics

Classification accuracy doesn't capture all aspects of representation quality. Comprehensive evaluation requires additional metrics:

Representation Analysis Metrics

•Uniformity: How evenly distributed are representations on the hypersphere? Collapsed representations have low uniformity.
•Alignment: How close are positive pairs? Measures if augmented views map to similar representations.
•Effective Rank: What is the effective dimensionality of representations? Low rank suggests dimensional collapse.
•Centered Kernel Alignment (CKA): How similar are representations across layers or methods? Useful for comparing architectures.
•Mutual Information Estimates: How much information do representations retain about inputs? Neural network-based estimators like MINE.

The uniformity-alignment framework:

Wang & Isola (2020) proposed that good representations should have:

High alignment: Positive pairs should map to similar representations
High uniformity: Representations should be spread uniformly on the hypersphere

These properties are directly measurable and correlate with downstream performance, providing insight into representation quality beyond accuracy metrics.

Summary: Evaluation Mastery

Key Takeaways

•Linear probing is the gold standard — Measures representation quality in isolation from downstream training.
•k-NN provides quick, training-free evaluation — Useful for monitoring during pretraining and rapid iteration.
•Fine-tuning measures practical utility — But can mask poor representations through adaptation.
•Standard benchmarks enable fair comparison — ImageNet linear, transfer suite, and detection benchmarks.
•Alignment and uniformity provide geometric insight — Complement accuracy metrics for deeper understanding.
•Low-data regimes test label efficiency — Critical for practical self-supervised learning applications.

Module Complete

You have now mastered self-supervised learning! From pretext task design through contrastive and non-contrastive methods to rigorous evaluation—you possess the complete toolkit for leveraging vast unlabeled data to learn powerful representations.

5 / 5

Loading learning content...

Machine LearningSemi-Supervised & Self-Supervised Learning

Self-Supervised Learning

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

5 / 5

Evaluation Protocols

The Evaluation Challenge

What You Will Master

Linear Evaluation Protocol

The linear evaluation protocol is the gold standard for comparing self-supervised methods. The procedure is simple:

Train an SSL model (encoder) on unlabeled data
Freeze all encoder weights
Train a linear classifier (single layer) on top using labeled data
Report accuracy on held-out test set

Why Linear Probing?

Linear Probe Implementation Details
Aspect	Common Choice	Rationale
Optimizer	SGD with momentum	More stable than Adam for linear probes
Learning rate	0.1-30 (with decay)	Optimal LR varies by representation
Batch size	256-4096	Larger batches often help
Training epochs	100	Enough for convergence
Augmentation	Standard crop + flip	Consistent with supervised baselines
Feature extraction	Global average pool	For CNNs; CLS token for Transformers

linear_evaluation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from typing import Tuple
 
class LinearEvaluator:
    """
    Standard linear evaluation protocol for SSL representations.
    """
    def __init__(
        self,
        encoder: nn.Module,
        feature_dim: int,
        num_classes: int,
        device: str = "cuda"
    ):
        self.encoder = encoder.to(device)
        self.encoder.eval()  # Freeze encoder
        
        # Single linear layer classifier
        self.classifier = nn.Linear(feature_dim, num_classes).to(device)
        self.device = device
        
        # Freeze encoder parameters
        for param in self.encoder.parameters():
            param.requires_grad = False
    
    def extract_features(self, dataloader: DataLoader) -> Tuple[torch.Tensor, torch.Tensor]:
        """Extract features for entire dataset."""
        features_list = []
        labels_list = []
        
        with torch.no_grad():
            for images, labels in dataloader:
                images = images.to(self.device)
                features = self.encoder(images)
                
                # Global average pooling if needed
                if features.dim() == 4:
                    features = features.mean(dim=[2, 3])
                
                features_list.append(features.cpu())
                labels_list.append(labels)
        
        return torch.cat(features_list), torch.cat(labels_list)
    
    def train(
        self,
        train_loader: DataLoader,
        val_loader: DataLoader,
        epochs: int = 100,
        lr: float = 0.1
    ) -> dict:
        """Train linear classifier and evaluate."""
        optimizer = optim.SGD(
            self.classifier.parameters(),
            lr=lr,
            momentum=0.9,
            weight_decay=0
        )
        scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs)
        criterion = nn.CrossEntropyLoss()
        
        best_acc = 0.0
        
        for epoch in range(epochs):
            self.classifier.train()
            
            for images, labels in train_loader:
                images, labels = images.to(self.device), labels.to(self.device)
                
                # Extract features (frozen encoder)
                with torch.no_grad():
                    features = self.encoder(images)
                    if features.dim() == 4:
                        features = features.mean(dim=[2, 3])
                
                # Train classifier
                logits = self.classifier(features)
                loss = criterion(logits, labels)
                
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
            
            scheduler.step()
            
            # Evaluate
            val_acc = self.evaluate(val_loader)
            best_acc = max(best_acc, val_acc)
        
        return {"best_accuracy": best_acc, "final_accuracy": val_acc}
    
    @torch.no_grad()
    def evaluate(self, dataloader: DataLoader) -> float:
        """Evaluate on validation/test set."""
        self.classifier.eval()
        correct = 0
        total = 0
        
        for images, labels in dataloader:
            images, labels = images.to(self.device), labels.to(self.device)
            
            features = self.encoder(images)
            if features.dim() == 4:
                features = features.mean(dim=[2, 3])
            
            logits = self.classifier(features)
            preds = logits.argmax(dim=1)
            
            correct += (preds == labels).sum().item()
            total += labels.size(0)
        
        return correct / total

k-Nearest Neighbors Evaluation

k-NN evaluation provides a training-free alternative to linear probing. The idea is simple: classify each test sample by the majority vote of its k nearest neighbors in the training set.

Why k-NN is valuable:

No training required: Just compute distances between representations
Sensitive to local structure: Reveals if similar samples cluster together
Quick iteration: Can evaluate during pretraining without overhead
Hyperparameter-free (mostly): Typically use k=20 or weighted voting

Linear Probe vs k-NN Evaluation
Aspect	Linear Probe	k-NN
Training	Required	None
Computational cost	Moderate	Low (one-time distance computation)
What it measures	Linear separability	Local neighborhood structure
Hyperparameters	LR, epochs, WD	k value, distance metric
Typical gap	Higher accuracy	~5-10% lower than linear
Use case	Final comparison	Quick validation, checkpointing

k-NN as Early Stopping Signal

Fine-Tuning Evaluation

While linear probing measures frozen representation quality, fine-tuning evaluation measures how well pretrained weights serve as initialization for downstream task training.

Fine-tuning protocols:

Full fine-tuning: Update all weights, including encoder
Partial fine-tuning: Freeze early layers, update later layers
Low-data fine-tuning: Fine-tune with limited labeled examples (1%, 10%)

Fine-tuning often yields higher absolute accuracy than linear probing because the encoder can adapt to the target task.

Fine-Tuning Evaluation Scenarios

•Same-domain fine-tuning: Pretrain and fine-tune on same dataset (e.g., ImageNet). Measures representation quality in matched setting.
•Transfer fine-tuning: Pretrain on source, fine-tune on target domain. Measures generalization across domains.
•Low-shot fine-tuning: Fine-tune with very few examples per class. Tests whether good representations enable sample-efficient learning.
•Task transfer: Pretrain for classification, fine-tune for detection/segmentation. Tests representation versatility.

Fine-Tuning Hides Representation Quality

Standard Benchmarks and Datasets

The SSL community has converged on standard benchmarks enabling fair comparison across methods.

Common SSL Evaluation Benchmarks
Benchmark	Task	Scale	What It Tests
ImageNet Linear	Classification (1000 classes)	1.28M images	General visual representations
ImageNet 1% / 10%	Semi-supervised classification	~13K / ~128K images	Label efficiency
COCO Detection	Object detection	118K images	Transfer to detection
VOC Segmentation	Semantic segmentation	~10K images	Dense prediction transfer
Transfer Suite	Multiple classification tasks	12+ datasets	Cross-domain generalization

Transfer Learning Benchmarks

•CIFAR-10/100: Small-scale image classification, quick iteration
•Food-101, Flowers-102, Pets, Cars: Fine-grained recognition transfer
•iNaturalist: Long-tailed, fine-grained natural world classification
•Places365: Scene recognition, different from object-centric ImageNet
•VTAB: Visual Task Adaptation Benchmark with natural, specialized, and structured tasks

Beyond Classification Metrics

Classification accuracy doesn't capture all aspects of representation quality. Comprehensive evaluation requires additional metrics:

Representation Analysis Metrics

•Uniformity: How evenly distributed are representations on the hypersphere? Collapsed representations have low uniformity.
•Alignment: How close are positive pairs? Measures if augmented views map to similar representations.
•Effective Rank: What is the effective dimensionality of representations? Low rank suggests dimensional collapse.
•Centered Kernel Alignment (CKA): How similar are representations across layers or methods? Useful for comparing architectures.
•Mutual Information Estimates: How much information do representations retain about inputs? Neural network-based estimators like MINE.

The uniformity-alignment framework:

Wang & Isola (2020) proposed that good representations should have:

High alignment: Positive pairs should map to similar representations
High uniformity: Representations should be spread uniformly on the hypersphere

These properties are directly measurable and correlate with downstream performance, providing insight into representation quality beyond accuracy metrics.

Summary: Evaluation Mastery

Key Takeaways

•Linear probing is the gold standard — Measures representation quality in isolation from downstream training.
•k-NN provides quick, training-free evaluation — Useful for monitoring during pretraining and rapid iteration.
•Fine-tuning measures practical utility — But can mask poor representations through adaptation.
•Standard benchmarks enable fair comparison — ImageNet linear, transfer suite, and detection benchmarks.
•Alignment and uniformity provide geometric insight — Complement accuracy metrics for deeper understanding.
•Low-data regimes test label efficiency — Critical for practical self-supervised learning applications.

Module Complete

5 / 5