Loading learning content...
Self-supervised learning presents a unique evaluation challenge: how do we measure the quality of representations learned without labels? Unlike supervised learning where accuracy on held-out data is the natural metric, SSL requires proxy evaluations that predict how well representations will transfer to downstream tasks.
The field has developed standardized evaluation protocols that enable fair comparison across methods. Understanding these protocols is essential for both conducting rigorous research and selecting methods for practical applications.
By the end of this page, you will master the standard evaluation protocols (linear probe, fine-tuning, k-NN), understand benchmarks and datasets used for SSL evaluation, analyze the relationship between evaluation metrics and downstream performance, and design rigorous evaluation pipelines for your own SSL research.
The linear evaluation protocol is the gold standard for comparing self-supervised methods. The procedure is simple:
The key insight: if a linear classifier achieves high accuracy, the representation has already done the hard work of separating semantic categories. Complex non-linear decision boundaries aren't needed.
Linear probing measures representation quality in isolation. Since the linear classifier has minimal capacity, all discriminative power must come from the frozen representations. This prevents conflating 'good SSL method' with 'good fine-tuning procedure' and enables fair comparison.
| Aspect | Common Choice | Rationale |
|---|---|---|
| Optimizer | SGD with momentum | More stable than Adam for linear probes |
| Learning rate | 0.1-30 (with decay) | Optimal LR varies by representation |
| Batch size | 256-4096 | Larger batches often help |
| Training epochs | 100 | Enough for convergence |
| Augmentation | Standard crop + flip | Consistent with supervised baselines |
| Feature extraction | Global average pool | For CNNs; CLS token for Transformers |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115
import torchimport torch.nn as nnimport torch.optim as optimfrom torch.utils.data import DataLoaderfrom typing import Tuple class LinearEvaluator: """ Standard linear evaluation protocol for SSL representations. """ def __init__( self, encoder: nn.Module, feature_dim: int, num_classes: int, device: str = "cuda" ): self.encoder = encoder.to(device) self.encoder.eval() # Freeze encoder # Single linear layer classifier self.classifier = nn.Linear(feature_dim, num_classes).to(device) self.device = device # Freeze encoder parameters for param in self.encoder.parameters(): param.requires_grad = False def extract_features(self, dataloader: DataLoader) -> Tuple[torch.Tensor, torch.Tensor]: """Extract features for entire dataset.""" features_list = [] labels_list = [] with torch.no_grad(): for images, labels in dataloader: images = images.to(self.device) features = self.encoder(images) # Global average pooling if needed if features.dim() == 4: features = features.mean(dim=[2, 3]) features_list.append(features.cpu()) labels_list.append(labels) return torch.cat(features_list), torch.cat(labels_list) def train( self, train_loader: DataLoader, val_loader: DataLoader, epochs: int = 100, lr: float = 0.1 ) -> dict: """Train linear classifier and evaluate.""" optimizer = optim.SGD( self.classifier.parameters(), lr=lr, momentum=0.9, weight_decay=0 ) scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs) criterion = nn.CrossEntropyLoss() best_acc = 0.0 for epoch in range(epochs): self.classifier.train() for images, labels in train_loader: images, labels = images.to(self.device), labels.to(self.device) # Extract features (frozen encoder) with torch.no_grad(): features = self.encoder(images) if features.dim() == 4: features = features.mean(dim=[2, 3]) # Train classifier logits = self.classifier(features) loss = criterion(logits, labels) optimizer.zero_grad() loss.backward() optimizer.step() scheduler.step() # Evaluate val_acc = self.evaluate(val_loader) best_acc = max(best_acc, val_acc) return {"best_accuracy": best_acc, "final_accuracy": val_acc} @torch.no_grad() def evaluate(self, dataloader: DataLoader) -> float: """Evaluate on validation/test set.""" self.classifier.eval() correct = 0 total = 0 for images, labels in dataloader: images, labels = images.to(self.device), labels.to(self.device) features = self.encoder(images) if features.dim() == 4: features = features.mean(dim=[2, 3]) logits = self.classifier(features) preds = logits.argmax(dim=1) correct += (preds == labels).sum().item() total += labels.size(0) return correct / totalk-NN evaluation provides a training-free alternative to linear probing. The idea is simple: classify each test sample by the majority vote of its k nearest neighbors in the training set.
Why k-NN is valuable:
| Aspect | Linear Probe | k-NN |
|---|---|---|
| Training | Required | None |
| Computational cost | Moderate | Low (one-time distance computation) |
| What it measures | Linear separability | Local neighborhood structure |
| Hyperparameters | LR, epochs, WD | k value, distance metric |
| Typical gap | Higher accuracy | ~5-10% lower than linear |
| Use case | Final comparison | Quick validation, checkpointing |
k-NN evaluation is commonly computed periodically during SSL pretraining to monitor representation quality. Since it requires no training, it adds minimal overhead. Sudden drops in k-NN accuracy can signal collapse or other training issues.
While linear probing measures frozen representation quality, fine-tuning evaluation measures how well pretrained weights serve as initialization for downstream task training.
Fine-tuning protocols:
Fine-tuning often yields higher absolute accuracy than linear probing because the encoder can adapt to the target task.
With enough fine-tuning, even poor initializations can reach good final accuracy. This is why linear probing remains the primary comparison metric—it isolates the contribution of pretraining from the contribution of downstream training.
The SSL community has converged on standard benchmarks enabling fair comparison across methods.
| Benchmark | Task | Scale | What It Tests |
|---|---|---|---|
| ImageNet Linear | Classification (1000 classes) | 1.28M images | General visual representations |
| ImageNet 1% / 10% | Semi-supervised classification | ~13K / ~128K images | Label efficiency |
| COCO Detection | Object detection | 118K images | Transfer to detection |
| VOC Segmentation | Semantic segmentation | ~10K images | Dense prediction transfer |
| Transfer Suite | Multiple classification tasks | 12+ datasets | Cross-domain generalization |
Classification accuracy doesn't capture all aspects of representation quality. Comprehensive evaluation requires additional metrics:
The uniformity-alignment framework:
Wang & Isola (2020) proposed that good representations should have:
These properties are directly measurable and correlate with downstream performance, providing insight into representation quality beyond accuracy metrics.
You have now mastered self-supervised learning! From pretext task design through contrastive and non-contrastive methods to rigorous evaluation—you possess the complete toolkit for leveraging vast unlabeled data to learn powerful representations.