Loading learning content...
The ultimate goal of self-supervised learning isn't to solve pretext tasks—it's to learn representations that capture the essential structure of data. These representations serve as a universal foundation that can be adapted to countless downstream tasks with minimal additional training.
But what exactly is a 'good' representation? How do we measure representation quality? And what architectural and training choices lead to representations that transfer effectively? This page provides definitive answers to these fundamental questions.
By the end of this page, you will understand the mathematical definition of representations and embeddings, master the criteria that define high-quality representations, analyze the relationship between pretext tasks and downstream transfer, and design representation learning pipelines for various applications.
A representation is a transformation of raw data into a form more suitable for downstream processing. Mathematically, given input space X (e.g., images, text), a representation function f: X → Z maps inputs to a representation space Z, typically R^d for some dimensionality d.
Key properties of the representation space:
| Type | Description | Example | Use Case |
|---|---|---|---|
| Dense Embeddings | Fixed-size continuous vectors | ResNet features, BERT [CLS] | Classification, retrieval |
| Sequence Representations | Variable-length token embeddings | BERT token embeddings | NER, question answering |
| Hierarchical Features | Multi-scale feature pyramids | FPN features | Object detection, segmentation |
| Probabilistic Embeddings | Distributions, not points | VAE latent space | Generation, uncertainty |
The representation hypothesis:
The central assumption of representation learning is that there exists an underlying structure in data that, once extracted, makes downstream tasks easier. A good representation 'unfolds' the data manifold, making previously entangled factors of variation separable.
Consider images: raw pixels entangle object identity, pose, lighting, and background. A good representation disentangles these factors, placing images of the same object nearby regardless of pose or lighting.
What distinguishes a useful representation from a useless one? Research has identified several key properties that correlate with downstream performance.
A powerful way to evaluate representations: freeze the encoder, train only a linear classifier on top. If linear probe accuracy is high, the representation has already separated semantic categories—the hardest work is done. This is the gold standard for comparing self-supervised methods.
The invariance-discriminability tradeoff:
Representations must balance two competing demands:
Too much invariance: different objects become indistinguishable (collapse). Too little invariance: same object under different conditions appears different.
Self-supervised methods must carefully navigate this tradeoff through task design and data augmentation strategies.
The choice of neural network architecture profoundly affects representation quality. Different architectures have different inductive biases that shape what they learn.
| Architecture | Inductive Bias | Representation Type | Best For |
|---|---|---|---|
| ResNet | Spatial locality, hierarchical features | Global pooled features | Image classification, transfer |
| Vision Transformer | Global attention, patch-based | CLS token or avg patches | Large-scale pretraining |
| Transformer (NLP) | Sequence modeling, attention | CLS token or mean pooling | Language understanding |
| U-Net / FPN | Multi-scale, encoder-decoder | Dense feature maps | Segmentation, detection |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import torchimport torch.nn as nnfrom typing import Dict, Optional class RepresentationExtractor(nn.Module): """ Flexible representation extractor supporting multiple extraction strategies from pretrained encoders. """ def __init__( self, encoder: nn.Module, projection_dim: Optional[int] = None, extraction_method: str = "pool", # pool, cls, mean ): super().__init__() self.encoder = encoder self.extraction_method = extraction_method # Optional projection head for contrastive learning if projection_dim is not None: # Determine encoder output dimension self.projector = nn.Sequential( nn.Linear(encoder.output_dim, encoder.output_dim), nn.ReLU(inplace=True), nn.Linear(encoder.output_dim, projection_dim) ) else: self.projector = None def extract_features(self, x: torch.Tensor) -> torch.Tensor: """ Extract representations using configured method. """ features = self.encoder(x) if self.extraction_method == "pool": # Global average pooling (for CNNs) if features.dim() == 4: # [B, C, H, W] features = features.mean(dim=[2, 3]) elif self.extraction_method == "cls": # CLS token (for Transformers) features = features[:, 0] elif self.extraction_method == "mean": # Mean pooling over sequence features = features.mean(dim=1) return features def forward(self, x: torch.Tensor) -> Dict[str, torch.Tensor]: """ Forward pass returning both representations and projections. """ # Extract base representations representations = self.extract_features(x) output = {"representations": representations} # Apply projection if available if self.projector is not None: projections = self.projector(representations) output["projections"] = projections return output def get_representation(self, x: torch.Tensor) -> torch.Tensor: """ Get only the representation (for downstream tasks). """ return self.extract_features(x)The value of self-supervised representations is realized through transfer learning—applying learned features to new tasks. Understanding transfer strategies is essential for practical deployment.
Representations optimized for pretext tasks may not perfectly align with downstream needs. This gap motivates research into task-agnostic representation learning, domain adaptation, and prompt-based methods that better bridge pretraining and downstream objectives.
Factors affecting transfer success:
Rigorous evaluation of representation quality requires multiple complementary metrics. No single measure captures all aspects of a good representation.
| Metric | What It Measures | How to Compute | Limitations |
|---|---|---|---|
| Linear Probe Accuracy | Semantic separability | Train linear classifier on frozen features | Task-specific, ignores fine-grained info |
| k-NN Accuracy | Neighborhood structure | Classify by k nearest neighbors | Sensitive to distance metric |
| Clustering Quality (NMI) | Natural grouping | Cluster features, compare to labels | Assumes clusters match categories |
| Feature Rank | Utilization of dimensions | Effective rank of covariance matrix | Doesn't directly measure semantics |
| Centered Kernel Alignment | Similarity between representations | Compare representation kernels | Computational cost for large datasets |
A critical failure mode in self-supervised learning is representation collapse—when the model learns to map all inputs to the same point or to a low-dimensional subspace. This produces trivially 'invariant' representations with no discriminative power. Methods must actively prevent collapse through architectural choices, loss design, or negative examples.
You now understand what makes representations powerful and how to evaluate their quality. Next, we'll dive into contrastive learning—the dominant paradigm for learning representations through comparison.