Machine LearningSemi-Supervised & Self-Supervised Learning

Self-Supervised Learning

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

2 / 5

Learning Representations

The Representation Learning Paradigm

The ultimate goal of self-supervised learning isn't to solve pretext tasks—it's to learn representations that capture the essential structure of data. These representations serve as a universal foundation that can be adapted to countless downstream tasks with minimal additional training.

But what exactly is a 'good' representation? How do we measure representation quality? And what architectural and training choices lead to representations that transfer effectively? This page provides definitive answers to these fundamental questions.

What You Will Master

By the end of this page, you will understand the mathematical definition of representations and embeddings, master the criteria that define high-quality representations, analyze the relationship between pretext tasks and downstream transfer, and design representation learning pipelines for various applications.

What Are Representations?

A representation is a transformation of raw data into a form more suitable for downstream processing. Mathematically, given input space X (e.g., images, text), a representation function f: X → Z maps inputs to a representation space Z, typically R^d for some dimensionality d.

Key properties of the representation space:

Dimensionality: The size of the representation vector (e.g., 512, 2048, 768)
Geometry: How representations are distributed and related in the space
Semantics: What information is encoded and how accessibly
Invariances: What transformations leave representations unchanged

Representation Types and Their Characteristics
Type	Description	Example	Use Case
Dense Embeddings	Fixed-size continuous vectors	ResNet features, BERT [CLS]	Classification, retrieval
Sequence Representations	Variable-length token embeddings	BERT token embeddings	NER, question answering
Hierarchical Features	Multi-scale feature pyramids	FPN features	Object detection, segmentation
Probabilistic Embeddings	Distributions, not points	VAE latent space	Generation, uncertainty

The representation hypothesis:

The central assumption of representation learning is that there exists an underlying structure in data that, once extracted, makes downstream tasks easier. A good representation 'unfolds' the data manifold, making previously entangled factors of variation separable.

Consider images: raw pixels entangle object identity, pose, lighting, and background. A good representation disentangles these factors, placing images of the same object nearby regardless of pose or lighting.

Properties of Good Representations

What distinguishes a useful representation from a useless one? Research has identified several key properties that correlate with downstream performance.

Essential Representation Properties

•Semantic Richness: Contains information about high-level concepts, not just low-level statistics. Can distinguish cats from dogs, not just pixel intensities.
•Linear Separability: Semantic categories form linearly separable clusters. A simple linear classifier should perform well—complex non-linear boundaries suggest inadequate representations.
•Transferability: Generalizes across tasks and domains. Features learned on ImageNet should help on medical imaging or satellite imagery.
•Invariance to Irrelevant Factors: Ignores transformations that don't affect semantics (lighting, minor viewpoint changes) while preserving discriminative information.
•Efficiency: Compact representation of essential information. Low-dimensional yet information-rich.

The Linear Probe Test

A powerful way to evaluate representations: freeze the encoder, train only a linear classifier on top. If linear probe accuracy is high, the representation has already separated semantic categories—the hardest work is done. This is the gold standard for comparing self-supervised methods.

The invariance-discriminability tradeoff:

Representations must balance two competing demands:

Invariance: Collapse representations of semantically equivalent inputs
Discriminability: Keep representations of semantically different inputs separated

Too much invariance: different objects become indistinguishable (collapse). Too little invariance: same object under different conditions appears different.

Self-supervised methods must carefully navigate this tradeoff through task design and data augmentation strategies.

Representation Learning Architectures

The choice of neural network architecture profoundly affects representation quality. Different architectures have different inductive biases that shape what they learn.

Architecture Comparison for Representation Learning
Architecture	Inductive Bias	Representation Type	Best For
ResNet	Spatial locality, hierarchical features	Global pooled features	Image classification, transfer
Vision Transformer	Global attention, patch-based	CLS token or avg patches	Large-scale pretraining
Transformer (NLP)	Sequence modeling, attention	CLS token or mean pooling	Language understanding
U-Net / FPN	Multi-scale, encoder-decoder	Dense feature maps	Segmentation, detection

representation_extractor.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import torch.nn as nn
from typing import Dict, Optional
 
class RepresentationExtractor(nn.Module):
    """
    Flexible representation extractor supporting multiple
    extraction strategies from pretrained encoders.
    """
    def __init__(
        self,
        encoder: nn.Module,
        projection_dim: Optional[int] = None,
        extraction_method: str = "pool",  # pool, cls, mean
    ):
        super().__init__()
        self.encoder = encoder
        self.extraction_method = extraction_method
        
        # Optional projection head for contrastive learning
        if projection_dim is not None:
            # Determine encoder output dimension
            self.projector = nn.Sequential(
                nn.Linear(encoder.output_dim, encoder.output_dim),
                nn.ReLU(inplace=True),
                nn.Linear(encoder.output_dim, projection_dim)
            )
        else:
            self.projector = None
    
    def extract_features(self, x: torch.Tensor) -> torch.Tensor:
        """
        Extract representations using configured method.
        """
        features = self.encoder(x)
        
        if self.extraction_method == "pool":
            # Global average pooling (for CNNs)
            if features.dim() == 4:  # [B, C, H, W]
                features = features.mean(dim=[2, 3])
        elif self.extraction_method == "cls":
            # CLS token (for Transformers)
            features = features[:, 0]
        elif self.extraction_method == "mean":
            # Mean pooling over sequence
            features = features.mean(dim=1)
        
        return features
    
    def forward(self, x: torch.Tensor) -> Dict[str, torch.Tensor]:
        """
        Forward pass returning both representations and projections.
        """
        # Extract base representations
        representations = self.extract_features(x)
        
        output = {"representations": representations}
        
        # Apply projection if available
        if self.projector is not None:
            projections = self.projector(representations)
            output["projections"] = projections
        
        return output
    
    def get_representation(self, x: torch.Tensor) -> torch.Tensor:
        """
        Get only the representation (for downstream tasks).
        """
        return self.extract_features(x)

Transfer Learning with Representations

The value of self-supervised representations is realized through transfer learning—applying learned features to new tasks. Understanding transfer strategies is essential for practical deployment.

Transfer Strategies

•Linear Probing: Freeze encoder, train only a linear classifier. Fastest, reveals raw representation quality. Use when representations are very strong or compute is limited.
•Full Fine-tuning: Train entire network on downstream task. Highest potential accuracy but requires more data and compute. Risk of catastrophic forgetting.
•Partial Fine-tuning: Freeze early layers, fine-tune later layers. Balances adaptation and preservation. Common practical choice.
•Adapter Tuning: Insert small trainable modules, freeze backbone. Parameter-efficient, maintains backbone intact. Popular for LLMs.

The Pretraining-Finetuning Gap

Representations optimized for pretext tasks may not perfectly align with downstream needs. This gap motivates research into task-agnostic representation learning, domain adaptation, and prompt-based methods that better bridge pretraining and downstream objectives.

Factors affecting transfer success:

Domain similarity: Transfer works best when source and target domains are related
Task similarity: Classification representations transfer well to classification
Data scale: More pretraining data generally yields better transfer
Representation depth: Which layer to extract from matters—later layers are more task-specific, earlier layers more general

Measuring Representation Quality

Rigorous evaluation of representation quality requires multiple complementary metrics. No single measure captures all aspects of a good representation.

Representation Quality Metrics
Metric	What It Measures	How to Compute	Limitations
Linear Probe Accuracy	Semantic separability	Train linear classifier on frozen features	Task-specific, ignores fine-grained info
k-NN Accuracy	Neighborhood structure	Classify by k nearest neighbors	Sensitive to distance metric
Clustering Quality (NMI)	Natural grouping	Cluster features, compare to labels	Assumes clusters match categories
Feature Rank	Utilization of dimensions	Effective rank of covariance matrix	Doesn't directly measure semantics
Centered Kernel Alignment	Similarity between representations	Compare representation kernels	Computational cost for large datasets

Representation Collapse

A critical failure mode in self-supervised learning is representation collapse—when the model learns to map all inputs to the same point or to a low-dimensional subspace. This produces trivially 'invariant' representations with no discriminative power. Methods must actively prevent collapse through architectural choices, loss design, or negative examples.

Summary: Representation Mastery

Key Takeaways

•Representations transform raw data into useful form — The mapping f: X → Z should capture semantic structure while discarding irrelevant variation.
•Good representations are linearly separable — If simple classifiers work, the representation has done the heavy lifting.
•Architecture choice affects representation properties — CNNs, Transformers, and hybrid models have different inductive biases.
•Transfer strategy depends on data and compute — Linear probing, fine-tuning, and adapters offer different tradeoffs.
•Multiple metrics needed for evaluation — Linear probe, k-NN, clustering, and rank all provide complementary insights.
•Collapse is the critical failure mode — Methods must actively ensure representations remain informative and distinct.

Page Complete

You now understand what makes representations powerful and how to evaluate their quality. Next, we'll dive into contrastive learning—the dominant paradigm for learning representations through comparison.

2 / 5

Loading learning content...

Machine LearningSemi-Supervised & Self-Supervised Learning

Self-Supervised Learning

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

2 / 5

Learning Representations

The Representation Learning Paradigm

What You Will Master

What Are Representations?

Key properties of the representation space:

Dimensionality: The size of the representation vector (e.g., 512, 2048, 768)
Geometry: How representations are distributed and related in the space
Semantics: What information is encoded and how accessibly
Invariances: What transformations leave representations unchanged

Representation Types and Their Characteristics
Type	Description	Example	Use Case
Dense Embeddings	Fixed-size continuous vectors	ResNet features, BERT [CLS]	Classification, retrieval
Sequence Representations	Variable-length token embeddings	BERT token embeddings	NER, question answering
Hierarchical Features	Multi-scale feature pyramids	FPN features	Object detection, segmentation
Probabilistic Embeddings	Distributions, not points	VAE latent space	Generation, uncertainty

The representation hypothesis:

Properties of Good Representations

What distinguishes a useful representation from a useless one? Research has identified several key properties that correlate with downstream performance.

Essential Representation Properties

•Semantic Richness: Contains information about high-level concepts, not just low-level statistics. Can distinguish cats from dogs, not just pixel intensities.
•Linear Separability: Semantic categories form linearly separable clusters. A simple linear classifier should perform well—complex non-linear boundaries suggest inadequate representations.
•Transferability: Generalizes across tasks and domains. Features learned on ImageNet should help on medical imaging or satellite imagery.
•Invariance to Irrelevant Factors: Ignores transformations that don't affect semantics (lighting, minor viewpoint changes) while preserving discriminative information.
•Efficiency: Compact representation of essential information. Low-dimensional yet information-rich.

The Linear Probe Test

The invariance-discriminability tradeoff:

Representations must balance two competing demands:

Invariance: Collapse representations of semantically equivalent inputs
Discriminability: Keep representations of semantically different inputs separated

Too much invariance: different objects become indistinguishable (collapse). Too little invariance: same object under different conditions appears different.

Self-supervised methods must carefully navigate this tradeoff through task design and data augmentation strategies.

Representation Learning Architectures

The choice of neural network architecture profoundly affects representation quality. Different architectures have different inductive biases that shape what they learn.

Architecture Comparison for Representation Learning
Architecture	Inductive Bias	Representation Type	Best For
ResNet	Spatial locality, hierarchical features	Global pooled features	Image classification, transfer
Vision Transformer	Global attention, patch-based	CLS token or avg patches	Large-scale pretraining
Transformer (NLP)	Sequence modeling, attention	CLS token or mean pooling	Language understanding
U-Net / FPN	Multi-scale, encoder-decoder	Dense feature maps	Segmentation, detection

representation_extractor.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import torch.nn as nn
from typing import Dict, Optional
 
class RepresentationExtractor(nn.Module):
    """
    Flexible representation extractor supporting multiple
    extraction strategies from pretrained encoders.
    """
    def __init__(
        self,
        encoder: nn.Module,
        projection_dim: Optional[int] = None,
        extraction_method: str = "pool",  # pool, cls, mean
    ):
        super().__init__()
        self.encoder = encoder
        self.extraction_method = extraction_method
        
        # Optional projection head for contrastive learning
        if projection_dim is not None:
            # Determine encoder output dimension
            self.projector = nn.Sequential(
                nn.Linear(encoder.output_dim, encoder.output_dim),
                nn.ReLU(inplace=True),
                nn.Linear(encoder.output_dim, projection_dim)
            )
        else:
            self.projector = None
    
    def extract_features(self, x: torch.Tensor) -> torch.Tensor:
        """
        Extract representations using configured method.
        """
        features = self.encoder(x)
        
        if self.extraction_method == "pool":
            # Global average pooling (for CNNs)
            if features.dim() == 4:  # [B, C, H, W]
                features = features.mean(dim=[2, 3])
        elif self.extraction_method == "cls":
            # CLS token (for Transformers)
            features = features[:, 0]
        elif self.extraction_method == "mean":
            # Mean pooling over sequence
            features = features.mean(dim=1)
        
        return features
    
    def forward(self, x: torch.Tensor) -> Dict[str, torch.Tensor]:
        """
        Forward pass returning both representations and projections.
        """
        # Extract base representations
        representations = self.extract_features(x)
        
        output = {"representations": representations}
        
        # Apply projection if available
        if self.projector is not None:
            projections = self.projector(representations)
            output["projections"] = projections
        
        return output
    
    def get_representation(self, x: torch.Tensor) -> torch.Tensor:
        """
        Get only the representation (for downstream tasks).
        """
        return self.extract_features(x)

Transfer Learning with Representations

The value of self-supervised representations is realized through transfer learning—applying learned features to new tasks. Understanding transfer strategies is essential for practical deployment.

Transfer Strategies

•Linear Probing: Freeze encoder, train only a linear classifier. Fastest, reveals raw representation quality. Use when representations are very strong or compute is limited.
•Full Fine-tuning: Train entire network on downstream task. Highest potential accuracy but requires more data and compute. Risk of catastrophic forgetting.
•Partial Fine-tuning: Freeze early layers, fine-tune later layers. Balances adaptation and preservation. Common practical choice.
•Adapter Tuning: Insert small trainable modules, freeze backbone. Parameter-efficient, maintains backbone intact. Popular for LLMs.

The Pretraining-Finetuning Gap

Factors affecting transfer success:

Domain similarity: Transfer works best when source and target domains are related
Task similarity: Classification representations transfer well to classification
Data scale: More pretraining data generally yields better transfer
Representation depth: Which layer to extract from matters—later layers are more task-specific, earlier layers more general

Measuring Representation Quality

Rigorous evaluation of representation quality requires multiple complementary metrics. No single measure captures all aspects of a good representation.

Representation Quality Metrics
Metric	What It Measures	How to Compute	Limitations
Linear Probe Accuracy	Semantic separability	Train linear classifier on frozen features	Task-specific, ignores fine-grained info
k-NN Accuracy	Neighborhood structure	Classify by k nearest neighbors	Sensitive to distance metric
Clustering Quality (NMI)	Natural grouping	Cluster features, compare to labels	Assumes clusters match categories
Feature Rank	Utilization of dimensions	Effective rank of covariance matrix	Doesn't directly measure semantics
Centered Kernel Alignment	Similarity between representations	Compare representation kernels	Computational cost for large datasets

Representation Collapse

Summary: Representation Mastery

Key Takeaways

•Representations transform raw data into useful form — The mapping f: X → Z should capture semantic structure while discarding irrelevant variation.
•Good representations are linearly separable — If simple classifiers work, the representation has done the heavy lifting.
•Architecture choice affects representation properties — CNNs, Transformers, and hybrid models have different inductive biases.
•Transfer strategy depends on data and compute — Linear probing, fine-tuning, and adapters offer different tradeoffs.
•Multiple metrics needed for evaluation — Linear probe, k-NN, clustering, and rank all provide complementary insights.
•Collapse is the critical failure mode — Methods must actively ensure representations remain informative and distinct.

Page Complete

2 / 5