Self Supervised Learning - Learning Module

Loading content...

0/245

Contrastive Learning

The Contrastive Learning Revolution

Contrastive learning has emerged as the dominant paradigm in self-supervised representation learning. The core idea is elegantly simple: learn representations by pulling similar samples together while pushing dissimilar samples apart. This seemingly straightforward principle, when properly implemented, yields representations that rival or exceed those learned with full supervision.

The breakthrough came when researchers realized that data augmentation could define similarity—two augmented views of the same image should have similar representations. This insight spawned a family of methods (SimCLR, MoCo, SwAV, BYOL) that have transformed the field.

What You Will Master

By the end of this page, you will understand the mathematical foundations of contrastive learning through InfoNCE and related losses, master the key methods (SimCLR, MoCo, BYOL) and their design choices, analyze why contrastive learning works from information-theoretic and geometric perspectives, and implement production-ready contrastive learning systems.

The Contrastive Learning Framework

At its core, contrastive learning answers the question: which samples are similar, and which are different? The framework consists of several key components:

Positive pairs: Samples that should have similar representations
Negative pairs: Samples that should have different representations
Encoder: Neural network that produces representations
Contrastive loss: Objective that enforces the similarity structure

The standard self-supervised setup:

For images, the most common approach creates positive pairs through data augmentation:

Sample an image x from the dataset
Apply two random augmentations: x₁ = t₁(x), x₂ = t₂(x)
Encode both views: z₁ = f(x₁), z₂ = f(x₂)
z₁ and z₂ form a positive pair
Representations from different images form negative pairs

The loss encourages z₁ and z₂ to be close while being far from representations of other images.

Components of Contrastive Learning Systems
Component	Purpose	Design Choices	Impact
Data Augmentation	Define semantic similarity	Crop, color, blur, rotation	Critical—determines what invariances are learned
Encoder	Extract representations	ResNet, ViT, CNN depth	Larger encoders generally better
Projection Head	Map to contrastive space	MLP depth and width	Improves performance significantly
Negative Samples	Prevent representation collapse	Batch size, memory bank	More negatives generally help
Temperature	Control similarity distribution	Typically 0.05-0.5	Lower = harder negatives

The InfoNCE Loss

The InfoNCE loss (Noise Contrastive Estimation for Information) is the foundation of modern contrastive learning. It optimizes a lower bound on mutual information between views.

Mathematical formulation:

Given a positive pair (z_i, z_j) and N-1 negative samples, the InfoNCE loss is:

$$\mathcal{L}{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum{k=1}^{N} \exp(\text{sim}(z_i, z_k) / \tau)}$$

where sim(·,·) is cosine similarity and τ is a temperature parameter.

This is essentially a softmax classification problem: identify the positive among all negatives.

Why Temperature Matters

Temperature τ controls the 'hardness' of the contrastive task. Low temperature (e.g., 0.07) makes the loss focus on hard negatives—samples that are similar but shouldn't be. High temperature (e.g., 0.5) treats all negatives more equally. Finding the right temperature is critical for performance.

infonce_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class InfoNCELoss(nn.Module):
    """
    InfoNCE (Noise Contrastive Estimation) loss for contrastive learning.
    Implements the NT-Xent (Normalized Temperature-scaled Cross Entropy) variant.
    """
    def __init__(self, temperature: float = 0.07):
        super().__init__()
        self.temperature = temperature
    
    def forward(
        self,
        z_i: torch.Tensor,
        z_j: torch.Tensor
    ) -> torch.Tensor:
        """
        Compute InfoNCE loss for a batch of positive pairs.
        
        Args:
            z_i: First view embeddings [batch_size, embedding_dim]
            z_j: Second view embeddings [batch_size, embedding_dim]
        
        Returns:
            Scalar loss value
        """
        batch_size = z_i.size(0)
        
        # Normalize embeddings to unit sphere
        z_i = F.normalize(z_i, dim=1)
        z_j = F.normalize(z_j, dim=1)
        
        # Concatenate both views: [2*batch_size, embedding_dim]
        representations = torch.cat([z_i, z_j], dim=0)
        
        # Compute similarity matrix: [2*batch_size, 2*batch_size]
        similarity_matrix = torch.mm(representations, representations.t())
        
        # Scale by temperature
        similarity_matrix = similarity_matrix / self.temperature
        
        # Create labels: positive pairs are at positions
        # (i, batch_size + i) and (batch_size + i, i)
        labels = torch.cat([
            torch.arange(batch_size, 2 * batch_size),
            torch.arange(batch_size)
        ], dim=0).to(z_i.device)
        
        # Mask out self-similarities (diagonal)
        mask = torch.eye(2 * batch_size, dtype=torch.bool, device=z_i.device)
        similarity_matrix = similarity_matrix.masked_fill(mask, float('-inf'))
        
        # Compute cross-entropy loss
        loss = F.cross_entropy(similarity_matrix, labels)
        
        return loss

SimCLR: A Simple Framework

SimCLR (Simple Framework for Contrastive Learning of Visual Representations) by Chen et al. (2020) demonstrated that surprisingly good results can be achieved with a simple approach, provided certain design choices are made.

Key components of SimCLR:

Strong augmentations: Random crop, color distortion, Gaussian blur
Large batch sizes: 4096-8192 samples for diverse negatives
Projection head: MLP projects representations before contrastive loss
NT-Xent loss: Normalized temperature-scaled cross-entropy

SimCLR Key Findings

•Composition of augmentations matters more than individual augmentations — Random crop + color distortion is particularly powerful
•Projection head is essential — Learning happens in projection space, but representations are extracted before projection
•Larger batch sizes improve performance — More negatives per positive pair leads to better representations
•Longer training helps — Unlike supervised learning, SSL benefits from extended training

The Projection Head Mystery

A puzzling finding: the projection head improves contrastive loss but representations extracted BEFORE the projection head transfer better to downstream tasks. The hypothesis is that the projection head discards information useful for downstream tasks but harmful to the contrastive objective.

MoCo: Momentum Contrast

MoCo (Momentum Contrast) by He et al. (2020) addresses a practical limitation of SimCLR: the need for very large batch sizes. MoCo introduces two key innovations:

Memory queue: Maintains a queue of encoded keys as negatives
Momentum encoder: Slowly updating encoder ensures queue consistency

This allows using many more negatives (e.g., 65536) without the memory cost of large batches.

The momentum update mechanism:

MoCo maintains two encoders:

Query encoder f_q: Updated by gradient descent
Key encoder f_k: Updated by momentum: θ_k ← m·θ_k + (1-m)·θ_q

The momentum coefficient m (typically 0.999) ensures the key encoder changes slowly. This consistency is crucial—if keys in the queue came from very different encoders, the contrastive task would be inconsistent.

MoCo v2 combined the best of both worlds: MoCo's queue mechanism with SimCLR's augmentations and projection head.

SimCLR vs MoCo Comparison
Aspect	SimCLR	MoCo
Negative source	Current batch	Memory queue
Batch size requirement	Very large (4096+)	Standard (256)
Memory usage	High	Moderate
Number of negatives	Limited by batch	Large (65536)
Encoder updates	Single encoder	Momentum encoder for keys
Implementation complexity	Simple	Moderate

Why Contrastive Learning Works

Understanding why contrastive learning produces good representations requires examining multiple theoretical perspectives.

Theoretical Perspectives

•Information Maximization: InfoNCE maximizes a lower bound on mutual information I(z_i; z_j) between views. High mutual information means representations capture shared information across augmentations—the semantic content.
•Alignment and Uniformity: Good representations align positive pairs (low distance) while being uniformly distributed on the hypersphere (preventing collapse). These are measurable properties that predict downstream performance.
•Implicit Classification: With enough negatives, maximizing InfoNCE approximates classifying each instance into its own class. This instance discrimination forces learning features that distinguish individual samples.
•Augmentation as Invariance Prior: By treating augmented views as equivalent, we inject prior knowledge about which transformations preserve semantics. The model learns augmentation-invariant features.

The Role of Negatives

Negatives are essential in standard contrastive learning—without them, the trivial solution of mapping all inputs to the same point minimizes the loss. Negatives provide the 'push apart' force that prevents collapse. This is why methods like BYOL and SimSiam that eliminate negatives are so surprising.

Summary: Contrastive Learning Mastery

Key Takeaways

•Contrastive learning pulls positives together, pushes negatives apart — The fundamental principle underlying all methods.
•InfoNCE loss is the foundation — A theoretically grounded objective that maximizes mutual information.
•Data augmentation defines similarity — The choice of augmentations determines what invariances are learned.
•SimCLR showed simplicity works — Strong augmentation, large batches, and projection head are the key ingredients.
•MoCo enables efficient training — Memory queue and momentum encoder reduce batch size requirements.
•Multiple theoretical perspectives explain success — Information theory, geometry, and implicit classification all provide insights.

Page Complete

You now understand contrastive learning—the dominant paradigm in self-supervised learning. Next, we'll explore non-contrastive methods that achieve similar results without explicit negative samples.