Loading content...
Contrastive learning has emerged as the dominant paradigm in self-supervised representation learning. The core idea is elegantly simple: learn representations by pulling similar samples together while pushing dissimilar samples apart. This seemingly straightforward principle, when properly implemented, yields representations that rival or exceed those learned with full supervision.
The breakthrough came when researchers realized that data augmentation could define similarity—two augmented views of the same image should have similar representations. This insight spawned a family of methods (SimCLR, MoCo, SwAV, BYOL) that have transformed the field.
By the end of this page, you will understand the mathematical foundations of contrastive learning through InfoNCE and related losses, master the key methods (SimCLR, MoCo, BYOL) and their design choices, analyze why contrastive learning works from information-theoretic and geometric perspectives, and implement production-ready contrastive learning systems.
At its core, contrastive learning answers the question: which samples are similar, and which are different? The framework consists of several key components:
The standard self-supervised setup:
For images, the most common approach creates positive pairs through data augmentation:
The loss encourages z₁ and z₂ to be close while being far from representations of other images.
| Component | Purpose | Design Choices | Impact |
|---|---|---|---|
| Data Augmentation | Define semantic similarity | Crop, color, blur, rotation | Critical—determines what invariances are learned |
| Encoder | Extract representations | ResNet, ViT, CNN depth | Larger encoders generally better |
| Projection Head | Map to contrastive space | MLP depth and width | Improves performance significantly |
| Negative Samples | Prevent representation collapse | Batch size, memory bank | More negatives generally help |
| Temperature | Control similarity distribution | Typically 0.05-0.5 | Lower = harder negatives |
The InfoNCE loss (Noise Contrastive Estimation for Information) is the foundation of modern contrastive learning. It optimizes a lower bound on mutual information between views.
Mathematical formulation:
Given a positive pair (z_i, z_j) and N-1 negative samples, the InfoNCE loss is:
$$\mathcal{L}{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum{k=1}^{N} \exp(\text{sim}(z_i, z_k) / \tau)}$$
where sim(·,·) is cosine similarity and τ is a temperature parameter.
This is essentially a softmax classification problem: identify the positive among all negatives.
Temperature τ controls the 'hardness' of the contrastive task. Low temperature (e.g., 0.07) makes the loss focus on hard negatives—samples that are similar but shouldn't be. High temperature (e.g., 0.5) treats all negatives more equally. Finding the right temperature is critical for performance.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import torchimport torch.nn as nnimport torch.nn.functional as F class InfoNCELoss(nn.Module): """ InfoNCE (Noise Contrastive Estimation) loss for contrastive learning. Implements the NT-Xent (Normalized Temperature-scaled Cross Entropy) variant. """ def __init__(self, temperature: float = 0.07): super().__init__() self.temperature = temperature def forward( self, z_i: torch.Tensor, z_j: torch.Tensor ) -> torch.Tensor: """ Compute InfoNCE loss for a batch of positive pairs. Args: z_i: First view embeddings [batch_size, embedding_dim] z_j: Second view embeddings [batch_size, embedding_dim] Returns: Scalar loss value """ batch_size = z_i.size(0) # Normalize embeddings to unit sphere z_i = F.normalize(z_i, dim=1) z_j = F.normalize(z_j, dim=1) # Concatenate both views: [2*batch_size, embedding_dim] representations = torch.cat([z_i, z_j], dim=0) # Compute similarity matrix: [2*batch_size, 2*batch_size] similarity_matrix = torch.mm(representations, representations.t()) # Scale by temperature similarity_matrix = similarity_matrix / self.temperature # Create labels: positive pairs are at positions # (i, batch_size + i) and (batch_size + i, i) labels = torch.cat([ torch.arange(batch_size, 2 * batch_size), torch.arange(batch_size) ], dim=0).to(z_i.device) # Mask out self-similarities (diagonal) mask = torch.eye(2 * batch_size, dtype=torch.bool, device=z_i.device) similarity_matrix = similarity_matrix.masked_fill(mask, float('-inf')) # Compute cross-entropy loss loss = F.cross_entropy(similarity_matrix, labels) return lossSimCLR (Simple Framework for Contrastive Learning of Visual Representations) by Chen et al. (2020) demonstrated that surprisingly good results can be achieved with a simple approach, provided certain design choices are made.
Key components of SimCLR:
A puzzling finding: the projection head improves contrastive loss but representations extracted BEFORE the projection head transfer better to downstream tasks. The hypothesis is that the projection head discards information useful for downstream tasks but harmful to the contrastive objective.
MoCo (Momentum Contrast) by He et al. (2020) addresses a practical limitation of SimCLR: the need for very large batch sizes. MoCo introduces two key innovations:
This allows using many more negatives (e.g., 65536) without the memory cost of large batches.
The momentum update mechanism:
MoCo maintains two encoders:
The momentum coefficient m (typically 0.999) ensures the key encoder changes slowly. This consistency is crucial—if keys in the queue came from very different encoders, the contrastive task would be inconsistent.
MoCo v2 combined the best of both worlds: MoCo's queue mechanism with SimCLR's augmentations and projection head.
| Aspect | SimCLR | MoCo |
|---|---|---|
| Negative source | Current batch | Memory queue |
| Batch size requirement | Very large (4096+) | Standard (256) |
| Memory usage | High | Moderate |
| Number of negatives | Limited by batch | Large (65536) |
| Encoder updates | Single encoder | Momentum encoder for keys |
| Implementation complexity | Simple | Moderate |
Understanding why contrastive learning produces good representations requires examining multiple theoretical perspectives.
Negatives are essential in standard contrastive learning—without them, the trivial solution of mapping all inputs to the same point minimizes the loss. Negatives provide the 'push apart' force that prevents collapse. This is why methods like BYOL and SimSiam that eliminate negatives are so surprising.
You now understand contrastive learning—the dominant paradigm in self-supervised learning. Next, we'll explore non-contrastive methods that achieve similar results without explicit negative samples.