Contrastive Learning - Learning Module

Loading content...

0/245

SimCLR: A Simple Framework for Contrastive Learning

The Simplicity Revolution

SimCLR (Simple Contrastive Learning of Representations) represents a watershed moment in self-supervised learning. Published by Chen et al. (Google, 2020), SimCLR demonstrated that with the right combination of simple components, contrastive learning could match or exceed supervised pre-training on ImageNet—without using any labels.

What made SimCLR revolutionary wasn't novel mathematics or complex architectures. Instead, it was the systematic study of what makes contrastive learning work and the discovery that simplicity, scale, and careful design choices could achieve breakthrough results. SimCLR became the reference implementation against which all subsequent contrastive methods are measured.

What You Will Learn

By the end of this page, you will understand: (1) SimCLR's complete architecture and training procedure, (2) Why each component (augmentations, projection head, batch size) matters, (3) The systematic ablation studies that revealed key insights, (4) Implementation details for reproducing results, and (5) SimCLR v2 improvements and their significance.

The SimCLR Framework

SimCLR's framework is elegantly simple, consisting of four major components working in concert:

The Four Pillars of SimCLR

Stochastic Data Augmentation — Transform each image into two correlated views
Base Encoder Network — Extract representations from augmented views
Projection Head — Map representations to contrastive space
Contrastive Loss — Pull together views of the same image, push apart different images

The Training Pipeline

For each image $x$ in a mini-batch of $N$ images:

$$x \xrightarrow{\text{aug}} (\tilde{x}_i, \tilde{x}_j) \xrightarrow{f(\cdot)} (h_i, h_j) \xrightarrow{g(\cdot)} (z_i, z_j) \xrightarrow{\mathcal{L}} \text{loss}$$

Where:

$\tilde{x}_i, \tilde{x}_j$ are two augmented views of $x$
$f(\cdot)$ is the base encoder (e.g., ResNet-50)
$h_i = f(\tilde{x}_i)$ is the representation (used for downstream tasks)
$g(\cdot)$ is the projection head (MLP)
$z_i = g(h_i)$ is the projection (used only for contrastive loss)

Converting Mermaid diagram...

Key Insight: Projection Head

The projection head is discarded after training—downstream tasks use h, not z. This seems wasteful but is crucial: the projection head allows the contrastive loss to discard information not useful for instance discrimination (like color) while preserving it in h for downstream tasks.

Data Augmentation: The Most Critical Component

SimCLR's ablation studies revealed a surprising finding: data augmentation is the single most important factor in contrastive learning performance. The choice and composition of augmentations dramatically affects representation quality.

The SimCLR Augmentation Pipeline

The default augmentation sequence applies transformations in order:

SimCLR Augmentation Pipeline
Order	Augmentation	Parameters	Probability
1	Random Resized Crop	Scale: [0.08, 1.0], Ratio: [3/4, 4/3]	1.0
2	Random Horizontal Flip	—	0.5
3	Color Jittering	Brightness: 0.8, Contrast: 0.8, Saturation: 0.8, Hue: 0.2	0.8
4	Random Grayscale	—	0.2
5	Gaussian Blur	Kernel: 10% of image size	0.5

Why These Augmentations Work

Random Resized Crop is perhaps the most important single augmentation. By randomly cropping different regions of an image at different scales, it forces the network to:

Understand global and local features
Be robust to object position and scale
Learn about object parts and wholes

Color Distortion (jittering + grayscale) prevents the network from using color histograms as a shortcut. Without color augmentation, the network can distinguish images based on color alone, never learning semantic features.

The Composition Matters

SimCLR's ablation study systematically tested augmentation combinations:

Ablation: Individual Augmentations on ImageNet (Linear Eval Accuracy)
Augmentation Combination	Top-1 Accuracy
Random Crop only	63.2%
Random Crop + Color	74.3%
Random Crop + Color + Blur	74.9%
Full pipeline	76.5%
Supervised baseline	76.5%

simclr_augmentations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torchvision.transforms as T
from PIL import ImageFilter
import random
 
class GaussianBlur:
    """Gaussian blur augmentation as used in SimCLR."""
    
    def __init__(self, sigma=[0.1, 2.0]):
        self.sigma = sigma
    
    def __call__(self, x):
        sigma = random.uniform(self.sigma[0], self.sigma[1])
        x = x.filter(ImageFilter.GaussianBlur(radius=sigma))
        return x
 
def get_simclr_augmentation(image_size=224, s=1.0):
    """
    Returns SimCLR augmentation pipeline.
    
    Args:
        image_size: Target image size after augmentation
        s: Color jitter strength multiplier
    
    Returns:
        Composition of augmentations
    """
    color_jitter = T.ColorJitter(
        brightness=0.8 * s,
        contrast=0.8 * s,
        saturation=0.8 * s,
        hue=0.2 * s
    )
    
    augmentation = T.Compose([
        T.RandomResizedCrop(
            size=image_size,
            scale=(0.08, 1.0),
            interpolation=T.InterpolationMode.BICUBIC
        ),
        T.RandomHorizontalFlip(p=0.5),
        T.RandomApply([color_jitter], p=0.8),
        T.RandomGrayscale(p=0.2),
        T.RandomApply([GaussianBlur(sigma=[0.1, 2.0])], p=0.5),
        T.ToTensor(),
        T.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        ),
    ])
    
    return augmentation
 
class SimCLRTransform:
    """
    Creates two correlated views of the same image.
    """
    
    def __init__(self, image_size=224, s=1.0):
        self.transform = get_simclr_augmentation(image_size, s)
    
    def __call__(self, x):
        return self.transform(x), self.transform(x)

Domain-Specific Augmentations

SimCLR's augmentations are tuned for natural images. For medical imaging, satellite imagery, or other domains, you must carefully design augmentations that preserve task-relevant information while creating meaningful variation. Random crop might destroy crucial spatial relationships in X-rays; color jitter might be harmful for histopathology.

Architecture: Encoder and Projection Head

Base Encoder

SimCLR is encoder-agnostic—any architecture can serve as the base encoder. The original paper primarily used ResNet variants:

Encoder	Parameters	Linear Eval	Fine-tuned
ResNet-50	23M	69.3%	76.5%
ResNet-50 (2x)	94M	74.2%	—
ResNet-50 (4x)	375M	76.5%	—
ResNet-152	58M	71.7%	—

The encoder outputs a representation $h = f(x) \in \mathbb{R}^{d}$ where $d = 2048$ for ResNet-50.

The Projection Head

Perhaps SimCLR's most important finding: a simple MLP projection head dramatically improves representation quality.

The projection head $g(\cdot)$ maps the encoder representation to a lower-dimensional space where contrastive loss is applied:

$$z = g(h) = W^{(2)} \sigma(W^{(1)} h)$$

where $\sigma$ is ReLU and $z \in \mathbb{R}^{128}$.

Why does this help?

The contrastive loss removes information that distinguishes instances but isn't useful for downstream tasks. By applying the loss to $z$ instead of $h$, this information loss is "absorbed" by the projection head. The representation $h$ retains all information.

Ablation on projection head depth:

Impact of Projection Head Design
Projection Head	Output Dim	Linear Eval Accuracy
None (use h directly)	2048	64.7%
Linear	128	66.3%
MLP (1 hidden layer)	128	69.3%
MLP (2 hidden layers)	128	69.0%
MLP (3 hidden layers)	128	68.4%

simclr_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
import torchvision.models as models
 
class SimCLR(nn.Module):
    """
    SimCLR model with ResNet encoder and MLP projection head.
    """
    
    def __init__(
        self,
        base_encoder='resnet50',
        projection_dim=128,
        hidden_dim=2048,
        pretrained=False
    ):
        super().__init__()
        
        # Base encoder
        if base_encoder == 'resnet50':
            self.encoder = models.resnet50(pretrained=pretrained)
            self.encoder_dim = self.encoder.fc.in_features  # 2048
            self.encoder.fc = nn.Identity()  # Remove classification head
        else:
            raise ValueError(f"Unknown encoder: {base_encoder}")
        
        # Projection head: h -> z
        self.projection_head = nn.Sequential(
            nn.Linear(self.encoder_dim, hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, projection_dim)
        )
    
    def forward(self, x):
        """
        Forward pass returning both representation and projection.
        
        Args:
            x: Input images (batch, C, H, W)
        
        Returns:
            h: Encoder representation (batch, encoder_dim) - for downstream
            z: Projection (batch, projection_dim) - for contrastive loss
        """
        h = self.encoder(x)
        z = self.projection_head(h)
        return h, z
    
    def encode(self, x):
        """Get representation only (for downstream tasks)."""
        return self.encoder(x)

NT-Xent: Normalized Temperature-scaled Cross Entropy

SimCLR uses NT-Xent (Normalized Temperature-scaled Cross Entropy), a specific instantiation of InfoNCE for in-batch negatives.

The Loss Function

Given a mini-batch of $N$ images, we generate $2N$ augmented views. For a positive pair $(i, j)$:

$$\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbf{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)}$$

where:

$\text{sim}(z_i, z_j) = \frac{z_i^\top z_j}{|z_i| |z_j|}$ is cosine similarity
$\tau$ is the temperature parameter (default: 0.07)
The denominator sums over all $2N-1$ other samples as negatives

The total loss averages over all positive pairs:

$$\mathcal{L} = \frac{1}{2N} \sum_{k=1}^{N} [\ell_{2k-1, 2k} + \ell_{2k, 2k-1}]$$

In-Batch Negative Sampling

What makes SimCLR distinctive is using other samples in the same batch as negatives. For a batch of 4096 images (8192 views), each positive pair has 8190 negatives.

This is computationally efficient—no separate negative sampling required—but creates a dependency between batch size and number of negatives.

nt_xent_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class NTXentLoss(nn.Module):
    """
    Normalized Temperature-scaled Cross Entropy Loss for SimCLR.
    """
    
    def __init__(self, temperature=0.07):
        super().__init__()
        self.temperature = temperature
    
    def forward(self, z_i, z_j):
        """
        Compute NT-Xent loss.
        
        Args:
            z_i: Projections from first view (batch_size, dim)
            z_j: Projections from second view (batch_size, dim)
        
        Returns:
            Scalar loss
        """
        batch_size = z_i.size(0)
        
        # Normalize
        z_i = F.normalize(z_i, dim=1)
        z_j = F.normalize(z_j, dim=1)
        
        # Concatenate views: [z_1, z_2, ..., z_N, z'_1, z'_2, ..., z'_N]
        z = torch.cat([z_i, z_j], dim=0)  # (2*batch_size, dim)
        
        # Compute similarity matrix: (2N, 2N)
        sim_matrix = torch.mm(z, z.t()) / self.temperature
        
        # Create mask for positive pairs
        # For sample i, positive is at i+N (or i-N if i >= N)
        pos_mask = torch.zeros(2 * batch_size, 2 * batch_size, device=z.device)
        pos_mask[:batch_size, batch_size:] = torch.eye(batch_size, device=z.device)
        pos_mask[batch_size:, :batch_size] = torch.eye(batch_size, device=z.device)
        
        # Mask out self-similarities
        self_mask = torch.eye(2 * batch_size, device=z.device).bool()
        sim_matrix = sim_matrix.masked_fill(self_mask, -1e9)
        
        # Get positive similarities
        pos_sim = (sim_matrix * pos_mask).sum(dim=1)  # (2N,)
        
        # Log-sum-exp over all (including negatives)
        logsumexp = torch.logsumexp(sim_matrix, dim=1)  # (2N,)
        
        # Loss: -log(exp(pos) / sum(exp))
        loss = -pos_sim + logsumexp
        
        return loss.mean()

Temperature Sensitivity

SimCLR found τ=0.07 optimal for ImageNet. But this depends on embedding normalization—with unit-norm vectors, cosine similarities range [-1, 1]. At τ=0.07, a similarity of 0.5 becomes exp(0.5/0.07) ≈ 1300. Small temperature creates very sharp distributions.

The Batch Size Requirement

SimCLR's most controversial requirement is its need for very large batch sizes—typically 4096 to 8192 samples, requiring specialized hardware.

Why Large Batches?

More negatives — With 4096 samples, each positive has 8190 negatives. This provides:
- More diverse negative examples
- Better gradient estimates
- Tighter bound on mutual information
Harder negatives — Larger batches increase the probability of "hard negatives"—samples that are semantically similar but not the positive pair. These provide the strongest learning signal.

Empirical Evidence

Effect of Batch Size on SimCLR Performance
Batch Size	Epochs	Top-1 Accuracy	Hardware Required
256	800	61.9%	1 GPU
512	800	64.2%	2 GPUs
1024	800	66.0%	4 GPUs
2048	800	68.3%	8 GPUs
4096	800	69.3%	16 GPUs
8192	800	70.2%	32 GPUs

Techniques for Large Batch Training

Distributed Training:

Use synchronized SGD across multiple GPUs
Gather projections from all workers before computing loss
Each worker computes local gradients then all-reduces

Gradient Accumulation:

For limited hardware, accumulate gradients over multiple forward passes
Maintains correct gradient direction but doesn't help with negative diversity
Less effective than true large batches

LARS Optimizer:

Layer-wise Adaptive Rate Scaling
Enables stable training with large learning rates
Critical for converging with batch sizes > 1024

The Computational Burden

SimCLR with batch size 4096 on ImageNet requires 32 TPU v3 cores for ~3 days. This was a significant limitation that motivated MoCo's memory bank approach. If you don't have massive compute, consider MoCo or other memory-efficient alternatives.

SimCLR v2: Scaling and Semi-Supervised Learning

SimCLR v2 (Chen et al., 2020) extended the framework with three key improvements:

1. Larger Models

Self-supervised learning benefits more from model size than supervised learning. The gap between self-supervised and supervised narrows (and eventually inverts) as models grow:

Model	Self-Supervised	Supervised	Gap
ResNet-50	69.3%	76.5%	-7.2%
ResNet-50 (2x)	74.2%	77.8%	-3.6%
ResNet-152 (3x)	79.0%	78.3%	+0.7%

2. Deeper Projection Head

SimCLR v2 uses a 3-layer projection head with batch normalization:

$$z = \text{BN}(W_3 \cdot \text{ReLU}(\text{BN}(W_2 \cdot \text{ReLU}(\text{BN}(W_1 \cdot h)))))$$

3. Distillation for Semi-Supervised Learning

SimCLR v2 introduced a three-stage recipe for semi-supervised learning:

Pre-train with contrastive learning on unlabeled data
Fine-tune with labeled data (even just 1%)
Distill the fine-tuned model to a smaller student

SimCLR v2 Key Results

•93.4% accuracy with only 10% of ImageNet labels (vs 94.4% supervised)
•91.2% accuracy with only 1% of labels (supervised: 48.4%)
•Smaller distilled models can match larger teachers with 2-3x fewer parameters
•State-of-the-art semi-supervised results across multiple benchmarks

Summary: SimCLR's Lasting Impact

SimCLR's contribution was as much methodological as algorithmic. By systematically ablating each component, it revealed the fundamental principles of contrastive learning success.

Key Takeaways

•Data augmentation is paramount — Strong, compositional augmentations (especially crop + color) drive performance more than any architectural choice.
•Projection heads are essential — A simple MLP projection head provides 5%+ improvement by absorbing task-irrelevant information loss.
•Large batches provide diverse negatives — More negatives = harder contrastive task = better representations.
•Temperature controls hardness — τ=0.07 balances learning from hard negatives without instability.
•Scale matters — Larger models and longer training continue improving self-supervised performance.
•Semi-supervised distillation — Pre-trained models can be efficiently transferred to smaller architectures.

SimCLR Mastered

You now understand SimCLR's complete framework—from augmentations through projection heads to NT-Xent loss. SimCLR's main limitation is its batch size requirement. Next, we'll explore MoCo, which achieves comparable results with standard batch sizes through a clever momentum-based approach.