Loading content...
SimCLR (Simple Contrastive Learning of Representations) represents a watershed moment in self-supervised learning. Published by Chen et al. (Google, 2020), SimCLR demonstrated that with the right combination of simple components, contrastive learning could match or exceed supervised pre-training on ImageNet—without using any labels.
What made SimCLR revolutionary wasn't novel mathematics or complex architectures. Instead, it was the systematic study of what makes contrastive learning work and the discovery that simplicity, scale, and careful design choices could achieve breakthrough results. SimCLR became the reference implementation against which all subsequent contrastive methods are measured.
By the end of this page, you will understand: (1) SimCLR's complete architecture and training procedure, (2) Why each component (augmentations, projection head, batch size) matters, (3) The systematic ablation studies that revealed key insights, (4) Implementation details for reproducing results, and (5) SimCLR v2 improvements and their significance.
SimCLR's framework is elegantly simple, consisting of four major components working in concert:
For each image $x$ in a mini-batch of $N$ images:
$$x \xrightarrow{\text{aug}} (\tilde{x}_i, \tilde{x}_j) \xrightarrow{f(\cdot)} (h_i, h_j) \xrightarrow{g(\cdot)} (z_i, z_j) \xrightarrow{\mathcal{L}} \text{loss}$$
Where:
The projection head is discarded after training—downstream tasks use h, not z. This seems wasteful but is crucial: the projection head allows the contrastive loss to discard information not useful for instance discrimination (like color) while preserving it in h for downstream tasks.
SimCLR's ablation studies revealed a surprising finding: data augmentation is the single most important factor in contrastive learning performance. The choice and composition of augmentations dramatically affects representation quality.
The default augmentation sequence applies transformations in order:
| Order | Augmentation | Parameters | Probability |
|---|---|---|---|
| 1 | Random Resized Crop | Scale: [0.08, 1.0], Ratio: [3/4, 4/3] | 1.0 |
| 2 | Random Horizontal Flip | — | 0.5 |
| 3 | Color Jittering | Brightness: 0.8, Contrast: 0.8, Saturation: 0.8, Hue: 0.2 | 0.8 |
| 4 | Random Grayscale | — | 0.2 |
| 5 | Gaussian Blur | Kernel: 10% of image size | 0.5 |
Random Resized Crop is perhaps the most important single augmentation. By randomly cropping different regions of an image at different scales, it forces the network to:
Color Distortion (jittering + grayscale) prevents the network from using color histograms as a shortcut. Without color augmentation, the network can distinguish images based on color alone, never learning semantic features.
SimCLR's ablation study systematically tested augmentation combinations:
| Augmentation Combination | Top-1 Accuracy |
|---|---|
| Random Crop only | 63.2% |
| Random Crop + Color | 74.3% |
| Random Crop + Color + Blur | 74.9% |
| Full pipeline | 76.5% |
| Supervised baseline | 76.5% |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import torchvision.transforms as Tfrom PIL import ImageFilterimport random class GaussianBlur: """Gaussian blur augmentation as used in SimCLR.""" def __init__(self, sigma=[0.1, 2.0]): self.sigma = sigma def __call__(self, x): sigma = random.uniform(self.sigma[0], self.sigma[1]) x = x.filter(ImageFilter.GaussianBlur(radius=sigma)) return x def get_simclr_augmentation(image_size=224, s=1.0): """ Returns SimCLR augmentation pipeline. Args: image_size: Target image size after augmentation s: Color jitter strength multiplier Returns: Composition of augmentations """ color_jitter = T.ColorJitter( brightness=0.8 * s, contrast=0.8 * s, saturation=0.8 * s, hue=0.2 * s ) augmentation = T.Compose([ T.RandomResizedCrop( size=image_size, scale=(0.08, 1.0), interpolation=T.InterpolationMode.BICUBIC ), T.RandomHorizontalFlip(p=0.5), T.RandomApply([color_jitter], p=0.8), T.RandomGrayscale(p=0.2), T.RandomApply([GaussianBlur(sigma=[0.1, 2.0])], p=0.5), T.ToTensor(), T.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ), ]) return augmentation class SimCLRTransform: """ Creates two correlated views of the same image. """ def __init__(self, image_size=224, s=1.0): self.transform = get_simclr_augmentation(image_size, s) def __call__(self, x): return self.transform(x), self.transform(x)SimCLR's augmentations are tuned for natural images. For medical imaging, satellite imagery, or other domains, you must carefully design augmentations that preserve task-relevant information while creating meaningful variation. Random crop might destroy crucial spatial relationships in X-rays; color jitter might be harmful for histopathology.
SimCLR is encoder-agnostic—any architecture can serve as the base encoder. The original paper primarily used ResNet variants:
| Encoder | Parameters | Linear Eval | Fine-tuned |
|---|---|---|---|
| ResNet-50 | 23M | 69.3% | 76.5% |
| ResNet-50 (2x) | 94M | 74.2% | — |
| ResNet-50 (4x) | 375M | 76.5% | — |
| ResNet-152 | 58M | 71.7% | — |
The encoder outputs a representation $h = f(x) \in \mathbb{R}^{d}$ where $d = 2048$ for ResNet-50.
Perhaps SimCLR's most important finding: a simple MLP projection head dramatically improves representation quality.
The projection head $g(\cdot)$ maps the encoder representation to a lower-dimensional space where contrastive loss is applied:
$$z = g(h) = W^{(2)} \sigma(W^{(1)} h)$$
where $\sigma$ is ReLU and $z \in \mathbb{R}^{128}$.
Why does this help?
The contrastive loss removes information that distinguishes instances but isn't useful for downstream tasks. By applying the loss to $z$ instead of $h$, this information loss is "absorbed" by the projection head. The representation $h$ retains all information.
Ablation on projection head depth:
| Projection Head | Output Dim | Linear Eval Accuracy |
|---|---|---|
| None (use h directly) | 2048 | 64.7% |
| Linear | 128 | 66.3% |
| MLP (1 hidden layer) | 128 | 69.3% |
| MLP (2 hidden layers) | 128 | 69.0% |
| MLP (3 hidden layers) | 128 | 68.4% |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import torchimport torch.nn as nnimport torchvision.models as models class SimCLR(nn.Module): """ SimCLR model with ResNet encoder and MLP projection head. """ def __init__( self, base_encoder='resnet50', projection_dim=128, hidden_dim=2048, pretrained=False ): super().__init__() # Base encoder if base_encoder == 'resnet50': self.encoder = models.resnet50(pretrained=pretrained) self.encoder_dim = self.encoder.fc.in_features # 2048 self.encoder.fc = nn.Identity() # Remove classification head else: raise ValueError(f"Unknown encoder: {base_encoder}") # Projection head: h -> z self.projection_head = nn.Sequential( nn.Linear(self.encoder_dim, hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, projection_dim) ) def forward(self, x): """ Forward pass returning both representation and projection. Args: x: Input images (batch, C, H, W) Returns: h: Encoder representation (batch, encoder_dim) - for downstream z: Projection (batch, projection_dim) - for contrastive loss """ h = self.encoder(x) z = self.projection_head(h) return h, z def encode(self, x): """Get representation only (for downstream tasks).""" return self.encoder(x)SimCLR uses NT-Xent (Normalized Temperature-scaled Cross Entropy), a specific instantiation of InfoNCE for in-batch negatives.
Given a mini-batch of $N$ images, we generate $2N$ augmented views. For a positive pair $(i, j)$:
$$\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k=1}^{2N} \mathbf{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k) / \tau)}$$
where:
The total loss averages over all positive pairs:
$$\mathcal{L} = \frac{1}{2N} \sum_{k=1}^{N} [\ell_{2k-1, 2k} + \ell_{2k, 2k-1}]$$
What makes SimCLR distinctive is using other samples in the same batch as negatives. For a batch of 4096 images (8192 views), each positive pair has 8190 negatives.
This is computationally efficient—no separate negative sampling required—but creates a dependency between batch size and number of negatives.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import torchimport torch.nn as nnimport torch.nn.functional as F class NTXentLoss(nn.Module): """ Normalized Temperature-scaled Cross Entropy Loss for SimCLR. """ def __init__(self, temperature=0.07): super().__init__() self.temperature = temperature def forward(self, z_i, z_j): """ Compute NT-Xent loss. Args: z_i: Projections from first view (batch_size, dim) z_j: Projections from second view (batch_size, dim) Returns: Scalar loss """ batch_size = z_i.size(0) # Normalize z_i = F.normalize(z_i, dim=1) z_j = F.normalize(z_j, dim=1) # Concatenate views: [z_1, z_2, ..., z_N, z'_1, z'_2, ..., z'_N] z = torch.cat([z_i, z_j], dim=0) # (2*batch_size, dim) # Compute similarity matrix: (2N, 2N) sim_matrix = torch.mm(z, z.t()) / self.temperature # Create mask for positive pairs # For sample i, positive is at i+N (or i-N if i >= N) pos_mask = torch.zeros(2 * batch_size, 2 * batch_size, device=z.device) pos_mask[:batch_size, batch_size:] = torch.eye(batch_size, device=z.device) pos_mask[batch_size:, :batch_size] = torch.eye(batch_size, device=z.device) # Mask out self-similarities self_mask = torch.eye(2 * batch_size, device=z.device).bool() sim_matrix = sim_matrix.masked_fill(self_mask, -1e9) # Get positive similarities pos_sim = (sim_matrix * pos_mask).sum(dim=1) # (2N,) # Log-sum-exp over all (including negatives) logsumexp = torch.logsumexp(sim_matrix, dim=1) # (2N,) # Loss: -log(exp(pos) / sum(exp)) loss = -pos_sim + logsumexp return loss.mean()SimCLR found τ=0.07 optimal for ImageNet. But this depends on embedding normalization—with unit-norm vectors, cosine similarities range [-1, 1]. At τ=0.07, a similarity of 0.5 becomes exp(0.5/0.07) ≈ 1300. Small temperature creates very sharp distributions.
SimCLR's most controversial requirement is its need for very large batch sizes—typically 4096 to 8192 samples, requiring specialized hardware.
More negatives — With 4096 samples, each positive has 8190 negatives. This provides:
Harder negatives — Larger batches increase the probability of "hard negatives"—samples that are semantically similar but not the positive pair. These provide the strongest learning signal.
| Batch Size | Epochs | Top-1 Accuracy | Hardware Required |
|---|---|---|---|
| 256 | 800 | 61.9% | 1 GPU |
| 512 | 800 | 64.2% | 2 GPUs |
| 1024 | 800 | 66.0% | 4 GPUs |
| 2048 | 800 | 68.3% | 8 GPUs |
| 4096 | 800 | 69.3% | 16 GPUs |
| 8192 | 800 | 70.2% | 32 GPUs |
Distributed Training:
Gradient Accumulation:
LARS Optimizer:
SimCLR with batch size 4096 on ImageNet requires 32 TPU v3 cores for ~3 days. This was a significant limitation that motivated MoCo's memory bank approach. If you don't have massive compute, consider MoCo or other memory-efficient alternatives.
SimCLR v2 (Chen et al., 2020) extended the framework with three key improvements:
Self-supervised learning benefits more from model size than supervised learning. The gap between self-supervised and supervised narrows (and eventually inverts) as models grow:
| Model | Self-Supervised | Supervised | Gap |
|---|---|---|---|
| ResNet-50 | 69.3% | 76.5% | -7.2% |
| ResNet-50 (2x) | 74.2% | 77.8% | -3.6% |
| ResNet-152 (3x) | 79.0% | 78.3% | +0.7% |
SimCLR v2 uses a 3-layer projection head with batch normalization:
$$z = \text{BN}(W_3 \cdot \text{ReLU}(\text{BN}(W_2 \cdot \text{ReLU}(\text{BN}(W_1 \cdot h)))))$$
SimCLR v2 introduced a three-stage recipe for semi-supervised learning:
SimCLR's contribution was as much methodological as algorithmic. By systematically ablating each component, it revealed the fundamental principles of contrastive learning success.
You now understand SimCLR's complete framework—from augmentations through projection heads to NT-Xent loss. SimCLR's main limitation is its batch size requirement. Next, we'll explore MoCo, which achieves comparable results with standard batch sizes through a clever momentum-based approach.