Contrastive Learning - Learning Module

Loading content...

0/245

MoCo: Momentum Contrast for Unsupervised Representation Learning

Decoupling Negatives from Batch Size

MoCo (Momentum Contrast), developed by Facebook AI Research (He et al., 2020), addressed SimCLR's most significant limitation: the requirement for massive batch sizes. MoCo's insight was elegant—instead of drawing negatives from the current batch, maintain a queue of recent embeddings that serves as a large, consistent dictionary for contrastive learning.

This simple architectural change has profound implications. MoCo achieves comparable or superior results to SimCLR while training with standard 256-sample batches on 8 GPUs—democratizing self-supervised learning for labs without access to TPU pods or massive GPU clusters.

What You Will Learn

By the end of this page, you will understand: (1) The dictionary-as-queue paradigm, (2) Why momentum encoders are necessary for consistency, (3) The complete MoCo training algorithm, (4) The evolution from MoCo v1 through v3, and (5) How to implement MoCo for your own applications.

Contrastive Learning as Dictionary Lookup

MoCo reframes contrastive learning as building and querying a dynamic dictionary. This perspective illuminates what makes contrastive learning work and why certain design choices matter.

The Dictionary Metaphor

In contrastive learning, we have:

Query ($q$): Encoded representation of one view of an image
Keys ($k_+, k_-$): Encoded representations of other views
Dictionary: Collection of keys to compare against

The contrastive loss asks: "Can we identify the matching key for a given query?"

$$\mathcal{L}q = -\log \frac{\exp(q \cdot k+ / \tau)}{\exp(q \cdot k_+ / \tau) + \sum_{k_-} \exp(q \cdot k_- / \tau)}$$

Properties of a Good Dictionary

For effective contrastive learning, the dictionary should be:

Large — More keys provide more negatives, enabling the model to learn finer distinctions
Consistent — Keys should be encoded by the same or similar encoder; mixing old and new encoder versions creates inconsistent comparisons

This is where SimCLR and MoCo differ fundamentally.

SimCLR's Dictionary

•Dictionary = Current batch
•Size limited by GPU memory
•All keys from same encoder state
•Perfectly consistent keys
•Requires 4096+ batch size

MoCo's Dictionary

•Dictionary = Queue of past keys
•Size independent of batch size
•Keys from recent encoder states
•Consistency via momentum update
•Works with batch size 256

The Momentum Encoder: Ensuring Consistency

The queue contains keys computed at different training steps. If we used the main encoder (which changes every step), old keys would be inconsistent with new ones. The solution: a momentum encoder that evolves slowly.

The Momentum Update Rule

Let $\theta_q$ be the query encoder parameters (updated by backprop) and $\theta_k$ be the key encoder parameters. The key encoder is updated as:

$$\theta_k \leftarrow m \cdot \theta_k + (1 - m) \cdot \theta_q$$

where $m \in [0, 1)$ is the momentum coefficient, typically $m = 0.999$.

Why Momentum Works

With $m = 0.999$:

After 1 step: key encoder retains 99.9% of its parameters
After 100 steps: retains ~90.5% of original parameters
After 1000 steps: retains ~36.8% of original parameters

This slow evolution means keys computed 1000 steps apart are still reasonably consistent—they come from similar (though not identical) encoders.

The Queue Mechanism

The queue operates as a FIFO buffer:

After each batch, enqueue new key embeddings
If queue exceeds maximum size (e.g., 65536), dequeue oldest embeddings
Queue provides negatives for contrastive loss

Converting Mermaid diagram...

Momentum Coefficient Sensitivity

The momentum coefficient m is crucial. Too low (e.g., 0.9) and keys become inconsistent quickly. Too high (e.g., 0.9999) and the key encoder lags too far behind, potentially encoding outdated features. m=0.999 balances consistency with adaptability.

The Complete MoCo Algorithm

Let's walk through MoCo's training procedure step by step.

Algorithm Overview

Input: Dataset $\mathcal{D}$, queue size $K$, momentum $m$, temperature $\tau$

Initialize:

Query encoder $f_q$ with parameters $\theta_q$
Key encoder $f_k$ with parameters $\theta_k = \theta_q$ (copy)
Empty queue $Q$

moco_algorithm.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MoCo(nn.Module):
    """
    MoCo: Momentum Contrast for Unsupervised Representation Learning.
    """
    
    def __init__(
        self,
        encoder,
        dim=128,
        queue_size=65536,
        momentum=0.999,
        temperature=0.07
    ):
        super().__init__()
        
        self.queue_size = queue_size
        self.momentum = momentum
        self.temperature = temperature
        
        # Query encoder (updated by gradient)
        self.encoder_q = encoder
        
        # Key encoder (updated by momentum)
        self.encoder_k = copy.deepcopy(encoder)
        for param in self.encoder_k.parameters():
            param.requires_grad = False  # No gradients for momentum encoder
        
        # Projection heads
        hidden_dim = self.encoder_q.fc.in_features
        self.encoder_q.fc = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, dim)
        )
        self.encoder_k.fc = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, dim)
        )
        
        # Initialize queue
        self.register_buffer("queue", torch.randn(dim, queue_size))
        self.queue = F.normalize(self.queue, dim=0)
        self.register_buffer("queue_ptr", torch.zeros(1, dtype=torch.long))
    
    @torch.no_grad()
    def _momentum_update_key_encoder(self):
        """Momentum update of key encoder."""
        for param_q, param_k in zip(
            self.encoder_q.parameters(),
            self.encoder_k.parameters()
        ):
            param_k.data = (
                self.momentum * param_k.data +
                (1 - self.momentum) * param_q.data
            )
    
    @torch.no_grad()
    def _dequeue_and_enqueue(self, keys):
        """Update queue with new keys."""
        batch_size = keys.size(0)
        ptr = int(self.queue_ptr)
        
        # Replace oldest keys
        if ptr + batch_size <= self.queue_size:
            self.queue[:, ptr:ptr + batch_size] = keys.T
        else:
            # Wrap around
            remaining = self.queue_size - ptr
            self.queue[:, ptr:] = keys[:remaining].T
            self.queue[:, :batch_size - remaining] = keys[remaining:].T
        
        # Move pointer
        self.queue_ptr[0] = (ptr + batch_size) % self.queue_size
    
    def forward(self, x_q, x_k):
        """
        Forward pass for MoCo training.
        
        Args:
            x_q: Query images (first augmented view)
            x_k: Key images (second augmented view)
        
        Returns:
            logits: Similarity logits (batch, 1 + queue_size)
            labels: Ground truth labels (all zeros - positive at index 0)
        """
        # Compute query features
        q = self.encoder_q(x_q)  # (batch, dim)
        q = F.normalize(q, dim=1)
        
        # Compute key features (no gradient)
        with torch.no_grad():
            self._momentum_update_key_encoder()
            k = self.encoder_k(x_k)  # (batch, dim)
            k = F.normalize(k, dim=1)
        
        # Positive logits: (batch, 1)
        l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1)
        
        # Negative logits from queue: (batch, queue_size)
        l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])
        
        # Concatenate: positive at position 0
        logits = torch.cat([l_pos, l_neg], dim=1) / self.temperature
        
        # Labels: positive is always at index 0
        labels = torch.zeros(logits.size(0), dtype=torch.long, device=logits.device)
        
        # Update queue with current keys
        self._dequeue_and_enqueue(k)
        
        return logits, labels

Training Loop

The training loop simply calls logits, labels = moco(x_q, x_k) then loss = F.cross_entropy(logits, labels). The cross-entropy loss with labels of all zeros implements InfoNCE where the positive is always at index 0.

The Evolution: MoCo v1 → v2 → v3

MoCo evolved significantly across three major versions, each incorporating lessons from the rapidly advancing field.

MoCo v1 (He et al., 2020)

The original MoCo established the momentum encoder and queue mechanism:

ResNet-50 encoder
Linear projection head
Queue size 65,536
Achieved 60.6% linear eval accuracy

MoCo v2 (Chen et al., 2020)

After SimCLR revealed the importance of projection heads and augmentations, MoCo v2 incorporated these insights:

Changes from v1:

MLP projection head (2-layer) instead of linear
Stronger augmentation (added blur, removed dropout)
Cosine learning rate schedule

Result: 71.1% linear eval (+10.5% over v1) with the same compute.

MoCo Version Comparison
Component	MoCo v1	MoCo v2	MoCo v3
Projection Head	Linear	MLP (2-layer)	MLP (3-layer)
Encoder	ResNet-50	ResNet-50	ViT-B/16
Augmentation	Basic	Blur, Color	Multi-crop
Queue	Yes	Yes	No (in-batch)
Momentum	0.999	0.999	Learned / 0.99→1
Accuracy	60.6%	71.1%	76.7%

MoCo v3 (Chen et al., 2021)

MoCo v3 adapted the framework for Vision Transformers (ViT), which have different training dynamics:

Key Changes:

No queue — Returns to in-batch negatives (ViT scales to larger batches)
Symmetrized loss — Both views serve as query and key
Prediction head — Additional MLP for asymmetric architecture
Patch projection freeze — Freezes patch embedding layer to stabilize training
Momentum annealing — Momentum increases from 0.99 to 1.0 during training

MoCo v3 demonstrated that contrastive learning scales effectively to Transformers, achieving 76.7% with ViT-B and 81.0% with ViT-L.

Choosing a Version

For ResNets with limited compute: use MoCo v2. For ViT or when you have large batches available: consider MoCo v3 or SimCLR. The queue mechanism is most valuable when batch sizes are constrained.

MoCo vs SimCLR: Practical Considerations

Both MoCo and SimCLR achieve similar final performance—the choice depends on your constraints.

MoCo vs SimCLR Tradeoffs
Aspect	MoCo v2	SimCLR
Batch Size Required	256 works well	4096+ for best results
GPU Memory	Lower (queue stored CPU-side)	Higher (all in GPU)
Implementation Complexity	Higher (queue, momentum)	Lower (straightforward)
Multi-GPU Scaling	Easier (smaller batches)	Harder (sync across many)
Negative Diversity	Queue provides diversity	Batch provides diversity
Key Consistency	Approximate (momentum)	Exact (same encoder)
Final Accuracy (ResNet-50)	71.1%	69.3%

When to Choose MoCo

•Limited GPU memory — Can train with smaller batches
•Single GPU or few GPUs — Doesn't require massive parallelism
•Production constraints — More practical deployment requirements
•Transfer learning focus — MoCo representations often transfer better to detection/segmentation

When to Choose SimCLR

•Large compute available — TPU pods or 32+ GPU clusters
•Simpler implementation — Easier to debug and modify
•Research experimentation — Faster iteration without queue complexity
•Semi-supervised learning — SimCLR v2's distillation pipeline is well-established

Summary: MoCo's Contributions

MoCo's insight—that contrastive learning can be viewed as dictionary lookup—opened new design possibilities and made self-supervised learning accessible to teams without massive compute.

Key Takeaways

•Dictionary-as-queue paradigm — Decouples batch size from negative count, enabling efficient training.
•Momentum encoder ensures consistency — Slow updates keep queue keys compatible despite encoder evolution.
•Queue provides massive negative pool — 65,536 negatives with batch size 256, matching SimCLR's effective negatives.
•MoCo v2 adopted SimCLR's insights — MLP head and stronger augmentation provided 10%+ improvement.
•MoCo v3 scales to Transformers — Adapts the framework for ViT with symmetrized loss and training tricks.
•Practical for limited compute — Democratizes self-supervised learning beyond TPU-scale labs.

MoCo Mastered

You now understand MoCo's elegant solution to the batch size problem. The momentum encoder and queue mechanism are fundamental techniques that appear in many later methods. Next, we'll examine positive and negative pairs more deeply—the fundamental unit of contrastive learning.