Loading content...
MoCo (Momentum Contrast), developed by Facebook AI Research (He et al., 2020), addressed SimCLR's most significant limitation: the requirement for massive batch sizes. MoCo's insight was elegant—instead of drawing negatives from the current batch, maintain a queue of recent embeddings that serves as a large, consistent dictionary for contrastive learning.
This simple architectural change has profound implications. MoCo achieves comparable or superior results to SimCLR while training with standard 256-sample batches on 8 GPUs—democratizing self-supervised learning for labs without access to TPU pods or massive GPU clusters.
By the end of this page, you will understand: (1) The dictionary-as-queue paradigm, (2) Why momentum encoders are necessary for consistency, (3) The complete MoCo training algorithm, (4) The evolution from MoCo v1 through v3, and (5) How to implement MoCo for your own applications.
MoCo reframes contrastive learning as building and querying a dynamic dictionary. This perspective illuminates what makes contrastive learning work and why certain design choices matter.
In contrastive learning, we have:
The contrastive loss asks: "Can we identify the matching key for a given query?"
$$\mathcal{L}q = -\log \frac{\exp(q \cdot k+ / \tau)}{\exp(q \cdot k_+ / \tau) + \sum_{k_-} \exp(q \cdot k_- / \tau)}$$
For effective contrastive learning, the dictionary should be:
Large — More keys provide more negatives, enabling the model to learn finer distinctions
Consistent — Keys should be encoded by the same or similar encoder; mixing old and new encoder versions creates inconsistent comparisons
This is where SimCLR and MoCo differ fundamentally.
The queue contains keys computed at different training steps. If we used the main encoder (which changes every step), old keys would be inconsistent with new ones. The solution: a momentum encoder that evolves slowly.
Let $\theta_q$ be the query encoder parameters (updated by backprop) and $\theta_k$ be the key encoder parameters. The key encoder is updated as:
$$\theta_k \leftarrow m \cdot \theta_k + (1 - m) \cdot \theta_q$$
where $m \in [0, 1)$ is the momentum coefficient, typically $m = 0.999$.
With $m = 0.999$:
This slow evolution means keys computed 1000 steps apart are still reasonably consistent—they come from similar (though not identical) encoders.
The queue operates as a FIFO buffer:
The momentum coefficient m is crucial. Too low (e.g., 0.9) and keys become inconsistent quickly. Too high (e.g., 0.9999) and the key encoder lags too far behind, potentially encoding outdated features. m=0.999 balances consistency with adaptability.
Let's walk through MoCo's training procedure step by step.
Input: Dataset $\mathcal{D}$, queue size $K$, momentum $m$, temperature $\tau$
Initialize:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
import torchimport torch.nn as nnimport torch.nn.functional as F class MoCo(nn.Module): """ MoCo: Momentum Contrast for Unsupervised Representation Learning. """ def __init__( self, encoder, dim=128, queue_size=65536, momentum=0.999, temperature=0.07 ): super().__init__() self.queue_size = queue_size self.momentum = momentum self.temperature = temperature # Query encoder (updated by gradient) self.encoder_q = encoder # Key encoder (updated by momentum) self.encoder_k = copy.deepcopy(encoder) for param in self.encoder_k.parameters(): param.requires_grad = False # No gradients for momentum encoder # Projection heads hidden_dim = self.encoder_q.fc.in_features self.encoder_q.fc = nn.Sequential( nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, dim) ) self.encoder_k.fc = nn.Sequential( nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, dim) ) # Initialize queue self.register_buffer("queue", torch.randn(dim, queue_size)) self.queue = F.normalize(self.queue, dim=0) self.register_buffer("queue_ptr", torch.zeros(1, dtype=torch.long)) @torch.no_grad() def _momentum_update_key_encoder(self): """Momentum update of key encoder.""" for param_q, param_k in zip( self.encoder_q.parameters(), self.encoder_k.parameters() ): param_k.data = ( self.momentum * param_k.data + (1 - self.momentum) * param_q.data ) @torch.no_grad() def _dequeue_and_enqueue(self, keys): """Update queue with new keys.""" batch_size = keys.size(0) ptr = int(self.queue_ptr) # Replace oldest keys if ptr + batch_size <= self.queue_size: self.queue[:, ptr:ptr + batch_size] = keys.T else: # Wrap around remaining = self.queue_size - ptr self.queue[:, ptr:] = keys[:remaining].T self.queue[:, :batch_size - remaining] = keys[remaining:].T # Move pointer self.queue_ptr[0] = (ptr + batch_size) % self.queue_size def forward(self, x_q, x_k): """ Forward pass for MoCo training. Args: x_q: Query images (first augmented view) x_k: Key images (second augmented view) Returns: logits: Similarity logits (batch, 1 + queue_size) labels: Ground truth labels (all zeros - positive at index 0) """ # Compute query features q = self.encoder_q(x_q) # (batch, dim) q = F.normalize(q, dim=1) # Compute key features (no gradient) with torch.no_grad(): self._momentum_update_key_encoder() k = self.encoder_k(x_k) # (batch, dim) k = F.normalize(k, dim=1) # Positive logits: (batch, 1) l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1) # Negative logits from queue: (batch, queue_size) l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()]) # Concatenate: positive at position 0 logits = torch.cat([l_pos, l_neg], dim=1) / self.temperature # Labels: positive is always at index 0 labels = torch.zeros(logits.size(0), dtype=torch.long, device=logits.device) # Update queue with current keys self._dequeue_and_enqueue(k) return logits, labelsThe training loop simply calls logits, labels = moco(x_q, x_k) then loss = F.cross_entropy(logits, labels). The cross-entropy loss with labels of all zeros implements InfoNCE where the positive is always at index 0.
MoCo evolved significantly across three major versions, each incorporating lessons from the rapidly advancing field.
The original MoCo established the momentum encoder and queue mechanism:
After SimCLR revealed the importance of projection heads and augmentations, MoCo v2 incorporated these insights:
Changes from v1:
Result: 71.1% linear eval (+10.5% over v1) with the same compute.
| Component | MoCo v1 | MoCo v2 | MoCo v3 |
|---|---|---|---|
| Projection Head | Linear | MLP (2-layer) | MLP (3-layer) |
| Encoder | ResNet-50 | ResNet-50 | ViT-B/16 |
| Augmentation | Basic |
|
|
| Queue | Yes | Yes | No (in-batch) |
| Momentum | 0.999 | 0.999 | Learned / 0.99→1 |
| Accuracy | 60.6% | 71.1% | 76.7% |
MoCo v3 adapted the framework for Vision Transformers (ViT), which have different training dynamics:
Key Changes:
MoCo v3 demonstrated that contrastive learning scales effectively to Transformers, achieving 76.7% with ViT-B and 81.0% with ViT-L.
For ResNets with limited compute: use MoCo v2. For ViT or when you have large batches available: consider MoCo v3 or SimCLR. The queue mechanism is most valuable when batch sizes are constrained.
Both MoCo and SimCLR achieve similar final performance—the choice depends on your constraints.
| Aspect | MoCo v2 | SimCLR |
|---|---|---|
| Batch Size Required | 256 works well | 4096+ for best results |
| GPU Memory | Lower (queue stored CPU-side) | Higher (all in GPU) |
| Implementation Complexity | Higher (queue, momentum) | Lower (straightforward) |
| Multi-GPU Scaling | Easier (smaller batches) | Harder (sync across many) |
| Negative Diversity | Queue provides diversity | Batch provides diversity |
| Key Consistency | Approximate (momentum) | Exact (same encoder) |
| Final Accuracy (ResNet-50) | 71.1% | 69.3% |
MoCo's insight—that contrastive learning can be viewed as dictionary lookup—opened new design possibilities and made self-supervised learning accessible to teams without massive compute.
You now understand MoCo's elegant solution to the batch size problem. The momentum encoder and queue mechanism are fundamental techniques that appear in many later methods. Next, we'll examine positive and negative pairs more deeply—the fundamental unit of contrastive learning.