Loading learning content...
The discovery that self-supervised learning can work without explicit negative samples represents one of the most surprising developments in recent machine learning. Methods like BYOL and SimSiam achieve competitive or superior performance using only positive pairs—a result that initially seemed to contradict theoretical understanding.
These non-contrastive methods simplify training pipelines, reduce memory requirements, and challenge our understanding of what makes self-supervised learning work. Understanding them is essential for both practical applications and theoretical insight.
By the end of this page, you will understand why non-contrastive methods don't collapse, master the architectural innovations that enable learning without negatives, implement BYOL and SimSiam from first principles, and compare the tradeoffs between contrastive and non-contrastive approaches.
Before understanding non-contrastive methods, we must understand the problem they solve: representation collapse.
What is collapse?
If we train a network to maximize similarity between positive pairs without any countervailing force, the trivial solution is to map all inputs to the same constant representation. This achieves perfect similarity (similarity = 1) but produces completely useless representations.
Mathematically, the optimal collapsed solution for a positive-only objective is: $$f(x) = c \quad \forall x$$
where c is any constant vector.
How contrastive methods prevent collapse:
Contrastive methods use negatives as a repulsive force. While the loss pulls positive pairs together, it simultaneously pushes negative pairs apart. This prevents the model from taking the easy path of mapping everything to a single point.
The question that drove non-contrastive research: Can we prevent collapse without explicit negatives?
BYOL (Bootstrap Your Own Latent) by Grill et al. (2020) was the first method to demonstrate that competitive self-supervised learning is possible without negatives. The key innovation is an asymmetric architecture that prevents collapse.
BYOL architecture:
The online network learns to predict the target network's representations. The asymmetry (predictor only in online) combined with the moving average target is what prevents collapse.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom copy import deepcopy class BYOL(nn.Module): """ Bootstrap Your Own Latent (BYOL) implementation. Learns representations without negative samples. """ def __init__( self, encoder: nn.Module, hidden_dim: int = 4096, projection_dim: int = 256, momentum: float = 0.996 ): super().__init__() # Get encoder output dimension self.encoder_dim = encoder.output_dim # Online network: encoder + projector + predictor self.online_encoder = encoder self.online_projector = self._build_projector(hidden_dim, projection_dim) self.predictor = self._build_predictor(hidden_dim, projection_dim) # Target network: encoder + projector (no predictor!) self.target_encoder = deepcopy(encoder) self.target_projector = deepcopy(self.online_projector) # Stop gradient on target network for param in self.target_encoder.parameters(): param.requires_grad = False for param in self.target_projector.parameters(): param.requires_grad = False self.momentum = momentum def _build_projector(self, hidden_dim: int, output_dim: int) -> nn.Module: return nn.Sequential( nn.Linear(self.encoder_dim, hidden_dim), nn.BatchNorm1d(hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, output_dim) ) def _build_predictor(self, hidden_dim: int, input_dim: int) -> nn.Module: return nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.BatchNorm1d(hidden_dim), nn.ReLU(inplace=True), nn.Linear(hidden_dim, input_dim) ) @torch.no_grad() def update_target_network(self): """Update target network with exponential moving average.""" for online, target in zip( self.online_encoder.parameters(), self.target_encoder.parameters() ): target.data = self.momentum * target.data + (1 - self.momentum) * online.data for online, target in zip( self.online_projector.parameters(), self.target_projector.parameters() ): target.data = self.momentum * target.data + (1 - self.momentum) * online.data def forward(self, x1: torch.Tensor, x2: torch.Tensor) -> torch.Tensor: """ Forward pass computing BYOL loss. Args: x1, x2: Two augmented views of the same batch Returns: BYOL loss (symmetrized) """ # Online network forward online_proj_1 = self.online_projector(self.online_encoder(x1)) online_proj_2 = self.online_projector(self.online_encoder(x2)) online_pred_1 = self.predictor(online_proj_1) online_pred_2 = self.predictor(online_proj_2) # Target network forward (no gradients) with torch.no_grad(): target_proj_1 = self.target_projector(self.target_encoder(x1)) target_proj_2 = self.target_projector(self.target_encoder(x2)) target_proj_1 = target_proj_1.detach() target_proj_2 = target_proj_2.detach() # BYOL loss: predict target from online loss_1 = self._regression_loss(online_pred_1, target_proj_2) loss_2 = self._regression_loss(online_pred_2, target_proj_1) return (loss_1 + loss_2) / 2 def _regression_loss(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor: """Normalized MSE loss.""" x = F.normalize(x, dim=-1) y = F.normalize(y, dim=-1) return 2 - 2 * (x * y).sum(dim=-1).mean()SimSiam (Simple Siamese Representation Learning) by Chen & He (2021) demonstrated that even the momentum encoder in BYOL is unnecessary. SimSiam achieves competitive results with just stop-gradient and a predictor.
SimSiam's surprising simplicity:
| Method | Negatives | Momentum | Predictor | Stop-Gradient |
|---|---|---|---|---|
| SimCLR | ✓ (large batch) | ✗ | ✗ | ✗ |
| MoCo | ✓ (memory queue) | ✓ | ✗ | ✗ |
| BYOL | ✗ | ✓ | ✓ | ✓ |
| SimSiam | ✗ | ✗ | ✓ | ✓ |
| Barlow Twins | ✗ | ✗ | ✗ | ✗ |
SimSiam can be understood as implicit clustering via EM. Each step updates cluster assignments (via stop-gradient) and cluster centers (via predictor update). The predictor prevents collapse by encouraging diverse representations that satisfy the prediction objective.
The question of why BYOL and SimSiam don't collapse has generated significant research. Several complementary explanations have emerged:
Non-contrastive methods are more sensitive to hyperparameters than contrastive methods. Small changes in architecture (removing BatchNorm), learning rate, or weight decay can lead to collapse. Careful monitoring and adherence to proven recipes is essential.
Beyond BYOL and SimSiam, other approaches prevent collapse through different mechanisms:
The Barlow Twins objective:
$$\mathcal{L}{BT} = \sum_i (1 - C{ii})^2 + \lambda \sum_i \sum_{j \neq i} C_{ij}^2$$
where C is the cross-correlation matrix between the two views' embeddings. The first term pushes diagonal elements to 1 (invariance), while the second term pushes off-diagonal elements to 0 (decorrelation).
You now understand non-contrastive methods—the surprising approaches that learn without explicit negatives. Next, we'll explore evaluation protocols for assessing self-supervised representations.