Machine LearningSemi-Supervised & Self-Supervised Learning

Self-Supervised Learning

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

4 / 5

Non-Contrastive Methods

Beyond Negative Samples

The discovery that self-supervised learning can work without explicit negative samples represents one of the most surprising developments in recent machine learning. Methods like BYOL and SimSiam achieve competitive or superior performance using only positive pairs—a result that initially seemed to contradict theoretical understanding.

These non-contrastive methods simplify training pipelines, reduce memory requirements, and challenge our understanding of what makes self-supervised learning work. Understanding them is essential for both practical applications and theoretical insight.

What You Will Master

By the end of this page, you will understand why non-contrastive methods don't collapse, master the architectural innovations that enable learning without negatives, implement BYOL and SimSiam from first principles, and compare the tradeoffs between contrastive and non-contrastive approaches.

The Collapse Problem

Before understanding non-contrastive methods, we must understand the problem they solve: representation collapse.

What is collapse?

If we train a network to maximize similarity between positive pairs without any countervailing force, the trivial solution is to map all inputs to the same constant representation. This achieves perfect similarity (similarity = 1) but produces completely useless representations.

Mathematically, the optimal collapsed solution for a positive-only objective is: $$f(x) = c \quad \forall x$$

where c is any constant vector.

Types of Representation Collapse

•Complete Collapse: All representations map to a single point. Trivially achieves any similarity objective.
•Dimensional Collapse: Representations use only a low-dimensional subspace. Some dimensions become constant across all inputs.
•Cluster Collapse: Representations collapse to a small number of modes. Groups of inputs become indistinguishable.

How contrastive methods prevent collapse:

Contrastive methods use negatives as a repulsive force. While the loss pulls positive pairs together, it simultaneously pushes negative pairs apart. This prevents the model from taking the easy path of mapping everything to a single point.

The question that drove non-contrastive research: Can we prevent collapse without explicit negatives?

BYOL: Bootstrap Your Own Latent

BYOL (Bootstrap Your Own Latent) by Grill et al. (2020) was the first method to demonstrate that competitive self-supervised learning is possible without negatives. The key innovation is an asymmetric architecture that prevents collapse.

BYOL architecture:

Online network: Encoder → Projector → Predictor
Target network: Encoder → Projector (no predictor)
Target network is updated via exponential moving average of online network

The online network learns to predict the target network's representations. The asymmetry (predictor only in online) combined with the moving average target is what prevents collapse.

byol_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import torch
import torch.nn as nn
import torch.nn.functional as F
from copy import deepcopy
 
class BYOL(nn.Module):
    """
    Bootstrap Your Own Latent (BYOL) implementation.
    Learns representations without negative samples.
    """
    def __init__(
        self,
        encoder: nn.Module,
        hidden_dim: int = 4096,
        projection_dim: int = 256,
        momentum: float = 0.996
    ):
        super().__init__()
        
        # Get encoder output dimension
        self.encoder_dim = encoder.output_dim
        
        # Online network: encoder + projector + predictor
        self.online_encoder = encoder
        self.online_projector = self._build_projector(hidden_dim, projection_dim)
        self.predictor = self._build_predictor(hidden_dim, projection_dim)
        
        # Target network: encoder + projector (no predictor!)
        self.target_encoder = deepcopy(encoder)
        self.target_projector = deepcopy(self.online_projector)
        
        # Stop gradient on target network
        for param in self.target_encoder.parameters():
            param.requires_grad = False
        for param in self.target_projector.parameters():
            param.requires_grad = False
        
        self.momentum = momentum
    
    def _build_projector(self, hidden_dim: int, output_dim: int) -> nn.Module:
        return nn.Sequential(
            nn.Linear(self.encoder_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, output_dim)
        )
    
    def _build_predictor(self, hidden_dim: int, input_dim: int) -> nn.Module:
        return nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, input_dim)
        )
    
    @torch.no_grad()
    def update_target_network(self):
        """Update target network with exponential moving average."""
        for online, target in zip(
            self.online_encoder.parameters(),
            self.target_encoder.parameters()
        ):
            target.data = self.momentum * target.data + (1 - self.momentum) * online.data
        
        for online, target in zip(
            self.online_projector.parameters(),
            self.target_projector.parameters()
        ):
            target.data = self.momentum * target.data + (1 - self.momentum) * online.data
    
    def forward(self, x1: torch.Tensor, x2: torch.Tensor) -> torch.Tensor:
        """
        Forward pass computing BYOL loss.
        
        Args:
            x1, x2: Two augmented views of the same batch
        
        Returns:
            BYOL loss (symmetrized)
        """
        # Online network forward
        online_proj_1 = self.online_projector(self.online_encoder(x1))
        online_proj_2 = self.online_projector(self.online_encoder(x2))
        online_pred_1 = self.predictor(online_proj_1)
        online_pred_2 = self.predictor(online_proj_2)
        
        # Target network forward (no gradients)
        with torch.no_grad():
            target_proj_1 = self.target_projector(self.target_encoder(x1))
            target_proj_2 = self.target_projector(self.target_encoder(x2))
            target_proj_1 = target_proj_1.detach()
            target_proj_2 = target_proj_2.detach()
        
        # BYOL loss: predict target from online
        loss_1 = self._regression_loss(online_pred_1, target_proj_2)
        loss_2 = self._regression_loss(online_pred_2, target_proj_1)
        
        return (loss_1 + loss_2) / 2
    
    def _regression_loss(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        """Normalized MSE loss."""
        x = F.normalize(x, dim=-1)
        y = F.normalize(y, dim=-1)
        return 2 - 2 * (x * y).sum(dim=-1).mean()

SimSiam: Simplifying Further

SimSiam (Simple Siamese Representation Learning) by Chen & He (2021) demonstrated that even the momentum encoder in BYOL is unnecessary. SimSiam achieves competitive results with just stop-gradient and a predictor.

SimSiam's surprising simplicity:

Single encoder with shared weights (Siamese)
Projector maps to embedding space
Predictor on one branch only
Stop-gradient on the other branch
No momentum encoder, no large batches, no negatives

Non-Contrastive Methods Comparison
Method	Negatives	Momentum	Predictor	Stop-Gradient
SimCLR	✓ (large batch)	✗	✗	✗
MoCo	✓ (memory queue)	✓	✗	✗
BYOL	✗	✓	✓	✓
SimSiam	✗	✗	✓	✓
Barlow Twins	✗	✗	✗	✗

Why SimSiam Works: The Expectation-Maximization View

SimSiam can be understood as implicit clustering via EM. Each step updates cluster assignments (via stop-gradient) and cluster centers (via predictor update). The predictor prevents collapse by encouraging diverse representations that satisfy the prediction objective.

Why Don't They Collapse?

The question of why BYOL and SimSiam don't collapse has generated significant research. Several complementary explanations have emerged:

Collapse Prevention Mechanisms

•Asymmetry breaks trivial solutions: The predictor and stop-gradient create an asymmetric architecture. The optimal solution for this asymmetric problem isn't the collapsed constant—it requires meaningful representations.
•Implicit regularization from BatchNorm: Batch normalization in the projector/predictor provides implicit regularization that prevents dimensional collapse. Removing BatchNorm often leads to collapse.
•The predictor as a bottleneck: The predictor must map online representations to target representations. If representations collapse, the predictor task becomes trivial, but the optimization dynamics favor non-trivial solutions.
•Slow-moving targets (BYOL): The momentum encoder provides a stable target that doesn't immediately adapt to the online network's changes, preventing a collapse spiral.

Collapse is Still a Risk

Non-contrastive methods are more sensitive to hyperparameters than contrastive methods. Small changes in architecture (removing BatchNorm), learning rate, or weight decay can lead to collapse. Careful monitoring and adherence to proven recipes is essential.

Alternative Non-Contrastive Approaches

Beyond BYOL and SimSiam, other approaches prevent collapse through different mechanisms:

Other Non-Contrastive Methods

•Barlow Twins: Enforces that the cross-correlation matrix between representations of two views equals identity. This redundancy reduction prevents both complete and dimensional collapse without negatives or asymmetry.
•VICReg: Combines Variance (prevent collapse), Invariance (align positive pairs), and Covariance (prevent dimensional collapse) terms. Explicit regularization rather than architectural tricks.
•W-MSE (Whitening MSE): Applies whitening transformation to representations before computing MSE loss. Whitening decorrelates features and prevents collapse.
•SwAV: Uses online clustering instead of direct comparison. Representations are clustered, and the model learns to predict cluster assignments of augmented views.

The Barlow Twins objective:

$$\mathcal{L}{BT} = \sum_i (1 - C{ii})^2 + \lambda \sum_i \sum_{j \neq i} C_{ij}^2$$

where C is the cross-correlation matrix between the two views' embeddings. The first term pushes diagonal elements to 1 (invariance), while the second term pushes off-diagonal elements to 0 (decorrelation).

Summary: Non-Contrastive Mastery

Key Takeaways

•Collapse is the central challenge — Without negatives, naive approaches map all inputs to the same point.
•BYOL prevents collapse through asymmetry and momentum — Predictor, stop-gradient, and EMA target work together.
•SimSiam shows momentum isn't necessary — Stop-gradient plus predictor suffices with proper regularization.
•Batch normalization plays a critical role — Implicit regularization from BatchNorm helps prevent collapse.
•Barlow Twins and VICReg use explicit regularization — Decorrelation/variance terms directly prevent dimensional collapse.
•Non-contrastive methods are simpler but more sensitive — Fewer components but require careful hyperparameter tuning.

Page Complete

You now understand non-contrastive methods—the surprising approaches that learn without explicit negatives. Next, we'll explore evaluation protocols for assessing self-supervised representations.

4 / 5

Loading learning content...

Machine LearningSemi-Supervised & Self-Supervised Learning

Self-Supervised Learning

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

4 / 5

Non-Contrastive Methods

Beyond Negative Samples

What You Will Master

The Collapse Problem

Before understanding non-contrastive methods, we must understand the problem they solve: representation collapse.

What is collapse?

Mathematically, the optimal collapsed solution for a positive-only objective is: $$f(x) = c \quad \forall x$$

where c is any constant vector.

Types of Representation Collapse

•Complete Collapse: All representations map to a single point. Trivially achieves any similarity objective.
•Dimensional Collapse: Representations use only a low-dimensional subspace. Some dimensions become constant across all inputs.
•Cluster Collapse: Representations collapse to a small number of modes. Groups of inputs become indistinguishable.

How contrastive methods prevent collapse:

The question that drove non-contrastive research: Can we prevent collapse without explicit negatives?

BYOL: Bootstrap Your Own Latent

BYOL architecture:

Online network: Encoder → Projector → Predictor
Target network: Encoder → Projector (no predictor)
Target network is updated via exponential moving average of online network

The online network learns to predict the target network's representations. The asymmetry (predictor only in online) combined with the moving average target is what prevents collapse.

byol_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import torch
import torch.nn as nn
import torch.nn.functional as F
from copy import deepcopy
 
class BYOL(nn.Module):
    """
    Bootstrap Your Own Latent (BYOL) implementation.
    Learns representations without negative samples.
    """
    def __init__(
        self,
        encoder: nn.Module,
        hidden_dim: int = 4096,
        projection_dim: int = 256,
        momentum: float = 0.996
    ):
        super().__init__()
        
        # Get encoder output dimension
        self.encoder_dim = encoder.output_dim
        
        # Online network: encoder + projector + predictor
        self.online_encoder = encoder
        self.online_projector = self._build_projector(hidden_dim, projection_dim)
        self.predictor = self._build_predictor(hidden_dim, projection_dim)
        
        # Target network: encoder + projector (no predictor!)
        self.target_encoder = deepcopy(encoder)
        self.target_projector = deepcopy(self.online_projector)
        
        # Stop gradient on target network
        for param in self.target_encoder.parameters():
            param.requires_grad = False
        for param in self.target_projector.parameters():
            param.requires_grad = False
        
        self.momentum = momentum
    
    def _build_projector(self, hidden_dim: int, output_dim: int) -> nn.Module:
        return nn.Sequential(
            nn.Linear(self.encoder_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, output_dim)
        )
    
    def _build_predictor(self, hidden_dim: int, input_dim: int) -> nn.Module:
        return nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, input_dim)
        )
    
    @torch.no_grad()
    def update_target_network(self):
        """Update target network with exponential moving average."""
        for online, target in zip(
            self.online_encoder.parameters(),
            self.target_encoder.parameters()
        ):
            target.data = self.momentum * target.data + (1 - self.momentum) * online.data
        
        for online, target in zip(
            self.online_projector.parameters(),
            self.target_projector.parameters()
        ):
            target.data = self.momentum * target.data + (1 - self.momentum) * online.data
    
    def forward(self, x1: torch.Tensor, x2: torch.Tensor) -> torch.Tensor:
        """
        Forward pass computing BYOL loss.
        
        Args:
            x1, x2: Two augmented views of the same batch
        
        Returns:
            BYOL loss (symmetrized)
        """
        # Online network forward
        online_proj_1 = self.online_projector(self.online_encoder(x1))
        online_proj_2 = self.online_projector(self.online_encoder(x2))
        online_pred_1 = self.predictor(online_proj_1)
        online_pred_2 = self.predictor(online_proj_2)
        
        # Target network forward (no gradients)
        with torch.no_grad():
            target_proj_1 = self.target_projector(self.target_encoder(x1))
            target_proj_2 = self.target_projector(self.target_encoder(x2))
            target_proj_1 = target_proj_1.detach()
            target_proj_2 = target_proj_2.detach()
        
        # BYOL loss: predict target from online
        loss_1 = self._regression_loss(online_pred_1, target_proj_2)
        loss_2 = self._regression_loss(online_pred_2, target_proj_1)
        
        return (loss_1 + loss_2) / 2
    
    def _regression_loss(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        """Normalized MSE loss."""
        x = F.normalize(x, dim=-1)
        y = F.normalize(y, dim=-1)
        return 2 - 2 * (x * y).sum(dim=-1).mean()

SimSiam: Simplifying Further

SimSiam's surprising simplicity:

Single encoder with shared weights (Siamese)
Projector maps to embedding space
Predictor on one branch only
Stop-gradient on the other branch
No momentum encoder, no large batches, no negatives

Non-Contrastive Methods Comparison
Method	Negatives	Momentum	Predictor	Stop-Gradient
SimCLR	✓ (large batch)	✗	✗	✗
MoCo	✓ (memory queue)	✓	✗	✗
BYOL	✗	✓	✓	✓
SimSiam	✗	✗	✓	✓
Barlow Twins	✗	✗	✗	✗

Why SimSiam Works: The Expectation-Maximization View

Why Don't They Collapse?

The question of why BYOL and SimSiam don't collapse has generated significant research. Several complementary explanations have emerged:

Collapse Prevention Mechanisms

•Asymmetry breaks trivial solutions: The predictor and stop-gradient create an asymmetric architecture. The optimal solution for this asymmetric problem isn't the collapsed constant—it requires meaningful representations.
•Implicit regularization from BatchNorm: Batch normalization in the projector/predictor provides implicit regularization that prevents dimensional collapse. Removing BatchNorm often leads to collapse.
•The predictor as a bottleneck: The predictor must map online representations to target representations. If representations collapse, the predictor task becomes trivial, but the optimization dynamics favor non-trivial solutions.
•Slow-moving targets (BYOL): The momentum encoder provides a stable target that doesn't immediately adapt to the online network's changes, preventing a collapse spiral.

Collapse is Still a Risk

Alternative Non-Contrastive Approaches

Beyond BYOL and SimSiam, other approaches prevent collapse through different mechanisms:

Other Non-Contrastive Methods

•Barlow Twins: Enforces that the cross-correlation matrix between representations of two views equals identity. This redundancy reduction prevents both complete and dimensional collapse without negatives or asymmetry.
•VICReg: Combines Variance (prevent collapse), Invariance (align positive pairs), and Covariance (prevent dimensional collapse) terms. Explicit regularization rather than architectural tricks.
•W-MSE (Whitening MSE): Applies whitening transformation to representations before computing MSE loss. Whitening decorrelates features and prevents collapse.
•SwAV: Uses online clustering instead of direct comparison. Representations are clustered, and the model learns to predict cluster assignments of augmented views.

The Barlow Twins objective:

$$\mathcal{L}{BT} = \sum_i (1 - C{ii})^2 + \lambda \sum_i \sum_{j \neq i} C_{ij}^2$$

Summary: Non-Contrastive Mastery

Key Takeaways

•Collapse is the central challenge — Without negatives, naive approaches map all inputs to the same point.
•BYOL prevents collapse through asymmetry and momentum — Predictor, stop-gradient, and EMA target work together.
•SimSiam shows momentum isn't necessary — Stop-gradient plus predictor suffices with proper regularization.
•Batch normalization plays a critical role — Implicit regularization from BatchNorm helps prevent collapse.
•Barlow Twins and VICReg use explicit regularization — Decorrelation/variance terms directly prevent dimensional collapse.
•Non-contrastive methods are simpler but more sensitive — Fewer components but require careful hyperparameter tuning.

Page Complete

You now understand non-contrastive methods—the surprising approaches that learn without explicit negatives. Next, we'll explore evaluation protocols for assessing self-supervised representations.

4 / 5