Loading content...
At its core, contrastive learning is about comparison. We learn representations by contrasting similar things (positive pairs) against dissimilar things (negative pairs). This simple principle—learning what goes together and what doesn't—underlies all contrastive methods.
But this simplicity hides deep questions: What makes a good positive pair? How do we sample effective negatives? What happens when our assumptions about positives and negatives are violated? This page explores these foundational concepts that determine contrastive learning success or failure.
By the end of this page, you will understand: (1) How positive pairs encode invariances we want to learn, (2) The spectrum of negative sampling strategies, (3) The false negative problem and its solutions, (4) Hard negative mining and its tradeoffs, and (5) How pair selection affects learned representations.
A positive pair consists of two inputs that should have similar representations. The choice of positive pairs implicitly defines what the representation should be invariant to.
When we train a model to give similar representations to $(x, x^+)$, we're saying: "Whatever differs between $x$ and $x^+$ is not important for representation."
For augmentation-based positives:
The representation becomes invariant to the augmentation transformations.
| Strategy | How Positives Are Created | Invariance Learned | Example Methods |
|---|---|---|---|
| Data Augmentation | Apply random transforms to same image | Color, scale, crop, orientation | SimCLR, MoCo, BYOL |
| Temporal Adjacency | Nearby frames in video | Short-term motion, viewpoint | CPC, VideoMoCo |
| Multi-view | Different camera angles of same scene | Viewpoint, lighting | Multi-view learning |
| Cross-modal | Different modalities of same content | Modality-specific noise | CLIP (image-text) |
| Supervised | Same-class samples as positives | Intra-class variation | Supervised Contrastive |
Augmentation-based positives rely on a critical assumption: semantic content is preserved under augmentations. When we crop an image of a dog, it's still a dog. When we change the colors, it's still the same dog.
This assumption can break in subtle ways:
Problematic cases:
The art of contrastive learning is designing augmentations that:
The invariances you encode through augmentation should match your downstream task. If you make representations invariant to color (via color jitter), they won't work well for tasks where color matters (e.g., distinguishing ripe vs unripe fruit). Design augmentations with your end task in mind.
Negative pairs are samples that should have dissimilar representations. Without negatives, the trivial solution is to map everything to the same point—perfect positive alignment but useless representations.
Negatives serve as a repulsive force in representation space:
$$\mathcal{L} = -\log \frac{e^{sim(q, k^+)/\tau}}{e^{sim(q, k^+)/\tau} + \sum_{i} e^{sim(q, k^-_i)/\tau}}$$
The denominator pushes the query away from negatives. More negatives = more repulsion from more directions = more uniform representation distribution.
The number of negatives has a profound impact:
| Negatives | Effect | Practical Consideration |
|---|---|---|
| Too few (<100) | Weak repulsion; easy task; less learning | May lead to collapse |
| Moderate (256-4K) | Balanced learning; good for smaller setups | Works with careful tuning |
| Many (4K-64K) | Strong repulsion; diverse negatives | Requires compute or memory bank |
| Very many (>64K) | Diminishing returns; redundancy | May not help beyond this point |
The InfoNCE bound on mutual information ($\log(K+1)$) explains why: with $K=65535$ negatives, we can estimate up to ~11 nats of MI. Beyond this, more negatives provide marginal gains.
A critical but often overlooked issue: some negatives aren't truly negative. When sampling negatives randomly, some may actually share semantic content with the query.
Consider ImageNet with 1000 classes and balanced classes:
These false negatives create conflicting gradients:
The model receives contradictory supervision.
| Aspect | Effect of False Negatives | Severity |
|---|---|---|
| Gradient direction | Conflicting signals slow learning | Moderate |
| Representation structure | Same-class items pushed apart | High for fine-grained tasks |
| Loss optimization | Lower bound on achievable loss | Creates floor |
| Cluster formation | Clusters may split along false boundaries | Depends on FN rate |
1. Temperature Scaling
Higher temperature softens the softmax distribution, reducing the weight on any single negative: $$p_i = \frac{e^{s_i/\tau}}{\sum_j e^{s_j/\tau}}$$
With high $\tau$, even false negatives with high similarity don't dominate gradients.
2. Debiased Contrastive Loss
Chuang et al. (2020) proposed explicitly correcting for false negatives: $$\mathcal{L}_{\text{debiased}} = -\log \frac{e^{f(x, x^+)/\tau}}{e^{f(x, x^+)/\tau} + N \cdot (g^- - \tau^+ / N \cdot g^+)}$$
where $g^+, g^-$ estimate positive and negative density.
3. Supervised Contrastive Loss
When labels are available, explicitly exclude same-class samples from negatives: $$\mathcal{L}{\text{sup}} = \sum{i} \frac{-1}{|P(i)|} \sum_{p \in P(i)} \log \frac{e^{z_i \cdot z_p / \tau}}{\sum_{a \neq i} e^{z_i \cdot z_a / \tau}}$$
where $P(i)$ is the set of all positives for sample $i$.
For most applications, moderately higher temperature (0.1-0.2 instead of 0.07) provides robustness to false negatives with minimal performance loss. True debiasing requires density estimation which adds complexity. If you have labels, supervised contrastive loss is the cleanest solution.
Not all negatives contribute equally to learning. Hard negatives—samples that are similar to the query but semantically different—provide the strongest learning signal.
Recall the gradient of InfoNCE with respect to negative similarity: $$\frac{\partial \mathcal{L}}{\partial f(q, k^-_i)} = p^-_i$$
where $p^-_i$ is the softmax probability assigned to negative $i$. Hard negatives (high similarity) receive higher softmax probability and thus contribute more to the gradient.
Hard negative mining can backfire:
1. False Negative Amplification Hard negatives may actually be false negatives—same-class samples that look similar. Mining them amplifies the false negative problem.
2. Sampling Bias Overemphasizing hard negatives can bias representations toward distinguishing within difficult pairs while ignoring broader structure.
3. Training Instability Very hard negatives create large gradients that can destabilize training, especially early when the encoder is poor.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import torchimport torch.nn.functional as F def hard_negative_contrastive_loss( query, positive, negatives, temperature=0.07, hard_negative_weight=0.5, num_hard=None): """ Contrastive loss with hard negative emphasis. Args: query: Query embeddings (batch, dim) positive: Positive embeddings (batch, dim) negatives: Negative embeddings (batch, num_neg, dim) temperature: Temperature scaling hard_negative_weight: How much to weight hard negatives num_hard: Number of hardest negatives to emphasize Returns: Weighted contrastive loss """ batch_size, num_neg, dim = negatives.shape # Normalize query = F.normalize(query, dim=-1) positive = F.normalize(positive, dim=-1) negatives = F.normalize(negatives, dim=-1) # Positive similarity pos_sim = (query * positive).sum(dim=-1, keepdim=True) / temperature # Negative similarities neg_sim = torch.bmm(negatives, query.unsqueeze(-1)).squeeze(-1) / temperature if num_hard is None: num_hard = num_neg // 4 # Top 25% as hard negatives # Find hardest negatives (highest similarity) hard_neg_sim, hard_indices = neg_sim.topk(num_hard, dim=-1) easy_neg_sim = neg_sim # Keep all for baseline # Standard loss (all negatives) logits_all = torch.cat([pos_sim, neg_sim], dim=-1) loss_all = F.cross_entropy( logits_all, torch.zeros(batch_size, dtype=torch.long, device=query.device) ) # Hard negative loss logits_hard = torch.cat([pos_sim, hard_neg_sim], dim=-1) loss_hard = F.cross_entropy( logits_hard, torch.zeros(batch_size, dtype=torch.long, device=query.device) ) # Weighted combination loss = (1 - hard_negative_weight) * loss_all + hard_negative_weight * loss_hard return lossIn most scenarios, the implicit hard negative weighting from low temperature is sufficient. Explicit hard negative mining is most valuable when you have verified labels (to avoid false negatives) or are working in domains with very clear semantic boundaries.
The choice of positive and negative pairs fundamentally determines what the learned representation captures.
Easy positives (minor augmentations):
Hard positives (aggressive augmentations):
Wang & Isola (2020) characterized representations by two properties:
Alignment: Positive pairs should be close $$\mathcal{L}{\text{align}} = \mathbb{E}{(x, x^+)} |f(x) - f(x^+)|^2$$
Uniformity: Representations should cover the hypersphere uniformly $$\mathcal{L}{\text{uniform}} = \log \mathbb{E}{x, y} e^{-2|f(x) - f(y)|^2}$$
Strong positives improve alignment but may hurt uniformity (everything pulled to same region). Many negatives improve uniformity but may hurt alignment (too much repulsion).
| Configuration | Alignment | Uniformity | Downstream Effect |
|---|---|---|---|
| Easy positives, few negatives | Moderate | Poor | Clustered but collapsed representation |
| Easy positives, many negatives | Good | Good | Well-spread but shallow features |
| Hard positives, few negatives | Poor | Poor | Potential collapse |
| Hard positives, many negatives | Good | Excellent | Semantic, transferable representations |
The interplay of positive difficulty and negative quantity explains several empirical observations:
SimCLR needs large batches because its strong augmentations create hard positives that require many negatives to prevent collapse.
Weaker augmentations need fewer negatives because the contrastive task is easier—the model can distinguish with less repulsive force.
Transfer learning benefits from harder positives because the model must learn semantic features to solve the task.
In-domain performance may prefer easier positives because task-relevant augmentation-sensitive features are preserved.
Positive and negative pairs are more than implementation details—they are the vocabulary through which we communicate to the model what we want it to learn.
You now understand the fundamental unit of contrastive learning—the pair. This understanding is essential for diagnosing problems, designing new methods, and adapting contrastive learning to new domains. Next, we'll examine data augmentation in depth—the practical mechanism for creating effective positive pairs.