Loading learning content...
InfoNCE (Information Noise Contrastive Estimation) stands as the mathematical cornerstone upon which modern contrastive learning is built. Before SimCLR achieved breakthrough results, before MoCo demonstrated the power of momentum encoders, and before contrastive methods revolutionized self-supervised learning—there was InfoNCE, a principled information-theoretic objective derived from noise contrastive estimation.
Understanding InfoNCE isn't merely academic. Every major contrastive learning method—SimCLR, MoCo, CLIP, BYOL's implicit contrastive interpretation, and countless others—either directly uses InfoNCE or builds upon its theoretical foundations. Mastering this loss function gives you the conceptual vocabulary to understand, compare, and innovate within the entire contrastive learning landscape.
By the end of this page, you will understand: (1) The information-theoretic motivation behind InfoNCE, (2) The complete mathematical derivation and its connection to mutual information, (3) Why the loss involves a softmax over negatives, (4) The role of temperature scaling, (5) Practical implementation considerations, and (6) The theoretical guarantees and limitations of the objective.
To truly understand InfoNCE, we must begin with its information-theoretic roots. The fundamental insight driving contrastive learning is elegant: good representations should preserve information about the input that is useful for downstream tasks.
Mutual information $I(X; Y)$ measures how much knowing one variable tells us about another:
$$I(X; Y) = \mathbb{E}_{p(x,y)} \left[ \log \frac{p(x, y)}{p(x)p(y)} \right]$$
In representation learning, we want the representation $Z = f(X)$ to preserve maximal information about the input $X$, or more specifically, about aspects of $X$ that correlate with relevant structure. If we have access to positive pairs $(X, X^+)$—two views of the same underlying content—we can frame our objective as maximizing:
$$I(f(X); f(X^+))$$
This captures the intuition that representations of related inputs should contain information about each other.
Mutual information provides a principled objective that doesn't require labels. If two augmented views of the same image share high mutual information in representation space, the representation must capture semantically meaningful features that survive augmentation—exactly what we want for transfer learning.
Direct computation of mutual information is generally intractable for high-dimensional continuous distributions. We can't compute the density ratio $\frac{p(x, y)}{p(x)p(y)}$ without knowing the underlying distributions, which we don't have access to.
This is where Noise Contrastive Estimation enters the picture. Instead of estimating densities directly, NCE reframes density estimation as a binary classification problem: can we distinguish samples from the true joint distribution $p(x, y)$ from samples drawn from a noise distribution?
Given:
We train a critic function $f(x, y)$ to assign high scores to positive pairs and low scores to negative pairs. The key insight: the optimal critic recovers the density ratio, and from this, we can bound mutual information.
Let's derive InfoNCE from first principles, understanding each step's significance.
Consider a $(K+1)$-way classification problem. We're given one positive pair $(x, x^+)$ and $K$ negative samples ${x^-_1, ..., x^-_K}$. The task: identify which sample is the true positive among the $K+1$ candidates.
Let $f(x, y)$ be a scoring function (our representation similarity). For practical implementations, this is typically:
$$f(x, y) = \frac{\text{sim}(g(x), g(y))}{\tau}$$
where $g$ is an encoder network, $\text{sim}$ is cosine similarity, and $\tau$ is a temperature parameter.
The probability of correctly identifying the positive is given by a softmax:
$$p(\text{positive} = x^+ | x, {x^+, x^-_1, ..., x^-K}) = \frac{e^{f(x, x^+)}}{e^{f(x, x^+)} + \sum{i=1}^{K} e^{f(x, x^-_i)}}$$
The InfoNCE loss is the negative log-likelihood of this classification:
$$\mathcal{L}{\text{InfoNCE}} = -\mathbb{E} \left[ \log \frac{e^{f(x, x^+)}}{e^{f(x, x^+)} + \sum{i=1}^{K} e^{f(x, x^-_i)}} \right]$$
1234567891011121314151617181920212223242526272829303132333435363738
import torchimport torch.nn.functional as F def infonce_loss(query, positive, negatives, temperature=0.07): """ Compute InfoNCE loss. Args: query: Tensor of shape (batch_size, embedding_dim) positive: Tensor of shape (batch_size, embedding_dim) negatives: Tensor of shape (batch_size, num_negatives, embedding_dim) temperature: Temperature scaling parameter Returns: InfoNCE loss value """ # Normalize embeddings for cosine similarity query = F.normalize(query, dim=-1) positive = F.normalize(positive, dim=-1) negatives = F.normalize(negatives, dim=-1) # Compute positive similarity: (batch_size,) pos_sim = torch.sum(query * positive, dim=-1) / temperature # Compute negative similarities: (batch_size, num_negatives) neg_sim = torch.bmm(negatives, query.unsqueeze(-1)).squeeze(-1) / temperature # Concatenate: positive in position 0 # Shape: (batch_size, 1 + num_negatives) logits = torch.cat([pos_sim.unsqueeze(-1), neg_sim], dim=-1) # Labels: positive is always at index 0 labels = torch.zeros(query.size(0), dtype=torch.long, device=query.device) # Cross-entropy loss loss = F.cross_entropy(logits, labels) return lossThe crucial theoretical result is that InfoNCE provides a lower bound on mutual information:
$$I(X; Y) \geq \log(K+1) - \mathcal{L}_{\text{InfoNCE}}$$
This bound becomes tighter as:
Proof sketch: The optimal critic satisfies $f^*(x, y) \propto \log \frac{p(y|x)}{p(y)}$. When this optimal critic is used, the InfoNCE loss equals the negative log probability of correct classification under a uniform prior, which can be shown to lower-bound mutual information.
The bound $I(X;Y) \geq \log(K+1) - \mathcal{L}$ reveals why more negatives help: they raise the ceiling on estimable mutual information. With $K=255$ negatives, you can estimate up to ~5.5 nats of MI. With $K=65535$, up to ~11 nats. This explains why methods like MoCo use large memory banks.
Temperature $\tau$ is perhaps the most underappreciated hyperparameter in contrastive learning. Its effects are profound and nuanced.
With temperature, our scoring function becomes:
$$f(x, y) = \frac{\text{sim}(g(x), g(y))}{\tau}$$
For cosine similarity (range $[-1, 1]$), typical temperatures are $\tau \in [0.05, 0.5]$.
Low temperature ($\tau \rightarrow 0$):
High temperature ($\tau \rightarrow \infty$):
| Temperature | Gradient Behavior | Representation Effect | Practical Consideration |
|---|---|---|---|
| τ = 0.01 | Extremely sharp; dominated by nearest negative | Very tight clusters; may collapse | Often unstable; requires careful negative sampling |
| τ = 0.07 | Focused on hard negatives; standard for SimCLR | Well-separated clusters; good transfer | Sweet spot for many vision tasks |
| τ = 0.1-0.2 | Balanced consideration of negatives | Moderate separation; smoother manifold | Good for smaller batch sizes |
| τ = 0.5-1.0 | Nearly uniform weighting | Loose structure; may underfit | Can help with noisy negatives |
The gradient of InfoNCE with respect to the positive similarity provides insight:
$$\frac{\partial \mathcal{L}}{\partial f(x, x^+)} = -(1 - p^+)$$
where $p^+$ is the softmax probability of the positive. This means:
Temperature controls the transition between these regimes. Lower temperature makes the transition sharper, concentrating learning on the hardest examples.
SimCLR uses τ=0.07, MoCo uses τ=0.07, and CLIP uses τ=0.01 (learnable). These aren't arbitrary—they're carefully tuned. A factor of 2x in temperature can significantly impact downstream performance. Always tune temperature when adapting methods to new domains.
The quality and quantity of negative samples fundamentally determines contrastive learning success. This section examines why negatives matter and how different methods approach negative sampling.
Without negatives, the trivial solution is to map all inputs to the same representation—achieving perfect positive similarity but learning nothing useful. Negatives provide the repulsive force that prevents representation collapse.
Mathematically, the gradient with respect to a negative sample is:
$$\frac{\partial \mathcal{L}}{\partial f(x, x^-_i)} = p^-_i$$
where $p^-_i$ is the softmax probability assigned to that negative. This means:
A critical but often overlooked issue: some negatives aren't truly negative. In a batch of 4096 ImageNet images, statistically some will belong to the same class. These "false negatives" are treated as negatives but actually share semantic content with the query.
Impact of false negatives:
Mitigation strategies:
Implementing InfoNCE correctly requires attention to numerical stability, memory efficiency, and distributed training considerations.
The naive implementation can suffer from numerical overflow in the exponentials. Always use the log-sum-exp trick:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import torchimport torch.nn.functional as Fimport torch.distributed as dist def stable_infonce_loss(z_i, z_j, temperature=0.07): """ Numerically stable InfoNCE for in-batch negatives (SimCLR-style). Args: z_i: First view embeddings (batch_size, dim) z_j: Second view embeddings (batch_size, dim) temperature: Temperature parameter Returns: Contrastive loss value """ batch_size = z_i.size(0) # Normalize embeddings z_i = F.normalize(z_i, dim=-1) z_j = F.normalize(z_j, dim=-1) # Gather all embeddings if distributed if dist.is_initialized(): z_i_all = gather_from_all(z_i) z_j_all = gather_from_all(z_j) else: z_i_all = z_i z_j_all = z_j total_batch = z_i_all.size(0) # Compute all pairwise similarities # Shape: (batch_size, 2 * total_batch) sim_i_j = torch.mm(z_i, z_j_all.t()) / temperature sim_j_i = torch.mm(z_j, z_i_all.t()) / temperature sim_i_i = torch.mm(z_i, z_i_all.t()) / temperature sim_j_j = torch.mm(z_j, z_j_all.t()) / temperature # Mask out self-similarities on diagonal rank = dist.get_rank() if dist.is_initialized() else 0 mask = torch.eye(batch_size, device=z_i.device) offset = rank * batch_size # For z_i: positives are corresponding z_j # Negatives are all other z_i and z_j (except self) pos_i = sim_i_j.diag() # Positive pairs # Construct negative logits (exclude positive and self) neg_i = torch.cat([sim_i_i, sim_i_j], dim=1) # Mask self and positive mask_full = torch.zeros(batch_size, 2 * total_batch, device=z_i.device) mask_full[:, offset:offset+batch_size] = mask # Self in z_i_all neg_i = neg_i - mask_full * 1e9 # Mask out # InfoNCE: -log(exp(pos) / (exp(pos) + sum(exp(neg)))) # = -pos + log(exp(pos) + sum(exp(neg))) # = -pos + logsumexp([pos, neg1, neg2, ...]) logits_i = torch.cat([pos_i.unsqueeze(-1), neg_i], dim=-1) loss_i = -pos_i + torch.logsumexp(logits_i, dim=-1) # Symmetric loss for z_j pos_j = sim_j_i.diag() neg_j = torch.cat([sim_j_j, sim_j_i], dim=1) mask_full_j = torch.zeros(batch_size, 2 * total_batch, device=z_j.device) mask_full_j[:, offset:offset+batch_size] = mask neg_j = neg_j - mask_full_j * 1e9 logits_j = torch.cat([pos_j.unsqueeze(-1), neg_j], dim=-1) loss_j = -pos_j + torch.logsumexp(logits_j, dim=-1) loss = (loss_i.mean() + loss_j.mean()) / 2 return loss def gather_from_all(tensor): """Gather tensors from all processes.""" tensors_gather = [torch.zeros_like(tensor) for _ in range(dist.get_world_size())] dist.all_gather(tensors_gather, tensor) return torch.cat(tensors_gather, dim=0)For very large batches, compute the similarity matrix in chunks rather than all at once. With 32K batch size and 256-dim embeddings, the full similarity matrix requires 32GB just for the logits. Chunk the computation along the batch dimension.
Recent theoretical work has clarified what InfoNCE actually learns:
Alignment and Uniformity Framework:
Wang & Isola (2020) decomposed contrastive learning into two properties:
Alignment: Positive pairs should have similar representations $$\mathcal{L}{\text{align}} = \mathbb{E}{(x, x^+)} |f(x) - f(x^+)|^2$$
Uniformity: Representations should be uniformly distributed on the hypersphere $$\mathcal{L}{\text{uniform}} = \log \mathbb{E}{x, x'} e^{-2|f(x) - f(x')|^2}$$
InfoNCE optimizes both implicitly: the numerator encourages alignment, and the denominator (through repulsion from negatives) encourages uniformity.
Even with InfoNCE, representations can suffer from dimensional collapse—where only a subset of embedding dimensions carry meaningful information. This manifests as:
Causes:
InfoNCE provides the mathematical bedrock for contrastive representation learning. Its elegance lies in transforming the intractable problem of mutual information maximization into a tractable classification task.
You now understand the mathematical heart of contrastive learning. InfoNCE isn't just a loss function—it's a principled approach to representation learning grounded in information theory. Next, we'll see how SimCLR operationalizes these principles into a complete framework.