Loading content...
Imagine searching for images using natural language: "a sunset over a calm ocean with silhouetted palm trees." Or finding videos that match a hummed melody. Or retrieving documents that relate to a hand-drawn sketch. These scenarios represent cross-modal retrieval—the task of searching across different data modalities using queries from one modality to find relevant items in another.
Cross-modal retrieval is fundamental to modern search and discovery systems. It powers image search engines, video recommendation, content moderation, and accessibility tools. The technical challenge is profound: how do we compare a text query to an image database when text and images have fundamentally different representations?
This page covers the theory and practice of cross-modal retrieval, including joint embedding spaces, retrieval architectures, training objectives, efficient search at scale, and evaluation methodologies. You will understand how to build and deploy retrieval systems that work across modalities.
The foundation of cross-modal retrieval is the joint embedding space—a shared representation where items from different modalities can be directly compared using distance or similarity metrics.
The Key Insight:
Instead of comparing raw representations (impossible—how do you compare pixels to words?), we learn encoder functions that map each modality to a common space:
$$f_{\text{image}}: \mathcal{I} \rightarrow \mathbb{R}^d$$ $$f_{\text{text}}: \mathcal{T} \rightarrow \mathbb{R}^d$$
In this shared space, semantic similarity corresponds to geometric proximity:
$$\text{similarity}(\text{image}, \text{text}) = \cos(f_{\text{image}}(\text{image}), f_{\text{text}}(\text{text}))$$
Properties of Good Embedding Spaces:
Mathematical Perspective:
The embedding space implicitly defines a metric on cross-modal similarity:
$$d(x_i, x_j) = 1 - \frac{f(x_i) \cdot f(x_j)}{|f(x_i)| |f(x_j)|}$$
Training objectives like contrastive learning optimize this metric to satisfy semantic constraints from paired data.
Embedding dimension (typically 256-1024) balances expressiveness vs. efficiency. Higher dimensions capture more nuance but require more storage and computation. CLIP uses 512 dimensions; some models use variable dimensions with Matryoshka representations that allow truncation.
Several training objectives have been developed to learn effective joint embedding spaces. Each has different properties regarding computational efficiency, gradient dynamics, and final retrieval quality.
Contrastive Learning (InfoNCE):
The dominant approach uses contrastive learning to pull together positive pairs (matched items) and push apart negative pairs (unmatched items):
$$\mathcal{L}{\text{InfoNCE}} = -\log \frac{\exp(s(q, k^+)/\tau)}{\exp(s(q, k^+)/\tau) + \sum{k^-} \exp(s(q, k^-)/\tau)}$$
Where $s(\cdot, \cdot)$ is similarity (dot product or cosine) and $\tau$ is temperature.
Triplet Loss:
An older but still-used approach that considers anchor-positive-negative triplets:
$$\mathcal{L}_{\text{triplet}} = \max(0, d(a, p) - d(a, n) + \alpha)$$
Pushes positive closer to anchor than negative by margin $\alpha$.
Multiple Negatives Ranking Loss:
For batch training with in-batch negatives:
$$\mathcal{L}{\text{MNR}} = -\frac{1}{N}\sum{i=1}^{N}\log\frac{\exp(s_i^+)}{\sum_{j=1}^{N}\exp(s_{ij})}$$
Sigmoid Loss (SigLIP):
Recent work replaces softmax with per-pair sigmoid, avoiding the need for large batches:
$$\mathcal{L}{\text{sigmoid}} = -\sum{i,j}\left[y_{ij}\log\sigma(s_{ij}) + (1-y_{ij})\log(1-\sigma(s_{ij}))\right]$$
| Objective | Negatives Required | Batch Dependency | Key Advantage |
|---|---|---|---|
| InfoNCE | Many (in-batch) | High | Strong theoretical foundation |
| Triplet | One per triplet | Low | Simple, interpretable margin |
| MNR | In-batch | Medium | Efficient multi-pair optimization |
| Sigmoid (SigLIP) | In-batch | Low | Works well with small batches |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import torchimport torch.nn.functional as F def infonce_loss( query_emb: torch.Tensor, key_emb: torch.Tensor, temperature: float = 0.07,) -> torch.Tensor: """ InfoNCE loss for cross-modal retrieval training. Args: query_emb: (B, D) normalized query embeddings key_emb: (B, D) normalized key embeddings temperature: Softmax temperature Returns: Scalar loss """ # Similarity matrix: (B, B) logits = query_emb @ key_emb.T / temperature # Positive pairs on diagonal labels = torch.arange(len(query_emb), device=query_emb.device) # Cross-entropy treats each row as classification problem loss = F.cross_entropy(logits, labels) return loss def sigmoid_loss( query_emb: torch.Tensor, key_emb: torch.Tensor, temperature: float = 0.1, bias: float = -10.0,) -> torch.Tensor: """ SigLIP-style sigmoid loss - does not require large batches. Each pair is treated as independent binary classification. Positive pairs should have high similarity, negatives low. """ batch_size = query_emb.shape[0] # Similarity matrix logits = query_emb @ key_emb.T / temperature + bias # Labels: diagonal is positive (1), off-diagonal is negative (0) labels = torch.eye(batch_size, device=query_emb.device) # Binary cross-entropy per pair loss = F.binary_cross_entropy_with_logits( logits, labels, reduction='mean' ) return loss def triplet_loss( anchor: torch.Tensor, positive: torch.Tensor, negative: torch.Tensor, margin: float = 0.2,) -> torch.Tensor: """ Triplet margin loss for retrieval. """ d_pos = (anchor - positive).pow(2).sum(dim=-1) d_neg = (anchor - negative).pow(2).sum(dim=-1) loss = F.relu(d_pos - d_neg + margin) return loss.mean()Practical retrieval systems must balance accuracy with efficiency. Different architectures make different trade-offs.
Dual Encoder Architecture:
The standard approach for large-scale retrieval:
Database items → Encoder → Embeddings → Index (ANN)
↓
Query → Encoder → Query Embedding → kNN Search → Top-K Results
Advantages:
Limitations:
Cross-Encoder Architecture:
For maximum accuracy, cross-encoders process query-candidate pairs jointly:
[Query, Candidate] → Cross-Encoder → Relevance Score
Advantages:
Limitations:
Two-Stage Retrieval:
The practical pattern combines both:
| Architecture | Query Latency | Index Size | Accuracy | Use Case |
|---|---|---|---|---|
| Dual Encoder | O(1) + ANN | N × d | Good | First-stage retrieval |
| Cross-Encoder | O(N) | None (pairs) | Excellent | Re-ranking top-K |
| ColBERT | O(1) + ANN | N × L × d | Very Good | Token-level retrieval |
| Two-Stage | O(K) rerank | N × d | Excellent | Production systems |
ColBERT (Contextualized Late Interaction over BERT) represents a middle ground: it stores token-level embeddings and computes MaxSim scores between query and document tokens. This preserves fine-grained matching while allowing pre-computation of document tokens.
Real-world retrieval systems operate over billions of items. Exact nearest neighbor search is O(N) per query—infeasible at scale. Approximate Nearest Neighbor (ANN) algorithms trade small accuracy losses for dramatic speedups.
Key ANN Approaches:
1. Product Quantization (PQ): Compress vectors into compact codes by quantizing subvectors:
2. Hierarchical Navigable Small Worlds (HNSW): Graph-based index enabling logarithmic search:
3. Inverted File Index (IVF): Partition space into clusters:
4. Locality-Sensitive Hashing (LSH): Hash similar items to same buckets:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import faissimport numpy as np class VectorSearchIndex: """ Efficient vector search using FAISS. Supports multiple index types for different scale/accuracy trade-offs. """ def __init__( self, dimension: int, index_type: str = "IVF", n_clusters: int = 1024, use_gpu: bool = False, ): self.dimension = dimension self.use_gpu = use_gpu if index_type == "Flat": # Exact search (brute force) self.index = faiss.IndexFlatIP(dimension) elif index_type == "IVF": # Inverted file with product quantization quantizer = faiss.IndexFlatIP(dimension) self.index = faiss.IndexIVFPQ( quantizer, dimension, n_clusters, # Number of clusters dimension // 4, # Subvector size 8, # Bits per subvector ) self.trained = False elif index_type == "HNSW": # Hierarchical Navigable Small Worlds self.index = faiss.IndexHNSWFlat(dimension, 32) # 32 neighbors if use_gpu: res = faiss.StandardGpuResources() self.index = faiss.index_cpu_to_gpu(res, 0, self.index) def add(self, embeddings: np.ndarray): """Add embeddings to index.""" if hasattr(self, 'trained') and not self.trained: self.index.train(embeddings) self.trained = True self.index.add(embeddings) def search(self, query: np.ndarray, k: int = 10): """ Search for k nearest neighbors. Returns: distances: (n_queries, k) similarity scores indices: (n_queries, k) database indices """ # Set search parameters for IVF if hasattr(self.index, 'nprobe'): self.index.nprobe = 32 # Search 32 clusters distances, indices = self.index.search(query, k) return distances, indices # Example usageindex = VectorSearchIndex(dimension=512, index_type="IVF")index.add(database_embeddings) # Shape: (1M, 512)distances, indices = index.search(query_embeddings, k=10)| Algorithm | Index Build | Query Time | Memory | Recall@10 |
|---|---|---|---|---|
| Flat (Exact) | O(1) | O(N) | N × d × 4B | 100% |
| IVF-PQ | O(N) | O(N/C + k) | ~10× smaller | 95-99% |
| HNSW | O(N log N) | O(log N) | N × d × 4B + graph | 99%+ |
| ScaNN | O(N) | O(N/C) | Quantized | 95-99% |
Evaluating retrieval systems requires metrics that capture different aspects of retrieval quality. The choice of metric depends on the application scenario.
Recall@K:
The fraction of relevant items found in the top K results:
$$\text{Recall@K} = \frac{|\text{Relevant} \cap \text{Retrieved@K}|}{|\text{Relevant}|}$$
Often used as R@1, R@5, R@10. R@1 measures if the correct item is ranked first.
Mean Reciprocal Rank (MRR):
Average of reciprocal ranks of the first relevant result:
$$\text{MRR} = \frac{1}{|Q|}\sum_{q \in Q}\frac{1}{\text{rank}(q)}$$
MRR is 1.0 if the correct result is always first, lower otherwise.
Mean Average Precision (mAP):
For cases with multiple relevant items per query:
$$\text{AP} = \frac{1}{|\text{Rel}|}\sum_{k=1}^{K}P(k) \cdot \text{rel}(k)$$
Where P(k) is precision at position k and rel(k) indicates if item k is relevant.
Normalized Discounted Cumulative Gain (nDCG):
For graded relevance (not just binary):
$$\text{DCG@K} = \sum_{i=1}^{K}\frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}$$
nDCG normalizes DCG by the ideal ranking's DCG.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as npfrom typing import List def recall_at_k( predictions: np.ndarray, # (N, K) - top K predicted indices targets: List[List[int]], # List of relevant indices per query k: int = 10,) -> float: """ Compute Recall@K: fraction of queries where target is in top-K predictions. """ hits = 0 for i, pred in enumerate(predictions): top_k = set(pred[:k]) relevant = set(targets[i]) if top_k & relevant: # Any overlap hits += 1 return hits / len(predictions) def mean_reciprocal_rank( predictions: np.ndarray, # (N, K) ranked predictions targets: np.ndarray, # (N,) single target per query) -> float: """ Compute Mean Reciprocal Rank. """ reciprocal_ranks = [] for i, pred in enumerate(predictions): target = targets[i] if target in pred: rank = np.where(pred == target)[0][0] + 1 reciprocal_ranks.append(1.0 / rank) else: reciprocal_ranks.append(0.0) return np.mean(reciprocal_ranks) def ndcg_at_k( predictions: np.ndarray, # (N, K) ranked predictions relevances: np.ndarray, # (N, M) relevance scores k: int = 10,) -> float: """ Compute normalized Discounted Cumulative Gain. """ # Simplified for binary relevance dcg = np.sum( relevances[:, :k] / np.log2(np.arange(2, k + 2)), axis=1 ) # Ideal DCG (perfect ranking) sorted_rel = np.sort(relevances, axis=1)[:, ::-1] idcg = np.sum( sorted_rel[:, :k] / np.log2(np.arange(2, k + 2)), axis=1 ) ndcg = dcg / (idcg + 1e-8) return np.mean(ndcg)Use R@1 for single-answer tasks (find THE matching item). Use R@K for recall-focused tasks (find ANY relevant item in top K, like first-stage retrieval). Use MRR when rank position matters but there's one right answer. Use mAP/nDCG when multiple items are relevant with varying importance.
Standardized benchmarks enable fair comparison of cross-modal retrieval systems. Different benchmarks test different capabilities.
| Benchmark | Modalities | Size | Task |
|---|---|---|---|
| MS-COCO | Image-Text | 123K images, 5 captions each | Image-caption retrieval |
| Flickr30K | Image-Text | 31K images, 5 captions each | Image-caption retrieval |
| MSCOCO-5K | Image-Text | 5K test images | Standard test split |
| MSR-VTT | Video-Text | 10K videos, 200K captions | Video-text retrieval |
| ActivityNet | Video-Text | 20K videos | Dense video retrieval |
| AudioSet | Audio-Video | 2M clips | Audio-visual retrieval |
| LAION-5B | Image-Text | 5.85B pairs | Large-scale pretraining |
In the final page of this module, we'll explore unified multimodal architectures—models that process any combination of modalities with a single architecture, representing the frontier of multimodal AI.