Multimodal Learning - Learning Module

Loading content...

0/278

Cross-Modal Retrieval: Searching Across Modalities

The Universal Search Problem

Imagine searching for images using natural language: "a sunset over a calm ocean with silhouetted palm trees." Or finding videos that match a hummed melody. Or retrieving documents that relate to a hand-drawn sketch. These scenarios represent cross-modal retrieval—the task of searching across different data modalities using queries from one modality to find relevant items in another.

Cross-modal retrieval is fundamental to modern search and discovery systems. It powers image search engines, video recommendation, content moderation, and accessibility tools. The technical challenge is profound: how do we compare a text query to an image database when text and images have fundamentally different representations?

Learning Objectives

This page covers the theory and practice of cross-modal retrieval, including joint embedding spaces, retrieval architectures, training objectives, efficient search at scale, and evaluation methodologies. You will understand how to build and deploy retrieval systems that work across modalities.

Joint Embedding Spaces

The foundation of cross-modal retrieval is the joint embedding space—a shared representation where items from different modalities can be directly compared using distance or similarity metrics.

The Key Insight:

Instead of comparing raw representations (impossible—how do you compare pixels to words?), we learn encoder functions that map each modality to a common space:

$$f_{\text{image}}: \mathcal{I} \rightarrow \mathbb{R}^d$$ $$f_{\text{text}}: \mathcal{T} \rightarrow \mathbb{R}^d$$

In this shared space, semantic similarity corresponds to geometric proximity:

$$\text{similarity}(\text{image}, \text{text}) = \cos(f_{\text{image}}(\text{image}), f_{\text{text}}(\text{text}))$$

Properties of Good Embedding Spaces:

Semantic alignment: Semantically similar items (regardless of modality) should be close
Uniformity: Embeddings should be well-distributed across the space (not collapsed)
Discriminative: Different concepts should be separable
Generalizable: Novel concepts should be meaningfully placed

Mathematical Perspective:

The embedding space implicitly defines a metric on cross-modal similarity:

$$d(x_i, x_j) = 1 - \frac{f(x_i) \cdot f(x_j)}{|f(x_i)| |f(x_j)|}$$

Training objectives like contrastive learning optimize this metric to satisfy semantic constraints from paired data.

Converting Mermaid diagram...

Embedding Dimension Trade-offs

Embedding dimension (typically 256-1024) balances expressiveness vs. efficiency. Higher dimensions capture more nuance but require more storage and computation. CLIP uses 512 dimensions; some models use variable dimensions with Matryoshka representations that allow truncation.

Training Objectives for Cross-Modal Retrieval

Several training objectives have been developed to learn effective joint embedding spaces. Each has different properties regarding computational efficiency, gradient dynamics, and final retrieval quality.

Contrastive Learning (InfoNCE):

The dominant approach uses contrastive learning to pull together positive pairs (matched items) and push apart negative pairs (unmatched items):

$$\mathcal{L}{\text{InfoNCE}} = -\log \frac{\exp(s(q, k^+)/\tau)}{\exp(s(q, k^+)/\tau) + \sum{k^-} \exp(s(q, k^-)/\tau)}$$

Where $s(\cdot, \cdot)$ is similarity (dot product or cosine) and $\tau$ is temperature.

Triplet Loss:

An older but still-used approach that considers anchor-positive-negative triplets:

$$\mathcal{L}_{\text{triplet}} = \max(0, d(a, p) - d(a, n) + \alpha)$$

Pushes positive closer to anchor than negative by margin $\alpha$.

Multiple Negatives Ranking Loss:

For batch training with in-batch negatives:

$$\mathcal{L}{\text{MNR}} = -\frac{1}{N}\sum{i=1}^{N}\log\frac{\exp(s_i^+)}{\sum_{j=1}^{N}\exp(s_{ij})}$$

Sigmoid Loss (SigLIP):

Recent work replaces softmax with per-pair sigmoid, avoiding the need for large batches:

$$\mathcal{L}{\text{sigmoid}} = -\sum{i,j}\left[y_{ij}\log\sigma(s_{ij}) + (1-y_{ij})\log(1-\sigma(s_{ij}))\right]$$

Comparison of Training Objectives
Objective	Negatives Required	Batch Dependency	Key Advantage
InfoNCE	Many (in-batch)	High	Strong theoretical foundation
Triplet	One per triplet	Low	Simple, interpretable margin
MNR	In-batch	Medium	Efficient multi-pair optimization
Sigmoid (SigLIP)	In-batch	Low	Works well with small batches

retrieval_losses.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn.functional as F
 
def infonce_loss(
    query_emb: torch.Tensor,
    key_emb: torch.Tensor,
    temperature: float = 0.07,
) -> torch.Tensor:
    """
    InfoNCE loss for cross-modal retrieval training.
    
    Args:
        query_emb: (B, D) normalized query embeddings
        key_emb: (B, D) normalized key embeddings
        temperature: Softmax temperature
        
    Returns:
        Scalar loss
    """
    # Similarity matrix: (B, B)
    logits = query_emb @ key_emb.T / temperature
    
    # Positive pairs on diagonal
    labels = torch.arange(len(query_emb), device=query_emb.device)
    
    # Cross-entropy treats each row as classification problem
    loss = F.cross_entropy(logits, labels)
    
    return loss
 
 
def sigmoid_loss(
    query_emb: torch.Tensor,
    key_emb: torch.Tensor,
    temperature: float = 0.1,
    bias: float = -10.0,
) -> torch.Tensor:
    """
    SigLIP-style sigmoid loss - does not require large batches.
    
    Each pair is treated as independent binary classification.
    Positive pairs should have high similarity, negatives low.
    """
    batch_size = query_emb.shape[0]
    
    # Similarity matrix
    logits = query_emb @ key_emb.T / temperature + bias
    
    # Labels: diagonal is positive (1), off-diagonal is negative (0)
    labels = torch.eye(batch_size, device=query_emb.device)
    
    # Binary cross-entropy per pair
    loss = F.binary_cross_entropy_with_logits(
        logits, labels, reduction='mean'
    )
    
    return loss
 
 
def triplet_loss(
    anchor: torch.Tensor,
    positive: torch.Tensor,
    negative: torch.Tensor,
    margin: float = 0.2,
) -> torch.Tensor:
    """
    Triplet margin loss for retrieval.
    """
    d_pos = (anchor - positive).pow(2).sum(dim=-1)
    d_neg = (anchor - negative).pow(2).sum(dim=-1)
    
    loss = F.relu(d_pos - d_neg + margin)
    
    return loss.mean()

Retrieval System Architectures

Practical retrieval systems must balance accuracy with efficiency. Different architectures make different trade-offs.

Dual Encoder Architecture:

The standard approach for large-scale retrieval:

Indexing phase (offline): Encode all database items into embeddings
Query phase (online): Encode query, find nearest neighbors in embedding space

Database items → Encoder → Embeddings → Index (ANN)
                                              ↓
Query → Encoder → Query Embedding → kNN Search → Top-K Results

Advantages:

Embeddings can be pre-computed and cached
Query encoding is independent of database size
Scales to billions of items with approximate nearest neighbor (ANN) search

Limitations:

No fine-grained cross-modal interaction
Fixed-size embedding may lose information

Cross-Encoder Architecture:

For maximum accuracy, cross-encoders process query-candidate pairs jointly:

[Query, Candidate] → Cross-Encoder → Relevance Score

Advantages:

Models fine-grained interactions between query and candidate
Higher accuracy for complex queries

Limitations:

Must run encoder for each (query, candidate) pair
Does not scale for large databases (O(N) per query)
Typically used as re-ranker on top of dual encoder

Two-Stage Retrieval:

The practical pattern combines both:

Retrieval: Dual encoder retrieves top-K candidates (K~100-1000)
Re-ranking: Cross-encoder re-scores candidates for final ranking

Retrieval Architecture Comparison
Architecture	Query Latency	Index Size	Accuracy	Use Case
Dual Encoder	O(1) + ANN	N × d	Good	First-stage retrieval
Cross-Encoder	O(N)	None (pairs)	Excellent	Re-ranking top-K
ColBERT	O(1) + ANN	N × L × d	Very Good	Token-level retrieval
Two-Stage	O(K) rerank	N × d	Excellent	Production systems

Late Interaction: ColBERT

ColBERT (Contextualized Late Interaction over BERT) represents a middle ground: it stores token-level embeddings and computes MaxSim scores between query and document tokens. This preserves fine-grained matching while allowing pre-computation of document tokens.

Efficient Search at Scale

Real-world retrieval systems operate over billions of items. Exact nearest neighbor search is O(N) per query—infeasible at scale. Approximate Nearest Neighbor (ANN) algorithms trade small accuracy losses for dramatic speedups.

Key ANN Approaches:

1. Product Quantization (PQ): Compress vectors into compact codes by quantizing subvectors:

Split d-dimensional vector into m subvectors
Quantize each subvector to one of k centroids
Store only centroid indices (log₂(k) bits each)
Distance approximated via lookup tables

2. Hierarchical Navigable Small Worlds (HNSW): Graph-based index enabling logarithmic search:

Build multi-layer graph connecting similar items
Search by greedy traversal from entry point
Higher layers are sparser, lower layers denser
O(log N) search complexity

3. Inverted File Index (IVF): Partition space into clusters:

Cluster all vectors into C centroids
At query time, find nearest centroids
Search only within selected clusters
Trade-off: more clusters = faster search, less recall

4. Locality-Sensitive Hashing (LSH): Hash similar items to same buckets:

Design hash functions where collision probability ∝ similarity
Query by hashing and searching buckets
Random projections for cosine similarity

vector_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import faiss
import numpy as np
 
class VectorSearchIndex:
    """
    Efficient vector search using FAISS.
    
    Supports multiple index types for different scale/accuracy trade-offs.
    """
    
    def __init__(
        self,
        dimension: int,
        index_type: str = "IVF",
        n_clusters: int = 1024,
        use_gpu: bool = False,
    ):
        self.dimension = dimension
        self.use_gpu = use_gpu
        
        if index_type == "Flat":
            # Exact search (brute force)
            self.index = faiss.IndexFlatIP(dimension)
            
        elif index_type == "IVF":
            # Inverted file with product quantization
            quantizer = faiss.IndexFlatIP(dimension)
            self.index = faiss.IndexIVFPQ(
                quantizer,
                dimension,
                n_clusters,     # Number of clusters
                dimension // 4,  # Subvector size
                8,              # Bits per subvector
            )
            self.trained = False
            
        elif index_type == "HNSW":
            # Hierarchical Navigable Small Worlds
            self.index = faiss.IndexHNSWFlat(dimension, 32)  # 32 neighbors
            
        if use_gpu:
            res = faiss.StandardGpuResources()
            self.index = faiss.index_cpu_to_gpu(res, 0, self.index)
    
    def add(self, embeddings: np.ndarray):
        """Add embeddings to index."""
        if hasattr(self, 'trained') and not self.trained:
            self.index.train(embeddings)
            self.trained = True
        self.index.add(embeddings)
    
    def search(self, query: np.ndarray, k: int = 10):
        """
        Search for k nearest neighbors.
        
        Returns:
            distances: (n_queries, k) similarity scores
            indices: (n_queries, k) database indices
        """
        # Set search parameters for IVF
        if hasattr(self.index, 'nprobe'):
            self.index.nprobe = 32  # Search 32 clusters
            
        distances, indices = self.index.search(query, k)
        return distances, indices
 
 
# Example usage
index = VectorSearchIndex(dimension=512, index_type="IVF")
index.add(database_embeddings)  # Shape: (1M, 512)
distances, indices = index.search(query_embeddings, k=10)

ANN Algorithm Comparison
Algorithm	Index Build	Query Time	Memory	Recall@10
Flat (Exact)	O(1)	O(N)	N × d × 4B	100%
IVF-PQ	O(N)	O(N/C + k)	~10× smaller	95-99%
HNSW	O(N log N)	O(log N)	N × d × 4B + graph	99%+
ScaNN	O(N)	O(N/C)	Quantized	95-99%

Evaluation Metrics for Retrieval

Evaluating retrieval systems requires metrics that capture different aspects of retrieval quality. The choice of metric depends on the application scenario.

Recall@K:

The fraction of relevant items found in the top K results:

$$\text{Recall@K} = \frac{|\text{Relevant} \cap \text{Retrieved@K}|}{|\text{Relevant}|}$$

Often used as R@1, R@5, R@10. R@1 measures if the correct item is ranked first.

Mean Reciprocal Rank (MRR):

Average of reciprocal ranks of the first relevant result:

$$\text{MRR} = \frac{1}{|Q|}\sum_{q \in Q}\frac{1}{\text{rank}(q)}$$

MRR is 1.0 if the correct result is always first, lower otherwise.

Mean Average Precision (mAP):

For cases with multiple relevant items per query:

$$\text{AP} = \frac{1}{|\text{Rel}|}\sum_{k=1}^{K}P(k) \cdot \text{rel}(k)$$

Where P(k) is precision at position k and rel(k) indicates if item k is relevant.

Normalized Discounted Cumulative Gain (nDCG):

For graded relevance (not just binary):

$$\text{DCG@K} = \sum_{i=1}^{K}\frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}$$

nDCG normalizes DCG by the ideal ranking's DCG.

retrieval_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from typing import List
 
def recall_at_k(
    predictions: np.ndarray,  # (N, K) - top K predicted indices
    targets: List[List[int]], # List of relevant indices per query
    k: int = 10,
) -> float:
    """
    Compute Recall@K: fraction of queries where target
    is in top-K predictions.
    """
    hits = 0
    for i, pred in enumerate(predictions):
        top_k = set(pred[:k])
        relevant = set(targets[i])
        if top_k & relevant:  # Any overlap
            hits += 1
    return hits / len(predictions)
 
 
def mean_reciprocal_rank(
    predictions: np.ndarray,  # (N, K) ranked predictions
    targets: np.ndarray,      # (N,) single target per query
) -> float:
    """
    Compute Mean Reciprocal Rank.
    """
    reciprocal_ranks = []
    for i, pred in enumerate(predictions):
        target = targets[i]
        if target in pred:
            rank = np.where(pred == target)[0][0] + 1
            reciprocal_ranks.append(1.0 / rank)
        else:
            reciprocal_ranks.append(0.0)
    return np.mean(reciprocal_ranks)
 
 
def ndcg_at_k(
    predictions: np.ndarray,  # (N, K) ranked predictions
    relevances: np.ndarray,   # (N, M) relevance scores
    k: int = 10,
) -> float:
    """
    Compute normalized Discounted Cumulative Gain.
    """
    # Simplified for binary relevance
    dcg = np.sum(
        relevances[:, :k] / np.log2(np.arange(2, k + 2)),
        axis=1
    )
    
    # Ideal DCG (perfect ranking)
    sorted_rel = np.sort(relevances, axis=1)[:, ::-1]
    idcg = np.sum(
        sorted_rel[:, :k] / np.log2(np.arange(2, k + 2)),
        axis=1
    )
    
    ndcg = dcg / (idcg + 1e-8)
    return np.mean(ndcg)

Metric Selection Guide

Use R@1 for single-answer tasks (find THE matching item). Use R@K for recall-focused tasks (find ANY relevant item in top K, like first-stage retrieval). Use MRR when rank position matters but there's one right answer. Use mAP/nDCG when multiple items are relevant with varying importance.

Benchmarks and Datasets

Standardized benchmarks enable fair comparison of cross-modal retrieval systems. Different benchmarks test different capabilities.

Major Cross-Modal Retrieval Benchmarks
Benchmark	Modalities	Size	Task
MS-COCO	Image-Text	123K images, 5 captions each	Image-caption retrieval
Flickr30K	Image-Text	31K images, 5 captions each	Image-caption retrieval
MSCOCO-5K	Image-Text	5K test images	Standard test split
MSR-VTT	Video-Text	10K videos, 200K captions	Video-text retrieval
ActivityNet	Video-Text	20K videos	Dense video retrieval
AudioSet	Audio-Video	2M clips	Audio-visual retrieval
LAION-5B	Image-Text	5.85B pairs	Large-scale pretraining

Benchmark Selection Considerations

•In-domain vs Out-domain: Train on COCO, evaluate on Flickr30K to test generalization
•Data contamination: Modern large-scale models may have seen benchmark data during pre-training
•Caption specificity: Some datasets have generic captions, others highly specific
•Multiple ground truths: An image may match multiple captions equally well

Summary: Cross-Modal Retrieval

Key Takeaways

•Joint embedding spaces enable comparison across modalities via learned encoder functions.
•Contrastive objectives (InfoNCE, SigLIP) are the dominant training approach for retrieval.
•Dual encoders enable scalable retrieval; cross-encoders provide accurate re-ranking.
•ANN algorithms (HNSW, IVF-PQ) make billion-scale search practical with sub-linear complexity.
•Evaluation metrics (R@K, MRR, mAP) capture different aspects of retrieval quality.
•Benchmarks (COCO, Flickr30K, MSR-VTT) provide standardized evaluation.

Up Next

In the final page of this module, we'll explore unified multimodal architectures—models that process any combination of modalities with a single architecture, representing the frontier of multimodal AI.

Cross-Modal Retrieval: Searching Across Modalities

The Universal Search Problem

Learning Objectives

Joint Embedding Spaces

The foundation of cross-modal retrieval is the joint embedding space—a shared representation where items from different modalities can be directly compared using distance or similarity metrics.

The Key Insight:

Instead of comparing raw representations (impossible—how do you compare pixels to words?), we learn encoder functions that map each modality to a common space:

$$f_{\text{image}}: \mathcal{I} \rightarrow \mathbb{R}^d$$ $$f_{\text{text}}: \mathcal{T} \rightarrow \mathbb{R}^d$$

In this shared space, semantic similarity corresponds to geometric proximity:

$$\text{similarity}(\text{image}, \text{text}) = \cos(f_{\text{image}}(\text{image}), f_{\text{text}}(\text{text}))$$

Properties of Good Embedding Spaces:

Semantic alignment: Semantically similar items (regardless of modality) should be close
Uniformity: Embeddings should be well-distributed across the space (not collapsed)
Discriminative: Different concepts should be separable
Generalizable: Novel concepts should be meaningfully placed

Mathematical Perspective:

The embedding space implicitly defines a metric on cross-modal similarity:

$$d(x_i, x_j) = 1 - \frac{f(x_i) \cdot f(x_j)}{|f(x_i)| |f(x_j)|}$$

Training objectives like contrastive learning optimize this metric to satisfy semantic constraints from paired data.

Converting Mermaid diagram...

Embedding Dimension Trade-offs

Training Objectives for Cross-Modal Retrieval

Contrastive Learning (InfoNCE):

The dominant approach uses contrastive learning to pull together positive pairs (matched items) and push apart negative pairs (unmatched items):

$$\mathcal{L}{\text{InfoNCE}} = -\log \frac{\exp(s(q, k^+)/\tau)}{\exp(s(q, k^+)/\tau) + \sum{k^-} \exp(s(q, k^-)/\tau)}$$

Where $s(\cdot, \cdot)$ is similarity (dot product or cosine) and $\tau$ is temperature.

Triplet Loss:

An older but still-used approach that considers anchor-positive-negative triplets:

$$\mathcal{L}_{\text{triplet}} = \max(0, d(a, p) - d(a, n) + \alpha)$$

Pushes positive closer to anchor than negative by margin $\alpha$.

Multiple Negatives Ranking Loss:

For batch training with in-batch negatives:

$$\mathcal{L}{\text{MNR}} = -\frac{1}{N}\sum{i=1}^{N}\log\frac{\exp(s_i^+)}{\sum_{j=1}^{N}\exp(s_{ij})}$$

Sigmoid Loss (SigLIP):

Recent work replaces softmax with per-pair sigmoid, avoiding the need for large batches:

$$\mathcal{L}{\text{sigmoid}} = -\sum{i,j}\left[y_{ij}\log\sigma(s_{ij}) + (1-y_{ij})\log(1-\sigma(s_{ij}))\right]$$

Comparison of Training Objectives
Objective	Negatives Required	Batch Dependency	Key Advantage
InfoNCE	Many (in-batch)	High	Strong theoretical foundation
Triplet	One per triplet	Low	Simple, interpretable margin
MNR	In-batch	Medium	Efficient multi-pair optimization
Sigmoid (SigLIP)	In-batch	Low	Works well with small batches

retrieval_losses.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torch.nn.functional as F
 
def infonce_loss(
    query_emb: torch.Tensor,
    key_emb: torch.Tensor,
    temperature: float = 0.07,
) -> torch.Tensor:
    """
    InfoNCE loss for cross-modal retrieval training.
    
    Args:
        query_emb: (B, D) normalized query embeddings
        key_emb: (B, D) normalized key embeddings
        temperature: Softmax temperature
        
    Returns:
        Scalar loss
    """
    # Similarity matrix: (B, B)
    logits = query_emb @ key_emb.T / temperature
    
    # Positive pairs on diagonal
    labels = torch.arange(len(query_emb), device=query_emb.device)
    
    # Cross-entropy treats each row as classification problem
    loss = F.cross_entropy(logits, labels)
    
    return loss
 
 
def sigmoid_loss(
    query_emb: torch.Tensor,
    key_emb: torch.Tensor,
    temperature: float = 0.1,
    bias: float = -10.0,
) -> torch.Tensor:
    """
    SigLIP-style sigmoid loss - does not require large batches.
    
    Each pair is treated as independent binary classification.
    Positive pairs should have high similarity, negatives low.
    """
    batch_size = query_emb.shape[0]
    
    # Similarity matrix
    logits = query_emb @ key_emb.T / temperature + bias
    
    # Labels: diagonal is positive (1), off-diagonal is negative (0)
    labels = torch.eye(batch_size, device=query_emb.device)
    
    # Binary cross-entropy per pair
    loss = F.binary_cross_entropy_with_logits(
        logits, labels, reduction='mean'
    )
    
    return loss
 
 
def triplet_loss(
    anchor: torch.Tensor,
    positive: torch.Tensor,
    negative: torch.Tensor,
    margin: float = 0.2,
) -> torch.Tensor:
    """
    Triplet margin loss for retrieval.
    """
    d_pos = (anchor - positive).pow(2).sum(dim=-1)
    d_neg = (anchor - negative).pow(2).sum(dim=-1)
    
    loss = F.relu(d_pos - d_neg + margin)
    
    return loss.mean()

Retrieval System Architectures

Practical retrieval systems must balance accuracy with efficiency. Different architectures make different trade-offs.

Dual Encoder Architecture:

The standard approach for large-scale retrieval:

Indexing phase (offline): Encode all database items into embeddings
Query phase (online): Encode query, find nearest neighbors in embedding space

Database items → Encoder → Embeddings → Index (ANN)
                                              ↓
Query → Encoder → Query Embedding → kNN Search → Top-K Results

Advantages:

Embeddings can be pre-computed and cached
Query encoding is independent of database size
Scales to billions of items with approximate nearest neighbor (ANN) search

Limitations:

No fine-grained cross-modal interaction
Fixed-size embedding may lose information

Cross-Encoder Architecture:

For maximum accuracy, cross-encoders process query-candidate pairs jointly:

[Query, Candidate] → Cross-Encoder → Relevance Score

Advantages:

Models fine-grained interactions between query and candidate
Higher accuracy for complex queries

Limitations:

Must run encoder for each (query, candidate) pair
Does not scale for large databases (O(N) per query)
Typically used as re-ranker on top of dual encoder

Two-Stage Retrieval:

The practical pattern combines both:

Retrieval: Dual encoder retrieves top-K candidates (K~100-1000)
Re-ranking: Cross-encoder re-scores candidates for final ranking

Retrieval Architecture Comparison
Architecture	Query Latency	Index Size	Accuracy	Use Case
Dual Encoder	O(1) + ANN	N × d	Good	First-stage retrieval
Cross-Encoder	O(N)	None (pairs)	Excellent	Re-ranking top-K
ColBERT	O(1) + ANN	N × L × d	Very Good	Token-level retrieval
Two-Stage	O(K) rerank	N × d	Excellent	Production systems

Late Interaction: ColBERT

Efficient Search at Scale

Key ANN Approaches:

1. Product Quantization (PQ): Compress vectors into compact codes by quantizing subvectors:

Split d-dimensional vector into m subvectors
Quantize each subvector to one of k centroids
Store only centroid indices (log₂(k) bits each)
Distance approximated via lookup tables

2. Hierarchical Navigable Small Worlds (HNSW): Graph-based index enabling logarithmic search:

Build multi-layer graph connecting similar items
Search by greedy traversal from entry point
Higher layers are sparser, lower layers denser
O(log N) search complexity

3. Inverted File Index (IVF): Partition space into clusters:

Cluster all vectors into C centroids
At query time, find nearest centroids
Search only within selected clusters
Trade-off: more clusters = faster search, less recall

4. Locality-Sensitive Hashing (LSH): Hash similar items to same buckets:

Design hash functions where collision probability ∝ similarity
Query by hashing and searching buckets
Random projections for cosine similarity

vector_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import faiss
import numpy as np
 
class VectorSearchIndex:
    """
    Efficient vector search using FAISS.
    
    Supports multiple index types for different scale/accuracy trade-offs.
    """
    
    def __init__(
        self,
        dimension: int,
        index_type: str = "IVF",
        n_clusters: int = 1024,
        use_gpu: bool = False,
    ):
        self.dimension = dimension
        self.use_gpu = use_gpu
        
        if index_type == "Flat":
            # Exact search (brute force)
            self.index = faiss.IndexFlatIP(dimension)
            
        elif index_type == "IVF":
            # Inverted file with product quantization
            quantizer = faiss.IndexFlatIP(dimension)
            self.index = faiss.IndexIVFPQ(
                quantizer,
                dimension,
                n_clusters,     # Number of clusters
                dimension // 4,  # Subvector size
                8,              # Bits per subvector
            )
            self.trained = False
            
        elif index_type == "HNSW":
            # Hierarchical Navigable Small Worlds
            self.index = faiss.IndexHNSWFlat(dimension, 32)  # 32 neighbors
            
        if use_gpu:
            res = faiss.StandardGpuResources()
            self.index = faiss.index_cpu_to_gpu(res, 0, self.index)
    
    def add(self, embeddings: np.ndarray):
        """Add embeddings to index."""
        if hasattr(self, 'trained') and not self.trained:
            self.index.train(embeddings)
            self.trained = True
        self.index.add(embeddings)
    
    def search(self, query: np.ndarray, k: int = 10):
        """
        Search for k nearest neighbors.
        
        Returns:
            distances: (n_queries, k) similarity scores
            indices: (n_queries, k) database indices
        """
        # Set search parameters for IVF
        if hasattr(self.index, 'nprobe'):
            self.index.nprobe = 32  # Search 32 clusters
            
        distances, indices = self.index.search(query, k)
        return distances, indices
 
 
# Example usage
index = VectorSearchIndex(dimension=512, index_type="IVF")
index.add(database_embeddings)  # Shape: (1M, 512)
distances, indices = index.search(query_embeddings, k=10)

ANN Algorithm Comparison
Algorithm	Index Build	Query Time	Memory	Recall@10
Flat (Exact)	O(1)	O(N)	N × d × 4B	100%
IVF-PQ	O(N)	O(N/C + k)	~10× smaller	95-99%
HNSW	O(N log N)	O(log N)	N × d × 4B + graph	99%+
ScaNN	O(N)	O(N/C)	Quantized	95-99%

Evaluation Metrics for Retrieval

Evaluating retrieval systems requires metrics that capture different aspects of retrieval quality. The choice of metric depends on the application scenario.

Recall@K:

The fraction of relevant items found in the top K results:

$$\text{Recall@K} = \frac{|\text{Relevant} \cap \text{Retrieved@K}|}{|\text{Relevant}|}$$

Often used as R@1, R@5, R@10. R@1 measures if the correct item is ranked first.

Mean Reciprocal Rank (MRR):

Average of reciprocal ranks of the first relevant result:

$$\text{MRR} = \frac{1}{|Q|}\sum_{q \in Q}\frac{1}{\text{rank}(q)}$$

MRR is 1.0 if the correct result is always first, lower otherwise.

Mean Average Precision (mAP):

For cases with multiple relevant items per query:

$$\text{AP} = \frac{1}{|\text{Rel}|}\sum_{k=1}^{K}P(k) \cdot \text{rel}(k)$$

Where P(k) is precision at position k and rel(k) indicates if item k is relevant.

Normalized Discounted Cumulative Gain (nDCG):

For graded relevance (not just binary):

$$\text{DCG@K} = \sum_{i=1}^{K}\frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}$$

nDCG normalizes DCG by the ideal ranking's DCG.

retrieval_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from typing import List
 
def recall_at_k(
    predictions: np.ndarray,  # (N, K) - top K predicted indices
    targets: List[List[int]], # List of relevant indices per query
    k: int = 10,
) -> float:
    """
    Compute Recall@K: fraction of queries where target
    is in top-K predictions.
    """
    hits = 0
    for i, pred in enumerate(predictions):
        top_k = set(pred[:k])
        relevant = set(targets[i])
        if top_k & relevant:  # Any overlap
            hits += 1
    return hits / len(predictions)
 
 
def mean_reciprocal_rank(
    predictions: np.ndarray,  # (N, K) ranked predictions
    targets: np.ndarray,      # (N,) single target per query
) -> float:
    """
    Compute Mean Reciprocal Rank.
    """
    reciprocal_ranks = []
    for i, pred in enumerate(predictions):
        target = targets[i]
        if target in pred:
            rank = np.where(pred == target)[0][0] + 1
            reciprocal_ranks.append(1.0 / rank)
        else:
            reciprocal_ranks.append(0.0)
    return np.mean(reciprocal_ranks)
 
 
def ndcg_at_k(
    predictions: np.ndarray,  # (N, K) ranked predictions
    relevances: np.ndarray,   # (N, M) relevance scores
    k: int = 10,
) -> float:
    """
    Compute normalized Discounted Cumulative Gain.
    """
    # Simplified for binary relevance
    dcg = np.sum(
        relevances[:, :k] / np.log2(np.arange(2, k + 2)),
        axis=1
    )
    
    # Ideal DCG (perfect ranking)
    sorted_rel = np.sort(relevances, axis=1)[:, ::-1]
    idcg = np.sum(
        sorted_rel[:, :k] / np.log2(np.arange(2, k + 2)),
        axis=1
    )
    
    ndcg = dcg / (idcg + 1e-8)
    return np.mean(ndcg)

Metric Selection Guide

Benchmarks and Datasets

Standardized benchmarks enable fair comparison of cross-modal retrieval systems. Different benchmarks test different capabilities.

Major Cross-Modal Retrieval Benchmarks
Benchmark	Modalities	Size	Task
MS-COCO	Image-Text	123K images, 5 captions each	Image-caption retrieval
Flickr30K	Image-Text	31K images, 5 captions each	Image-caption retrieval
MSCOCO-5K	Image-Text	5K test images	Standard test split
MSR-VTT	Video-Text	10K videos, 200K captions	Video-text retrieval
ActivityNet	Video-Text	20K videos	Dense video retrieval
AudioSet	Audio-Video	2M clips	Audio-visual retrieval
LAION-5B	Image-Text	5.85B pairs	Large-scale pretraining

Benchmark Selection Considerations

•In-domain vs Out-domain: Train on COCO, evaluate on Flickr30K to test generalization
•Data contamination: Modern large-scale models may have seen benchmark data during pre-training
•Caption specificity: Some datasets have generic captions, others highly specific
•Multiple ground truths: An image may match multiple captions equally well

Summary: Cross-Modal Retrieval

Key Takeaways

•Joint embedding spaces enable comparison across modalities via learned encoder functions.
•Contrastive objectives (InfoNCE, SigLIP) are the dominant training approach for retrieval.
•Dual encoders enable scalable retrieval; cross-encoders provide accurate re-ranking.
•ANN algorithms (HNSW, IVF-PQ) make billion-scale search practical with sub-linear complexity.
•Evaluation metrics (R@K, MRR, mAP) capture different aspects of retrieval quality.
•Benchmarks (COCO, Flickr30K, MSR-VTT) provide standardized evaluation.

Up Next