Research FrontiersMultimodal Learning

Multimodal Learning: Bridging Vision, Language, and Beyond

LevelAdvanced

Duration90 mins

TopicMultimodal Learning

1 / 5

Vision-Language Models: Foundations and Architectures

The Convergence of Seeing and Understanding

Humans effortlessly integrate visual perception with linguistic understanding. We can describe what we see, answer questions about images, follow visual instructions, and imagine scenes from textual descriptions. For decades, artificial intelligence treated vision and language as separate domains—computer vision systems could recognize objects, while natural language processing systems could parse sentences, but bridging these modalities remained an elusive goal.

Vision-Language Models (VLMs) represent a paradigm shift in how we build AI systems that understand and reason across modalities. These models learn joint representations that capture the semantic relationships between visual content and natural language, enabling capabilities that were impossible just a few years ago: generating detailed image descriptions, answering complex visual questions, creating images from text prompts, and performing zero-shot visual recognition using only textual class descriptions.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of vision-language models, master the key architectural paradigms (dual-encoder, fusion-encoder, encoder-decoder), comprehend different pre-training objectives, and appreciate how these models have transformed both research and practical applications in multimodal AI.

The Multimodal Challenge

Before diving into architectures, we must understand why combining vision and language is fundamentally challenging. The difficulty stems from the inherent differences in how these modalities represent information.

The Representation Gap:

Visual data is continuous, high-dimensional, and spatially structured. An image is a grid of pixel values encoding color intensities, where meaning emerges from spatial relationships, textures, edges, and compositional patterns. In contrast, language is discrete, sequential, and symbolically structured. Words are tokens from a finite vocabulary, combined according to grammatical rules to express meaning.

Bridging these modalities requires learning a shared semantic space where visual concepts and linguistic concepts can be meaningfully compared and combined. This is non-trivial because:

Granularity mismatch: A single word like "dog" corresponds to infinitely many possible visual appearances
Compositional complexity: The phrase "a dog chasing a cat" requires understanding objects, their attributes, and their relationships
Abstract concepts: Words like "freedom" or "happiness" have visual correlates that are highly context-dependent
Cultural and contextual variation: The same visual scene may be described differently based on cultural context or communicative intent

Fundamental Differences Between Vision and Language Modalities
Aspect	Vision	Language
Data Type	Continuous (pixel intensities)	Discrete (tokens from vocabulary)
Structure	2D spatial grid (or 3D for video)	1D sequential
Basic Units	Pixels, patches, regions	Words, subwords, characters
Compositionality	Spatial relationships, part-whole	Syntactic trees, dependency structures
Dimensionality	Very high (millions of pixels)	Variable (sentence length × vocab)
Ambiguity	One image, many descriptions	One description, many images

The Grounding Problem

Vision-language models must solve the 'symbol grounding problem'—connecting abstract linguistic symbols to perceptual experiences. When a model learns that 'red' corresponds to certain pixel patterns, it is performing a form of grounding that philosophers have debated for centuries in the context of human cognition.

Historical Evolution of Vision-Language Models

The journey toward modern vision-language models spans several decades, with each era contributing crucial insights and techniques.

Era 1: Feature Engineering (Pre-2012)

Early approaches relied on hand-crafted features for both modalities. Visual features like SIFT, HOG, and bag-of-visual-words were combined with linguistic features like TF-IDF or word2vec embeddings. Systems were task-specific, requiring separate models for image captioning, visual question answering, and image retrieval. The semantic gap between visual and textual features limited performance.

Era 2: Deep Learning Fusion (2012-2018)

The deep learning revolution, sparked by AlexNet in 2012, enabled end-to-end learning of visual features. Researchers began combining CNN image encoders with RNN/LSTM text encoders. Key innovations included:

Show and Tell (2015): CNN encoder + LSTM decoder for image captioning
VQA models (2015-2016): Answering questions about images by fusing visual and textual features
Attention mechanisms: Learning to focus on relevant image regions for each word

Era 3: Transformer Revolution (2018-2020)

Transformers, introduced for NLP in 2017, were adapted for vision-language tasks. Self-attention enabled modeling long-range dependencies and cross-modal interactions:

ViLBERT, LXMERT (2019): Two-stream transformers with cross-modal attention
UNITER, OSCAR (2020): Single-stream fusion with object tags as anchors

Era 4: Large-Scale Pre-training (2021-Present)

The current era is defined by massive-scale pre-training on internet-scale image-text pairs, leading to emergent zero-shot and few-shot capabilities:

CLIP, ALIGN (2021): Contrastive learning on 400M+ image-text pairs
Flamingo, GPT-4V (2022-2023): Connecting frozen LLMs with vision encoders
LLaVA, InstructBLIP (2023): Vision-language instruction tuning

Converting Mermaid diagram...

Architectural Paradigms for Vision-Language Models

Modern vision-language models follow three primary architectural paradigms, each with distinct trade-offs in computational efficiency, flexibility, and capability. Understanding these paradigms is essential for selecting the right approach for a given application.

Paradigm 1: Dual-Encoder Architecture

The dual-encoder architecture processes each modality independently through separate encoders, producing fixed-dimensional embeddings that are compared in a shared semantic space.

Image → Vision Encoder → Image Embedding (d-dim)
                                    ↓
                              Similarity Score
                                    ↑
Text  → Text Encoder   → Text Embedding (d-dim)

Characteristics:

Efficiency: Embeddings can be pre-computed and cached, enabling fast retrieval
Scalability: Supports billion-scale retrieval with approximate nearest neighbor search
Limitation: No fine-grained cross-modal interaction; cannot model token-level alignment

Examples: CLIP, ALIGN, SigLIP, OpenCLIP

Paradigm 2: Fusion-Encoder Architecture

Fusion encoders concatenate or interleave visual and textual tokens, processing them jointly through cross-attention layers that model fine-grained interactions.

Image → Vision Encoder → [CLS, v1, v2, ..., vN]
                                    ↓
                         Fusion Transformer ← [CLS, w1, w2, ..., wM]
                                    ↓
                              Joint Embedding

Characteristics:

Expressiveness: Models token-level alignment (which word corresponds to which region)
Capability: Supports complex reasoning tasks like VQA
Limitation: Cannot pre-compute embeddings; requires full forward pass for each pair

Examples: ViLBERT, LXMERT, UNITER, BEiT-3

Paradigm 3: Encoder-Decoder Architecture

Encoder-decoder models process the input (image, or image+text) through an encoder and generate output text autoregressively through a decoder.

Image → Vision Encoder → Visual Tokens
                              ↓
[Prompt] → Text Encoder → [Prompt Tokens, Visual Tokens]
                              ↓
                         Decoder → Generated Text

Characteristics:

Generativity: Can produce open-ended textual outputs (captions, answers, descriptions)
Flexibility: Handles diverse tasks through appropriate prompting
Limitation: Computationally expensive; cannot return embeddings for retrieval

Examples: Flamingo, BLIP-2, LLaVA, GPT-4V

Comparison of Vision-Language Architectural Paradigms
Paradigm	Interaction Level	Retrieval	Generation	Efficiency
Dual-Encoder	Embedding (global)	✓ Excellent	✗ None	Very High
Fusion-Encoder	Token (local)	✗ Poor	Limited	Medium
Encoder-Decoder	Token + Autoregressive	Possible	✓ Excellent	Low

Hybrid Approaches

Modern systems often combine paradigms. BLIP-2 uses a dual-encoder for efficient pre-training, a fusion module (Q-Former) for alignment, and an LLM decoder for generation. This modular design allows mixing frozen pre-trained components with lightweight trainable adapters.

Vision Encoders for Multimodal Models

The vision encoder transforms raw pixel data into semantic representations that can interface with language models. The choice of vision encoder significantly impacts model capability and efficiency.

Convolutional Neural Networks (CNNs)

CNNs like ResNet were the dominant vision encoders in early VLMs. They process images hierarchically through convolution and pooling operations:

Grid features: Extract features from the final convolutional layer (e.g., 7×7 grid for ResNet)
Region features: Use object detectors like Faster R-CNN to extract features from detected regions
Limitation: Fixed receptive fields; region-based features require expensive pre-processing

Vision Transformers (ViT)

ViT applies transformers directly to image patches, treating each patch as a token:

Image (224×224) → Patch Embedding (16×16 patches) → 196 tokens + [CLS]
                       ↓
              Transformer Encoder
                       ↓
              Patch Embeddings (d-dim each)

Advantages:

Scalability: Performance scales with model size and data
Flexibility: No fixed grid structure; attention can model global relationships
Compatibility: Token-based output aligns naturally with language transformers

Modern Vision Encoders:

Encoder	Key Innovation	Use Cases
ViT-L/14	Standard ViT at 336px	CLIP, OpenCLIP
SigLIP	Sigmoid loss (no softmax)	PaLI, Gemini
EVA	Masked image modeling + CLIP	EVA-CLIP
DINOv2	Self-supervised objectives	Feature extraction

vision_encoder_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import torch
import torch.nn as nn
from transformers import CLIPVisionModel, CLIPImageProcessor
 
class VisionEncoder(nn.Module):
    """
    Vision encoder wrapper for multimodal models.
    
    Processes images into patch embeddings compatible
    with language model architectures.
    """
    
    def __init__(
        self,
        model_name: str = "openai/clip-vit-large-patch14-336",
        output_dim: int = 768,
        num_query_tokens: int = 32,
    ):
        super().__init__()
        
        # Load pre-trained vision encoder
        self.vision_model = CLIPVisionModel.from_pretrained(model_name)
        self.processor = CLIPImageProcessor.from_pretrained(model_name)
        
        # Vision encoder hidden dimension
        vision_dim = self.vision_model.config.hidden_size  # 1024 for ViT-L
        
        # Optional: Projection to match LLM dimension
        if vision_dim != output_dim:
            self.projection = nn.Linear(vision_dim, output_dim)
        else:
            self.projection = nn.Identity()
        
        # Optional: Learnable query tokens (like Q-Former)
        self.query_tokens = nn.Parameter(
            torch.randn(1, num_query_tokens, output_dim)
        )
        
    def forward(self, images: torch.Tensor) -> torch.Tensor:
        """
        Args:
            images: Tensor of shape (B, C, H, W)
            
        Returns:
            Tensor of shape (B, num_patches, output_dim)
        """
        # Extract patch embeddings (exclude [CLS] token)
        vision_outputs = self.vision_model(images)
        patch_embeddings = vision_outputs.last_hidden_state[:, 1:, :]
        
        # Project to target dimension
        projected = self.projection(patch_embeddings)
        
        return projected
    
    def get_image_features(self, images: torch.Tensor) -> torch.Tensor:
        """Get pooled image representation for retrieval."""
        vision_outputs = self.vision_model(images)
        # Use [CLS] token as image representation
        cls_embedding = vision_outputs.last_hidden_state[:, 0, :]
        return self.projection(cls_embedding)

Pre-training Objectives for Vision-Language Models

Pre-training objectives determine what knowledge the model acquires from large-scale data. Different objectives develop different capabilities, and modern VLMs often combine multiple objectives.

Contrastive Learning (Image-Text Matching)

Contrastive objectives learn to align matching image-text pairs while separating non-matching pairs in embedding space:

$$\mathcal{L}{\text{contrastive}} = -\frac{1}{N}\sum{i=1}^{N}\left[\log\frac{\exp(s_{ii}/\tau)}{\sum_{j=1}^{N}\exp(s_{ij}/\tau)} + \log\frac{\exp(s_{ii}/\tau)}{\sum_{j=1}^{N}\exp(s_{ji}/\tau)}\right]$$

Where $s_{ij} = \text{sim}(v_i, t_j)$ is the similarity between image $i$ and text $j$, and $\tau$ is a temperature parameter.

Key Properties:

Learns globally aligned representations
Enables zero-shot classification via text prompts
Scales efficiently with in-batch negatives

Masked Language/Image Modeling

Inspired by BERT, these objectives mask portions of input and train the model to reconstruct them:

MLM: Mask text tokens, predict from visual context
MIM: Mask image patches, predict from textual context (or surrounding patches)

Generative Objectives

Train the model to generate text conditioned on images:

Image Captioning: Generate descriptions from images
Prefix Language Modeling: Standard autoregressive generation
Image-Text Generation: Generate interleaved image-text sequences

Pre-training Objective Comparison

•Contrastive (CLIP): Best for retrieval and zero-shot classification; global alignment only; scales with batch size
•Masked Modeling (BEiT-3): Develops fine-grained understanding; expensive to train; requires reconstruction targets
•Generative (Flamingo): Best for open-ended generation; requires large language models; computationally intensive
•Hybrid (BLIP-2): Combines strengths; uses contrastive + generative; modular architecture enables frozen components

Data Quality Matters

Pre-training objectives are only as good as the data they're trained on. CLIP's success came partly from carefully curated image-text pairs. Models trained on noisy web data often learn spurious correlations (e.g., associating 'beautiful' with faces rather than landscapes).

Zero-Shot and Transfer Learning Capabilities

One of the most remarkable properties of modern VLMs is their ability to generalize to new tasks and domains without task-specific training—a capability known as zero-shot transfer.

Zero-Shot Image Classification

Traditional classifiers require labeled examples for each class. VLMs like CLIP can classify images into arbitrary categories by comparing image embeddings to text embeddings of class names:

# Zero-shot classification with CLIP
class_prompts = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
text_embeddings = encode_text(class_prompts)  # Shape: (3, d)
image_embedding = encode_image(image)          # Shape: (1, d)

similarities = image_embedding @ text_embeddings.T  # Shape: (1, 3)
predicted_class = similarities.argmax()  # Most similar class

Why Zero-Shot Works:

Semantic alignment: Pre-training aligns visual concepts with their linguistic descriptions
Compositional generalization: Novel concepts can be expressed by combining known words
Prompt engineering: Carefully crafted prompts can unlock latent knowledge

Prompt Engineering for Zero-Shot:

Prompt design significantly impacts zero-shot performance:

Prompt Template	Example	Use Case
`"a photo of a {class}"`	"a photo of a cat"	General classification
`"a {class} in the wild"`	"a lion in the wild"	Natural scenes
`"satellite imagery of {class}"`	"satellite imagery of forest"	Remote sensing
`"a drawing of a {class}"`	"a drawing of a house"	Sketch recognition

zero_shot_classification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn.functional as F
from transformers import CLIPProcessor, CLIPModel
 
def zero_shot_classify(
    images: list,
    class_names: list[str],
    model_name: str = "openai/clip-vit-large-patch14"
) -> list[str]:
    """
    Perform zero-shot image classification using CLIP.
    
    Args:
        images: List of PIL images or image paths
        class_names: List of class labels
        model_name: CLIP model variant
        
    Returns:
        List of predicted class names
    """
    # Load model and processor
    model = CLIPModel.from_pretrained(model_name)
    processor = CLIPProcessor.from_pretrained(model_name)
    
    # Create prompts with template
    prompts = [f"a photo of a {name}" for name in class_names]
    
    # Process inputs
    inputs = processor(
        text=prompts,
        images=images,
        return_tensors="pt",
        padding=True
    )
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
        image_embeds = outputs.image_embeds  # (N_images, D)
        text_embeds = outputs.text_embeds    # (N_classes, D)
        
        # Normalize embeddings
        image_embeds = F.normalize(image_embeds, dim=-1)
        text_embeds = F.normalize(text_embeds, dim=-1)
        
        # Compute similarities
        logits = image_embeds @ text_embeds.T  # (N_images, N_classes)
        predictions = logits.argmax(dim=-1)
    
    return [class_names[p] for p in predictions]

Ensemble of Prompts

CLIP's official implementation averages predictions across 80+ prompt templates per class (e.g., 'a photo of a {class}', 'a blurry photo of a {class}', 'a sculpture of a {class}'). This ensemble approach improves robustness and reduces sensitivity to specific phrasings.

Applications of Vision-Language Models

Vision-language models have enabled a wide range of applications that were previously impossible or required extensive task-specific engineering.

Key Applications

•Image-Text Retrieval: Find images matching a text query or texts matching an image. Powers visual search engines, stock photo platforms, and content discovery.
•Visual Question Answering (VQA): Answer natural language questions about image content. Used in accessibility tools, document understanding, and interactive systems.
•Image Captioning: Generate natural language descriptions of images. Enables alt-text generation, content moderation, and media indexing.
•Text-to-Image Generation: Create images from textual descriptions. Powers creative tools (DALL-E, Midjourney) and design automation.
•Visual Grounding: Localize objects in images based on referring expressions. Critical for robotics, autonomous vehicles, and human-robot interaction.
•Document Understanding: Extract and reason about information in documents, receipts, and forms. Enables intelligent document processing and data extraction.

VLM Applications and Their Model Requirements
Application	Architecture	Key Capability	Example Models
Image Search	Dual-Encoder	Efficient retrieval	CLIP, ALIGN
VQA	Fusion/Encoder-Decoder	Fine-grained reasoning	BLIP-2, LLaVA
Captioning	Encoder-Decoder	Fluent generation	CoCa, Flamingo
Text-to-Image	Diffusion + VLM	Visual generation	DALL-E, Stable Diffusion
Document AI	Layout-aware VLM	Spatial understanding	LayoutLMv3, Donut

Summary: Vision-Language Models

We've covered the foundations of vision-language models, establishing the conceptual and architectural groundwork for the rest of this module.

Key Takeaways

•The Multimodal Challenge: Vision and language differ fundamentally in structure, requiring sophisticated alignment mechanisms.
•Architectural Paradigms: Dual-encoders excel at retrieval, fusion-encoders at reasoning, encoder-decoders at generation.
•Pre-training Objectives: Contrastive, masked modeling, and generative objectives develop different capabilities.
•Zero-Shot Transfer: Modern VLMs can perform tasks without task-specific training through semantic alignment.
•Diverse Applications: From search to generation, VLMs are transforming how we build AI systems.

Up Next

In the next page, we'll dive deep into CLIP and DALL-E—two landmark models that demonstrated the power of large-scale vision-language pre-training and opened new frontiers in zero-shot learning and text-to-image generation.

1 / 5

Loading learning content...

Research FrontiersMultimodal Learning

Multimodal Learning: Bridging Vision, Language, and Beyond

LevelAdvanced

Duration90 mins

TopicMultimodal Learning

1 / 5

Vision-Language Models: Foundations and Architectures

The Convergence of Seeing and Understanding

What You Will Learn

The Multimodal Challenge

The Representation Gap:

Bridging these modalities requires learning a shared semantic space where visual concepts and linguistic concepts can be meaningfully compared and combined. This is non-trivial because:

Granularity mismatch: A single word like "dog" corresponds to infinitely many possible visual appearances
Compositional complexity: The phrase "a dog chasing a cat" requires understanding objects, their attributes, and their relationships
Abstract concepts: Words like "freedom" or "happiness" have visual correlates that are highly context-dependent
Cultural and contextual variation: The same visual scene may be described differently based on cultural context or communicative intent

Fundamental Differences Between Vision and Language Modalities
Aspect	Vision	Language
Data Type	Continuous (pixel intensities)	Discrete (tokens from vocabulary)
Structure	2D spatial grid (or 3D for video)	1D sequential
Basic Units	Pixels, patches, regions	Words, subwords, characters
Compositionality	Spatial relationships, part-whole	Syntactic trees, dependency structures
Dimensionality	Very high (millions of pixels)	Variable (sentence length × vocab)
Ambiguity	One image, many descriptions	One description, many images

The Grounding Problem

Historical Evolution of Vision-Language Models

The journey toward modern vision-language models spans several decades, with each era contributing crucial insights and techniques.

Era 1: Feature Engineering (Pre-2012)

Era 2: Deep Learning Fusion (2012-2018)

Show and Tell (2015): CNN encoder + LSTM decoder for image captioning
VQA models (2015-2016): Answering questions about images by fusing visual and textual features
Attention mechanisms: Learning to focus on relevant image regions for each word

Era 3: Transformer Revolution (2018-2020)

Transformers, introduced for NLP in 2017, were adapted for vision-language tasks. Self-attention enabled modeling long-range dependencies and cross-modal interactions:

ViLBERT, LXMERT (2019): Two-stream transformers with cross-modal attention
UNITER, OSCAR (2020): Single-stream fusion with object tags as anchors

Era 4: Large-Scale Pre-training (2021-Present)

The current era is defined by massive-scale pre-training on internet-scale image-text pairs, leading to emergent zero-shot and few-shot capabilities:

CLIP, ALIGN (2021): Contrastive learning on 400M+ image-text pairs
Flamingo, GPT-4V (2022-2023): Connecting frozen LLMs with vision encoders
LLaVA, InstructBLIP (2023): Vision-language instruction tuning

Converting Mermaid diagram...

Architectural Paradigms for Vision-Language Models

Paradigm 1: Dual-Encoder Architecture

The dual-encoder architecture processes each modality independently through separate encoders, producing fixed-dimensional embeddings that are compared in a shared semantic space.

Image → Vision Encoder → Image Embedding (d-dim)
                                    ↓
                              Similarity Score
                                    ↑
Text  → Text Encoder   → Text Embedding (d-dim)

Characteristics:

Efficiency: Embeddings can be pre-computed and cached, enabling fast retrieval
Scalability: Supports billion-scale retrieval with approximate nearest neighbor search
Limitation: No fine-grained cross-modal interaction; cannot model token-level alignment

Examples: CLIP, ALIGN, SigLIP, OpenCLIP

Paradigm 2: Fusion-Encoder Architecture

Fusion encoders concatenate or interleave visual and textual tokens, processing them jointly through cross-attention layers that model fine-grained interactions.

Image → Vision Encoder → [CLS, v1, v2, ..., vN]
                                    ↓
                         Fusion Transformer ← [CLS, w1, w2, ..., wM]
                                    ↓
                              Joint Embedding

Characteristics:

Expressiveness: Models token-level alignment (which word corresponds to which region)
Capability: Supports complex reasoning tasks like VQA
Limitation: Cannot pre-compute embeddings; requires full forward pass for each pair

Examples: ViLBERT, LXMERT, UNITER, BEiT-3

Paradigm 3: Encoder-Decoder Architecture

Encoder-decoder models process the input (image, or image+text) through an encoder and generate output text autoregressively through a decoder.

Image → Vision Encoder → Visual Tokens
                              ↓
[Prompt] → Text Encoder → [Prompt Tokens, Visual Tokens]
                              ↓
                         Decoder → Generated Text

Characteristics:

Generativity: Can produce open-ended textual outputs (captions, answers, descriptions)
Flexibility: Handles diverse tasks through appropriate prompting
Limitation: Computationally expensive; cannot return embeddings for retrieval

Examples: Flamingo, BLIP-2, LLaVA, GPT-4V

Comparison of Vision-Language Architectural Paradigms
Paradigm	Interaction Level	Retrieval	Generation	Efficiency
Dual-Encoder	Embedding (global)	✓ Excellent	✗ None	Very High
Fusion-Encoder	Token (local)	✗ Poor	Limited	Medium
Encoder-Decoder	Token + Autoregressive	Possible	✓ Excellent	Low

Hybrid Approaches

Vision Encoders for Multimodal Models

The vision encoder transforms raw pixel data into semantic representations that can interface with language models. The choice of vision encoder significantly impacts model capability and efficiency.

Convolutional Neural Networks (CNNs)

CNNs like ResNet were the dominant vision encoders in early VLMs. They process images hierarchically through convolution and pooling operations:

Grid features: Extract features from the final convolutional layer (e.g., 7×7 grid for ResNet)
Region features: Use object detectors like Faster R-CNN to extract features from detected regions
Limitation: Fixed receptive fields; region-based features require expensive pre-processing

Vision Transformers (ViT)

ViT applies transformers directly to image patches, treating each patch as a token:

Image (224×224) → Patch Embedding (16×16 patches) → 196 tokens + [CLS]
                       ↓
              Transformer Encoder
                       ↓
              Patch Embeddings (d-dim each)

Advantages:

Scalability: Performance scales with model size and data
Flexibility: No fixed grid structure; attention can model global relationships
Compatibility: Token-based output aligns naturally with language transformers

Modern Vision Encoders:

Encoder	Key Innovation	Use Cases
ViT-L/14	Standard ViT at 336px	CLIP, OpenCLIP
SigLIP	Sigmoid loss (no softmax)	PaLI, Gemini
EVA	Masked image modeling + CLIP	EVA-CLIP
DINOv2	Self-supervised objectives	Feature extraction

vision_encoder_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import torch
import torch.nn as nn
from transformers import CLIPVisionModel, CLIPImageProcessor
 
class VisionEncoder(nn.Module):
    """
    Vision encoder wrapper for multimodal models.
    
    Processes images into patch embeddings compatible
    with language model architectures.
    """
    
    def __init__(
        self,
        model_name: str = "openai/clip-vit-large-patch14-336",
        output_dim: int = 768,
        num_query_tokens: int = 32,
    ):
        super().__init__()
        
        # Load pre-trained vision encoder
        self.vision_model = CLIPVisionModel.from_pretrained(model_name)
        self.processor = CLIPImageProcessor.from_pretrained(model_name)
        
        # Vision encoder hidden dimension
        vision_dim = self.vision_model.config.hidden_size  # 1024 for ViT-L
        
        # Optional: Projection to match LLM dimension
        if vision_dim != output_dim:
            self.projection = nn.Linear(vision_dim, output_dim)
        else:
            self.projection = nn.Identity()
        
        # Optional: Learnable query tokens (like Q-Former)
        self.query_tokens = nn.Parameter(
            torch.randn(1, num_query_tokens, output_dim)
        )
        
    def forward(self, images: torch.Tensor) -> torch.Tensor:
        """
        Args:
            images: Tensor of shape (B, C, H, W)
            
        Returns:
            Tensor of shape (B, num_patches, output_dim)
        """
        # Extract patch embeddings (exclude [CLS] token)
        vision_outputs = self.vision_model(images)
        patch_embeddings = vision_outputs.last_hidden_state[:, 1:, :]
        
        # Project to target dimension
        projected = self.projection(patch_embeddings)
        
        return projected
    
    def get_image_features(self, images: torch.Tensor) -> torch.Tensor:
        """Get pooled image representation for retrieval."""
        vision_outputs = self.vision_model(images)
        # Use [CLS] token as image representation
        cls_embedding = vision_outputs.last_hidden_state[:, 0, :]
        return self.projection(cls_embedding)

Pre-training Objectives for Vision-Language Models

Pre-training objectives determine what knowledge the model acquires from large-scale data. Different objectives develop different capabilities, and modern VLMs often combine multiple objectives.

Contrastive Learning (Image-Text Matching)

Contrastive objectives learn to align matching image-text pairs while separating non-matching pairs in embedding space:

Where $s_{ij} = \text{sim}(v_i, t_j)$ is the similarity between image $i$ and text $j$, and $\tau$ is a temperature parameter.

Key Properties:

Learns globally aligned representations
Enables zero-shot classification via text prompts
Scales efficiently with in-batch negatives

Masked Language/Image Modeling

Inspired by BERT, these objectives mask portions of input and train the model to reconstruct them:

MLM: Mask text tokens, predict from visual context
MIM: Mask image patches, predict from textual context (or surrounding patches)

Generative Objectives

Train the model to generate text conditioned on images:

Image Captioning: Generate descriptions from images
Prefix Language Modeling: Standard autoregressive generation
Image-Text Generation: Generate interleaved image-text sequences

Pre-training Objective Comparison

•Contrastive (CLIP): Best for retrieval and zero-shot classification; global alignment only; scales with batch size
•Masked Modeling (BEiT-3): Develops fine-grained understanding; expensive to train; requires reconstruction targets
•Generative (Flamingo): Best for open-ended generation; requires large language models; computationally intensive
•Hybrid (BLIP-2): Combines strengths; uses contrastive + generative; modular architecture enables frozen components

Data Quality Matters

Zero-Shot and Transfer Learning Capabilities

One of the most remarkable properties of modern VLMs is their ability to generalize to new tasks and domains without task-specific training—a capability known as zero-shot transfer.

Zero-Shot Image Classification

Traditional classifiers require labeled examples for each class. VLMs like CLIP can classify images into arbitrary categories by comparing image embeddings to text embeddings of class names:

# Zero-shot classification with CLIP
class_prompts = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
text_embeddings = encode_text(class_prompts)  # Shape: (3, d)
image_embedding = encode_image(image)          # Shape: (1, d)

similarities = image_embedding @ text_embeddings.T  # Shape: (1, 3)
predicted_class = similarities.argmax()  # Most similar class

Why Zero-Shot Works:

Semantic alignment: Pre-training aligns visual concepts with their linguistic descriptions
Compositional generalization: Novel concepts can be expressed by combining known words
Prompt engineering: Carefully crafted prompts can unlock latent knowledge

Prompt Engineering for Zero-Shot:

Prompt design significantly impacts zero-shot performance:

Prompt Template	Example	Use Case
`"a photo of a {class}"`	"a photo of a cat"	General classification
`"a {class} in the wild"`	"a lion in the wild"	Natural scenes
`"satellite imagery of {class}"`	"satellite imagery of forest"	Remote sensing
`"a drawing of a {class}"`	"a drawing of a house"	Sketch recognition

zero_shot_classification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn.functional as F
from transformers import CLIPProcessor, CLIPModel
 
def zero_shot_classify(
    images: list,
    class_names: list[str],
    model_name: str = "openai/clip-vit-large-patch14"
) -> list[str]:
    """
    Perform zero-shot image classification using CLIP.
    
    Args:
        images: List of PIL images or image paths
        class_names: List of class labels
        model_name: CLIP model variant
        
    Returns:
        List of predicted class names
    """
    # Load model and processor
    model = CLIPModel.from_pretrained(model_name)
    processor = CLIPProcessor.from_pretrained(model_name)
    
    # Create prompts with template
    prompts = [f"a photo of a {name}" for name in class_names]
    
    # Process inputs
    inputs = processor(
        text=prompts,
        images=images,
        return_tensors="pt",
        padding=True
    )
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
        image_embeds = outputs.image_embeds  # (N_images, D)
        text_embeds = outputs.text_embeds    # (N_classes, D)
        
        # Normalize embeddings
        image_embeds = F.normalize(image_embeds, dim=-1)
        text_embeds = F.normalize(text_embeds, dim=-1)
        
        # Compute similarities
        logits = image_embeds @ text_embeds.T  # (N_images, N_classes)
        predictions = logits.argmax(dim=-1)
    
    return [class_names[p] for p in predictions]

Ensemble of Prompts

Applications of Vision-Language Models

Vision-language models have enabled a wide range of applications that were previously impossible or required extensive task-specific engineering.

Key Applications

•Image-Text Retrieval: Find images matching a text query or texts matching an image. Powers visual search engines, stock photo platforms, and content discovery.
•Visual Question Answering (VQA): Answer natural language questions about image content. Used in accessibility tools, document understanding, and interactive systems.
•Image Captioning: Generate natural language descriptions of images. Enables alt-text generation, content moderation, and media indexing.
•Text-to-Image Generation: Create images from textual descriptions. Powers creative tools (DALL-E, Midjourney) and design automation.
•Visual Grounding: Localize objects in images based on referring expressions. Critical for robotics, autonomous vehicles, and human-robot interaction.
•Document Understanding: Extract and reason about information in documents, receipts, and forms. Enables intelligent document processing and data extraction.

VLM Applications and Their Model Requirements
Application	Architecture	Key Capability	Example Models
Image Search	Dual-Encoder	Efficient retrieval	CLIP, ALIGN
VQA	Fusion/Encoder-Decoder	Fine-grained reasoning	BLIP-2, LLaVA
Captioning	Encoder-Decoder	Fluent generation	CoCa, Flamingo
Text-to-Image	Diffusion + VLM	Visual generation	DALL-E, Stable Diffusion
Document AI	Layout-aware VLM	Spatial understanding	LayoutLMv3, Donut

Summary: Vision-Language Models

We've covered the foundations of vision-language models, establishing the conceptual and architectural groundwork for the rest of this module.

Key Takeaways

•The Multimodal Challenge: Vision and language differ fundamentally in structure, requiring sophisticated alignment mechanisms.
•Architectural Paradigms: Dual-encoders excel at retrieval, fusion-encoders at reasoning, encoder-decoders at generation.
•Pre-training Objectives: Contrastive, masked modeling, and generative objectives develop different capabilities.
•Zero-Shot Transfer: Modern VLMs can perform tasks without task-specific training through semantic alignment.
•Diverse Applications: From search to generation, VLMs are transforming how we build AI systems.

Up Next

1 / 5