Foundation Models - Learning Module

Loading content...

0/245

Multimodal Models

Beyond Language: The Multimodal Frontier

Human intelligence is inherently multimodal. We see, hear, read, and speak—integrating information across senses seamlessly. We don't just understand text about cats; we recognize cats in images, hear them meow, and can describe what we see.

Multimodal AI aims to replicate this integration. Rather than separate models for vision, language, and audio, multimodal models learn to process and relate information across modalities within a single architecture. This represents the natural extension of the foundation model paradigm—from 'one model for text' to 'one model for everything.'

This page explores multimodal foundation models: how they work, why they matter, and what they can do that unimodal models cannot.

What You Will Learn

By the end of this page, you will understand: (1) the architecture of vision-language models like CLIP, (2) contrastive learning for multimodal alignment, (3) generative multimodal models (DALL-E, Stable Diffusion, GPT-4V), (4) unified architectures that process all modalities, (5) cross-modal capabilities and applications, and (6) the challenges unique to multimodal learning.

Why Multimodality Matters

Before diving into architectures, let's establish why multimodal models represent such an important direction for AI.

The Limitations of Unimodal Learning:

Models trained on single modalities face fundamental constraints:

Language-only models:

Cannot directly perceive the world they describe
Must rely on textual descriptions which are incomplete
Cannot ground language in physical reality
Miss vast amounts of information (an image is worth a thousand words)

Vision-only models:

Cannot express understanding in natural language
Struggle with abstract concepts that lack visual representation
Cannot follow textual instructions
Limited to fixed visual categories

Benefits of Multimodal Integration

•Grounding — Language acquires meaning through connection to perception. 'Red' means something when associated with red pixels.
•Complementary Information — Different modalities provide different views. Text explains what an image doesn't show; images show what text can't describe.
•Zero-Shot Transfer — Multimodal alignment enables classification into arbitrary categories described in language, without retraining.
•Richer Understanding — Concepts represented across modalities may be more robust and generalizable than unimodal representations.
•Natural Interfaces — Users can interact through any modality—showing images, speaking, typing—with unified understanding.

The Information Density Argument:

Different modalities carry different types and densities of information:

Modality	Bits per second (typical)	What it captures well
Text	~30-60 bits/s (reading)	Abstract concepts, structured knowledge, reasoning
Audio	~44,100 × 16 bits/s (CD audio)	Temporal patterns, emotion, music, speech
Image	Megapixels per frame	Spatial relationships, appearance, scene composition
Video	GB per minute	Dynamics, actions, causality over time

A model that processes all modalities can access information that's natural in each: abstract reasoning from text, spatial understanding from images, temporal dynamics from audio/video.

The Scaling Hypothesis for Multimodality:

Just as scale has proven crucial for unimodal language models, researchers hypothesize that scaling multimodal models—more modalities, more data, more parameters—will produce emergent cross-modal capabilities not achievable otherwise. Early evidence from GPT-4V and Gemini supports this view.

The Embodiment Connection

Some researchers argue that grounding in perception is necessary for true understanding—that 'meaning' requires connection to the world, not just text patterns. Multimodal models represent a step toward this grounding, though they still lack physical embodiment and interaction with the real world.

CLIP: Contrastive Language-Image Pre-training

CLIP (Contrastive Language-Image Pre-training), released by OpenAI in 2021, demonstrated that you can learn powerful visual representations by training on natural language supervision at scale. It became a foundational component for many subsequent multimodal systems.

The Core Idea:

Instead of training a vision model with fixed image categories (ImageNet's 1000 classes), CLIP learns a joint embedding space where images and their text descriptions are close together:

Collect millions of (image, text) pairs from the internet
Encode images with a vision transformer, text with a text transformer
Train so that matching pairs have similar embeddings; non-matching pairs have dissimilar embeddings
At inference, compare image embeddings to text embeddings of any category

clip_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class CLIPModel(nn.Module):
    """
    Simplified CLIP architecture.
    
    Key insight: Learn a joint embedding space through contrastive learning.
    Images and their descriptions should have similar embeddings.
    """
    def __init__(
        self,
        vision_encoder,      # e.g., ViT-B/32
        text_encoder,        # e.g., Transformer
        embed_dim: int = 512,
        temperature: float = 0.07,
    ):
        super().__init__()
        
        self.vision_encoder = vision_encoder
        self.text_encoder = text_encoder
        
        # Projection heads to shared embedding space
        self.vision_projection = nn.Linear(vision_encoder.output_dim, embed_dim)
        self.text_projection = nn.Linear(text_encoder.output_dim, embed_dim)
        
        # Learnable temperature parameter
        self.logit_scale = nn.Parameter(torch.log(torch.tensor(1.0 / temperature)))
        
    def encode_image(self, images: torch.Tensor) -> torch.Tensor:
        """Encode images to joint embedding space."""
        features = self.vision_encoder(images)
        embeddings = self.vision_projection(features)
        return F.normalize(embeddings, dim=-1)
    
    def encode_text(self, text_tokens: torch.Tensor) -> torch.Tensor:
        """Encode text to joint embedding space."""
        features = self.text_encoder(text_tokens)
        embeddings = self.text_projection(features)
        return F.normalize(embeddings, dim=-1)
    
    def forward(
        self,
        images: torch.Tensor,
        text_tokens: torch.Tensor,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Compute similarity scores between all image-text pairs.
        
        Returns logits for contrastive loss.
        """
        image_embeddings = self.encode_image(images)  # [B, embed_dim]
        text_embeddings = self.encode_text(text_tokens)  # [B, embed_dim]
        
        # Scale by learnable temperature
        logit_scale = self.logit_scale.exp()
        
        # Compute similarity matrix: [B, B]
        # logits[i, j] = similarity between image i and text j
        logits_per_image = logit_scale * image_embeddings @ text_embeddings.T
        logits_per_text = logits_per_image.T
        
        return logits_per_image, logits_per_text
 
 
def clip_contrastive_loss(
    logits_per_image: torch.Tensor,
    logits_per_text: torch.Tensor,
) -> torch.Tensor:
    """
    InfoNCE contrastive loss for CLIP training.
    
    Key insight: The diagonal of the similarity matrix contains matching pairs.
    We want to maximize these while minimizing off-diagonal (non-matching) pairs.
    
    This is equivalent to a symmetric cross-entropy loss where:
    - For each image, predict which text it matches
    - For each text, predict which image it matches
    """
    batch_size = logits_per_image.shape[0]
    
    # Labels: diagonal elements (index i matches index i)
    labels = torch.arange(batch_size, device=logits_per_image.device)
    
    # Cross-entropy: each image should match its corresponding text
    loss_image = F.cross_entropy(logits_per_image, labels)
    # Cross-entropy: each text should match its corresponding image
    loss_text = F.cross_entropy(logits_per_text, labels)
    
    # Symmetric loss
    return (loss_image + loss_text) / 2
 
 
# Zero-shot classification with CLIP
def zero_shot_classify(
    model: CLIPModel,
    image: torch.Tensor,
    class_names: list[str],
    prompt_template: str = "A photo of a {}",
) -> list[float]:
    """
    Classify an image into arbitrary text categories.
    
    This is remarkably powerful: no training on these specific classes,
    just comparison to text descriptions.
    """
    # Encode the image
    image_embedding = model.encode_image(image.unsqueeze(0))
    
    # Encode all class descriptions
    text_prompts = [prompt_template.format(name) for name in class_names]
    text_tokens = tokenize(text_prompts)  # Tokenization step
    text_embeddings = model.encode_text(text_tokens)
    
    # Compute similarities
    similarities = (image_embedding @ text_embeddings.T).squeeze(0)
    
    # Convert to probabilities
    probs = F.softmax(similarities * model.logit_scale.exp(), dim=-1)
    
    return probs.tolist()
 
# Example usage:
# probs = zero_shot_classify(clip_model, dog_image, 
#     ["dog", "cat", "car", "tree", "person"])
# Result: [0.92, 0.05, 0.01, 0.01, 0.01] - recognizes it's a dog!

Why CLIP Works:

Scale of Training Data: CLIP was trained on 400 million image-text pairs from the internet. This dwarfs any manually curated dataset.
Natural Language Supervision: Instead of fixed labels, the supervision comes from natural language—capturing richer descriptions than category labels.
Contrastive Objective: The InfoNCE loss creates a strong learning signal by contrasting matching pairs against many non-matching pairs in each batch.
Emergent Concepts: CLIP learns concepts never explicitly labeled because they appear in natural text descriptions (actions, attributes, relationships).

CLIP's Revolutionary Capabilities:

Zero-shot classification: Classify images into any text-described categories without retraining
Image-text retrieval: Find images matching a text query, or texts matching an image
Compositional understanding: Some ability to understand 'a red cube on a blue sphere'
Transfer to new domains: Strong performance on novel datasets without fine-tuning

CLIP as a Building Block

CLIP's vision encoder has become a building block for many systems: it provides visual features for DALL-E, Stable Diffusion, and many VLMs. The pretrained CLIP embeddings are often more useful than training visual representations from scratch.

Generative Multimodal Models: DALL-E and Diffusion

While CLIP understands image-text relationships, it doesn't generate images. A separate line of research has developed text-to-image generation—models that create images from text descriptions.

DALL-E and Autoregressive Image Generation (2021):

The original DALL-E took an autoregressive approach:

Encode images as discrete tokens using a VQ-VAE
Train a transformer to predict image tokens given text tokens
Generate by sampling image tokens autoregressively

This worked but had limitations: autoregressive generation was slow, and the discrete bottleneck lost high-frequency details.

Diffusion Models: The New Paradigm (2022-present):

Diffusion models revolutionized image generation with a different approach:

diffusion_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
"""
Diffusion models for image generation (conceptual overview).
 
Key idea: Learn to denoise images at every noise level.
Generation: Start from pure noise, progressively denoise.
"""
 
import torch
import torch.nn as nn
 
class DiffusionModel(nn.Module):
    """
    Simplified diffusion for image generation.
    
    Training: Add noise to images, predict the noise.
    Inference: Start from noise, iteratively remove predicted noise.
    """
    def __init__(self, denoiser: nn.Module, num_timesteps: int = 1000):
        super().__init__()
        self.denoiser = denoiser  # Usually a U-Net
        self.num_timesteps = num_timesteps
        
        # Precompute noise schedule (beta_t values)
        self.register_buffer(
            'betas',
            torch.linspace(1e-4, 0.02, num_timesteps)
        )
        self.register_buffer('alphas', 1.0 - self.betas)
        self.register_buffer('alpha_bars', torch.cumprod(self.alphas, dim=0))
        
    def forward_diffusion(
        self,
        x_0: torch.Tensor,
        t: torch.Tensor,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Add noise to images according to noise schedule.
        
        x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
        """
        noise = torch.randn_like(x_0)
        alpha_bar_t = self.alpha_bars[t][:, None, None, None]
        
        x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise
        
        return x_t, noise
    
    def training_step(
        self,
        images: torch.Tensor,
        text_embeddings: torch.Tensor = None,  # For conditional generation
    ) -> torch.Tensor:
        """
        Training: Predict the noise that was added.
        
        Simple objective: MSE between predicted and actual noise.
        """
        batch_size = images.shape[0]
        
        # Sample random timesteps
        t = torch.randint(0, self.num_timesteps, (batch_size,), device=images.device)
        
        # Add noise
        x_t, noise = self.forward_diffusion(images, t)
        
        # Predict noise (conditioned on timestep and optionally text)
        predicted_noise = self.denoiser(x_t, t, text_embeddings)
        
        # MSE loss
        loss = nn.functional.mse_loss(predicted_noise, noise)
        
        return loss
    
    @torch.no_grad()
    def generate(
        self,
        text_embeddings: torch.Tensor,
        image_shape: tuple = (3, 512, 512),
        guidance_scale: float = 7.5,
    ) -> torch.Tensor:
        """
        Generate images from text using iterative denoising.
        
        Classifier-free guidance: Interpolate between conditional
        and unconditional predictions for better text adherence.
        """
        batch_size = text_embeddings.shape[0]
        
        # Start from pure noise
        x_t = torch.randn(batch_size, *image_shape, device=text_embeddings.device)
        
        # Iteratively denoise
        for t in reversed(range(self.num_timesteps)):
            t_tensor = torch.full((batch_size,), t, device=x_t.device)
            
            # Classifier-free guidance: predict with and without text
            noise_cond = self.denoiser(x_t, t_tensor, text_embeddings)
            noise_uncond = self.denoiser(x_t, t_tensor, None)  # Unconditional
            
            # Guided noise prediction
            predicted_noise = noise_uncond + guidance_scale * (noise_cond - noise_uncond)
            
            # Denoise one step
            x_t = self.denoise_step(x_t, predicted_noise, t)
        
        return x_t
    
    def denoise_step(self, x_t, predicted_noise, t):
        """Single denoising step (DDPM sampling)."""
        alpha_t = self.alphas[t]
        alpha_bar_t = self.alpha_bars[t]
        beta_t = self.betas[t]
        
        # Predicted x_0 from x_t and predicted noise
        x_0_pred = (x_t - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t)
        
        # Add noise for next timestep (unless t=0)
        if t > 0:
            noise = torch.randn_like(x_t)
            alpha_bar_t_minus_1 = self.alpha_bars[t - 1]
            
            # Posterior mean and variance
            mean = (torch.sqrt(alpha_bar_t_minus_1) * beta_t * x_0_pred + 
                    torch.sqrt(alpha_t) * (1 - alpha_bar_t_minus_1) * x_t) / (1 - alpha_bar_t)
            var = beta_t * (1 - alpha_bar_t_minus_1) / (1 - alpha_bar_t)
            
            x_t_minus_1 = mean + torch.sqrt(var) * noise
        else:
            x_t_minus_1 = x_0_pred
            
        return x_t_minus_1

Major Text-to-Image Models
Model	Organization	Approach	Key Features
DALL-E	OpenAI	VQ-VAE + Autoregressive	First large-scale text-to-image
DALL-E 2	OpenAI	CLIP + Diffusion	Higher quality, editing capabilities
DALL-E 3	OpenAI	Diffusion + GPT-4	Better prompt following, integrated with ChatGPT
Stable Diffusion	Stability AI	Latent Diffusion	Open-source, widely adopted
Midjourney	Midjourney	Proprietary Diffusion	Artistic style, aesthetic focus
Imagen	Google	Diffusion + T5	Strong text encoder, high fidelity
Flux	Black Forest Labs	Rectified Flow	Fast, high quality, open weights

Key Innovations in Text-to-Image:

Latent Diffusion: Stable Diffusion operates in a compressed latent space (via VAE) rather than pixel space, dramatically reducing computation.
Classifier-Free Guidance: Interpolate between conditional and unconditional generation to improve text adherence. Higher guidance = more faithful to prompt but potentially less diverse.
Cross-Attention for Conditioning: Text embeddings attend to image features via cross-attention, enabling fine-grained control over generation.
Rectified Flow: Newer models (Flux, SD3) use rectified flow trajectories for faster, higher-quality generation.

Challenges in Text-to-Image

Despite impressive results, text-to-image models struggle with: counting (generating exactly N objects), spatial relationships (left/right, above/below), text rendering (generating legible words in images), and compositional scenarios (multiple objects with specific attributes). These remain active research areas.

Vision-Language Models: GPT-4V and Beyond

Vision-Language Models (VLMs) combine the text generation capabilities of LLMs with the ability to understand images. They can answer questions about images, describe visual content, and reason across text and images.

The General VLM Architecture:

Most VLMs follow a pattern:

Vision Encoder: Process images into features (often CLIP's vision encoder or similar)
Modality Alignment: Map visual features into the text model's embedding space
Language Model: Generate text conditioned on visual context

The key design choice is how to integrate visual information into the LLM.

vlm_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import torch
import torch.nn as nn
 
class VisionLanguageModel(nn.Module):
    """
    Simplified Vision-Language Model architecture.
    
    This follows the pattern used by LLaVA, GPT-4V, Gemini, etc.
    Images are encoded and projected into the LLM's embedding space.
    """
    def __init__(
        self,
        vision_encoder,              # Pretrained, e.g., CLIP ViT
        language_model,              # Pretrained LLM
        vision_dim: int = 768,       # Vision encoder output dim
        llm_dim: int = 4096,         # LLM hidden dim
        num_vision_tokens: int = 256, # Vision tokens per image
    ):
        super().__init__()
        
        self.vision_encoder = vision_encoder
        self.language_model = language_model
        
        # Projection: map vision features to LLM space
        # This is the key learnable component for modality alignment
        self.vision_projection = nn.Sequential(
            nn.Linear(vision_dim, llm_dim),
            nn.GELU(),
            nn.Linear(llm_dim, llm_dim),
        )
        
        # Alternative: Use cross-attention for finer-grained integration
        # self.cross_attention = nn.MultiheadAttention(llm_dim, 8)
        
        self.num_vision_tokens = num_vision_tokens
        
    def encode_image(self, images: torch.Tensor) -> torch.Tensor:
        """
        Encode images to sequences of tokens in LLM space.
        
        Returns: [batch, num_vision_tokens, llm_dim]
        """
        # Get patch features from vision encoder
        vision_features = self.vision_encoder(images)  # [batch, patches, vision_dim]
        
        # Project to LLM space
        vision_tokens = self.vision_projection(vision_features)  # [batch, patches, llm_dim]
        
        return vision_tokens
    
    def forward(
        self,
        images: torch.Tensor,
        text_tokens: torch.Tensor,
        labels: torch.Tensor = None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Forward pass with interleaved image and text.
        
        Images become a sequence of tokens that the LLM processes
        alongside text tokens.
        """
        batch_size = text_tokens.shape[0]
        
        # Encode images to token sequences
        vision_tokens = self.encode_image(images)
        
        # Get text embeddings
        text_embeddings = self.language_model.embed_tokens(text_tokens)
        
        # Interleave: [IMAGE_START] [vision tokens] [IMAGE_END] [text tokens]
        # In practice, special tokens mark image positions
        combined = torch.cat([vision_tokens, text_embeddings], dim=1)
        
        # Forward through LLM
        outputs = self.language_model(inputs_embeds=combined)
        
        # Only compute loss on text portion
        if labels is not None:
            # Shift for next-token prediction, only on text tokens
            text_logits = outputs.logits[:, self.num_vision_tokens:, :]
            loss = nn.functional.cross_entropy(
                text_logits.reshape(-1, text_logits.size(-1)),
                labels.reshape(-1),
            )
            return outputs, loss
        
        return outputs, None
    
    def generate(
        self,
        image: torch.Tensor,
        prompt: str,
        max_length: int = 512,
    ) -> str:
        """Generate text response given image and text prompt."""
        vision_tokens = self.encode_image(image.unsqueeze(0))
        prompt_tokens = self.tokenizer(prompt)
        
        # Combine and generate
        combined = self.prepare_inputs(vision_tokens, prompt_tokens)
        
        output_ids = self.language_model.generate(
            inputs_embeds=combined,
            max_length=max_length,
        )
        
        return self.tokenizer.decode(output_ids[0])
 
 
# Training approaches for VLMs:
training_stages = {
    'stage_1': {
        'name': 'Alignment Pretraining',
        'frozen': ['vision_encoder', 'language_model'],
        'trained': ['vision_projection'],
        'data': 'Image-caption pairs (CC3M, LAION)',
        'objective': 'Learn to project visual features into LLM space',
    },
    'stage_2': {
        'name': 'Instruction Tuning',
        'frozen': ['vision_encoder'],
        'trained': ['vision_projection', 'language_model'],
        'data': 'Visual instruction following (LLaVA-Instruct, etc.)',
        'objective': 'Learn to follow multimodal instructions',
    },
}

Major Vision-Language Models
Model	Organization	Architecture	Key Capabilities
GPT-4V	OpenAI	Vision encoder + GPT-4	State-of-art visual reasoning, chat
Gemini Ultra	Google	Native multimodal (unified)	Natively trained on mixed modalities
Claude 3.5	Anthropic	Vision encoder + Claude	Strong document understanding
LLaVA	Academic/Open	CLIP + LLaMA	Open weights, instruction following
Qwen-VL	Alibaba	Vision encoder + Qwen	Strong multilingual visual understanding
InternVL	Open	InternViT + InternLM	Large-scale open VLM

What VLMs Can Do:

Visual Question Answering: Answer questions about image content, from simple ('What color is the car?') to complex ('Why might the person be upset?').

Document Understanding: Read and analyze documents, charts, tables, and screenshots—extracting and reasoning about structured information.

Visual Reasoning: Solve problems requiring visual understanding: geometry, physics simulations, spatial reasoning.

Multi-Image Reasoning: Compare images, identify differences, track objects across frames.

Grounded Generation: Generate text that accurately describes or refers to specific image regions.

Native vs. Stitched Multimodality

Models like GPT-4V 'stitch' a vision encoder onto an LLM. Models like Gemini are 'natively multimodal'—trained from scratch on mixed modality data. Native multimodality may enable better cross-modal reasoning, though stitched approaches can leverage strong unimodal pretrained components.

Beyond Images: Audio, Video, and Beyond

The multimodal paradigm extends beyond text and images to encompass audio, video, and other modalities.

Audio Models:

Speech Recognition (Whisper):

OpenAI's Whisper demonstrated that scaling transformers on speech recognition produces robust, multilingual models:

Trained on 680,000 hours of labeled audio
Multilingual: transcribes 99 languages
Robust to accents, background noise, technical language
Also performs translation (speech in language A → text in language B)

Text-to-Speech:

Models like Tortoise, XTTS, and StyleTTS synthesize natural-sounding speech from text, complete with emotion, prosody, and speaker identity control.

Audio Language Models:

Models like AudioLM and MusicLM generate audio (music, sound effects) from text descriptions, extending generative AI to the audio domain.

Multimodal Models Across Modalities
Modality	Understanding	Generation
Text	LLMs (GPT, Claude, LLaMA)	LLMs (same models)
Images	CLIP, VLMs (GPT-4V, Gemini)	DALL-E, Stable Diffusion, Midjourney
Audio/Speech	Whisper, wav2vec	Tortoise, ElevenLabs, XTTS
Music	Jukebox (limited)	MusicLM, Suno, Udio
Video	Video LLMs (VideoLLaMA, etc.)	Sora, Gen-2, Pika
3D/Mesh	Point-E (limited)	Point-E, GetD
Actions/Robotics	RT-2	RT-2, Gato

Video Understanding and Generation:

Video Understanding:

Video adds temporal complexity. Models must track objects, understand actions, and reason about cause and effect over time.

Approaches include:

Frame sampling + VLM: Process key frames through image-based VLMs
Temporal encoding: Add temporal position encoding or use 3D convolutions
Video transformers: Attention over space and time jointly

Video Generation (Sora and Beyond):

OpenAI's Sora (2024) demonstrated remarkable video generation from text:

Generates coherent videos up to 1 minute
Maintains object consistency across frames
Simulates physics plausibly
Handles camera motion and scene composition

Video generation represents a major scaling challenge: a minute of video at 30fps is ~1,800 frames, each equivalent to generating an image.

Challenges in Video Generation

•Temporal Consistency — Objects and scenes must remain stable while changing appropriately. Hands shouldn't spontaneously add fingers.
•Physics Simulation — Real videos obey physics. Generated videos often have subtle physics violations that break immersion.
•Compute Requirements — Video generation is orders of magnitude more expensive than image generation. Training costs are enormous.
•Long-Range Coherence — Maintaining narrative, spatial, and causal coherence over many seconds is qualitatively harder than single images.
•Evaluation — How do you measure video quality? Metrics are underdeveloped compared to images or text.

World Models and Video

Some researchers view video generation as a step toward 'world models'—systems that understand and can simulate how the world works. If a model can generate plausible video, it must have some understanding of physics, object permanence, and causality. Whether current models truly have such understanding or merely approximate surface statistics is debated.

Unified Architectures: Any-to-Any Models

The logical endpoint of multimodality is a unified architecture that handles all modalities natively—taking any combination of inputs and producing any combination of outputs.

The Vision: Any-to-Any Models

Imagine a single model that can:

Read text and images → generate text
Listen to audio and see video → answer questions
Read a text description → generate an image
See an image → generate audio description
Any input modality → any output modality

This is 'any-to-any' multimodality. Rather than separate models for each modality pair, one model handles everything.

Approaches to Unified Architectures:

Approach 1: Unified Tokenization

Reduce all modalities to tokens in a shared vocabulary:

Text: standard tokenization
Images: VQ-VAE codes, or patch embeddings
Audio: Discrete speech tokens (EnCodec, SoundStream)
Video: Spatial-temporal tokens

Then train a transformer on sequences mixing these tokens. This is the approach of models like Chameleon and CM3Leon.

Approach 2: Modality-Specific Encoders/Decoders, Shared Core

Use specialized encoders and decoders for each modality, but route everything through a shared transformer core:

Text encoder/decoder → Core Transformer ← Image encoder/decoder
Common representation space in the middle

This is similar to Gemini's approach.

Approach 3: External Tools and Routing

A language model routes to specialized models:

LLM understands what's needed
Calls image generation, audio synthesis, etc., as tools
Integrates results

This is pragmatic but not truly unified—it's orchestration of specialists.

Approaches to Unified Multimodal Architecture
Approach	Example Models	Pros	Cons
Unified Tokenization	Chameleon, CM3Leon, Emu	Elegant, single training objective	Tokenization quality limits performance
Shared Core + Specialists	Gemini, GPT-4o	Best of both worlds	Complex engineering
Tool-Based Routing	ChatGPT + DALL-E	Leverages best specialist models	Not truly unified, integration overhead
Pixel-Level Autoregression	Research stage	True end-to-end, no tokenization loss	Extremely compute intensive

GPT-4o: A Step Toward True Unification

OpenAI's GPT-4o (2024) represents significant progress toward any-to-any:

Native audio input/output (no separate transcription)
Real-time voice conversation
Integrated understanding across text, vision, and audio
Unified reasoning across modalities

This allows natural interactions like: 'Look at this [shows image] and describe it in a song [sings response].'

The Token is the Universal Interface

Underlying most unified approaches is a deep insight: transformers process sequences of tokens, and any information can be tokenized. The challenge is finding tokenizations that preserve the essential information of each modality while remaining compatible with transformer processing.

Challenges and Future Directions

Multimodal AI has made remarkable progress but faces significant open challenges.

Current Limitations:

Compositional Understanding:

Even the best multimodal models struggle with compositional scenarios:

'A red cube on top of a blue sphere next to a green pyramid'
Models often miss spatial relationships, swap attributes, or omit objects
This suggests shallow rather than deep compositional understanding

Grounding and Factuality:

Multimodal hallucination is multifaceted:

Models may describe objects not in an image
May misidentify relationships or actions
May generate images that don't match prompts

Consistency Across Modalities:

Information may not transfer cleanly between modalities:

A model may read text correctly but generate a factually inconsistent image
Audio and visual understanding may not integrate seamlessly

Open Research Directions

•Better Tokenization — Current discrete tokenization loses high-frequency information. Continuous representations or better codebooks could help.
•Efficient Cross-Modal Attention — Full attention across all modalities is expensive. Sparse or hierarchical attention patterns could scale better.
•World Models — Moving beyond surface statistics to genuine models of how the world works—physics, causality, object permanence.
•3D and Embodiment — Extending to 3D understanding, robotics, and physical interaction with the world.
•Memory and State — Long-term memory that persists across interactions; maintaining state in extended video or dialogue.
•Evaluation Frameworks — Better benchmarks for multimodal integration, not just per-modality performance.

The Path Forward:

Several trends are likely to shape multimodal AI's future:

Continued Scaling: Following the unimodal playbook, simply scaling up data, compute, and parameters will likely continue to improve capabilities.
Native Multimodality: More models trained from scratch on mixed modalities rather than bolting on encoders post-hoc.
Real-Time Interaction: Models that process continuous audio/video streams in real-time, enabling more natural interaction.
Embodied AI: Connecting multimodal understanding to robotic action and physical world interaction.
Efficiency Improvements: Making multimodal models practical to deploy—smaller, faster, cheaper while maintaining capability.

Safety Considerations

Multimodal capabilities introduce new safety challenges: generating deepfake images/videos, bypassing text-based safety filters with visual inputs, and creating harmful multimedia content. As capabilities advance, safety measures must keep pace.

Summary: The Multimodal Frontier

We have explored the multimodal frontier of foundation models—from contrastive learning through generation to unified architectures. Let's consolidate the key insights:

Key Takeaways

•Multimodality enables grounding — Connecting language to perception provides meaning that text alone cannot.
•CLIP revolutionized vision-language — Contrastive learning on image-text pairs created a versatile joint embedding space.
•Diffusion models power image generation — Iterative denoising produces high-quality images from text, surpassing earlier autoregressive approaches.
•VLMs combine vision and language — Models like GPT-4V and Gemini answer questions about images, read documents, and reason visually.
•Audio and video follow similar patterns — The same principles extend to speech (Whisper), music, and video (Sora).
•Unified architectures are emerging — Any-to-any models that handle all modalities natively represent the frontier.
•Challenges remain — Compositional understanding, cross-modal consistency, and real-world grounding are unsolved.
•The trend is clear — Foundation models are becoming multimodal by default, not by exception.

What's Next:

Having explored scale, emergence, LLMs, and multimodality, we conclude this module by examining the foundation model paradigm itself—the broader implications of these models for AI research, applications, and society. We'll synthesize the themes of this module into a cohesive understanding of what foundation models are and why they matter.

Page Complete

You now understand the multimodal landscape of foundation models—from CLIP's contrastive learning through diffusion-based generation to unified any-to-any architectures. This knowledge is essential as multimodality becomes the default paradigm for frontier AI systems.