Loading content...
Human intelligence is inherently multimodal. We see, hear, read, and speak—integrating information across senses seamlessly. We don't just understand text about cats; we recognize cats in images, hear them meow, and can describe what we see.
Multimodal AI aims to replicate this integration. Rather than separate models for vision, language, and audio, multimodal models learn to process and relate information across modalities within a single architecture. This represents the natural extension of the foundation model paradigm—from 'one model for text' to 'one model for everything.'
This page explores multimodal foundation models: how they work, why they matter, and what they can do that unimodal models cannot.
By the end of this page, you will understand: (1) the architecture of vision-language models like CLIP, (2) contrastive learning for multimodal alignment, (3) generative multimodal models (DALL-E, Stable Diffusion, GPT-4V), (4) unified architectures that process all modalities, (5) cross-modal capabilities and applications, and (6) the challenges unique to multimodal learning.
Before diving into architectures, let's establish why multimodal models represent such an important direction for AI.
The Limitations of Unimodal Learning:
Models trained on single modalities face fundamental constraints:
Language-only models:
Vision-only models:
The Information Density Argument:
Different modalities carry different types and densities of information:
| Modality | Bits per second (typical) | What it captures well |
|---|---|---|
| Text | ~30-60 bits/s (reading) | Abstract concepts, structured knowledge, reasoning |
| Audio | ~44,100 × 16 bits/s (CD audio) | Temporal patterns, emotion, music, speech |
| Image | Megapixels per frame | Spatial relationships, appearance, scene composition |
| Video | GB per minute | Dynamics, actions, causality over time |
A model that processes all modalities can access information that's natural in each: abstract reasoning from text, spatial understanding from images, temporal dynamics from audio/video.
The Scaling Hypothesis for Multimodality:
Just as scale has proven crucial for unimodal language models, researchers hypothesize that scaling multimodal models—more modalities, more data, more parameters—will produce emergent cross-modal capabilities not achievable otherwise. Early evidence from GPT-4V and Gemini supports this view.
Some researchers argue that grounding in perception is necessary for true understanding—that 'meaning' requires connection to the world, not just text patterns. Multimodal models represent a step toward this grounding, though they still lack physical embodiment and interaction with the real world.
CLIP (Contrastive Language-Image Pre-training), released by OpenAI in 2021, demonstrated that you can learn powerful visual representations by training on natural language supervision at scale. It became a foundational component for many subsequent multimodal systems.
The Core Idea:
Instead of training a vision model with fixed image categories (ImageNet's 1000 classes), CLIP learns a joint embedding space where images and their text descriptions are close together:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import torchimport torch.nn as nnimport torch.nn.functional as F class CLIPModel(nn.Module): """ Simplified CLIP architecture. Key insight: Learn a joint embedding space through contrastive learning. Images and their descriptions should have similar embeddings. """ def __init__( self, vision_encoder, # e.g., ViT-B/32 text_encoder, # e.g., Transformer embed_dim: int = 512, temperature: float = 0.07, ): super().__init__() self.vision_encoder = vision_encoder self.text_encoder = text_encoder # Projection heads to shared embedding space self.vision_projection = nn.Linear(vision_encoder.output_dim, embed_dim) self.text_projection = nn.Linear(text_encoder.output_dim, embed_dim) # Learnable temperature parameter self.logit_scale = nn.Parameter(torch.log(torch.tensor(1.0 / temperature))) def encode_image(self, images: torch.Tensor) -> torch.Tensor: """Encode images to joint embedding space.""" features = self.vision_encoder(images) embeddings = self.vision_projection(features) return F.normalize(embeddings, dim=-1) def encode_text(self, text_tokens: torch.Tensor) -> torch.Tensor: """Encode text to joint embedding space.""" features = self.text_encoder(text_tokens) embeddings = self.text_projection(features) return F.normalize(embeddings, dim=-1) def forward( self, images: torch.Tensor, text_tokens: torch.Tensor, ) -> tuple[torch.Tensor, torch.Tensor]: """ Compute similarity scores between all image-text pairs. Returns logits for contrastive loss. """ image_embeddings = self.encode_image(images) # [B, embed_dim] text_embeddings = self.encode_text(text_tokens) # [B, embed_dim] # Scale by learnable temperature logit_scale = self.logit_scale.exp() # Compute similarity matrix: [B, B] # logits[i, j] = similarity between image i and text j logits_per_image = logit_scale * image_embeddings @ text_embeddings.T logits_per_text = logits_per_image.T return logits_per_image, logits_per_text def clip_contrastive_loss( logits_per_image: torch.Tensor, logits_per_text: torch.Tensor,) -> torch.Tensor: """ InfoNCE contrastive loss for CLIP training. Key insight: The diagonal of the similarity matrix contains matching pairs. We want to maximize these while minimizing off-diagonal (non-matching) pairs. This is equivalent to a symmetric cross-entropy loss where: - For each image, predict which text it matches - For each text, predict which image it matches """ batch_size = logits_per_image.shape[0] # Labels: diagonal elements (index i matches index i) labels = torch.arange(batch_size, device=logits_per_image.device) # Cross-entropy: each image should match its corresponding text loss_image = F.cross_entropy(logits_per_image, labels) # Cross-entropy: each text should match its corresponding image loss_text = F.cross_entropy(logits_per_text, labels) # Symmetric loss return (loss_image + loss_text) / 2 # Zero-shot classification with CLIPdef zero_shot_classify( model: CLIPModel, image: torch.Tensor, class_names: list[str], prompt_template: str = "A photo of a {}",) -> list[float]: """ Classify an image into arbitrary text categories. This is remarkably powerful: no training on these specific classes, just comparison to text descriptions. """ # Encode the image image_embedding = model.encode_image(image.unsqueeze(0)) # Encode all class descriptions text_prompts = [prompt_template.format(name) for name in class_names] text_tokens = tokenize(text_prompts) # Tokenization step text_embeddings = model.encode_text(text_tokens) # Compute similarities similarities = (image_embedding @ text_embeddings.T).squeeze(0) # Convert to probabilities probs = F.softmax(similarities * model.logit_scale.exp(), dim=-1) return probs.tolist() # Example usage:# probs = zero_shot_classify(clip_model, dog_image, # ["dog", "cat", "car", "tree", "person"])# Result: [0.92, 0.05, 0.01, 0.01, 0.01] - recognizes it's a dog!Why CLIP Works:
Scale of Training Data: CLIP was trained on 400 million image-text pairs from the internet. This dwarfs any manually curated dataset.
Natural Language Supervision: Instead of fixed labels, the supervision comes from natural language—capturing richer descriptions than category labels.
Contrastive Objective: The InfoNCE loss creates a strong learning signal by contrasting matching pairs against many non-matching pairs in each batch.
Emergent Concepts: CLIP learns concepts never explicitly labeled because they appear in natural text descriptions (actions, attributes, relationships).
CLIP's Revolutionary Capabilities:
CLIP's vision encoder has become a building block for many systems: it provides visual features for DALL-E, Stable Diffusion, and many VLMs. The pretrained CLIP embeddings are often more useful than training visual representations from scratch.
While CLIP understands image-text relationships, it doesn't generate images. A separate line of research has developed text-to-image generation—models that create images from text descriptions.
DALL-E and Autoregressive Image Generation (2021):
The original DALL-E took an autoregressive approach:
This worked but had limitations: autoregressive generation was slow, and the discrete bottleneck lost high-frequency details.
Diffusion Models: The New Paradigm (2022-present):
Diffusion models revolutionized image generation with a different approach:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
"""Diffusion models for image generation (conceptual overview). Key idea: Learn to denoise images at every noise level.Generation: Start from pure noise, progressively denoise.""" import torchimport torch.nn as nn class DiffusionModel(nn.Module): """ Simplified diffusion for image generation. Training: Add noise to images, predict the noise. Inference: Start from noise, iteratively remove predicted noise. """ def __init__(self, denoiser: nn.Module, num_timesteps: int = 1000): super().__init__() self.denoiser = denoiser # Usually a U-Net self.num_timesteps = num_timesteps # Precompute noise schedule (beta_t values) self.register_buffer( 'betas', torch.linspace(1e-4, 0.02, num_timesteps) ) self.register_buffer('alphas', 1.0 - self.betas) self.register_buffer('alpha_bars', torch.cumprod(self.alphas, dim=0)) def forward_diffusion( self, x_0: torch.Tensor, t: torch.Tensor, ) -> tuple[torch.Tensor, torch.Tensor]: """ Add noise to images according to noise schedule. x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise """ noise = torch.randn_like(x_0) alpha_bar_t = self.alpha_bars[t][:, None, None, None] x_t = torch.sqrt(alpha_bar_t) * x_0 + torch.sqrt(1 - alpha_bar_t) * noise return x_t, noise def training_step( self, images: torch.Tensor, text_embeddings: torch.Tensor = None, # For conditional generation ) -> torch.Tensor: """ Training: Predict the noise that was added. Simple objective: MSE between predicted and actual noise. """ batch_size = images.shape[0] # Sample random timesteps t = torch.randint(0, self.num_timesteps, (batch_size,), device=images.device) # Add noise x_t, noise = self.forward_diffusion(images, t) # Predict noise (conditioned on timestep and optionally text) predicted_noise = self.denoiser(x_t, t, text_embeddings) # MSE loss loss = nn.functional.mse_loss(predicted_noise, noise) return loss @torch.no_grad() def generate( self, text_embeddings: torch.Tensor, image_shape: tuple = (3, 512, 512), guidance_scale: float = 7.5, ) -> torch.Tensor: """ Generate images from text using iterative denoising. Classifier-free guidance: Interpolate between conditional and unconditional predictions for better text adherence. """ batch_size = text_embeddings.shape[0] # Start from pure noise x_t = torch.randn(batch_size, *image_shape, device=text_embeddings.device) # Iteratively denoise for t in reversed(range(self.num_timesteps)): t_tensor = torch.full((batch_size,), t, device=x_t.device) # Classifier-free guidance: predict with and without text noise_cond = self.denoiser(x_t, t_tensor, text_embeddings) noise_uncond = self.denoiser(x_t, t_tensor, None) # Unconditional # Guided noise prediction predicted_noise = noise_uncond + guidance_scale * (noise_cond - noise_uncond) # Denoise one step x_t = self.denoise_step(x_t, predicted_noise, t) return x_t def denoise_step(self, x_t, predicted_noise, t): """Single denoising step (DDPM sampling).""" alpha_t = self.alphas[t] alpha_bar_t = self.alpha_bars[t] beta_t = self.betas[t] # Predicted x_0 from x_t and predicted noise x_0_pred = (x_t - torch.sqrt(1 - alpha_bar_t) * predicted_noise) / torch.sqrt(alpha_bar_t) # Add noise for next timestep (unless t=0) if t > 0: noise = torch.randn_like(x_t) alpha_bar_t_minus_1 = self.alpha_bars[t - 1] # Posterior mean and variance mean = (torch.sqrt(alpha_bar_t_minus_1) * beta_t * x_0_pred + torch.sqrt(alpha_t) * (1 - alpha_bar_t_minus_1) * x_t) / (1 - alpha_bar_t) var = beta_t * (1 - alpha_bar_t_minus_1) / (1 - alpha_bar_t) x_t_minus_1 = mean + torch.sqrt(var) * noise else: x_t_minus_1 = x_0_pred return x_t_minus_1| Model | Organization | Approach | Key Features |
|---|---|---|---|
| DALL-E | OpenAI | VQ-VAE + Autoregressive | First large-scale text-to-image |
| DALL-E 2 | OpenAI | CLIP + Diffusion | Higher quality, editing capabilities |
| DALL-E 3 | OpenAI | Diffusion + GPT-4 | Better prompt following, integrated with ChatGPT |
| Stable Diffusion | Stability AI | Latent Diffusion | Open-source, widely adopted |
| Midjourney | Midjourney | Proprietary Diffusion | Artistic style, aesthetic focus |
| Imagen | Diffusion + T5 | Strong text encoder, high fidelity | |
| Flux | Black Forest Labs | Rectified Flow | Fast, high quality, open weights |
Key Innovations in Text-to-Image:
Latent Diffusion: Stable Diffusion operates in a compressed latent space (via VAE) rather than pixel space, dramatically reducing computation.
Classifier-Free Guidance: Interpolate between conditional and unconditional generation to improve text adherence. Higher guidance = more faithful to prompt but potentially less diverse.
Cross-Attention for Conditioning: Text embeddings attend to image features via cross-attention, enabling fine-grained control over generation.
Rectified Flow: Newer models (Flux, SD3) use rectified flow trajectories for faster, higher-quality generation.
Despite impressive results, text-to-image models struggle with: counting (generating exactly N objects), spatial relationships (left/right, above/below), text rendering (generating legible words in images), and compositional scenarios (multiple objects with specific attributes). These remain active research areas.
Vision-Language Models (VLMs) combine the text generation capabilities of LLMs with the ability to understand images. They can answer questions about images, describe visual content, and reason across text and images.
The General VLM Architecture:
Most VLMs follow a pattern:
The key design choice is how to integrate visual information into the LLM.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import torchimport torch.nn as nn class VisionLanguageModel(nn.Module): """ Simplified Vision-Language Model architecture. This follows the pattern used by LLaVA, GPT-4V, Gemini, etc. Images are encoded and projected into the LLM's embedding space. """ def __init__( self, vision_encoder, # Pretrained, e.g., CLIP ViT language_model, # Pretrained LLM vision_dim: int = 768, # Vision encoder output dim llm_dim: int = 4096, # LLM hidden dim num_vision_tokens: int = 256, # Vision tokens per image ): super().__init__() self.vision_encoder = vision_encoder self.language_model = language_model # Projection: map vision features to LLM space # This is the key learnable component for modality alignment self.vision_projection = nn.Sequential( nn.Linear(vision_dim, llm_dim), nn.GELU(), nn.Linear(llm_dim, llm_dim), ) # Alternative: Use cross-attention for finer-grained integration # self.cross_attention = nn.MultiheadAttention(llm_dim, 8) self.num_vision_tokens = num_vision_tokens def encode_image(self, images: torch.Tensor) -> torch.Tensor: """ Encode images to sequences of tokens in LLM space. Returns: [batch, num_vision_tokens, llm_dim] """ # Get patch features from vision encoder vision_features = self.vision_encoder(images) # [batch, patches, vision_dim] # Project to LLM space vision_tokens = self.vision_projection(vision_features) # [batch, patches, llm_dim] return vision_tokens def forward( self, images: torch.Tensor, text_tokens: torch.Tensor, labels: torch.Tensor = None, ) -> tuple[torch.Tensor, torch.Tensor]: """ Forward pass with interleaved image and text. Images become a sequence of tokens that the LLM processes alongside text tokens. """ batch_size = text_tokens.shape[0] # Encode images to token sequences vision_tokens = self.encode_image(images) # Get text embeddings text_embeddings = self.language_model.embed_tokens(text_tokens) # Interleave: [IMAGE_START] [vision tokens] [IMAGE_END] [text tokens] # In practice, special tokens mark image positions combined = torch.cat([vision_tokens, text_embeddings], dim=1) # Forward through LLM outputs = self.language_model(inputs_embeds=combined) # Only compute loss on text portion if labels is not None: # Shift for next-token prediction, only on text tokens text_logits = outputs.logits[:, self.num_vision_tokens:, :] loss = nn.functional.cross_entropy( text_logits.reshape(-1, text_logits.size(-1)), labels.reshape(-1), ) return outputs, loss return outputs, None def generate( self, image: torch.Tensor, prompt: str, max_length: int = 512, ) -> str: """Generate text response given image and text prompt.""" vision_tokens = self.encode_image(image.unsqueeze(0)) prompt_tokens = self.tokenizer(prompt) # Combine and generate combined = self.prepare_inputs(vision_tokens, prompt_tokens) output_ids = self.language_model.generate( inputs_embeds=combined, max_length=max_length, ) return self.tokenizer.decode(output_ids[0]) # Training approaches for VLMs:training_stages = { 'stage_1': { 'name': 'Alignment Pretraining', 'frozen': ['vision_encoder', 'language_model'], 'trained': ['vision_projection'], 'data': 'Image-caption pairs (CC3M, LAION)', 'objective': 'Learn to project visual features into LLM space', }, 'stage_2': { 'name': 'Instruction Tuning', 'frozen': ['vision_encoder'], 'trained': ['vision_projection', 'language_model'], 'data': 'Visual instruction following (LLaVA-Instruct, etc.)', 'objective': 'Learn to follow multimodal instructions', },}| Model | Organization | Architecture | Key Capabilities |
|---|---|---|---|
| GPT-4V | OpenAI | Vision encoder + GPT-4 | State-of-art visual reasoning, chat |
| Gemini Ultra | Native multimodal (unified) | Natively trained on mixed modalities | |
| Claude 3.5 | Anthropic | Vision encoder + Claude | Strong document understanding |
| LLaVA | Academic/Open | CLIP + LLaMA | Open weights, instruction following |
| Qwen-VL | Alibaba | Vision encoder + Qwen | Strong multilingual visual understanding |
| InternVL | Open | InternViT + InternLM | Large-scale open VLM |
What VLMs Can Do:
Visual Question Answering: Answer questions about image content, from simple ('What color is the car?') to complex ('Why might the person be upset?').
Document Understanding: Read and analyze documents, charts, tables, and screenshots—extracting and reasoning about structured information.
Visual Reasoning: Solve problems requiring visual understanding: geometry, physics simulations, spatial reasoning.
Multi-Image Reasoning: Compare images, identify differences, track objects across frames.
Grounded Generation: Generate text that accurately describes or refers to specific image regions.
Models like GPT-4V 'stitch' a vision encoder onto an LLM. Models like Gemini are 'natively multimodal'—trained from scratch on mixed modality data. Native multimodality may enable better cross-modal reasoning, though stitched approaches can leverage strong unimodal pretrained components.
The multimodal paradigm extends beyond text and images to encompass audio, video, and other modalities.
Audio Models:
Speech Recognition (Whisper):
OpenAI's Whisper demonstrated that scaling transformers on speech recognition produces robust, multilingual models:
Text-to-Speech:
Models like Tortoise, XTTS, and StyleTTS synthesize natural-sounding speech from text, complete with emotion, prosody, and speaker identity control.
Audio Language Models:
Models like AudioLM and MusicLM generate audio (music, sound effects) from text descriptions, extending generative AI to the audio domain.
| Modality | Understanding | Generation |
|---|---|---|
| Text | LLMs (GPT, Claude, LLaMA) | LLMs (same models) |
| Images | CLIP, VLMs (GPT-4V, Gemini) | DALL-E, Stable Diffusion, Midjourney |
| Audio/Speech | Whisper, wav2vec | Tortoise, ElevenLabs, XTTS |
| Music | Jukebox (limited) | MusicLM, Suno, Udio |
| Video | Video LLMs (VideoLLaMA, etc.) | Sora, Gen-2, Pika |
| 3D/Mesh | Point-E (limited) | Point-E, GetD |
| Actions/Robotics | RT-2 | RT-2, Gato |
Video Understanding and Generation:
Video Understanding:
Video adds temporal complexity. Models must track objects, understand actions, and reason about cause and effect over time.
Approaches include:
Video Generation (Sora and Beyond):
OpenAI's Sora (2024) demonstrated remarkable video generation from text:
Video generation represents a major scaling challenge: a minute of video at 30fps is ~1,800 frames, each equivalent to generating an image.
Some researchers view video generation as a step toward 'world models'—systems that understand and can simulate how the world works. If a model can generate plausible video, it must have some understanding of physics, object permanence, and causality. Whether current models truly have such understanding or merely approximate surface statistics is debated.
The logical endpoint of multimodality is a unified architecture that handles all modalities natively—taking any combination of inputs and producing any combination of outputs.
The Vision: Any-to-Any Models
Imagine a single model that can:
This is 'any-to-any' multimodality. Rather than separate models for each modality pair, one model handles everything.
Approaches to Unified Architectures:
Approach 1: Unified Tokenization
Reduce all modalities to tokens in a shared vocabulary:
Then train a transformer on sequences mixing these tokens. This is the approach of models like Chameleon and CM3Leon.
Approach 2: Modality-Specific Encoders/Decoders, Shared Core
Use specialized encoders and decoders for each modality, but route everything through a shared transformer core:
This is similar to Gemini's approach.
Approach 3: External Tools and Routing
A language model routes to specialized models:
This is pragmatic but not truly unified—it's orchestration of specialists.
| Approach | Example Models | Pros | Cons |
|---|---|---|---|
| Unified Tokenization | Chameleon, CM3Leon, Emu | Elegant, single training objective | Tokenization quality limits performance |
| Shared Core + Specialists | Gemini, GPT-4o | Best of both worlds | Complex engineering |
| Tool-Based Routing | ChatGPT + DALL-E | Leverages best specialist models | Not truly unified, integration overhead |
| Pixel-Level Autoregression | Research stage | True end-to-end, no tokenization loss | Extremely compute intensive |
GPT-4o: A Step Toward True Unification
OpenAI's GPT-4o (2024) represents significant progress toward any-to-any:
This allows natural interactions like: 'Look at this [shows image] and describe it in a song [sings response].'
Underlying most unified approaches is a deep insight: transformers process sequences of tokens, and any information can be tokenized. The challenge is finding tokenizations that preserve the essential information of each modality while remaining compatible with transformer processing.
Multimodal AI has made remarkable progress but faces significant open challenges.
Current Limitations:
Compositional Understanding:
Even the best multimodal models struggle with compositional scenarios:
Grounding and Factuality:
Multimodal hallucination is multifaceted:
Consistency Across Modalities:
Information may not transfer cleanly between modalities:
The Path Forward:
Several trends are likely to shape multimodal AI's future:
Continued Scaling: Following the unimodal playbook, simply scaling up data, compute, and parameters will likely continue to improve capabilities.
Native Multimodality: More models trained from scratch on mixed modalities rather than bolting on encoders post-hoc.
Real-Time Interaction: Models that process continuous audio/video streams in real-time, enabling more natural interaction.
Embodied AI: Connecting multimodal understanding to robotic action and physical world interaction.
Efficiency Improvements: Making multimodal models practical to deploy—smaller, faster, cheaper while maintaining capability.
Multimodal capabilities introduce new safety challenges: generating deepfake images/videos, bypassing text-based safety filters with visual inputs, and creating harmful multimedia content. As capabilities advance, safety measures must keep pace.
We have explored the multimodal frontier of foundation models—from contrastive learning through generation to unified architectures. Let's consolidate the key insights:
What's Next:
Having explored scale, emergence, LLMs, and multimodality, we conclude this module by examining the foundation model paradigm itself—the broader implications of these models for AI research, applications, and society. We'll synthesize the themes of this module into a cohesive understanding of what foundation models are and why they matter.
You now understand the multimodal landscape of foundation models—from CLIP's contrastive learning through diffusion-based generation to unified any-to-any architectures. This knowledge is essential as multimodality becomes the default paradigm for frontier AI systems.