Loading content...
The progression of multimodal AI has followed a natural trajectory: from specialized models for specific modality pairs (image-text, audio-video) toward unified architectures that can process any combination of modalities within a single model. This represents both a practical engineering goal—reducing the complexity of building and maintaining separate systems—and a scientific hypothesis about the nature of intelligence.
Unified multimodal models aim to:
This page explores the architectures, training strategies, and key models that are pushing toward truly unified multimodal AI.
You will understand the design principles behind unified multimodal architectures, the key models advancing this frontier (GPT-4V, Gemini, LLaVA, Flamingo), and the technical challenges in building systems that seamlessly integrate multiple modalities.
Building a single model that handles multiple modalities requires careful architectural decisions. Several design principles have emerged from recent research.
Principle 1: Modality-Specific Tokenization, Shared Processing
The most successful unified architectures use modality-specific encoders to convert inputs into a common token format, followed by a shared transformer backbone:
Image → Vision Tokenizer → Visual Tokens ─┐
├→ Shared Transformer → Outputs
Text → Text Tokenizer → Text Tokens ─┘
This preserves the efficiency of specialized encoders while enabling cross-modal reasoning in the shared layers.
Principle 2: Flexibility in Input and Output
Unified models should handle arbitrary input-output combinations:
Principle 3: Scalable Training
Unified architectures must efficiently leverage diverse datasets:
Principle 4: Emergent Cross-Modal Capabilities
The unified model should develop abilities not explicitly trained:
| Design Choice | Options | Trade-off |
|---|---|---|
| Token space | Shared vs. modality-specific | Flexibility vs. optimization |
| Encoder | Frozen vs. trained | Efficiency vs. adaptation |
| Fusion | Early vs. late | Interaction depth vs. compute |
| Decoder | Autoregressive vs. non-AR | Generality vs. speed |
A key challenge is that different modalities have vastly different information densities. A single image may contain as much information as hundreds of text tokens, but processing hundreds of visual tokens per image is computationally expensive. Solutions include pooling, resampling, or learned query tokens (Q-Former).
A practical approach to unified multimodal models is to connect pre-trained unimodal components with learnable adapters. This leverages existing powerful models while minimizing training costs.
The Adapter Pattern:
LLaVA (Large Language-and-Vision Assistant):
LLaVA connects CLIP's vision encoder to Vicuna (LLaMA-based LLM):
Image → CLIP ViT → Visual Tokens
↓
Linear Projector (Adapter)
↓
Visual + Text Tokens → LLM → Response
The projector is trained on image-caption pairs, then the full model is fine-tuned on visual instruction data.
BLIP-2 (Bootstrapped Language-Image Pre-training):
BLIP-2 uses a Q-Former (Querying Transformer) as the adapter:
Image → Frozen ViT → Image Features
↓
Q-Former (32 learnable queries)
↓
32 Visual Tokens → Frozen LLM → Response
The Q-Former learns to extract the most relevant information from visual features into a fixed number of tokens, reducing computational cost.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import torchimport torch.nn as nn class MultimodalAdapter(nn.Module): """ Adapter module to connect vision encoder to language model. Inspired by LLaVA and BLIP-2 approaches. """ def __init__( self, vision_dim: int = 1024, # CLIP ViT-L/14 llm_dim: int = 4096, # LLaMA-7B num_query_tokens: int = 32, # For Q-Former style adapter_type: str = "linear", # "linear", "mlp", "qformer" ): super().__init__() self.adapter_type = adapter_type if adapter_type == "linear": # Simple linear projection (LLaVA v1) self.adapter = nn.Linear(vision_dim, llm_dim) elif adapter_type == "mlp": # Two-layer MLP (LLaVA v1.5) self.adapter = nn.Sequential( nn.Linear(vision_dim, llm_dim), nn.GELU(), nn.Linear(llm_dim, llm_dim), ) elif adapter_type == "qformer": # Learnable query tokens (BLIP-2 style) self.query_tokens = nn.Parameter( torch.randn(1, num_query_tokens, llm_dim) ) self.cross_attention = nn.MultiheadAttention( embed_dim=llm_dim, num_heads=16, kdim=vision_dim, vdim=vision_dim, batch_first=True, ) self.ffn = nn.Sequential( nn.Linear(llm_dim, llm_dim * 4), nn.GELU(), nn.Linear(llm_dim * 4, llm_dim), ) def forward(self, visual_features: torch.Tensor) -> torch.Tensor: """ Transform visual features for LLM input. Args: visual_features: (B, N, vision_dim) from vision encoder Returns: (B, M, llm_dim) visual tokens for LLM """ B = visual_features.shape[0] if self.adapter_type in ["linear", "mlp"]: # Project each visual token return self.adapter(visual_features) elif self.adapter_type == "qformer": # Use learnable queries to extract information queries = self.query_tokens.expand(B, -1, -1) # Cross-attention: queries attend to visual features attended, _ = self.cross_attention( queries, visual_features, visual_features ) # FFN output = attended + self.ffn(attended) return outputAdapter-based approaches enable training multimodal models with modest resources. LLaVA's projection layer has only ~20M parameters; with the full model frozen initially, training requires a fraction of the compute needed for end-to-end training.
While adapter approaches are efficient, the most capable unified models are trained end-to-end across modalities, allowing deep integration of multimodal knowledge.
GPT-4V (GPT-4 Vision):
OpenAI's GPT-4V extends GPT-4 with vision capabilities:
Gemini (Google DeepMind):
Gemini is natively multimodal from the ground up:
Flamingo (DeepMind):
Flamingo pioneered the adapter approach at scale:
Key Architectural Pattern:
Multiple Images/Frames → Vision Encoder → Resampler → Fixed Visual Tokens
↓
Text Prompt → Tokenizer → Text Tokens ─────────────────→ Fused Sequence
↓
Transformer Decoder
↓
Text Output
| Model | Type | Modalities | Context | Training Approach |
|---|---|---|---|---|
| GPT-4V | Proprietary | Text, Image | 128K tokens | End-to-end (speculated) |
| Gemini Pro | Proprietary | Text, Image, Audio, Video | 32K→1M tokens | Native multimodal |
| LLaVA 1.5 | Open | Text, Image | 4K tokens | Adapter + fine-tune |
| Flamingo | Research | Text, Image, Video | ~4K tokens | Frozen + adapters |
| InternVL | Open | Text, Image | 8K tokens | End-to-end |
| Idefics2 | Open | Text, Image | 8K tokens | Adapter + partial tune |
A fundamental question in unified architectures is how to represent different modalities in a common format. Several tokenization strategies have emerged.
Continuous Embeddings:
The simplest approach: modality-specific encoders produce continuous embeddings that are concatenated with text embeddings:
Discrete Visual Tokens (VQ-VAE/VQGAN):
Quantize visual content into discrete tokens from a learned codebook:
Chameleon (Meta, 2024):
Truly unified tokenization for text and images:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import torchimport torch.nn as nnfrom transformers import AutoTokenizer class UnifiedMultimodalTokenizer: """ Unified tokenizer that handles text, images, and audio. Converts all modalities to a common token sequence format with special tokens marking modality boundaries. """ def __init__( self, text_tokenizer_name: str = "meta-llama/Llama-2-7b", vision_encoder = None, # Pre-trained vision encoder audio_encoder = None, # Pre-trained audio encoder ): # Text tokenizer self.text_tokenizer = AutoTokenizer.from_pretrained( text_tokenizer_name ) # Add special tokens for modality markers special_tokens = { 'additional_special_tokens': [ '<image>', '</image>', '<audio>', '</audio>', '<video>', '</video>', ] } self.text_tokenizer.add_special_tokens(special_tokens) # Modality encoders self.vision_encoder = vision_encoder self.audio_encoder = audio_encoder def tokenize(self, inputs: dict) -> dict: """ Tokenize multimodal inputs into unified format. Args: inputs: { 'text': str or List[str], 'images': Optional[Tensor] (B, C, H, W), 'audio': Optional[Tensor] (B, T), } Returns: { 'input_ids': Tensor (B, L) for text, 'image_embeds': Tensor (B, N, D) for images, 'audio_embeds': Tensor (B, T, D) for audio, 'modality_positions': dict marking positions, } """ result = {} # Tokenize text if 'text' in inputs: text_tokens = self.text_tokenizer( inputs['text'], return_tensors='pt', padding=True, ) result['input_ids'] = text_tokens['input_ids'] result['attention_mask'] = text_tokens['attention_mask'] # Encode images if 'images' in inputs and self.vision_encoder: with torch.no_grad(): image_embeds = self.vision_encoder(inputs['images']) result['image_embeds'] = image_embeds # Encode audio if 'audio' in inputs and self.audio_encoder: with torch.no_grad(): audio_embeds = self.audio_encoder(inputs['audio']) result['audio_embeds'] = audio_embeds return resultDiscrete visual tokens require large codebooks (8K-16K codes) to capture visual diversity. Combined with text vocabulary (32K-128K), unified vocabularies can become very large. This has implications for embedding table size and output softmax computation.
Training unified multimodal models requires careful orchestration of diverse data and objectives.
Multi-Stage Training:
Most successful unified models use staged training:
Stage 1: Unimodal Pre-training
Stage 2: Multimodal Alignment
Stage 3: Instruction Fine-tuning
Data Mixing Strategies:
Unified models must balance different data types:
| Data Type | Examples | Purpose |
|---|---|---|
| Image-Caption | LAION, CC3M | Basic alignment |
| Visual Q&A | VQAv2, GQA | Fine-grained understanding |
| Document | DocVQA, ChartQA | OCR, structured data |
| Conversation | ShareGPT4V | Dialogue capability |
| Text-only | OpenWebText | Maintain language ability |
Loss Balancing:
Different modalities and tasks may have different loss scales: $$\mathcal{L}_{total} = \sum_i \lambda_i \mathcal{L}_i$$
Weights $\lambda_i$ can be fixed or learned (e.g., uncertainty weighting).
LLaVA achieved strong results with only 150K visual instruction examples by carefully curating high-quality data. GPT-4 was used to generate detailed image descriptions and conversations, providing richer supervision than raw web data.
Unified multimodal architectures are rapidly evolving. Several emerging directions point toward the future of this field.
| Model | Input Modalities | Output Modalities | Architecture |
|---|---|---|---|
| CoDi | Text, Image, Audio, Video | Text, Image, Audio, Video | Composable diffusion |
| NExT-GPT | Text, Image, Audio, Video | Text, Image, Audio, Video | LLM + modality decoders |
| Chameleon | Text, Image | Text, Image | Unified AR with discrete tokens |
| Gemini | Text, Image, Audio, Video | Text, Image (limited) | End-to-end transformer |
Some researchers argue that unified multimodal models represent a step toward artificial general intelligence. The ability to reason across modalities—connecting visual observations, linguistic knowledge, and auditory information—mirrors a key aspect of human cognition. However, significant gaps remain in reasoning, planning, and physical understanding.
Congratulations! You've completed the Multimodal Learning module. You now understand vision-language models, CLIP and DALL-E, audio-visual learning, cross-modal retrieval, and unified architectures—the cutting edge of multimodal AI research.