Multimodal Learning - Learning Module

Loading content...

0/245

Unified Multimodal Architectures: One Model, Many Modalities

The Quest for Universal Multimodal Intelligence

The progression of multimodal AI has followed a natural trajectory: from specialized models for specific modality pairs (image-text, audio-video) toward unified architectures that can process any combination of modalities within a single model. This represents both a practical engineering goal—reducing the complexity of building and maintaining separate systems—and a scientific hypothesis about the nature of intelligence.

Unified multimodal models aim to:

Process text, images, audio, video, and other modalities interchangeably
Understand cross-modal relationships without modality-specific architectures
Enable emergent capabilities from the interaction of different modalities
Provide a single interface for diverse multimodal tasks

This page explores the architectures, training strategies, and key models that are pushing toward truly unified multimodal AI.

Learning Objectives

You will understand the design principles behind unified multimodal architectures, the key models advancing this frontier (GPT-4V, Gemini, LLaVA, Flamingo), and the technical challenges in building systems that seamlessly integrate multiple modalities.

Design Principles for Unified Architectures

Building a single model that handles multiple modalities requires careful architectural decisions. Several design principles have emerged from recent research.

Principle 1: Modality-Specific Tokenization, Shared Processing

The most successful unified architectures use modality-specific encoders to convert inputs into a common token format, followed by a shared transformer backbone:

Image → Vision Tokenizer → Visual Tokens ─┐
                                          ├→ Shared Transformer → Outputs
Text  → Text Tokenizer   → Text Tokens  ─┘

This preserves the efficiency of specialized encoders while enabling cross-modal reasoning in the shared layers.

Principle 2: Flexibility in Input and Output

Unified models should handle arbitrary input-output combinations:

Image → Text (captioning, VQA)
Text → Image (generation)
[Image, Text] → Text (visual dialogue)
Audio → Text (transcription)
[Video, Audio, Text] → Text (video understanding)

Principle 3: Scalable Training

Unified architectures must efficiently leverage diverse datasets:

Pre-train on modality-specific data (images, text, audio separately)
Align modalities using paired data (image-text pairs)
Fine-tune on downstream tasks

Principle 4: Emergent Cross-Modal Capabilities

The unified model should develop abilities not explicitly trained:

Understanding novel modality combinations
Transferring knowledge across modalities
Zero-shot generalization to new tasks

Unified Architecture Design Choices
Design Choice	Options	Trade-off
Token space	Shared vs. modality-specific	Flexibility vs. optimization
Encoder	Frozen vs. trained	Efficiency vs. adaptation
Fusion	Early vs. late	Interaction depth vs. compute
Decoder	Autoregressive vs. non-AR	Generality vs. speed

The Tokenization Bottleneck

A key challenge is that different modalities have vastly different information densities. A single image may contain as much information as hundreds of text tokens, but processing hundreds of visual tokens per image is computationally expensive. Solutions include pooling, resampling, or learned query tokens (Q-Former).

Adapter-Based Multimodal Models

A practical approach to unified multimodal models is to connect pre-trained unimodal components with learnable adapters. This leverages existing powerful models while minimizing training costs.

The Adapter Pattern:

Start with a frozen pre-trained LLM (GPT, LLaMA, etc.)
Start with a frozen pre-trained vision encoder (CLIP, SigLIP)
Train a lightweight adapter to bridge modalities
Optionally fine-tune the full model or use LoRA

LLaVA (Large Language-and-Vision Assistant):

LLaVA connects CLIP's vision encoder to Vicuna (LLaMA-based LLM):

Image → CLIP ViT → Visual Tokens
                      ↓
                  Linear Projector (Adapter)
                      ↓
              Visual + Text Tokens → LLM → Response

The projector is trained on image-caption pairs, then the full model is fine-tuned on visual instruction data.

BLIP-2 (Bootstrapped Language-Image Pre-training):

BLIP-2 uses a Q-Former (Querying Transformer) as the adapter:

Image → Frozen ViT → Image Features
                          ↓
                    Q-Former (32 learnable queries)
                          ↓
              32 Visual Tokens → Frozen LLM → Response

The Q-Former learns to extract the most relevant information from visual features into a fixed number of tokens, reducing computational cost.

adapter_multimodal.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import torch
import torch.nn as nn
 
class MultimodalAdapter(nn.Module):
    """
    Adapter module to connect vision encoder to language model.
    
    Inspired by LLaVA and BLIP-2 approaches.
    """
    
    def __init__(
        self,
        vision_dim: int = 1024,    # CLIP ViT-L/14
        llm_dim: int = 4096,       # LLaMA-7B
        num_query_tokens: int = 32, # For Q-Former style
        adapter_type: str = "linear",  # "linear", "mlp", "qformer"
    ):
        super().__init__()
        self.adapter_type = adapter_type
        
        if adapter_type == "linear":
            # Simple linear projection (LLaVA v1)
            self.adapter = nn.Linear(vision_dim, llm_dim)
            
        elif adapter_type == "mlp":
            # Two-layer MLP (LLaVA v1.5)
            self.adapter = nn.Sequential(
                nn.Linear(vision_dim, llm_dim),
                nn.GELU(),
                nn.Linear(llm_dim, llm_dim),
            )
            
        elif adapter_type == "qformer":
            # Learnable query tokens (BLIP-2 style)
            self.query_tokens = nn.Parameter(
                torch.randn(1, num_query_tokens, llm_dim)
            )
            self.cross_attention = nn.MultiheadAttention(
                embed_dim=llm_dim,
                num_heads=16,
                kdim=vision_dim,
                vdim=vision_dim,
                batch_first=True,
            )
            self.ffn = nn.Sequential(
                nn.Linear(llm_dim, llm_dim * 4),
                nn.GELU(),
                nn.Linear(llm_dim * 4, llm_dim),
            )
            
    def forward(self, visual_features: torch.Tensor) -> torch.Tensor:
        """
        Transform visual features for LLM input.
        
        Args:
            visual_features: (B, N, vision_dim) from vision encoder
            
        Returns:
            (B, M, llm_dim) visual tokens for LLM
        """
        B = visual_features.shape[0]
        
        if self.adapter_type in ["linear", "mlp"]:
            # Project each visual token
            return self.adapter(visual_features)
            
        elif self.adapter_type == "qformer":
            # Use learnable queries to extract information
            queries = self.query_tokens.expand(B, -1, -1)
            
            # Cross-attention: queries attend to visual features
            attended, _ = self.cross_attention(
                queries, visual_features, visual_features
            )
            
            # FFN
            output = attended + self.ffn(attended)
            return output

Training Efficiency

Adapter-based approaches enable training multimodal models with modest resources. LLaVA's projection layer has only ~20M parameters; with the full model frozen initially, training requires a fraction of the compute needed for end-to-end training.

End-to-End Unified Models

While adapter approaches are efficient, the most capable unified models are trained end-to-end across modalities, allowing deep integration of multimodal knowledge.

GPT-4V (GPT-4 Vision):

OpenAI's GPT-4V extends GPT-4 with vision capabilities:

Processes interleaved image and text inputs
Maintains GPT-4's strong language understanding
Handles complex visual reasoning, OCR, diagram understanding
Architecture details not public, but likely uses vision encoder integrated with transformer

Gemini (Google DeepMind):

Gemini is natively multimodal from the ground up:

Trained on text, images, audio, video simultaneously
Variable context window (up to 1M tokens for Gemini 1.5)
Processes multiple images, video frames in single context
Unified vocabulary across modalities using tokenizers

Flamingo (DeepMind):

Flamingo pioneered the adapter approach at scale:

Frozen vision encoder (NFNet) + frozen LLM (Chinchilla)
Perceiver Resampler: reduces visual tokens to fixed count
Gated cross-attention: injects visual info into LLM layers
Trained on interleaved image-text web data

Key Architectural Pattern:

Multiple Images/Frames → Vision Encoder → Resampler → Fixed Visual Tokens
                                                              ↓
Text Prompt → Tokenizer → Text Tokens ─────────────────→ Fused Sequence
                                                              ↓
                                                     Transformer Decoder
                                                              ↓
                                                      Text Output

Comparison of Major Unified Multimodal Models
Model	Type	Modalities	Context	Training Approach
GPT-4V	Proprietary	Text, Image	128K tokens	End-to-end (speculated)
Gemini Pro	Proprietary	Text, Image, Audio, Video	32K→1M tokens	Native multimodal
LLaVA 1.5	Open	Text, Image	4K tokens	Adapter + fine-tune
Flamingo	Research	Text, Image, Video	~4K tokens	Frozen + adapters
InternVL	Open	Text, Image	8K tokens	End-to-end
Idefics2	Open	Text, Image	8K tokens	Adapter + partial tune

Unified Tokenization Strategies

A fundamental question in unified architectures is how to represent different modalities in a common format. Several tokenization strategies have emerged.

Continuous Embeddings:

The simplest approach: modality-specific encoders produce continuous embeddings that are concatenated with text embeddings:

Vision: CLIP/SigLIP produces (N, D) patch embeddings
Audio: Wav2Vec/Whisper produces (T, D) frame embeddings
Text: Vocabulary lookup produces (L, D) token embeddings
All are projected to same dimension and concatenated

Discrete Visual Tokens (VQ-VAE/VQGAN):

Quantize visual content into discrete tokens from a learned codebook:

Image → Encoder → Quantize to codebook → Discrete tokens
Same vocabulary can theoretically include text and visual tokens
Enables autoregressive generation of images as token sequences
Used in: DALL-E 1, AIM, Chameleon

Chameleon (Meta, 2024):

Truly unified tokenization for text and images:

8K text vocabulary + 8K image vocabulary = 16K unified vocabulary
Images tokenized via learned VQGAN (discretized to 8192 codes)
Single autoregressive model generates both text and images
Can interleave text and image generation in any order

unified_tokenizer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import torch
import torch.nn as nn
from transformers import AutoTokenizer
 
class UnifiedMultimodalTokenizer:
    """
    Unified tokenizer that handles text, images, and audio.
    
    Converts all modalities to a common token sequence format
    with special tokens marking modality boundaries.
    """
    
    def __init__(
        self,
        text_tokenizer_name: str = "meta-llama/Llama-2-7b",
        vision_encoder = None,  # Pre-trained vision encoder
        audio_encoder = None,   # Pre-trained audio encoder
    ):
        # Text tokenizer
        self.text_tokenizer = AutoTokenizer.from_pretrained(
            text_tokenizer_name
        )
        
        # Add special tokens for modality markers
        special_tokens = {
            'additional_special_tokens': [
                '<image>', '</image>',
                '<audio>', '</audio>',
                '<video>', '</video>',
            ]
        }
        self.text_tokenizer.add_special_tokens(special_tokens)
        
        # Modality encoders
        self.vision_encoder = vision_encoder
        self.audio_encoder = audio_encoder
        
    def tokenize(self, inputs: dict) -> dict:
        """
        Tokenize multimodal inputs into unified format.
        
        Args:
            inputs: {
                'text': str or List[str],
                'images': Optional[Tensor] (B, C, H, W),
                'audio': Optional[Tensor] (B, T),
            }
            
        Returns:
            {
                'input_ids': Tensor (B, L) for text,
                'image_embeds': Tensor (B, N, D) for images,
                'audio_embeds': Tensor (B, T, D) for audio,
                'modality_positions': dict marking positions,
            }
        """
        result = {}
        
        # Tokenize text
        if 'text' in inputs:
            text_tokens = self.text_tokenizer(
                inputs['text'],
                return_tensors='pt',
                padding=True,
            )
            result['input_ids'] = text_tokens['input_ids']
            result['attention_mask'] = text_tokens['attention_mask']
        
        # Encode images
        if 'images' in inputs and self.vision_encoder:
            with torch.no_grad():
                image_embeds = self.vision_encoder(inputs['images'])
            result['image_embeds'] = image_embeds
        
        # Encode audio
        if 'audio' in inputs and self.audio_encoder:
            with torch.no_grad():
                audio_embeds = self.audio_encoder(inputs['audio'])
            result['audio_embeds'] = audio_embeds
        
        return result

The Vocabulary Size Question

Discrete visual tokens require large codebooks (8K-16K codes) to capture visual diversity. Combined with text vocabulary (32K-128K), unified vocabularies can become very large. This has implications for embedding table size and output softmax computation.

Training Strategies for Unified Models

Training unified multimodal models requires careful orchestration of diverse data and objectives.

Multi-Stage Training:

Most successful unified models use staged training:

Stage 1: Unimodal Pre-training

Pre-train vision encoder on images (ImageNet, WebLI)
Pre-train language model on text (web corpus)
Goal: Strong unimodal representations

Stage 2: Multimodal Alignment

Train adapters/projectors on image-text pairs
Contrastive or captioning objectives
Goal: Align visual and textual representations

Stage 3: Instruction Fine-tuning

Fine-tune on instruction-following data
Include visual instructions (VQA, captioning, etc.)
Goal: Follow multimodal instructions

Data Mixing Strategies:

Unified models must balance different data types:

Data Type	Examples	Purpose
Image-Caption	LAION, CC3M	Basic alignment
Visual Q&A	VQAv2, GQA	Fine-grained understanding
Document	DocVQA, ChartQA	OCR, structured data
Conversation	ShareGPT4V	Dialogue capability
Text-only	OpenWebText	Maintain language ability

Loss Balancing:

Different modalities and tasks may have different loss scales: $$\mathcal{L}_{total} = \sum_i \lambda_i \mathcal{L}_i$$

Weights $\lambda_i$ can be fixed or learned (e.g., uncertainty weighting).

Training Best Practices

•Start with strong unimodal models: The quality of pre-trained components bounds final performance
•Progressive unfreezing: Start with adapters only, gradually unfreeze encoder/LLM layers
•High-quality alignment data: Small, curated image-caption data often outperforms large noisy data
•Diverse instruction data: Include many task types to develop general capabilities
•Maintain text ability: Include text-only data to prevent catastrophic forgetting

Data Quality > Quantity

LLaVA achieved strong results with only 150K visual instruction examples by carefully curating high-quality data. GPT-4 was used to generate detailed image descriptions and conversations, providing richer supervision than raw web data.

Emerging Directions and Future

Unified multimodal architectures are rapidly evolving. Several emerging directions point toward the future of this field.

Frontier Directions

•Any-to-Any Generation: Models that can take any modality as input and produce any modality as output (text, image, audio, video) within a single architecture (e.g., CoDi, NExT-GPT)
•Extended Context: Long-context models like Gemini 1.5 (1M tokens) enable understanding of hour-long videos and thousands of images in a single context
•Real-Time Multimodal: Models optimized for real-time processing of streaming audio and video, enabling interactive applications
•World Models: Unified architectures that build internal models of the world from multimodal input, enabling simulation and planning
•Embodied Multimodal: Connecting multimodal models to robotic systems, using vision, language, and proprioception together
•Efficient Architectures: Mixture-of-experts (MoE) for multimodal—activating modality-specific experts only when needed

Any-to-Any Multimodal Models
Model	Input Modalities	Output Modalities	Architecture
CoDi	Text, Image, Audio, Video	Text, Image, Audio, Video	Composable diffusion
NExT-GPT	Text, Image, Audio, Video	Text, Image, Audio, Video	LLM + modality decoders
Chameleon	Text, Image	Text, Image	Unified AR with discrete tokens
Gemini	Text, Image, Audio, Video	Text, Image (limited)	End-to-end transformer

The Path to AGI?

Some researchers argue that unified multimodal models represent a step toward artificial general intelligence. The ability to reason across modalities—connecting visual observations, linguistic knowledge, and auditory information—mirrors a key aspect of human cognition. However, significant gaps remain in reasoning, planning, and physical understanding.

Summary: Unified Multimodal Architectures

Key Takeaways

•Unified architectures process multiple modalities with shared transformer backbones and modality-specific tokenizers.
•Adapter approaches (LLaVA, BLIP-2) efficiently connect pre-trained unimodal models with learnable bridges.
•End-to-end models (GPT-4V, Gemini) achieve deeper integration through native multimodal training.
•Unified tokenization via discrete tokens (VQGAN) or continuous embeddings enables common processing.
•Multi-stage training with careful data mixing develops robust multimodal capabilities.
•The frontier moves toward any-to-any generation, extended context, and embodied multimodal intelligence.

Module Complete

Congratulations! You've completed the Multimodal Learning module. You now understand vision-language models, CLIP and DALL-E, audio-visual learning, cross-modal retrieval, and unified architectures—the cutting edge of multimodal AI research.