Multimodal Learning - Learning Module

Loading content...

0/245

CLIP and DALL-E: Revolutionary Vision-Language Models

The Models That Changed Everything

In January 2021, OpenAI released two models that fundamentally changed the landscape of vision-language AI: CLIP (Contrastive Language-Image Pre-training) and DALL-E. While conceptually distinct—CLIP learns to understand the relationship between images and text, while DALL-E generates images from text—these models share a common insight: large-scale pre-training on image-text pairs enables remarkable emergent capabilities.

CLIP demonstrated that a model trained to match images with their captions could perform zero-shot image classification competitive with supervised methods trained on millions of labeled examples. DALL-E showed that the same principles could be inverted: by learning the mapping from text to images, AI could generate novel, creative imagery from natural language descriptions.

Together, these models represent a paradigm shift from task-specific, supervised learning to flexible, general-purpose multimodal understanding.

Learning Objectives

This page provides exhaustive coverage of CLIP and DALL-E architectures, training methodologies, mathematical foundations, and practical implementations. You will understand the design decisions that made these models successful and how to apply them effectively.

CLIP: Architecture and Design

CLIP's architecture is elegantly simple: two separate encoders—one for images, one for text—that map their respective inputs to a shared embedding space where semantic similarity can be measured via dot product.

Image Encoder Options:

OpenAI trained CLIP with two vision encoder architectures:

ResNet variants: Modified ResNets (RN50, RN101, RN50x4, RN50x16, RN50x64) with attention pooling instead of global average pooling. The attention pooling uses the [CLS] token from a single transformer layer to produce the final representation.
Vision Transformer (ViT): ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px. Images are divided into patches, linearly projected, and processed by a standard transformer encoder.

Text Encoder:

The text encoder is a transformer with masked self-attention (GPT-2 style), taking text tokens as input and producing a sequence-level representation from the [EOS] token position.

Vocabulary: 49,152 BPE tokens
Context length: 77 tokens maximum
Width: 512 (base) to 768 (large)
Layers: 12 (base) to 12 (large)

Projection Heads:

Both encoders output features that are linearly projected to the shared embedding space (dimension 512 or 768). These projections are learned during training and are critical for alignment.

Converting Mermaid diagram...

clip_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class CLIP(nn.Module):
    """
    Simplified CLIP implementation demonstrating core architecture.
    """
    
    def __init__(
        self,
        embed_dim: int = 512,
        vision_width: int = 768,
        text_width: int = 512,
        vision_layers: int = 12,
        text_layers: int = 12,
        vocab_size: int = 49152,
        context_length: int = 77,
    ):
        super().__init__()
        
        # Vision encoder (simplified ViT)
        self.visual = VisionTransformer(
            width=vision_width,
            layers=vision_layers,
            output_dim=embed_dim,
        )
        
        # Text encoder
        self.transformer = TextTransformer(
            width=text_width,
            layers=text_layers,
            vocab_size=vocab_size,
            context_length=context_length,
        )
        
        # Projection layers
        self.visual_projection = nn.Linear(vision_width, embed_dim, bias=False)
        self.text_projection = nn.Linear(text_width, embed_dim, bias=False)
        
        # Learnable temperature parameter (log scale for stability)
        self.logit_scale = nn.Parameter(torch.ones([]) * torch.log(torch.tensor(1/0.07)))
        
    def encode_image(self, image: torch.Tensor) -> torch.Tensor:
        """Encode images to normalized embeddings."""
        features = self.visual(image)
        projected = self.visual_projection(features)
        return F.normalize(projected, dim=-1)
    
    def encode_text(self, text: torch.Tensor) -> torch.Tensor:
        """Encode text tokens to normalized embeddings."""
        features = self.transformer(text)
        projected = self.text_projection(features)
        return F.normalize(projected, dim=-1)
    
    def forward(self, image: torch.Tensor, text: torch.Tensor):
        """
        Compute image-text similarity matrix.
        
        Returns:
            logits_per_image: (N, N) similarity from image perspective
            logits_per_text: (N, N) similarity from text perspective
        """
        image_features = self.encode_image(image)
        text_features = self.encode_text(text)
        
        # Cosine similarity with learned temperature
        logit_scale = self.logit_scale.exp()
        logits_per_image = logit_scale * image_features @ text_features.T
        logits_per_text = logits_per_image.T
        
        return logits_per_image, logits_per_text

CLIP: Training Methodology

CLIP's training procedure is a masterclass in scalable contrastive learning. Understanding the mathematical formulation and practical considerations is essential for reproducing or extending CLIP.

Contrastive Pre-training Objective:

Given a batch of N image-text pairs, CLIP learns to maximize similarity between matching pairs while minimizing similarity between non-matching pairs. This is formulated as a symmetric cross-entropy loss:

$$\mathcal{L} = \frac{1}{2}\left(\mathcal{L}{I \rightarrow T} + \mathcal{L}{T \rightarrow I}\right)$$

Where: $$\mathcal{L}{I \rightarrow T} = -\frac{1}{N}\sum{i=1}^{N}\log\frac{\exp(s_{ii}/\tau)}{\sum_{j=1}^{N}\exp(s_{ij}/\tau)}$$

Here, $s_{ij} = \text{sim}(I_i, T_j) = I_i^T T_j$ for normalized embeddings, and $\tau$ is the temperature parameter.

The Role of Temperature:

The temperature parameter $\tau$ controls the sharpness of the softmax distribution:

Low temperature ($\tau < 1$): Sharper distribution, harder negatives
High temperature ($\tau > 1$): Softer distribution, more uniform

CLIP learns $\tau$ as a parameter, initialized to 0.07 (actually learns $\log(1/\tau)$ for numerical stability).

Training Data: WIT (WebImageText)

CLIP was trained on 400 million image-text pairs collected from the internet:

Diverse coverage of visual concepts
Natural language supervision (no manual labeling)
Quality filtering to remove noise
Balanced across visual domains

CLIP Training Hyperparameters
Parameter	Value	Notes
Batch Size	32,768	Large batch critical for contrastive learning
Learning Rate	5e-4 (ViT-L)	Cosine decay schedule
Training Steps	~12.8B image-text pairs seen	32 epochs over 400M pairs
Temperature Init	0.07	Learned during training
Image Resolution	224×224 (336 for large)	Standard ViT resolution
Precision	Mixed (FP16/FP32)	Gradient scaling for stability

clip_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torch
import torch.nn.functional as F
import torch.distributed as dist
 
def clip_loss(
    image_embeddings: torch.Tensor,
    text_embeddings: torch.Tensor,
    logit_scale: torch.Tensor,
    gather_with_grad: bool = True,
) -> torch.Tensor:
    """
    Compute symmetric CLIP contrastive loss.
    
    Args:
        image_embeddings: (N, D) normalized image embeddings
        text_embeddings: (N, D) normalized text embeddings
        logit_scale: Learned temperature (log scale)
        gather_with_grad: Whether to gather across GPUs with gradients
        
    Returns:
        Scalar loss value
    """
    # For distributed training, gather embeddings from all GPUs
    if dist.is_initialized() and gather_with_grad:
        # All-gather with gradient support
        all_image_embeddings = gather_with_gradient(image_embeddings)
        all_text_embeddings = gather_with_gradient(text_embeddings)
    else:
        all_image_embeddings = image_embeddings
        all_text_embeddings = text_embeddings
    
    # Compute similarity matrix scaled by temperature
    logits = logit_scale.exp() * image_embeddings @ all_text_embeddings.T
    
    # Ground truth: diagonal entries are positive pairs
    batch_size = image_embeddings.shape[0]
    if dist.is_initialized():
        # Offset labels for distributed training
        rank = dist.get_rank()
        labels = torch.arange(batch_size, device=image_embeddings.device)
        labels = labels + rank * batch_size
    else:
        labels = torch.arange(batch_size, device=image_embeddings.device)
    
    # Symmetric cross-entropy loss
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)
    
    return (loss_i2t + loss_t2i) / 2
 
 
def gather_with_gradient(tensor: torch.Tensor) -> torch.Tensor:
    """All-gather tensors while preserving gradients."""
    world_size = dist.get_world_size()
    tensors = [torch.zeros_like(tensor) for _ in range(world_size)]
    dist.all_gather(tensors, tensor)
    
    # Replace own tensor to preserve gradient
    rank = dist.get_rank()
    tensors[rank] = tensor
    
    return torch.cat(tensors, dim=0)

Batch Size is Critical

CLIP's performance degrades significantly with small batch sizes. The contrastive objective requires many negative examples to learn discriminative representations. Techniques like gradient checkpointing, mixed precision, and distributed training are essential for achieving large effective batch sizes.

CLIP: Zero-Shot Classification in Depth

CLIP's zero-shot classification ability is its most celebrated feature. By learning a joint embedding space, CLIP can classify images into arbitrary categories using only textual descriptions of those categories.

The Zero-Shot Protocol:

Define class names: Specify the classification task via class labels
Create text prompts: Transform class names into natural language (e.g., "a photo of a {class}")
Encode prompts: Use CLIP's text encoder to create class embeddings
Encode image: Use CLIP's image encoder to create the test image embedding
Compute similarities: Calculate cosine similarity between image and all class embeddings
Predict: Select the class with highest similarity

Prompt Engineering:

Raw class names often underperform compared to full sentences because CLIP was trained on image captions, not isolated labels. The official CLIP implementation uses an ensemble of 80 prompt templates:

"a photo of a {class}"
"a blurry photo of a {class}"
"a low contrast photo of a {class}"
"a high contrast photo of a {class}"
"a bad photo of a {class}"
"a good photo of a {class}"
"a photo of a small {class}"
"a photo of a big {class}"
"a photo of the {class}"
"a {class} in a video game"
...

The embeddings from all prompts are averaged to create a more robust class representation.

CLIP Zero-Shot Performance Across Datasets
Dataset	Classes	CLIP ViT-L/14@336	Best Supervised
ImageNet	1000	76.2%	~90% (EfficientNet-L2)
CIFAR-10	10	95.7%	99.5%
CIFAR-100	100	77.5%	93.0%
Food-101	101	93.8%	93.0% (first zero-shot win!)
Stanford Cars	196	77.3%	94.0%
ObjectNet	113	69.0%	55.0% (ResNet-152)

Domain-Specific Prompts

For specialized domains, tailored prompts dramatically improve performance. Medical imaging might use 'an X-ray showing {condition}'; satellite imagery might use 'satellite view of {terrain}'. Investing time in prompt engineering often yields better returns than model fine-tuning.

DALL-E: Architecture and Innovation

DALL-E flips the vision-language paradigm: instead of understanding images through text, it generates images from text. The original DALL-E (2021) used a discrete VAE and autoregressive transformer, while DALL-E 2 (2022) and DALL-E 3 (2023) adopted diffusion-based generation.

DALL-E 1: Discrete Token Generation

DALL-E 1's architecture combines two stages:

Stage 1: dVAE (Discrete Variational Autoencoder)

Compresses 256×256 images into 32×32 grids of discrete tokens
Each token is one of 8,192 possible visual "words"
Trained separately to reconstruct images from tokens

Stage 2: Autoregressive Transformer

12 billion parameter GPT-style transformer
Takes text tokens (256 BPE) + image tokens (1024 dVAE) = 1280 tokens
Trained to predict tokens autoregressively: P(image | text)
Uses sparse attention for efficiency

DALL-E 2: Diffusion-Based Generation

DALL-E 2 replaced the discrete approach with diffusion models:

CLIP Text Encoder: Encode text prompt to CLIP embedding
Prior Network: Transform CLIP text embedding to CLIP image embedding
Diffusion Decoder: Generate images conditioned on CLIP image embedding

This architecture leverages CLIP's well-structured embedding space while using diffusion for higher-quality image generation.

DALL-E 3: Native Caption Understanding

DALL-E 3 improved prompt following through:

Training on highly descriptive synthetic captions
Better text rendering through specialized modules
Tight integration with ChatGPT for prompt refinement

Converting Mermaid diagram...

The Prior Model

DALL-E 2's prior model is crucial: it learns the distribution of CLIP image embeddings conditioned on CLIP text embeddings. This is trained on the same image-text pairs as CLIP, learning P(z_image | z_text). The prior can be autoregressive or diffusion-based.

Technical Deep Dive: Key Innovations

Both CLIP and DALL-E introduced technical innovations that enabled their success. Understanding these innovations provides insight into why they work and how to build on them.

CLIP's Key Innovations:

Natural Language Supervision at Scale
- Previous work used expensive human labels or limited caption datasets
- CLIP used 400M noisy image-text pairs from the internet
- This scale enabled robust, generalizable representations
Efficient Contrastive Learning
- In-batch negatives: Each of N images serves as negative for N-1 texts
- Quadratic comparisons (N²) from linear data (N pairs)
- Requires only forward passes, no expensive sampling
Learned Temperature
- Temperature is typically a hyperparameter requiring careful tuning
- CLIP learns temperature during training
- Converges to ~0.01, indicating very sharp similarity distributions

DALL-E's Key Innovations:

Discrete Visual Tokens
- Enables treating image generation as language modeling
- Leverages advances in autoregressive transformers
- 8,192 codebook captures diverse visual patterns
Sparse Attention Patterns
- Full attention over 1,280 tokens is O(n²) = 1.6M operations
- DALL-E uses row, column, and conv attention patterns
- Reduces complexity while maintaining modeling capacity
Classifier-Free Guidance (DALL-E 2)
- Interpolates between conditional and unconditional generation
- Controls fidelity vs. diversity tradeoff
- Uses a single model for both (null conditioning)

classifier_free_guidance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
 
def classifier_free_guidance(
    model: "DiffusionModel",
    noisy_image: torch.Tensor,
    timestep: torch.Tensor,
    text_embedding: torch.Tensor,
    guidance_scale: float = 7.5,
) -> torch.Tensor:
    """
    Apply classifier-free guidance during diffusion sampling.
    
    The model predicts noise for both conditional (with text) and
    unconditional (without text) settings. The final prediction
    extrapolates in the direction of the conditional prediction.
    
    Args:
        model: Diffusion model that accepts (x, t, condition)
        noisy_image: Current noisy image x_t
        timestep: Current diffusion timestep  
        text_embedding: CLIP text embedding for conditioning
        guidance_scale: How much to amplify text conditioning
        
    Returns:
        Guided noise prediction
    """
    # Predict noise with text conditioning
    noise_pred_cond = model(noisy_image, timestep, text_embedding)
    
    # Predict noise without conditioning (null prompt)
    null_embedding = torch.zeros_like(text_embedding)
    noise_pred_uncond = model(noisy_image, timestep, null_embedding)
    
    # Classifier-free guidance: extrapolate from unconditional
    # toward conditional prediction
    noise_pred = noise_pred_uncond + guidance_scale * (
        noise_pred_cond - noise_pred_uncond
    )
    
    return noise_pred
 
# Note: Higher guidance_scale (7.5-15) = more prompt adherence
# Lower guidance_scale (1-3) = more diversity, less adherence

Practical Applications and Use Cases

CLIP and DALL-E have enabled a wide range of practical applications, from research tools to commercial products.

CLIP Applications

•Image Search: Find images using natural language queries without labeled training data
•Content Moderation: Detect inappropriate content using text descriptions of violations
•Zero-Shot Classification: Classify images into custom categories without retraining
•Image-to-Image Search: Find visually similar images using embedding similarity
•Model Steering: Guide diffusion models using CLIP scores

DALL-E Applications

•Creative Design: Generate concept art, illustrations, and visual concepts
•Marketing Assets: Create product images, advertisements, and social media content
•Rapid Prototyping: Visualize ideas before expensive production
•Image Editing: Inpainting, outpainting, and variations
•Accessibility: Generate visual content for those who cannot create it manually

Limitations and Ethical Considerations

These models have notable limitations: CLIP can encode societal biases from training data; DALL-E can generate misleading or harmful imagery; both struggle with fine-grained details like text in images. Responsible deployment requires careful content filtering, watermarking, and user guidelines.

Summary: CLIP and DALL-E

Key Takeaways

•CLIP learns a joint embedding space through contrastive learning on 400M image-text pairs, enabling zero-shot visual recognition.
•DALL-E inverts this relationship, generating images from text using autoregressive (v1) or diffusion-based (v2/v3) approaches.
•Large-scale pre-training on natural language supervision is the key insight enabling both models' generalization.
•Prompt engineering significantly impacts zero-shot performance and generation quality.
•These models have become foundational components in modern multimodal AI systems.

Up Next

In the next page, we'll explore audio-visual learning—extending multimodal AI beyond vision and language to incorporate sound, enabling applications in video understanding, speech-visual alignment, and multi-sensory AI systems.

CLIP and DALL-E: Revolutionary Vision-Language Models

The Models That Changed Everything

Together, these models represent a paradigm shift from task-specific, supervised learning to flexible, general-purpose multimodal understanding.

Learning Objectives

CLIP: Architecture and Design

Image Encoder Options:

OpenAI trained CLIP with two vision encoder architectures:

ResNet variants: Modified ResNets (RN50, RN101, RN50x4, RN50x16, RN50x64) with attention pooling instead of global average pooling. The attention pooling uses the [CLS] token from a single transformer layer to produce the final representation.
Vision Transformer (ViT): ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px. Images are divided into patches, linearly projected, and processed by a standard transformer encoder.

Text Encoder:

The text encoder is a transformer with masked self-attention (GPT-2 style), taking text tokens as input and producing a sequence-level representation from the [EOS] token position.

Vocabulary: 49,152 BPE tokens
Context length: 77 tokens maximum
Width: 512 (base) to 768 (large)
Layers: 12 (base) to 12 (large)

Projection Heads:

Both encoders output features that are linearly projected to the shared embedding space (dimension 512 or 768). These projections are learned during training and are critical for alignment.

Converting Mermaid diagram...

clip_architecture.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class CLIP(nn.Module):
    """
    Simplified CLIP implementation demonstrating core architecture.
    """
    
    def __init__(
        self,
        embed_dim: int = 512,
        vision_width: int = 768,
        text_width: int = 512,
        vision_layers: int = 12,
        text_layers: int = 12,
        vocab_size: int = 49152,
        context_length: int = 77,
    ):
        super().__init__()
        
        # Vision encoder (simplified ViT)
        self.visual = VisionTransformer(
            width=vision_width,
            layers=vision_layers,
            output_dim=embed_dim,
        )
        
        # Text encoder
        self.transformer = TextTransformer(
            width=text_width,
            layers=text_layers,
            vocab_size=vocab_size,
            context_length=context_length,
        )
        
        # Projection layers
        self.visual_projection = nn.Linear(vision_width, embed_dim, bias=False)
        self.text_projection = nn.Linear(text_width, embed_dim, bias=False)
        
        # Learnable temperature parameter (log scale for stability)
        self.logit_scale = nn.Parameter(torch.ones([]) * torch.log(torch.tensor(1/0.07)))
        
    def encode_image(self, image: torch.Tensor) -> torch.Tensor:
        """Encode images to normalized embeddings."""
        features = self.visual(image)
        projected = self.visual_projection(features)
        return F.normalize(projected, dim=-1)
    
    def encode_text(self, text: torch.Tensor) -> torch.Tensor:
        """Encode text tokens to normalized embeddings."""
        features = self.transformer(text)
        projected = self.text_projection(features)
        return F.normalize(projected, dim=-1)
    
    def forward(self, image: torch.Tensor, text: torch.Tensor):
        """
        Compute image-text similarity matrix.
        
        Returns:
            logits_per_image: (N, N) similarity from image perspective
            logits_per_text: (N, N) similarity from text perspective
        """
        image_features = self.encode_image(image)
        text_features = self.encode_text(text)
        
        # Cosine similarity with learned temperature
        logit_scale = self.logit_scale.exp()
        logits_per_image = logit_scale * image_features @ text_features.T
        logits_per_text = logits_per_image.T
        
        return logits_per_image, logits_per_text

CLIP: Training Methodology

CLIP's training procedure is a masterclass in scalable contrastive learning. Understanding the mathematical formulation and practical considerations is essential for reproducing or extending CLIP.

Contrastive Pre-training Objective:

$$\mathcal{L} = \frac{1}{2}\left(\mathcal{L}{I \rightarrow T} + \mathcal{L}{T \rightarrow I}\right)$$

Where: $$\mathcal{L}{I \rightarrow T} = -\frac{1}{N}\sum{i=1}^{N}\log\frac{\exp(s_{ii}/\tau)}{\sum_{j=1}^{N}\exp(s_{ij}/\tau)}$$

Here, $s_{ij} = \text{sim}(I_i, T_j) = I_i^T T_j$ for normalized embeddings, and $\tau$ is the temperature parameter.

The Role of Temperature:

The temperature parameter $\tau$ controls the sharpness of the softmax distribution:

Low temperature ($\tau < 1$): Sharper distribution, harder negatives
High temperature ($\tau > 1$): Softer distribution, more uniform

CLIP learns $\tau$ as a parameter, initialized to 0.07 (actually learns $\log(1/\tau)$ for numerical stability).

Training Data: WIT (WebImageText)

CLIP was trained on 400 million image-text pairs collected from the internet:

Diverse coverage of visual concepts
Natural language supervision (no manual labeling)
Quality filtering to remove noise
Balanced across visual domains

CLIP Training Hyperparameters
Parameter	Value	Notes
Batch Size	32,768	Large batch critical for contrastive learning
Learning Rate	5e-4 (ViT-L)	Cosine decay schedule
Training Steps	~12.8B image-text pairs seen	32 epochs over 400M pairs
Temperature Init	0.07	Learned during training
Image Resolution	224×224 (336 for large)	Standard ViT resolution
Precision	Mixed (FP16/FP32)	Gradient scaling for stability

clip_loss.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import torch
import torch.nn.functional as F
import torch.distributed as dist
 
def clip_loss(
    image_embeddings: torch.Tensor,
    text_embeddings: torch.Tensor,
    logit_scale: torch.Tensor,
    gather_with_grad: bool = True,
) -> torch.Tensor:
    """
    Compute symmetric CLIP contrastive loss.
    
    Args:
        image_embeddings: (N, D) normalized image embeddings
        text_embeddings: (N, D) normalized text embeddings
        logit_scale: Learned temperature (log scale)
        gather_with_grad: Whether to gather across GPUs with gradients
        
    Returns:
        Scalar loss value
    """
    # For distributed training, gather embeddings from all GPUs
    if dist.is_initialized() and gather_with_grad:
        # All-gather with gradient support
        all_image_embeddings = gather_with_gradient(image_embeddings)
        all_text_embeddings = gather_with_gradient(text_embeddings)
    else:
        all_image_embeddings = image_embeddings
        all_text_embeddings = text_embeddings
    
    # Compute similarity matrix scaled by temperature
    logits = logit_scale.exp() * image_embeddings @ all_text_embeddings.T
    
    # Ground truth: diagonal entries are positive pairs
    batch_size = image_embeddings.shape[0]
    if dist.is_initialized():
        # Offset labels for distributed training
        rank = dist.get_rank()
        labels = torch.arange(batch_size, device=image_embeddings.device)
        labels = labels + rank * batch_size
    else:
        labels = torch.arange(batch_size, device=image_embeddings.device)
    
    # Symmetric cross-entropy loss
    loss_i2t = F.cross_entropy(logits, labels)
    loss_t2i = F.cross_entropy(logits.T, labels)
    
    return (loss_i2t + loss_t2i) / 2
 
 
def gather_with_gradient(tensor: torch.Tensor) -> torch.Tensor:
    """All-gather tensors while preserving gradients."""
    world_size = dist.get_world_size()
    tensors = [torch.zeros_like(tensor) for _ in range(world_size)]
    dist.all_gather(tensors, tensor)
    
    # Replace own tensor to preserve gradient
    rank = dist.get_rank()
    tensors[rank] = tensor
    
    return torch.cat(tensors, dim=0)

Batch Size is Critical

CLIP: Zero-Shot Classification in Depth

The Zero-Shot Protocol:

Define class names: Specify the classification task via class labels
Create text prompts: Transform class names into natural language (e.g., "a photo of a {class}")
Encode prompts: Use CLIP's text encoder to create class embeddings
Encode image: Use CLIP's image encoder to create the test image embedding
Compute similarities: Calculate cosine similarity between image and all class embeddings
Predict: Select the class with highest similarity

Prompt Engineering:

"a photo of a {class}"
"a blurry photo of a {class}"
"a low contrast photo of a {class}"
"a high contrast photo of a {class}"
"a bad photo of a {class}"
"a good photo of a {class}"
"a photo of a small {class}"
"a photo of a big {class}"
"a photo of the {class}"
"a {class} in a video game"
...

The embeddings from all prompts are averaged to create a more robust class representation.

CLIP Zero-Shot Performance Across Datasets
Dataset	Classes	CLIP ViT-L/14@336	Best Supervised
ImageNet	1000	76.2%	~90% (EfficientNet-L2)
CIFAR-10	10	95.7%	99.5%
CIFAR-100	100	77.5%	93.0%
Food-101	101	93.8%	93.0% (first zero-shot win!)
Stanford Cars	196	77.3%	94.0%
ObjectNet	113	69.0%	55.0% (ResNet-152)

Domain-Specific Prompts

DALL-E: Architecture and Innovation

DALL-E 1: Discrete Token Generation

DALL-E 1's architecture combines two stages:

Stage 1: dVAE (Discrete Variational Autoencoder)

Compresses 256×256 images into 32×32 grids of discrete tokens
Each token is one of 8,192 possible visual "words"
Trained separately to reconstruct images from tokens

Stage 2: Autoregressive Transformer

12 billion parameter GPT-style transformer
Takes text tokens (256 BPE) + image tokens (1024 dVAE) = 1280 tokens
Trained to predict tokens autoregressively: P(image | text)
Uses sparse attention for efficiency

DALL-E 2: Diffusion-Based Generation

DALL-E 2 replaced the discrete approach with diffusion models:

CLIP Text Encoder: Encode text prompt to CLIP embedding
Prior Network: Transform CLIP text embedding to CLIP image embedding
Diffusion Decoder: Generate images conditioned on CLIP image embedding

This architecture leverages CLIP's well-structured embedding space while using diffusion for higher-quality image generation.

DALL-E 3: Native Caption Understanding

DALL-E 3 improved prompt following through:

Training on highly descriptive synthetic captions
Better text rendering through specialized modules
Tight integration with ChatGPT for prompt refinement

Converting Mermaid diagram...

The Prior Model

Technical Deep Dive: Key Innovations

Both CLIP and DALL-E introduced technical innovations that enabled their success. Understanding these innovations provides insight into why they work and how to build on them.

CLIP's Key Innovations:

Natural Language Supervision at Scale
- Previous work used expensive human labels or limited caption datasets
- CLIP used 400M noisy image-text pairs from the internet
- This scale enabled robust, generalizable representations
Efficient Contrastive Learning
- In-batch negatives: Each of N images serves as negative for N-1 texts
- Quadratic comparisons (N²) from linear data (N pairs)
- Requires only forward passes, no expensive sampling
Learned Temperature
- Temperature is typically a hyperparameter requiring careful tuning
- CLIP learns temperature during training
- Converges to ~0.01, indicating very sharp similarity distributions

DALL-E's Key Innovations:

Discrete Visual Tokens
- Enables treating image generation as language modeling
- Leverages advances in autoregressive transformers
- 8,192 codebook captures diverse visual patterns
Sparse Attention Patterns
- Full attention over 1,280 tokens is O(n²) = 1.6M operations
- DALL-E uses row, column, and conv attention patterns
- Reduces complexity while maintaining modeling capacity
Classifier-Free Guidance (DALL-E 2)
- Interpolates between conditional and unconditional generation
- Controls fidelity vs. diversity tradeoff
- Uses a single model for both (null conditioning)

classifier_free_guidance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
 
def classifier_free_guidance(
    model: "DiffusionModel",
    noisy_image: torch.Tensor,
    timestep: torch.Tensor,
    text_embedding: torch.Tensor,
    guidance_scale: float = 7.5,
) -> torch.Tensor:
    """
    Apply classifier-free guidance during diffusion sampling.
    
    The model predicts noise for both conditional (with text) and
    unconditional (without text) settings. The final prediction
    extrapolates in the direction of the conditional prediction.
    
    Args:
        model: Diffusion model that accepts (x, t, condition)
        noisy_image: Current noisy image x_t
        timestep: Current diffusion timestep  
        text_embedding: CLIP text embedding for conditioning
        guidance_scale: How much to amplify text conditioning
        
    Returns:
        Guided noise prediction
    """
    # Predict noise with text conditioning
    noise_pred_cond = model(noisy_image, timestep, text_embedding)
    
    # Predict noise without conditioning (null prompt)
    null_embedding = torch.zeros_like(text_embedding)
    noise_pred_uncond = model(noisy_image, timestep, null_embedding)
    
    # Classifier-free guidance: extrapolate from unconditional
    # toward conditional prediction
    noise_pred = noise_pred_uncond + guidance_scale * (
        noise_pred_cond - noise_pred_uncond
    )
    
    return noise_pred
 
# Note: Higher guidance_scale (7.5-15) = more prompt adherence
# Lower guidance_scale (1-3) = more diversity, less adherence

Practical Applications and Use Cases

CLIP and DALL-E have enabled a wide range of practical applications, from research tools to commercial products.

CLIP Applications

•Image Search: Find images using natural language queries without labeled training data
•Content Moderation: Detect inappropriate content using text descriptions of violations
•Zero-Shot Classification: Classify images into custom categories without retraining
•Image-to-Image Search: Find visually similar images using embedding similarity
•Model Steering: Guide diffusion models using CLIP scores

DALL-E Applications

•Creative Design: Generate concept art, illustrations, and visual concepts
•Marketing Assets: Create product images, advertisements, and social media content
•Rapid Prototyping: Visualize ideas before expensive production
•Image Editing: Inpainting, outpainting, and variations
•Accessibility: Generate visual content for those who cannot create it manually

Limitations and Ethical Considerations

Summary: CLIP and DALL-E

Key Takeaways

•CLIP learns a joint embedding space through contrastive learning on 400M image-text pairs, enabling zero-shot visual recognition.
•DALL-E inverts this relationship, generating images from text using autoregressive (v1) or diffusion-based (v2/v3) approaches.
•Large-scale pre-training on natural language supervision is the key insight enabling both models' generalization.
•Prompt engineering significantly impacts zero-shot performance and generation quality.
•These models have become foundational components in modern multimodal AI systems.

Up Next