Loading content...
In January 2021, OpenAI released two models that fundamentally changed the landscape of vision-language AI: CLIP (Contrastive Language-Image Pre-training) and DALL-E. While conceptually distinct—CLIP learns to understand the relationship between images and text, while DALL-E generates images from text—these models share a common insight: large-scale pre-training on image-text pairs enables remarkable emergent capabilities.
CLIP demonstrated that a model trained to match images with their captions could perform zero-shot image classification competitive with supervised methods trained on millions of labeled examples. DALL-E showed that the same principles could be inverted: by learning the mapping from text to images, AI could generate novel, creative imagery from natural language descriptions.
Together, these models represent a paradigm shift from task-specific, supervised learning to flexible, general-purpose multimodal understanding.
This page provides exhaustive coverage of CLIP and DALL-E architectures, training methodologies, mathematical foundations, and practical implementations. You will understand the design decisions that made these models successful and how to apply them effectively.
CLIP's architecture is elegantly simple: two separate encoders—one for images, one for text—that map their respective inputs to a shared embedding space where semantic similarity can be measured via dot product.
Image Encoder Options:
OpenAI trained CLIP with two vision encoder architectures:
ResNet variants: Modified ResNets (RN50, RN101, RN50x4, RN50x16, RN50x64) with attention pooling instead of global average pooling. The attention pooling uses the [CLS] token from a single transformer layer to produce the final representation.
Vision Transformer (ViT): ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px. Images are divided into patches, linearly projected, and processed by a standard transformer encoder.
Text Encoder:
The text encoder is a transformer with masked self-attention (GPT-2 style), taking text tokens as input and producing a sequence-level representation from the [EOS] token position.
Projection Heads:
Both encoders output features that are linearly projected to the shared embedding space (dimension 512 or 768). These projections are learned during training and are critical for alignment.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import torchimport torch.nn as nnimport torch.nn.functional as F class CLIP(nn.Module): """ Simplified CLIP implementation demonstrating core architecture. """ def __init__( self, embed_dim: int = 512, vision_width: int = 768, text_width: int = 512, vision_layers: int = 12, text_layers: int = 12, vocab_size: int = 49152, context_length: int = 77, ): super().__init__() # Vision encoder (simplified ViT) self.visual = VisionTransformer( width=vision_width, layers=vision_layers, output_dim=embed_dim, ) # Text encoder self.transformer = TextTransformer( width=text_width, layers=text_layers, vocab_size=vocab_size, context_length=context_length, ) # Projection layers self.visual_projection = nn.Linear(vision_width, embed_dim, bias=False) self.text_projection = nn.Linear(text_width, embed_dim, bias=False) # Learnable temperature parameter (log scale for stability) self.logit_scale = nn.Parameter(torch.ones([]) * torch.log(torch.tensor(1/0.07))) def encode_image(self, image: torch.Tensor) -> torch.Tensor: """Encode images to normalized embeddings.""" features = self.visual(image) projected = self.visual_projection(features) return F.normalize(projected, dim=-1) def encode_text(self, text: torch.Tensor) -> torch.Tensor: """Encode text tokens to normalized embeddings.""" features = self.transformer(text) projected = self.text_projection(features) return F.normalize(projected, dim=-1) def forward(self, image: torch.Tensor, text: torch.Tensor): """ Compute image-text similarity matrix. Returns: logits_per_image: (N, N) similarity from image perspective logits_per_text: (N, N) similarity from text perspective """ image_features = self.encode_image(image) text_features = self.encode_text(text) # Cosine similarity with learned temperature logit_scale = self.logit_scale.exp() logits_per_image = logit_scale * image_features @ text_features.T logits_per_text = logits_per_image.T return logits_per_image, logits_per_textCLIP's training procedure is a masterclass in scalable contrastive learning. Understanding the mathematical formulation and practical considerations is essential for reproducing or extending CLIP.
Contrastive Pre-training Objective:
Given a batch of N image-text pairs, CLIP learns to maximize similarity between matching pairs while minimizing similarity between non-matching pairs. This is formulated as a symmetric cross-entropy loss:
$$\mathcal{L} = \frac{1}{2}\left(\mathcal{L}{I \rightarrow T} + \mathcal{L}{T \rightarrow I}\right)$$
Where: $$\mathcal{L}{I \rightarrow T} = -\frac{1}{N}\sum{i=1}^{N}\log\frac{\exp(s_{ii}/\tau)}{\sum_{j=1}^{N}\exp(s_{ij}/\tau)}$$
Here, $s_{ij} = \text{sim}(I_i, T_j) = I_i^T T_j$ for normalized embeddings, and $\tau$ is the temperature parameter.
The Role of Temperature:
The temperature parameter $\tau$ controls the sharpness of the softmax distribution:
CLIP learns $\tau$ as a parameter, initialized to 0.07 (actually learns $\log(1/\tau)$ for numerical stability).
Training Data: WIT (WebImageText)
CLIP was trained on 400 million image-text pairs collected from the internet:
| Parameter | Value | Notes |
|---|---|---|
| Batch Size | 32,768 | Large batch critical for contrastive learning |
| Learning Rate | 5e-4 (ViT-L) | Cosine decay schedule |
| Training Steps | ~12.8B image-text pairs seen | 32 epochs over 400M pairs |
| Temperature Init | 0.07 | Learned during training |
| Image Resolution | 224×224 (336 for large) | Standard ViT resolution |
| Precision | Mixed (FP16/FP32) | Gradient scaling for stability |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import torchimport torch.nn.functional as Fimport torch.distributed as dist def clip_loss( image_embeddings: torch.Tensor, text_embeddings: torch.Tensor, logit_scale: torch.Tensor, gather_with_grad: bool = True,) -> torch.Tensor: """ Compute symmetric CLIP contrastive loss. Args: image_embeddings: (N, D) normalized image embeddings text_embeddings: (N, D) normalized text embeddings logit_scale: Learned temperature (log scale) gather_with_grad: Whether to gather across GPUs with gradients Returns: Scalar loss value """ # For distributed training, gather embeddings from all GPUs if dist.is_initialized() and gather_with_grad: # All-gather with gradient support all_image_embeddings = gather_with_gradient(image_embeddings) all_text_embeddings = gather_with_gradient(text_embeddings) else: all_image_embeddings = image_embeddings all_text_embeddings = text_embeddings # Compute similarity matrix scaled by temperature logits = logit_scale.exp() * image_embeddings @ all_text_embeddings.T # Ground truth: diagonal entries are positive pairs batch_size = image_embeddings.shape[0] if dist.is_initialized(): # Offset labels for distributed training rank = dist.get_rank() labels = torch.arange(batch_size, device=image_embeddings.device) labels = labels + rank * batch_size else: labels = torch.arange(batch_size, device=image_embeddings.device) # Symmetric cross-entropy loss loss_i2t = F.cross_entropy(logits, labels) loss_t2i = F.cross_entropy(logits.T, labels) return (loss_i2t + loss_t2i) / 2 def gather_with_gradient(tensor: torch.Tensor) -> torch.Tensor: """All-gather tensors while preserving gradients.""" world_size = dist.get_world_size() tensors = [torch.zeros_like(tensor) for _ in range(world_size)] dist.all_gather(tensors, tensor) # Replace own tensor to preserve gradient rank = dist.get_rank() tensors[rank] = tensor return torch.cat(tensors, dim=0)CLIP's performance degrades significantly with small batch sizes. The contrastive objective requires many negative examples to learn discriminative representations. Techniques like gradient checkpointing, mixed precision, and distributed training are essential for achieving large effective batch sizes.
CLIP's zero-shot classification ability is its most celebrated feature. By learning a joint embedding space, CLIP can classify images into arbitrary categories using only textual descriptions of those categories.
The Zero-Shot Protocol:
Prompt Engineering:
Raw class names often underperform compared to full sentences because CLIP was trained on image captions, not isolated labels. The official CLIP implementation uses an ensemble of 80 prompt templates:
"a photo of a {class}"
"a blurry photo of a {class}"
"a low contrast photo of a {class}"
"a high contrast photo of a {class}"
"a bad photo of a {class}"
"a good photo of a {class}"
"a photo of a small {class}"
"a photo of a big {class}"
"a photo of the {class}"
"a {class} in a video game"
...
The embeddings from all prompts are averaged to create a more robust class representation.
| Dataset | Classes | CLIP ViT-L/14@336 | Best Supervised |
|---|---|---|---|
| ImageNet | 1000 | 76.2% | ~90% (EfficientNet-L2) |
| CIFAR-10 | 10 | 95.7% | 99.5% |
| CIFAR-100 | 100 | 77.5% | 93.0% |
| Food-101 | 101 | 93.8% | 93.0% (first zero-shot win!) |
| Stanford Cars | 196 | 77.3% | 94.0% |
| ObjectNet | 113 | 69.0% | 55.0% (ResNet-152) |
For specialized domains, tailored prompts dramatically improve performance. Medical imaging might use 'an X-ray showing {condition}'; satellite imagery might use 'satellite view of {terrain}'. Investing time in prompt engineering often yields better returns than model fine-tuning.
DALL-E flips the vision-language paradigm: instead of understanding images through text, it generates images from text. The original DALL-E (2021) used a discrete VAE and autoregressive transformer, while DALL-E 2 (2022) and DALL-E 3 (2023) adopted diffusion-based generation.
DALL-E 1: Discrete Token Generation
DALL-E 1's architecture combines two stages:
Stage 1: dVAE (Discrete Variational Autoencoder)
Stage 2: Autoregressive Transformer
DALL-E 2: Diffusion-Based Generation
DALL-E 2 replaced the discrete approach with diffusion models:
This architecture leverages CLIP's well-structured embedding space while using diffusion for higher-quality image generation.
DALL-E 3: Native Caption Understanding
DALL-E 3 improved prompt following through:
DALL-E 2's prior model is crucial: it learns the distribution of CLIP image embeddings conditioned on CLIP text embeddings. This is trained on the same image-text pairs as CLIP, learning P(z_image | z_text). The prior can be autoregressive or diffusion-based.
Both CLIP and DALL-E introduced technical innovations that enabled their success. Understanding these innovations provides insight into why they work and how to build on them.
CLIP's Key Innovations:
Natural Language Supervision at Scale
Efficient Contrastive Learning
Learned Temperature
DALL-E's Key Innovations:
Discrete Visual Tokens
Sparse Attention Patterns
Classifier-Free Guidance (DALL-E 2)
12345678910111213141516171819202122232425262728293031323334353637383940414243
import torch def classifier_free_guidance( model: "DiffusionModel", noisy_image: torch.Tensor, timestep: torch.Tensor, text_embedding: torch.Tensor, guidance_scale: float = 7.5,) -> torch.Tensor: """ Apply classifier-free guidance during diffusion sampling. The model predicts noise for both conditional (with text) and unconditional (without text) settings. The final prediction extrapolates in the direction of the conditional prediction. Args: model: Diffusion model that accepts (x, t, condition) noisy_image: Current noisy image x_t timestep: Current diffusion timestep text_embedding: CLIP text embedding for conditioning guidance_scale: How much to amplify text conditioning Returns: Guided noise prediction """ # Predict noise with text conditioning noise_pred_cond = model(noisy_image, timestep, text_embedding) # Predict noise without conditioning (null prompt) null_embedding = torch.zeros_like(text_embedding) noise_pred_uncond = model(noisy_image, timestep, null_embedding) # Classifier-free guidance: extrapolate from unconditional # toward conditional prediction noise_pred = noise_pred_uncond + guidance_scale * ( noise_pred_cond - noise_pred_uncond ) return noise_pred # Note: Higher guidance_scale (7.5-15) = more prompt adherence# Lower guidance_scale (1-3) = more diversity, less adherenceCLIP and DALL-E have enabled a wide range of practical applications, from research tools to commercial products.
These models have notable limitations: CLIP can encode societal biases from training data; DALL-E can generate misleading or harmful imagery; both struggle with fine-grained details like text in images. Responsible deployment requires careful content filtering, watermarking, and user guidelines.
In the next page, we'll explore audio-visual learning—extending multimodal AI beyond vision and language to incorporate sound, enabling applications in video understanding, speech-visual alignment, and multi-sensory AI systems.