Loading content...
Vision-language pre-training represents a paradigm shift in self-supervised learning: instead of learning visual representations in isolation, we jointly learn visual and textual representations that share a common embedding space.
The internet provides billions of image-text pairs: photos with captions, products with descriptions, social media posts with images. This naturally occurring multimodal data enables self-supervised learning at unprecedented scale.
CLIP was trained on 400 million image-text pairs; ALIGN on 1.8 billion. This scale, combined with contrastive learning between modalities, produces representations with remarkable zero-shot capabilities—matching or exceeding supervised models on many benchmarks without seeing a single labeled example.
CLIP (Contrastive Language-Image Pre-training) by Radford et al. (2021) from OpenAI learns to align images and text in a shared embedding space through contrastive learning.
Image Encoder: ResNet or Vision Transformer
Text Encoder: Transformer
Training: For a batch of N image-text pairs, maximize similarity of correct pairs while minimizing similarity of incorrect pairs.
$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[\log\frac{\exp(v_i \cdot t_i / \tau)}{\sum_j \exp(v_i \cdot t_j / \tau)} + \log\frac{\exp(t_i \cdot v_i / \tau)}{\sum_j \exp(t_i \cdot v_j / \tau)}\right]$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import torchimport torch.nn as nnimport torch.nn.functional as F class CLIP(nn.Module): """CLIP: Contrastive Language-Image Pre-training.""" def __init__(self, image_encoder, text_encoder, embed_dim=512, temp=0.07): super().__init__() self.image_encoder = image_encoder self.text_encoder = text_encoder self.temperature = nn.Parameter(torch.ones([]) * temp) # Projection heads to shared space self.image_proj = nn.Linear(image_encoder.output_dim, embed_dim) self.text_proj = nn.Linear(text_encoder.output_dim, embed_dim) def encode_image(self, images): features = self.image_encoder(images) return F.normalize(self.image_proj(features), dim=-1) def encode_text(self, text): features = self.text_encoder(text) return F.normalize(self.text_proj(features), dim=-1) def forward(self, images, text): # Encode both modalities image_embeds = self.encode_image(images) text_embeds = self.encode_text(text) # Compute similarity matrix logits = image_embeds @ text_embeds.T / self.temperature # Symmetric contrastive loss labels = torch.arange(len(images), device=images.device) loss_i2t = F.cross_entropy(logits, labels) loss_t2i = F.cross_entropy(logits.T, labels) return (loss_i2t + loss_t2i) / 2 def zero_shot_classify(self, images, class_texts): """Zero-shot classification using text prompts.""" image_embeds = self.encode_image(images) text_embeds = self.encode_text(class_texts) similarity = image_embeds @ text_embeds.T return similarity.argmax(dim=-1)CLIP's zero-shot performance depends heavily on text prompts. Instead of just "dog", using "a photo of a dog" significantly improves accuracy. Ensemble prompts like "a photo of a [class]", "a good photo of a [class]", "a photo of the large [class]" boost performance further.
ALIGN (A Large-scale ImaGe and Noisy text embedding) by Jia et al. (2021) from Google demonstrates that scale can compensate for noise in training data.
| Model | Data Size | Image Encoder | Text Encoder | ImageNet Zero-shot |
|---|---|---|---|---|
| CLIP | 400M | ViT-L/14 | Transformer | 75.5% |
| ALIGN | 1.8B | EfficientNet-L2 | BERT-Large | 76.4% |
| OpenCLIP | 2B | ViT-G/14 | Transformer | 78.0% |
| EVA-CLIP | 2B | EVA-G | Transformer | 78.5% |
1234567891011121314151617181920212223
def analyze_scaling_behavior(results_dict): """Analyze how performance scales with data and model size. Empirical observations from CLIP/ALIGN papers: - Zero-shot accuracy scales log-linearly with compute - Doubling data consistently improves by ~1-2% - Model size matters but data size matters more """ import numpy as np # Example scaling law fit # accuracy ≈ a * log(compute) + b compute = np.array([1e18, 1e19, 1e20, 1e21]) # FLOPs accuracy = np.array([55, 65, 72, 76]) # ImageNet zero-shot # Log-linear fit log_compute = np.log10(compute) coeffs = np.polyfit(log_compute, accuracy, 1) print(f"Scaling law: acc = {coeffs[0]:.1f} * log10(compute) + {coeffs[1]:.1f}") # Typical result: acc = 7 * log10(compute) - 71 return coeffsWhile CLIP and ALIGN use pure contrastive learning, newer methods combine multiple objectives for richer representations.
BLIP by Li et al. (2022) combines three objectives:
CoCa by Yu et al. (2022) unifies contrastive and generative objectives:
1234567891011121314151617181920212223242526272829303132333435
class BLIP(nn.Module): """BLIP: Bootstrapping Language-Image Pre-training.""" def __init__(self, image_encoder, text_encoder, multimodal_encoder, text_decoder): super().__init__() self.image_encoder = image_encoder self.text_encoder = text_encoder self.multimodal_encoder = multimodal_encoder # Cross-attention self.text_decoder = text_decoder # Captioning self.itm_head = nn.Linear(text_encoder.hidden_size, 2) def forward(self, images, text, mode='all'): image_embeds = self.image_encoder(images) losses = {} if mode in ['all', 'itc']: # Image-text contrastive text_embeds = self.text_encoder(text, mode='cls') losses['itc'] = self.contrastive_loss(image_embeds, text_embeds) if mode in ['all', 'itm']: # Image-text matching with hard negatives multimodal_out = self.multimodal_encoder(image_embeds, text) itm_logits = self.itm_head(multimodal_out[:, 0]) # Create hard negatives by in-batch shuffling losses['itm'] = self.itm_loss(itm_logits, images, text) if mode in ['all', 'lm']: # Language modeling (captioning) caption_loss = self.text_decoder(image_embeds, text) losses['lm'] = caption_loss return lossesSigLIP by Zhai et al. (2023) replaces softmax-based contrastive loss with sigmoid loss, enabling better scaling and distributed training.
Softmax (CLIP): Each image competes with ALL text in batch $$\mathcal{L} = -\log\frac{\exp(s_{ii})}{\sum_j \exp(s_{ij})}$$
Sigmoid (SigLIP): Independent binary classification for each pair $$\mathcal{L} = -\sum_{i,j} \left[y_{ij}\log\sigma(s_{ij}) + (1-y_{ij})\log(1-\sigma(s_{ij}))\right]$$
where $y_{ij} = 1$ if $(i,j)$ is a correct pair.
12345678910111213141516171819202122232425
class SigLIP(nn.Module): """SigLIP: Sigmoid Loss for Language-Image Pre-training.""" def __init__(self, image_encoder, text_encoder, embed_dim=512): super().__init__() self.image_encoder = image_encoder self.text_encoder = text_encoder self.logit_scale = nn.Parameter(torch.ones([]) * 10) self.logit_bias = nn.Parameter(torch.zeros([])) def forward(self, images, text): image_embeds = F.normalize(self.image_encoder(images), dim=-1) text_embeds = F.normalize(self.text_encoder(text), dim=-1) # Similarity matrix logits = image_embeds @ text_embeds.T * self.logit_scale + self.logit_bias # Create labels: 1 on diagonal (correct pairs), 0 elsewhere n = len(images) labels = torch.eye(n, device=images.device) # Binary cross-entropy for each pair independently loss = F.binary_cross_entropy_with_logits(logits, labels) return lossVision-language pre-training enables a wide range of downstream applications, often in zero-shot or few-shot settings.
1234567891011121314151617181920
def zero_shot_classification(model, images, class_names): """Zero-shot image classification with CLIP.""" # Create text prompts prompts = [f"a photo of a {c}" for c in class_names] text_embeds = model.encode_text(prompts) image_embeds = model.encode_image(images) # Compute similarity similarity = image_embeds @ text_embeds.T predictions = similarity.argmax(dim=-1) return [class_names[p] for p in predictions] def image_text_retrieval(model, images, captions, query_image): """Retrieve captions for a query image.""" image_embed = model.encode_image(query_image.unsqueeze(0)) text_embeds = model.encode_text(captions) similarities = (image_embed @ text_embeds.T).squeeze() top_k = similarities.argsort(descending=True)[:5] return [captions[i] for i in top_k]You now understand vision-language pre-training from CLIP through modern advances like SigLIP. These multimodal methods learn rich representations by connecting visual and textual understanding. Next, we'll explore foundation model training—how these techniques scale to create general-purpose AI systems.