Modern Self Supervised Methods - Learning Module

Loading content...

0/278

Vision-Language Pre-training: Bridging Modalities

Connecting Vision and Language

Vision-language pre-training represents a paradigm shift in self-supervised learning: instead of learning visual representations in isolation, we jointly learn visual and textual representations that share a common embedding space.

The internet provides billions of image-text pairs: photos with captions, products with descriptions, social media posts with images. This naturally occurring multimodal data enables self-supervised learning at unprecedented scale.

Why Vision-Language Pre-training Matters

Natural supervision: Text descriptions provide rich semantic supervision without manual labeling
Open vocabulary: Models understand concepts beyond fixed class labels
Zero-shot transfer: Apply to new tasks without task-specific training
Compositional understanding: Combine concepts ("a red car on a beach")

The Scale Revolution

CLIP was trained on 400 million image-text pairs; ALIGN on 1.8 billion. This scale, combined with contrastive learning between modalities, produces representations with remarkable zero-shot capabilities—matching or exceeding supervised models on many benchmarks without seeing a single labeled example.

CLIP: Contrastive Language-Image Pre-training

CLIP (Contrastive Language-Image Pre-training) by Radford et al. (2021) from OpenAI learns to align images and text in a shared embedding space through contrastive learning.

Architecture

Image Encoder: ResNet or Vision Transformer

Input: Image → Output: Image embedding $v \in \mathbb{R}^d$

Text Encoder: Transformer

Input: Text → Output: Text embedding $t \in \mathbb{R}^d$

Training: For a batch of N image-text pairs, maximize similarity of correct pairs while minimizing similarity of incorrect pairs.

$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[\log\frac{\exp(v_i \cdot t_i / \tau)}{\sum_j \exp(v_i \cdot t_j / \tau)} + \log\frac{\exp(t_i \cdot v_i / \tau)}{\sum_j \exp(t_i \cdot v_j / \tau)}\right]$$

clip.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class CLIP(nn.Module):
    """CLIP: Contrastive Language-Image Pre-training."""
    
    def __init__(self, image_encoder, text_encoder, embed_dim=512, temp=0.07):
        super().__init__()
        self.image_encoder = image_encoder
        self.text_encoder = text_encoder
        self.temperature = nn.Parameter(torch.ones([]) * temp)
        
        # Projection heads to shared space
        self.image_proj = nn.Linear(image_encoder.output_dim, embed_dim)
        self.text_proj = nn.Linear(text_encoder.output_dim, embed_dim)
    
    def encode_image(self, images):
        features = self.image_encoder(images)
        return F.normalize(self.image_proj(features), dim=-1)
    
    def encode_text(self, text):
        features = self.text_encoder(text)
        return F.normalize(self.text_proj(features), dim=-1)
    
    def forward(self, images, text):
        # Encode both modalities
        image_embeds = self.encode_image(images)
        text_embeds = self.encode_text(text)
        
        # Compute similarity matrix
        logits = image_embeds @ text_embeds.T / self.temperature
        
        # Symmetric contrastive loss
        labels = torch.arange(len(images), device=images.device)
        loss_i2t = F.cross_entropy(logits, labels)
        loss_t2i = F.cross_entropy(logits.T, labels)
        
        return (loss_i2t + loss_t2i) / 2
    
    def zero_shot_classify(self, images, class_texts):
        """Zero-shot classification using text prompts."""
        image_embeds = self.encode_image(images)
        text_embeds = self.encode_text(class_texts)
        
        similarity = image_embeds @ text_embeds.T
        return similarity.argmax(dim=-1)

Prompt Engineering for CLIP

CLIP's zero-shot performance depends heavily on text prompts. Instead of just "dog", using "a photo of a dog" significantly improves accuracy. Ensemble prompts like "a photo of a [class]", "a good photo of a [class]", "a photo of the large [class]" boost performance further.

ALIGN and the Power of Scale

ALIGN (A Large-scale ImaGe and Noisy text embedding) by Jia et al. (2021) from Google demonstrates that scale can compensate for noise in training data.

Key Insights

Noisy data at scale works: 1.8B image-alt-text pairs, minimally filtered
Simple approach: Same contrastive objective as CLIP
Dual-encoder efficiency: Separate encoders allow efficient retrieval
Emergent capabilities: Zero-shot transfer improves with scale

Vision-Language Model Comparison
Model	Data Size	Image Encoder	Text Encoder	ImageNet Zero-shot
CLIP	400M	ViT-L/14	Transformer	75.5%
ALIGN	1.8B	EfficientNet-L2	BERT-Large	76.4%
OpenCLIP	2B	ViT-G/14	Transformer	78.0%
EVA-CLIP	2B	EVA-G	Transformer	78.5%

scaling_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def analyze_scaling_behavior(results_dict):
    """Analyze how performance scales with data and model size.
    
    Empirical observations from CLIP/ALIGN papers:
    - Zero-shot accuracy scales log-linearly with compute
    - Doubling data consistently improves by ~1-2%
    - Model size matters but data size matters more
    """
    import numpy as np
    
    # Example scaling law fit
    # accuracy ≈ a * log(compute) + b
    compute = np.array([1e18, 1e19, 1e20, 1e21])  # FLOPs
    accuracy = np.array([55, 65, 72, 76])  # ImageNet zero-shot
    
    # Log-linear fit
    log_compute = np.log10(compute)
    coeffs = np.polyfit(log_compute, accuracy, 1)
    
    print(f"Scaling law: acc = {coeffs[0]:.1f} * log10(compute) + {coeffs[1]:.1f}")
    # Typical result: acc = 7 * log10(compute) - 71
    
    return coeffs

Beyond Contrastive: Advanced VL Methods

While CLIP and ALIGN use pure contrastive learning, newer methods combine multiple objectives for richer representations.

BLIP: Bootstrapping Language-Image Pre-training

BLIP by Li et al. (2022) combines three objectives:

Image-text contrastive (ITC): Like CLIP
Image-text matching (ITM): Binary classification of pair validity
Language modeling (LM): Generate captions conditioned on images

CoCa: Contrastive Captioners

CoCa by Yu et al. (2022) unifies contrastive and generative objectives:

Dual encoder for efficient retrieval (contrastive)
Decoder for captioning (generative)
Single unified model handles both

blip.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
class BLIP(nn.Module):
    """BLIP: Bootstrapping Language-Image Pre-training."""
    
    def __init__(self, image_encoder, text_encoder, multimodal_encoder, text_decoder):
        super().__init__()
        self.image_encoder = image_encoder
        self.text_encoder = text_encoder
        self.multimodal_encoder = multimodal_encoder  # Cross-attention
        self.text_decoder = text_decoder  # Captioning
        
        self.itm_head = nn.Linear(text_encoder.hidden_size, 2)
    
    def forward(self, images, text, mode='all'):
        image_embeds = self.image_encoder(images)
        
        losses = {}
        
        if mode in ['all', 'itc']:
            # Image-text contrastive
            text_embeds = self.text_encoder(text, mode='cls')
            losses['itc'] = self.contrastive_loss(image_embeds, text_embeds)
        
        if mode in ['all', 'itm']:
            # Image-text matching with hard negatives
            multimodal_out = self.multimodal_encoder(image_embeds, text)
            itm_logits = self.itm_head(multimodal_out[:, 0])
            # Create hard negatives by in-batch shuffling
            losses['itm'] = self.itm_loss(itm_logits, images, text)
        
        if mode in ['all', 'lm']:
            # Language modeling (captioning)
            caption_loss = self.text_decoder(image_embeds, text)
            losses['lm'] = caption_loss
        
        return losses

SigLIP: Efficient Sigmoid-based Training

SigLIP by Zhai et al. (2023) replaces softmax-based contrastive loss with sigmoid loss, enabling better scaling and distributed training.

Softmax vs Sigmoid Loss

Softmax (CLIP): Each image competes with ALL text in batch $$\mathcal{L} = -\log\frac{\exp(s_{ii})}{\sum_j \exp(s_{ij})}$$

Sigmoid (SigLIP): Independent binary classification for each pair $$\mathcal{L} = -\sum_{i,j} \left[y_{ij}\log\sigma(s_{ij}) + (1-y_{ij})\log(1-\sigma(s_{ij}))\right]$$

where $y_{ij} = 1$ if $(i,j)$ is a correct pair.

Advantages of Sigmoid Loss

No global normalization: Pairs are independent, better for distributed training
Larger effective batch sizes: No softmax denominator across GPUs
Better calibration: Sigmoid outputs are interpretable probabilities

siglip.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class SigLIP(nn.Module):
    """SigLIP: Sigmoid Loss for Language-Image Pre-training."""
    
    def __init__(self, image_encoder, text_encoder, embed_dim=512):
        super().__init__()
        self.image_encoder = image_encoder
        self.text_encoder = text_encoder
        self.logit_scale = nn.Parameter(torch.ones([]) * 10)
        self.logit_bias = nn.Parameter(torch.zeros([]))
    
    def forward(self, images, text):
        image_embeds = F.normalize(self.image_encoder(images), dim=-1)
        text_embeds = F.normalize(self.text_encoder(text), dim=-1)
        
        # Similarity matrix
        logits = image_embeds @ text_embeds.T * self.logit_scale + self.logit_bias
        
        # Create labels: 1 on diagonal (correct pairs), 0 elsewhere
        n = len(images)
        labels = torch.eye(n, device=images.device)
        
        # Binary cross-entropy for each pair independently
        loss = F.binary_cross_entropy_with_logits(logits, labels)
        
        return loss

Applications of Vision-Language Models

Vision-language pre-training enables a wide range of downstream applications, often in zero-shot or few-shot settings.

Key Applications

•Zero-shot Classification: Classify images using text descriptions of classes, no training examples needed
•Image-Text Retrieval: Find images matching a query or captions matching an image
•Visual Question Answering: Answer questions about images using VL understanding
•Image Generation Guidance: CLIP guides diffusion models (DALL-E, Stable Diffusion)
•Object Detection: Open-vocabulary detection with text descriptions (OWL-ViT, GLIP)
•Video Understanding: Extend VL models to temporal understanding

vlm_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def zero_shot_classification(model, images, class_names):
    """Zero-shot image classification with CLIP."""
    # Create text prompts
    prompts = [f"a photo of a {c}" for c in class_names]
    text_embeds = model.encode_text(prompts)
    image_embeds = model.encode_image(images)
    
    # Compute similarity
    similarity = image_embeds @ text_embeds.T
    predictions = similarity.argmax(dim=-1)
    return [class_names[p] for p in predictions]
 
def image_text_retrieval(model, images, captions, query_image):
    """Retrieve captions for a query image."""
    image_embed = model.encode_image(query_image.unsqueeze(0))
    text_embeds = model.encode_text(captions)
    
    similarities = (image_embed @ text_embeds.T).squeeze()
    top_k = similarities.argsort(descending=True)[:5]
    return [captions[i] for i in top_k]

Summary: Vision-Language Pre-training

Key Takeaways

•Cross-modal contrastive learning: CLIP aligns images and text in shared embedding space
•Scale matters: Billions of image-text pairs enable emergent zero-shot capabilities
•Noisy data works: With sufficient scale, manual curation isn't necessary
•Multiple objectives help: BLIP combines contrastive, matching, and generative losses
•Sigmoid loss scales better: SigLIP enables more efficient distributed training

Page Complete

You now understand vision-language pre-training from CLIP through modern advances like SigLIP. These multimodal methods learn rich representations by connecting visual and textual understanding. Next, we'll explore foundation model training—how these techniques scale to create general-purpose AI systems.