Machine LearningSemi-Supervised & Self-Supervised Learning

Self-Supervised Learning

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

1 / 5

Pretext Tasks

The Self-Supervision Revolution

Self-supervised learning represents one of the most profound paradigm shifts in modern machine learning. At its core lies a deceptively simple yet powerful idea: create supervision signals from the data itself. Rather than relying on expensive human annotations, self-supervised methods exploit the inherent structure within raw data to learn rich, transferable representations.

The concept of pretext tasks forms the foundation of this approach. A pretext task is an auxiliary objective—carefully designed by researchers—that forces a model to learn meaningful features about the data in order to solve it. The task itself is not the end goal; rather, it serves as a vehicle for representation learning. The representations learned through solving pretext tasks can then be transferred to various downstream tasks where labeled data may be scarce.

What You Will Master

By the end of this page, you will deeply understand pretext task design principles, master the taxonomy of pretext tasks across modalities, analyze the mathematical foundations of why pretext tasks work, and evaluate the strengths and limitations of different pretext task families.

The Philosophy of Pretext Tasks

To understand pretext tasks deeply, we must first grasp the philosophical insight that underlies them. The key observation is this: real-world data possesses rich internal structure that can serve as a free supervisory signal.

Consider a simple example. Given an image of a cat, if we remove a portion of the image and ask a model to predict what was removed, the model must understand:

The concept of 'cat' and typical cat appearances
Spatial relationships between body parts
Texture patterns and color distributions
Object coherence and semantic consistency

No human labeled this image as 'cat'. Yet to solve this simple prediction task, the model effectively learns what a cat is. This is the power of pretext tasks.

The Core Insight

Pretext tasks work because solving them requires understanding the underlying data distribution p(x). A model that can predict missing pixels must implicitly model what pixels are likely given context—meaning it has learned the structure of images in that domain.

The pretext-downstream distinction:

The term 'pretext' is carefully chosen. It implies that the task is not the true objective—it's a pretext, an excuse, a means to an end. The workflow is:

Pretext Phase: Train a model on a self-supervised task using unlabeled data
Transfer Phase: Extract learned representations (typically from intermediate layers)
Downstream Phase: Fine-tune or use representations for target tasks with limited labels

The quality of a pretext task is measured not by performance on the pretext task itself, but by how well the learned representations transfer to downstream tasks.

Pretext Task Design Desiderata
Criterion	Description	Why It Matters
Non-trivial	Task cannot be solved with simple shortcuts	Prevents learning degenerate representations
Semantic	Solving requires understanding high-level concepts	Ensures transferable representations
Scalable	Labels generated automatically at scale	Leverages unlimited unlabeled data
Domain-aligned	Task relates to downstream applications	Maximizes transfer effectiveness
Efficient	Reasonable computational requirements	Enables practical training

Image-Based Pretext Tasks

Computer vision has been the primary domain for pretext task innovation. The rich spatial structure of images provides numerous opportunities for creating self-supervisory signals. Let's examine the major families of image-based pretext tasks.

Spatial Prediction Tasks

•Jigsaw Puzzles: Divide image into patches, shuffle them, predict correct arrangement. Forces learning of spatial relationships and object structure. Original work by Noroozi & Favaro (2016) used 9 patches with 1000 permutation classes.
•Rotation Prediction: Rotate image by 0°, 90°, 180°, 270°, predict rotation angle. Requires understanding object orientation and canonical poses. Remarkably effective for such a simple task (Gidaris et al., 2018).
•Relative Patch Position: Given two patches, predict their relative spatial position (8 directions + same). Introduced context prediction paradigm (Doersch et al., 2015).

Generative Prediction Tasks

•Inpainting: Mask random regions, predict missing pixels. Context Encoders (Pathak et al., 2016) used adversarial training for realistic reconstructions. Requires holistic scene understanding.
•Colorization: Convert to grayscale, predict colors. Demands semantic understanding—knowing that sky is blue, grass is green. Cross-channel prediction task (Zhang et al., 2016).
•Super-Resolution: Downsample image, predict high-resolution version. Forces learning of natural image statistics and fine-grained details.

rotation_prediction_pretext.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import transforms
 
class RotationPrediction(nn.Module):
    """
    Rotation prediction pretext task.
    Learns representations by predicting image rotation angle.
    """
    def __init__(self, backbone: nn.Module, feature_dim: int = 512):
        super().__init__()
        self.backbone = backbone
        # Classifier for 4 rotation angles: 0°, 90°, 180°, 270°
        self.rotation_classifier = nn.Sequential(
            nn.Linear(feature_dim, 256),
            nn.ReLU(inplace=True),
            nn.Linear(256, 4)
        )
    
    def rotate_batch(self, images: torch.Tensor) -> tuple:
        """
        Create rotated versions of images with labels.
        Returns concatenated rotations and corresponding labels.
        """
        batch_size = images.size(0)
        rotations = []
        labels = []
        
        for k in range(4):  # 0, 90, 180, 270 degrees
            # torch.rot90 rotates counterclockwise
            rotated = torch.rot90(images, k, dims=[2, 3])
            rotations.append(rotated)
            labels.append(torch.full((batch_size,), k, dtype=torch.long))
        
        # Concatenate all rotations: [4*B, C, H, W]
        all_images = torch.cat(rotations, dim=0)
        all_labels = torch.cat(labels, dim=0).to(images.device)
        
        return all_images, all_labels
    
    def forward(self, images: torch.Tensor) -> dict:
        """
        Forward pass for rotation prediction.
        """
        # Create rotated batch with labels
        rotated_images, rotation_labels = self.rotate_batch(images)
        
        # Extract features
        features = self.backbone(rotated_images)
        
        # Predict rotation
        logits = self.rotation_classifier(features)
        
        # Compute loss
        loss = F.cross_entropy(logits, rotation_labels)
        
        # Compute accuracy
        preds = logits.argmax(dim=1)
        accuracy = (preds == rotation_labels).float().mean()
        
        return {
            'loss': loss,
            'accuracy': accuracy,
            'features': features,
            'logits': logits
        }

Text-Based Pretext Tasks

Natural language processing has seen extraordinary success with self-supervised pretext tasks. The sequential nature of text provides natural opportunities for prediction-based learning.

Language Modeling Tasks

•Causal Language Modeling (CLM): Predict next token given all previous tokens. Autoregressive objective used by GPT family. Models p(x_t | x_1, ..., x_{t-1}).
•Masked Language Modeling (MLM): Mask random tokens (typically 15%), predict masked tokens from context. Bidirectional objective used by BERT. Enables understanding of full context.
•Replaced Token Detection (RTD): Replace some tokens with plausible alternatives, predict which tokens were replaced. More efficient than MLM—ELECTRA approach.

Sentence-Level Tasks

•Next Sentence Prediction (NSP): Given two sentences, predict if second follows first in original document. Used in original BERT but later found less useful.
•Sentence Order Prediction (SOP): Given two consecutive sentences, predict if they're in correct order. Used in ALBERT, more discriminative than NSP.
•Contrastive Sentence Learning: Treat different views of same document as positives, different documents as negatives. Foundation for sentence embeddings.

Why MLM Works So Well

Masked language modeling's power comes from forcing the model to build rich bidirectional representations. To predict [MASK] in 'The [MASK] sat on the mat', the model must understand syntax (noun expected), semantics (animate entity that sits), and context (relationship with 'mat'). This creates dense, contextual representations that transfer remarkably well.

Multi-Modal Pretext Tasks

When multiple modalities are available (vision, language, audio), their natural correspondence provides powerful supervisory signals. Multi-modal pretext tasks exploit the alignment between modalities.

Multi-Modal Pretext Task Families
Task Type	Modalities	Supervision Signal	Example Methods
Cross-Modal Matching	Image + Text	Do image and caption match?	CLIP, ALIGN, Florence
Audio-Visual Correspondence	Video + Audio	Does audio match video?	SoundNet, AVE, XDC
Temporal Alignment	Video + Text	Align narration with video frames	MIL-NCE, VideoBERT
Cross-Modal Generation	Image → Text	Generate captions from images	Image Captioning pretraining
Masked Multi-Modal Modeling	Image + Text	Mask in one modality, predict from other	FLAVA, BEiT-3

The Web as a Dataset

The internet provides billions of image-text pairs (images with alt-text, captions, surrounding content). This natural correspondence enables training on unprecedented scales—CLIP used 400 million pairs, LAION-5B contains 5 billion. No manual labeling required.

Theoretical Foundations

Why do pretext tasks lead to useful representations? Several theoretical frameworks provide insight into this phenomenon.

Information-Theoretic View:

A good pretext task requires the model to preserve mutual information between the input and learned representation. Formally, if z = f(x) is the representation and y is the pretext label derived from x:

$$I(x; z) \geq I(y; z)$$

To successfully predict y (the pretext target), z must contain information about x that's relevant to y. If we design y to capture semantic properties of x, then z will necessarily encode semantics.

The Minimum Description Length Perspective:

From an MDL viewpoint, a model that can efficiently predict pretext targets has learned a compressed representation of the data distribution. This compression forces the model to discover regularities and structure—exactly the features useful for downstream tasks.

The Invariance Hypothesis:

Many pretext tasks implicitly encourage learning representations that are invariant to irrelevant factors while being sensitive to semantic content. Rotation prediction teaches invariance to rotation while maintaining object identity. This aligns with what downstream tasks typically need.

The Shortcut Problem

A critical failure mode: models may solve pretext tasks using shortcuts that don't require semantic understanding. For jigsaw puzzles, models might use chromatic aberration patterns at patch boundaries. For colorization, they might learn simple color statistics. Careful task design is essential to prevent degenerate solutions.

Summary: Pretext Task Mastery

Key Takeaways

•Pretext tasks create supervision from data structure — No labels needed; the data's inherent organization provides learning signals.
•Task design determines representation quality — The pretext task must require semantic understanding, not allow shortcuts.
•Different modalities enable different tasks — Images: spatial prediction, generation. Text: masking, ordering. Multi-modal: cross-modal alignment.
•Theoretical foundations explain effectiveness — Information theory, MDL, and invariance principles justify why pretext tasks work.
•Scale is a key advantage — Unlimited unlabeled data can be leveraged, enabling training on billions of examples.

Page Complete

You now understand the foundations of pretext tasks—the clever mechanisms that transform unlabeled data into supervised learning problems. Next, we'll explore how these pretext tasks lead to powerful learned representations.

1 / 5

Loading learning content...

Machine LearningSemi-Supervised & Self-Supervised Learning

Self-Supervised Learning

LevelAdvanced

Duration90 mins

TopicSemi-Supervised & Self-Supervised Learning

1 / 5

Pretext Tasks

The Self-Supervision Revolution

What You Will Master

The Philosophy of Pretext Tasks

Consider a simple example. Given an image of a cat, if we remove a portion of the image and ask a model to predict what was removed, the model must understand:

The concept of 'cat' and typical cat appearances
Spatial relationships between body parts
Texture patterns and color distributions
Object coherence and semantic consistency

No human labeled this image as 'cat'. Yet to solve this simple prediction task, the model effectively learns what a cat is. This is the power of pretext tasks.

The Core Insight

The pretext-downstream distinction:

The term 'pretext' is carefully chosen. It implies that the task is not the true objective—it's a pretext, an excuse, a means to an end. The workflow is:

Pretext Phase: Train a model on a self-supervised task using unlabeled data
Transfer Phase: Extract learned representations (typically from intermediate layers)
Downstream Phase: Fine-tune or use representations for target tasks with limited labels

The quality of a pretext task is measured not by performance on the pretext task itself, but by how well the learned representations transfer to downstream tasks.

Pretext Task Design Desiderata
Criterion	Description	Why It Matters
Non-trivial	Task cannot be solved with simple shortcuts	Prevents learning degenerate representations
Semantic	Solving requires understanding high-level concepts	Ensures transferable representations
Scalable	Labels generated automatically at scale	Leverages unlimited unlabeled data
Domain-aligned	Task relates to downstream applications	Maximizes transfer effectiveness
Efficient	Reasonable computational requirements	Enables practical training

Image-Based Pretext Tasks

Spatial Prediction Tasks

•Jigsaw Puzzles: Divide image into patches, shuffle them, predict correct arrangement. Forces learning of spatial relationships and object structure. Original work by Noroozi & Favaro (2016) used 9 patches with 1000 permutation classes.
•Rotation Prediction: Rotate image by 0°, 90°, 180°, 270°, predict rotation angle. Requires understanding object orientation and canonical poses. Remarkably effective for such a simple task (Gidaris et al., 2018).
•Relative Patch Position: Given two patches, predict their relative spatial position (8 directions + same). Introduced context prediction paradigm (Doersch et al., 2015).

Generative Prediction Tasks

•Inpainting: Mask random regions, predict missing pixels. Context Encoders (Pathak et al., 2016) used adversarial training for realistic reconstructions. Requires holistic scene understanding.
•Colorization: Convert to grayscale, predict colors. Demands semantic understanding—knowing that sky is blue, grass is green. Cross-channel prediction task (Zhang et al., 2016).
•Super-Resolution: Downsample image, predict high-resolution version. Forces learning of natural image statistics and fine-grained details.

rotation_prediction_pretext.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import transforms
 
class RotationPrediction(nn.Module):
    """
    Rotation prediction pretext task.
    Learns representations by predicting image rotation angle.
    """
    def __init__(self, backbone: nn.Module, feature_dim: int = 512):
        super().__init__()
        self.backbone = backbone
        # Classifier for 4 rotation angles: 0°, 90°, 180°, 270°
        self.rotation_classifier = nn.Sequential(
            nn.Linear(feature_dim, 256),
            nn.ReLU(inplace=True),
            nn.Linear(256, 4)
        )
    
    def rotate_batch(self, images: torch.Tensor) -> tuple:
        """
        Create rotated versions of images with labels.
        Returns concatenated rotations and corresponding labels.
        """
        batch_size = images.size(0)
        rotations = []
        labels = []
        
        for k in range(4):  # 0, 90, 180, 270 degrees
            # torch.rot90 rotates counterclockwise
            rotated = torch.rot90(images, k, dims=[2, 3])
            rotations.append(rotated)
            labels.append(torch.full((batch_size,), k, dtype=torch.long))
        
        # Concatenate all rotations: [4*B, C, H, W]
        all_images = torch.cat(rotations, dim=0)
        all_labels = torch.cat(labels, dim=0).to(images.device)
        
        return all_images, all_labels
    
    def forward(self, images: torch.Tensor) -> dict:
        """
        Forward pass for rotation prediction.
        """
        # Create rotated batch with labels
        rotated_images, rotation_labels = self.rotate_batch(images)
        
        # Extract features
        features = self.backbone(rotated_images)
        
        # Predict rotation
        logits = self.rotation_classifier(features)
        
        # Compute loss
        loss = F.cross_entropy(logits, rotation_labels)
        
        # Compute accuracy
        preds = logits.argmax(dim=1)
        accuracy = (preds == rotation_labels).float().mean()
        
        return {
            'loss': loss,
            'accuracy': accuracy,
            'features': features,
            'logits': logits
        }

Text-Based Pretext Tasks

Natural language processing has seen extraordinary success with self-supervised pretext tasks. The sequential nature of text provides natural opportunities for prediction-based learning.

Language Modeling Tasks

•Causal Language Modeling (CLM): Predict next token given all previous tokens. Autoregressive objective used by GPT family. Models p(x_t | x_1, ..., x_{t-1}).
•Masked Language Modeling (MLM): Mask random tokens (typically 15%), predict masked tokens from context. Bidirectional objective used by BERT. Enables understanding of full context.
•Replaced Token Detection (RTD): Replace some tokens with plausible alternatives, predict which tokens were replaced. More efficient than MLM—ELECTRA approach.

Sentence-Level Tasks

•Next Sentence Prediction (NSP): Given two sentences, predict if second follows first in original document. Used in original BERT but later found less useful.
•Sentence Order Prediction (SOP): Given two consecutive sentences, predict if they're in correct order. Used in ALBERT, more discriminative than NSP.
•Contrastive Sentence Learning: Treat different views of same document as positives, different documents as negatives. Foundation for sentence embeddings.

Why MLM Works So Well

Multi-Modal Pretext Tasks

Multi-Modal Pretext Task Families
Task Type	Modalities	Supervision Signal	Example Methods
Cross-Modal Matching	Image + Text	Do image and caption match?	CLIP, ALIGN, Florence
Audio-Visual Correspondence	Video + Audio	Does audio match video?	SoundNet, AVE, XDC
Temporal Alignment	Video + Text	Align narration with video frames	MIL-NCE, VideoBERT
Cross-Modal Generation	Image → Text	Generate captions from images	Image Captioning pretraining
Masked Multi-Modal Modeling	Image + Text	Mask in one modality, predict from other	FLAVA, BEiT-3

The Web as a Dataset

Theoretical Foundations

Why do pretext tasks lead to useful representations? Several theoretical frameworks provide insight into this phenomenon.

Information-Theoretic View:

$$I(x; z) \geq I(y; z)$$

To successfully predict y (the pretext target), z must contain information about x that's relevant to y. If we design y to capture semantic properties of x, then z will necessarily encode semantics.

The Minimum Description Length Perspective:

The Invariance Hypothesis:

The Shortcut Problem

Summary: Pretext Task Mastery

Key Takeaways

•Pretext tasks create supervision from data structure — No labels needed; the data's inherent organization provides learning signals.
•Task design determines representation quality — The pretext task must require semantic understanding, not allow shortcuts.
•Different modalities enable different tasks — Images: spatial prediction, generation. Text: masking, ordering. Multi-modal: cross-modal alignment.
•Theoretical foundations explain effectiveness — Information theory, MDL, and invariance principles justify why pretext tasks work.
•Scale is a key advantage — Unlimited unlabeled data can be leveraged, enabling training on billions of examples.

Page Complete

1 / 5