Feature Based Transfer - Learning Module

Loading content...

0/245

Pre-trained Representations

The Power of Learned Features

In the modern machine learning era, training from scratch is often unnecessary and wasteful. The realization that representations learned on one task can be repurposed for entirely different tasks has revolutionized how we approach learning problems—particularly when labeled data is scarce.

Consider this: a neural network trained on ImageNet to classify objects has learned far more than just "what is a dog" or "what is a car." In its hidden layers, it has developed sophisticated feature detectors—representations that capture edges, textures, shapes, parts, and semantic concepts. These learned features encode visual knowledge that generalizes far beyond the original 1,000 ImageNet categories.

This page provides a rigorous, comprehensive exploration of pre-trained representations: what they are, why they work, how they're structured, and when they transfer effectively. Understanding pre-trained representations is the foundation for all feature-based transfer learning techniques.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of representation learning, the hierarchical structure of learned features, the mathematical properties that make representations transferable, and the landscape of pre-trained models across different domains.

The Representation Learning Paradigm

At its core, machine learning is fundamentally about learning representations. The raw input data—pixels, words, audio samples—exists in a high-dimensional space that is poorly suited for decision-making. The job of a machine learning model is to transform this raw input into a representation where the relevant patterns become apparent.

Formally, consider a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that maps inputs to outputs. In practice, we decompose this into:

$$f(x) = g(\phi(x))$$

where $\phi: \mathcal{X} \rightarrow \mathcal{Z}$ is the representation function (often called the encoder or feature extractor), and $g: \mathcal{Z} \rightarrow \mathcal{Y}$ is the task head (often called the classifier or predictor).

The space $\mathcal{Z}$ is the representation space or latent space. The entire machinery of deep learning can be viewed as learning $\phi$ that produces useful representations $z = \phi(x)$.

The Key Insight

The revolutionary insight of transfer learning is that a representation function φ learned for one task often produces representations z that are useful for many other tasks. If φ captures fundamental structure in the data, we can reuse it rather than learning from scratch.

Why representations transfer:

The transferability of representations rests on a key assumption: different tasks in the same domain share underlying structure. Consider image classification:

Detecting "cat" requires understanding fur textures, pointed ears, and feline body shapes
Detecting "dog" requires understanding similar fur textures, different ear shapes, and canine body shapes
Detecting "medical anomaly in X-ray" requires understanding tissue textures, anatomical shapes, and density patterns

All of these tasks benefit from low-level features (edges, gradients, textures) and mid-level features (shapes, parts, spatial relationships). A representation that captures these building blocks is useful across all tasks.

The manifold hypothesis:

Deep learning research has revealed that high-dimensional data often lies on or near low-dimensional manifolds. Images of natural scenes, for instance, occupy a tiny fraction of all possible pixel arrangements. A good representation function learns to map data to this underlying manifold structure.

Mathematically, if data lies on a $d$-dimensional manifold $\mathcal{M}$ embedded in $\mathbb{R}^D$ where $d \ll D$, then a representation with $\text{dim}(\mathcal{Z}) \approx d$ can capture the essential structure while discarding irrelevant variation.

Properties of Good Representations

•Disentanglement — Separate factors of variation are encoded in separate dimensions. Changes in one aspect (e.g., pose) affect only specific representation components.
•Smoothness — Similar inputs map to nearby points in representation space. Small perturbations in input yield small changes in representation.
•Invariance — The representation is unchanged by irrelevant transformations (e.g., lighting changes, background variations) while remaining sensitive to task-relevant factors.
•Hierarchical compositionality — Complex concepts are built from simpler ones in a layered fashion, enabling reuse of components across different high-level concepts.
•Cluster structure — Semantically similar inputs cluster together in representation space, enabling simple classifiers to separate categories.

Hierarchical Feature Learning in Neural Networks

Deep neural networks learn representations in a hierarchical, compositional manner. This hierarchical structure is not an accident—it emerges naturally from the architecture and optimization process, and it mirrors the compositional structure of natural data.

The layer-by-layer abstraction:

In a convolutional neural network for vision, the hierarchy is well-characterized:

Layer Depth	Feature Type	Receptive Field	Example Detectors
Layer 1	Edges, gradients	3×3 to 7×7	Gabor-like filters at various orientations
Layer 2-3	Textures, corners	15×15 to 50×50	Fur patterns, mesh patterns, fabric textures
Layer 4-5	Object parts	50×50 to 150×150	Eyes, wheels, windows, handles
Layer 6+	Objects, scenes	Full image	Complete faces, vehicles, indoor scenes

This progression from general to specific is the key to transferability. Early layers learn features relevant to virtually any visual task, while later layers become increasingly task-specific.

Visualization Insight

The landmark 2014 paper by Zeiler & Fergus visualized CNN features using deconvolution, revealing exactly this hierarchy. They showed that conv1 learns edge detectors, conv2 learns texture patterns, conv3 learns more complex textures and parts, and deeper layers learn object-level concepts.

Mathematical formulation:

Consider a deep network with $L$ layers. Each layer $l$ computes:

$$h^{(l)} = \sigma(W^{(l)} h^{(l-1)} + b^{(l)})$$

where $h^{(0)} = x$ is the input. The representation at layer $l$ is $h^{(l)} \in \mathbb{R}^{d_l}$.

For transfer learning, we're interested in which layers produce representations that transfer well. Empirically, the answer depends on the domain similarity between source and target:

Transferability curves:

Research by Yosinski et al. (2014) measured transferability by training on ImageNet with layers frozen at different depths, then fine-tuning on various target tasks:

Layers 1-2: Highly transferable. These generic features transfer to almost any visual task.
Layers 3-4: Moderately transferable. Transfer success depends on domain similarity.
Layers 5+: Task-specific. These features may need substantial adaptation for new tasks.

This is often called the transferability curve: a function that maps layer depth to transfer performance.

Transferability by Layer Depth and Domain Similarity
Source/Target Similarity	Early Layers (1-2)	Middle Layers (3-4)	Deep Layers (5+)
High (ImageNet → Natural Images)	Excellent ✓	Excellent ✓	Good ✓
Medium (ImageNet → Medical Images)	Excellent ✓	Moderate ~	Poor ✗
Low (ImageNet → Satellite Images)	Good ✓	Poor ✗	Poor ✗
Very Low (ImageNet → Audio Spectrograms)	Variable ~	Poor ✗	Very Poor ✗

The co-adaptation problem:

An important consideration is that neural network layers are co-adapted—they learn to work together as a unit. When you transfer only some layers, you break this co-adaptation. Specifically:

Layers 1-2 learned to produce outputs that layers 3-5 expect
Randomly initialized new layers don't know how to process transferred features
This creates a "fragile" network that may underperform initially

The solution is careful initialization and learning rate strategies (covered in Module 3), but the fundamental issue highlights that representations don't exist in isolation—they're part of a system.

Foundation Models and Pre-training Paradigms

The last decade has witnessed the rise of foundation models—large, general-purpose models pre-trained on massive datasets that serve as the starting point for countless downstream applications. These models embody the culmination of representation learning at scale.

The foundation model paradigm:

Pre-train a large model on a massive, diverse dataset using self-supervision or supervised learning
Distribute the pre-trained weights as a starting point for the community
Adapt these weights to specific downstream tasks through fine-tuning or feature extraction

This paradigm has transformed multiple fields:

Major Foundation Models Across Domains
Domain	Model	Pre-training Data	Pre-training Objective	Parameters
Vision	ResNet-152	ImageNet (1.2M images)	Classification	60M
Vision	ViT-Large	ImageNet-21K (14M images)	Classification	307M
Vision	CLIP ViT-L/14	400M image-text pairs	Contrastive	428M
NLP	BERT-Large	BooksCorpus + Wikipedia	MLM + NSP	340M
NLP	GPT-3	Common Crawl + Books + Wikipedia	Autoregressive LM	175B
Multimodal	Flamingo	Interleaved image-text	Mixed	80B
Audio	Wav2Vec 2.0	LibriSpeech + LibriLight	Contrastive + MLM	317M

Pre-training objectives and their impact on representations:

The choice of pre-training objective fundamentally shapes what representations are learned:

Supervised pre-training (e.g., ImageNet classification):

Learns representations that linearly separate the training classes
Features are discriminative for category boundaries
May over-specialize to the training label set

Self-supervised pre-training (e.g., contrastive learning, masked prediction):

Learns representations that capture data structure without labels
Features encode general similarities and patterns
Often more transferable to diverse downstream tasks

Language modeling pre-training:

Optimizes prediction of next/masked tokens
Features encode syntactic and semantic relationships
Enables zero-shot and few-shot transfer via prompting

The Bitter Lesson Applied

Richard Sutton's 'Bitter Lesson' observes that general methods leveraging computation outperform special-purpose approaches. Foundation models embody this: massive scale and general objectives produce representations that transfer better than hand-crafted features ever could.

What makes a good pre-training dataset?

The pre-training dataset critically determines representation quality:

Scale: More data generally produces better representations, up to diminishing returns
Diversity: Coverage of varied scenarios improves generalization
Quality: Noisy or biased data produces noisy or biased representations
Alignment: Data distribution should reasonably overlap with anticipated downstream domains

The ImageNet story:

ImageNet's impact on computer vision cannot be overstated. Before ImageNet pre-training became standard (around 2012-2014), most vision systems trained from scratch on small task-specific datasets. Post-ImageNet, pre-training became the default:

Fine-tuning pre-trained models consistently outperformed training from scratch
Even for specialized domains (medical imaging, satellite imagery), ImageNet features provided a useful starting point
The community converged on a small set of canonical architectures (VGG, ResNet, Inception) as pre-training starting points

This represented a paradigm shift: starting from random weights became the exception, not the rule.

Mathematical Properties of Transferable Representations

Understanding why representations transfer requires examining their mathematical properties. Several theoretical frameworks attempt to explain transferability.

Intrinsic dimensionality:

Representations often have intrinsic dimensionality much lower than their embedding dimension. If a 2048-dimensional representation effectively lies on a manifold of dimension ~50, then:

Most dimensions encode redundant or correlated information
Downstream tasks can access the essential structure without needing all dimensions
Compression (PCA, random projections) often preserves task performance

Measuring intrinsic dimensionality:

Several methods estimate intrinsic dimensionality:

$$\text{ID}{\text{MLE}} = \left[ \frac{1}{n} \sum{i=1}^n \log \frac{r_k(x_i)}{r_1(x_i)} \right]^{-1}$$

where $r_k(x)$ is the distance to the $k$-th nearest neighbor. This maximum likelihood estimator gives the local dimensionality around each point.

Research Finding

Studies have shown that ImageNet-trained ResNet representations have intrinsic dimensionality around 50-100, despite living in 2048 dimensions. This low intrinsic dimensionality partially explains why even simple linear classifiers work well on top of pre-trained features.

Representation geometry:

The geometry of learned representations determines their utility for downstream tasks:

Linear separability: For classification, we want class conditional distributions $P(z|y)$ to be linearly separable. Pre-trained representations often achieve this:

$$\text{margin}(\phi) = \min_{y \neq y'} \frac{|\mu_y - \mu_{y'}|}{\sigma_y + \sigma_{y'}}$$

where $\mu_y$, $\sigma_y$ are the mean and standard deviation of class $y$ representations.

Clustering quality metrics:

To measure how well representations cluster by class without training a classifier:

Silhouette score: Measures cohesion within clusters vs separation between clusters
Davies-Bouldin index: Ratio of within-class to between-class scatter
Adjusted Rand index: Agreement between representation clustering and true labels

Representation similarity analysis:

We can compare representations across layers, models, and even modalities using:

$$\text{CKA}(X, Y) = \frac{\text{HSIC}(X, Y)}{\sqrt{\text{HSIC}(X, X) \cdot \text{HSIC}(Y, Y)}}$$

where HSIC is the Hilbert-Schmidt Independence Criterion. This Centered Kernel Alignment metric reveals how similar two representations are in terms of the similarity structure they induce.

representation_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.preprocessing import StandardScaler
 
def centered_kernel_alignment(X: np.ndarray, Y: np.ndarray) -> float:
    """
    Compute Centered Kernel Alignment between two representations.
    
    Args:
        X: First representation matrix, shape (n_samples, d1)
        Y: Second representation matrix, shape (n_samples, d2)
    
    Returns:
        CKA score in [0, 1], where 1 indicates identical similarity structure
    """
    # Center the data
    X = X - X.mean(axis=0)
    Y = Y - Y.mean(axis=0)
    
    # Compute Gram matrices
    K = X @ X.T  # Linear kernel for X
    L = Y @ Y.T  # Linear kernel for Y
    
    # Center the Gram matrices
    n = K.shape[0]
    H = np.eye(n) - np.ones((n, n)) / n
    K_c = H @ K @ H
    L_c = H @ L @ H
    
    # Compute CKA
    hsic_xy = np.sum(K_c * L_c) / (n - 1) ** 2
    hsic_xx = np.sum(K_c * K_c) / (n - 1) ** 2
    hsic_yy = np.sum(L_c * L_c) / (n - 1) ** 2
    
    cka = hsic_xy / np.sqrt(hsic_xx * hsic_yy)
    return cka
 
 
def intrinsic_dimensionality_mle(X: np.ndarray, k: int = 5) -> float:
    """
    Estimate intrinsic dimensionality using Maximum Likelihood Estimator.
    
    Args:
        X: Data matrix, shape (n_samples, d)
        k: Number of neighbors for estimation
    
    Returns:
        Estimated intrinsic dimensionality
    """
    from sklearn.neighbors import NearestNeighbors
    
    nbrs = NearestNeighbors(n_neighbors=k + 1).fit(X)
    distances, _ = nbrs.kneighbors(X)
    
    # Exclude distance to self (first column)
    distances = distances[:, 1:]
    
    # MLE estimator
    log_ratios = np.log(distances[:, -1:] / distances[:, :-1])
    id_estimate = (k - 1) / np.sum(log_ratios) * len(X)
    
    return id_estimate
 
 
def representation_cluster_quality(representations: np.ndarray, 
                                   labels: np.ndarray) -> dict:
    """
    Compute various clustering quality metrics for representations.
    
    Returns dict with silhouette_score, davies_bouldin, adjusted_rand.
    """
    from sklearn.metrics import (
        silhouette_score, 
        davies_bouldin_score,
        adjusted_rand_score
    )
    from sklearn.cluster import KMeans
    
    n_classes = len(np.unique(labels))
    
    # K-means clustering
    kmeans = KMeans(n_clusters=n_classes, random_state=42)
    pred_labels = kmeans.fit_predict(representations)
    
    return {
        'silhouette': silhouette_score(representations, labels),
        'davies_bouldin': davies_bouldin_score(representations, labels),
        'adjusted_rand': adjusted_rand_score(labels, pred_labels)
    }

Representation Quality and Transfer Performance

A key question in transfer learning is: how do we predict whether a pre-trained representation will transfer well to a new task? This has practical implications—evaluating transfer performance for every possible source/target combination is computationally prohibitive.

Probing tasks:

One approach is to use probing tasks (also called diagnostic classifiers) to evaluate what information is encoded in representations:

Freeze the pre-trained encoder $\phi$
Extract representations $z = \phi(x)$ for a probe dataset
Train a simple linear classifier $g$ on $(z, y)$ pairs
Measure probe accuracy

High probe accuracy indicates that the representation encodes information relevant to the probe task.

Common Probing Tasks for Vision

•Object classification — Can the representation distinguish object categories?
•Scene recognition — Does it encode scene-level semantics?
•Surface normal estimation — Are 3D structure cues present?
•Depth estimation — Is depth information encoded?
•Edge detection — Are low-level boundary features preserved?
•Colorization — Does it encode semantic color information?

The probe-transfer correlation:

Research has established that probe task performance correlates with transfer learning performance, but imperfectly. A representation may excel at one probe task while failing at transfer to a related task due to:

Representation geometry: The information may be present but not linearly accessible
Domain shift: Probe and target may have different input distributions
Task formulation: Classification probes may not predict regression transfer performance

LEEP and transferability estimators:

More sophisticated transferability estimators attempt to predict transfer performance directly:

Log Expected Empirical Prediction (LEEP):

$$\text{LEEP}(\phi, \mathcal{D}T) = \frac{1}{n} \sum{i=1}^n \log \sum_{z \in \mathcal{Z}} P(y_i | z) \hat{P}(z | \phi(x_i))$$

where $\hat{P}(z | \phi(x))$ is estimated from the source model's predictions and $P(y|z)$ is estimated on the target data.

LEEP gives a score that correlates highly with actual transfer performance, enabling efficient model selection without full fine-tuning.

Computational Consideration

While transferability estimators are much faster than full fine-tuning, they still require forward passes through the pre-trained model and aren't suitable for all scenarios. For very large models or real-time model selection, even LEEP-style estimates may be too expensive.

Scaling laws for transfer:

Recent research has established scaling laws relating pre-training compute, model size, and transfer performance:

$$\text{Transfer Loss} \propto \frac{C_0}{N^{\alpha}} + \frac{C_1}{D^{\beta}} + \epsilon$$

where $N$ is model size, $D$ is pre-training data size, and $\alpha, \beta$ are power-law exponents (typically $\alpha \approx 0.4$, $\beta \approx 0.4$).

Key implications:

Larger models produce more transferable representations
More diverse pre-training data improves transfer
Returns are diminishing but consistent—more compute always helps (within reason)
The optimal compute allocation depends on the target task diversity

Choosing Pre-trained Models in Practice

With hundreds of pre-trained models available, choosing the right one for your task requires systematic thinking. This section provides practical guidance for model selection.

Decision framework:

Start with these questions:

What is my input modality? (images, text, audio, multimodal)
How much labeled target data do I have?
How similar is my target domain to available pre-training domains?
What are my compute and latency constraints?
Do I need interpretability or uncertainty estimates?

Pre-trained Model Selection Guide for Vision
Scenario	Recommended Model	Reasoning
Natural images, ample data	ResNet-50/101 or ViT-Base	Well-understood, strong baselines, efficient
Natural images, limited data	CLIP or larger ImageNet models	Better representations compensate for limited fine-tuning data
Medical imaging	ImageNet pre-trained + domain adaptation	Generic features + careful adaptation for domain shift
Satellite/aerial imagery	Specialized models (e.g., SatlasPretrain)	Domain-specific pre-training critical for non-natural images
Need zero-shot capability	CLIP, OpenCLIP	Vision-language models enable text-based classification
Edge deployment	MobileNet, EfficientNet-Lite	Designed for compute-constrained environments

Model hubs and repositories:

Pre-trained models are distributed through various platforms:

PyTorch Hub / torchvision: Standard vision architectures with ImageNet weights
Hugging Face Model Hub: Thousands of models across modalities, with standardized APIs
TensorFlow Hub: Models in SavedModel format for TensorFlow ecosystem
ONNX Model Zoo: Framework-agnostic models for production deployment
Papers with Code: Links to model weights accompanying research papers

Practical considerations:

License compatibility: Check that the model license permits your use case (commercial, derivative works, distribution)
Training details: Understand what data the model was trained on—biases in pre-training propagate to downstream tasks
Documentation quality: Prefer models with clear documentation of architecture, training procedure, and known limitations
Community adoption: Widely-used models have more debugging resources and known issues documented
Reproducibility: Check if training code is available—this indicates higher quality and enables verification

load_pretrained_models.py
Python (PyTorch)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import torch
import torchvision.models as models
from transformers import AutoModel, AutoImageProcessor
 
# =============================================================
# Method 1: torchvision - Standard vision models
# =============================================================
 
# Load ResNet-50 with ImageNet-1K pre-trained weights
resnet50 = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
 
# Access specific layers for feature extraction
# Remove the final classification layer to get features
feature_extractor = torch.nn.Sequential(*list(resnet50.children())[:-1])
 
# Load ViT (Vision Transformer) with ImageNet-21K + 1K weights
vit_b_16 = models.vit_b_16(weights=models.ViT_B_16_Weights.IMAGENET1K_V1)
 
# =============================================================
# Method 2: Hugging Face - Broader model variety
# =============================================================
 
# Load CLIP for vision-language capabilities
from transformers import CLIPModel, CLIPProcessor
 
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
 
# Load DINOv2 - Strong self-supervised vision model
dinov2 = AutoModel.from_pretrained("facebook/dinov2-base")
dinov2_processor = AutoImageProcessor.from_pretrained("facebook/dinov2-base")
 
# =============================================================
# Method 3: Custom feature extraction function
# =============================================================
 
def extract_features(model: torch.nn.Module, 
                     images: torch.Tensor,
                     layer_name: str = None) -> torch.Tensor:
    """
    Extract features from a specific layer of a pre-trained model.
    
    Args:
        model: Pre-trained model
        images: Batch of preprocessed images, shape (B, C, H, W)
        layer_name: Name of layer to extract from (if None, uses penultimate)
    
    Returns:
        Feature tensor, shape depends on chosen layer
    """
    features = {}
    
    def hook_fn(name):
        def hook(module, input, output):
            features[name] = output.detach()
        return hook
    
    # Register hook on specified layer
    if layer_name:
        for name, module in model.named_modules():
            if name == layer_name:
                module.register_forward_hook(hook_fn(name))
                break
    
    # Forward pass
    with torch.no_grad():
        model.eval()
        _ = model(images)
    
    if layer_name and layer_name in features:
        return features[layer_name]
    
    # Default: extract from global average pool output
    return features.get(list(features.keys())[-1], None)

Common Pitfalls and Best Practices

Working with pre-trained representations is not without hazards. Understanding common failure modes helps you avoid them.

Common Pitfalls

•Preprocessing mismatch — Pre-trained models expect specific input normalization (e.g., ImageNet mean/std). Using wrong preprocessing silently degrades performance.
•Resolution mismatch — Models trained at 224×224 may underperform at 512×512 or vice versa. The receptive field and position encodings are resolution-dependent.
•Domain shift blindness — Assuming representations will transfer well without measuring domain similarity leads to disappointing results.
•Ignoring model biases — Pre-training data biases (e.g., ImageNet's Western-centric images) affect downstream fairness.
•Over-relying on benchmarks — Models that excel on standard benchmarks may fail on your specific distribution.
•Layer selection without analysis — Arbitrarily choosing which layer to extract features from instead of empirically validating.

Best Practices

•Match preprocessing exactly — Use the same transforms as pre-training. Copy normalization constants precisely.
•Profile representation quality — Before committing to a model, run quick probing experiments on your target data.
•Try multiple extraction points — Extract from different layers and compare downstream performance.
•Document your choices — Record which model version, layer, and preprocessing you used for reproducibility.
•Monitor for degradation — Representations may become less useful over time if the input distribution drifts.
•Consider ensembling — Combining representations from multiple pre-trained models often outperforms any single model.

The Preprocessing Trap

One of the most common bugs in transfer learning is preprocessing mismatch. ImageNet models expect pixel values normalized by mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. Using mean=0.5, std=0.5, or unnormalized inputs will silently produce garbage features. Always verify normalization matches the pre-training setup.

Summary: Pre-trained Representations

We've covered the foundations of pre-trained representations—the backbone of modern transfer learning. Let's consolidate the key takeaways:

Key Takeaways

•Representation learning is fundamental — ML models learn to transform raw inputs into structured representations where patterns become apparent.
•Representations transfer because tasks share structure — Low-level features (edges, textures) are universally useful; high-level features become task-specific.
•Hierarchical learning produces general-to-specific features — Early layers are highly transferable; deeper layers require more adaptation.
•Foundation models embody scaled representation learning — Massive pre-training on diverse data produces powerful, general representations.
•Mathematical properties (dimensionality, geometry, separability) determine transfer success — We can measure and predict transferability.
•Practical model selection requires systematic thinking — Consider domain similarity, data availability, compute constraints, and use probing to validate.
•Preprocessing and layer selection are critical details — Match pre-training exactly; profile multiple extraction points.

What's next:

Now that we understand what pre-trained representations are and why they work, the next page explores frozen features—the simplest approach to using pre-trained representations, where the encoder remains fixed while only a new task head is trained. This approach establishes a baseline and reveals the limits of pure feature extraction without adaptation.

Page Complete

You now understand the theoretical and practical foundations of pre-trained representations. This knowledge forms the basis for all feature-based transfer techniques: frozen features, feature extraction, and feature adaptation. Next, we'll see how to use these representations directly without modification.