Machine LearningTransfer Learning & Domain Adaptation

Feature-Based Transfer

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

2 / 5

Frozen Features

The Simplest Transfer: Using Features As-Is

The simplest form of transfer learning requires no additional training of the pre-trained model at all. Instead, we treat the pre-trained network as a fixed feature extractor—a black box that transforms raw inputs into meaningful representations. Only a lightweight classifier or regressor is trained on top of these frozen features.

This approach, known as frozen feature transfer or simply training a linear probe, establishes the baseline for all transfer learning methods. Before investing compute in fine-tuning, we should ask: how far can we get without modifying the pre-trained weights at all?

The answer, surprisingly, is often "quite far." For many tasks, especially those with limited labeled data or high similarity to the pre-training domain, frozen features provide competitive or even superior performance to more complex approaches.

This page provides a rigorous, comprehensive exploration of frozen feature transfer: when it works, why it fails, how to implement it correctly, and what it reveals about representation quality.

What You Will Learn

By the end of this page, you will understand the theoretical basis for frozen feature transfer, know when to use this approach versus fine-tuning, master the implementation details including layer selection and classifier design, and interpret what frozen feature performance tells you about domain similarity.

The Frozen Feature Paradigm

In frozen feature transfer, we decompose a model into two components:

$$f(x) = g(\phi(x))$$

where:

$\phi: \mathcal{X} \rightarrow \mathbb{R}^d$ is the frozen encoder (pre-trained, parameters fixed)
$g: \mathbb{R}^d \rightarrow \mathcal{Y}$ is the task head (randomly initialized, trained on target data)

The training procedure:

Load pre-trained weights: Initialize $\phi$ with weights from pre-training
Freeze the encoder: Set $\phi$'s parameters to non-trainable
Extract features: Compute $z = \phi(x)$ for all training examples
Train the head: Optimize $g$ on $(z, y)$ pairs using target task labels

This is computationally efficient because:

Feature extraction is a single forward pass per example (can be done once and cached)
Only the small task head is trained, not the large encoder
No back-propagation through the encoder is needed

Practical Speedup

For a ResNet-50, the encoder has ~25M parameters while a linear probe has only ~2M (2048 features × 1000 classes). Training the probe takes seconds per epoch versus minutes for full fine-tuning. More importantly, you can pre-compute and cache features, making classifier experimentation extremely fast.

Why frozen features work:

The success of frozen features rests on the assumption that the pre-trained encoder has learned task-agnostic, general-purpose representations. If $\phi$ maps semantically similar inputs to nearby points in representation space, then a simple linear classifier can find decision boundaries.

Mathematically, if the representation induces a clustering structure where:

$$\text{intra-class distance} \ll \text{inter-class distance}$$

then even linear classifiers achieve high accuracy. This is precisely what good pre-trained representations achieve—they linearly separate many downstream classes, even those not seen during pre-training.

Theoretical guarantee (informal):

If the representation $\phi$ satisfies certain smoothness and separability conditions, then the sample complexity of learning $g$ scales only with the dimension $d$ of the representation and the complexity of $g$, not with the complexity of the full input space $\mathcal{X}$.

This is the core benefit: we've reduced a complex learning problem (raw pixels to labels) into a simpler one (learned features to labels).

When Frozen Features Excel

•Limited labeled data — With only hundreds of examples, fine-tuning may overfit. Frozen features provide strong regularization by fixing the encoder.
•High source-target similarity — When the target domain closely resembles the pre-training domain, representations transfer well without adaptation.
•Rapid prototyping — Testing whether transfer learning helps at all. Frozen features give a quick baseline in minutes.
•Computational constraints — When GPU resources are limited, avoiding encoder backprop dramatically reduces training cost.
•Multi-task scenarios — A single set of frozen features can serve multiple task heads simultaneously, enabling efficient multi-task learning.

Mathematical Analysis of Frozen Transfer

Understanding when frozen features succeed or fail requires formal analysis of the transfer process.

Problem formulation:

Let $\mathcal{D}S = {(x_i, y_i)}^n{i=1}$ be the source domain data (used for pre-training) and $\mathcal{D}T = {(x_j, y_j)}^m{j=1}$ be the target domain data (used for training $g$). The source and target may have different:

Input distributions: $P_S(X) \neq P_T(X)$
Label distributions: $P_S(Y) \neq P_T(Y)$
Label spaces: $\mathcal{Y}_S \neq \mathcal{Y}_T$
Conditional distributions: $P_S(Y|X) \neq P_T(Y|X)$

The transfer error decomposition:

The error on the target task can be decomposed as:

$$\epsilon_T(g \circ \phi) = \epsilon_T(g \circ \phi^) + \underbrace{[\epsilon_T(g \circ \phi) - \epsilon_T(g \circ \phi^)]}_{\text{representation gap}}$$

where $\phi^*$ is the optimal representation for the target task. The representation gap measures how much we lose by using the transferred representation instead of the ideal one.

Key Insight

The representation gap is zero when the pre-trained φ happens to be optimal for the target task. This occurs when source and target tasks share the same underlying structure. In practice, we aim for this gap to be small, not zero.

Bounding the transfer error:

Under certain conditions, we can bound the target error:

$$\epsilon_T(g \circ \phi) \leq \epsilon_S(g \circ \phi) + d_{\mathcal{H}}(P_S^\phi, P_T^\phi) + \lambda$$

where:

$\epsilon_S$ is the source error (typically from pre-training)
$d_{\mathcal{H}}$ is a distribution distance (e.g., $\mathcal{H}$-divergence) between source and target in representation space
$\lambda$ is the combined error of the optimal classifier on both domains

Important implications:

Low source error isn't enough: The pre-trained model might have zero source error but fail on the target if the distributions differ significantly.
Representation matters: The divergence term operates in representation space, not input space. A good $\phi$ maps both domains to overlapping regions.
Some tasks are inherently harder: The $\lambda$ term represents an irreducible component—if source and target are fundamentally different, no representation helps.

Linear probe theory:

For linear probes $g(z) = w^\top z + b$, the generalization error satisfies:

$$\mathbb{E}[\epsilon(g)] \leq \frac{|w|^2 \cdot \text{Var}(\phi(X))}{m} + O\left(\sqrt{\frac{d}{m}}\right)$$

where $m$ is the number of target examples and $d$ is the representation dimension. This shows:

Performance scales with $1/m$: more target data helps
Performance scales with $d/m$: high-dimensional representations need more data
Regularization (controlling $|w|$) is important

Factors Affecting Frozen Feature Transfer Success
Factor	Effect on Transfer	How to Measure
Source-target domain similarity	Higher similarity → better transfer	Distribution divergence metrics, visual inspection
Representation quality	Better representations → easier downstream learning	Linear probe accuracy on source tasks
Target data quantity	More data → lower variance, better probe training	Learning curve analysis
Number of target classes	More classes → harder classification	Random baseline (1/K for K classes)
Representation dimension	Higher dimension → need more data (curse of dimensionality)	Effective dimension, PCA analysis
Class imbalance	Imbalance → biased probes, misleading accuracy	Class distribution analysis, stratified metrics

Layer Selection for Feature Extraction

A critical decision in frozen feature transfer is which layer to extract features from. Different layers encode different levels of abstraction, and the optimal choice depends on the source-target relationship.

The layer selection principle:

As discussed in Page 0, neural networks learn hierarchical features:

Early layers: Generic, low-level features (edges, textures) — highly transferable
Middle layers: Compositional features (parts, patterns) — moderately transferable
Late layers: Task-specific, semantic features — may not transfer well

Optimal layer depends on domain similarity:

Domain Similarity	Optimal Extraction Layer	Reasoning
Very High	Penultimate (pre-logits)	High-level features directly applicable
Moderate	Middle layers	Balance between generality and specificity
Low	Early layers	Only low-level features transfer
Very Low	Consider from scratch	Transfer may hurt more than help

The Penultimate Layer Default

For most transfer scenarios, start with the penultimate layer (the layer before the final classification layer). This is the default in most frameworks. It captures high-level semantics while remaining somewhat general. Only move to earlier layers if this underperforms your expectations.

Feature dimensionality at different layers:

Consider a ResNet-50 architecture:

Layer	Output Shape	Feature Dimension	Characteristics
conv1	112×112×64	802,816	Gabor-like filters, very generic
layer1 (res2)	56×56×256	802,816	Low-level compositions
layer2 (res3)	28×28×512	401,408	Mid-level patterns
layer3 (res4)	14×14×1024	200,704	Object parts
layer4 (res5)	7×7×2048	100,352	High-level semantics
avgpool	1×1×2048	2,048	Global representation

Practical approaches to layer selection:

Single-layer extraction: Extract from one layer (usually avgpool), train a single probe.
Multi-layer concatenation: Concatenate features from multiple layers, increasing expressivity at the cost of dimensionality.
Multi-scale extraction: Pool spatial features at different resolutions and combine. Captures both local and global information.
Learned combination: Train a lightweight network to combine features from multiple layers.

layer_extraction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import torch
import torch.nn as nn
from torchvision import models
from typing import Dict, List
 
class MultiLayerFeatureExtractor(nn.Module):
    """
    Extract features from multiple layers of a pre-trained model.
    
    Supports flexible layer selection and feature combination strategies.
    """
    def __init__(
        self, 
        model_name: str = "resnet50",
        layers: List[str] = ["layer3", "layer4", "avgpool"],
        combine_strategy: str = "concat"  # "concat", "mean", "learned"
    ):
        super().__init__()
        
        # Load pre-trained model
        base_model = getattr(models, model_name)(
            weights=getattr(models, f"{model_name.upper()}_Weights").IMAGENET1K_V2
        )
        
        # Freeze all parameters
        for param in base_model.parameters():
            param.requires_grad = False
        
        # Store model components for hook access
        self.model = base_model
        self.layers = layers
        self.combine_strategy = combine_strategy
        
        # Storage for intermediate features
        self.features: Dict[str, torch.Tensor] = {}
        
        # Register forward hooks on specified layers
        for name in layers:
            layer = dict(base_model.named_modules())[name]
            layer.register_forward_hook(self._get_hook(name))
        
        # Compute output dimension for downstream heads
        self._compute_output_dim()
    
    def _get_hook(self, name: str):
        def hook(module, input, output):
            # Flatten spatial dimensions if present
            if output.dim() == 4:  # B x C x H x W
                output = output.flatten(start_dim=2).mean(dim=2)  # Global avg pool
            self.features[name] = output
        return hook
    
    def _compute_output_dim(self):
        """Determine output dimension by doing a dummy forward pass."""
        with torch.no_grad():
            dummy = torch.randn(1, 3, 224, 224)
            self.forward(dummy)
            
            if self.combine_strategy == "concat":
                self.output_dim = sum(
                    self.features[l].shape[1] for l in self.layers
                )
            else:
                # Assumes all layers have same dim (may need projection)
                self.output_dim = self.features[self.layers[0]].shape[1]
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Clear previous features
        self.features = {}
        
        # Forward through full model (hooks capture intermediate)
        _ = self.model(x)
        
        # Combine features according to strategy
        feature_list = [self.features[l] for l in self.layers]
        
        if self.combine_strategy == "concat":
            return torch.cat(feature_list, dim=1)
        elif self.combine_strategy == "mean":
            # Stack and average (assumes same dimension)
            return torch.stack(feature_list, dim=0).mean(dim=0)
        else:
            raise ValueError(f"Unknown strategy: {self.combine_strategy}")
 
 
class FrozenFeatureClassifier(nn.Module):
    """
    Complete frozen feature classifier: extractor + trainable head.
    """
    def __init__(
        self,
        num_classes: int,
        extractor: MultiLayerFeatureExtractor = None,
        head_type: str = "linear",  # "linear", "mlp", "attention"
        hidden_dim: int = 512,
        dropout: float = 0.5
    ):
        super().__init__()
        
        # Create default extractor if none provided
        self.extractor = extractor or MultiLayerFeatureExtractor()
        input_dim = self.extractor.output_dim
        
        # Build classification head (this is what we train)
        if head_type == "linear":
            self.head = nn.Linear(input_dim, num_classes)
        elif head_type == "mlp":
            self.head = nn.Sequential(
                nn.Linear(input_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout),
                nn.Linear(hidden_dim, num_classes)
            )
        elif head_type == "attention":
            # Self-attention over feature dimensions
            self.head = nn.Sequential(
                nn.Linear(input_dim, hidden_dim),
                nn.MultiheadAttention(hidden_dim, num_heads=4, dropout=dropout),
                nn.Linear(hidden_dim, num_classes)
            )
        
        # Initialize head with good defaults
        self._init_head()
    
    def _init_head(self):
        for m in self.head.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        features = self.extractor(x)
        return self.head(features)
    
    def get_trainable_params(self):
        """Return only the parameters that should be trained."""
        return self.head.parameters()
 
 
# Usage example
if __name__ == "__main__":
    # Create classifier
    classifier = FrozenFeatureClassifier(
        num_classes=10,
        head_type="mlp"
    )
    
    # Only train the head!
    optimizer = torch.optim.Adam(
        classifier.get_trainable_params(), 
        lr=0.001
    )
    
    # Dummy batch
    images = torch.randn(8, 3, 224, 224)
    labels = torch.randint(0, 10, (8,))
    
    # Forward pass
    logits = classifier(images)
    loss = nn.CrossEntropyLoss()(logits, labels)
    
    print(f"Feature dimension: {classifier.extractor.output_dim}")
    print(f"Output shape: {logits.shape}")
    print(f"Loss: {loss.item():.4f}")

Classification Head Design

Given frozen features, the classification head $g$ determines how we map representations to predictions. The choice of head architecture involves trade-offs between expressivity, regularization, and sample efficiency.

Linear probes:

The simplest head is a linear classifier:

$$g(z) = W z + b$$

where $W \in \mathbb{R}^{K \times d}$ for $K$ classes and $d$-dimensional features.

Advantages:

Minimal capacity → strong regularization
Fast training (convex optimization for logistic loss)
Interpretable weights (each row is a class prototype)
Standard benchmark for representation quality

Disadvantages:

Cannot learn non-linear decision boundaries
May underfit if representations aren't linearly separable
No feature interactions beyond what's encoded in $z$

Linear Probe = Representation Quality Test

Linear probe accuracy is widely used as a metric for representation quality. If a linear classifier achieves 75% accuracy on ImageNet using features from model A, and 80% using features from model B, we say B has learned better representations. This metric is used in nearly all self-supervised learning papers.

MLP heads:

A multi-layer perceptron adds non-linear capacity:

$$g(z) = W_2 \sigma(W_1 z + b_1) + b_2$$

When to use MLPs:

Linear probe underperforms expectations
Sufficient target data to train more parameters
Task requires non-linear feature combinations

MLP design choices:

Hidden dimension: Typically 512-2048 for vision tasks
Number of layers: 1-2 hidden layers usually sufficient
Activation: ReLU standard; GELU for Transformer consistency
Regularization: Dropout (0.3-0.5), weight decay essential

Attention-based heads:

For sequence or spatial features (before global pooling), attention heads can learn to weight different positions:

$$g(z_1, ..., z_n) = \text{Attention}(Q, K, V)$$

This is useful when:

Features have spatial structure (no global pooling)
Different regions have varying importance for the task

Head capacity and overfitting:

A crucial principle: more head capacity requires more target data. The relationship follows:

Target Data Size	Recommended Head	Why
< 100 examples	Linear	Prevent overfitting
100 - 1,000	Linear or small MLP	Limited capacity okay
1,000 - 10,000	MLP (1-2 layers)	Can support non-linearity
> 10,000	Deeper MLP or consider fine-tuning	Sufficient data for more expressivity

head_architectures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
 
class LinearProbe(nn.Module):
    """
    Standard linear probe for frozen feature evaluation.
    Includes optional normalization and temperature scaling.
    """
    def __init__(
        self,
        input_dim: int,
        num_classes: int,
        normalize: bool = True,
        temperature: float = 1.0
    ):
        super().__init__()
        self.normalize = normalize
        self.temperature = temperature
        self.classifier = nn.Linear(input_dim, num_classes)
        
        # Initialize with scaling aware of normalization
        nn.init.normal_(self.classifier.weight, std=0.01)
        nn.init.zeros_(self.classifier.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.normalize:
            x = F.normalize(x, p=2, dim=1)
        logits = self.classifier(x) / self.temperature
        return logits
 
 
class MLPHead(nn.Module):
    """
    Flexible MLP head with configurable architecture.
    """
    def __init__(
        self,
        input_dim: int,
        num_classes: int,
        hidden_dims: list = [512],
        activation: str = "relu",
        dropout: float = 0.5,
        batch_norm: bool = True
    ):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        # Activation function selection
        act_fn = {
            "relu": nn.ReLU,
            "gelu": nn.GELU,
            "silu": nn.SiLU
        }[activation]
        
        # Build hidden layers
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            if batch_norm:
                layers.append(nn.BatchNorm1d(hidden_dim))
            layers.append(act_fn())
            layers.append(nn.Dropout(dropout))
            prev_dim = hidden_dim
        
        # Final classification layer
        layers.append(nn.Linear(prev_dim, num_classes))
        
        self.mlp = nn.Sequential(*layers)
        self._init_weights()
    
    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm1d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.mlp(x)
 
 
class PrototypicalHead(nn.Module):
    """
    Prototype-based classification head.
    Learns class prototypes and classifies by nearest prototype.
    Particularly effective for few-shot scenarios.
    """
    def __init__(
        self, 
        input_dim: int, 
        num_classes: int,
        metric: str = "euclidean"  # or "cosine"
    ):
        super().__init__()
        self.metric = metric
        
        # Learnable prototypes (one per class)
        self.prototypes = nn.Parameter(
            torch.randn(num_classes, input_dim)
        )
        nn.init.xavier_uniform_(self.prototypes)
        
        # Optional learnable temperature
        self.temperature = nn.Parameter(torch.tensor(1.0))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.metric == "euclidean":
            # Negative squared distance (higher = closer)
            dists = -torch.cdist(x, self.prototypes).pow(2)
        elif self.metric == "cosine":
            # Cosine similarity
            x_norm = F.normalize(x, dim=1)
            p_norm = F.normalize(self.prototypes, dim=1)
            dists = x_norm @ p_norm.T
        
        return dists / self.temperature
 
 
def select_head(
    input_dim: int,
    num_classes: int,
    target_data_size: int,
    task_type: str = "classification"
) -> nn.Module:
    """
    Heuristic head selection based on data availability.
    
    Args:
        input_dim: Dimension of frozen features
        num_classes: Number of output classes
        target_data_size: Number of labeled examples available
        task_type: "classification" or "regression"
    
    Returns:
        Appropriately-sized classification head
    """
    samples_per_class = target_data_size / num_classes
    
    if samples_per_class < 10:
        # Very few examples: use prototypical head
        print("Selected: PrototypicalHead (few-shot regime)")
        return PrototypicalHead(input_dim, num_classes, metric="cosine")
    
    elif samples_per_class < 50:
        # Limited data: use linear probe
        print("Selected: LinearProbe (limited data regime)")
        return LinearProbe(input_dim, num_classes, normalize=True)
    
    elif samples_per_class < 200:
        # Moderate data: small MLP
        print("Selected: MLPHead with 1 hidden layer")
        return MLPHead(
            input_dim, num_classes, 
            hidden_dims=[512], 
            dropout=0.5
        )
    
    else:
        # Ample data: larger MLP
        print("Selected: MLPHead with 2 hidden layers")
        return MLPHead(
            input_dim, num_classes,
            hidden_dims=[1024, 512],
            dropout=0.3
        )

Feature Caching for Efficient Training

A major advantage of frozen features is that we can pre-compute and cache representations, dramatically accelerating subsequent training. This section covers efficient implementation strategies.

The caching workflow:

Extract once: Run all training images through the frozen encoder
Store to disk: Save features alongside labels in an efficient format
Train many times: Iterate over cached features for head training
Experiment freely: Try different heads, hyperparameters, regularizers

Computational savings:

Consider training a classifier on 50,000 images with ResNet-50:

Approach	Time per Epoch	Forward Pass	Backward Pass
Full fine-tune	~5 min	Encoder + Head	Encoder + Head
Frozen (no cache)	~3 min	Encoder + Head	Head only
Frozen (cached)	~5 sec	Head only	Head only

Caching provides ~60x speedup for head training iterations.

Storage Considerations

For 50,000 images with 2048-dim features in float32, you need ~400MB of storage. This is negligible compared to the original images (~5GB for ImageNet-scale). The storage-compute tradeoff strongly favors caching for most scenarios.

feature_caching.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from pathlib import Path
from tqdm import tqdm
from typing import Tuple, Optional
import h5py
 
class FeatureCacher:
    """
    Efficient feature extraction and caching for frozen transfer.
    
    Supports multiple storage backends and streaming extraction.
    """
    def __init__(
        self,
        encoder: nn.Module,
        cache_dir: str = "./feature_cache",
        device: str = "cuda",
        dtype: torch.dtype = torch.float16  # Half precision saves 50% storage
    ):
        self.encoder = encoder.to(device).eval()
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.device = device
        self.dtype = dtype
        
        # Freeze encoder
        for param in self.encoder.parameters():
            param.requires_grad = False
    
    @torch.no_grad()
    def extract_and_cache(
        self,
        dataloader: DataLoader,
        cache_name: str,
        backend: str = "hdf5"  # "hdf5", "numpy", "torch"
    ) -> Path:
        """
        Extract features from all data and save to disk.
        
        Args:
            dataloader: DataLoader yielding (images, labels)
            cache_name: Name for the cached features file
            backend: Storage format
        
        Returns:
            Path to cached features
        """
        all_features = []
        all_labels = []
        
        print(f"Extracting features from {len(dataloader.dataset)} samples...")
        
        for images, labels in tqdm(dataloader, desc="Extracting"):
            images = images.to(self.device)
            
            # Extract features
            features = self.encoder(images)
            
            # Flatten if needed (e.g., from spatial features)
            if features.dim() > 2:
                features = features.flatten(start_dim=1)
            
            # Convert to target dtype for storage efficiency
            features = features.to(self.dtype).cpu()
            
            all_features.append(features)
            all_labels.append(labels)
        
        # Concatenate all batches
        features_tensor = torch.cat(all_features, dim=0)
        labels_tensor = torch.cat(all_labels, dim=0)
        
        # Save based on backend
        cache_path = self.cache_dir / f"{cache_name}.{backend}"
        
        if backend == "hdf5":
            with h5py.File(cache_path, 'w') as f:
                f.create_dataset('features', data=features_tensor.numpy(),
                               compression='gzip', compression_opts=4)
                f.create_dataset('labels', data=labels_tensor.numpy())
        
        elif backend == "numpy":
            np.savez_compressed(
                cache_path.with_suffix('.npz'),
                features=features_tensor.numpy(),
                labels=labels_tensor.numpy()
            )
            cache_path = cache_path.with_suffix('.npz')
        
        elif backend == "torch":
            torch.save({
                'features': features_tensor,
                'labels': labels_tensor
            }, cache_path.with_suffix('.pt'))
            cache_path = cache_path.with_suffix('.pt')
        
        print(f"Cached {len(features_tensor)} samples to {cache_path}")
        print(f"Feature shape: {features_tensor.shape}")
        print(f"Storage size: {cache_path.stat().st_size / 1e6:.1f} MB")
        
        return cache_path
    
    @staticmethod
    def load_cached_features(
        cache_path: Path,
        device: str = "cpu"
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Load cached features back into memory."""
        
        suffix = cache_path.suffix
        
        if suffix == '.hdf5' or suffix == '.h5':
            with h5py.File(cache_path, 'r') as f:
                features = torch.tensor(f['features'][:], dtype=torch.float32)
                labels = torch.tensor(f['labels'][:], dtype=torch.long)
        
        elif suffix == '.npz':
            data = np.load(cache_path)
            features = torch.tensor(data['features'], dtype=torch.float32)
            labels = torch.tensor(data['labels'], dtype=torch.long)
        
        elif suffix == '.pt':
            data = torch.load(cache_path)
            features = data['features'].float()
            labels = data['labels']
        
        return features.to(device), labels.to(device)
 
 
class CachedFeatureDataset(torch.utils.data.Dataset):
    """
    Dataset that loads features from disk lazily.
    Useful for very large datasets that don't fit in memory.
    """
    def __init__(self, cache_path: Path):
        self.cache_path = cache_path
        
        # Open file handle for lazy loading
        self.h5_file = h5py.File(cache_path, 'r')
        self.features = self.h5_file['features']
        self.labels = self.h5_file['labels']
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        feature = torch.tensor(self.features[idx], dtype=torch.float32)
        label = torch.tensor(self.labels[idx], dtype=torch.long)
        return feature, label
    
    def __del__(self):
        self.h5_file.close()
 
 
def train_on_cached_features(
    features: torch.Tensor,
    labels: torch.Tensor,
    head: nn.Module,
    num_epochs: int = 100,
    batch_size: int = 256,
    lr: float = 0.01,
    weight_decay: float = 1e-4,
    device: str = "cuda"
) -> nn.Module:
    """
    Train a classification head on pre-cached features.
    
    This is extremely fast since there's no encoder forward pass.
    """
    head = head.to(device)
    
    # Create simple dataset from tensors
    dataset = TensorDataset(features, labels)
    loader = DataLoader(
        dataset, 
        batch_size=batch_size, 
        shuffle=True,
        pin_memory=True
    )
    
    # Optimizer
    optimizer = torch.optim.AdamW(
        head.parameters(), 
        lr=lr, 
        weight_decay=weight_decay
    )
    
    # Cosine schedule
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=num_epochs
    )
    
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        head.train()
        total_loss = 0
        correct = 0
        total = 0
        
        for batch_features, batch_labels in loader:
            batch_features = batch_features.to(device)
            batch_labels = batch_labels.to(device)
            
            optimizer.zero_grad()
            logits = head(batch_features)
            loss = criterion(logits, batch_labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            correct += (logits.argmax(1) == batch_labels).sum().item()
            total += len(batch_labels)
        
        scheduler.step()
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: Loss={total_loss/len(loader):.4f}, "
                  f"Acc={100*correct/total:.2f}%")
    
    return head

Limitations of Frozen Features

While frozen features provide an efficient baseline, they have fundamental limitations that motivate more sophisticated transfer approaches.

Limitation 1: Domain shift

When the target domain differs significantly from the pre-training domain, frozen representations may encode irrelevant information while lacking task-relevant features.

Example: ImageNet features for medical imaging

ImageNet learns features for natural objects: fur, wheels, faces
Medical images require features for tissue textures, anatomical structures, pathological patterns
The frozen representation may not encode disease-discriminative information

When Frozen Features Fail

If your target domain is substantially different from the pre-training domain (e.g., natural images → satellite imagery, English → Arabic, photos → sketches), frozen features often underperform. The representation gap becomes too large for the task head to compensate.

Limitation 2: Fixed resolution and architecture

Frozen feature extraction inherits all the architectural constraints of the pre-trained model:

Resolution: If the encoder expects 224×224 inputs, you must resize your images, potentially losing detail
Aspect ratio: Center-cropping or distorting non-square images affects performance
Input channels: RGB-trained models can't directly handle hyperspectral or single-channel data
Receptive field: Object scales present in your data may not match the pre-trained receptive fields

Limitation 3: No task-specific adaptation

The representation is frozen—it cannot adjust to emphasize features relevant to your specific task:

$$\frac{\partial \mathcal{L}}{\partial \phi} = 0$$

The gradient signal from the target task cannot flow back to improve the representation. If the pre-trained features don't linearly separate your classes, you're limited to what non-linear head can achieve.

Limitation 4: Layer-feature mismatch

The optimal extraction layer varies by task, but we can only choose one (or a fixed combination). If your task needs:

Low-level texture features → early layers are better
Semantic category features → late layers are better
Both simultaneously → neither choice is optimal

Limitation 5: Representation bottleneck

The representation dimension $d$ is fixed. If task-relevant information exists in a subspace of dimension $k > d$ that was compressed during pre-training, that information is irrecoverably lost.

Quantifying the limitations:

Research has measured the gap between frozen features and fine-tuning:

Dataset	Frozen Linear Probe	Full Fine-tune	Gap
CIFAR-10	93.2%	97.4%	4.2%
CIFAR-100	78.5%	86.9%	8.4%
Oxford Flowers	94.1%	98.2%	4.1%
DTD Textures	73.8%	79.4%	5.6%
Retinal OCT	68.2%	89.5%	21.3%

The gap is small when the target domain is similar to ImageNet (CIFAR, Flowers) but large for domain-shifted data (medical images).

Signs That Frozen Features Are Insufficient

•Low linear probe accuracy — If a linear probe achieves accuracy near random baseline, the representation isn't suitable.
•High MLP improvement — If switching from linear to MLP head improves accuracy by >10%, the features need non-linear disentangling that fine-tuning might provide 'for free'.
•Saturation at low data — If accuracy plateaus even as you add more target data, the representation is the bottleneck, not the head.
•Failure on fine-grained tasks — Tasks requiring subtle distinctions (species, subspecies, medical conditions) often exceed what frozen features can provide.
•Domain mismatch warnings — If visual inspection shows source and target domains look very different, expect frozen features to struggle.

When to Use Frozen Features

Given the trade-offs, when should you choose frozen features over fine-tuning?

Decision framework:

Frozen Features vs Fine-Tuning Decision Guide
Criterion	Favor Frozen	Favor Fine-Tuning
Target data size	< 1,000 examples	5,000 examples
Domain similarity	High (same domain)	Low (domain shift)
Compute budget	Limited / CPU only	GPU hours available
Iteration speed	Need rapid experiments	Can wait for training
Linear probe accuracy	80% of target	< 60% of target
Task type	Coarse-grained	Fine-grained distinctions
Model size	Large model, limited VRAM	Sufficient VRAM for backprop

The hybrid strategy:

In practice, frozen features serve as a critical starting point and upper bound estimator:

Start with frozen: Establish a baseline in minutes
Analyze the gap: Compare to expected performance
If gap is small: Frozen features may be sufficient; fine-tuning has diminishing returns
If gap is large: Proceed to fine-tuning; the representation needs adaptation

This workflow ensures you don't waste compute on fine-tuning when frozen features suffice, while identifying cases where adaptation is necessary.

Special use cases for frozen features:

1. Multi-task learning: Extract features once, train multiple heads for different tasks. Each task head trains independently, enabling:

Separate hyperparameter tuning per task
No competition for representation capacity between tasks
Easy addition/removal of tasks

2. Inference efficiency: In deployment, frozen feature + lightweight head can be faster than a fully fine-tuned model if:

Features can be pre-computed and cached
Head inference dominates latency
Batch size enables efficient matmul

3. Representation analysis: Frozen features enable studying what representations encode without confounding from fine-tuning:

Probing classifiers for interpretability
Similarity structure analysis
Information-theoretic measures

4. Few-shot and zero-shot: With extremely limited data, frozen features are essential:

Fine-tuning with 5 examples per class is unstable
Linear probes or prototype classifiers work well
Meta-learning approaches build on frozen backbones

The Practitioner's Rule

Always compute frozen feature baseline first. If it achieves >90% of your target performance, consider whether the compute cost of fine-tuning is justified by the remaining gap. Often, improving data quality or adding more labeled examples provides better ROI than switching to fine-tuning.

Summary: Frozen Features

We've covered frozen feature transfer comprehensively—from theory to implementation to practical decision-making. Let's consolidate the key points:

Key Takeaways

•Frozen features = fixed encoder + trainable head — The simplest transfer approach; representation remains unchanged.
•Success depends on domain similarity — High similarity enables excellent frozen transfer; low similarity may require adaptation.
•Layer selection matters — Choose extraction layer based on task requirements; penultimate layer is a good default.
•Head complexity should match data — Linear probes for limited data; MLPs for more data; match capacity to sample size.
•Feature caching accelerates experiments — Pre-compute and store features for 60x+ speedup in head training.
•Limitations include domain shift, fixed resolution, and no task adaptation — When these matter, move to fine-tuning.
•Use frozen features as diagnostic baseline — They reveal representation quality and inform fine-tuning decisions.

What's next:

Having understood frozen features as a baseline, the next page dives deeper into feature extraction—more sophisticated techniques for deriving useful representations from pre-trained models, including multi-scale feature aggregation, attention-based pooling, and representation dimensionality reduction. These techniques improve upon naive feature extraction while still avoiding the cost of fine-tuning.

Page Complete

You now understand frozen feature transfer: when it works, how to implement it efficiently, and what its limitations are. This baseline establishes the floor for transfer performance and guides decisions about when more sophisticated methods are needed.

2 / 5

Loading learning content...

Machine LearningTransfer Learning & Domain Adaptation

Feature-Based Transfer

LevelAdvanced

Duration90 mins

TopicTransfer Learning & Domain Adaptation

2 / 5

Frozen Features

The Simplest Transfer: Using Features As-Is

This page provides a rigorous, comprehensive exploration of frozen feature transfer: when it works, why it fails, how to implement it correctly, and what it reveals about representation quality.

What You Will Learn

The Frozen Feature Paradigm

In frozen feature transfer, we decompose a model into two components:

$$f(x) = g(\phi(x))$$

where:

$\phi: \mathcal{X} \rightarrow \mathbb{R}^d$ is the frozen encoder (pre-trained, parameters fixed)
$g: \mathbb{R}^d \rightarrow \mathcal{Y}$ is the task head (randomly initialized, trained on target data)

The training procedure:

Load pre-trained weights: Initialize $\phi$ with weights from pre-training
Freeze the encoder: Set $\phi$'s parameters to non-trainable
Extract features: Compute $z = \phi(x)$ for all training examples
Train the head: Optimize $g$ on $(z, y)$ pairs using target task labels

This is computationally efficient because:

Feature extraction is a single forward pass per example (can be done once and cached)
Only the small task head is trained, not the large encoder
No back-propagation through the encoder is needed

Practical Speedup

Why frozen features work:

Mathematically, if the representation induces a clustering structure where:

$$\text{intra-class distance} \ll \text{inter-class distance}$$

Theoretical guarantee (informal):

This is the core benefit: we've reduced a complex learning problem (raw pixels to labels) into a simpler one (learned features to labels).

When Frozen Features Excel

•Limited labeled data — With only hundreds of examples, fine-tuning may overfit. Frozen features provide strong regularization by fixing the encoder.
•High source-target similarity — When the target domain closely resembles the pre-training domain, representations transfer well without adaptation.
•Rapid prototyping — Testing whether transfer learning helps at all. Frozen features give a quick baseline in minutes.
•Computational constraints — When GPU resources are limited, avoiding encoder backprop dramatically reduces training cost.
•Multi-task scenarios — A single set of frozen features can serve multiple task heads simultaneously, enabling efficient multi-task learning.

Mathematical Analysis of Frozen Transfer

Understanding when frozen features succeed or fail requires formal analysis of the transfer process.

Problem formulation:

Input distributions: $P_S(X) \neq P_T(X)$
Label distributions: $P_S(Y) \neq P_T(Y)$
Label spaces: $\mathcal{Y}_S \neq \mathcal{Y}_T$
Conditional distributions: $P_S(Y|X) \neq P_T(Y|X)$

The transfer error decomposition:

The error on the target task can be decomposed as:

$$\epsilon_T(g \circ \phi) = \epsilon_T(g \circ \phi^) + \underbrace{[\epsilon_T(g \circ \phi) - \epsilon_T(g \circ \phi^)]}_{\text{representation gap}}$$

where $\phi^*$ is the optimal representation for the target task. The representation gap measures how much we lose by using the transferred representation instead of the ideal one.

Key Insight

Bounding the transfer error:

Under certain conditions, we can bound the target error:

$$\epsilon_T(g \circ \phi) \leq \epsilon_S(g \circ \phi) + d_{\mathcal{H}}(P_S^\phi, P_T^\phi) + \lambda$$

where:

$\epsilon_S$ is the source error (typically from pre-training)
$d_{\mathcal{H}}$ is a distribution distance (e.g., $\mathcal{H}$-divergence) between source and target in representation space
$\lambda$ is the combined error of the optimal classifier on both domains

Important implications:

Low source error isn't enough: The pre-trained model might have zero source error but fail on the target if the distributions differ significantly.
Representation matters: The divergence term operates in representation space, not input space. A good $\phi$ maps both domains to overlapping regions.
Some tasks are inherently harder: The $\lambda$ term represents an irreducible component—if source and target are fundamentally different, no representation helps.

Linear probe theory:

For linear probes $g(z) = w^\top z + b$, the generalization error satisfies:

$$\mathbb{E}[\epsilon(g)] \leq \frac{|w|^2 \cdot \text{Var}(\phi(X))}{m} + O\left(\sqrt{\frac{d}{m}}\right)$$

where $m$ is the number of target examples and $d$ is the representation dimension. This shows:

Performance scales with $1/m$: more target data helps
Performance scales with $d/m$: high-dimensional representations need more data
Regularization (controlling $|w|$) is important

Factors Affecting Frozen Feature Transfer Success
Factor	Effect on Transfer	How to Measure
Source-target domain similarity	Higher similarity → better transfer	Distribution divergence metrics, visual inspection
Representation quality	Better representations → easier downstream learning	Linear probe accuracy on source tasks
Target data quantity	More data → lower variance, better probe training	Learning curve analysis
Number of target classes	More classes → harder classification	Random baseline (1/K for K classes)
Representation dimension	Higher dimension → need more data (curse of dimensionality)	Effective dimension, PCA analysis
Class imbalance	Imbalance → biased probes, misleading accuracy	Class distribution analysis, stratified metrics

Layer Selection for Feature Extraction

The layer selection principle:

As discussed in Page 0, neural networks learn hierarchical features:

Early layers: Generic, low-level features (edges, textures) — highly transferable
Middle layers: Compositional features (parts, patterns) — moderately transferable
Late layers: Task-specific, semantic features — may not transfer well

Optimal layer depends on domain similarity:

Domain Similarity	Optimal Extraction Layer	Reasoning
Very High	Penultimate (pre-logits)	High-level features directly applicable
Moderate	Middle layers	Balance between generality and specificity
Low	Early layers	Only low-level features transfer
Very Low	Consider from scratch	Transfer may hurt more than help

The Penultimate Layer Default

Feature dimensionality at different layers:

Consider a ResNet-50 architecture:

Layer	Output Shape	Feature Dimension	Characteristics
conv1	112×112×64	802,816	Gabor-like filters, very generic
layer1 (res2)	56×56×256	802,816	Low-level compositions
layer2 (res3)	28×28×512	401,408	Mid-level patterns
layer3 (res4)	14×14×1024	200,704	Object parts
layer4 (res5)	7×7×2048	100,352	High-level semantics
avgpool	1×1×2048	2,048	Global representation

Practical approaches to layer selection:

Single-layer extraction: Extract from one layer (usually avgpool), train a single probe.
Multi-layer concatenation: Concatenate features from multiple layers, increasing expressivity at the cost of dimensionality.
Multi-scale extraction: Pool spatial features at different resolutions and combine. Captures both local and global information.
Learned combination: Train a lightweight network to combine features from multiple layers.

layer_extraction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import torch
import torch.nn as nn
from torchvision import models
from typing import Dict, List
 
class MultiLayerFeatureExtractor(nn.Module):
    """
    Extract features from multiple layers of a pre-trained model.
    
    Supports flexible layer selection and feature combination strategies.
    """
    def __init__(
        self, 
        model_name: str = "resnet50",
        layers: List[str] = ["layer3", "layer4", "avgpool"],
        combine_strategy: str = "concat"  # "concat", "mean", "learned"
    ):
        super().__init__()
        
        # Load pre-trained model
        base_model = getattr(models, model_name)(
            weights=getattr(models, f"{model_name.upper()}_Weights").IMAGENET1K_V2
        )
        
        # Freeze all parameters
        for param in base_model.parameters():
            param.requires_grad = False
        
        # Store model components for hook access
        self.model = base_model
        self.layers = layers
        self.combine_strategy = combine_strategy
        
        # Storage for intermediate features
        self.features: Dict[str, torch.Tensor] = {}
        
        # Register forward hooks on specified layers
        for name in layers:
            layer = dict(base_model.named_modules())[name]
            layer.register_forward_hook(self._get_hook(name))
        
        # Compute output dimension for downstream heads
        self._compute_output_dim()
    
    def _get_hook(self, name: str):
        def hook(module, input, output):
            # Flatten spatial dimensions if present
            if output.dim() == 4:  # B x C x H x W
                output = output.flatten(start_dim=2).mean(dim=2)  # Global avg pool
            self.features[name] = output
        return hook
    
    def _compute_output_dim(self):
        """Determine output dimension by doing a dummy forward pass."""
        with torch.no_grad():
            dummy = torch.randn(1, 3, 224, 224)
            self.forward(dummy)
            
            if self.combine_strategy == "concat":
                self.output_dim = sum(
                    self.features[l].shape[1] for l in self.layers
                )
            else:
                # Assumes all layers have same dim (may need projection)
                self.output_dim = self.features[self.layers[0]].shape[1]
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Clear previous features
        self.features = {}
        
        # Forward through full model (hooks capture intermediate)
        _ = self.model(x)
        
        # Combine features according to strategy
        feature_list = [self.features[l] for l in self.layers]
        
        if self.combine_strategy == "concat":
            return torch.cat(feature_list, dim=1)
        elif self.combine_strategy == "mean":
            # Stack and average (assumes same dimension)
            return torch.stack(feature_list, dim=0).mean(dim=0)
        else:
            raise ValueError(f"Unknown strategy: {self.combine_strategy}")
 
 
class FrozenFeatureClassifier(nn.Module):
    """
    Complete frozen feature classifier: extractor + trainable head.
    """
    def __init__(
        self,
        num_classes: int,
        extractor: MultiLayerFeatureExtractor = None,
        head_type: str = "linear",  # "linear", "mlp", "attention"
        hidden_dim: int = 512,
        dropout: float = 0.5
    ):
        super().__init__()
        
        # Create default extractor if none provided
        self.extractor = extractor or MultiLayerFeatureExtractor()
        input_dim = self.extractor.output_dim
        
        # Build classification head (this is what we train)
        if head_type == "linear":
            self.head = nn.Linear(input_dim, num_classes)
        elif head_type == "mlp":
            self.head = nn.Sequential(
                nn.Linear(input_dim, hidden_dim),
                nn.ReLU(),
                nn.Dropout(dropout),
                nn.Linear(hidden_dim, num_classes)
            )
        elif head_type == "attention":
            # Self-attention over feature dimensions
            self.head = nn.Sequential(
                nn.Linear(input_dim, hidden_dim),
                nn.MultiheadAttention(hidden_dim, num_heads=4, dropout=dropout),
                nn.Linear(hidden_dim, num_classes)
            )
        
        # Initialize head with good defaults
        self._init_head()
    
    def _init_head(self):
        for m in self.head.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        features = self.extractor(x)
        return self.head(features)
    
    def get_trainable_params(self):
        """Return only the parameters that should be trained."""
        return self.head.parameters()
 
 
# Usage example
if __name__ == "__main__":
    # Create classifier
    classifier = FrozenFeatureClassifier(
        num_classes=10,
        head_type="mlp"
    )
    
    # Only train the head!
    optimizer = torch.optim.Adam(
        classifier.get_trainable_params(), 
        lr=0.001
    )
    
    # Dummy batch
    images = torch.randn(8, 3, 224, 224)
    labels = torch.randint(0, 10, (8,))
    
    # Forward pass
    logits = classifier(images)
    loss = nn.CrossEntropyLoss()(logits, labels)
    
    print(f"Feature dimension: {classifier.extractor.output_dim}")
    print(f"Output shape: {logits.shape}")
    print(f"Loss: {loss.item():.4f}")

Classification Head Design

Linear probes:

The simplest head is a linear classifier:

$$g(z) = W z + b$$

where $W \in \mathbb{R}^{K \times d}$ for $K$ classes and $d$-dimensional features.

Advantages:

Minimal capacity → strong regularization
Fast training (convex optimization for logistic loss)
Interpretable weights (each row is a class prototype)
Standard benchmark for representation quality

Disadvantages:

Cannot learn non-linear decision boundaries
May underfit if representations aren't linearly separable
No feature interactions beyond what's encoded in $z$

Linear Probe = Representation Quality Test

MLP heads:

A multi-layer perceptron adds non-linear capacity:

$$g(z) = W_2 \sigma(W_1 z + b_1) + b_2$$

When to use MLPs:

Linear probe underperforms expectations
Sufficient target data to train more parameters
Task requires non-linear feature combinations

MLP design choices:

Hidden dimension: Typically 512-2048 for vision tasks
Number of layers: 1-2 hidden layers usually sufficient
Activation: ReLU standard; GELU for Transformer consistency
Regularization: Dropout (0.3-0.5), weight decay essential

Attention-based heads:

For sequence or spatial features (before global pooling), attention heads can learn to weight different positions:

$$g(z_1, ..., z_n) = \text{Attention}(Q, K, V)$$

This is useful when:

Features have spatial structure (no global pooling)
Different regions have varying importance for the task

Head capacity and overfitting:

A crucial principle: more head capacity requires more target data. The relationship follows:

Target Data Size	Recommended Head	Why
< 100 examples	Linear	Prevent overfitting
100 - 1,000	Linear or small MLP	Limited capacity okay
1,000 - 10,000	MLP (1-2 layers)	Can support non-linearity
> 10,000	Deeper MLP or consider fine-tuning	Sufficient data for more expressivity

head_architectures.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional
 
class LinearProbe(nn.Module):
    """
    Standard linear probe for frozen feature evaluation.
    Includes optional normalization and temperature scaling.
    """
    def __init__(
        self,
        input_dim: int,
        num_classes: int,
        normalize: bool = True,
        temperature: float = 1.0
    ):
        super().__init__()
        self.normalize = normalize
        self.temperature = temperature
        self.classifier = nn.Linear(input_dim, num_classes)
        
        # Initialize with scaling aware of normalization
        nn.init.normal_(self.classifier.weight, std=0.01)
        nn.init.zeros_(self.classifier.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.normalize:
            x = F.normalize(x, p=2, dim=1)
        logits = self.classifier(x) / self.temperature
        return logits
 
 
class MLPHead(nn.Module):
    """
    Flexible MLP head with configurable architecture.
    """
    def __init__(
        self,
        input_dim: int,
        num_classes: int,
        hidden_dims: list = [512],
        activation: str = "relu",
        dropout: float = 0.5,
        batch_norm: bool = True
    ):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        # Activation function selection
        act_fn = {
            "relu": nn.ReLU,
            "gelu": nn.GELU,
            "silu": nn.SiLU
        }[activation]
        
        # Build hidden layers
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            if batch_norm:
                layers.append(nn.BatchNorm1d(hidden_dim))
            layers.append(act_fn())
            layers.append(nn.Dropout(dropout))
            prev_dim = hidden_dim
        
        # Final classification layer
        layers.append(nn.Linear(prev_dim, num_classes))
        
        self.mlp = nn.Sequential(*layers)
        self._init_weights()
    
    def _init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm1d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.mlp(x)
 
 
class PrototypicalHead(nn.Module):
    """
    Prototype-based classification head.
    Learns class prototypes and classifies by nearest prototype.
    Particularly effective for few-shot scenarios.
    """
    def __init__(
        self, 
        input_dim: int, 
        num_classes: int,
        metric: str = "euclidean"  # or "cosine"
    ):
        super().__init__()
        self.metric = metric
        
        # Learnable prototypes (one per class)
        self.prototypes = nn.Parameter(
            torch.randn(num_classes, input_dim)
        )
        nn.init.xavier_uniform_(self.prototypes)
        
        # Optional learnable temperature
        self.temperature = nn.Parameter(torch.tensor(1.0))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.metric == "euclidean":
            # Negative squared distance (higher = closer)
            dists = -torch.cdist(x, self.prototypes).pow(2)
        elif self.metric == "cosine":
            # Cosine similarity
            x_norm = F.normalize(x, dim=1)
            p_norm = F.normalize(self.prototypes, dim=1)
            dists = x_norm @ p_norm.T
        
        return dists / self.temperature
 
 
def select_head(
    input_dim: int,
    num_classes: int,
    target_data_size: int,
    task_type: str = "classification"
) -> nn.Module:
    """
    Heuristic head selection based on data availability.
    
    Args:
        input_dim: Dimension of frozen features
        num_classes: Number of output classes
        target_data_size: Number of labeled examples available
        task_type: "classification" or "regression"
    
    Returns:
        Appropriately-sized classification head
    """
    samples_per_class = target_data_size / num_classes
    
    if samples_per_class < 10:
        # Very few examples: use prototypical head
        print("Selected: PrototypicalHead (few-shot regime)")
        return PrototypicalHead(input_dim, num_classes, metric="cosine")
    
    elif samples_per_class < 50:
        # Limited data: use linear probe
        print("Selected: LinearProbe (limited data regime)")
        return LinearProbe(input_dim, num_classes, normalize=True)
    
    elif samples_per_class < 200:
        # Moderate data: small MLP
        print("Selected: MLPHead with 1 hidden layer")
        return MLPHead(
            input_dim, num_classes, 
            hidden_dims=[512], 
            dropout=0.5
        )
    
    else:
        # Ample data: larger MLP
        print("Selected: MLPHead with 2 hidden layers")
        return MLPHead(
            input_dim, num_classes,
            hidden_dims=[1024, 512],
            dropout=0.3
        )

Feature Caching for Efficient Training

A major advantage of frozen features is that we can pre-compute and cache representations, dramatically accelerating subsequent training. This section covers efficient implementation strategies.

The caching workflow:

Extract once: Run all training images through the frozen encoder
Store to disk: Save features alongside labels in an efficient format
Train many times: Iterate over cached features for head training
Experiment freely: Try different heads, hyperparameters, regularizers

Computational savings:

Consider training a classifier on 50,000 images with ResNet-50:

Approach	Time per Epoch	Forward Pass	Backward Pass
Full fine-tune	~5 min	Encoder + Head	Encoder + Head
Frozen (no cache)	~3 min	Encoder + Head	Head only
Frozen (cached)	~5 sec	Head only	Head only

Caching provides ~60x speedup for head training iterations.

Storage Considerations

feature_caching.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
from pathlib import Path
from tqdm import tqdm
from typing import Tuple, Optional
import h5py
 
class FeatureCacher:
    """
    Efficient feature extraction and caching for frozen transfer.
    
    Supports multiple storage backends and streaming extraction.
    """
    def __init__(
        self,
        encoder: nn.Module,
        cache_dir: str = "./feature_cache",
        device: str = "cuda",
        dtype: torch.dtype = torch.float16  # Half precision saves 50% storage
    ):
        self.encoder = encoder.to(device).eval()
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        self.device = device
        self.dtype = dtype
        
        # Freeze encoder
        for param in self.encoder.parameters():
            param.requires_grad = False
    
    @torch.no_grad()
    def extract_and_cache(
        self,
        dataloader: DataLoader,
        cache_name: str,
        backend: str = "hdf5"  # "hdf5", "numpy", "torch"
    ) -> Path:
        """
        Extract features from all data and save to disk.
        
        Args:
            dataloader: DataLoader yielding (images, labels)
            cache_name: Name for the cached features file
            backend: Storage format
        
        Returns:
            Path to cached features
        """
        all_features = []
        all_labels = []
        
        print(f"Extracting features from {len(dataloader.dataset)} samples...")
        
        for images, labels in tqdm(dataloader, desc="Extracting"):
            images = images.to(self.device)
            
            # Extract features
            features = self.encoder(images)
            
            # Flatten if needed (e.g., from spatial features)
            if features.dim() > 2:
                features = features.flatten(start_dim=1)
            
            # Convert to target dtype for storage efficiency
            features = features.to(self.dtype).cpu()
            
            all_features.append(features)
            all_labels.append(labels)
        
        # Concatenate all batches
        features_tensor = torch.cat(all_features, dim=0)
        labels_tensor = torch.cat(all_labels, dim=0)
        
        # Save based on backend
        cache_path = self.cache_dir / f"{cache_name}.{backend}"
        
        if backend == "hdf5":
            with h5py.File(cache_path, 'w') as f:
                f.create_dataset('features', data=features_tensor.numpy(),
                               compression='gzip', compression_opts=4)
                f.create_dataset('labels', data=labels_tensor.numpy())
        
        elif backend == "numpy":
            np.savez_compressed(
                cache_path.with_suffix('.npz'),
                features=features_tensor.numpy(),
                labels=labels_tensor.numpy()
            )
            cache_path = cache_path.with_suffix('.npz')
        
        elif backend == "torch":
            torch.save({
                'features': features_tensor,
                'labels': labels_tensor
            }, cache_path.with_suffix('.pt'))
            cache_path = cache_path.with_suffix('.pt')
        
        print(f"Cached {len(features_tensor)} samples to {cache_path}")
        print(f"Feature shape: {features_tensor.shape}")
        print(f"Storage size: {cache_path.stat().st_size / 1e6:.1f} MB")
        
        return cache_path
    
    @staticmethod
    def load_cached_features(
        cache_path: Path,
        device: str = "cpu"
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Load cached features back into memory."""
        
        suffix = cache_path.suffix
        
        if suffix == '.hdf5' or suffix == '.h5':
            with h5py.File(cache_path, 'r') as f:
                features = torch.tensor(f['features'][:], dtype=torch.float32)
                labels = torch.tensor(f['labels'][:], dtype=torch.long)
        
        elif suffix == '.npz':
            data = np.load(cache_path)
            features = torch.tensor(data['features'], dtype=torch.float32)
            labels = torch.tensor(data['labels'], dtype=torch.long)
        
        elif suffix == '.pt':
            data = torch.load(cache_path)
            features = data['features'].float()
            labels = data['labels']
        
        return features.to(device), labels.to(device)
 
 
class CachedFeatureDataset(torch.utils.data.Dataset):
    """
    Dataset that loads features from disk lazily.
    Useful for very large datasets that don't fit in memory.
    """
    def __init__(self, cache_path: Path):
        self.cache_path = cache_path
        
        # Open file handle for lazy loading
        self.h5_file = h5py.File(cache_path, 'r')
        self.features = self.h5_file['features']
        self.labels = self.h5_file['labels']
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        feature = torch.tensor(self.features[idx], dtype=torch.float32)
        label = torch.tensor(self.labels[idx], dtype=torch.long)
        return feature, label
    
    def __del__(self):
        self.h5_file.close()
 
 
def train_on_cached_features(
    features: torch.Tensor,
    labels: torch.Tensor,
    head: nn.Module,
    num_epochs: int = 100,
    batch_size: int = 256,
    lr: float = 0.01,
    weight_decay: float = 1e-4,
    device: str = "cuda"
) -> nn.Module:
    """
    Train a classification head on pre-cached features.
    
    This is extremely fast since there's no encoder forward pass.
    """
    head = head.to(device)
    
    # Create simple dataset from tensors
    dataset = TensorDataset(features, labels)
    loader = DataLoader(
        dataset, 
        batch_size=batch_size, 
        shuffle=True,
        pin_memory=True
    )
    
    # Optimizer
    optimizer = torch.optim.AdamW(
        head.parameters(), 
        lr=lr, 
        weight_decay=weight_decay
    )
    
    # Cosine schedule
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=num_epochs
    )
    
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        head.train()
        total_loss = 0
        correct = 0
        total = 0
        
        for batch_features, batch_labels in loader:
            batch_features = batch_features.to(device)
            batch_labels = batch_labels.to(device)
            
            optimizer.zero_grad()
            logits = head(batch_features)
            loss = criterion(logits, batch_labels)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            correct += (logits.argmax(1) == batch_labels).sum().item()
            total += len(batch_labels)
        
        scheduler.step()
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: Loss={total_loss/len(loader):.4f}, "
                  f"Acc={100*correct/total:.2f}%")
    
    return head

Limitations of Frozen Features

While frozen features provide an efficient baseline, they have fundamental limitations that motivate more sophisticated transfer approaches.

Limitation 1: Domain shift

When the target domain differs significantly from the pre-training domain, frozen representations may encode irrelevant information while lacking task-relevant features.

Example: ImageNet features for medical imaging

ImageNet learns features for natural objects: fur, wheels, faces
Medical images require features for tissue textures, anatomical structures, pathological patterns
The frozen representation may not encode disease-discriminative information

When Frozen Features Fail

Limitation 2: Fixed resolution and architecture

Frozen feature extraction inherits all the architectural constraints of the pre-trained model:

Resolution: If the encoder expects 224×224 inputs, you must resize your images, potentially losing detail
Aspect ratio: Center-cropping or distorting non-square images affects performance
Input channels: RGB-trained models can't directly handle hyperspectral or single-channel data
Receptive field: Object scales present in your data may not match the pre-trained receptive fields

Limitation 3: No task-specific adaptation

The representation is frozen—it cannot adjust to emphasize features relevant to your specific task:

$$\frac{\partial \mathcal{L}}{\partial \phi} = 0$$

Limitation 4: Layer-feature mismatch

The optimal extraction layer varies by task, but we can only choose one (or a fixed combination). If your task needs:

Low-level texture features → early layers are better
Semantic category features → late layers are better
Both simultaneously → neither choice is optimal

Limitation 5: Representation bottleneck

The representation dimension $d$ is fixed. If task-relevant information exists in a subspace of dimension $k > d$ that was compressed during pre-training, that information is irrecoverably lost.

Quantifying the limitations:

Research has measured the gap between frozen features and fine-tuning:

Dataset	Frozen Linear Probe	Full Fine-tune	Gap
CIFAR-10	93.2%	97.4%	4.2%
CIFAR-100	78.5%	86.9%	8.4%
Oxford Flowers	94.1%	98.2%	4.1%
DTD Textures	73.8%	79.4%	5.6%
Retinal OCT	68.2%	89.5%	21.3%

The gap is small when the target domain is similar to ImageNet (CIFAR, Flowers) but large for domain-shifted data (medical images).

Signs That Frozen Features Are Insufficient

•Low linear probe accuracy — If a linear probe achieves accuracy near random baseline, the representation isn't suitable.
•High MLP improvement — If switching from linear to MLP head improves accuracy by >10%, the features need non-linear disentangling that fine-tuning might provide 'for free'.
•Saturation at low data — If accuracy plateaus even as you add more target data, the representation is the bottleneck, not the head.
•Failure on fine-grained tasks — Tasks requiring subtle distinctions (species, subspecies, medical conditions) often exceed what frozen features can provide.
•Domain mismatch warnings — If visual inspection shows source and target domains look very different, expect frozen features to struggle.

When to Use Frozen Features

Given the trade-offs, when should you choose frozen features over fine-tuning?

Decision framework:

Frozen Features vs Fine-Tuning Decision Guide
Criterion	Favor Frozen	Favor Fine-Tuning
Target data size	< 1,000 examples	5,000 examples
Domain similarity	High (same domain)	Low (domain shift)
Compute budget	Limited / CPU only	GPU hours available
Iteration speed	Need rapid experiments	Can wait for training
Linear probe accuracy	80% of target	< 60% of target
Task type	Coarse-grained	Fine-grained distinctions
Model size	Large model, limited VRAM	Sufficient VRAM for backprop

The hybrid strategy:

In practice, frozen features serve as a critical starting point and upper bound estimator:

Start with frozen: Establish a baseline in minutes
Analyze the gap: Compare to expected performance
If gap is small: Frozen features may be sufficient; fine-tuning has diminishing returns
If gap is large: Proceed to fine-tuning; the representation needs adaptation

This workflow ensures you don't waste compute on fine-tuning when frozen features suffice, while identifying cases where adaptation is necessary.

Special use cases for frozen features:

1. Multi-task learning: Extract features once, train multiple heads for different tasks. Each task head trains independently, enabling:

Separate hyperparameter tuning per task
No competition for representation capacity between tasks
Easy addition/removal of tasks

2. Inference efficiency: In deployment, frozen feature + lightweight head can be faster than a fully fine-tuned model if:

Features can be pre-computed and cached
Head inference dominates latency
Batch size enables efficient matmul

3. Representation analysis: Frozen features enable studying what representations encode without confounding from fine-tuning:

Probing classifiers for interpretability
Similarity structure analysis
Information-theoretic measures

4. Few-shot and zero-shot: With extremely limited data, frozen features are essential:

Fine-tuning with 5 examples per class is unstable
Linear probes or prototype classifiers work well
Meta-learning approaches build on frozen backbones

The Practitioner's Rule

Summary: Frozen Features

We've covered frozen feature transfer comprehensively—from theory to implementation to practical decision-making. Let's consolidate the key points:

Key Takeaways

•Frozen features = fixed encoder + trainable head — The simplest transfer approach; representation remains unchanged.
•Success depends on domain similarity — High similarity enables excellent frozen transfer; low similarity may require adaptation.
•Layer selection matters — Choose extraction layer based on task requirements; penultimate layer is a good default.
•Head complexity should match data — Linear probes for limited data; MLPs for more data; match capacity to sample size.
•Feature caching accelerates experiments — Pre-compute and store features for 60x+ speedup in head training.
•Limitations include domain shift, fixed resolution, and no task adaptation — When these matter, move to fine-tuning.
•Use frozen features as diagnostic baseline — They reveal representation quality and inform fine-tuning decisions.

What's next:

Page Complete

2 / 5