Loading learning content...
The simplest form of transfer learning requires no additional training of the pre-trained model at all. Instead, we treat the pre-trained network as a fixed feature extractor—a black box that transforms raw inputs into meaningful representations. Only a lightweight classifier or regressor is trained on top of these frozen features.
This approach, known as frozen feature transfer or simply training a linear probe, establishes the baseline for all transfer learning methods. Before investing compute in fine-tuning, we should ask: how far can we get without modifying the pre-trained weights at all?
The answer, surprisingly, is often "quite far." For many tasks, especially those with limited labeled data or high similarity to the pre-training domain, frozen features provide competitive or even superior performance to more complex approaches.
This page provides a rigorous, comprehensive exploration of frozen feature transfer: when it works, why it fails, how to implement it correctly, and what it reveals about representation quality.
By the end of this page, you will understand the theoretical basis for frozen feature transfer, know when to use this approach versus fine-tuning, master the implementation details including layer selection and classifier design, and interpret what frozen feature performance tells you about domain similarity.
In frozen feature transfer, we decompose a model into two components:
$$f(x) = g(\phi(x))$$
where:
The training procedure:
This is computationally efficient because:
For a ResNet-50, the encoder has ~25M parameters while a linear probe has only ~2M (2048 features × 1000 classes). Training the probe takes seconds per epoch versus minutes for full fine-tuning. More importantly, you can pre-compute and cache features, making classifier experimentation extremely fast.
Why frozen features work:
The success of frozen features rests on the assumption that the pre-trained encoder has learned task-agnostic, general-purpose representations. If $\phi$ maps semantically similar inputs to nearby points in representation space, then a simple linear classifier can find decision boundaries.
Mathematically, if the representation induces a clustering structure where:
$$\text{intra-class distance} \ll \text{inter-class distance}$$
then even linear classifiers achieve high accuracy. This is precisely what good pre-trained representations achieve—they linearly separate many downstream classes, even those not seen during pre-training.
Theoretical guarantee (informal):
If the representation $\phi$ satisfies certain smoothness and separability conditions, then the sample complexity of learning $g$ scales only with the dimension $d$ of the representation and the complexity of $g$, not with the complexity of the full input space $\mathcal{X}$.
This is the core benefit: we've reduced a complex learning problem (raw pixels to labels) into a simpler one (learned features to labels).
Understanding when frozen features succeed or fail requires formal analysis of the transfer process.
Problem formulation:
Let $\mathcal{D}S = {(x_i, y_i)}^n{i=1}$ be the source domain data (used for pre-training) and $\mathcal{D}T = {(x_j, y_j)}^m{j=1}$ be the target domain data (used for training $g$). The source and target may have different:
The transfer error decomposition:
The error on the target task can be decomposed as:
$$\epsilon_T(g \circ \phi) = \epsilon_T(g \circ \phi^) + \underbrace{[\epsilon_T(g \circ \phi) - \epsilon_T(g \circ \phi^)]}_{\text{representation gap}}$$
where $\phi^*$ is the optimal representation for the target task. The representation gap measures how much we lose by using the transferred representation instead of the ideal one.
The representation gap is zero when the pre-trained φ happens to be optimal for the target task. This occurs when source and target tasks share the same underlying structure. In practice, we aim for this gap to be small, not zero.
Bounding the transfer error:
Under certain conditions, we can bound the target error:
$$\epsilon_T(g \circ \phi) \leq \epsilon_S(g \circ \phi) + d_{\mathcal{H}}(P_S^\phi, P_T^\phi) + \lambda$$
where:
Important implications:
Low source error isn't enough: The pre-trained model might have zero source error but fail on the target if the distributions differ significantly.
Representation matters: The divergence term operates in representation space, not input space. A good $\phi$ maps both domains to overlapping regions.
Some tasks are inherently harder: The $\lambda$ term represents an irreducible component—if source and target are fundamentally different, no representation helps.
Linear probe theory:
For linear probes $g(z) = w^\top z + b$, the generalization error satisfies:
$$\mathbb{E}[\epsilon(g)] \leq \frac{|w|^2 \cdot \text{Var}(\phi(X))}{m} + O\left(\sqrt{\frac{d}{m}}\right)$$
where $m$ is the number of target examples and $d$ is the representation dimension. This shows:
| Factor | Effect on Transfer | How to Measure |
|---|---|---|
| Source-target domain similarity | Higher similarity → better transfer | Distribution divergence metrics, visual inspection |
| Representation quality | Better representations → easier downstream learning | Linear probe accuracy on source tasks |
| Target data quantity | More data → lower variance, better probe training | Learning curve analysis |
| Number of target classes | More classes → harder classification | Random baseline (1/K for K classes) |
| Representation dimension | Higher dimension → need more data (curse of dimensionality) | Effective dimension, PCA analysis |
| Class imbalance | Imbalance → biased probes, misleading accuracy | Class distribution analysis, stratified metrics |
A critical decision in frozen feature transfer is which layer to extract features from. Different layers encode different levels of abstraction, and the optimal choice depends on the source-target relationship.
The layer selection principle:
As discussed in Page 0, neural networks learn hierarchical features:
Optimal layer depends on domain similarity:
| Domain Similarity | Optimal Extraction Layer | Reasoning |
|---|---|---|
| Very High | Penultimate (pre-logits) | High-level features directly applicable |
| Moderate | Middle layers | Balance between generality and specificity |
| Low | Early layers | Only low-level features transfer |
| Very Low | Consider from scratch | Transfer may hurt more than help |
For most transfer scenarios, start with the penultimate layer (the layer before the final classification layer). This is the default in most frameworks. It captures high-level semantics while remaining somewhat general. Only move to earlier layers if this underperforms your expectations.
Feature dimensionality at different layers:
Consider a ResNet-50 architecture:
| Layer | Output Shape | Feature Dimension | Characteristics |
|---|---|---|---|
| conv1 | 112×112×64 | 802,816 | Gabor-like filters, very generic |
| layer1 (res2) | 56×56×256 | 802,816 | Low-level compositions |
| layer2 (res3) | 28×28×512 | 401,408 | Mid-level patterns |
| layer3 (res4) | 14×14×1024 | 200,704 | Object parts |
| layer4 (res5) | 7×7×2048 | 100,352 | High-level semantics |
| avgpool | 1×1×2048 | 2,048 | Global representation |
Practical approaches to layer selection:
Single-layer extraction: Extract from one layer (usually avgpool), train a single probe.
Multi-layer concatenation: Concatenate features from multiple layers, increasing expressivity at the cost of dimensionality.
Multi-scale extraction: Pool spatial features at different resolutions and combine. Captures both local and global information.
Learned combination: Train a lightweight network to combine features from multiple layers.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
import torchimport torch.nn as nnfrom torchvision import modelsfrom typing import Dict, List class MultiLayerFeatureExtractor(nn.Module): """ Extract features from multiple layers of a pre-trained model. Supports flexible layer selection and feature combination strategies. """ def __init__( self, model_name: str = "resnet50", layers: List[str] = ["layer3", "layer4", "avgpool"], combine_strategy: str = "concat" # "concat", "mean", "learned" ): super().__init__() # Load pre-trained model base_model = getattr(models, model_name)( weights=getattr(models, f"{model_name.upper()}_Weights").IMAGENET1K_V2 ) # Freeze all parameters for param in base_model.parameters(): param.requires_grad = False # Store model components for hook access self.model = base_model self.layers = layers self.combine_strategy = combine_strategy # Storage for intermediate features self.features: Dict[str, torch.Tensor] = {} # Register forward hooks on specified layers for name in layers: layer = dict(base_model.named_modules())[name] layer.register_forward_hook(self._get_hook(name)) # Compute output dimension for downstream heads self._compute_output_dim() def _get_hook(self, name: str): def hook(module, input, output): # Flatten spatial dimensions if present if output.dim() == 4: # B x C x H x W output = output.flatten(start_dim=2).mean(dim=2) # Global avg pool self.features[name] = output return hook def _compute_output_dim(self): """Determine output dimension by doing a dummy forward pass.""" with torch.no_grad(): dummy = torch.randn(1, 3, 224, 224) self.forward(dummy) if self.combine_strategy == "concat": self.output_dim = sum( self.features[l].shape[1] for l in self.layers ) else: # Assumes all layers have same dim (may need projection) self.output_dim = self.features[self.layers[0]].shape[1] def forward(self, x: torch.Tensor) -> torch.Tensor: # Clear previous features self.features = {} # Forward through full model (hooks capture intermediate) _ = self.model(x) # Combine features according to strategy feature_list = [self.features[l] for l in self.layers] if self.combine_strategy == "concat": return torch.cat(feature_list, dim=1) elif self.combine_strategy == "mean": # Stack and average (assumes same dimension) return torch.stack(feature_list, dim=0).mean(dim=0) else: raise ValueError(f"Unknown strategy: {self.combine_strategy}") class FrozenFeatureClassifier(nn.Module): """ Complete frozen feature classifier: extractor + trainable head. """ def __init__( self, num_classes: int, extractor: MultiLayerFeatureExtractor = None, head_type: str = "linear", # "linear", "mlp", "attention" hidden_dim: int = 512, dropout: float = 0.5 ): super().__init__() # Create default extractor if none provided self.extractor = extractor or MultiLayerFeatureExtractor() input_dim = self.extractor.output_dim # Build classification head (this is what we train) if head_type == "linear": self.head = nn.Linear(input_dim, num_classes) elif head_type == "mlp": self.head = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout), nn.Linear(hidden_dim, num_classes) ) elif head_type == "attention": # Self-attention over feature dimensions self.head = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.MultiheadAttention(hidden_dim, num_heads=4, dropout=dropout), nn.Linear(hidden_dim, num_classes) ) # Initialize head with good defaults self._init_head() def _init_head(self): for m in self.head.modules(): if isinstance(m, nn.Linear): nn.init.xavier_uniform_(m.weight) if m.bias is not None: nn.init.zeros_(m.bias) def forward(self, x: torch.Tensor) -> torch.Tensor: features = self.extractor(x) return self.head(features) def get_trainable_params(self): """Return only the parameters that should be trained.""" return self.head.parameters() # Usage exampleif __name__ == "__main__": # Create classifier classifier = FrozenFeatureClassifier( num_classes=10, head_type="mlp" ) # Only train the head! optimizer = torch.optim.Adam( classifier.get_trainable_params(), lr=0.001 ) # Dummy batch images = torch.randn(8, 3, 224, 224) labels = torch.randint(0, 10, (8,)) # Forward pass logits = classifier(images) loss = nn.CrossEntropyLoss()(logits, labels) print(f"Feature dimension: {classifier.extractor.output_dim}") print(f"Output shape: {logits.shape}") print(f"Loss: {loss.item():.4f}")Given frozen features, the classification head $g$ determines how we map representations to predictions. The choice of head architecture involves trade-offs between expressivity, regularization, and sample efficiency.
Linear probes:
The simplest head is a linear classifier:
$$g(z) = W z + b$$
where $W \in \mathbb{R}^{K \times d}$ for $K$ classes and $d$-dimensional features.
Advantages:
Disadvantages:
Linear probe accuracy is widely used as a metric for representation quality. If a linear classifier achieves 75% accuracy on ImageNet using features from model A, and 80% using features from model B, we say B has learned better representations. This metric is used in nearly all self-supervised learning papers.
MLP heads:
A multi-layer perceptron adds non-linear capacity:
$$g(z) = W_2 \sigma(W_1 z + b_1) + b_2$$
When to use MLPs:
MLP design choices:
Attention-based heads:
For sequence or spatial features (before global pooling), attention heads can learn to weight different positions:
$$g(z_1, ..., z_n) = \text{Attention}(Q, K, V)$$
This is useful when:
Head capacity and overfitting:
A crucial principle: more head capacity requires more target data. The relationship follows:
| Target Data Size | Recommended Head | Why |
|---|---|---|
| < 100 examples | Linear | Prevent overfitting |
| 100 - 1,000 | Linear or small MLP | Limited capacity okay |
| 1,000 - 10,000 | MLP (1-2 layers) | Can support non-linearity |
| > 10,000 | Deeper MLP or consider fine-tuning | Sufficient data for more expressivity |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
import torchimport torch.nn as nnimport torch.nn.functional as Ffrom typing import Optional class LinearProbe(nn.Module): """ Standard linear probe for frozen feature evaluation. Includes optional normalization and temperature scaling. """ def __init__( self, input_dim: int, num_classes: int, normalize: bool = True, temperature: float = 1.0 ): super().__init__() self.normalize = normalize self.temperature = temperature self.classifier = nn.Linear(input_dim, num_classes) # Initialize with scaling aware of normalization nn.init.normal_(self.classifier.weight, std=0.01) nn.init.zeros_(self.classifier.bias) def forward(self, x: torch.Tensor) -> torch.Tensor: if self.normalize: x = F.normalize(x, p=2, dim=1) logits = self.classifier(x) / self.temperature return logits class MLPHead(nn.Module): """ Flexible MLP head with configurable architecture. """ def __init__( self, input_dim: int, num_classes: int, hidden_dims: list = [512], activation: str = "relu", dropout: float = 0.5, batch_norm: bool = True ): super().__init__() layers = [] prev_dim = input_dim # Activation function selection act_fn = { "relu": nn.ReLU, "gelu": nn.GELU, "silu": nn.SiLU }[activation] # Build hidden layers for hidden_dim in hidden_dims: layers.append(nn.Linear(prev_dim, hidden_dim)) if batch_norm: layers.append(nn.BatchNorm1d(hidden_dim)) layers.append(act_fn()) layers.append(nn.Dropout(dropout)) prev_dim = hidden_dim # Final classification layer layers.append(nn.Linear(prev_dim, num_classes)) self.mlp = nn.Sequential(*layers) self._init_weights() def _init_weights(self): for m in self.modules(): if isinstance(m, nn.Linear): nn.init.kaiming_normal_(m.weight, mode='fan_out') if m.bias is not None: nn.init.zeros_(m.bias) elif isinstance(m, nn.BatchNorm1d): nn.init.ones_(m.weight) nn.init.zeros_(m.bias) def forward(self, x: torch.Tensor) -> torch.Tensor: return self.mlp(x) class PrototypicalHead(nn.Module): """ Prototype-based classification head. Learns class prototypes and classifies by nearest prototype. Particularly effective for few-shot scenarios. """ def __init__( self, input_dim: int, num_classes: int, metric: str = "euclidean" # or "cosine" ): super().__init__() self.metric = metric # Learnable prototypes (one per class) self.prototypes = nn.Parameter( torch.randn(num_classes, input_dim) ) nn.init.xavier_uniform_(self.prototypes) # Optional learnable temperature self.temperature = nn.Parameter(torch.tensor(1.0)) def forward(self, x: torch.Tensor) -> torch.Tensor: if self.metric == "euclidean": # Negative squared distance (higher = closer) dists = -torch.cdist(x, self.prototypes).pow(2) elif self.metric == "cosine": # Cosine similarity x_norm = F.normalize(x, dim=1) p_norm = F.normalize(self.prototypes, dim=1) dists = x_norm @ p_norm.T return dists / self.temperature def select_head( input_dim: int, num_classes: int, target_data_size: int, task_type: str = "classification") -> nn.Module: """ Heuristic head selection based on data availability. Args: input_dim: Dimension of frozen features num_classes: Number of output classes target_data_size: Number of labeled examples available task_type: "classification" or "regression" Returns: Appropriately-sized classification head """ samples_per_class = target_data_size / num_classes if samples_per_class < 10: # Very few examples: use prototypical head print("Selected: PrototypicalHead (few-shot regime)") return PrototypicalHead(input_dim, num_classes, metric="cosine") elif samples_per_class < 50: # Limited data: use linear probe print("Selected: LinearProbe (limited data regime)") return LinearProbe(input_dim, num_classes, normalize=True) elif samples_per_class < 200: # Moderate data: small MLP print("Selected: MLPHead with 1 hidden layer") return MLPHead( input_dim, num_classes, hidden_dims=[512], dropout=0.5 ) else: # Ample data: larger MLP print("Selected: MLPHead with 2 hidden layers") return MLPHead( input_dim, num_classes, hidden_dims=[1024, 512], dropout=0.3 )A major advantage of frozen features is that we can pre-compute and cache representations, dramatically accelerating subsequent training. This section covers efficient implementation strategies.
The caching workflow:
Computational savings:
Consider training a classifier on 50,000 images with ResNet-50:
| Approach | Time per Epoch | Forward Pass | Backward Pass |
|---|---|---|---|
| Full fine-tune | ~5 min | Encoder + Head | Encoder + Head |
| Frozen (no cache) | ~3 min | Encoder + Head | Head only |
| Frozen (cached) | ~5 sec | Head only | Head only |
Caching provides ~60x speedup for head training iterations.
For 50,000 images with 2048-dim features in float32, you need ~400MB of storage. This is negligible compared to the original images (~5GB for ImageNet-scale). The storage-compute tradeoff strongly favors caching for most scenarios.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224
import torchimport torch.nn as nnfrom torch.utils.data import DataLoader, TensorDatasetimport numpy as npfrom pathlib import Pathfrom tqdm import tqdmfrom typing import Tuple, Optionalimport h5py class FeatureCacher: """ Efficient feature extraction and caching for frozen transfer. Supports multiple storage backends and streaming extraction. """ def __init__( self, encoder: nn.Module, cache_dir: str = "./feature_cache", device: str = "cuda", dtype: torch.dtype = torch.float16 # Half precision saves 50% storage ): self.encoder = encoder.to(device).eval() self.cache_dir = Path(cache_dir) self.cache_dir.mkdir(parents=True, exist_ok=True) self.device = device self.dtype = dtype # Freeze encoder for param in self.encoder.parameters(): param.requires_grad = False @torch.no_grad() def extract_and_cache( self, dataloader: DataLoader, cache_name: str, backend: str = "hdf5" # "hdf5", "numpy", "torch" ) -> Path: """ Extract features from all data and save to disk. Args: dataloader: DataLoader yielding (images, labels) cache_name: Name for the cached features file backend: Storage format Returns: Path to cached features """ all_features = [] all_labels = [] print(f"Extracting features from {len(dataloader.dataset)} samples...") for images, labels in tqdm(dataloader, desc="Extracting"): images = images.to(self.device) # Extract features features = self.encoder(images) # Flatten if needed (e.g., from spatial features) if features.dim() > 2: features = features.flatten(start_dim=1) # Convert to target dtype for storage efficiency features = features.to(self.dtype).cpu() all_features.append(features) all_labels.append(labels) # Concatenate all batches features_tensor = torch.cat(all_features, dim=0) labels_tensor = torch.cat(all_labels, dim=0) # Save based on backend cache_path = self.cache_dir / f"{cache_name}.{backend}" if backend == "hdf5": with h5py.File(cache_path, 'w') as f: f.create_dataset('features', data=features_tensor.numpy(), compression='gzip', compression_opts=4) f.create_dataset('labels', data=labels_tensor.numpy()) elif backend == "numpy": np.savez_compressed( cache_path.with_suffix('.npz'), features=features_tensor.numpy(), labels=labels_tensor.numpy() ) cache_path = cache_path.with_suffix('.npz') elif backend == "torch": torch.save({ 'features': features_tensor, 'labels': labels_tensor }, cache_path.with_suffix('.pt')) cache_path = cache_path.with_suffix('.pt') print(f"Cached {len(features_tensor)} samples to {cache_path}") print(f"Feature shape: {features_tensor.shape}") print(f"Storage size: {cache_path.stat().st_size / 1e6:.1f} MB") return cache_path @staticmethod def load_cached_features( cache_path: Path, device: str = "cpu" ) -> Tuple[torch.Tensor, torch.Tensor]: """Load cached features back into memory.""" suffix = cache_path.suffix if suffix == '.hdf5' or suffix == '.h5': with h5py.File(cache_path, 'r') as f: features = torch.tensor(f['features'][:], dtype=torch.float32) labels = torch.tensor(f['labels'][:], dtype=torch.long) elif suffix == '.npz': data = np.load(cache_path) features = torch.tensor(data['features'], dtype=torch.float32) labels = torch.tensor(data['labels'], dtype=torch.long) elif suffix == '.pt': data = torch.load(cache_path) features = data['features'].float() labels = data['labels'] return features.to(device), labels.to(device) class CachedFeatureDataset(torch.utils.data.Dataset): """ Dataset that loads features from disk lazily. Useful for very large datasets that don't fit in memory. """ def __init__(self, cache_path: Path): self.cache_path = cache_path # Open file handle for lazy loading self.h5_file = h5py.File(cache_path, 'r') self.features = self.h5_file['features'] self.labels = self.h5_file['labels'] def __len__(self): return len(self.labels) def __getitem__(self, idx): feature = torch.tensor(self.features[idx], dtype=torch.float32) label = torch.tensor(self.labels[idx], dtype=torch.long) return feature, label def __del__(self): self.h5_file.close() def train_on_cached_features( features: torch.Tensor, labels: torch.Tensor, head: nn.Module, num_epochs: int = 100, batch_size: int = 256, lr: float = 0.01, weight_decay: float = 1e-4, device: str = "cuda") -> nn.Module: """ Train a classification head on pre-cached features. This is extremely fast since there's no encoder forward pass. """ head = head.to(device) # Create simple dataset from tensors dataset = TensorDataset(features, labels) loader = DataLoader( dataset, batch_size=batch_size, shuffle=True, pin_memory=True ) # Optimizer optimizer = torch.optim.AdamW( head.parameters(), lr=lr, weight_decay=weight_decay ) # Cosine schedule scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=num_epochs ) criterion = nn.CrossEntropyLoss() for epoch in range(num_epochs): head.train() total_loss = 0 correct = 0 total = 0 for batch_features, batch_labels in loader: batch_features = batch_features.to(device) batch_labels = batch_labels.to(device) optimizer.zero_grad() logits = head(batch_features) loss = criterion(logits, batch_labels) loss.backward() optimizer.step() total_loss += loss.item() correct += (logits.argmax(1) == batch_labels).sum().item() total += len(batch_labels) scheduler.step() if (epoch + 1) % 10 == 0: print(f"Epoch {epoch+1}: Loss={total_loss/len(loader):.4f}, " f"Acc={100*correct/total:.2f}%") return headWhile frozen features provide an efficient baseline, they have fundamental limitations that motivate more sophisticated transfer approaches.
Limitation 1: Domain shift
When the target domain differs significantly from the pre-training domain, frozen representations may encode irrelevant information while lacking task-relevant features.
Example: ImageNet features for medical imaging
If your target domain is substantially different from the pre-training domain (e.g., natural images → satellite imagery, English → Arabic, photos → sketches), frozen features often underperform. The representation gap becomes too large for the task head to compensate.
Limitation 2: Fixed resolution and architecture
Frozen feature extraction inherits all the architectural constraints of the pre-trained model:
Limitation 3: No task-specific adaptation
The representation is frozen—it cannot adjust to emphasize features relevant to your specific task:
$$\frac{\partial \mathcal{L}}{\partial \phi} = 0$$
The gradient signal from the target task cannot flow back to improve the representation. If the pre-trained features don't linearly separate your classes, you're limited to what non-linear head can achieve.
Limitation 4: Layer-feature mismatch
The optimal extraction layer varies by task, but we can only choose one (or a fixed combination). If your task needs:
Limitation 5: Representation bottleneck
The representation dimension $d$ is fixed. If task-relevant information exists in a subspace of dimension $k > d$ that was compressed during pre-training, that information is irrecoverably lost.
Quantifying the limitations:
Research has measured the gap between frozen features and fine-tuning:
| Dataset | Frozen Linear Probe | Full Fine-tune | Gap |
|---|---|---|---|
| CIFAR-10 | 93.2% | 97.4% | 4.2% |
| CIFAR-100 | 78.5% | 86.9% | 8.4% |
| Oxford Flowers | 94.1% | 98.2% | 4.1% |
| DTD Textures | 73.8% | 79.4% | 5.6% |
| Retinal OCT | 68.2% | 89.5% | 21.3% |
The gap is small when the target domain is similar to ImageNet (CIFAR, Flowers) but large for domain-shifted data (medical images).
Given the trade-offs, when should you choose frozen features over fine-tuning?
Decision framework:
| Criterion | Favor Frozen | Favor Fine-Tuning |
|---|---|---|
| Target data size | < 1,000 examples | 5,000 examples |
| Domain similarity | High (same domain) | Low (domain shift) |
| Compute budget | Limited / CPU only | GPU hours available |
| Iteration speed | Need rapid experiments | Can wait for training |
| Linear probe accuracy | 80% of target | < 60% of target |
| Task type | Coarse-grained | Fine-grained distinctions |
| Model size | Large model, limited VRAM | Sufficient VRAM for backprop |
The hybrid strategy:
In practice, frozen features serve as a critical starting point and upper bound estimator:
This workflow ensures you don't waste compute on fine-tuning when frozen features suffice, while identifying cases where adaptation is necessary.
Special use cases for frozen features:
1. Multi-task learning: Extract features once, train multiple heads for different tasks. Each task head trains independently, enabling:
2. Inference efficiency: In deployment, frozen feature + lightweight head can be faster than a fully fine-tuned model if:
3. Representation analysis: Frozen features enable studying what representations encode without confounding from fine-tuning:
4. Few-shot and zero-shot: With extremely limited data, frozen features are essential:
Always compute frozen feature baseline first. If it achieves >90% of your target performance, consider whether the compute cost of fine-tuning is justified by the remaining gap. Often, improving data quality or adding more labeled examples provides better ROI than switching to fine-tuning.
We've covered frozen feature transfer comprehensively—from theory to implementation to practical decision-making. Let's consolidate the key points:
What's next:
Having understood frozen features as a baseline, the next page dives deeper into feature extraction—more sophisticated techniques for deriving useful representations from pre-trained models, including multi-scale feature aggregation, attention-based pooling, and representation dimensionality reduction. These techniques improve upon naive feature extraction while still avoiding the cost of fine-tuning.
You now understand frozen feature transfer: when it works, how to implement it efficiently, and what its limitations are. This baseline establishes the floor for transfer performance and guides decisions about when more sophisticated methods are needed.