Loading content...
While frozen features from the penultimate layer provide a solid baseline, advanced feature extraction techniques can significantly improve transfer performance without requiring fine-tuning. These methods leverage the rich intermediate representations within pre-trained networks more effectively.
This page explores sophisticated feature extraction strategies: multi-scale aggregation, attention-based pooling, feature selection, and dimensionality reduction. These techniques bridge the gap between simple frozen features and full fine-tuning.
Master advanced feature extraction: multi-layer aggregation, spatial pooling strategies, feature selection methods, and efficient dimensionality reduction for transfer learning.
Different layers capture information at different scales and abstraction levels. Multi-scale feature aggregation combines features from multiple layers to create richer representations.
Why multi-scale matters:
| Layer Level | Information Captured | Best For |
|---|---|---|
| Early (conv1-2) | Edges, gradients, colors | Texture classification |
| Middle (conv3-4) | Textures, patterns, parts | Object parts recognition |
| Late (conv5+) | Semantic concepts, objects | Category classification |
| Combined | All levels | General-purpose transfer |
Aggregation strategies:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import torchimport torch.nn as nnimport torch.nn.functional as F class MultiScaleExtractor(nn.Module): """Extract and aggregate features from multiple network layers.""" def __init__(self, backbone: nn.Module, layer_names: list, aggregation: str = "concat", project_dim: int = None): super().__init__() self.backbone = backbone self.layer_names = layer_names self.aggregation = aggregation self.features = {} # Register hooks for name, module in backbone.named_modules(): if name in layer_names: module.register_forward_hook(self._make_hook(name)) # Optional projection to reduce dimensionality if project_dim: self.projector = nn.LazyLinear(project_dim) else: self.projector = None def _make_hook(self, name): def hook(module, input, output): # Global average pool if spatial if output.dim() == 4: output = F.adaptive_avg_pool2d(output, 1).flatten(1) self.features[name] = output return hook def forward(self, x): self.features = {} _ = self.backbone(x) # Aggregate features feats = [self.features[n] for n in self.layer_names] if self.aggregation == "concat": z = torch.cat(feats, dim=1) elif self.aggregation == "mean": # Assumes same dimension or uses projection z = torch.stack(feats).mean(dim=0) elif self.aggregation == "max": z = torch.stack(feats).max(dim=0)[0] if self.projector: z = self.projector(z) return zBefore the final classification layer, convolutional features have spatial structure (H×W×C). How we pool this spatial information affects what's preserved in the final representation.
Global Average Pooling (GAP): $$z = \frac{1}{HW} \sum_{i,j} f_{i,j}$$
Averages over all spatial positions. Simple but loses spatial information.
Global Max Pooling (GMP): $$z = \max_{i,j} f_{i,j}$$
Keeps strongest activation per channel. Good for detecting presence of features.
Spatial Pyramid Pooling (SPP): Pool at multiple resolutions and concatenate: $$z = [\text{GAP}(f); \text{Pool}{2\times2}(f); \text{Pool}{4\times4}(f)]$$
Generalized Mean Pooling (GeM): $$z = \left(\frac{1}{HW} \sum_{i,j} f_{i,j}^p\right)^{1/p}$$
Parameter $p$ interpolates between mean ($p=1$) and max ($p\to\infty$). Learnable $p$ adapts to task.
Generalized Mean Pooling with learned p often outperforms both GAP and GMP. Start with p=3 and let the network learn the optimal value. This is standard in image retrieval and fine-grained recognition.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import torchimport torch.nn as nnimport torch.nn.functional as F class GeneralizedMeanPooling(nn.Module): """GeM pooling with learnable power parameter.""" def __init__(self, p: float = 3.0, eps: float = 1e-6, learn_p: bool = True): super().__init__() self.eps = eps if learn_p: self.p = nn.Parameter(torch.tensor(p)) else: self.register_buffer('p', torch.tensor(p)) def forward(self, x): # x: (B, C, H, W) return F.adaptive_avg_pool2d( x.clamp(min=self.eps).pow(self.p), 1 ).pow(1.0 / self.p).flatten(1) class SpatialPyramidPooling(nn.Module): """Multi-scale spatial pooling.""" def __init__(self, levels: list = [1, 2, 4]): super().__init__() self.levels = levels def forward(self, x): B, C, H, W = x.shape pooled = [] for level in self.levels: pool = F.adaptive_avg_pool2d(x, level) # (B, C, level, level) pooled.append(pool.flatten(2)) # (B, C, level^2) return torch.cat(pooled, dim=2).flatten(1) # (B, C * sum(level^2)) class AttentionPooling(nn.Module): """Learn to attend to important spatial positions.""" def __init__(self, in_channels: int, hidden_dim: int = 256): super().__init__() self.attention = nn.Sequential( nn.Conv2d(in_channels, hidden_dim, 1), nn.ReLU(), nn.Conv2d(hidden_dim, 1, 1) ) def forward(self, x): # x: (B, C, H, W) attn = self.attention(x) # (B, 1, H, W) attn = F.softmax(attn.flatten(2), dim=2).view_as(attn) return (x * attn).sum(dim=[2, 3]) # (B, C)Not all features from a pre-trained model are equally relevant to your target task. Feature selection identifies which dimensions are most informative, potentially improving performance and reducing computation.
Why select features?
Selection methods:
Filter methods (pre-training):
Wrapper methods (with training):
Embedded methods:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npfrom sklearn.feature_selection import ( SelectKBest, mutual_info_classif, f_classif, VarianceThreshold, RFE)from sklearn.linear_model import LogisticRegressionimport torchimport torch.nn as nn def select_features_mutual_info(features, labels, k=512): """Select top-k features by mutual information with labels.""" selector = SelectKBest(mutual_info_classif, k=k) selected = selector.fit_transform(features, labels) return selected, selector.get_support() def select_features_variance(features, threshold=0.01): """Remove near-constant features.""" selector = VarianceThreshold(threshold=threshold) selected = selector.fit_transform(features) return selected, selector.get_support() def select_features_l1(features, labels, C=0.1): """Use L1-regularized logistic regression for selection.""" clf = LogisticRegression(penalty='l1', C=C, solver='saga', max_iter=1000) clf.fit(features, labels) # Features with non-zero weights are selected importance = np.abs(clf.coef_).sum(axis=0) return importance > 0, importance class LearnedFeatureSelector(nn.Module): """Learn a soft feature mask during training.""" def __init__(self, feature_dim: int, temperature: float = 1.0): super().__init__() self.logits = nn.Parameter(torch.zeros(feature_dim)) self.temperature = temperature def forward(self, x): # Gumbel-softmax for differentiable selection mask = torch.sigmoid(self.logits / self.temperature) return x * mask def get_selected_features(self, threshold=0.5): with torch.no_grad(): mask = torch.sigmoid(self.logits) return (mask > threshold).cpu().numpy()High-dimensional features (2048+) can be reduced to lower dimensions while preserving most discriminative information. This improves computational efficiency and can regularize downstream classifiers.
Principal Component Analysis (PCA):
Project to top-$k$ principal components: $$z_{\text{reduced}} = W_k^\top (z - \mu)$$
where $W_k$ contains the top-$k$ eigenvectors of the covariance matrix.
Typical dimensionality reduction:
Whitening:
Normalize features to unit variance: $$z_{\text{white}} = D^{-1/2} W^\top (z - \mu)$$
This decorrelates features and can improve linear probe performance.
| Method | Linear? | Preserves | Cost |
|---|---|---|---|
| PCA | Yes | Variance | O(d³) once, O(dk) per sample |
| Whitened PCA | Yes | Decorrelated variance | Same as PCA |
| Random Projection | Yes | Distances (approx.) | O(dk) per sample |
| UMAP/t-SNE | No | Local structure | High (O(n²)) |
| Autoencoder | No | Learned reconstruction | Training required |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npfrom sklearn.decomposition import PCAfrom sklearn.random_projection import GaussianRandomProjectionimport torchimport torch.nn as nn class FeatureReducer: """Efficient feature dimensionality reduction.""" def __init__(self, method: str = "pca", n_components: int = 512, whiten: bool = False): self.method = method self.n_components = n_components self.whiten = whiten self.reducer = None def fit(self, features: np.ndarray): if self.method == "pca": self.reducer = PCA(n_components=self.n_components, whiten=self.whiten) elif self.method == "random": self.reducer = GaussianRandomProjection( n_components=self.n_components ) self.reducer.fit(features) if self.method == "pca": explained = self.reducer.explained_variance_ratio_.sum() print(f"PCA: {self.n_components} components explain " f"{explained*100:.1f}% variance") return self def transform(self, features: np.ndarray) -> np.ndarray: return self.reducer.transform(features) def fit_transform(self, features: np.ndarray) -> np.ndarray: return self.fit(features).transform(features) class LearnedProjection(nn.Module): """Learnable linear projection for feature reduction.""" def __init__(self, in_dim: int, out_dim: int, init: str = "orthogonal"): super().__init__() self.projection = nn.Linear(in_dim, out_dim, bias=False) if init == "orthogonal": nn.init.orthogonal_(self.projection.weight) elif init == "xavier": nn.init.xavier_uniform_(self.projection.weight) def forward(self, x): return self.projection(x)How features are normalized significantly impacts downstream classifier performance. Different normalization schemes have different properties.
L2 Normalization: $$z_{\text{norm}} = \frac{z}{|z|_2}$$
Maps all features to the unit hypersphere. Converts dot product to cosine similarity.
Standardization (Z-score): $$z_{\text{std}} = \frac{z - \mu}{\sigma}$$
Zero mean, unit variance per dimension. Computed on training set, applied to all.
Power normalization: $$z_{\text{power}} = \text{sign}(z) |z|^\alpha$$
Typically $\alpha = 0.5$ (square root). Reduces influence of large activations.
Always compute normalization statistics (mean, std) on the training set only, then apply the same transformation to validation and test sets. Data leakage from including test data in statistics biases evaluation.
Combining the techniques above into a coherent pipeline:
Recommended pipeline:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import torchimport torch.nn as nnfrom torchvision import modelsfrom sklearn.decomposition import PCAimport numpy as np class FeatureExtractionPipeline: """Complete pipeline from images to classified predictions.""" def __init__( self, backbone_name: str = "resnet50", layers: list = ["layer3", "layer4"], gem_p: float = 3.0, pca_dim: int = 512, normalize: bool = True ): # Load backbone self.backbone = getattr(models, backbone_name)( weights="IMAGENET1K_V2" ).eval() for p in self.backbone.parameters(): p.requires_grad = False self.layers = layers self.gem_p = gem_p self.pca_dim = pca_dim self.normalize = normalize self.features = {} self._register_hooks() self.pca = None def _register_hooks(self): for name, module in self.backbone.named_modules(): if name in self.layers: module.register_forward_hook( lambda m, i, o, n=name: self.features.update({n: o}) ) def _gem_pool(self, x, p=3.0): return x.clamp(min=1e-6).pow(p).mean(dim=[2, 3]).pow(1/p) @torch.no_grad() def extract(self, dataloader, device="cuda"): """Extract features from all samples.""" self.backbone = self.backbone.to(device) all_features, all_labels = [], [] for images, labels in dataloader: self.features = {} images = images.to(device) _ = self.backbone(images) # GeM pool each layer and concatenate layer_feats = [] for layer in self.layers: pooled = self._gem_pool(self.features[layer], self.gem_p) layer_feats.append(pooled) features = torch.cat(layer_feats, dim=1).cpu().numpy() all_features.append(features) all_labels.append(labels.numpy()) return np.vstack(all_features), np.concatenate(all_labels) def fit_reduction(self, features): """Fit PCA on training features.""" self.pca = PCA(n_components=self.pca_dim, whiten=True) reduced = self.pca.fit_transform(features) print(f"Variance explained: {self.pca.explained_variance_ratio_.sum():.2%}") return self._final_normalize(reduced) def transform(self, features): """Apply fitted PCA to new features.""" reduced = self.pca.transform(features) return self._final_normalize(reduced) def _final_normalize(self, features): if self.normalize: norms = np.linalg.norm(features, axis=1, keepdims=True) return features / (norms + 1e-8) return featuresWhat's next: The next page explores domain discrepancy—how to measure and understand the gap between source and target domains, which is fundamental to predicting when transfer will succeed or fail.
You now understand advanced feature extraction techniques that maximize the utility of frozen pre-trained representations.