Feature Based Transfer - Learning Module

Loading content...

0/245

Feature Extraction

Beyond Simple Extraction: Advanced Feature Engineering

While frozen features from the penultimate layer provide a solid baseline, advanced feature extraction techniques can significantly improve transfer performance without requiring fine-tuning. These methods leverage the rich intermediate representations within pre-trained networks more effectively.

This page explores sophisticated feature extraction strategies: multi-scale aggregation, attention-based pooling, feature selection, and dimensionality reduction. These techniques bridge the gap between simple frozen features and full fine-tuning.

What You Will Learn

Master advanced feature extraction: multi-layer aggregation, spatial pooling strategies, feature selection methods, and efficient dimensionality reduction for transfer learning.

Multi-Scale Feature Aggregation

Different layers capture information at different scales and abstraction levels. Multi-scale feature aggregation combines features from multiple layers to create richer representations.

Why multi-scale matters:

Layer Level	Information Captured	Best For
Early (conv1-2)	Edges, gradients, colors	Texture classification
Middle (conv3-4)	Textures, patterns, parts	Object parts recognition
Late (conv5+)	Semantic concepts, objects	Category classification
Combined	All levels	General-purpose transfer

Aggregation strategies:

Concatenation: $z = [z_1; z_2; ...; z_L]$ — Simple but high-dimensional
Weighted sum: $z = \sum_l w_l z_l$ — Learnable or fixed weights
Attention fusion: Learn which layers matter for each input
FPN-style: Top-down pathway combines multi-resolution features

multiscale_extraction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MultiScaleExtractor(nn.Module):
    """Extract and aggregate features from multiple network layers."""
    
    def __init__(self, backbone: nn.Module, layer_names: list, 
                 aggregation: str = "concat", project_dim: int = None):
        super().__init__()
        self.backbone = backbone
        self.layer_names = layer_names
        self.aggregation = aggregation
        self.features = {}
        
        # Register hooks
        for name, module in backbone.named_modules():
            if name in layer_names:
                module.register_forward_hook(self._make_hook(name))
        
        # Optional projection to reduce dimensionality
        if project_dim:
            self.projector = nn.LazyLinear(project_dim)
        else:
            self.projector = None
    
    def _make_hook(self, name):
        def hook(module, input, output):
            # Global average pool if spatial
            if output.dim() == 4:
                output = F.adaptive_avg_pool2d(output, 1).flatten(1)
            self.features[name] = output
        return hook
    
    def forward(self, x):
        self.features = {}
        _ = self.backbone(x)
        
        # Aggregate features
        feats = [self.features[n] for n in self.layer_names]
        
        if self.aggregation == "concat":
            z = torch.cat(feats, dim=1)
        elif self.aggregation == "mean":
            # Assumes same dimension or uses projection
            z = torch.stack(feats).mean(dim=0)
        elif self.aggregation == "max":
            z = torch.stack(feats).max(dim=0)[0]
        
        if self.projector:
            z = self.projector(z)
        return z

Spatial Pooling Strategies

Before the final classification layer, convolutional features have spatial structure (H×W×C). How we pool this spatial information affects what's preserved in the final representation.

Global Average Pooling (GAP): $$z = \frac{1}{HW} \sum_{i,j} f_{i,j}$$

Averages over all spatial positions. Simple but loses spatial information.

Global Max Pooling (GMP): $$z = \max_{i,j} f_{i,j}$$

Keeps strongest activation per channel. Good for detecting presence of features.

Spatial Pyramid Pooling (SPP): Pool at multiple resolutions and concatenate: $$z = [\text{GAP}(f); \text{Pool}{2\times2}(f); \text{Pool}{4\times4}(f)]$$

Generalized Mean Pooling (GeM): $$z = \left(\frac{1}{HW} \sum_{i,j} f_{i,j}^p\right)^{1/p}$$

Parameter $p$ interpolates between mean ($p=1$) and max ($p\to\infty$). Learnable $p$ adapts to task.

GeM in Practice

Generalized Mean Pooling with learned p often outperforms both GAP and GMP. Start with p=3 and let the network learn the optimal value. This is standard in image retrieval and fine-grained recognition.

spatial_pooling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class GeneralizedMeanPooling(nn.Module):
    """GeM pooling with learnable power parameter."""
    
    def __init__(self, p: float = 3.0, eps: float = 1e-6, learn_p: bool = True):
        super().__init__()
        self.eps = eps
        if learn_p:
            self.p = nn.Parameter(torch.tensor(p))
        else:
            self.register_buffer('p', torch.tensor(p))
    
    def forward(self, x):
        # x: (B, C, H, W)
        return F.adaptive_avg_pool2d(
            x.clamp(min=self.eps).pow(self.p), 1
        ).pow(1.0 / self.p).flatten(1)
 
 
class SpatialPyramidPooling(nn.Module):
    """Multi-scale spatial pooling."""
    
    def __init__(self, levels: list = [1, 2, 4]):
        super().__init__()
        self.levels = levels
    
    def forward(self, x):
        B, C, H, W = x.shape
        pooled = []
        for level in self.levels:
            pool = F.adaptive_avg_pool2d(x, level)  # (B, C, level, level)
            pooled.append(pool.flatten(2))  # (B, C, level^2)
        return torch.cat(pooled, dim=2).flatten(1)  # (B, C * sum(level^2))
 
 
class AttentionPooling(nn.Module):
    """Learn to attend to important spatial positions."""
    
    def __init__(self, in_channels: int, hidden_dim: int = 256):
        super().__init__()
        self.attention = nn.Sequential(
            nn.Conv2d(in_channels, hidden_dim, 1),
            nn.ReLU(),
            nn.Conv2d(hidden_dim, 1, 1)
        )
    
    def forward(self, x):
        # x: (B, C, H, W)
        attn = self.attention(x)  # (B, 1, H, W)
        attn = F.softmax(attn.flatten(2), dim=2).view_as(attn)
        return (x * attn).sum(dim=[2, 3])  # (B, C)

Feature Selection and Importance

Not all features from a pre-trained model are equally relevant to your target task. Feature selection identifies which dimensions are most informative, potentially improving performance and reducing computation.

Why select features?

Remove noise: Irrelevant features add noise to the decision boundary
Reduce overfitting: Fewer features means fewer parameters to fit
Improve interpretability: Understand which features matter
Computational efficiency: Smaller feature vectors = faster inference

Selection methods:

Filter methods (pre-training):

Variance threshold: Remove low-variance (constant) features
Mutual information: Keep features with high I(z_i; Y)
ANOVA F-score: Keep features with high between-class variance

Wrapper methods (with training):

Recursive feature elimination: Train, remove least important, repeat
L1 regularization: Lasso induces sparsity in linear probe weights

Embedded methods:

Learn feature importance as part of training
Attention mechanisms over feature dimensions

feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
from sklearn.feature_selection import (
    SelectKBest, mutual_info_classif, f_classif,
    VarianceThreshold, RFE
)
from sklearn.linear_model import LogisticRegression
import torch
import torch.nn as nn
 
def select_features_mutual_info(features, labels, k=512):
    """Select top-k features by mutual information with labels."""
    selector = SelectKBest(mutual_info_classif, k=k)
    selected = selector.fit_transform(features, labels)
    return selected, selector.get_support()
 
def select_features_variance(features, threshold=0.01):
    """Remove near-constant features."""
    selector = VarianceThreshold(threshold=threshold)
    selected = selector.fit_transform(features)
    return selected, selector.get_support()
 
def select_features_l1(features, labels, C=0.1):
    """Use L1-regularized logistic regression for selection."""
    clf = LogisticRegression(penalty='l1', C=C, solver='saga', max_iter=1000)
    clf.fit(features, labels)
    # Features with non-zero weights are selected
    importance = np.abs(clf.coef_).sum(axis=0)
    return importance > 0, importance
 
class LearnedFeatureSelector(nn.Module):
    """Learn a soft feature mask during training."""
    
    def __init__(self, feature_dim: int, temperature: float = 1.0):
        super().__init__()
        self.logits = nn.Parameter(torch.zeros(feature_dim))
        self.temperature = temperature
    
    def forward(self, x):
        # Gumbel-softmax for differentiable selection
        mask = torch.sigmoid(self.logits / self.temperature)
        return x * mask
    
    def get_selected_features(self, threshold=0.5):
        with torch.no_grad():
            mask = torch.sigmoid(self.logits)
            return (mask > threshold).cpu().numpy()

Dimensionality Reduction for Features

High-dimensional features (2048+) can be reduced to lower dimensions while preserving most discriminative information. This improves computational efficiency and can regularize downstream classifiers.

Principal Component Analysis (PCA):

Project to top-$k$ principal components: $$z_{\text{reduced}} = W_k^\top (z - \mu)$$

where $W_k$ contains the top-$k$ eigenvectors of the covariance matrix.

Typical dimensionality reduction:

2048 → 512: Often loses <5% accuracy
2048 → 256: Good for memory-constrained scenarios
2048 → 128: Aggressive; may hurt fine-grained tasks

Whitening:

Normalize features to unit variance: $$z_{\text{white}} = D^{-1/2} W^\top (z - \mu)$$

This decorrelates features and can improve linear probe performance.

Dimensionality Reduction Methods Comparison
Method	Linear?	Preserves	Cost
PCA	Yes	Variance	O(d³) once, O(dk) per sample
Whitened PCA	Yes	Decorrelated variance	Same as PCA
Random Projection	Yes	Distances (approx.)	O(dk) per sample
UMAP/t-SNE	No	Local structure	High (O(n²))
Autoencoder	No	Learned reconstruction	Training required

dimensionality_reduction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from sklearn.decomposition import PCA
from sklearn.random_projection import GaussianRandomProjection
import torch
import torch.nn as nn
 
class FeatureReducer:
    """Efficient feature dimensionality reduction."""
    
    def __init__(self, method: str = "pca", n_components: int = 512,
                 whiten: bool = False):
        self.method = method
        self.n_components = n_components
        self.whiten = whiten
        self.reducer = None
    
    def fit(self, features: np.ndarray):
        if self.method == "pca":
            self.reducer = PCA(n_components=self.n_components, 
                              whiten=self.whiten)
        elif self.method == "random":
            self.reducer = GaussianRandomProjection(
                n_components=self.n_components
            )
        
        self.reducer.fit(features)
        
        if self.method == "pca":
            explained = self.reducer.explained_variance_ratio_.sum()
            print(f"PCA: {self.n_components} components explain "
                  f"{explained*100:.1f}% variance")
        
        return self
    
    def transform(self, features: np.ndarray) -> np.ndarray:
        return self.reducer.transform(features)
    
    def fit_transform(self, features: np.ndarray) -> np.ndarray:
        return self.fit(features).transform(features)
 
 
class LearnedProjection(nn.Module):
    """Learnable linear projection for feature reduction."""
    
    def __init__(self, in_dim: int, out_dim: int, 
                 init: str = "orthogonal"):
        super().__init__()
        self.projection = nn.Linear(in_dim, out_dim, bias=False)
        
        if init == "orthogonal":
            nn.init.orthogonal_(self.projection.weight)
        elif init == "xavier":
            nn.init.xavier_uniform_(self.projection.weight)
    
    def forward(self, x):
        return self.projection(x)

Feature Normalization Techniques

How features are normalized significantly impacts downstream classifier performance. Different normalization schemes have different properties.

L2 Normalization: $$z_{\text{norm}} = \frac{z}{|z|_2}$$

Maps all features to the unit hypersphere. Converts dot product to cosine similarity.

Standardization (Z-score): $$z_{\text{std}} = \frac{z - \mu}{\sigma}$$

Zero mean, unit variance per dimension. Computed on training set, applied to all.

Power normalization: $$z_{\text{power}} = \text{sign}(z) |z|^\alpha$$

Typically $\alpha = 0.5$ (square root). Reduces influence of large activations.

When to Use Each Normalization

•L2 norm — Best for similarity-based tasks (retrieval, few-shot). Standard for contrastive learning.
•Standardization — Best for linear probes. Ensures equal feature importance initially.
•Power + L2 — Best for aggregated features (e.g., VLAD, Fisher vectors). Reduces burstiness.
•No normalization — Sometimes works; let batch norm in classifier handle it.

Train/Test Consistency

Always compute normalization statistics (mean, std) on the training set only, then apply the same transformation to validation and test sets. Data leakage from including test data in statistics biases evaluation.

Building a Complete Feature Extraction Pipeline

Combining the techniques above into a coherent pipeline:

Recommended pipeline:

Multi-layer extraction — Extract from 2-3 layers (e.g., layer3, layer4, avgpool)
Spatial pooling — GeM pooling for each spatial layer
Aggregation — Concatenate multi-layer features
Dimensionality reduction — PCA to ~512-1024 dimensions
Normalization — L2 normalize final features
Classification — Linear probe or small MLP

extraction_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import torch
import torch.nn as nn
from torchvision import models
from sklearn.decomposition import PCA
import numpy as np
 
class FeatureExtractionPipeline:
    """Complete pipeline from images to classified predictions."""
    
    def __init__(
        self,
        backbone_name: str = "resnet50",
        layers: list = ["layer3", "layer4"],
        gem_p: float = 3.0,
        pca_dim: int = 512,
        normalize: bool = True
    ):
        # Load backbone
        self.backbone = getattr(models, backbone_name)(
            weights="IMAGENET1K_V2"
        ).eval()
        for p in self.backbone.parameters():
            p.requires_grad = False
        
        self.layers = layers
        self.gem_p = gem_p
        self.pca_dim = pca_dim
        self.normalize = normalize
        
        self.features = {}
        self._register_hooks()
        self.pca = None
    
    def _register_hooks(self):
        for name, module in self.backbone.named_modules():
            if name in self.layers:
                module.register_forward_hook(
                    lambda m, i, o, n=name: self.features.update({n: o})
                )
    
    def _gem_pool(self, x, p=3.0):
        return x.clamp(min=1e-6).pow(p).mean(dim=[2, 3]).pow(1/p)
    
    @torch.no_grad()
    def extract(self, dataloader, device="cuda"):
        """Extract features from all samples."""
        self.backbone = self.backbone.to(device)
        all_features, all_labels = [], []
        
        for images, labels in dataloader:
            self.features = {}
            images = images.to(device)
            _ = self.backbone(images)
            
            # GeM pool each layer and concatenate
            layer_feats = []
            for layer in self.layers:
                pooled = self._gem_pool(self.features[layer], self.gem_p)
                layer_feats.append(pooled)
            
            features = torch.cat(layer_feats, dim=1).cpu().numpy()
            all_features.append(features)
            all_labels.append(labels.numpy())
        
        return np.vstack(all_features), np.concatenate(all_labels)
    
    def fit_reduction(self, features):
        """Fit PCA on training features."""
        self.pca = PCA(n_components=self.pca_dim, whiten=True)
        reduced = self.pca.fit_transform(features)
        print(f"Variance explained: {self.pca.explained_variance_ratio_.sum():.2%}")
        return self._final_normalize(reduced)
    
    def transform(self, features):
        """Apply fitted PCA to new features."""
        reduced = self.pca.transform(features)
        return self._final_normalize(reduced)
    
    def _final_normalize(self, features):
        if self.normalize:
            norms = np.linalg.norm(features, axis=1, keepdims=True)
            return features / (norms + 1e-8)
        return features

Summary: Feature Extraction

Key Takeaways

•Multi-scale features capture information at different abstraction levels; combine layers for richer representations.
•Spatial pooling (GeM, SPP, attention) affects what information is preserved; GeM is a strong default.
•Feature selection removes noise and improves efficiency; mutual information and L1 regularization are effective.
•Dimensionality reduction via PCA maintains performance while reducing compute; 512 dims often sufficient.
•Normalization (L2, standardization) is critical for consistent classifier performance.
•Complete pipelines chain these techniques for state-of-the-art frozen feature performance.

What's next: The next page explores domain discrepancy—how to measure and understand the gap between source and target domains, which is fundamental to predicting when transfer will succeed or fail.

Page Complete

You now understand advanced feature extraction techniques that maximize the utility of frozen pre-trained representations.