Loading content...
When a convolutional layer processes an image, what exactly does it produce? Not a single number, not a flat vector, but a rich three-dimensional structure called a feature map—a spatial grid where each position contains activations for multiple learned features.
A feature map is the language CNNs use to describe what they see. Each position indicates 'here is evidence for these features at this location.' Early layers might say 'there's a vertical edge here' or 'this region is blue-ish.' Deeper layers might say 'this looks like fur texture' or 'this region contains an eye-like pattern.'
Understanding feature maps is essential for:
This page provides a comprehensive treatment of feature maps—their structure, computation, interpretation, and role in CNN learning.
This page covers the complete theory and practice of feature maps. You will understand their mathematical structure, how they're computed via convolution, interpretation through visualization techniques, design principles for channel counts, and their role in the hierarchical feature learning that makes CNNs powerful.
Let's establish precise terminology for discussing feature maps.
Feature Map Definition:
A feature map is a 2D array of activations produced by applying a single filter to the input. The term refers to both:
Tensor Dimensions:
In deep learning frameworks, feature map tensors typically have shape:
[Batch, Channels, Height, Width] (PyTorch, NCHW) or [Batch, Height, Width, Channels] (TensorFlow default, NHWC)
| Dimension | Meaning |
|---|---|
| Batch (N) | Number of samples processed together |
| Channels (C) | Number of feature maps / filters |
| Height (H) | Spatial height of feature maps |
| Width (W) | Spatial width of feature maps |
Example:
A conv layer processing 224×224 RGB images with 64 filters produces:
These terms are often used interchangeably: 'channels' (dimension of tensor), 'feature maps' (spatial activations), 'filters' (learned weights). A layer with 64 filters produces 64 output channels, where each channel is a feature map. The terms emphasize different perspectives: learned weights (filters), activations (feature maps), or tensor structure (channels).
Spatial vs Channel Dimensions:
Feature map tensors have two fundamentally different types of dimensions:
Spatial Dimensions (H, W):
Channel Dimension (C):
Visualization:
W
┌────────────────┐
H │ Channel 1 │ (e.g., "vertical edge")
└────────────────┘
┌────────────────┐
H │ Channel 2 │ (e.g., "horizontal edge")
└────────────────┘
┌────────────────┐
H │ Channel 3 │ (e.g., "diagonal edge")
└────────────────┘
...
┌────────────────┐
H │ Channel C │ (e.g., "texture pattern")
└────────────────┘
C feature maps stacked along channel dimension
| Layer | Channels | Height | Width | Total Activations |
|---|---|---|---|---|
| Input | 3 | 224 | 224 | 150,528 |
| Conv1 (64 filters) | 64 | 112 | 112 | 802,816 |
| Conv2 (128 filters) | 128 | 56 | 56 | 401,408 |
| Conv3 (256 filters) | 256 | 28 | 28 | 200,704 |
| Conv4 (512 filters) | 512 | 14 | 14 | 100,352 |
| Conv5 (512 filters) | 512 | 7 | 7 | 25,088 |
Understanding the precise computation of feature maps reveals their structure and interpretation.
Convolution Operation:
For an input with C_in channels and a conv layer with C_out filters:
$$Y_{c_{out}}[i, j] = \sigma\left(\sum_{c_{in}=1}^{C_{in}} \sum_{m,n} X_{c_{in}}[i+m, j+n] \cdot K_{c_{out}, c_{in}}[m, n] + b_{c_{out}}\right)$$
where:
Key Insights:
A 'kernel' or 'filter' isn't just k×k—it's C_in × k × k. When we say '64 3×3 filters on RGB input,' we actually have 64 filters each of size 3 × 3 × 3 = 27 weights (plus bias). The filter slides spatially but covers all input channels at once.
Step-by-Step Example:
Input: RGB image [3, 8, 8] — 3 channels, 8×8 spatial Conv layer: 64 filters, 3×3, stride 1, padding 1
Computation for one output position (i=4, j=4) in output channel c=0:
# Extract 3×3 patch from all 3 input channels
patch = X[:, 3:6, 3:6] # Shape: [3, 3, 3]
# Filter 0 has shape [3, 3, 3] — one 3×3 kernel per input channel
filter_0 = K[0] # Shape: [3, 3, 3]
# Element-wise multiply and sum all
activation = (patch * filter_0).sum() + bias[0]
# Apply nonlinearity
Y[0, 4, 4] = relu(activation)
Repeat for all 64 filters at all 8×8 positions → Output [64, 8, 8]
Computational Complexity:
For input [C_in, H, W] and output [C_out, H', W'] with k×k kernels:
$$\text{FLOPs} = 2 \cdot C_{out} \cdot H' \cdot W' \cdot C_{in} \cdot k^2$$
(The factor of 2 counts multiply + add separately)
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import torchimport torch.nn as nnimport torch.nn.functional as F def manual_conv2d(x, weight, bias, padding=0): """ Manual implementation of 2D convolution to show feature map computation. Args: x: Input tensor [C_in, H, W] weight: Filter weights [C_out, C_in, k, k] bias: Bias vector [C_out] padding: Zero padding Returns: Feature map tensor [C_out, H', W'] """ c_in, h, w = x.shape c_out, _, k, _ = weight.shape # Apply padding if padding > 0: x = F.pad(x, (padding, padding, padding, padding)) _, h_pad, w_pad = x.shape h_out = h_pad - k + 1 w_out = w_pad - k + 1 # Initialize output output = torch.zeros(c_out, h_out, w_out) # Compute each output position for cout in range(c_out): for i in range(h_out): for j in range(w_out): # Extract patch from ALL input channels patch = x[:, i:i+k, j:j+k] # [C_in, k, k] # Dot product with this output channel's filter # Filter has shape [C_in, k, k] output[cout, i, j] = (patch * weight[cout]).sum() + bias[cout] return output # Examplex = torch.randn(3, 8, 8) # RGB 8x8 imageconv = nn.Conv2d(3, 64, kernel_size=3, padding=1) # Our manual implementationmanual_out = manual_conv2d(x, conv.weight, conv.bias, padding=1) # PyTorch implementationpytorch_out = conv(x.unsqueeze(0)).squeeze(0) # Verify they matchprint(f"Max difference: {(manual_out - pytorch_out).abs().max().item():.6f}")print(f"Output shape: {manual_out.shape}") # [64, 8, 8] # Output:# Max difference: 0.000000# Output shape: torch.Size([64, 8, 8])Feature maps aren't just mathematical objects—they encode meaningful visual information. Understanding how to interpret them is crucial for model understanding and debugging.
What Feature Maps Represent:
Each feature map is a spatial detector for a particular pattern:
Layer-by-Layer Interpretation:
Early Layers (Conv1-2):
Middle Layers (Conv3-4):
Deep Layers (Conv5+):
ReLU makes feature maps sparse by zeroing negative values. This sparsity has interpretation: a ReLU feature map shows where patterns are present (positive activation) and where they're absent or contradicted (zero). Without ReLU, feature maps would have both positive and negative values, making interpretation harder.
Visualization Techniques:
1. Direct Visualization
Display feature map activations as grayscale images:
import matplotlib.pyplot as plt
def visualize_feature_maps(feature_maps, num_show=16):
"""
feature_maps: [C, H, W] tensor from one sample
"""
fig, axes = plt.subplots(4, 4, figsize=(12, 12))
for i, ax in enumerate(axes.flat):
if i < min(num_show, feature_maps.shape[0]):
ax.imshow(feature_maps[i].cpu().numpy(), cmap='viridis')
ax.set_title(f'Channel {i}')
ax.axis('off')
plt.tight_layout()
2. Activation Maximization
Generate input patterns that maximally activate a specific feature:
3. Occlusion Sensitivity
Slide an occluding patch across input and record activation changes:
4. Gradient-based Attribution (Grad-CAM)
Weight feature maps by gradient of class score: $$L^c = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)$$
where $\alpha_k^c = \frac{1}{Z}\sum_i\sum_j \frac{\partial y^c}{\partial A^k_{ij}}$
Creates class-specific attention maps showing which regions matter for a prediction.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import torchimport torch.nn as nnimport torch.nn.functional as Fimport numpy as npimport matplotlib.pyplot as pltfrom torchvision import models, transforms def extract_feature_maps(model, input_tensor, layer_name): """ Extract feature maps from a specific layer. """ feature_maps = {} def hook(module, input, output): feature_maps['output'] = output.detach() # Register hook on target layer for name, module in model.named_modules(): if name == layer_name: handle = module.register_forward_hook(hook) break # Forward pass with torch.no_grad(): model(input_tensor) handle.remove() return feature_maps['output'] def visualize_top_activating_channels(feature_maps, top_k=8): """ Visualize channels with highest mean activation. """ # feature_maps: [1, C, H, W] maps = feature_maps.squeeze(0) # [C, H, W] # Find top-k channels by mean activation mean_activations = maps.mean(dim=(1, 2)) # [C] top_indices = mean_activations.argsort(descending=True)[:top_k] fig, axes = plt.subplots(2, top_k // 2, figsize=(15, 6)) for i, ax in enumerate(axes.flat): idx = top_indices[i] ax.imshow(maps[idx].cpu().numpy(), cmap='viridis') ax.set_title(f'Ch {idx}: mean={mean_activations[idx]:.2f}') ax.axis('off') plt.suptitle('Top Activating Feature Channels') return fig def compute_channel_correlation(feature_maps): """ Compute correlation matrix between feature map channels. Shows which features co-activate. """ # feature_maps: [1, C, H, W] maps = feature_maps.squeeze(0) # [C, H, W] C = maps.shape[0] # Flatten spatial dimensions maps_flat = maps.view(C, -1) # [C, H*W] # Compute correlation matrix maps_centered = maps_flat - maps_flat.mean(dim=1, keepdim=True) maps_std = maps_flat.std(dim=1, keepdim=True) + 1e-8 maps_normalized = maps_centered / maps_std correlation = torch.mm(maps_normalized, maps_normalized.t()) / maps_flat.shape[1] return correlation.cpu().numpy() # Example usage with pretrained ResNetmodel = models.resnet18(pretrained=True)model.eval() # Create sample inputsample_input = torch.randn(1, 3, 224, 224) # Extract from different layersfor layer in ['layer1.0.conv1', 'layer2.0.conv1', 'layer3.0.conv1', 'layer4.0.conv1']: try: fmaps = extract_feature_maps(model, sample_input, layer) print(f"{layer}: {fmaps.shape}") except: print(f"{layer}: layer not found")Feature maps exhibit interesting statistical properties—sparsity and redundancy—that have profound implications for network compression and understanding.
Sparsity:
After ReLU, feature maps are typically sparse—most values are zero:
Why Sparsity Matters:
Measuring Sparsity:
def measure_sparsity(feature_maps, threshold=0.0):
"""
Compute fraction of near-zero values in feature maps.
"""
total_values = feature_maps.numel()
sparse_values = (feature_maps.abs() <= threshold).sum().item()
return sparse_values / total_values
Recent research suggests that networks represent more features than they have neurons by using sparse, distributed codes. A feature map with 512 channels might encode thousands of features via combinations of channels. This 'superposition' explains why networks can be compressed without losing capability.
Redundancy:
Despite having hundreds of channels, feature maps often contain redundant information:
Intra-Channel Redundancy: Spatially adjacent positions often have similar values (smoothness)
Inter-Channel Redundancy: Many channels are highly correlated—they detect similar features
Exploiting Redundancy:
1. Channel Pruning
Remove channels that are highly correlated or rarely activate:
2. Low-Rank Factorization
Approximate weight matrices with lower-rank versions: $$W \approx UV^T$$
For a [C_out, C_in, k, k] kernel, factorize into two smaller convolutions.
3. Quantization
Reduce precision of feature map values:
Sparse, redundant feature maps tolerate aggressive quantization.
How many feature maps (channels) should each layer have? This fundamental design question affects capacity, computation, and performance.
Traditional Approach: Doubling Pattern
VGG and ResNet popularized the pattern:
Rationale:
Mathematical Justification:
For input [C, H, W] and output [2C, H/2, W/2]:
The doubling compensates for spatial reduction, maintaining representational capacity.
Neural Architecture Search (NAS) has found that the doubling pattern isn't always optimal. EfficientNet uses compound scaling that increases depth, width, and resolution together. ConvNeXt uses 96 → 192 → 384 → 768 progression (3× instead of 2×). The optimal pattern depends on the task and compute budget.
Factors Affecting Optimal Channel Count:
1. Task Complexity
2. Input Resolution
3. Depth
4. Computational Budget
Width Multiplier Pattern:
MobileNet introduced width multipliers:
base_channels = [32, 64, 128, 256, 512]
width_mult = 0.75 # 75% of base width
actual_channels = [int(c * width_mult) for c in base_channels]
# Result: [24, 48, 96, 192, 384]
This allows trading accuracy for efficiency uniformly across the network.
| Architecture | Stage 1 | Stage 2 | Stage 3 | Stage 4 | Stage 5 | Pattern |
|---|---|---|---|---|---|---|
| VGG-16 | 64 | 128 | 256 | 512 | 512 | 2× doubling |
| ResNet-50 | 64 | 256 | 512 | 1024 | 2048 | 2× + bottleneck |
| MobileNet V2 | 32 | 24 | 32 | 64 | 320 | Inverted bottleneck |
| EfficientNet-B0 | 32 | 24 | 40 | 80 | 192 | Compound scaling |
| ConvNeXt-T | 96 | 192 | 384 | 768 | 2× doubling | |
| Swin-T | 96 | 192 | 384 | 768 | 2× doubling |
Feature maps are the medium through which information flows through a CNN. Understanding this flow reveals how representations are built and transformed.
Information Transformation:
As data flows through a CNN, feature maps undergo systematic transformations:
Spatial Transformation:
Channel Transformation:
Information Content:
Bottleneck Analysis:
The layer with minimum representation size (often the final conv layer) is the information bottleneck:
Input: 224×224×3 = 150,528 values
Conv5: 7×7×512 = 25,088 values
Global Pool: 1×1×512 = 512 values ← Bottleneck
FC: 1000 values (class logits)
The 512-dimensional representation must contain all information needed for classification—a compression factor of ~300×.
According to Information Bottleneck theory, networks learn to compress input information while preserving task-relevant information. Feature maps represent this compressed encoding—discarding irrelevant details (background, exact textures) while retaining relevant ones (object categories, spatial relationships).
Skip Connections and Information Flow:
ResNets demonstrated that skip connections fundamentally improve information flow:
Without Skips:
Input → Conv → ReLU → Conv → ReLU → ... → Output
Information must flow through all layers sequentially.
Gradients must backprop through all layers.
Deep networks suffer gradient vanishing.
With Skips:
Input ─┬─▶ Conv → ReLU → Conv ─┬─▶ + → Output
└────────────────────────┘
Information can bypass convolutional layers.
Gradients have direct paths to early layers.
Even 100+ layer networks train well.
Feature Map Analysis with Skips:
In ResNets, feature maps at layer L contain:
This explains why deeper ResNets work: each layer needs only to learn what to ADD to existing features, not replace them entirely.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import torchimport torch.nn as nnimport numpy as np def analyze_feature_statistics(model, sample_input, num_layers=10): """ Analyze feature map statistics through the network. """ stats = [] def hook_fn(name): def hook(module, input, output): if isinstance(output, torch.Tensor) and len(output.shape) == 4: o = output.detach() stats.append({ 'name': name, 'shape': list(o.shape), 'mean': o.mean().item(), 'std': o.std().item(), 'sparsity': (o == 0).float().mean().item(), 'spatial_size': o.shape[2] * o.shape[3], 'channels': o.shape[1], 'total_values': o.numel(), }) return hook # Register hooks hooks = [] for name, module in model.named_modules(): if isinstance(module, (nn.Conv2d, nn.BatchNorm2d, nn.ReLU, nn.MaxPool2d)): hooks.append(module.register_forward_hook(hook_fn(name))) # Forward pass with torch.no_grad(): model(sample_input) # Remove hooks for h in hooks: h.remove() return stats def visualize_information_flow(stats): """ Print analysis of how information flows through layers. """ print("Feature Map Analysis:") print("=" * 80) print(f"{'Layer':<25} {'Shape':>15} {'Mean':>8} {'Std':>8} {'Sparsity':>10}") print("-" * 80) for s in stats[:20]: # Show first 20 layers shape_str = f"{s['channels']}×{int(np.sqrt(s['spatial_size']))}²" print(f"{s['name'][:24]:<25} {shape_str:>15} " f"{s['mean']:>8.3f} {s['std']:>8.3f} {s['sparsity']:>9.1%}") # Compute information metrics print("\n" + "=" * 80) print("Information Summary:") print("-" * 80) if len(stats) > 0: initial_size = stats[0]['total_values'] final_size = stats[-1]['total_values'] print(f"Initial representation: {initial_size:,} values") print(f"Final representation: {final_size:,} values") print(f"Compression ratio: {initial_size / final_size:.1f}x") # Example with a small modelfrom torchvision import modelsmodel = models.resnet18(pretrained=True)model.eval() sample = torch.randn(1, 3, 224, 224)stats = analyze_feature_statistics(model, sample)visualize_information_flow(stats)Perhaps the most fascinating aspect of feature maps is their emergent semantic content—they learn to represent meaningful visual concepts without explicit supervision.
Emergence of Semantic Features:
Studies visualizing individual channels reveal semantic specificity:
| Layer | Example Channels | What They Detect |
|---|---|---|
| Conv1 | Various | Edges, colors (generic) |
| Conv3 | Channel 12 | Wheel-like patterns |
| Conv3 | Channel 47 | Fur/hair textures |
| Conv4 | Channel 83 | Faces |
| Conv4 | Channel 156 | Text characters |
| Conv5 | Channel 201 | Dogs |
| Conv5 | Channel 276 | Buildings |
Network Dissection:
The Network Dissection framework (Bau et al., 2017) systematically identifies what each channel detects:
Networks spontaneously learn to detect high-level concepts like faces, wheels, and even specific objects—without ever being trained on these labels. A network trained only for scene classification develops face detectors because faces are useful for identifying indoor scenes. This emergence of semantic features is a remarkable property of deep learning.
Hierarchical Feature Semantics:
Layer 1: Color and Edge
├── Ch 1: Horizontal edge
├── Ch 2: Vertical edge
├── Ch 3: Red color
└── Ch 4: Blue color
Layer 3: Texture and Pattern
├── Ch 12: Honeycomb texture
├── Ch 47: Striped pattern
├── Ch 83: Dotted pattern
└── Ch 156: Grid pattern
Layer 5: Object and Part
├── Ch 201: Dog heads
├── Ch 276: Buildings
├── Ch 312: Wheels
└── Ch 445: Eyes
Implications for Transfer Learning:
The semantic content of feature maps explains why transfer learning works:
Best Practice:
Feature maps are the intermediate representations that give CNNs their power. They transform raw pixel data into rich, hierarchical, semantically meaningful descriptions that enable visual understanding.
Connection to Next Topic:
So far we've treated each channel as detecting patterns from the previous layer's features. But what happens when we have multiple input channels, like RGB images or multi-channel feature maps from previous layers? The next page explores multiple channels—how convolution operates across channels, the distinction between input and output channels, and techniques like depthwise separable convolutions that exploit channel structure for efficiency.
You now understand feature maps—the rich intermediate representations that encode what a CNN has learned to see. You've explored their structure, computation, interpretation through visualization, sparsity properties, design principles for channel counts, and the emergence of semantic meaning. Next, we'll examine how multiple input and output channels work together in convolutional layers.