Convolutional Layers - Learning Module

Loading content...

0/245

Feature Maps

The Language of CNN Representations

When a convolutional layer processes an image, what exactly does it produce? Not a single number, not a flat vector, but a rich three-dimensional structure called a feature map—a spatial grid where each position contains activations for multiple learned features.

A feature map is the language CNNs use to describe what they see. Each position indicates 'here is evidence for these features at this location.' Early layers might say 'there's a vertical edge here' or 'this region is blue-ish.' Deeper layers might say 'this looks like fur texture' or 'this region contains an eye-like pattern.'

Understanding feature maps is essential for:

Architecture design: How many feature maps do you need at each layer?
Model interpretation: What has the network learned?
Debugging: Why is the network failing on certain inputs?
Transfer learning: Which features transfer between domains?

This page provides a comprehensive treatment of feature maps—their structure, computation, interpretation, and role in CNN learning.

What You Will Learn

This page covers the complete theory and practice of feature maps. You will understand their mathematical structure, how they're computed via convolution, interpretation through visualization techniques, design principles for channel counts, and their role in the hierarchical feature learning that makes CNNs powerful.

Structure and Terminology

Let's establish precise terminology for discussing feature maps.

Feature Map Definition:

A feature map is a 2D array of activations produced by applying a single filter to the input. The term refers to both:

A single H × W slice representing one feature
The collection of all such slices at a layer

Tensor Dimensions:

In deep learning frameworks, feature map tensors typically have shape:

[Batch, Channels, Height, Width] (PyTorch, NCHW) or [Batch, Height, Width, Channels] (TensorFlow default, NHWC)

Dimension	Meaning
Batch (N)	Number of samples processed together
Channels (C)	Number of feature maps / filters
Height (H)	Spatial height of feature maps
Width (W)	Spatial width of feature maps

Example:

A conv layer processing 224×224 RGB images with 64 filters produces:

Input: [B, 3, 224, 224] — 3 input channels (RGB)
Output: [B, 64, 224, 224] — 64 output feature maps (with same-padding)
Each of the 64 feature maps is 224×224 spatial

Channels vs Feature Maps vs Filters

These terms are often used interchangeably: 'channels' (dimension of tensor), 'feature maps' (spatial activations), 'filters' (learned weights). A layer with 64 filters produces 64 output channels, where each channel is a feature map. The terms emphasize different perspectives: learned weights (filters), activations (feature maps), or tensor structure (channels).

Spatial vs Channel Dimensions:

Feature map tensors have two fundamentally different types of dimensions:

Spatial Dimensions (H, W):

Represent 'where' in the image
Translation equivariance operates here
Pooling reduces these dimensions
Convolution operates across these

Channel Dimension (C):

Represents 'what' features are detected
Each channel is a different feature detector
Channels increase through network depth
1×1 convolutions mix channels

Visualization:

                    W
           ┌────────────────┐
        H  │   Channel 1   │  (e.g., "vertical edge")
           └────────────────┘
           ┌────────────────┐
        H  │   Channel 2   │  (e.g., "horizontal edge")
           └────────────────┘
           ┌────────────────┐
        H  │   Channel 3   │  (e.g., "diagonal edge")
           └────────────────┘
                 ...
           ┌────────────────┐
        H  │   Channel C   │  (e.g., "texture pattern")
           └────────────────┘

        C feature maps stacked along channel dimension

Feature Map Dimensions Through a CNN
Layer	Channels	Height	Width	Total Activations
Input	3	224	224	150,528
Conv1 (64 filters)	64	112	112	802,816
Conv2 (128 filters)	128	56	56	401,408
Conv3 (256 filters)	256	28	28	200,704
Conv4 (512 filters)	512	14	14	100,352
Conv5 (512 filters)	512	7	7	25,088

How Feature Maps Are Computed

Understanding the precise computation of feature maps reveals their structure and interpretation.

Convolution Operation:

For an input with C_in channels and a conv layer with C_out filters:

$$Y_{c_{out}}[i, j] = \sigma\left(\sum_{c_{in}=1}^{C_{in}} \sum_{m,n} X_{c_{in}}[i+m, j+n] \cdot K_{c_{out}, c_{in}}[m, n] + b_{c_{out}}\right)$$

where:

X is the input tensor [C_in, H, W]
K is the kernel tensor [C_out, C_in, k, k]
b is the bias vector [C_out]
σ is the activation function (e.g., ReLU)

Key Insights:

Each output channel is computed independently (different filter)
Each output channel integrates all input channels (sum over c_in)
The filter for channel c_out has shape [C_in, k, k] — it's a 3D volume
The bias b_c shifts the entire feature map — learned threshold

The 3D Filter Perspective

A 'kernel' or 'filter' isn't just k×k—it's C_in × k × k. When we say '64 3×3 filters on RGB input,' we actually have 64 filters each of size 3 × 3 × 3 = 27 weights (plus bias). The filter slides spatially but covers all input channels at once.

Step-by-Step Example:

Input: RGB image [3, 8, 8] — 3 channels, 8×8 spatial Conv layer: 64 filters, 3×3, stride 1, padding 1

Computation for one output position (i=4, j=4) in output channel c=0:

# Extract 3×3 patch from all 3 input channels
patch = X[:, 3:6, 3:6]  # Shape: [3, 3, 3]

# Filter 0 has shape [3, 3, 3] — one 3×3 kernel per input channel
filter_0 = K[0]  # Shape: [3, 3, 3]

# Element-wise multiply and sum all
activation = (patch * filter_0).sum() + bias[0]

# Apply nonlinearity
Y[0, 4, 4] = relu(activation)

Repeat for all 64 filters at all 8×8 positions → Output [64, 8, 8]

Computational Complexity:

For input [C_in, H, W] and output [C_out, H', W'] with k×k kernels:

$$\text{FLOPs} = 2 \cdot C_{out} \cdot H' \cdot W' \cdot C_{in} \cdot k^2$$

(The factor of 2 counts multiply + add separately)

feature_map_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def manual_conv2d(x, weight, bias, padding=0):
    """
    Manual implementation of 2D convolution to show feature map computation.
    
    Args:
        x: Input tensor [C_in, H, W]
        weight: Filter weights [C_out, C_in, k, k]
        bias: Bias vector [C_out]
        padding: Zero padding
    
    Returns:
        Feature map tensor [C_out, H', W']
    """
    c_in, h, w = x.shape
    c_out, _, k, _ = weight.shape
    
    # Apply padding
    if padding > 0:
        x = F.pad(x, (padding, padding, padding, padding))
    
    _, h_pad, w_pad = x.shape
    h_out = h_pad - k + 1
    w_out = w_pad - k + 1
    
    # Initialize output
    output = torch.zeros(c_out, h_out, w_out)
    
    # Compute each output position
    for cout in range(c_out):
        for i in range(h_out):
            for j in range(w_out):
                # Extract patch from ALL input channels
                patch = x[:, i:i+k, j:j+k]  # [C_in, k, k]
                
                # Dot product with this output channel's filter
                # Filter has shape [C_in, k, k]
                output[cout, i, j] = (patch * weight[cout]).sum() + bias[cout]
    
    return output
 
# Example
x = torch.randn(3, 8, 8)  # RGB 8x8 image
conv = nn.Conv2d(3, 64, kernel_size=3, padding=1)
 
# Our manual implementation
manual_out = manual_conv2d(x, conv.weight, conv.bias, padding=1)
 
# PyTorch implementation
pytorch_out = conv(x.unsqueeze(0)).squeeze(0)
 
# Verify they match
print(f"Max difference: {(manual_out - pytorch_out).abs().max().item():.6f}")
print(f"Output shape: {manual_out.shape}")  # [64, 8, 8]
 
# Output:
# Max difference: 0.000000
# Output shape: torch.Size([64, 8, 8])

Interpreting Feature Maps

Feature maps aren't just mathematical objects—they encode meaningful visual information. Understanding how to interpret them is crucial for model understanding and debugging.

What Feature Maps Represent:

Each feature map is a spatial detector for a particular pattern:

High activation at position (i, j) means the pattern is present at that location
Low/zero activation means the pattern is absent
Negative values (pre-ReLU) indicate anti-correlation with the pattern

Layer-by-Layer Interpretation:

Early Layers (Conv1-2):

Detect low-level features: edges, colors, gradients
Feature maps resemble Gabor filters and color blobs
Easily interpretable by humans
Features are generic, applicable to any visual domain

Middle Layers (Conv3-4):

Detect mid-level features: textures, patterns, parts
Feature maps show texture regions, object parts
Less directly interpretable
Features become somewhat domain-specific

Deep Layers (Conv5+):

Detect high-level features: objects, object parts
Feature maps are sparse, with few strong activations
Activation corresponds to 'this looks like category X'
Features are highly domain-specific

The Role of ReLU

ReLU makes feature maps sparse by zeroing negative values. This sparsity has interpretation: a ReLU feature map shows where patterns are present (positive activation) and where they're absent or contradicted (zero). Without ReLU, feature maps would have both positive and negative values, making interpretation harder.

Visualization Techniques:

1. Direct Visualization

Display feature map activations as grayscale images:

import matplotlib.pyplot as plt

def visualize_feature_maps(feature_maps, num_show=16):
    """
    feature_maps: [C, H, W] tensor from one sample
    """
    fig, axes = plt.subplots(4, 4, figsize=(12, 12))
    for i, ax in enumerate(axes.flat):
        if i < min(num_show, feature_maps.shape[0]):
            ax.imshow(feature_maps[i].cpu().numpy(), cmap='viridis')
            ax.set_title(f'Channel {i}')
        ax.axis('off')
    plt.tight_layout()

2. Activation Maximization

Generate input patterns that maximally activate a specific feature:

Start with noise image
Backpropagate gradient to maximize target neuron
Update image with gradient ascent
Result shows what the feature 'looks for'

3. Occlusion Sensitivity

Slide an occluding patch across input and record activation changes:

Strong activation drops indicate important regions
Creates a heatmap of feature importance

4. Gradient-based Attribution (Grad-CAM)

Weight feature maps by gradient of class score: $$L^c = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)$$

where $\alpha_k^c = \frac{1}{Z}\sum_i\sum_j \frac{\partial y^c}{\partial A^k_{ij}}$

Creates class-specific attention maps showing which regions matter for a prediction.

feature_map_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torchvision import models, transforms
 
def extract_feature_maps(model, input_tensor, layer_name):
    """
    Extract feature maps from a specific layer.
    """
    feature_maps = {}
    
    def hook(module, input, output):
        feature_maps['output'] = output.detach()
    
    # Register hook on target layer
    for name, module in model.named_modules():
        if name == layer_name:
            handle = module.register_forward_hook(hook)
            break
    
    # Forward pass
    with torch.no_grad():
        model(input_tensor)
    
    handle.remove()
    return feature_maps['output']
 
def visualize_top_activating_channels(feature_maps, top_k=8):
    """
    Visualize channels with highest mean activation.
    """
    # feature_maps: [1, C, H, W]
    maps = feature_maps.squeeze(0)  # [C, H, W]
    
    # Find top-k channels by mean activation
    mean_activations = maps.mean(dim=(1, 2))  # [C]
    top_indices = mean_activations.argsort(descending=True)[:top_k]
    
    fig, axes = plt.subplots(2, top_k // 2, figsize=(15, 6))
    for i, ax in enumerate(axes.flat):
        idx = top_indices[i]
        ax.imshow(maps[idx].cpu().numpy(), cmap='viridis')
        ax.set_title(f'Ch {idx}: mean={mean_activations[idx]:.2f}')
        ax.axis('off')
    plt.suptitle('Top Activating Feature Channels')
    return fig
 
def compute_channel_correlation(feature_maps):
    """
    Compute correlation matrix between feature map channels.
    Shows which features co-activate.
    """
    # feature_maps: [1, C, H, W]
    maps = feature_maps.squeeze(0)  # [C, H, W]
    C = maps.shape[0]
    
    # Flatten spatial dimensions
    maps_flat = maps.view(C, -1)  # [C, H*W]
    
    # Compute correlation matrix
    maps_centered = maps_flat - maps_flat.mean(dim=1, keepdim=True)
    maps_std = maps_flat.std(dim=1, keepdim=True) + 1e-8
    maps_normalized = maps_centered / maps_std
    
    correlation = torch.mm(maps_normalized, maps_normalized.t()) / maps_flat.shape[1]
    
    return correlation.cpu().numpy()
 
# Example usage with pretrained ResNet
model = models.resnet18(pretrained=True)
model.eval()
 
# Create sample input
sample_input = torch.randn(1, 3, 224, 224)
 
# Extract from different layers
for layer in ['layer1.0.conv1', 'layer2.0.conv1', 'layer3.0.conv1', 'layer4.0.conv1']:
    try:
        fmaps = extract_feature_maps(model, sample_input, layer)
        print(f"{layer}: {fmaps.shape}")
    except:
        print(f"{layer}: layer not found")

Feature Map Sparsity and Redundancy

Feature maps exhibit interesting statistical properties—sparsity and redundancy—that have profound implications for network compression and understanding.

Sparsity:

After ReLU, feature maps are typically sparse—most values are zero:

Layer 1: ~50% zeros (many regions have no edges)
Layer 3: ~70% zeros (textures cover limited areas)
Layer 5: ~90% zeros (high-level features are rare)

Why Sparsity Matters:

Computational efficiency: Sparse tensors can skip zero multiplications
Memory efficiency: Can store only non-zero values
Regularization: Implicit L0 regularization from ReLU
Interpretability: Sparse representation is easier to interpret

Measuring Sparsity:

def measure_sparsity(feature_maps, threshold=0.0):
    """
    Compute fraction of near-zero values in feature maps.
    """
    total_values = feature_maps.numel()
    sparse_values = (feature_maps.abs() <= threshold).sum().item()
    return sparse_values / total_values

The Superposition Hypothesis

Recent research suggests that networks represent more features than they have neurons by using sparse, distributed codes. A feature map with 512 channels might encode thousands of features via combinations of channels. This 'superposition' explains why networks can be compressed without losing capability.

Redundancy:

Despite having hundreds of channels, feature maps often contain redundant information:

Intra-Channel Redundancy: Spatially adjacent positions often have similar values (smoothness)

Inter-Channel Redundancy: Many channels are highly correlated—they detect similar features

Exploiting Redundancy:

1. Channel Pruning

Remove channels that are highly correlated or rarely activate:

Compute channel importance scores
Remove lowest-scoring channels
Fine-tune to recover accuracy

2. Low-Rank Factorization

Approximate weight matrices with lower-rank versions: $$W \approx UV^T$$

For a [C_out, C_in, k, k] kernel, factorize into two smaller convolutions.

3. Quantization

Reduce precision of feature map values:

Full precision: 32-bit float
Half precision: 16-bit float
Integer: 8-bit or 4-bit integer

Sparse, redundant feature maps tolerate aggressive quantization.

Desirable Sparsity

•Indicates selective feature detection
•Features respond only to relevant patterns
•Enables efficient inference
•Improves interpretability
•Natural regularization effect

Excessive Redundancy

•Wastes parameters on duplicate features
•Increases memory and compute cost
•May indicate overparameterization
•Opportunity for compression
•Can indicate poor architecture design

Designing Channel Counts

How many feature maps (channels) should each layer have? This fundamental design question affects capacity, computation, and performance.

Traditional Approach: Doubling Pattern

VGG and ResNet popularized the pattern:

Early layers: 64 channels
After each spatial downsampling: double channels
Progression: 64 → 128 → 256 → 512 → 512

Rationale:

Spatial dimensions decrease by 2× (pooling)
Channel dimensions increase by 2×
Total representation size stays roughly constant
Maintains computational balance across layers

Mathematical Justification:

For input [C, H, W] and output [2C, H/2, W/2]:

Input representation: C × H × W values
Output representation: 2C × (H/2) × (W/2) = C × H × W / 2 values

The doubling compensates for spatial reduction, maintaining representational capacity.

Modern Approaches

Neural Architecture Search (NAS) has found that the doubling pattern isn't always optimal. EfficientNet uses compound scaling that increases depth, width, and resolution together. ConvNeXt uses 96 → 192 → 384 → 768 progression (3× instead of 2×). The optimal pattern depends on the task and compute budget.

Factors Affecting Optimal Channel Count:

1. Task Complexity

Simple tasks (digit recognition): Fewer channels needed
Complex tasks (ImageNet): More channels needed
Fine-grained classification: Often needs even more

2. Input Resolution

Higher resolution: Can use fewer early channels (more spatial information)
Lower resolution: May need more channels to compensate

3. Depth

Deeper networks can use fewer channels per layer
Shallow networks need more channels for capacity

4. Computational Budget

Channel count dominates compute: FLOPs ∝ C_in × C_out
Reducing channels most effective for cost reduction

Width Multiplier Pattern:

MobileNet introduced width multipliers:

base_channels = [32, 64, 128, 256, 512]
width_mult = 0.75  # 75% of base width
actual_channels = [int(c * width_mult) for c in base_channels]
# Result: [24, 48, 96, 192, 384]

This allows trading accuracy for efficiency uniformly across the network.

Channel Progressions in Popular Architectures
Architecture	Stage 1	Stage 2	Stage 3	Stage 4	Stage 5	Pattern
VGG-16	64	128	256	512	512	2× doubling
ResNet-50	64	256	512	1024	2048	2× + bottleneck
MobileNet V2	32	24	32	64	320	Inverted bottleneck
EfficientNet-B0	32	24	40	80	192	Compound scaling
ConvNeXt-T	96	192	384	768		2× doubling
Swin-T	96	192	384	768		2× doubling

Feature Maps and Information Flow

Feature maps are the medium through which information flows through a CNN. Understanding this flow reveals how representations are built and transformed.

Information Transformation:

As data flows through a CNN, feature maps undergo systematic transformations:

Spatial Transformation:

Pooling/striding reduces spatial dimensions
Larger receptive fields integrate more context
Position information becomes coarser

Channel Transformation:

3×3 convs mix local information
1×1 convs mix channel information
Channel count typically increases

Information Content:

Early: Dense, distributed, redundant
Late: Sparse, concentrated, task-specific

Bottleneck Analysis:

The layer with minimum representation size (often the final conv layer) is the information bottleneck:

Input: 224×224×3 = 150,528 values
Conv5: 7×7×512 = 25,088 values
Global Pool: 1×1×512 = 512 values  ← Bottleneck
FC: 1000 values (class logits)

The 512-dimensional representation must contain all information needed for classification—a compression factor of ~300×.

The Information Bottleneck Principle

According to Information Bottleneck theory, networks learn to compress input information while preserving task-relevant information. Feature maps represent this compressed encoding—discarding irrelevant details (background, exact textures) while retaining relevant ones (object categories, spatial relationships).

Skip Connections and Information Flow:

ResNets demonstrated that skip connections fundamentally improve information flow:

Without Skips:

Input → Conv → ReLU → Conv → ReLU → ... → Output

Information must flow through all layers sequentially.
Gradients must backprop through all layers.
Deep networks suffer gradient vanishing.

With Skips:

Input ─┬─▶ Conv → ReLU → Conv ─┬─▶ + → Output
       └────────────────────────┘

Information can bypass convolutional layers.
Gradients have direct paths to early layers.
Even 100+ layer networks train well.

Feature Map Analysis with Skips:

In ResNets, feature maps at layer L contain:

New features computed at layer L
Features passed unchanged from layer L-1
The network learns residual functions: F(x) + x

This explains why deeper ResNets work: each layer needs only to learn what to ADD to existing features, not replace them entirely.

analyze_feature_flow.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import torch
import torch.nn as nn
import numpy as np
 
def analyze_feature_statistics(model, sample_input, num_layers=10):
    """
    Analyze feature map statistics through the network.
    """
    stats = []
    
    def hook_fn(name):
        def hook(module, input, output):
            if isinstance(output, torch.Tensor) and len(output.shape) == 4:
                o = output.detach()
                stats.append({
                    'name': name,
                    'shape': list(o.shape),
                    'mean': o.mean().item(),
                    'std': o.std().item(),
                    'sparsity': (o == 0).float().mean().item(),
                    'spatial_size': o.shape[2] * o.shape[3],
                    'channels': o.shape[1],
                    'total_values': o.numel(),
                })
        return hook
    
    # Register hooks
    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, (nn.Conv2d, nn.BatchNorm2d, nn.ReLU, nn.MaxPool2d)):
            hooks.append(module.register_forward_hook(hook_fn(name)))
    
    # Forward pass
    with torch.no_grad():
        model(sample_input)
    
    # Remove hooks
    for h in hooks:
        h.remove()
    
    return stats
 
def visualize_information_flow(stats):
    """
    Print analysis of how information flows through layers.
    """
    print("Feature Map Analysis:")
    print("=" * 80)
    print(f"{'Layer':<25} {'Shape':>15} {'Mean':>8} {'Std':>8} {'Sparsity':>10}")
    print("-" * 80)
    
    for s in stats[:20]:  # Show first 20 layers
        shape_str = f"{s['channels']}×{int(np.sqrt(s['spatial_size']))}²"
        print(f"{s['name'][:24]:<25} {shape_str:>15} "
              f"{s['mean']:>8.3f} {s['std']:>8.3f} {s['sparsity']:>9.1%}")
    
    # Compute information metrics
    print("\n" + "=" * 80)
    print("Information Summary:")
    print("-" * 80)
    
    if len(stats) > 0:
        initial_size = stats[0]['total_values']
        final_size = stats[-1]['total_values']
        print(f"Initial representation: {initial_size:,} values")
        print(f"Final representation: {final_size:,} values")
        print(f"Compression ratio: {initial_size / final_size:.1f}x")
 
# Example with a small model
from torchvision import models
model = models.resnet18(pretrained=True)
model.eval()
 
sample = torch.randn(1, 3, 224, 224)
stats = analyze_feature_statistics(model, sample)
visualize_information_flow(stats)

Semantic Meaning of Feature Maps

Perhaps the most fascinating aspect of feature maps is their emergent semantic content—they learn to represent meaningful visual concepts without explicit supervision.

Emergence of Semantic Features:

Studies visualizing individual channels reveal semantic specificity:

Layer	Example Channels	What They Detect
Conv1	Various	Edges, colors (generic)
Conv3	Channel 12	Wheel-like patterns
Conv3	Channel 47	Fur/hair textures
Conv4	Channel 83	Faces
Conv4	Channel 156	Text characters
Conv5	Channel 201	Dogs
Conv5	Channel 276	Buildings

Network Dissection:

The Network Dissection framework (Bau et al., 2017) systematically identifies what each channel detects:

Run network on densely labeled dataset (segmentations, parts, objects)
For each channel, find which semantic concepts correlate with its activations
Compute IoU between channel activations and semantic labels
Assign channel to concept with highest IoU above threshold

Surprising Findings from Network Dissection

Networks spontaneously learn to detect high-level concepts like faces, wheels, and even specific objects—without ever being trained on these labels. A network trained only for scene classification develops face detectors because faces are useful for identifying indoor scenes. This emergence of semantic features is a remarkable property of deep learning.

Hierarchical Feature Semantics:

Layer 1: Color and Edge
├── Ch 1: Horizontal edge
├── Ch 2: Vertical edge  
├── Ch 3: Red color
└── Ch 4: Blue color

Layer 3: Texture and Pattern
├── Ch 12: Honeycomb texture
├── Ch 47: Striped pattern
├── Ch 83: Dotted pattern
└── Ch 156: Grid pattern

Layer 5: Object and Part
├── Ch 201: Dog heads
├── Ch 276: Buildings
├── Ch 312: Wheels
└── Ch 445: Eyes

Implications for Transfer Learning:

The semantic content of feature maps explains why transfer learning works:

Early features are universal: Edges and textures apply across all visual domains
Mid-level features transfer moderately: Textures and parts depend somewhat on domain
Deep features are task-specific: Object detectors may not transfer to medical imaging

Best Practice:

Freeze early layers when transferring (they're already good)
Fine-tune or replace deep layers (they're task-specific)
The optimal freezing point depends on domain similarity

What Makes Feature Maps Interpretable

•Spatial correspondence: Activations align with input locations
•Selectivity: Each channel responds to specific patterns
•Invariance: Higher layers are more invariant to transformations
•Compositionality: Complex features build from simpler ones
•Sparsity: Few channels activate for any given input region

Summary and Connections

Feature maps are the intermediate representations that give CNNs their power. They transform raw pixel data into rich, hierarchical, semantically meaningful descriptions that enable visual understanding.

Key Takeaways

•Feature maps are 3D tensors with spatial dimensions (where) and channel dimension (what feature).
•Each channel detects a specific pattern computed by convolving with a learned filter over all input channels.
•Feature maps exhibit sparsity and redundancy, enabling compression and revealing network structure.
•Channel counts follow design patterns (doubling) that balance capacity with computation.
•Information flows and transforms through the network, becoming more abstract and task-specific with depth.
•Semantic features emerge spontaneously, enabling transfer learning and model interpretation.

Connection to Next Topic:

So far we've treated each channel as detecting patterns from the previous layer's features. But what happens when we have multiple input channels, like RGB images or multi-channel feature maps from previous layers? The next page explores multiple channels—how convolution operates across channels, the distinction between input and output channels, and techniques like depthwise separable convolutions that exploit channel structure for efficiency.

Page Complete

You now understand feature maps—the rich intermediate representations that encode what a CNN has learned to see. You've explored their structure, computation, interpretation through visualization, sparsity properties, design principles for channel counts, and the emergence of semantic meaning. Next, we'll examine how multiple input and output channels work together in convolutional layers.