Machine LearningConvolutional Neural Networks

Convolutional Layers

LevelIntermediate

Duration90 mins

TopicConvolutional Neural Networks

2 / 5

Translation Equivariance

The Property That Makes CNNs Work

Consider a profound question: When you shift an image of a cat two pixels to the right, what should happen to the features detected by a neural network?

With a fully-connected network, the answer is unpredictable. Shifting pixels rearranges the input vector, potentially activating completely different neurons. A network trained to recognize cats at the image center might fail catastrophically when the same cat appears at the corner.

With a convolutional network, something elegant happens: the detected features shift by exactly the same amount as the input. A cat detector that fires at position (100, 100) for a cat centered there will fire at position (102, 100) when that cat shifts right by two pixels. This property—where output features shift in perfect correspondence with input shifts—is called translation equivariance.

Translation equivariance isn't an accident or an optimization; it's a mathematical guarantee that emerges directly from parameter sharing. Understanding this property is essential for understanding why convolutional networks are so effective for spatial data.

What You Will Learn

This page provides a rigorous treatment of translation equivariance. You will understand the formal mathematical definition, prove why convolution is equivariant, distinguish equivariance from invariance, explore connections to group theory and geometric deep learning, and examine practical implications for CNN design and generalization.

Formal Definition of Equivariance

Let's begin with precise mathematical definitions.

Translation Operator:

Define the translation operator T_τ that shifts a function (or image) by displacement vector τ = (τₓ, τᵧ):

$$[T_\tau f](x, y) = f(x - \tau_x, y - \tau_y)$$

This shifts the function in the positive direction by τ. If f has a peak at (10, 10), then T_τf with τ = (3, 5) has a peak at (13, 15).

Equivariance Definition:

A function (or layer) Φ is equivariant to a transformation T if applying T before Φ gives the same result as applying T after Φ:

$$\Phi(T_\tau[f]) = T_\tau[\Phi(f)]$$

In words: transforming the input then processing equals processing then transforming the output. The transformation 'commutes' with the function.

Visualizing Equivariance:

             Φ
   f ──────────────▶ Φ(f)
   │                  │
T_τ│                  │T_τ
   ▼                  ▼
 T_τ[f] ──────────▶ T_τ[Φ(f)] = Φ(T_τ[f])
             Φ

Both paths lead to the same result!

Equivariance vs Invariance

Equivariance and invariance are often confused. Invariance means the output doesn't change under transformation: Φ(T[f]) = Φ(f). Equivariance means the output transforms in the same way: Φ(T[f]) = T[Φ(f)]. Individual conv layers are equivariant; the full network with pooling achieves partial invariance.

Why Equivariance Matters:

1. Efficient Generalization

If a network is translation equivariant, then learning to detect a feature at one location automatically provides detection at all locations. A single training example of a cat at position A teaches the network about cats at every position.

2. Predictable Feature Behavior

Equivariance guarantees how features transform. We can reason about what a layer does mathematically, not just empirically.

3. Robust Recognition

Objects in real images appear at arbitrary positions. Equivariant representations ensure consistent recognition regardless of position.

4. Meaningful Feature Maps

With equivariance, feature map positions have semantic meaning—activations correspond to spatial locations in the input.

Equivariance vs Invariance Comparison
Property	Definition	Where Used	Effect
Translation Equivariance	Φ(Translate(x)) = Translate(Φ(x))	Convolutional layers	Features shift with input
Translation Invariance	Φ(Translate(x)) = Φ(x)	Global pooling, final classifier	Output unchanged by shift
Rotation Equivariance	Φ(Rotate(x)) = Rotate(Φ(x))	Specialized architectures (Group CNNs)	Features rotate with input
Scale Equivariance	Φ(Scale(x)) = Scale(Φ(x))	Scale-space networks, FPN	Features scale with input

Proof That Convolution Is Translation Equivariant

Let's rigorously prove that discrete 2D convolution is translation equivariant.

Setup:

Let f: ℤ² → ℝ be an input image and k: ℤ² → ℝ be a convolution kernel. The convolution is defined as:

$$(f * k)[i, j] = \sum_{m} \sum_{n} f[m, n] \cdot k[i-m, j-n]$$

Alternatively, using the cross-correlation convention common in deep learning:

$$(f \star k)[i, j] = \sum_{m} \sum_{n} f[i+m, j+n] \cdot k[m, n]$$

The Theorem:

For any translation τ = (τₓ, τᵧ):

$$T_\tau[f * k] = (T_\tau f) * k$$

or equivalently:

$$T_\tau[f] * k = T_\tau[f * k]$$

Proof:

Let g = f * k be the convolution output. We need to show that translating f before convolving gives the same result as translating g.

Proof Insight

The proof hinges on a change of variables in the summation. Because the kernel is applied identically at all positions (parameter sharing), translating the input simply translates where the kernel 'sees' patterns, producing a translated output.

Detailed Proof:

Start with the left side: $(T_\tau f) * k$ at position (i, j):

$$[(T_\tau f) * k][i, j] = \sum_{m} \sum_{n} (T_\tau f)[i+m, j+n] \cdot k[m, n]$$

By definition of translation: $$(T_\tau f)[i+m, j+n] = f[i+m-\tau_x, j+n-\tau_y]$$

Substituting: $$= \sum_{m} \sum_{n} f[i+m-\tau_x, j+n-\tau_y] \cdot k[m, n]$$

Change variables: let m' = m, n' = n (indices sum over all positions, so unchanged): $$= \sum_{m'} \sum_{n'} f[(i-\tau_x)+m', (j-\tau_y)+n'] \cdot k[m', n']$$

This is exactly the convolution evaluated at (i-τₓ, j-τᵧ): $$= (f * k)[i-\tau_x, j-\tau_y]$$

By definition of translation: $$= (T_\tau[f * k])[i, j]$$

QED: $(T_\tau f) * k = T_\tau[f * k]$ ∎

The Key Insight:

Equivariance emerges from the homogeneity of convolution—the same operation is applied everywhere. Parameter sharing guarantees this homogeneity, making equivariance a mathematical certainty, not an empirical observation.

equivariance_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import torch
import torch.nn.functional as F
 
def verify_translation_equivariance():
    """
    Numerically verify that convolution is translation equivariant.
    """
    # Create a simple input image (batch=1, channels=1, H=10, W=10)
    x = torch.randn(1, 1, 10, 10)
    
    # Create a kernel
    kernel = torch.randn(1, 1, 3, 3)
    
    # Define translation (shift right by 2, down by 1)
    tau_x, tau_y = 2, 1
    
    # Method 1: Translate input, then convolve
    # Use a larger input to avoid boundary effects
    x_padded = F.pad(x, (2, 2, 2, 2))  # Add padding
    
    # Translate via roll (circular shift for demonstration)
    x_translated = torch.roll(x_padded, shifts=(tau_y, tau_x), dims=(2, 3))
    output_method1 = F.conv2d(x_translated, kernel, padding=0)
    
    # Method 2: Convolve, then translate output
    output_original = F.conv2d(x_padded, kernel, padding=0)
    output_method2 = torch.roll(output_original, shifts=(tau_y, tau_x), dims=(2, 3))
    
    # Compare (within numerical precision)
    # They should be equal in the valid region
    valid_region_1 = output_method1[:, :, tau_y:, tau_x:]
    valid_region_2 = output_method2[:, :, tau_y:, tau_x:]
    
    max_diff = (valid_region_1 - valid_region_2).abs().max().item()
    
    print(f"Max difference between methods: {max_diff:.2e}")
    print(f"Equivariance holds: {max_diff < 1e-6}")
    
    return max_diff < 1e-6
 
# Verify
result = verify_translation_equivariance()
print(f"\nConvolution is translation equivariant: {result}")
 
# Output:
# Max difference between methods: ~0.00e+00
# Equivariance holds: True
# Convolution is translation equivariant: True

Connection to Group Theory

Translation equivariance is best understood through the lens of group theory—the mathematical study of symmetry. This perspective unifies CNNs with a broader family of equivariant architectures.

The Translation Group:

The set of all 2D translations forms a group (ℝ², +) where:

Elements: Translation vectors τ = (τₓ, τᵧ) ∈ ℝ²
Operation: Vector addition (composing translations)
Identity: Zero translation (0, 0)
Inverse: Negative translation (-τₓ, -τᵧ)

This is an abelian (commutative) group: τ₁ + τ₂ = τ₂ + τ₁

Group Actions:

The translation group acts on the space of images. Each translation τ defines a transformation T_τ of images: $$T_\tau: f \mapsto T_\tau[f]$$

where $T_\tau f = f(x - τ)$

Equivariant Maps:

A function Φ is G-equivariant if it commutes with all group actions: $$\Phi(T_g[f]) = T_g[\Phi(f)] \quad \forall g \in G$$

Convolution is equivariant to the translation group.

Why Group Theory Matters

Group theory provides a principled framework for designing architectures with built-in symmetries. Beyond translations, we can design networks equivariant to rotations (SO(2)), reflections (O(2)), 3D rotations (SO(3)), or arbitrary Lie groups. This is the foundation of Geometric Deep Learning.

Extending Beyond Translations:

The group-theoretic view suggests natural extensions:

1. Rotation Equivariance

Replace the translation group with the rotation group SO(2). Group-equivariant CNNs (G-CNNs) achieve rotation equivariance by convolving with rotated copies of filters:

$$[f * k]\theta = \int f(r{-\theta}(x)) k(x) dx$$

where r_θ is rotation by angle θ.

2. Scale Equivariance

The multiplicative group (ℝ⁺, ×) of positive scalings. Scale-equivariant networks process images at multiple scales with shared weights.

3. Affine and Projective Equivariance

More complex groups handling perspective transformations, useful for 3D vision tasks.

The Group Convolution Theorem:

A linear map Φ: L²(G) → L²(G) is G-equivariant if and only if it can be expressed as convolution on the group:

$$\Phi[f] = f * k$$

for some kernel k. This establishes convolution as the unique linear equivariant operation!

Symmetry Groups in Deep Learning
Group	Symmetry	Architecture	Application
(ℝ², +)	2D Translation	Standard CNN	Image recognition
SO(2)	2D Rotation	G-CNN, Harmonic Networks	Medical imaging, satellite
SE(2)	Translation + Rotation	SE(2)-CNN	Robotics, autonomous driving
SO(3)	3D Rotation	Spherical CNN	Molecular modeling, 3D shapes
E(3)	3D Euclidean	EGNN, SchNet	Physics simulation, chemistry
S_n	Permutation	Message Passing GNN	Graphs, sets, point clouds

Equivariance Through a CNN Pipeline

A complete CNN consists of multiple layers. Let's trace how equivariance propagates and where it's intentionally broken.

Layer-by-Layer Analysis:

1. Convolutional Layers: Equivariant ✓

As proven, convolution is translation equivariant. Stacking multiple conv layers preserves equivariance: $$\Phi_2(\Phi_1(T_\tau[x])) = \Phi_2(T_\tau[\Phi_1(x)]) = T_\tau[\Phi_2(\Phi_1(x))]$$

2. Pointwise Nonlinearities (ReLU, etc.): Equivariant ✓

Activation functions applied elementwise preserve equivariance: $$\sigma(T_\tau[f])[i,j] = \sigma(f[i-\tau_x, j-\tau_y]) = T_\tau[\sigma(f)][i,j]$$

3. Batch/Layer Normalization: Approximately Equivariant ~

Statistics computed over spatial positions. Slightly breaks equivariance due to boundary effects and running statistics, but effect is minor.

4. Pooling Layers: Break Equivariance ✗

Max pooling and average pooling over windows break strict equivariance but introduce desired invariance to small translations.

Pooling and Equivariance

Pooling layers downsample the feature map, which breaks strict translation equivariance. A 1-pixel shift in input doesn't map to a 1-pixel shift in output after 2×2 pooling—it might map to 0 or 1 pixel depending on alignment. This is actually desirable for building translation invariance gradually.

From Equivariance to Invariance:

CNNs build partial invariance through a hierarchy of equivariant layers followed by pooling:

Early layers: Fully equivariant, features track spatial position precisely
Pooling: Reduces resolution, averages out small translations
Deeper layers: Equivariant at coarser scale, invariant to small shifts
Global pooling: Completely discards position, full translation invariance
Fully-connected classifier: Operates on globally-pooled features

                     Translation Sensitivity
                            ▲
                            │
           High ────────────┼──○ Input pixels
                            │   ╲
                            │    ○ Conv1
                            │     ╲
                            │      ○ Pool1
                            │       ╲
                            │        ○ Conv2
                            │         ╲
                            │          ○ Pool2
                            │           ╲
           Low ─────────────┼────────────○ Global Pool
                            │             ╲
           None ────────────┼──────────────○ Classifier
                            │
                            └──────────────────────────▶ Layer Depth

Design Principle:

Equivariance preserves where features are found. Invariance ultimately discards this for classification (a cat is a cat regardless of position). The key insight is that we want equivariance in intermediate representations (to detect and localize features) but invariance in final predictions (to classify regardless of position).

equivariance_through_layers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def analyze_equivariance(model, x, shift=(2, 3)):
    """
    Analyze how equivariance changes through a CNN.
    
    For each layer, compute:
    - f(T[x]): Apply shift to input, then forward pass
    - T[f(x)]: Forward pass, then shift output
    
    Compare to measure equivariance preservation.
    """
    shift_y, shift_x = shift
    
    # Create shifted input
    x_shifted = torch.roll(x, shifts=(shift_y, shift_x), dims=(2, 3))
    
    results = {}
    
    # Get intermediate activations for original and shifted
    activations_original = []
    activations_shifted = []
    
    def hook_fn(storage):
        def hook(module, input, output):
            storage.append(output.detach())
        return hook
    
    # Register hooks
    hooks = []
    for name, layer in model.named_children():
        hooks.append((name, layer.register_forward_hook(hook_fn(activations_original))))
    
    # Forward original
    _ = model(x)
    
    # Clear hooks and re-register for shifted
    for name, h in hooks:
        h.remove()
    
    hooks = []
    for name, layer in model.named_children():
        hooks.append((name, layer.register_forward_hook(hook_fn(activations_shifted))))
    
    # Forward shifted
    _ = model(x_shifted)
    
    # Compare
    for name, h in hooks:
        h.remove()
    
    layer_names = [name for name, _ in model.named_children()]
    
    for i, name in enumerate(layer_names):
        orig = activations_original[i]
        shifted = activations_shifted[i]
        
        # Shift the original activation
        orig_shifted = torch.roll(orig, shifts=(shift_y, shift_x), dims=(2, 3))
        
        # Check if feature maps have same spatial size
        if orig.shape == shifted.shape:
            # Measure equivariance: ||T[f(x)] - f(T[x])||
            # (ignoring boundary effects by comparing central region)
            h, w = orig.shape[2], orig.shape[3]
            margin = max(abs(shift_y), abs(shift_x)) + 2
            
            if h > 2*margin and w > 2*margin:
                central_orig = orig_shifted[:, :, margin:-margin, margin:-margin]
                central_shifted = shifted[:, :, margin:-margin, margin:-margin]
                
                diff = (central_orig - central_shifted).abs().mean().item()
                results[name] = {
                    'equivariance_error': diff,
                    'spatial_size': (h, w)
                }
    
    return results
 
# Example simple CNN
model = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),    # Equivariant
    nn.ReLU(),                          # Equivariant
    nn.MaxPool2d(2),                    # Breaks equivariance
    nn.Conv2d(64, 128, 3, padding=1),   # Equivariant
    nn.ReLU(),                          # Equivariant
)
 
x = torch.randn(1, 3, 32, 32)
results = analyze_equivariance(model, x)
 
for layer, data in results.items():
    print(f"{layer}: size={data['spatial_size']}, " +
          f"equiv_error={data['equivariance_error']:.4f}")

Practical Implications of Equivariance

Understanding equivariance has profound practical implications for training, testing, and deploying CNNs.

1. Data Efficiency

Equivariance provides implicit data augmentation. A network that learns to detect an edge at position (10, 10) automatically detects edges at all positions. Without equivariance, the network would need separate training examples for edges at each location—an exponentially larger dataset.

Formal Analysis:

Consider an image with an object that could appear at H × W different positions. Without equivariance, we might need O(H × W) training examples per object class. With equivariance, a single example teaches the network about all positions, reducing sample complexity by factor O(H × W).

2. Generalization to Unseen Positions

Objects in test images may appear at positions never seen during training. Equivariance guarantees consistent recognition regardless:

$$P(\text{cat} | \text{image with cat at } (x_1, y_1)) = P(\text{cat} | \text{image with cat at } (x_2, y_2))$$

This position-invariant prediction emerges from equivariant representations combined with global pooling.

Real-World Impact

ImageNet-trained CNNs generalize to objects at arbitrary positions despite training images having centered objects. Equivariance enables this crucial generalization—without it, networks would fail on off-center objects or require exhaustive position augmentation during training.

3. Localization and Detection

Equivariance enables spatial localization. Because features shift with input, we can determine WHERE an object is by finding WHERE features activate:

Object detection: Bounding box regression at feature map positions
Semantic segmentation: Per-pixel classification using feature maps
Pose estimation: Joint location prediction from feature activations

Without equivariance, there would be no spatial correspondence between features and input locations.

4. Transfer Learning

Features learned on one dataset transfer to new datasets with different object positions. An edge detector learned on ImageNet works on medical images where edges appear at different locations. Equivariance ensures this transfer works regardless of spatial statistics.

5. Fully Convolutional Networks

Equivariance enables processing images of arbitrary size:

Train on 224×224 images
Test on 1024×1024 images
Same conv layers, no retraining needed

The network applies the same local operations everywhere, scaling seamlessly.

Equivariance Benefits Summary

•Sample Efficiency: Learn from fewer examples by exploiting position symmetry
•Position Generalization: Recognize objects anywhere, not just at trained positions
•Spatial Localization: Feature positions correspond to input positions
•Transfer Learning: Position-independent features transfer across domains
•Resolution Flexibility: Process any image size with same weights
•Interpretability: Feature maps have meaningful spatial semantics

Breaking Equivariance When Needed

While equivariance is generally desirable, some tasks require position-dependent processing. Modern architectures selectively break equivariance when beneficial.

When Position Matters:

1. Structured Documents

Page headers vs body vs footer have different semantics
Position indicates semantic role, not just content

2. Face Recognition

Eyes in the upper third, mouth in lower third
Position provides strong prior on facial features

3. Scene Understanding

Sky typically above, ground below
Horizon line provides reference frame

4. Video Prediction

Object trajectories depend on position (physics)
Gravity affects objects differently based on height

Modern Approaches to Position Dependence

Vision Transformers (ViTs) explicitly add position embeddings, allowing the model to learn position-dependent patterns. CoordConv concatenates coordinate channels. Positional attention modulates attention based on position. These approaches blend equivariant processing with position awareness.

Methods for Position-Aware Processing:

1. Position Embeddings (Vision Transformers)

Add learnable position vectors to input patches: $$x'_i = x_i + p_i$$

where p_i encodes position i. This allows learning of position-dependent patterns while preserving local equivariance within patches.

2. CoordConv

Concatenate coordinate channels to input:

x_coord = normalized_x_coordinates  # -1 to 1
y_coord = normalized_y_coordinates  # -1 to 1
x_augmented = concat([x, x_coord, y_coord], dim=1)

The network can now learn position-dependent filters using these coordinate channels.

3. Relative Position Bias

Add position-dependent bias to attention scores: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}} + B\right)V$$

where B is a relative position bias matrix.

4. Spatially-Varying Convolutions

Use different kernels at different positions: $$Y[i,j] = \sum_{m,n} K_{i,j}[m,n] \cdot X[i+m, j+n]$$

This fully breaks equivariance but allows position-specific feature extraction.

Methods for Position-Dependent Processing
Method	Mechanism	Equivariance	Use Case
Standard Conv	Shared weights	Fully preserved	General vision
CoordConv	Coordinate channels	Softly broken	Object detection, rendering
Position Embedding	Additive embeddings	Fully broken	ViTs, structured data
Relative Position Bias	Attention bias	Partially preserved	Swin Transformer
Spatially-Varying Conv	Per-position kernels	Fully broken	Documents, medical imaging

Equivariance and Geometric Deep Learning

Translation equivariance in CNNs is a special case of a broader principle: designing neural networks with built-in geometric priors. This field, called Geometric Deep Learning, provides a unified framework for understanding architectures across domains.

The Geometric Deep Learning Blueprint:

Identify the domain: Images, graphs, point clouds, meshes, molecules
Identify symmetries: What transformations preserve meaning?
Design equivariant layers: Ensure network respects those symmetries
Build invariant readouts: Aggregate to produce symmetry-invariant predictions

Examples Across Domains:

Images (Grids):

Symmetry: Translations (optionally rotations)
Equivariant layer: Convolution
Invariant readout: Global pooling

Graphs:

Symmetry: Node permutations (graph isomorphism)
Equivariant layer: Message passing / Graph convolution
Invariant readout: Sum/mean over nodes

Point Clouds:

Symmetry: Permutations + 3D rigid motions
Equivariant layer: PointNet, SE(3)-transformers
Invariant readout: Symmetric aggregation

The Grand Unified View

CNNs, Graph Neural Networks, Transformers, and other architectures can all be viewed through the lens of equivariance. Different architectures correspond to different symmetry groups. Understanding this unifies apparently disparate architectures under a single mathematical framework.

Mathematical Foundation:

The key result connecting group theory to neural networks:

Theorem (Characterization of Equivariant Linear Maps): Let G be a compact group acting on input space X and output space Y. A linear map Φ: X → Y is G-equivariant if and only if:

$$\Phi(x) = \int_G T_g^{(Y)} \cdot k(g) \cdot T_g^{(X)}[x] , dg$$

for some kernel function k: G → L(X, Y). For the translation group on grids, this integral becomes the familiar convolution.

Implications:

Convolution is canonical: It's the unique linear equivariant operation for translations
Design by symmetry: Identify symmetries first, then architecture follows
Guaranteed generalization: Equivariant networks generalize across symmetry transformations
Reduced hypothesis space: Symmetry constraints reduce what the network can represent, improving sample efficiency

Geometric Deep Learning Insights

•Symmetry determines architecture: The inductive bias of equivariance shapes what the network can learn
•Domains are geometric structures: Images, graphs, and manifolds all have intrinsic geometry
•Equivariance enables generalization: Networks generalize across symmetry transformations automatically
•Universal language: Group theory provides vocabulary for comparing architectures
•Design principles, not heuristics: Equivariance provides principled architecture design

Summary and Connections

Translation equivariance is the mathematical property that makes CNNs effective for visual recognition. It emerges directly from parameter sharing and has profound implications for network behavior.

Key Takeaways

•Equivariance means features shift with input: When the input translates, features translate by the same amount.
•Convolution is provably equivariant: Parameter sharing guarantees equivariance mathematically.
•Group theory provides the framework: Translations form a group, and convolution is the unique linear equivariant operation.
•Equivariance enables efficient learning: A single example teaches the network about all spatial positions.
•Pooling builds invariance from equivariance: Hierarchical pooling gradually discards position for classification.
•Modern architectures selectively break equivariance: Position embeddings and coord-conv enable position-dependent reasoning.

Connection to Next Topic:

Equivariance tells us about global properties—how the entire feature map transforms. The next page explores the receptive field—the local region of input that influences each feature. Understanding receptive fields reveals how CNNs build hierarchical representations, from local edges to global objects.

Page Complete

You now understand translation equivariance—the property that makes CNN features predictably shift with the input. You've seen the formal definition, mathematical proof, connections to group theory, and practical implications. Next, we'll explore receptive fields: the local regions that each feature 'sees'.

2 / 5

Loading learning content...

Machine LearningConvolutional Neural Networks

Convolutional Layers

LevelIntermediate

Duration90 mins

TopicConvolutional Neural Networks

2 / 5

Translation Equivariance

The Property That Makes CNNs Work

Consider a profound question: When you shift an image of a cat two pixels to the right, what should happen to the features detected by a neural network?

What You Will Learn

Formal Definition of Equivariance

Let's begin with precise mathematical definitions.

Translation Operator:

Define the translation operator T_τ that shifts a function (or image) by displacement vector τ = (τₓ, τᵧ):

$$[T_\tau f](x, y) = f(x - \tau_x, y - \tau_y)$$

This shifts the function in the positive direction by τ. If f has a peak at (10, 10), then T_τf with τ = (3, 5) has a peak at (13, 15).

Equivariance Definition:

A function (or layer) Φ is equivariant to a transformation T if applying T before Φ gives the same result as applying T after Φ:

$$\Phi(T_\tau[f]) = T_\tau[\Phi(f)]$$

In words: transforming the input then processing equals processing then transforming the output. The transformation 'commutes' with the function.

Visualizing Equivariance:

             Φ
   f ──────────────▶ Φ(f)
   │                  │
T_τ│                  │T_τ
   ▼                  ▼
 T_τ[f] ──────────▶ T_τ[Φ(f)] = Φ(T_τ[f])
             Φ

Both paths lead to the same result!

Equivariance vs Invariance

Why Equivariance Matters:

1. Efficient Generalization

2. Predictable Feature Behavior

Equivariance guarantees how features transform. We can reason about what a layer does mathematically, not just empirically.

3. Robust Recognition

Objects in real images appear at arbitrary positions. Equivariant representations ensure consistent recognition regardless of position.

4. Meaningful Feature Maps

With equivariance, feature map positions have semantic meaning—activations correspond to spatial locations in the input.

Equivariance vs Invariance Comparison
Property	Definition	Where Used	Effect
Translation Equivariance	Φ(Translate(x)) = Translate(Φ(x))	Convolutional layers	Features shift with input
Translation Invariance	Φ(Translate(x)) = Φ(x)	Global pooling, final classifier	Output unchanged by shift
Rotation Equivariance	Φ(Rotate(x)) = Rotate(Φ(x))	Specialized architectures (Group CNNs)	Features rotate with input
Scale Equivariance	Φ(Scale(x)) = Scale(Φ(x))	Scale-space networks, FPN	Features scale with input

Proof That Convolution Is Translation Equivariant

Let's rigorously prove that discrete 2D convolution is translation equivariant.

Setup:

Let f: ℤ² → ℝ be an input image and k: ℤ² → ℝ be a convolution kernel. The convolution is defined as:

$$(f * k)[i, j] = \sum_{m} \sum_{n} f[m, n] \cdot k[i-m, j-n]$$

Alternatively, using the cross-correlation convention common in deep learning:

$$(f \star k)[i, j] = \sum_{m} \sum_{n} f[i+m, j+n] \cdot k[m, n]$$

The Theorem:

For any translation τ = (τₓ, τᵧ):

$$T_\tau[f * k] = (T_\tau f) * k$$

or equivalently:

$$T_\tau[f] * k = T_\tau[f * k]$$

Proof:

Let g = f * k be the convolution output. We need to show that translating f before convolving gives the same result as translating g.

Proof Insight

Detailed Proof:

Start with the left side: $(T_\tau f) * k$ at position (i, j):

$$[(T_\tau f) * k][i, j] = \sum_{m} \sum_{n} (T_\tau f)[i+m, j+n] \cdot k[m, n]$$

By definition of translation: $$(T_\tau f)[i+m, j+n] = f[i+m-\tau_x, j+n-\tau_y]$$

Substituting: $$= \sum_{m} \sum_{n} f[i+m-\tau_x, j+n-\tau_y] \cdot k[m, n]$$

Change variables: let m' = m, n' = n (indices sum over all positions, so unchanged): $$= \sum_{m'} \sum_{n'} f[(i-\tau_x)+m', (j-\tau_y)+n'] \cdot k[m', n']$$

This is exactly the convolution evaluated at (i-τₓ, j-τᵧ): $$= (f * k)[i-\tau_x, j-\tau_y]$$

By definition of translation: $$= (T_\tau[f * k])[i, j]$$

QED: $(T_\tau f) * k = T_\tau[f * k]$ ∎

The Key Insight:

equivariance_verification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
import torch
import torch.nn.functional as F
 
def verify_translation_equivariance():
    """
    Numerically verify that convolution is translation equivariant.
    """
    # Create a simple input image (batch=1, channels=1, H=10, W=10)
    x = torch.randn(1, 1, 10, 10)
    
    # Create a kernel
    kernel = torch.randn(1, 1, 3, 3)
    
    # Define translation (shift right by 2, down by 1)
    tau_x, tau_y = 2, 1
    
    # Method 1: Translate input, then convolve
    # Use a larger input to avoid boundary effects
    x_padded = F.pad(x, (2, 2, 2, 2))  # Add padding
    
    # Translate via roll (circular shift for demonstration)
    x_translated = torch.roll(x_padded, shifts=(tau_y, tau_x), dims=(2, 3))
    output_method1 = F.conv2d(x_translated, kernel, padding=0)
    
    # Method 2: Convolve, then translate output
    output_original = F.conv2d(x_padded, kernel, padding=0)
    output_method2 = torch.roll(output_original, shifts=(tau_y, tau_x), dims=(2, 3))
    
    # Compare (within numerical precision)
    # They should be equal in the valid region
    valid_region_1 = output_method1[:, :, tau_y:, tau_x:]
    valid_region_2 = output_method2[:, :, tau_y:, tau_x:]
    
    max_diff = (valid_region_1 - valid_region_2).abs().max().item()
    
    print(f"Max difference between methods: {max_diff:.2e}")
    print(f"Equivariance holds: {max_diff < 1e-6}")
    
    return max_diff < 1e-6
 
# Verify
result = verify_translation_equivariance()
print(f"\nConvolution is translation equivariant: {result}")
 
# Output:
# Max difference between methods: ~0.00e+00
# Equivariance holds: True
# Convolution is translation equivariant: True

Connection to Group Theory

The Translation Group:

The set of all 2D translations forms a group (ℝ², +) where:

Elements: Translation vectors τ = (τₓ, τᵧ) ∈ ℝ²
Operation: Vector addition (composing translations)
Identity: Zero translation (0, 0)
Inverse: Negative translation (-τₓ, -τᵧ)

This is an abelian (commutative) group: τ₁ + τ₂ = τ₂ + τ₁

Group Actions:

The translation group acts on the space of images. Each translation τ defines a transformation T_τ of images: $$T_\tau: f \mapsto T_\tau[f]$$

where $T_\tau f = f(x - τ)$

Equivariant Maps:

A function Φ is G-equivariant if it commutes with all group actions: $$\Phi(T_g[f]) = T_g[\Phi(f)] \quad \forall g \in G$$

Convolution is equivariant to the translation group.

Why Group Theory Matters

Extending Beyond Translations:

The group-theoretic view suggests natural extensions:

1. Rotation Equivariance

Replace the translation group with the rotation group SO(2). Group-equivariant CNNs (G-CNNs) achieve rotation equivariance by convolving with rotated copies of filters:

$$[f * k]\theta = \int f(r{-\theta}(x)) k(x) dx$$

where r_θ is rotation by angle θ.

2. Scale Equivariance

The multiplicative group (ℝ⁺, ×) of positive scalings. Scale-equivariant networks process images at multiple scales with shared weights.

3. Affine and Projective Equivariance

More complex groups handling perspective transformations, useful for 3D vision tasks.

The Group Convolution Theorem:

A linear map Φ: L²(G) → L²(G) is G-equivariant if and only if it can be expressed as convolution on the group:

$$\Phi[f] = f * k$$

for some kernel k. This establishes convolution as the unique linear equivariant operation!

Symmetry Groups in Deep Learning
Group	Symmetry	Architecture	Application
(ℝ², +)	2D Translation	Standard CNN	Image recognition
SO(2)	2D Rotation	G-CNN, Harmonic Networks	Medical imaging, satellite
SE(2)	Translation + Rotation	SE(2)-CNN	Robotics, autonomous driving
SO(3)	3D Rotation	Spherical CNN	Molecular modeling, 3D shapes
E(3)	3D Euclidean	EGNN, SchNet	Physics simulation, chemistry
S_n	Permutation	Message Passing GNN	Graphs, sets, point clouds

Equivariance Through a CNN Pipeline

A complete CNN consists of multiple layers. Let's trace how equivariance propagates and where it's intentionally broken.

Layer-by-Layer Analysis:

1. Convolutional Layers: Equivariant ✓

As proven, convolution is translation equivariant. Stacking multiple conv layers preserves equivariance: $$\Phi_2(\Phi_1(T_\tau[x])) = \Phi_2(T_\tau[\Phi_1(x)]) = T_\tau[\Phi_2(\Phi_1(x))]$$

2. Pointwise Nonlinearities (ReLU, etc.): Equivariant ✓

Activation functions applied elementwise preserve equivariance: $$\sigma(T_\tau[f])[i,j] = \sigma(f[i-\tau_x, j-\tau_y]) = T_\tau[\sigma(f)][i,j]$$

3. Batch/Layer Normalization: Approximately Equivariant ~

Statistics computed over spatial positions. Slightly breaks equivariance due to boundary effects and running statistics, but effect is minor.

4. Pooling Layers: Break Equivariance ✗

Max pooling and average pooling over windows break strict equivariance but introduce desired invariance to small translations.

Pooling and Equivariance

From Equivariance to Invariance:

CNNs build partial invariance through a hierarchy of equivariant layers followed by pooling:

Early layers: Fully equivariant, features track spatial position precisely
Pooling: Reduces resolution, averages out small translations
Deeper layers: Equivariant at coarser scale, invariant to small shifts
Global pooling: Completely discards position, full translation invariance
Fully-connected classifier: Operates on globally-pooled features

                     Translation Sensitivity
                            ▲
                            │
           High ────────────┼──○ Input pixels
                            │   ╲
                            │    ○ Conv1
                            │     ╲
                            │      ○ Pool1
                            │       ╲
                            │        ○ Conv2
                            │         ╲
                            │          ○ Pool2
                            │           ╲
           Low ─────────────┼────────────○ Global Pool
                            │             ╲
           None ────────────┼──────────────○ Classifier
                            │
                            └──────────────────────────▶ Layer Depth

Design Principle:

equivariance_through_layers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import torch
import torch.nn as nn
import torch.nn.functional as F
 
def analyze_equivariance(model, x, shift=(2, 3)):
    """
    Analyze how equivariance changes through a CNN.
    
    For each layer, compute:
    - f(T[x]): Apply shift to input, then forward pass
    - T[f(x)]: Forward pass, then shift output
    
    Compare to measure equivariance preservation.
    """
    shift_y, shift_x = shift
    
    # Create shifted input
    x_shifted = torch.roll(x, shifts=(shift_y, shift_x), dims=(2, 3))
    
    results = {}
    
    # Get intermediate activations for original and shifted
    activations_original = []
    activations_shifted = []
    
    def hook_fn(storage):
        def hook(module, input, output):
            storage.append(output.detach())
        return hook
    
    # Register hooks
    hooks = []
    for name, layer in model.named_children():
        hooks.append((name, layer.register_forward_hook(hook_fn(activations_original))))
    
    # Forward original
    _ = model(x)
    
    # Clear hooks and re-register for shifted
    for name, h in hooks:
        h.remove()
    
    hooks = []
    for name, layer in model.named_children():
        hooks.append((name, layer.register_forward_hook(hook_fn(activations_shifted))))
    
    # Forward shifted
    _ = model(x_shifted)
    
    # Compare
    for name, h in hooks:
        h.remove()
    
    layer_names = [name for name, _ in model.named_children()]
    
    for i, name in enumerate(layer_names):
        orig = activations_original[i]
        shifted = activations_shifted[i]
        
        # Shift the original activation
        orig_shifted = torch.roll(orig, shifts=(shift_y, shift_x), dims=(2, 3))
        
        # Check if feature maps have same spatial size
        if orig.shape == shifted.shape:
            # Measure equivariance: ||T[f(x)] - f(T[x])||
            # (ignoring boundary effects by comparing central region)
            h, w = orig.shape[2], orig.shape[3]
            margin = max(abs(shift_y), abs(shift_x)) + 2
            
            if h > 2*margin and w > 2*margin:
                central_orig = orig_shifted[:, :, margin:-margin, margin:-margin]
                central_shifted = shifted[:, :, margin:-margin, margin:-margin]
                
                diff = (central_orig - central_shifted).abs().mean().item()
                results[name] = {
                    'equivariance_error': diff,
                    'spatial_size': (h, w)
                }
    
    return results
 
# Example simple CNN
model = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),    # Equivariant
    nn.ReLU(),                          # Equivariant
    nn.MaxPool2d(2),                    # Breaks equivariance
    nn.Conv2d(64, 128, 3, padding=1),   # Equivariant
    nn.ReLU(),                          # Equivariant
)
 
x = torch.randn(1, 3, 32, 32)
results = analyze_equivariance(model, x)
 
for layer, data in results.items():
    print(f"{layer}: size={data['spatial_size']}, " +
          f"equiv_error={data['equivariance_error']:.4f}")

Practical Implications of Equivariance

Understanding equivariance has profound practical implications for training, testing, and deploying CNNs.

1. Data Efficiency

Formal Analysis:

2. Generalization to Unseen Positions

Objects in test images may appear at positions never seen during training. Equivariance guarantees consistent recognition regardless:

$$P(\text{cat} | \text{image with cat at } (x_1, y_1)) = P(\text{cat} | \text{image with cat at } (x_2, y_2))$$

This position-invariant prediction emerges from equivariant representations combined with global pooling.

Real-World Impact

3. Localization and Detection

Equivariance enables spatial localization. Because features shift with input, we can determine WHERE an object is by finding WHERE features activate:

Object detection: Bounding box regression at feature map positions
Semantic segmentation: Per-pixel classification using feature maps
Pose estimation: Joint location prediction from feature activations

Without equivariance, there would be no spatial correspondence between features and input locations.

4. Transfer Learning

5. Fully Convolutional Networks

Equivariance enables processing images of arbitrary size:

Train on 224×224 images
Test on 1024×1024 images
Same conv layers, no retraining needed

The network applies the same local operations everywhere, scaling seamlessly.

Equivariance Benefits Summary

•Sample Efficiency: Learn from fewer examples by exploiting position symmetry
•Position Generalization: Recognize objects anywhere, not just at trained positions
•Spatial Localization: Feature positions correspond to input positions
•Transfer Learning: Position-independent features transfer across domains
•Resolution Flexibility: Process any image size with same weights
•Interpretability: Feature maps have meaningful spatial semantics

Breaking Equivariance When Needed

While equivariance is generally desirable, some tasks require position-dependent processing. Modern architectures selectively break equivariance when beneficial.

When Position Matters:

1. Structured Documents

Page headers vs body vs footer have different semantics
Position indicates semantic role, not just content

2. Face Recognition

Eyes in the upper third, mouth in lower third
Position provides strong prior on facial features

3. Scene Understanding

Sky typically above, ground below
Horizon line provides reference frame

4. Video Prediction

Object trajectories depend on position (physics)
Gravity affects objects differently based on height

Modern Approaches to Position Dependence

Methods for Position-Aware Processing:

1. Position Embeddings (Vision Transformers)

Add learnable position vectors to input patches: $$x'_i = x_i + p_i$$

where p_i encodes position i. This allows learning of position-dependent patterns while preserving local equivariance within patches.

2. CoordConv

Concatenate coordinate channels to input:

x_coord = normalized_x_coordinates  # -1 to 1
y_coord = normalized_y_coordinates  # -1 to 1
x_augmented = concat([x, x_coord, y_coord], dim=1)

The network can now learn position-dependent filters using these coordinate channels.

3. Relative Position Bias

Add position-dependent bias to attention scores: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}} + B\right)V$$

where B is a relative position bias matrix.

4. Spatially-Varying Convolutions

Use different kernels at different positions: $$Y[i,j] = \sum_{m,n} K_{i,j}[m,n] \cdot X[i+m, j+n]$$

This fully breaks equivariance but allows position-specific feature extraction.

Methods for Position-Dependent Processing
Method	Mechanism	Equivariance	Use Case
Standard Conv	Shared weights	Fully preserved	General vision
CoordConv	Coordinate channels	Softly broken	Object detection, rendering
Position Embedding	Additive embeddings	Fully broken	ViTs, structured data
Relative Position Bias	Attention bias	Partially preserved	Swin Transformer
Spatially-Varying Conv	Per-position kernels	Fully broken	Documents, medical imaging

Equivariance and Geometric Deep Learning

The Geometric Deep Learning Blueprint:

Identify the domain: Images, graphs, point clouds, meshes, molecules
Identify symmetries: What transformations preserve meaning?
Design equivariant layers: Ensure network respects those symmetries
Build invariant readouts: Aggregate to produce symmetry-invariant predictions

Examples Across Domains:

Images (Grids):

Symmetry: Translations (optionally rotations)
Equivariant layer: Convolution
Invariant readout: Global pooling

Graphs:

Symmetry: Node permutations (graph isomorphism)
Equivariant layer: Message passing / Graph convolution
Invariant readout: Sum/mean over nodes

Point Clouds:

Symmetry: Permutations + 3D rigid motions
Equivariant layer: PointNet, SE(3)-transformers
Invariant readout: Symmetric aggregation

The Grand Unified View

Mathematical Foundation:

The key result connecting group theory to neural networks:

Theorem (Characterization of Equivariant Linear Maps): Let G be a compact group acting on input space X and output space Y. A linear map Φ: X → Y is G-equivariant if and only if:

$$\Phi(x) = \int_G T_g^{(Y)} \cdot k(g) \cdot T_g^{(X)}[x] , dg$$

for some kernel function k: G → L(X, Y). For the translation group on grids, this integral becomes the familiar convolution.

Implications:

Convolution is canonical: It's the unique linear equivariant operation for translations
Design by symmetry: Identify symmetries first, then architecture follows
Guaranteed generalization: Equivariant networks generalize across symmetry transformations
Reduced hypothesis space: Symmetry constraints reduce what the network can represent, improving sample efficiency

Geometric Deep Learning Insights

•Symmetry determines architecture: The inductive bias of equivariance shapes what the network can learn
•Domains are geometric structures: Images, graphs, and manifolds all have intrinsic geometry
•Equivariance enables generalization: Networks generalize across symmetry transformations automatically
•Universal language: Group theory provides vocabulary for comparing architectures
•Design principles, not heuristics: Equivariance provides principled architecture design

Summary and Connections

Translation equivariance is the mathematical property that makes CNNs effective for visual recognition. It emerges directly from parameter sharing and has profound implications for network behavior.

Key Takeaways

•Equivariance means features shift with input: When the input translates, features translate by the same amount.
•Convolution is provably equivariant: Parameter sharing guarantees equivariance mathematically.
•Group theory provides the framework: Translations form a group, and convolution is the unique linear equivariant operation.
•Equivariance enables efficient learning: A single example teaches the network about all spatial positions.
•Pooling builds invariance from equivariance: Hierarchical pooling gradually discards position for classification.
•Modern architectures selectively break equivariance: Position embeddings and coord-conv enable position-dependent reasoning.

Connection to Next Topic:

Page Complete

2 / 5