Loading learning content...
Consider a profound question: When you shift an image of a cat two pixels to the right, what should happen to the features detected by a neural network?
With a fully-connected network, the answer is unpredictable. Shifting pixels rearranges the input vector, potentially activating completely different neurons. A network trained to recognize cats at the image center might fail catastrophically when the same cat appears at the corner.
With a convolutional network, something elegant happens: the detected features shift by exactly the same amount as the input. A cat detector that fires at position (100, 100) for a cat centered there will fire at position (102, 100) when that cat shifts right by two pixels. This property—where output features shift in perfect correspondence with input shifts—is called translation equivariance.
Translation equivariance isn't an accident or an optimization; it's a mathematical guarantee that emerges directly from parameter sharing. Understanding this property is essential for understanding why convolutional networks are so effective for spatial data.
This page provides a rigorous treatment of translation equivariance. You will understand the formal mathematical definition, prove why convolution is equivariant, distinguish equivariance from invariance, explore connections to group theory and geometric deep learning, and examine practical implications for CNN design and generalization.
Let's begin with precise mathematical definitions.
Translation Operator:
Define the translation operator T_τ that shifts a function (or image) by displacement vector τ = (τₓ, τᵧ):
$$[T_\tau f](x, y) = f(x - \tau_x, y - \tau_y)$$
This shifts the function in the positive direction by τ. If f has a peak at (10, 10), then T_τf with τ = (3, 5) has a peak at (13, 15).
Equivariance Definition:
A function (or layer) Φ is equivariant to a transformation T if applying T before Φ gives the same result as applying T after Φ:
$$\Phi(T_\tau[f]) = T_\tau[\Phi(f)]$$
In words: transforming the input then processing equals processing then transforming the output. The transformation 'commutes' with the function.
Visualizing Equivariance:
Φ
f ──────────────▶ Φ(f)
│ │
T_τ│ │T_τ
▼ ▼
T_τ[f] ──────────▶ T_τ[Φ(f)] = Φ(T_τ[f])
Φ
Both paths lead to the same result!
Equivariance and invariance are often confused. Invariance means the output doesn't change under transformation: Φ(T[f]) = Φ(f). Equivariance means the output transforms in the same way: Φ(T[f]) = T[Φ(f)]. Individual conv layers are equivariant; the full network with pooling achieves partial invariance.
Why Equivariance Matters:
1. Efficient Generalization
If a network is translation equivariant, then learning to detect a feature at one location automatically provides detection at all locations. A single training example of a cat at position A teaches the network about cats at every position.
2. Predictable Feature Behavior
Equivariance guarantees how features transform. We can reason about what a layer does mathematically, not just empirically.
3. Robust Recognition
Objects in real images appear at arbitrary positions. Equivariant representations ensure consistent recognition regardless of position.
4. Meaningful Feature Maps
With equivariance, feature map positions have semantic meaning—activations correspond to spatial locations in the input.
| Property | Definition | Where Used | Effect |
|---|---|---|---|
| Translation Equivariance | Φ(Translate(x)) = Translate(Φ(x)) | Convolutional layers | Features shift with input |
| Translation Invariance | Φ(Translate(x)) = Φ(x) | Global pooling, final classifier | Output unchanged by shift |
| Rotation Equivariance | Φ(Rotate(x)) = Rotate(Φ(x)) | Specialized architectures (Group CNNs) | Features rotate with input |
| Scale Equivariance | Φ(Scale(x)) = Scale(Φ(x)) | Scale-space networks, FPN | Features scale with input |
Let's rigorously prove that discrete 2D convolution is translation equivariant.
Setup:
Let f: ℤ² → ℝ be an input image and k: ℤ² → ℝ be a convolution kernel. The convolution is defined as:
$$(f * k)[i, j] = \sum_{m} \sum_{n} f[m, n] \cdot k[i-m, j-n]$$
Alternatively, using the cross-correlation convention common in deep learning:
$$(f \star k)[i, j] = \sum_{m} \sum_{n} f[i+m, j+n] \cdot k[m, n]$$
The Theorem:
For any translation τ = (τₓ, τᵧ):
$$T_\tau[f * k] = (T_\tau f) * k$$
or equivalently:
$$T_\tau[f] * k = T_\tau[f * k]$$
Proof:
Let g = f * k be the convolution output. We need to show that translating f before convolving gives the same result as translating g.
The proof hinges on a change of variables in the summation. Because the kernel is applied identically at all positions (parameter sharing), translating the input simply translates where the kernel 'sees' patterns, producing a translated output.
Detailed Proof:
Start with the left side: $(T_\tau f) * k$ at position (i, j):
$$[(T_\tau f) * k][i, j] = \sum_{m} \sum_{n} (T_\tau f)[i+m, j+n] \cdot k[m, n]$$
By definition of translation: $$(T_\tau f)[i+m, j+n] = f[i+m-\tau_x, j+n-\tau_y]$$
Substituting: $$= \sum_{m} \sum_{n} f[i+m-\tau_x, j+n-\tau_y] \cdot k[m, n]$$
Change variables: let m' = m, n' = n (indices sum over all positions, so unchanged): $$= \sum_{m'} \sum_{n'} f[(i-\tau_x)+m', (j-\tau_y)+n'] \cdot k[m', n']$$
This is exactly the convolution evaluated at (i-τₓ, j-τᵧ): $$= (f * k)[i-\tau_x, j-\tau_y]$$
By definition of translation: $$= (T_\tau[f * k])[i, j]$$
QED: $(T_\tau f) * k = T_\tau[f * k]$ ∎
The Key Insight:
Equivariance emerges from the homogeneity of convolution—the same operation is applied everywhere. Parameter sharing guarantees this homogeneity, making equivariance a mathematical certainty, not an empirical observation.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import numpy as npimport torchimport torch.nn.functional as F def verify_translation_equivariance(): """ Numerically verify that convolution is translation equivariant. """ # Create a simple input image (batch=1, channels=1, H=10, W=10) x = torch.randn(1, 1, 10, 10) # Create a kernel kernel = torch.randn(1, 1, 3, 3) # Define translation (shift right by 2, down by 1) tau_x, tau_y = 2, 1 # Method 1: Translate input, then convolve # Use a larger input to avoid boundary effects x_padded = F.pad(x, (2, 2, 2, 2)) # Add padding # Translate via roll (circular shift for demonstration) x_translated = torch.roll(x_padded, shifts=(tau_y, tau_x), dims=(2, 3)) output_method1 = F.conv2d(x_translated, kernel, padding=0) # Method 2: Convolve, then translate output output_original = F.conv2d(x_padded, kernel, padding=0) output_method2 = torch.roll(output_original, shifts=(tau_y, tau_x), dims=(2, 3)) # Compare (within numerical precision) # They should be equal in the valid region valid_region_1 = output_method1[:, :, tau_y:, tau_x:] valid_region_2 = output_method2[:, :, tau_y:, tau_x:] max_diff = (valid_region_1 - valid_region_2).abs().max().item() print(f"Max difference between methods: {max_diff:.2e}") print(f"Equivariance holds: {max_diff < 1e-6}") return max_diff < 1e-6 # Verifyresult = verify_translation_equivariance()print(f"\nConvolution is translation equivariant: {result}") # Output:# Max difference between methods: ~0.00e+00# Equivariance holds: True# Convolution is translation equivariant: TrueTranslation equivariance is best understood through the lens of group theory—the mathematical study of symmetry. This perspective unifies CNNs with a broader family of equivariant architectures.
The Translation Group:
The set of all 2D translations forms a group (ℝ², +) where:
This is an abelian (commutative) group: τ₁ + τ₂ = τ₂ + τ₁
Group Actions:
The translation group acts on the space of images. Each translation τ defines a transformation T_τ of images: $$T_\tau: f \mapsto T_\tau[f]$$
where $T_\tau f = f(x - τ)$
Equivariant Maps:
A function Φ is G-equivariant if it commutes with all group actions: $$\Phi(T_g[f]) = T_g[\Phi(f)] \quad \forall g \in G$$
Convolution is equivariant to the translation group.
Group theory provides a principled framework for designing architectures with built-in symmetries. Beyond translations, we can design networks equivariant to rotations (SO(2)), reflections (O(2)), 3D rotations (SO(3)), or arbitrary Lie groups. This is the foundation of Geometric Deep Learning.
Extending Beyond Translations:
The group-theoretic view suggests natural extensions:
1. Rotation Equivariance
Replace the translation group with the rotation group SO(2). Group-equivariant CNNs (G-CNNs) achieve rotation equivariance by convolving with rotated copies of filters:
$$[f * k]\theta = \int f(r{-\theta}(x)) k(x) dx$$
where r_θ is rotation by angle θ.
2. Scale Equivariance
The multiplicative group (ℝ⁺, ×) of positive scalings. Scale-equivariant networks process images at multiple scales with shared weights.
3. Affine and Projective Equivariance
More complex groups handling perspective transformations, useful for 3D vision tasks.
The Group Convolution Theorem:
A linear map Φ: L²(G) → L²(G) is G-equivariant if and only if it can be expressed as convolution on the group:
$$\Phi[f] = f * k$$
for some kernel k. This establishes convolution as the unique linear equivariant operation!
| Group | Symmetry | Architecture | Application |
|---|---|---|---|
| (ℝ², +) | 2D Translation | Standard CNN | Image recognition |
| SO(2) | 2D Rotation | G-CNN, Harmonic Networks | Medical imaging, satellite |
| SE(2) | Translation + Rotation | SE(2)-CNN | Robotics, autonomous driving |
| SO(3) | 3D Rotation | Spherical CNN | Molecular modeling, 3D shapes |
| E(3) | 3D Euclidean | EGNN, SchNet | Physics simulation, chemistry |
| S_n | Permutation | Message Passing GNN | Graphs, sets, point clouds |
A complete CNN consists of multiple layers. Let's trace how equivariance propagates and where it's intentionally broken.
Layer-by-Layer Analysis:
1. Convolutional Layers: Equivariant ✓
As proven, convolution is translation equivariant. Stacking multiple conv layers preserves equivariance: $$\Phi_2(\Phi_1(T_\tau[x])) = \Phi_2(T_\tau[\Phi_1(x)]) = T_\tau[\Phi_2(\Phi_1(x))]$$
2. Pointwise Nonlinearities (ReLU, etc.): Equivariant ✓
Activation functions applied elementwise preserve equivariance: $$\sigma(T_\tau[f])[i,j] = \sigma(f[i-\tau_x, j-\tau_y]) = T_\tau[\sigma(f)][i,j]$$
3. Batch/Layer Normalization: Approximately Equivariant ~
Statistics computed over spatial positions. Slightly breaks equivariance due to boundary effects and running statistics, but effect is minor.
4. Pooling Layers: Break Equivariance ✗
Max pooling and average pooling over windows break strict equivariance but introduce desired invariance to small translations.
Pooling layers downsample the feature map, which breaks strict translation equivariance. A 1-pixel shift in input doesn't map to a 1-pixel shift in output after 2×2 pooling—it might map to 0 or 1 pixel depending on alignment. This is actually desirable for building translation invariance gradually.
From Equivariance to Invariance:
CNNs build partial invariance through a hierarchy of equivariant layers followed by pooling:
Translation Sensitivity
▲
│
High ────────────┼──○ Input pixels
│ ╲
│ ○ Conv1
│ ╲
│ ○ Pool1
│ ╲
│ ○ Conv2
│ ╲
│ ○ Pool2
│ ╲
Low ─────────────┼────────────○ Global Pool
│ ╲
None ────────────┼──────────────○ Classifier
│
└──────────────────────────▶ Layer Depth
Design Principle:
Equivariance preserves where features are found. Invariance ultimately discards this for classification (a cat is a cat regardless of position). The key insight is that we want equivariance in intermediate representations (to detect and localize features) but invariance in final predictions (to classify regardless of position).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import torchimport torch.nn as nnimport torch.nn.functional as F def analyze_equivariance(model, x, shift=(2, 3)): """ Analyze how equivariance changes through a CNN. For each layer, compute: - f(T[x]): Apply shift to input, then forward pass - T[f(x)]: Forward pass, then shift output Compare to measure equivariance preservation. """ shift_y, shift_x = shift # Create shifted input x_shifted = torch.roll(x, shifts=(shift_y, shift_x), dims=(2, 3)) results = {} # Get intermediate activations for original and shifted activations_original = [] activations_shifted = [] def hook_fn(storage): def hook(module, input, output): storage.append(output.detach()) return hook # Register hooks hooks = [] for name, layer in model.named_children(): hooks.append((name, layer.register_forward_hook(hook_fn(activations_original)))) # Forward original _ = model(x) # Clear hooks and re-register for shifted for name, h in hooks: h.remove() hooks = [] for name, layer in model.named_children(): hooks.append((name, layer.register_forward_hook(hook_fn(activations_shifted)))) # Forward shifted _ = model(x_shifted) # Compare for name, h in hooks: h.remove() layer_names = [name for name, _ in model.named_children()] for i, name in enumerate(layer_names): orig = activations_original[i] shifted = activations_shifted[i] # Shift the original activation orig_shifted = torch.roll(orig, shifts=(shift_y, shift_x), dims=(2, 3)) # Check if feature maps have same spatial size if orig.shape == shifted.shape: # Measure equivariance: ||T[f(x)] - f(T[x])|| # (ignoring boundary effects by comparing central region) h, w = orig.shape[2], orig.shape[3] margin = max(abs(shift_y), abs(shift_x)) + 2 if h > 2*margin and w > 2*margin: central_orig = orig_shifted[:, :, margin:-margin, margin:-margin] central_shifted = shifted[:, :, margin:-margin, margin:-margin] diff = (central_orig - central_shifted).abs().mean().item() results[name] = { 'equivariance_error': diff, 'spatial_size': (h, w) } return results # Example simple CNNmodel = nn.Sequential( nn.Conv2d(3, 64, 3, padding=1), # Equivariant nn.ReLU(), # Equivariant nn.MaxPool2d(2), # Breaks equivariance nn.Conv2d(64, 128, 3, padding=1), # Equivariant nn.ReLU(), # Equivariant) x = torch.randn(1, 3, 32, 32)results = analyze_equivariance(model, x) for layer, data in results.items(): print(f"{layer}: size={data['spatial_size']}, " + f"equiv_error={data['equivariance_error']:.4f}")Understanding equivariance has profound practical implications for training, testing, and deploying CNNs.
1. Data Efficiency
Equivariance provides implicit data augmentation. A network that learns to detect an edge at position (10, 10) automatically detects edges at all positions. Without equivariance, the network would need separate training examples for edges at each location—an exponentially larger dataset.
Formal Analysis:
Consider an image with an object that could appear at H × W different positions. Without equivariance, we might need O(H × W) training examples per object class. With equivariance, a single example teaches the network about all positions, reducing sample complexity by factor O(H × W).
2. Generalization to Unseen Positions
Objects in test images may appear at positions never seen during training. Equivariance guarantees consistent recognition regardless:
$$P(\text{cat} | \text{image with cat at } (x_1, y_1)) = P(\text{cat} | \text{image with cat at } (x_2, y_2))$$
This position-invariant prediction emerges from equivariant representations combined with global pooling.
ImageNet-trained CNNs generalize to objects at arbitrary positions despite training images having centered objects. Equivariance enables this crucial generalization—without it, networks would fail on off-center objects or require exhaustive position augmentation during training.
3. Localization and Detection
Equivariance enables spatial localization. Because features shift with input, we can determine WHERE an object is by finding WHERE features activate:
Without equivariance, there would be no spatial correspondence between features and input locations.
4. Transfer Learning
Features learned on one dataset transfer to new datasets with different object positions. An edge detector learned on ImageNet works on medical images where edges appear at different locations. Equivariance ensures this transfer works regardless of spatial statistics.
5. Fully Convolutional Networks
Equivariance enables processing images of arbitrary size:
The network applies the same local operations everywhere, scaling seamlessly.
While equivariance is generally desirable, some tasks require position-dependent processing. Modern architectures selectively break equivariance when beneficial.
When Position Matters:
1. Structured Documents
2. Face Recognition
3. Scene Understanding
4. Video Prediction
Vision Transformers (ViTs) explicitly add position embeddings, allowing the model to learn position-dependent patterns. CoordConv concatenates coordinate channels. Positional attention modulates attention based on position. These approaches blend equivariant processing with position awareness.
Methods for Position-Aware Processing:
1. Position Embeddings (Vision Transformers)
Add learnable position vectors to input patches: $$x'_i = x_i + p_i$$
where p_i encodes position i. This allows learning of position-dependent patterns while preserving local equivariance within patches.
2. CoordConv
Concatenate coordinate channels to input:
x_coord = normalized_x_coordinates # -1 to 1
y_coord = normalized_y_coordinates # -1 to 1
x_augmented = concat([x, x_coord, y_coord], dim=1)
The network can now learn position-dependent filters using these coordinate channels.
3. Relative Position Bias
Add position-dependent bias to attention scores: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}} + B\right)V$$
where B is a relative position bias matrix.
4. Spatially-Varying Convolutions
Use different kernels at different positions: $$Y[i,j] = \sum_{m,n} K_{i,j}[m,n] \cdot X[i+m, j+n]$$
This fully breaks equivariance but allows position-specific feature extraction.
| Method | Mechanism | Equivariance | Use Case |
|---|---|---|---|
| Standard Conv | Shared weights | Fully preserved | General vision |
| CoordConv | Coordinate channels | Softly broken | Object detection, rendering |
| Position Embedding | Additive embeddings | Fully broken | ViTs, structured data |
| Relative Position Bias | Attention bias | Partially preserved | Swin Transformer |
| Spatially-Varying Conv | Per-position kernels | Fully broken | Documents, medical imaging |
Translation equivariance in CNNs is a special case of a broader principle: designing neural networks with built-in geometric priors. This field, called Geometric Deep Learning, provides a unified framework for understanding architectures across domains.
The Geometric Deep Learning Blueprint:
Examples Across Domains:
Images (Grids):
Graphs:
Point Clouds:
CNNs, Graph Neural Networks, Transformers, and other architectures can all be viewed through the lens of equivariance. Different architectures correspond to different symmetry groups. Understanding this unifies apparently disparate architectures under a single mathematical framework.
Mathematical Foundation:
The key result connecting group theory to neural networks:
Theorem (Characterization of Equivariant Linear Maps): Let G be a compact group acting on input space X and output space Y. A linear map Φ: X → Y is G-equivariant if and only if:
$$\Phi(x) = \int_G T_g^{(Y)} \cdot k(g) \cdot T_g^{(X)}[x] , dg$$
for some kernel function k: G → L(X, Y). For the translation group on grids, this integral becomes the familiar convolution.
Implications:
Translation equivariance is the mathematical property that makes CNNs effective for visual recognition. It emerges directly from parameter sharing and has profound implications for network behavior.
Connection to Next Topic:
Equivariance tells us about global properties—how the entire feature map transforms. The next page explores the receptive field—the local region of input that influences each feature. Understanding receptive fields reveals how CNNs build hierarchical representations, from local edges to global objects.
You now understand translation equivariance—the property that makes CNN features predictably shift with the input. You've seen the formal definition, mathematical proof, connections to group theory, and practical implications. Next, we'll explore receptive fields: the local regions that each feature 'sees'.