Loading learning content...
When designing a CNN, choosing kernel sizes is only part of the story. Two additional parameters profoundly affect both the output dimensions and the computational characteristics of convolutional layers: stride and padding.
Stride controls how far the kernel moves between output positions. A stride of 1 means the kernel shifts one pixel at a time; a stride of 2 means it jumps two pixels, effectively downsampling the output.
Padding controls what happens at the input boundaries. Zero-padding adds extra values around the input, allowing the kernel to process edge regions fully and, critically, enabling control over output dimensions.
Together, stride and padding determine:
This page provides complete mastery of these parameters, including the essential dimension formulas that every CNN practitioner must know.
By the end of this page, you will understand stride and padding from first principles, memorize the output dimension formulas, know when to use different stride/padding configurations, and appreciate the design tradeoffs in CNN architectures.
Definition:
The stride (s) specifies the step size of the kernel as it slides across the input. With stride s, the kernel moves s positions between consecutive output computations.
Stride 1 (default):
Stride > 1:
Visual Intuition:
Imagine a 3×3 kernel sliding over a 5×5 input.
With stride 1: The kernel can be placed at positions (0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0), (2,1), (2,2). Output is 3×3.
With stride 2: The kernel jumps two positions: (0,0), (0,2), (2,0), (2,2). Output is 2×2.
| Stride | Kernel Positions (per row) | Output Dimension | Output Elements |
|---|---|---|---|
| s = 1 | Positions 0, 1, 2, 3, 4 (5 positions) | 5 × 5 | 25 |
| s = 2 | Positions 0, 2, 4 (3 positions) | 3 × 3 | 9 |
| s = 3 | Positions 0, 3 (2 positions) | 2 × 2 | 4 |
Stride as Controlled Downsampling:
Before strided convolutions became popular, CNNs used pooling layers (max-pool, average-pool) to reduce spatial dimensions. Strided convolutions achieve the same downsampling effect while simultaneously applying learned filters.
Benefits of strided convolutions over pooling:
Potential drawbacks:
Asymmetric Strides:
Stride can differ between dimensions. Stride (sₕ, sw) specifies separate step sizes for height and width. This is useful when inputs have non-square aspect ratios, though square strides (s, s) are most common.
Stride 1 is used for most convolutional layers to preserve spatial resolution. Stride 2 is used at 'downsampling' points to halve spatial dimensions (replacing pooling in many modern architectures). Strides > 2 are rare because aggressive downsampling loses too much information.
The Boundary Problem:
Consider a 5×5 input and a 3×3 kernel with stride 1. Without padding, the kernel can only be placed where it fully fits within the input bounds—positions (0,0) through (2,2). The output is 3×3: smaller than the input.
This shrinkage compounds through layers:
Moreover, edge pixels participate in fewer convolutions than center pixels, underweighting boundary information.
Padding to the Rescue:
Padding adds extra values around the input's borders, allowing the kernel to extend beyond the original boundaries. The most common padding value is zero (hence 'zero-padding'), but other strategies exist.
Padding Amount:
Padding p specifies how many rows/columns to add on each side:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as np def demonstrate_padding(): """ Visualize zero-padding and its effect on convolution. """ # 4×4 input image image = np.array([ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16] ]) print("Original image (4×4):") print(image) print(f"Shape: {image.shape}") # Padding p=1: add one row/column of zeros on each side padded = np.pad(image, pad_width=1, mode='constant', constant_values=0) print("\nZero-padded image (p=1, becomes 6×6):") print(padded) print(f"Shape: {padded.shape}") # With a 3×3 kernel: # - No padding: output is 4-3+1 = 2×2 # - Padding=1: output is (4+2)-3+1 = 4×4 (same as input!) # Padding p=2: add two rows/columns of zeros on each side padded_2 = np.pad(image, pad_width=2, mode='constant', constant_values=0) print("\nZero-padded image (p=2, becomes 8×8):") print(padded_2) print(f"Shape: {padded_2.shape}") def calculate_output_dimensions(): """ Calculate output dimensions for various padding amounts. """ H, W = 7, 7 # Input height, width K = 3 # Kernel size s = 1 # Stride print(f"Input: {H}×{W}, Kernel: {K}×{K}, Stride: {s}") print("-" * 40) for p in range(4): H_out = (H + 2*p - K) // s + 1 W_out = (W + 2*p - K) // s + 1 print(f"Padding p={p}: Output is {H_out}×{W_out}") if __name__ == "__main__": demonstrate_padding() print() calculate_output_dimensions()Padding Strategies:
Zero Padding:
Replicate/Edge Padding:
Reflect Padding:
Circular/Periodic Padding:
Deep learning frameworks primarily use zero padding for its simplicity and efficiency.
Zero-padding introduces artificial edges where the image transitions to zeros. CNNs can learn to detect these artifacts, potentially making predictions boundary-dependent. For applications where boundary behavior matters (e.g., medical imaging, satellite imagery), consider non-zero padding strategies or analyze boundary effects carefully.
The most important formula in CNN architecture design computes the output spatial dimensions given input size, kernel size, padding, and stride.
The Formula:
For input dimension H (height) or W (width):
$$H_{out} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1$$
Where:
Deriving the Formula:
| Input | Kernel | Padding | Stride | Calculation | Output |
|---|---|---|---|---|---|
| 32×32 | 3×3 | 0 | 1 | ⌊(32+0-3)/1⌋+1 = 30 | 30×30 |
| 32×32 | 3×3 | 1 | 1 | ⌊(32+2-3)/1⌋+1 = 32 | 32×32 ✓ same |
| 32×32 | 3×3 | 1 | 2 | ⌊(32+2-3)/2⌋+1 = 16 | 16×16 |
| 224×224 | 7×7 | 3 | 2 | ⌊(224+6-7)/2⌋+1 = 112 | 112×112 |
| 112×112 | 3×3 | 1 | 2 | ⌊(112+2-3)/2⌋+1 = 56 | 56×56 |
Special Cases:
'Same' Padding (output size equals input size, stride 1):
To achieve H_out = H with s = 1:
$$p = \frac{K - 1}{2}$$
For odd K (e.g., K=3 → p=1, K=5 → p=2), this is an integer. For even K, 'same' padding requires asymmetric padding (different amounts on each side)—less common.
'Valid' Padding (no padding, output shrinks):
With p = 0:
$$H_{out} = \left\lfloor \frac{H - K}{s} \right\rfloor + 1$$
The output is smaller than the input, and edge pixels participate in fewer computations.
Dimensions Must Be Positive:
The output dimension must satisfy H_out > 0. This requires:
$$H + 2p \geq K$$
The padded input must be at least as large as the kernel—otherwise, we can't place the kernel anywhere.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import math def conv_output_dim(input_dim: int, kernel: int, padding: int, stride: int) -> int: """ Calculate the output dimension for convolution. Args: input_dim: Input height or width kernel: Kernel size padding: Padding added to each side stride: Stride of the convolution Returns: Output dimension Raises: ValueError: If output dimension would be < 1 """ output = (input_dim + 2 * padding - kernel) // stride + 1 if output < 1: raise ValueError( f"Invalid configuration: input={input_dim}, kernel={kernel}, " f"padding={padding}, stride={stride} yields output={output}" ) return output def same_padding(kernel: int) -> int: """ Calculate padding needed for 'same' output (stride 1). For odd kernels, returns (kernel - 1) / 2. For even kernels, would require asymmetric padding. """ if kernel % 2 == 0: print(f"Warning: Even kernel {kernel} requires asymmetric padding for 'same'") return (kernel - 1) // 2 def trace_cnn_dimensions(): """ Trace dimensions through a typical CNN architecture. """ # Example: Simplified ResNet-style stem and blocks H, W = 224, 224 print(f"Input: {H}×{W}") print("-" * 50) # Layer 1: 7×7 conv, stride 2, pad 3 H = conv_output_dim(H, kernel=7, padding=3, stride=2) W = conv_output_dim(W, kernel=7, padding=3, stride=2) print(f"After 7×7 conv (s=2, p=3): {H}×{W}") # Max pool: 3×3, stride 2, pad 1 H = conv_output_dim(H, kernel=3, padding=1, stride=2) W = conv_output_dim(W, kernel=3, padding=1, stride=2) print(f"After 3×3 maxpool (s=2, p=1): {H}×{W}") # ResNet Block 1: 3×3 conv, stride 1, pad 1 (same) H = conv_output_dim(H, kernel=3, padding=1, stride=1) W = conv_output_dim(W, kernel=3, padding=1, stride=1) print(f"After 3×3 conv (s=1, p=1): {H}×{W}") # Downsampling block: 3×3 conv, stride 2, pad 1 H = conv_output_dim(H, kernel=3, padding=1, stride=2) W = conv_output_dim(W, kernel=3, padding=1, stride=2) print(f"After 3×3 conv (s=2, p=1): {H}×{W}") # Another 3×3 same H = conv_output_dim(H, kernel=3, padding=1, stride=1) W = conv_output_dim(W, kernel=3, padding=1, stride=1) print(f"After 3×3 conv (s=1, p=1): {H}×{W}") # Downsample again H = conv_output_dim(H, kernel=3, padding=1, stride=2) W = conv_output_dim(W, kernel=3, padding=1, stride=2) print(f"After 3×3 conv (s=2, p=1): {H}×{W}") if __name__ == "__main__": trace_cnn_dimensions()The output dimension formula is essential for architecture design and debugging. When dimensions don't match in your network, the issue almost always traces to incorrect stride/padding settings. Internalize: Output = floor((Input + 2×Padding - Kernel) / Stride) + 1
While symmetric configurations (same padding on all sides, same stride in both directions) are most common, asymmetric setups have important uses.
Asymmetric Padding:
Instead of padding p on all sides, specify (p_top, p_bottom, p_left, p_right) or equivalently (p_h, p_w) for height and width paddings.
Use cases:
Framework syntax:
nn.Conv2d: accepts padding=(p_h, p_w) or use nn.ZeroPad2d for fully asymmetricpadding='same' handles asymmetric automatically, or specify explicitlyAsymmetric Strides:
Stride (s_h, s_w) allows different step sizes vertically and horizontally.
Use cases:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import torchimport torch.nn as nn def asymmetric_padding_example(): """ Demonstrate asymmetric padding for 'same' output with even kernel. """ # Input: 8×8 # Kernel: 4×4 (even) # For 'same' output with stride 1: # Need total padding = kernel - 1 = 3 # Asymmetric: 1 on top, 2 on bottom, 1 on left, 2 on right # (or any distribution summing to 3 in each dimension) x = torch.randn(1, 1, 8, 8) # Asymmetric padding: (left, right, top, bottom) = (1, 2, 1, 2) pad = nn.ZeroPad2d((1, 2, 1, 2)) x_padded = pad(x) print(f"Original shape: {x.shape}") print(f"After asymmetric padding (1,2,1,2): {x_padded.shape}") # Now apply 4×4 conv with no additional padding conv = nn.Conv2d(1, 1, kernel_size=4, padding=0, stride=1) y = conv(x_padded) print(f"After 4×4 conv: {y.shape}") print("Output matches input spatial dims: 8×8 ✓") def asymmetric_stride_example(): """ Demonstrate asymmetric strides. """ # Wide input: 16×32 x = torch.randn(1, 3, 16, 32) # Downsample height by 2, width by 4 conv = nn.Conv2d(3, 16, kernel_size=3, stride=(2, 4), padding=1) y = conv(x) print(f"Input shape: {x.shape}") print(f"Stride (2, 4) output: {y.shape}") # Output: (1, 16, 8, 8) - now square! def fractional_strides(): """ Fractional strides via transposed convolution (for upsampling). """ # Upsampling: 'fractional stride' = transposed/deconvolution x = torch.randn(1, 16, 8, 8) # Transposed conv with stride 2: conceptually stride 1/2, doubles size conv_transpose = nn.ConvTranspose2d(16, 8, kernel_size=4, stride=2, padding=1) y = conv_transpose(x) print(f"Input shape: {x.shape}") print(f"Transposed conv output: {y.shape}") # Output: (1, 8, 16, 16) - upsampled 2x if __name__ == "__main__": asymmetric_padding_example() print() asymmetric_stride_example() print() fractional_strides()Fractional Strides (Transposed Convolutions):
Stride < 1 doesn't make sense for standard convolution (can't move less than one position). But transposed convolutions (also called deconvolutions or fractionally-strided convolutions) achieve the conceptual equivalent: upsampling.
A transposed convolution with stride 2 doubles spatial dimensions—the 'inverse' of a regular stride-2 convolution. This is crucial for:
Output Formula for Transposed Convolution:
$$H_{out} = (H - 1) \times s - 2p + K$$
Note: This is quite different from normal convolution. The stride now expands rather than contracts.
TensorFlow and PyTorch handle 'same' padding slightly differently for even kernels. TensorFlow automatically applies asymmetric padding to achieve exact 'same' output. PyTorch requires manual asymmetric padding specification. Always verify dimensions in practice.
The receptive field of a neuron is the region of the input that influences its activation. Stride and padding choices profoundly affect how receptive fields grow through a CNN.
Single Layer Receptive Field:
For a single convolutional layer, the receptive field equals the kernel size K. Each output position 'sees' a K×K region of the input.
Multi-Layer Receptive Field:
For stacked layers, receptive fields grow. The formula for the receptive field after L layers (with kernel sizes Kₗ and strides sₗ) is:
$$RF = 1 + \sum_{l=1}^{L} (K_l - 1) \prod_{m=1}^{l-1} s_m$$
Intuition:
Effect of Stride on Receptive Field Growth:
Stride 1 layers increase receptive field by K - 1 each.
Stride 2 layers:
| Layer | Kernel | Stride | Receptive Field | Calculation |
|---|---|---|---|---|
| Input | 1 | Starting point | ||
| Conv1 | 3×3 | 1 | 3 | 1 + (3-1)×1 = 3 |
| Conv2 | 3×3 | 2 | 5 | 3 + (3-1)×1 = 5 |
| Conv3 | 3×3 | 1 | 9 | 5 + (3-1)×2 = 9 (stride 2 doubles growth) |
| Conv4 | 3×3 | 1 | 13 | 9 + (3-1)×2 = 13 |
| Conv5 | 3×3 | 2 | 17 | 13 + (3-1)×2 = 17 |
| Conv6 | 3×3 | 1 | 25 | 17 + (3-1)×4 = 25 (two stride-2 layers: 2×2=4) |
Why Receptive Field Matters:
Context Aggregation: Tasks like semantic segmentation need neurons that 'see' large regions to understand object context. Small receptive fields limit this.
Classification Accuracy: For image classification, the final feature map neurons should ideally 'see' the entire image (global receptive field).
Detection Scales: Object detectors must match receptive field to object sizes. A neuron with a 30×30 receptive field can't reliably detect 100×100 objects.
Efficiency Trade-offs: Large receptive fields can be achieved either by large kernels (expensive) or many small stride-1 layers (deep networks) or fewer stride>1 layers (fast but potentially limited detail).
The VGG Insight:
VGG showed that two 3×3 convolutions have the same receptive field as one 5×5, and three 3×3s equal one 7×7, but with fewer parameters and more nonlinearity:
This insight drove the adoption of small 3×3 kernels throughout modern architectures.
For ImageNet-scale classification (224×224 input), final layers should have receptive fields covering most/all of the image. Calculate your architecture's receptive field growth and ensure it's sufficient for your task. For tasks requiring fine detail, ensure enough resolution is preserved.
Across successful CNN architectures, certain stride/padding patterns appear repeatedly. Learning these patterns helps in designing and understanding networks.
The 'Same' Convolution: 3×3, stride 1, pad 1:
The most common single-layer configuration:
The Downsampling Convolution: 3×3, stride 2, pad 1:
The standard for halving dimensions:
| Configuration | Kernel | Stride | Padding | Output Effect |
|---|---|---|---|---|
| Same (preserve size) | 3×3 | 1 | 1 | H_out = H_in |
| Same (larger kernel) | 5×5 | 1 | 2 | H_out = H_in |
| Downsample ½ | 3×3 | 2 | 1 | H_out = H_in / 2 |
| Aggressive downsample stem | 7×7 | 2 | 3 | H_out = H_in / 2 |
| 1×1 pointwise | 1×1 | 1 | 0 | H_out = H_in (channel mixing only) |
| Upsample 2× | 4×4 TransConv | 2 | 1 | H_out = 2 × H_in |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
"""Trace dimensions through ResNet-18 architecture.Demonstrates standard stride/padding patterns.""" def conv_output(h: int, k: int, s: int, p: int) -> int: return (h + 2*p - k) // s + 1 def resnet18_dimension_trace(): h, w = 224, 224 print(f"Input: {h}×{w}") print("=" * 50) # Stem: 7×7 conv, stride 2, pad 3 h = conv_output(h, k=7, s=2, p=3) w = conv_output(w, k=7, s=2, p=3) print(f"After conv1 (7×7, s=2, p=3): {h}×{w} (64 channels)") # Max pool: 3×3, stride 2, pad 1 h = conv_output(h, k=3, s=2, p=1) w = conv_output(w, k=3, s=2, p=1) print(f"After maxpool (3×3, s=2, p=1): {h}×{w}") # Layer1: Two BasicBlocks, all 3×3 s=1 p=1 (same) print(f"After layer1 (2 blocks, same): {h}×{w} (64 channels)") # Layer2: First block has 3×3 s=2 for downsample h = conv_output(h, k=3, s=2, p=1) w = conv_output(w, k=3, s=2, p=1) print(f"After layer2 (downsample, then same): {h}×{w} (128 channels)") # Layer3: First block has 3×3 s=2 for downsample h = conv_output(h, k=3, s=2, p=1) w = conv_output(w, k=3, s=2, p=1) print(f"After layer3 (downsample, then same): {h}×{w} (256 channels)") # Layer4: First block has 3×3 s=2 for downsample h = conv_output(h, k=3, s=2, p=1) w = conv_output(w, k=3, s=2, p=1) print(f"After layer4 (downsample, then same): {h}×{w} (512 channels)") # Global average pool: 7×7 → 1×1 print(f"After global avg pool: 1×1 (512 channels)") # FC to 1000 classes print(f"After FC: 1000-dimensional output") if __name__ == "__main__": resnet18_dimension_trace()1×1 convolutions (stride 1, padding 0) are special: they preserve spatial dimensions while only mixing channels. No 'sliding' occurs; each spatial position is transformed independently. Used for channel reduction (bottlenecks), channel expansion, and adding nonlinearity.
Dimension mismatches are among the most common errors when building CNNs. Mastering stride/padding helps prevent and quickly diagnose these issues.
Error: 'RuntimeError: Expected input size X, but got Y'
This typically means:
Debugging Approach:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import torchimport torch.nn as nn class DebugCNN(nn.Module): """ CNN with built-in shape tracing for debugging. """ def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 64, 3, stride=1, padding=1) self.conv2 = nn.Conv2d(64, 128, 3, stride=2, padding=1) self.conv3 = nn.Conv2d(128, 256, 3, stride=2, padding=1) self.conv4 = nn.Conv2d(256, 512, 3, stride=2, padding=1) def forward(self, x, debug=False): if debug: print(f"Input: {x.shape}") x = self.conv1(x) if debug: print(f"After conv1 (3×3, s=1, p=1): {x.shape}") x = self.conv2(x) if debug: print(f"After conv2 (3×3, s=2, p=1): {x.shape}") x = self.conv3(x) if debug: print(f"After conv3 (3×3, s=2, p=1): {x.shape}") x = self.conv4(x) if debug: print(f"After conv4 (3×3, s=2, p=1): {x.shape}") return x def check_dimensions(): """ Verify dimensions match expectations. """ model = DebugCNN() x = torch.randn(1, 3, 224, 224) print("Tracing dimensions through network:") print("=" * 50) output = model(x, debug=True) # Expected for 224×224 input: # conv1: 224 → 224 (same) # conv2: 224 → 112 (÷2) # conv3: 112 → 56 (÷2) # conv4: 56 → 28 (÷2) expected_h = 28 actual_h = output.shape[2] print(f"\nExpected final size: {expected_h}×{expected_h}") print(f"Actual final size: {actual_h}×{output.shape[3]}") print(f"Match: {expected_h == actual_h}") if __name__ == "__main__": check_dimensions()For complex architectures, write a function that computes expected dimensions symbolically (without running data through). This catches errors before any computation and documents the intended dimension flow.
Stride and padding are the control mechanisms that determine spatial dimension evolution through CNNs. Mastery of these parameters is essential for architecture design and debugging.
What's Next:
The next page covers dilation (also called atrous or dilated convolutions)—a technique that expands the receptive field without increasing kernel parameters or reducing resolution. Dilation enables efficient large-context processing, crucial for tasks like semantic segmentation.
You now have complete command over how convolution parameters affect spatial dimensions. Every CNN architecture decision about resolution flow traces back to these stride and padding choices. With the output formula internalized, you can design and debug architectures with confidence.