Machine LearningConvolutional Neural Networks

The Convolution Operation

LevelIntermediate

Duration90 mins

TopicConvolutional Neural Networks

3 / 5

Stride and Padding

Controlling the Convolution Mechanics

When designing a CNN, choosing kernel sizes is only part of the story. Two additional parameters profoundly affect both the output dimensions and the computational characteristics of convolutional layers: stride and padding.

Stride controls how far the kernel moves between output positions. A stride of 1 means the kernel shifts one pixel at a time; a stride of 2 means it jumps two pixels, effectively downsampling the output.

Padding controls what happens at the input boundaries. Zero-padding adds extra values around the input, allowing the kernel to process edge regions fully and, critically, enabling control over output dimensions.

Together, stride and padding determine:

The spatial dimensions of each layer's output
Whether spatial resolution is preserved or reduced
The receptive field growth rate through the network
Computational cost and memory footprint

This page provides complete mastery of these parameters, including the essential dimension formulas that every CNN practitioner must know.

What You Will Learn

By the end of this page, you will understand stride and padding from first principles, memorize the output dimension formulas, know when to use different stride/padding configurations, and appreciate the design tradeoffs in CNN architectures.

Understanding Stride

Definition:

The stride (s) specifies the step size of the kernel as it slides across the input. With stride s, the kernel moves s positions between consecutive output computations.

Stride 1 (default):

Kernel moves one position at a time
Maximum overlap between successive windows
Output captures features at every position

Stride > 1:

Kernel skips positions
Reduced overlap, fewer output positions
Acts as downsampling—reduces spatial dimensions

Visual Intuition:

Imagine a 3×3 kernel sliding over a 5×5 input.

With stride 1: The kernel can be placed at positions (0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0), (2,1), (2,2). Output is 3×3.

With stride 2: The kernel jumps two positions: (0,0), (0,2), (2,0), (2,2). Output is 2×2.

Stride Effects on a 7×7 Input with 3×3 Kernel (No Padding)
Stride	Kernel Positions (per row)	Output Dimension	Output Elements
s = 1	Positions 0, 1, 2, 3, 4 (5 positions)	5 × 5	25
s = 2	Positions 0, 2, 4 (3 positions)	3 × 3	9
s = 3	Positions 0, 3 (2 positions)	2 × 2	4

Stride as Controlled Downsampling:

Before strided convolutions became popular, CNNs used pooling layers (max-pool, average-pool) to reduce spatial dimensions. Strided convolutions achieve the same downsampling effect while simultaneously applying learned filters.

Benefits of strided convolutions over pooling:

Learned downsampling (network decides what to preserve)
Fewer layers (combine filtering and downsampling)
Commonly used in modern architectures (ResNet, EfficientNet)

Potential drawbacks:

May lose fine detail more aggressively than pooling
Shift-variance: small input shifts can cause large output changes

Asymmetric Strides:

Stride can differ between dimensions. Stride (sₕ, sw) specifies separate step sizes for height and width. This is useful when inputs have non-square aspect ratios, though square strides (s, s) are most common.

Common Stride Patterns

Stride 1 is used for most convolutional layers to preserve spatial resolution. Stride 2 is used at 'downsampling' points to halve spatial dimensions (replacing pooling in many modern architectures). Strides > 2 are rare because aggressive downsampling loses too much information.

Understanding Padding

The Boundary Problem:

Consider a 5×5 input and a 3×3 kernel with stride 1. Without padding, the kernel can only be placed where it fully fits within the input bounds—positions (0,0) through (2,2). The output is 3×3: smaller than the input.

This shrinkage compounds through layers:

Layer 1: 5×5 → 3×3
Layer 2: 3×3 → 1×1
Layer 3: Cannot apply another 3×3 kernel!

Moreover, edge pixels participate in fewer convolutions than center pixels, underweighting boundary information.

Padding to the Rescue:

Padding adds extra values around the input's borders, allowing the kernel to extend beyond the original boundaries. The most common padding value is zero (hence 'zero-padding'), but other strategies exist.

Padding Amount:

Padding p specifies how many rows/columns to add on each side:

p = 0: No padding (also called 'valid' mode)
p = 1: One row/column on each side (increases each dimension by 2)
p = k/2 (rounded down) for kernel size k: The padding that preserves input size with stride 1 ('same' mode for odd kernels)

padding_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def demonstrate_padding():
    """
    Visualize zero-padding and its effect on convolution.
    """
    # 4×4 input image
    image = np.array([
        [1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]
    ])
    
    print("Original image (4×4):")
    print(image)
    print(f"Shape: {image.shape}")
    
    # Padding p=1: add one row/column of zeros on each side
    padded = np.pad(image, pad_width=1, mode='constant', constant_values=0)
    
    print("\nZero-padded image (p=1, becomes 6×6):")
    print(padded)
    print(f"Shape: {padded.shape}")
    
    # With a 3×3 kernel:
    # - No padding: output is 4-3+1 = 2×2
    # - Padding=1: output is (4+2)-3+1 = 4×4 (same as input!)
    
    # Padding p=2: add two rows/columns of zeros on each side
    padded_2 = np.pad(image, pad_width=2, mode='constant', constant_values=0)
    
    print("\nZero-padded image (p=2, becomes 8×8):")
    print(padded_2)
    print(f"Shape: {padded_2.shape}")
 
 
def calculate_output_dimensions():
    """
    Calculate output dimensions for various padding amounts.
    """
    H, W = 7, 7    # Input height, width
    K = 3          # Kernel size
    s = 1          # Stride
    
    print(f"Input: {H}×{W}, Kernel: {K}×{K}, Stride: {s}")
    print("-" * 40)
    
    for p in range(4):
        H_out = (H + 2*p - K) // s + 1
        W_out = (W + 2*p - K) // s + 1
        print(f"Padding p={p}: Output is {H_out}×{W_out}")
 
 
if __name__ == "__main__":
    demonstrate_padding()
    print()
    calculate_output_dimensions()

Padding Strategies:

Zero Padding:

Pad with constant value 0
Simple, efficient, universally supported
Creates artificial discontinuities at edges
Standard in deep learning

Replicate/Edge Padding:

Extend edge pixel values outward
Smoother boundary transitions
Common in image processing

Reflect Padding:

Mirror image content at boundaries
Very smooth extrapolation
Used when edge artifacts are problematic

Circular/Periodic Padding:

Wrap around (left edge connects to right)
Appropriate for inherently periodic data (e.g., angles, time series with periodicity)

Deep learning frameworks primarily use zero padding for its simplicity and efficiency.

The Edge Artifact Issue

Zero-padding introduces artificial edges where the image transitions to zeros. CNNs can learn to detect these artifacts, potentially making predictions boundary-dependent. For applications where boundary behavior matters (e.g., medical imaging, satellite imagery), consider non-zero padding strategies or analyze boundary effects carefully.

The Output Dimension Formula

The most important formula in CNN architecture design computes the output spatial dimensions given input size, kernel size, padding, and stride.

The Formula:

For input dimension H (height) or W (width):

$$H_{out} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1$$

Where:

H: input height (or width)
p: padding (added to each side)
K: kernel size (height or width)
s: stride
⌊·⌋: floor function (rounds down)

Deriving the Formula:

After padding, the effective input size is H + 2p
The kernel of size K requires K positions; the first valid center position is at K/2 (conceptually)
The number of complete, non-overlapping steps of size s is (H + 2p - K) / s
Plus one for the initial position
Floor because we can only take complete steps

Output Dimension Examples (Square Input and Kernel)
Input	Kernel	Padding	Stride	Calculation	Output
32×32	3×3	0	1	⌊(32+0-3)/1⌋+1 = 30	30×30
32×32	3×3	1	1	⌊(32+2-3)/1⌋+1 = 32	32×32 ✓ same
32×32	3×3	1	2	⌊(32+2-3)/2⌋+1 = 16	16×16
224×224	7×7	3	2	⌊(224+6-7)/2⌋+1 = 112	112×112
112×112	3×3	1	2	⌊(112+2-3)/2⌋+1 = 56	56×56

Special Cases:

'Same' Padding (output size equals input size, stride 1):

To achieve H_out = H with s = 1:

$$p = \frac{K - 1}{2}$$

For odd K (e.g., K=3 → p=1, K=5 → p=2), this is an integer. For even K, 'same' padding requires asymmetric padding (different amounts on each side)—less common.

'Valid' Padding (no padding, output shrinks):

With p = 0:

$$H_{out} = \left\lfloor \frac{H - K}{s} \right\rfloor + 1$$

The output is smaller than the input, and edge pixels participate in fewer computations.

Dimensions Must Be Positive:

The output dimension must satisfy H_out > 0. This requires:

$$H + 2p \geq K$$

The padded input must be at least as large as the kernel—otherwise, we can't place the kernel anywhere.

dimension_calculator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import math
 
def conv_output_dim(input_dim: int, kernel: int, padding: int, stride: int) -> int:
    """
    Calculate the output dimension for convolution.
    
    Args:
        input_dim: Input height or width
        kernel: Kernel size
        padding: Padding added to each side
        stride: Stride of the convolution
    
    Returns:
        Output dimension
    
    Raises:
        ValueError: If output dimension would be < 1
    """
    output = (input_dim + 2 * padding - kernel) // stride + 1
    
    if output < 1:
        raise ValueError(
            f"Invalid configuration: input={input_dim}, kernel={kernel}, "
            f"padding={padding}, stride={stride} yields output={output}"
        )
    
    return output
 
 
def same_padding(kernel: int) -> int:
    """
    Calculate padding needed for 'same' output (stride 1).
    
    For odd kernels, returns (kernel - 1) / 2.
    For even kernels, would require asymmetric padding.
    """
    if kernel % 2 == 0:
        print(f"Warning: Even kernel {kernel} requires asymmetric padding for 'same'")
    return (kernel - 1) // 2
 
 
def trace_cnn_dimensions():
    """
    Trace dimensions through a typical CNN architecture.
    """
    # Example: Simplified ResNet-style stem and blocks
    H, W = 224, 224
    print(f"Input: {H}×{W}")
    print("-" * 50)
    
    # Layer 1: 7×7 conv, stride 2, pad 3
    H = conv_output_dim(H, kernel=7, padding=3, stride=2)
    W = conv_output_dim(W, kernel=7, padding=3, stride=2)
    print(f"After 7×7 conv (s=2, p=3): {H}×{W}")
    
    # Max pool: 3×3, stride 2, pad 1
    H = conv_output_dim(H, kernel=3, padding=1, stride=2)
    W = conv_output_dim(W, kernel=3, padding=1, stride=2)
    print(f"After 3×3 maxpool (s=2, p=1): {H}×{W}")
    
    # ResNet Block 1: 3×3 conv, stride 1, pad 1 (same)
    H = conv_output_dim(H, kernel=3, padding=1, stride=1)
    W = conv_output_dim(W, kernel=3, padding=1, stride=1)
    print(f"After 3×3 conv (s=1, p=1): {H}×{W}")
    
    # Downsampling block: 3×3 conv, stride 2, pad 1
    H = conv_output_dim(H, kernel=3, padding=1, stride=2)
    W = conv_output_dim(W, kernel=3, padding=1, stride=2)
    print(f"After 3×3 conv (s=2, p=1): {H}×{W}")
    
    # Another 3×3 same
    H = conv_output_dim(H, kernel=3, padding=1, stride=1)
    W = conv_output_dim(W, kernel=3, padding=1, stride=1)
    print(f"After 3×3 conv (s=1, p=1): {H}×{W}")
    
    # Downsample again
    H = conv_output_dim(H, kernel=3, padding=1, stride=2)
    W = conv_output_dim(W, kernel=3, padding=1, stride=2)
    print(f"After 3×3 conv (s=2, p=1): {H}×{W}")
 
 
if __name__ == "__main__":
    trace_cnn_dimensions()

Memorize This Formula

The output dimension formula is essential for architecture design and debugging. When dimensions don't match in your network, the issue almost always traces to incorrect stride/padding settings. Internalize: Output = floor((Input + 2×Padding - Kernel) / Stride) + 1

Asymmetric and Non-Standard Configurations

While symmetric configurations (same padding on all sides, same stride in both directions) are most common, asymmetric setups have important uses.

Asymmetric Padding:

Instead of padding p on all sides, specify (p_top, p_bottom, p_left, p_right) or equivalently (p_h, p_w) for height and width paddings.

Use cases:

'Same' output for even-sized kernels (requires adding one more to one side)
Non-square inputs that need specific aspect ratio outputs
Compensating for alignment issues

Framework syntax:

PyTorch nn.Conv2d: accepts padding=(p_h, p_w) or use nn.ZeroPad2d for fully asymmetric
TensorFlow: padding='same' handles asymmetric automatically, or specify explicitly

Asymmetric Strides:

Stride (s_h, s_w) allows different step sizes vertically and horizontally.

Use cases:

Processing wide/tall images while downsampling one dimension more than the other
Video processing: different spatial vs. temporal strides
Rectifying aspect ratios

asymmetric_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import torch
import torch.nn as nn
 
def asymmetric_padding_example():
    """
    Demonstrate asymmetric padding for 'same' output with even kernel.
    """
    # Input: 8×8
    # Kernel: 4×4 (even)
    # For 'same' output with stride 1:
    #   Need total padding = kernel - 1 = 3
    #   Asymmetric: 1 on top, 2 on bottom, 1 on left, 2 on right
    #   (or any distribution summing to 3 in each dimension)
    
    x = torch.randn(1, 1, 8, 8)
    
    # Asymmetric padding: (left, right, top, bottom) = (1, 2, 1, 2)
    pad = nn.ZeroPad2d((1, 2, 1, 2))
    x_padded = pad(x)
    
    print(f"Original shape: {x.shape}")
    print(f"After asymmetric padding (1,2,1,2): {x_padded.shape}")
    
    # Now apply 4×4 conv with no additional padding
    conv = nn.Conv2d(1, 1, kernel_size=4, padding=0, stride=1)
    y = conv(x_padded)
    
    print(f"After 4×4 conv: {y.shape}")
    print("Output matches input spatial dims: 8×8 ✓")
 
 
def asymmetric_stride_example():
    """
    Demonstrate asymmetric strides.
    """
    # Wide input: 16×32
    x = torch.randn(1, 3, 16, 32)
    
    # Downsample height by 2, width by 4
    conv = nn.Conv2d(3, 16, kernel_size=3, stride=(2, 4), padding=1)
    y = conv(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Stride (2, 4) output: {y.shape}")
    # Output: (1, 16, 8, 8) - now square!
    
 
def fractional_strides():
    """
    Fractional strides via transposed convolution (for upsampling).
    """
    # Upsampling: 'fractional stride' = transposed/deconvolution
    x = torch.randn(1, 16, 8, 8)
    
    # Transposed conv with stride 2: conceptually stride 1/2, doubles size
    conv_transpose = nn.ConvTranspose2d(16, 8, kernel_size=4, stride=2, padding=1)
    y = conv_transpose(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Transposed conv output: {y.shape}")
    # Output: (1, 8, 16, 16) - upsampled 2x
 
 
if __name__ == "__main__":
    asymmetric_padding_example()
    print()
    asymmetric_stride_example()
    print()
    fractional_strides()

Fractional Strides (Transposed Convolutions):

Stride < 1 doesn't make sense for standard convolution (can't move less than one position). But transposed convolutions (also called deconvolutions or fractionally-strided convolutions) achieve the conceptual equivalent: upsampling.

A transposed convolution with stride 2 doubles spatial dimensions—the 'inverse' of a regular stride-2 convolution. This is crucial for:

Decoder networks in autoencoders
Upsampling paths in U-Net
Generator networks in GANs

Output Formula for Transposed Convolution:

$$H_{out} = (H - 1) \times s - 2p + K$$

Note: This is quite different from normal convolution. The stride now expands rather than contracts.

Framework Differences

TensorFlow and PyTorch handle 'same' padding slightly differently for even kernels. TensorFlow automatically applies asymmetric padding to achieve exact 'same' output. PyTorch requires manual asymmetric padding specification. Always verify dimensions in practice.

Receptive Field Considerations

The receptive field of a neuron is the region of the input that influences its activation. Stride and padding choices profoundly affect how receptive fields grow through a CNN.

Single Layer Receptive Field:

For a single convolutional layer, the receptive field equals the kernel size K. Each output position 'sees' a K×K region of the input.

Multi-Layer Receptive Field:

For stacked layers, receptive fields grow. The formula for the receptive field after L layers (with kernel sizes Kₗ and strides sₗ) is:

$$RF = 1 + \sum_{l=1}^{L} (K_l - 1) \prod_{m=1}^{l-1} s_m$$

Intuition:

Each layer expands the receptive field by (Kₗ - 1)
This expansion is scaled by all preceding strides (because earlier layers cover more input per step)

Effect of Stride on Receptive Field Growth:

Stride 1 layers increase receptive field by K - 1 each.

Stride 2 layers:

Contribute K - 1 to receptive field themselves
Cause all subsequent layers to contribute twice as much
Act as 'receptive field multipliers'

Receptive Field Growth Example
Layer	Kernel	Stride	Receptive Field	Calculation
Input			1	Starting point
Conv1	3×3	1	3	1 + (3-1)×1 = 3
Conv2	3×3	2	5	3 + (3-1)×1 = 5
Conv3	3×3	1	9	5 + (3-1)×2 = 9 (stride 2 doubles growth)
Conv4	3×3	1	13	9 + (3-1)×2 = 13
Conv5	3×3	2	17	13 + (3-1)×2 = 17
Conv6	3×3	1	25	17 + (3-1)×4 = 25 (two stride-2 layers: 2×2=4)

Why Receptive Field Matters:

Context Aggregation: Tasks like semantic segmentation need neurons that 'see' large regions to understand object context. Small receptive fields limit this.
Classification Accuracy: For image classification, the final feature map neurons should ideally 'see' the entire image (global receptive field).
Detection Scales: Object detectors must match receptive field to object sizes. A neuron with a 30×30 receptive field can't reliably detect 100×100 objects.
Efficiency Trade-offs: Large receptive fields can be achieved either by large kernels (expensive) or many small stride-1 layers (deep networks) or fewer stride>1 layers (fast but potentially limited detail).

The VGG Insight:

VGG showed that two 3×3 convolutions have the same receptive field as one 5×5, and three 3×3s equal one 7×7, but with fewer parameters and more nonlinearity:

One 7×7 layer: 49 parameters
Three 3×3 layers: 27 parameters (with nonlinearities between)

This insight drove the adoption of small 3×3 kernels throughout modern architectures.

Receptive Field Design Principles

For ImageNet-scale classification (224×224 input), final layers should have receptive fields covering most/all of the image. Calculate your architecture's receptive field growth and ensure it's sufficient for your task. For tasks requiring fine detail, ensure enough resolution is preserved.

Common Configurations in Practice

Across successful CNN architectures, certain stride/padding patterns appear repeatedly. Learning these patterns helps in designing and understanding networks.

The 'Same' Convolution: 3×3, stride 1, pad 1:

The most common single-layer configuration:

Kernel: 3×3
Stride: 1
Padding: 1
Effect: Preserves spatial dimensions exactly
Used in: VGG, ResNet blocks, most feature extraction layers

The Downsampling Convolution: 3×3, stride 2, pad 1:

The standard for halving dimensions:

Kernel: 3×3
Stride: 2
Padding: 1
Effect: Halves height and width
Used in: ResNet transitions, EfficientNet, modern architectures (replacing pooling)

Architecture-Specific Patterns

•ResNet Stem (Initial Layer): 7×7 kernel, stride 2, pad 3, followed by 3×3 max pool stride 2. Rapidly reduces 224×224 → 112×112 → 56×56.
•EfficientNet Stem: 3×3, stride 2, pad 1. More efficient than 7×7 with similar information extraction.
•VGG Pattern: Repeated 3×3 stride 1 pad 1, with 2×2 max pool stride 2 for downsampling.
•MobileNet Depthwise: 3×3 depthwise conv stride 1 or 2, pad 1, for efficient feature extraction.
•U-Net Style: Encoder uses stride-2 for downsampling; decoder uses transposed conv stride 2 for upsampling.

Standard CNN Configurations Quick Reference
Configuration	Kernel	Stride	Padding	Output Effect
Same (preserve size)	3×3	1	1	H_out = H_in
Same (larger kernel)	5×5	1	2	H_out = H_in
Downsample ½	3×3	2	1	H_out = H_in / 2
Aggressive downsample stem	7×7	2	3	H_out = H_in / 2
1×1 pointwise	1×1	1	0	H_out = H_in (channel mixing only)
Upsample 2×	4×4 TransConv	2	1	H_out = 2 × H_in

resnet_dimensions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
"""
Trace dimensions through ResNet-18 architecture.
Demonstrates standard stride/padding patterns.
"""
 
def conv_output(h: int, k: int, s: int, p: int) -> int:
    return (h + 2*p - k) // s + 1
 
def resnet18_dimension_trace():
    h, w = 224, 224
    print(f"Input: {h}×{w}")
    print("=" * 50)
    
    # Stem: 7×7 conv, stride 2, pad 3
    h = conv_output(h, k=7, s=2, p=3)
    w = conv_output(w, k=7, s=2, p=3)
    print(f"After conv1 (7×7, s=2, p=3): {h}×{w} (64 channels)")
    
    # Max pool: 3×3, stride 2, pad 1
    h = conv_output(h, k=3, s=2, p=1)
    w = conv_output(w, k=3, s=2, p=1)
    print(f"After maxpool (3×3, s=2, p=1): {h}×{w}")
    
    # Layer1: Two BasicBlocks, all 3×3 s=1 p=1 (same)
    print(f"After layer1 (2 blocks, same): {h}×{w} (64 channels)")
    
    # Layer2: First block has 3×3 s=2 for downsample
    h = conv_output(h, k=3, s=2, p=1)
    w = conv_output(w, k=3, s=2, p=1)
    print(f"After layer2 (downsample, then same): {h}×{w} (128 channels)")
    
    # Layer3: First block has 3×3 s=2 for downsample
    h = conv_output(h, k=3, s=2, p=1)
    w = conv_output(w, k=3, s=2, p=1)
    print(f"After layer3 (downsample, then same): {h}×{w} (256 channels)")
    
    # Layer4: First block has 3×3 s=2 for downsample
    h = conv_output(h, k=3, s=2, p=1)
    w = conv_output(w, k=3, s=2, p=1)
    print(f"After layer4 (downsample, then same): {h}×{w} (512 channels)")
    
    # Global average pool: 7×7 → 1×1
    print(f"After global avg pool: 1×1 (512 channels)")
    
    # FC to 1000 classes
    print(f"After FC: 1000-dimensional output")
 
 
if __name__ == "__main__":
    resnet18_dimension_trace()

The 1×1 Convolution

1×1 convolutions (stride 1, padding 0) are special: they preserve spatial dimensions while only mixing channels. No 'sliding' occurs; each spatial position is transformed independently. Used for channel reduction (bottlenecks), channel expansion, and adding nonlinearity.

Common Errors and Debugging

Dimension mismatches are among the most common errors when building CNNs. Mastering stride/padding helps prevent and quickly diagnose these issues.

Error: 'RuntimeError: Expected input size X, but got Y'

This typically means:

A layer's expected input dimensions don't match the previous layer's output
Common after stride-2 layers (did you expect dimension halving?)
Or after layers without proper same-padding

Debugging Approach:

Print shapes at each layer during forward pass
Calculate expected dimensions using the output formula
Identify where actual differs from expected
Adjust stride, padding, or kernel size

Common Mistakes

•Forgetting padding for 'same' output: A 3×3 conv without padding shrinks dimensions by 2. Always add pad=1 for 3×3 to preserve size.
•Double-counting stride effects: Two consecutive stride-2 layers reduce dimensions by 4× total, not 2×.
•Inconsistent handling of odd dimensions: 7×7 with stride 2, pad 3 gives 4 (not 3.5). Floor rounding can cause surprises.
•Confusing transposed conv formula: Transposed convolution with stride 2 doubles size; it's not the same formula as normal conv.
•Ignoring channel dimensions: Conv layers change channels (not just spatial dimensions). Ensure downstream layers expect the right channel count.

dimension_debugger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import torch
import torch.nn as nn
 
class DebugCNN(nn.Module):
    """
    CNN with built-in shape tracing for debugging.
    """
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(64, 128, 3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(128, 256, 3, stride=2, padding=1)
        self.conv4 = nn.Conv2d(256, 512, 3, stride=2, padding=1)
    
    def forward(self, x, debug=False):
        if debug:
            print(f"Input: {x.shape}")
        
        x = self.conv1(x)
        if debug:
            print(f"After conv1 (3×3, s=1, p=1): {x.shape}")
        
        x = self.conv2(x)
        if debug:
            print(f"After conv2 (3×3, s=2, p=1): {x.shape}")
        
        x = self.conv3(x)
        if debug:
            print(f"After conv3 (3×3, s=2, p=1): {x.shape}")
        
        x = self.conv4(x)
        if debug:
            print(f"After conv4 (3×3, s=2, p=1): {x.shape}")
        
        return x
 
 
def check_dimensions():
    """
    Verify dimensions match expectations.
    """
    model = DebugCNN()
    x = torch.randn(1, 3, 224, 224)
    
    print("Tracing dimensions through network:")
    print("=" * 50)
    output = model(x, debug=True)
    
    # Expected for 224×224 input:
    # conv1: 224 → 224 (same)
    # conv2: 224 → 112 (÷2)
    # conv3: 112 → 56 (÷2)
    # conv4: 56 → 28 (÷2)
    
    expected_h = 28
    actual_h = output.shape[2]
    
    print(f"\nExpected final size: {expected_h}×{expected_h}")
    print(f"Actual final size: {actual_h}×{output.shape[3]}")
    print(f"Match: {expected_h == actual_h}")
 
 
if __name__ == "__main__":
    check_dimensions()

Pro Tip: Use Symbolic Shape Tracing

For complex architectures, write a function that computes expected dimensions symbolically (without running data through). This catches errors before any computation and documents the intended dimension flow.

Summary: Mastering Stride and Padding

Stride and padding are the control mechanisms that determine spatial dimension evolution through CNNs. Mastery of these parameters is essential for architecture design and debugging.

Key Takeaways

•Stride: Controls kernel movement step size. Stride > 1 downsamples; stride 2 halves dimensions (with appropriate padding).
•Padding: Expands input boundaries (usually with zeros). Enables 'same' output sizing and preserves edge information.
•Output Formula: H_out = floor((H + 2p - K) / s) + 1. Memorize this—it's essential for architecture work.
•'Same' Padding: For odd kernels with stride 1, padding = (K-1)/2 preserves input size.
•Receptive Field: Stride layers amplify receptive field growth of subsequent layers. Track this for proper context aggregation.
•Common Patterns: 3×3 stride 1 pad 1 (same), 3×3 stride 2 pad 1 (halve), 1×1 stride 1 pad 0 (channel mix).

What's Next:

The next page covers dilation (also called atrous or dilated convolutions)—a technique that expands the receptive field without increasing kernel parameters or reducing resolution. Dilation enables efficient large-context processing, crucial for tasks like semantic segmentation.

Dimension Mastery Achieved

You now have complete command over how convolution parameters affect spatial dimensions. Every CNN architecture decision about resolution flow traces back to these stride and padding choices. With the output formula internalized, you can design and debug architectures with confidence.

3 / 5

Loading learning content...

Machine LearningConvolutional Neural Networks

The Convolution Operation

LevelIntermediate

Duration90 mins

TopicConvolutional Neural Networks

3 / 5

Stride and Padding

Controlling the Convolution Mechanics

Together, stride and padding determine:

The spatial dimensions of each layer's output
Whether spatial resolution is preserved or reduced
The receptive field growth rate through the network
Computational cost and memory footprint

This page provides complete mastery of these parameters, including the essential dimension formulas that every CNN practitioner must know.

What You Will Learn

Understanding Stride

Definition:

The stride (s) specifies the step size of the kernel as it slides across the input. With stride s, the kernel moves s positions between consecutive output computations.

Stride 1 (default):

Kernel moves one position at a time
Maximum overlap between successive windows
Output captures features at every position

Stride > 1:

Kernel skips positions
Reduced overlap, fewer output positions
Acts as downsampling—reduces spatial dimensions

Visual Intuition:

Imagine a 3×3 kernel sliding over a 5×5 input.

With stride 1: The kernel can be placed at positions (0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0), (2,1), (2,2). Output is 3×3.

With stride 2: The kernel jumps two positions: (0,0), (0,2), (2,0), (2,2). Output is 2×2.

Stride Effects on a 7×7 Input with 3×3 Kernel (No Padding)
Stride	Kernel Positions (per row)	Output Dimension	Output Elements
s = 1	Positions 0, 1, 2, 3, 4 (5 positions)	5 × 5	25
s = 2	Positions 0, 2, 4 (3 positions)	3 × 3	9
s = 3	Positions 0, 3 (2 positions)	2 × 2	4

Stride as Controlled Downsampling:

Benefits of strided convolutions over pooling:

Learned downsampling (network decides what to preserve)
Fewer layers (combine filtering and downsampling)
Commonly used in modern architectures (ResNet, EfficientNet)

Potential drawbacks:

May lose fine detail more aggressively than pooling
Shift-variance: small input shifts can cause large output changes

Asymmetric Strides:

Common Stride Patterns

Understanding Padding

The Boundary Problem:

This shrinkage compounds through layers:

Layer 1: 5×5 → 3×3
Layer 2: 3×3 → 1×1
Layer 3: Cannot apply another 3×3 kernel!

Moreover, edge pixels participate in fewer convolutions than center pixels, underweighting boundary information.

Padding to the Rescue:

Padding Amount:

Padding p specifies how many rows/columns to add on each side:

p = 0: No padding (also called 'valid' mode)
p = 1: One row/column on each side (increases each dimension by 2)
p = k/2 (rounded down) for kernel size k: The padding that preserves input size with stride 1 ('same' mode for odd kernels)

padding_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def demonstrate_padding():
    """
    Visualize zero-padding and its effect on convolution.
    """
    # 4×4 input image
    image = np.array([
        [1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]
    ])
    
    print("Original image (4×4):")
    print(image)
    print(f"Shape: {image.shape}")
    
    # Padding p=1: add one row/column of zeros on each side
    padded = np.pad(image, pad_width=1, mode='constant', constant_values=0)
    
    print("\nZero-padded image (p=1, becomes 6×6):")
    print(padded)
    print(f"Shape: {padded.shape}")
    
    # With a 3×3 kernel:
    # - No padding: output is 4-3+1 = 2×2
    # - Padding=1: output is (4+2)-3+1 = 4×4 (same as input!)
    
    # Padding p=2: add two rows/columns of zeros on each side
    padded_2 = np.pad(image, pad_width=2, mode='constant', constant_values=0)
    
    print("\nZero-padded image (p=2, becomes 8×8):")
    print(padded_2)
    print(f"Shape: {padded_2.shape}")
 
 
def calculate_output_dimensions():
    """
    Calculate output dimensions for various padding amounts.
    """
    H, W = 7, 7    # Input height, width
    K = 3          # Kernel size
    s = 1          # Stride
    
    print(f"Input: {H}×{W}, Kernel: {K}×{K}, Stride: {s}")
    print("-" * 40)
    
    for p in range(4):
        H_out = (H + 2*p - K) // s + 1
        W_out = (W + 2*p - K) // s + 1
        print(f"Padding p={p}: Output is {H_out}×{W_out}")
 
 
if __name__ == "__main__":
    demonstrate_padding()
    print()
    calculate_output_dimensions()

Padding Strategies:

Zero Padding:

Pad with constant value 0
Simple, efficient, universally supported
Creates artificial discontinuities at edges
Standard in deep learning

Replicate/Edge Padding:

Extend edge pixel values outward
Smoother boundary transitions
Common in image processing

Reflect Padding:

Mirror image content at boundaries
Very smooth extrapolation
Used when edge artifacts are problematic

Circular/Periodic Padding:

Wrap around (left edge connects to right)
Appropriate for inherently periodic data (e.g., angles, time series with periodicity)

Deep learning frameworks primarily use zero padding for its simplicity and efficiency.

The Edge Artifact Issue

The Output Dimension Formula

The most important formula in CNN architecture design computes the output spatial dimensions given input size, kernel size, padding, and stride.

The Formula:

For input dimension H (height) or W (width):

$$H_{out} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1$$

Where:

H: input height (or width)
p: padding (added to each side)
K: kernel size (height or width)
s: stride
⌊·⌋: floor function (rounds down)

Deriving the Formula:

After padding, the effective input size is H + 2p
The kernel of size K requires K positions; the first valid center position is at K/2 (conceptually)
The number of complete, non-overlapping steps of size s is (H + 2p - K) / s
Plus one for the initial position
Floor because we can only take complete steps

Output Dimension Examples (Square Input and Kernel)
Input	Kernel	Padding	Stride	Calculation	Output
32×32	3×3	0	1	⌊(32+0-3)/1⌋+1 = 30	30×30
32×32	3×3	1	1	⌊(32+2-3)/1⌋+1 = 32	32×32 ✓ same
32×32	3×3	1	2	⌊(32+2-3)/2⌋+1 = 16	16×16
224×224	7×7	3	2	⌊(224+6-7)/2⌋+1 = 112	112×112
112×112	3×3	1	2	⌊(112+2-3)/2⌋+1 = 56	56×56

Special Cases:

'Same' Padding (output size equals input size, stride 1):

To achieve H_out = H with s = 1:

$$p = \frac{K - 1}{2}$$

For odd K (e.g., K=3 → p=1, K=5 → p=2), this is an integer. For even K, 'same' padding requires asymmetric padding (different amounts on each side)—less common.

'Valid' Padding (no padding, output shrinks):

With p = 0:

$$H_{out} = \left\lfloor \frac{H - K}{s} \right\rfloor + 1$$

The output is smaller than the input, and edge pixels participate in fewer computations.

Dimensions Must Be Positive:

The output dimension must satisfy H_out > 0. This requires:

$$H + 2p \geq K$$

The padded input must be at least as large as the kernel—otherwise, we can't place the kernel anywhere.

dimension_calculator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import math
 
def conv_output_dim(input_dim: int, kernel: int, padding: int, stride: int) -> int:
    """
    Calculate the output dimension for convolution.
    
    Args:
        input_dim: Input height or width
        kernel: Kernel size
        padding: Padding added to each side
        stride: Stride of the convolution
    
    Returns:
        Output dimension
    
    Raises:
        ValueError: If output dimension would be < 1
    """
    output = (input_dim + 2 * padding - kernel) // stride + 1
    
    if output < 1:
        raise ValueError(
            f"Invalid configuration: input={input_dim}, kernel={kernel}, "
            f"padding={padding}, stride={stride} yields output={output}"
        )
    
    return output
 
 
def same_padding(kernel: int) -> int:
    """
    Calculate padding needed for 'same' output (stride 1).
    
    For odd kernels, returns (kernel - 1) / 2.
    For even kernels, would require asymmetric padding.
    """
    if kernel % 2 == 0:
        print(f"Warning: Even kernel {kernel} requires asymmetric padding for 'same'")
    return (kernel - 1) // 2
 
 
def trace_cnn_dimensions():
    """
    Trace dimensions through a typical CNN architecture.
    """
    # Example: Simplified ResNet-style stem and blocks
    H, W = 224, 224
    print(f"Input: {H}×{W}")
    print("-" * 50)
    
    # Layer 1: 7×7 conv, stride 2, pad 3
    H = conv_output_dim(H, kernel=7, padding=3, stride=2)
    W = conv_output_dim(W, kernel=7, padding=3, stride=2)
    print(f"After 7×7 conv (s=2, p=3): {H}×{W}")
    
    # Max pool: 3×3, stride 2, pad 1
    H = conv_output_dim(H, kernel=3, padding=1, stride=2)
    W = conv_output_dim(W, kernel=3, padding=1, stride=2)
    print(f"After 3×3 maxpool (s=2, p=1): {H}×{W}")
    
    # ResNet Block 1: 3×3 conv, stride 1, pad 1 (same)
    H = conv_output_dim(H, kernel=3, padding=1, stride=1)
    W = conv_output_dim(W, kernel=3, padding=1, stride=1)
    print(f"After 3×3 conv (s=1, p=1): {H}×{W}")
    
    # Downsampling block: 3×3 conv, stride 2, pad 1
    H = conv_output_dim(H, kernel=3, padding=1, stride=2)
    W = conv_output_dim(W, kernel=3, padding=1, stride=2)
    print(f"After 3×3 conv (s=2, p=1): {H}×{W}")
    
    # Another 3×3 same
    H = conv_output_dim(H, kernel=3, padding=1, stride=1)
    W = conv_output_dim(W, kernel=3, padding=1, stride=1)
    print(f"After 3×3 conv (s=1, p=1): {H}×{W}")
    
    # Downsample again
    H = conv_output_dim(H, kernel=3, padding=1, stride=2)
    W = conv_output_dim(W, kernel=3, padding=1, stride=2)
    print(f"After 3×3 conv (s=2, p=1): {H}×{W}")
 
 
if __name__ == "__main__":
    trace_cnn_dimensions()

Memorize This Formula

Asymmetric and Non-Standard Configurations

While symmetric configurations (same padding on all sides, same stride in both directions) are most common, asymmetric setups have important uses.

Asymmetric Padding:

Instead of padding p on all sides, specify (p_top, p_bottom, p_left, p_right) or equivalently (p_h, p_w) for height and width paddings.

Use cases:

'Same' output for even-sized kernels (requires adding one more to one side)
Non-square inputs that need specific aspect ratio outputs
Compensating for alignment issues

Framework syntax:

PyTorch nn.Conv2d: accepts padding=(p_h, p_w) or use nn.ZeroPad2d for fully asymmetric
TensorFlow: padding='same' handles asymmetric automatically, or specify explicitly

Asymmetric Strides:

Stride (s_h, s_w) allows different step sizes vertically and horizontally.

Use cases:

Processing wide/tall images while downsampling one dimension more than the other
Video processing: different spatial vs. temporal strides
Rectifying aspect ratios

asymmetric_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import torch
import torch.nn as nn
 
def asymmetric_padding_example():
    """
    Demonstrate asymmetric padding for 'same' output with even kernel.
    """
    # Input: 8×8
    # Kernel: 4×4 (even)
    # For 'same' output with stride 1:
    #   Need total padding = kernel - 1 = 3
    #   Asymmetric: 1 on top, 2 on bottom, 1 on left, 2 on right
    #   (or any distribution summing to 3 in each dimension)
    
    x = torch.randn(1, 1, 8, 8)
    
    # Asymmetric padding: (left, right, top, bottom) = (1, 2, 1, 2)
    pad = nn.ZeroPad2d((1, 2, 1, 2))
    x_padded = pad(x)
    
    print(f"Original shape: {x.shape}")
    print(f"After asymmetric padding (1,2,1,2): {x_padded.shape}")
    
    # Now apply 4×4 conv with no additional padding
    conv = nn.Conv2d(1, 1, kernel_size=4, padding=0, stride=1)
    y = conv(x_padded)
    
    print(f"After 4×4 conv: {y.shape}")
    print("Output matches input spatial dims: 8×8 ✓")
 
 
def asymmetric_stride_example():
    """
    Demonstrate asymmetric strides.
    """
    # Wide input: 16×32
    x = torch.randn(1, 3, 16, 32)
    
    # Downsample height by 2, width by 4
    conv = nn.Conv2d(3, 16, kernel_size=3, stride=(2, 4), padding=1)
    y = conv(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Stride (2, 4) output: {y.shape}")
    # Output: (1, 16, 8, 8) - now square!
    
 
def fractional_strides():
    """
    Fractional strides via transposed convolution (for upsampling).
    """
    # Upsampling: 'fractional stride' = transposed/deconvolution
    x = torch.randn(1, 16, 8, 8)
    
    # Transposed conv with stride 2: conceptually stride 1/2, doubles size
    conv_transpose = nn.ConvTranspose2d(16, 8, kernel_size=4, stride=2, padding=1)
    y = conv_transpose(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Transposed conv output: {y.shape}")
    # Output: (1, 8, 16, 16) - upsampled 2x
 
 
if __name__ == "__main__":
    asymmetric_padding_example()
    print()
    asymmetric_stride_example()
    print()
    fractional_strides()

Fractional Strides (Transposed Convolutions):

A transposed convolution with stride 2 doubles spatial dimensions—the 'inverse' of a regular stride-2 convolution. This is crucial for:

Decoder networks in autoencoders
Upsampling paths in U-Net
Generator networks in GANs

Output Formula for Transposed Convolution:

$$H_{out} = (H - 1) \times s - 2p + K$$

Note: This is quite different from normal convolution. The stride now expands rather than contracts.

Framework Differences

Receptive Field Considerations

The receptive field of a neuron is the region of the input that influences its activation. Stride and padding choices profoundly affect how receptive fields grow through a CNN.

Single Layer Receptive Field:

For a single convolutional layer, the receptive field equals the kernel size K. Each output position 'sees' a K×K region of the input.

Multi-Layer Receptive Field:

For stacked layers, receptive fields grow. The formula for the receptive field after L layers (with kernel sizes Kₗ and strides sₗ) is:

$$RF = 1 + \sum_{l=1}^{L} (K_l - 1) \prod_{m=1}^{l-1} s_m$$

Intuition:

Each layer expands the receptive field by (Kₗ - 1)
This expansion is scaled by all preceding strides (because earlier layers cover more input per step)

Effect of Stride on Receptive Field Growth:

Stride 1 layers increase receptive field by K - 1 each.

Stride 2 layers:

Contribute K - 1 to receptive field themselves
Cause all subsequent layers to contribute twice as much
Act as 'receptive field multipliers'

Receptive Field Growth Example
Layer	Kernel	Stride	Receptive Field	Calculation
Input			1	Starting point
Conv1	3×3	1	3	1 + (3-1)×1 = 3
Conv2	3×3	2	5	3 + (3-1)×1 = 5
Conv3	3×3	1	9	5 + (3-1)×2 = 9 (stride 2 doubles growth)
Conv4	3×3	1	13	9 + (3-1)×2 = 13
Conv5	3×3	2	17	13 + (3-1)×2 = 17
Conv6	3×3	1	25	17 + (3-1)×4 = 25 (two stride-2 layers: 2×2=4)

Why Receptive Field Matters:

Context Aggregation: Tasks like semantic segmentation need neurons that 'see' large regions to understand object context. Small receptive fields limit this.
Classification Accuracy: For image classification, the final feature map neurons should ideally 'see' the entire image (global receptive field).
Detection Scales: Object detectors must match receptive field to object sizes. A neuron with a 30×30 receptive field can't reliably detect 100×100 objects.
Efficiency Trade-offs: Large receptive fields can be achieved either by large kernels (expensive) or many small stride-1 layers (deep networks) or fewer stride>1 layers (fast but potentially limited detail).

The VGG Insight:

VGG showed that two 3×3 convolutions have the same receptive field as one 5×5, and three 3×3s equal one 7×7, but with fewer parameters and more nonlinearity:

One 7×7 layer: 49 parameters
Three 3×3 layers: 27 parameters (with nonlinearities between)

This insight drove the adoption of small 3×3 kernels throughout modern architectures.

Receptive Field Design Principles

Common Configurations in Practice

Across successful CNN architectures, certain stride/padding patterns appear repeatedly. Learning these patterns helps in designing and understanding networks.

The 'Same' Convolution: 3×3, stride 1, pad 1:

The most common single-layer configuration:

Kernel: 3×3
Stride: 1
Padding: 1
Effect: Preserves spatial dimensions exactly
Used in: VGG, ResNet blocks, most feature extraction layers

The Downsampling Convolution: 3×3, stride 2, pad 1:

The standard for halving dimensions:

Kernel: 3×3
Stride: 2
Padding: 1
Effect: Halves height and width
Used in: ResNet transitions, EfficientNet, modern architectures (replacing pooling)

Architecture-Specific Patterns

•ResNet Stem (Initial Layer): 7×7 kernel, stride 2, pad 3, followed by 3×3 max pool stride 2. Rapidly reduces 224×224 → 112×112 → 56×56.
•EfficientNet Stem: 3×3, stride 2, pad 1. More efficient than 7×7 with similar information extraction.
•VGG Pattern: Repeated 3×3 stride 1 pad 1, with 2×2 max pool stride 2 for downsampling.
•MobileNet Depthwise: 3×3 depthwise conv stride 1 or 2, pad 1, for efficient feature extraction.
•U-Net Style: Encoder uses stride-2 for downsampling; decoder uses transposed conv stride 2 for upsampling.

Standard CNN Configurations Quick Reference
Configuration	Kernel	Stride	Padding	Output Effect
Same (preserve size)	3×3	1	1	H_out = H_in
Same (larger kernel)	5×5	1	2	H_out = H_in
Downsample ½	3×3	2	1	H_out = H_in / 2
Aggressive downsample stem	7×7	2	3	H_out = H_in / 2
1×1 pointwise	1×1	1	0	H_out = H_in (channel mixing only)
Upsample 2×	4×4 TransConv	2	1	H_out = 2 × H_in

resnet_dimensions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
"""
Trace dimensions through ResNet-18 architecture.
Demonstrates standard stride/padding patterns.
"""
 
def conv_output(h: int, k: int, s: int, p: int) -> int:
    return (h + 2*p - k) // s + 1
 
def resnet18_dimension_trace():
    h, w = 224, 224
    print(f"Input: {h}×{w}")
    print("=" * 50)
    
    # Stem: 7×7 conv, stride 2, pad 3
    h = conv_output(h, k=7, s=2, p=3)
    w = conv_output(w, k=7, s=2, p=3)
    print(f"After conv1 (7×7, s=2, p=3): {h}×{w} (64 channels)")
    
    # Max pool: 3×3, stride 2, pad 1
    h = conv_output(h, k=3, s=2, p=1)
    w = conv_output(w, k=3, s=2, p=1)
    print(f"After maxpool (3×3, s=2, p=1): {h}×{w}")
    
    # Layer1: Two BasicBlocks, all 3×3 s=1 p=1 (same)
    print(f"After layer1 (2 blocks, same): {h}×{w} (64 channels)")
    
    # Layer2: First block has 3×3 s=2 for downsample
    h = conv_output(h, k=3, s=2, p=1)
    w = conv_output(w, k=3, s=2, p=1)
    print(f"After layer2 (downsample, then same): {h}×{w} (128 channels)")
    
    # Layer3: First block has 3×3 s=2 for downsample
    h = conv_output(h, k=3, s=2, p=1)
    w = conv_output(w, k=3, s=2, p=1)
    print(f"After layer3 (downsample, then same): {h}×{w} (256 channels)")
    
    # Layer4: First block has 3×3 s=2 for downsample
    h = conv_output(h, k=3, s=2, p=1)
    w = conv_output(w, k=3, s=2, p=1)
    print(f"After layer4 (downsample, then same): {h}×{w} (512 channels)")
    
    # Global average pool: 7×7 → 1×1
    print(f"After global avg pool: 1×1 (512 channels)")
    
    # FC to 1000 classes
    print(f"After FC: 1000-dimensional output")
 
 
if __name__ == "__main__":
    resnet18_dimension_trace()

The 1×1 Convolution

Common Errors and Debugging

Dimension mismatches are among the most common errors when building CNNs. Mastering stride/padding helps prevent and quickly diagnose these issues.

Error: 'RuntimeError: Expected input size X, but got Y'

This typically means:

A layer's expected input dimensions don't match the previous layer's output
Common after stride-2 layers (did you expect dimension halving?)
Or after layers without proper same-padding

Debugging Approach:

Print shapes at each layer during forward pass
Calculate expected dimensions using the output formula
Identify where actual differs from expected
Adjust stride, padding, or kernel size

Common Mistakes

•Forgetting padding for 'same' output: A 3×3 conv without padding shrinks dimensions by 2. Always add pad=1 for 3×3 to preserve size.
•Double-counting stride effects: Two consecutive stride-2 layers reduce dimensions by 4× total, not 2×.
•Inconsistent handling of odd dimensions: 7×7 with stride 2, pad 3 gives 4 (not 3.5). Floor rounding can cause surprises.
•Confusing transposed conv formula: Transposed convolution with stride 2 doubles size; it's not the same formula as normal conv.
•Ignoring channel dimensions: Conv layers change channels (not just spatial dimensions). Ensure downstream layers expect the right channel count.

dimension_debugger.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import torch
import torch.nn as nn
 
class DebugCNN(nn.Module):
    """
    CNN with built-in shape tracing for debugging.
    """
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(64, 128, 3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(128, 256, 3, stride=2, padding=1)
        self.conv4 = nn.Conv2d(256, 512, 3, stride=2, padding=1)
    
    def forward(self, x, debug=False):
        if debug:
            print(f"Input: {x.shape}")
        
        x = self.conv1(x)
        if debug:
            print(f"After conv1 (3×3, s=1, p=1): {x.shape}")
        
        x = self.conv2(x)
        if debug:
            print(f"After conv2 (3×3, s=2, p=1): {x.shape}")
        
        x = self.conv3(x)
        if debug:
            print(f"After conv3 (3×3, s=2, p=1): {x.shape}")
        
        x = self.conv4(x)
        if debug:
            print(f"After conv4 (3×3, s=2, p=1): {x.shape}")
        
        return x
 
 
def check_dimensions():
    """
    Verify dimensions match expectations.
    """
    model = DebugCNN()
    x = torch.randn(1, 3, 224, 224)
    
    print("Tracing dimensions through network:")
    print("=" * 50)
    output = model(x, debug=True)
    
    # Expected for 224×224 input:
    # conv1: 224 → 224 (same)
    # conv2: 224 → 112 (÷2)
    # conv3: 112 → 56 (÷2)
    # conv4: 56 → 28 (÷2)
    
    expected_h = 28
    actual_h = output.shape[2]
    
    print(f"\nExpected final size: {expected_h}×{expected_h}")
    print(f"Actual final size: {actual_h}×{output.shape[3]}")
    print(f"Match: {expected_h == actual_h}")
 
 
if __name__ == "__main__":
    check_dimensions()

Pro Tip: Use Symbolic Shape Tracing

Summary: Mastering Stride and Padding

Stride and padding are the control mechanisms that determine spatial dimension evolution through CNNs. Mastery of these parameters is essential for architecture design and debugging.

Key Takeaways

•Stride: Controls kernel movement step size. Stride > 1 downsamples; stride 2 halves dimensions (with appropriate padding).
•Padding: Expands input boundaries (usually with zeros). Enables 'same' output sizing and preserves edge information.
•Output Formula: H_out = floor((H + 2p - K) / s) + 1. Memorize this—it's essential for architecture work.
•'Same' Padding: For odd kernels with stride 1, padding = (K-1)/2 preserves input size.
•Receptive Field: Stride layers amplify receptive field growth of subsequent layers. Track this for proper context aggregation.
•Common Patterns: 3×3 stride 1 pad 1 (same), 3×3 stride 2 pad 1 (halve), 1×1 stride 1 pad 0 (channel mix).

What's Next:

Dimension Mastery Achieved

3 / 5