Machine LearningConvolutional Neural Networks

Pooling and Downsampling

LevelIntermediate

Duration90 mins

TopicConvolutional Neural Networks

2 / 5

Average Pooling

The Averaging Alternative

While max pooling captures the strongest activation within a region, an equally fundamental question arises: what if we care about the overall activation pattern rather than just its peak? This perspective leads us to average pooling—an operation that computes the arithmetic mean of values within each pooling region.

Average pooling offers a complementary approach to spatial aggregation. Where max pooling asks "Is this feature present strongly anywhere in this region?", average pooling asks "What is the overall strength of activation across this region?" These are fundamentally different questions, and the choice between them has profound implications for what networks learn to represent.

What You Will Learn

By the end of this page, you will understand the mathematical formulation of average pooling, how gradients distribute uniformly during backpropagation, the smoothing and noise-averaging properties, when average pooling outperforms max pooling, its role in Global Average Pooling (GAP), and practical implementation considerations.

Mathematical Formulation

Average pooling computes the arithmetic mean of all values within the pooling window. Let's formalize this precisely.

Notation and Setup:

For an input feature map $X \in \mathbb{R}^{C \times H \times W}$ with a pooling window of size $k \times k$ and stride $s$, the average pooling operation for channel $c$ at output position $(i, j)$ is:

$$Y_{c,i,j} = \frac{1}{k^2} \sum_{(m,n) \in \mathcal{R}{i,j}} X{c,m,n}$$

where $\mathcal{R}_{i,j}$ defines the receptive field region:

$$\mathcal{R}_{i,j} = {(m,n) : i \cdot s \leq m < i \cdot s + k, ; j \cdot s \leq n < j \cdot s + k}$$

The key distinction from max pooling is the summation followed by division rather than a selection operation.

Linear Operation

Unlike max pooling which is nonlinear (piecewise linear), average pooling is a strictly linear operation. It can be represented as a convolution with a uniform kernel of all 1/k² values. This linearity has implications for gradient flow and compositional analysis.

Output Dimensions:

The output dimension calculation is identical to max pooling:

$$H_{out} = \left\lfloor \frac{H - k + 2p}{s} \right\rfloor + 1$$

$$W_{out} = \left\lfloor \frac{W - k + 2p}{s} \right\rfloor + 1$$

Average Pooling as Convolution:

Average pooling with a $k \times k$ window is mathematically equivalent to convolution with a kernel:

$$K = \frac{1}{k^2} \begin{bmatrix} 1 & 1 & \cdots & 1 \ 1 & 1 & \cdots & 1 \ \vdots & \vdots & \ddots & \vdots \ 1 & 1 & \cdots & 1 \end{bmatrix}$$

This equivalence means average pooling can be implemented using optimized convolution routines, though dedicated implementations are typically more efficient.

Average Pooling vs Max Pooling Mathematical Properties
Property	Average Pooling	Max Pooling
Operation type	Linear (weighted sum)	Nonlinear (selection)
Formula	Mean of values	Maximum of values
Gradient distribution	Uniform to all inputs	All to maximum only
Convolution equivalent	Yes (box filter)	No
Sensitivity to outliers	Low (averaging smooths)	High (outliers become output)
Information preservation	Aggregate statistics	Peak activation

The Averaging Perspective:

Conceptually, average pooling implements a soft logical AND-like operation:

The output is high only if most activations in the region are high
A single strong activation is diluted by surrounding weak activations
The operation asks: "On average, how strongly do features respond in this region?"

This contrasts sharply with max pooling's OR-like behavior where a single strong response dominates the output.

Channel-wise Independence:

Like max pooling, average pooling operates independently on each channel. The mean is computed only among spatial neighbors within the same feature map. Cross-channel averaging (if desired) would require different operations like channel pooling or dimensionality reduction.

Gradient Flow Analysis

The gradient behavior of average pooling differs fundamentally from max pooling, with important implications for training dynamics.

Gradient of the Mean:

For a scalar average operation $y = \frac{1}{n} \sum_{i=1}^{n} x_i$, the gradient with respect to each input is:

$$\frac{\partial y}{\partial x_i} = \frac{1}{n}$$

Every input contributes equally to the output, and thus every input receives an equal share of the upstream gradient.

Democratic Gradient Distribution

During backpropagation through average pooling, every position in the pooling region receives 1/k² of the upstream gradient. This 'democratic' distribution contrasts with max pooling's winner-take-all approach, where only the maximum position receives the gradient.

Backpropagation Algorithm:

For average pooling, the backward pass is straightforward:

Receive the upstream gradient $\frac{\partial L}{\partial Y_{c,i,j}}$ for each output position
Distribute this gradient equally to all $k^2$ input positions in the pooling region
Each input position receives $\frac{1}{k^2} \cdot \frac{\partial L}{\partial Y_{c,i,j}}$

Implications for Learning:

Average Pooling Gradient Properties

•All neurons participate: Every position in the pooling region receives gradient signal, enabling broader learning across the feature map
•Gradient magnitude reduction: The 1/k² scaling reduces gradient magnitude, which can contribute to gradient vanishing in very deep networks
•Smooth gradient landscape: The continuous, uniform gradient creates smoother optimization landscapes compared to max pooling's discrete switching
•No memory required for indices: Unlike max pooling, no need to store argmax positions for backpropagation

average_pooling_backward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
def avg_pool_forward(X, pool_size=2, stride=2):
    """
    Forward pass for average pooling.
    
    Args:
        X: Input tensor of shape (batch, channels, height, width)
        pool_size: Size of pooling window (assumes square)
        stride: Stride for pooling operation
    
    Returns:
        out: Pooled output
    """
    N, C, H, W = X.shape
    H_out = (H - pool_size) // stride + 1
    W_out = (W - pool_size) // stride + 1
    
    out = np.zeros((N, C, H_out, W_out))
    
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            w_start = j * stride
            
            # Extract pooling region and compute mean
            region = X[:, :, h_start:h_start+pool_size, 
                               w_start:w_start+pool_size]
            out[:, :, i, j] = np.mean(region, axis=(2, 3))
    
    return out
 
def avg_pool_backward(dout, input_shape, pool_size=2, stride=2):
    """
    Backward pass for average pooling.
    
    Gradients are distributed uniformly to all positions in the 
    pooling region. Each position receives 1/(pool_size^2) of the
    upstream gradient.
    """
    N, C, H, W = input_shape
    dX = np.zeros(input_shape)
    
    _, _, H_out, W_out = dout.shape
    scale = 1.0 / (pool_size * pool_size)
    
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            w_start = j * stride
            
            # Distribute gradient uniformly
            for m in range(pool_size):
                for n in range(pool_size):
                    dX[:, :, h_start+m, w_start+n] += dout[:, :, i, j] * scale
    
    return dX
 
# Comparison of gradient behavior
def compare_gradient_behavior():
    """
    Demonstrate the difference between max and average pooling gradients.
    """
    # Create a 2x2 region with varying activations
    region = np.array([[1.0, 3.0], 
                       [2.0, 0.5]])
    upstream_grad = 1.0  # Gradient from the layer above
    
    # Max pooling: gradient goes only to maximum (3.0 at position [0,1])
    max_grad = np.array([[0.0, 1.0], 
                         [0.0, 0.0]])
    
    # Average pooling: gradient distributed uniformly
    avg_grad = np.array([[0.25, 0.25], 
                         [0.25, 0.25]])
    
    print("Input region:")
    print(region)
    print(f"\nMax pooling output: {np.max(region):.2f}")
    print(f"Average pooling output: {np.mean(region):.2f}")
    print(f"\nMax pooling gradient:\n{max_grad}")
    print(f"\nAverage pooling gradient:\n{avg_grad}")
    
compare_gradient_behavior()

Gradient Vanishing Considerations:

The 1/k² scaling in average pooling gradients can compound across multiple layers:

Layers	Pool Size	Combined Gradient Scale
1	2×2	1/4 = 0.25
2	2×2	1/16 = 0.0625
3	2×2	1/64 ≈ 0.0156
4	2×2	1/256 ≈ 0.0039
5	2×2	1/1024 ≈ 0.00098

After 5 average pooling layers, gradients are scaled by approximately 1/1000. While this illustrates the theoretical concern, in practice:

Other gradient scaling mechanisms (batch norm, residual connections) dominate
Gradients flow through convolutional layers too, not just pooling
The uniform distribution enables more parameters to receive meaningful updates
Modern optimizers (Adam, etc.) adapt to varying gradient magnitudes

Smoothing and Noise Averaging Properties

Average pooling acts as a low-pass filter on feature maps, smoothing spatial variations and averaging out noise. This property has both advantages and disadvantages depending on the application.

Signal Processing Perspective:

The averaging operation is equivalent to convolution with a box filter. In the frequency domain, this corresponds to multiplication with a sinc function, which:

Attenuates high-frequency components (fine spatial details)
Preserves low-frequency components (coarse spatial patterns)
Introduces a characteristic "ringing" at sharp edges (Gibbs phenomenon)

Low-Pass Filtering Effect

Every averaging operation is fundamentally a low-pass filter. In average pooling, this means fine-grained details and high-frequency noise are suppressed, while broad spatial patterns are preserved. This makes average pooling effective for applications where aggregate statistics matter more than precise patterns.

Noise Reduction:

Average pooling's noise-reducing property follows from the statistical properties of averaging:

If feature activations have noise with variance $\sigma^2$
Averaging over $k^2$ positions reduces noise variance to $\sigma^2/k^2$
Standard deviation (noise amplitude) reduces by factor $k$

This is particularly valuable in:

Medical imaging: Where sensor noise is significant
Low-light photography: Where photon noise affects features
Remote sensing: Where atmospheric effects introduce variability
Audio processing: Where background noise contaminates signals

Contrast with Max Pooling:

Average Pooling

•Smooths feature maps
•Reduces noise variance by k²
•Suppresses spurious peaks
•Captures overall region activity
•Stable to outliers

Max Pooling

•Preserves peak activations
•Noise can become signal if maximal
•Amplifies spurious strong peaks
•Captures strongest response only
•Sensitive to outliers

noise_reduction_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import torch
import torch.nn.functional as F
 
def demonstrate_noise_reduction():
    """
    Demonstrate how average pooling reduces noise while max pooling
    can amplify noise peaks.
    """
    np.random.seed(42)
    
    # Create a clean signal: a centered blob of activation
    H, W = 8, 8
    clean_signal = np.zeros((1, 1, H, W))
    
    # Create a smooth Gaussian-like feature
    for i in range(H):
        for j in range(W):
            dist_sq = (i - 3.5)**2 + (j - 3.5)**2
            clean_signal[0, 0, i, j] = np.exp(-dist_sq / 4.0)
    
    # Add Gaussian noise
    noise = np.random.normal(0, 0.3, clean_signal.shape)
    noisy_signal = clean_signal + noise
    
    # Clip any negative values (post-ReLU assumption)
    noisy_signal = np.maximum(noisy_signal, 0)
    
    # Convert to torch tensors
    clean_t = torch.tensor(clean_signal, dtype=torch.float32)
    noisy_t = torch.tensor(noisy_signal, dtype=torch.float32)
    
    # Apply pooling
    max_clean = F.max_pool2d(clean_t, 2, 2)
    max_noisy = F.max_pool2d(noisy_t, 2, 2)
    avg_clean = F.avg_pool2d(clean_t, 2, 2)
    avg_noisy = F.avg_pool2d(noisy_t, 2, 2)
    
    # Measure distortion from clean reference
    max_dist = torch.mean((max_noisy - max_clean)**2).item()
    avg_dist = torch.mean((avg_noisy - avg_clean)**2).item()
    
    print("Noise-induced distortion (MSE from clean pooled output):")
    print(f"  Max pooling: {max_dist:.6f}")
    print(f"  Average pooling: {avg_dist:.6f}")
    print(f"  Ratio (max/avg): {max_dist/avg_dist:.2f}x")
    
    # Result: Average pooling shows lower distortion because noise is averaged
    # rather than selected when it happens to be the maximum
 
demonstrate_noise_reduction()

Trade-off: Detail Loss

The same smoothing that reduces noise also removes fine spatial detail:

Texture patterns: Subtle texture variations are averaged away
Sharp edges: Boundaries become blurred after averaging
Small features: Features smaller than the pooling window may disappear

This is why average pooling alone is rarely used for tasks requiring spatial precision. However, for classification where global feature presence matters more than precise localization, this smoothing can be advantageous.

Global Average Pooling (GAP)

Global Average Pooling (GAP) extends average pooling to the entire spatial extent of each feature map, computing a single average value per channel. This simple operation, introduced in the Network in Network (NIN) paper by Lin et al. (2014), has become a cornerstone of modern CNN architectures.

Mathematical Definition:

For an input feature map $X \in \mathbb{R}^{C \times H \times W}$, GAP produces:

$$Y_c = \frac{1}{H \cdot W} \sum_{i=1}^{H} \sum_{j=1}^{W} X_{c,i,j}$$

The output is a vector $Y \in \mathbb{R}^{C}$—one value per channel, independent of spatial dimensions.

Why GAP Revolutionized Classification

Before GAP, CNNs used fully connected layers to convert spatial features to class predictions. This required flattening feature maps into vectors, introducing millions of parameters prone to overfitting. GAP eliminated these parameters entirely while enforcing a direct correspondence between feature maps and output categories.

Advantages of GAP:

Benefits of Global Average Pooling

•Massive parameter reduction: Eliminates fully connected layers that often contain 90%+ of network parameters
•Built-in regularization: Fewer parameters means reduced overfitting risk
•Input size flexibility: Since GAP adapts to any spatial dimension, trained networks can accept variable input sizes at inference time
•Interpretable feature maps: Each final feature map directly corresponds to a category detector, enabling class activation mapping (CAM)
•Spatial invariance: Complete translation invariance—doesn't matter where in the image the object appears

global_average_pooling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class ClassicClassificationHead(nn.Module):
    """
    Classic approach: Flatten + Fully Connected layers.
    Requires fixed input size, many parameters.
    """
    def __init__(self, in_channels, spatial_size, num_classes):
        super().__init__()
        flat_size = in_channels * spatial_size * spatial_size
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(flat_size, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes)
        )
        
    def forward(self, x):
        return self.fc(x)
 
 
class GAPClassificationHead(nn.Module):
    """
    Modern approach: Global Average Pooling + single linear layer.
    Works with any input size, far fewer parameters.
    """
    def __init__(self, in_channels, num_classes):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)  # Output: (batch, channels, 1, 1)
        self.fc = nn.Linear(in_channels, num_classes)
        
    def forward(self, x):
        x = self.gap(x)  # (batch, channels, 1, 1)
        x = x.view(x.size(0), -1)  # (batch, channels)
        return self.fc(x)
 
 
# Parameter comparison
def compare_parameters():
    in_channels = 512
    spatial = 7
    num_classes = 1000
    
    classic = ClassicClassificationHead(in_channels, spatial, num_classes)
    gap_based = GAPClassificationHead(in_channels, num_classes)
    
    classic_params = sum(p.numel() for p in classic.parameters())
    gap_params = sum(p.numel() for p in gap_based.parameters())
    
    print("Parameter Comparison:")
    print(f"  Classic FC head: {classic_params:,} parameters")
    print(f"  GAP-based head: {gap_params:,} parameters")
    print(f"  Reduction: {classic_params / gap_params:.1f}x fewer parameters")
 
compare_parameters()
# Output:
#   Classic FC head: ~119,000,000 parameters  
#   GAP-based head: ~513,000 parameters
#   Reduction: ~232x fewer parameters

GAP in Modern Architectures:

Virtually all modern classification architectures use GAP:

Architecture	Year	GAP Usage
VGGNet	2014	No (fully connected)
GoogLeNet	2014	Yes, replaced FC layers
ResNet	2015	Yes, before final classifier
DenseNet	2017	Yes, before final classifier
EfficientNet	2019	Yes, with squeeze-excite
ConvNeXt	2022	Yes, standard approach

Adaptive Pooling:

Modern frameworks provide AdaptiveAvgPool2d(output_size) which adjusts pooling kernel sizes to produce a specified output dimension regardless of input size. For GAP, output_size=1 creates the single-value-per-channel output.

When GAP May Not Be Ideal

GAP assumes that the average activation across the entire feature map is meaningful. For tasks where precise spatial information matters (detection, segmentation) or where features have highly variable sizes and positions, alternative approaches like spatial pyramid pooling or Region of Interest (RoI) pooling may be more appropriate.

When to Use Average Pooling

Selecting between average and max pooling is an architectural decision with meaningful performance implications. Let's examine scenarios where average pooling excels.

Ideal Scenarios for Average Pooling:

Average Pooling Strengths

•Smooth feature maps: When activations are relatively uniform across regions and peaks aren't semantically significant
•Noise sensitivity: In noisy domains where spurious high activations could mislead max pooling
•Aggregate statistics: When the overall activation level matters more than peak responses
•Final feature aggregation: For global average pooling before classification layers
•Dense activations: When feature detectors fire across broad regions rather than at precise point

Domain-Specific Recommendations:

Domain	Pooling Preference	Rationale
Natural images	Max pooling (early), GAP (final)	Sparse texture features, then global aggregation
Medical imaging	Average pooling	Noise reduction, gradual features
Audio/Speech	Often average pooling	Spectral smoothing, noise robustness
Satellite imagery	Depends on task	Detection→max, segmentation→average
Text (character-level)	Max pooling	Sparse character patterns
Scientific data	Average pooling	Statistical aggregation important

Hybrid Approaches:

Some architectures combine both pooling types:

hybrid_pooling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import torch
import torch.nn as nn
 
class ConcatPooling(nn.Module):
    """
    Concatenate max and average pooling outputs.
    Used in fastai and other frameworks for classification.
    
    This captures both peak activations (max) and aggregate 
    statistics (avg), giving the classifier richer information.
    """
    def __init__(self):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.gmp = nn.AdaptiveMaxPool2d(1)  # Global Max Pooling
        
    def forward(self, x):
        avg_out = self.gap(x)  # (batch, C, 1, 1)
        max_out = self.gmp(x)  # (batch, C, 1, 1)
        return torch.cat([avg_out, max_out], dim=1)  # (batch, 2C, 1, 1)
 
 
class SpatialPyramidPooling(nn.Module):
    """
    Spatial Pyramid Pooling (SPP) combines pooling at multiple scales.
    Enables fixed-size output regardless of input dimensions.
    
    Each level uses average pooling at different granularities.
    """
    def __init__(self, levels=[1, 2, 4]):
        super().__init__()
        self.levels = levels
        
    def forward(self, x):
        batch_size, channels, H, W = x.shape
        outputs = []
        
        for level in self.levels:
            # Pool to level × level grid
            pool = nn.AdaptiveAvgPool2d(level)
            out = pool(x)  # (batch, C, level, level)
            out = out.view(batch_size, -1)  # Flatten spatial dims
            outputs.append(out)
        
        # Concatenate all levels
        return torch.cat(outputs, dim=1)
 
 
class MixedPoolingBlock(nn.Module):
    """
    Inception-style mixed pooling: parallel max and avg pooling
    branches that are concatenated.
    """
    def __init__(self, in_channels):
        super().__init__()
        self.max_pool = nn.MaxPool2d(3, stride=1, padding=1)
        self.avg_pool = nn.AvgPool2d(3, stride=1, padding=1)
        self.conv_reduce = nn.Conv2d(in_channels * 2, in_channels, 1)
        
    def forward(self, x):
        max_out = self.max_pool(x)
        avg_out = self.avg_pool(x)
        concat = torch.cat([max_out, avg_out], dim=1)
        return self.conv_reduce(concat)
 
 
# Usage example
def test_pooling_variants():
    x = torch.randn(8, 256, 14, 14)
    
    concat_pool = ConcatPooling()
    spp = SpatialPyramidPooling([1, 2, 4])
    mixed = MixedPoolingBlock(256)
    
    print(f"Input shape: {x.shape}")
    print(f"ConcatPooling output: {concat_pool(x).shape}")  # (8, 512, 1, 1)
    print(f"SPP output: {spp(x).shape}")  # (8, 256*(1+4+16)) = (8, 5376)
    print(f"MixedPooling output: {mixed(x).shape}")  # (8, 256, 14, 14)
 
test_pooling_variants()

Empirical Comparisons:

In practice, the choice often has modest impact on final performance for classification:

Springenberg et al. (2014) found that networks can achieve similar performance with either pooling type when other architecture choices are adjusted
Max pooling typically performs marginally better for sparse features (textures, edges)
Average pooling may generalize slightly better due to its smoothing regularization effect
The feature extraction layers (convolutions) often compensate for pooling choice

The more significant impact comes from:

Whether to pool at all (vs. strided convolutions)
The pooling stride (how much downsampling)
The overall architectural design

Implementation and Framework Support

Average pooling implementation is straightforward, but several nuances affect practical usage.

Framework APIs:

average_pooling_apis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import torch
import torch.nn as nn
import tensorflow as tf
 
# ===========================================
# PyTorch Average Pooling
# ===========================================
 
# Standard 2D average pooling
x = torch.randn(32, 64, 28, 28)  # NCHW format
 
# Functional API
out = torch.nn.functional.avg_pool2d(
    x,
    kernel_size=2,
    stride=2,
    padding=0,
    ceil_mode=False,
    count_include_pad=True,  # Include padding zeros in average
    divisor_override=None    # Custom divisor (not recommended)
)
 
# Module API
avg_pool = nn.AvgPool2d(
    kernel_size=2,
    stride=2,
    padding=0,
    ceil_mode=False,
    count_include_pad=True
)
 
# Adaptive pooling (variable input size → fixed output size)
adaptive_avg = nn.AdaptiveAvgPool2d(output_size=(7, 7))  # Any input → 7×7
global_avg = nn.AdaptiveAvgPool2d(output_size=1)  # GAP
 
print(f"Standard AvgPool2d output: {avg_pool(x).shape}")  # (32, 64, 14, 14)
print(f"Adaptive 7×7 output: {adaptive_avg(x).shape}")  # (32, 64, 7, 7)
print(f"Global average output: {global_avg(x).shape}")  # (32, 64, 1, 1)
 
# 1D average pooling (for sequences/temporal data)
x_1d = torch.randn(32, 64, 100)  # (batch, channels, length)
avg_pool_1d = nn.AvgPool1d(kernel_size=4, stride=2)
print(f"AvgPool1d output: {avg_pool_1d(x_1d).shape}")  # (32, 64, 49)
 
# 3D average pooling (for video/volumetric data)
x_3d = torch.randn(2, 16, 8, 28, 28)  # (batch, C, D, H, W)
avg_pool_3d = nn.AvgPool3d(kernel_size=(2, 2, 2), stride=2)
print(f"AvgPool3d output: {avg_pool_3d(x_3d).shape}")  # (2, 16, 4, 14, 14)
 
 
# ===========================================
# TensorFlow / Keras Average Pooling
# ===========================================
 
x_tf = tf.random.normal([32, 28, 28, 64])  # NHWC format (TF default)
 
# Functional API
out_tf = tf.nn.avg_pool(
    x_tf,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='VALID'
)
print(f"TF avg_pool output: {out_tf.shape}")  # (32, 14, 14, 64)
 
# Keras Layer API
keras_avg = tf.keras.layers.AveragePooling2D(
    pool_size=(2, 2),
    strides=(2, 2),
    padding='valid'
)
 
# Global average pooling in Keras
keras_gap = tf.keras.layers.GlobalAveragePooling2D()
gap_output = keras_gap(x_tf)
print(f"Keras GAP output: {gap_output.shape}")  # (32, 64)
 
 
# ===========================================
# count_include_pad behavior
# ===========================================
 
def demonstrate_count_include_pad():
    """
    When padding is used, count_include_pad determines whether
    padding zeros are included in the average calculation.
    """
    x = torch.tensor([[[[1., 2.], [3., 4.]]]])  # 1×1×2×2
    
    # With padding=1, we get a 4×4 padded input
    # count_include_pad=True: divide by all positions including padding
    # count_include_pad=False: divide only by valid positions
    
    pool_include = nn.AvgPool2d(2, stride=1, padding=1, count_include_pad=True)
    pool_exclude = nn.AvgPool2d(2, stride=1, padding=1, count_include_pad=False)
    
    print("Input:")
    print(x.squeeze())
    print("\nWith count_include_pad=True (includes zeros in denominator):")
    print(pool_include(x).squeeze())
    print("\nWith count_include_pad=False (excludes zeros from denominator):")  
    print(pool_exclude(x).squeeze())
 
demonstrate_count_include_pad()

The count_include_pad Gotcha

When using padding with average pooling, the count_include_pad parameter significantly affects boundary behavior. Setting it to False (exclude padding from average) often produces more intuitive results at boundaries, avoiding artificial suppression of edge activations.

Performance Considerations:

Average pooling is computationally efficient:

Simple arithmetic: Just additions and a division per output
No index tracking: Unlike max pooling, no need to store argmax indices
Cache-friendly: Sequential memory access patterns
SIMD vectorizable: Sum operations parallelize well

For the same configuration, average pooling typically:

Has similar forward pass speed to max pooling
Has faster backward pass (no index lookup, just uniform distribution)
Uses less memory (no index storage)

Numerical Stability:

For very large pooling regions (especially GAP on high-resolution inputs), numerical precision can matter:

FP16: May lose precision with many small values
FP32: Generally sufficient for most applications
Compensated summation (Kahan): Rarely needed but available for critical applications

Max vs Average Pooling: Comprehensive Comparison

Let's consolidate the key differences between max and average pooling to provide a practical decision framework.

Comprehensive Pooling Comparison
Aspect	Max Pooling	Average Pooling
Mathematical operation	Maximum selection	Arithmetic mean
Gradient behavior	Winner-take-all (sparse)	Uniform distribution (dense)
Information preserved	Peak/strongest activation	Overall activation level
Noise handling	Can amplify noise peaks	Reduces noise by averaging
Texture sensitivity	High (detects sparse patterns)	Lower (smooths textures)
Translation invariance	Bounded local invariance	Bounded local invariance + smoothing
Biological analog	Complex cells in V1	Gain normalization circuits
Memory (backward pass)	Stores argmax indices	No index storage needed
Compute (forward)	Max comparison operations	Sum and divide operations
Dominant use case	Intermediate layers for features	Final GAP for classification

Practical Guideline

When in doubt, use max pooling for intermediate layers (between conv blocks) and global average pooling before the final classifier. This combination leverages max pooling's strong feature detection while benefiting from GAP's parameter efficiency and regularization.

Decision Flowchart:

Is this the final aggregation before classification?
- Yes → Use Global Average Pooling
- No → Continue to step 2
Are you working with noisy data or need noise robustness?
- Yes → Consider average pooling
- No → Continue to step 3
Do features activate sparsely at specific locations?
- Yes → Use max pooling
- No (features broadly distributed) → Consider average pooling
Is this a dense prediction task (segmentation, detection)?
- Yes → Consider strided convolutions or atrous convolutions
- No → Default to max pooling

Modern Trend:

Recent architectures increasingly replace intermediate pooling with strided convolutions, reserving pooling primarily for final global aggregation. This allows the network to learn optimal downsampling rather than using fixed operations.

Summary and Key Takeaways

Average pooling provides a complementary approach to spatial aggregation, with distinct properties that make it essential in modern CNN architectures.

Core Concepts Mastered

•Mathematical formulation: Average pooling computes the arithmetic mean within local regions, acting as a linear low-pass filter
•Gradient distribution: Gradients flow uniformly to all positions in the pooling region, enabling democratic learning across spatial locations
•Noise averaging: The averaging operation reduces noise variance proportionally to the pool size, providing robustness to input noise
•Global Average Pooling: GAP revolutionized classification by eliminating fully connected layers, reducing parameters by 100x+ while maintaining accuracy
•Use case selection: Average pooling suits smooth features, noisy data, and final aggregation; max pooling suits sparse, textured features
•Implementation nuances: The count_include_pad parameter affects boundary behavior when padding is used
•Hybrid approaches: Concatenating max and average pooling captures complementary information

Page Complete

You now understand average pooling's mathematical properties, gradient behavior, and practical applications. Combined with your knowledge of max pooling, you can make informed decisions about spatial aggregation in CNN architectures.

What's Next:

In the next page, we explore Global Pooling in greater depth—examining global max pooling, spatial pyramid pooling, and advanced aggregation strategies that bridge local features and global predictions.

2 / 5

Loading learning content...

Machine LearningConvolutional Neural Networks

Pooling and Downsampling

LevelIntermediate

Duration90 mins

TopicConvolutional Neural Networks

2 / 5

Average Pooling

The Averaging Alternative

What You Will Learn

Mathematical Formulation

Average pooling computes the arithmetic mean of all values within the pooling window. Let's formalize this precisely.

Notation and Setup:

$$Y_{c,i,j} = \frac{1}{k^2} \sum_{(m,n) \in \mathcal{R}{i,j}} X{c,m,n}$$

where $\mathcal{R}_{i,j}$ defines the receptive field region:

$$\mathcal{R}_{i,j} = {(m,n) : i \cdot s \leq m < i \cdot s + k, ; j \cdot s \leq n < j \cdot s + k}$$

The key distinction from max pooling is the summation followed by division rather than a selection operation.

Linear Operation

Output Dimensions:

The output dimension calculation is identical to max pooling:

$$H_{out} = \left\lfloor \frac{H - k + 2p}{s} \right\rfloor + 1$$

$$W_{out} = \left\lfloor \frac{W - k + 2p}{s} \right\rfloor + 1$$

Average Pooling as Convolution:

Average pooling with a $k \times k$ window is mathematically equivalent to convolution with a kernel:

$$K = \frac{1}{k^2} \begin{bmatrix} 1 & 1 & \cdots & 1 \ 1 & 1 & \cdots & 1 \ \vdots & \vdots & \ddots & \vdots \ 1 & 1 & \cdots & 1 \end{bmatrix}$$

This equivalence means average pooling can be implemented using optimized convolution routines, though dedicated implementations are typically more efficient.

Average Pooling vs Max Pooling Mathematical Properties
Property	Average Pooling	Max Pooling
Operation type	Linear (weighted sum)	Nonlinear (selection)
Formula	Mean of values	Maximum of values
Gradient distribution	Uniform to all inputs	All to maximum only
Convolution equivalent	Yes (box filter)	No
Sensitivity to outliers	Low (averaging smooths)	High (outliers become output)
Information preservation	Aggregate statistics	Peak activation

The Averaging Perspective:

Conceptually, average pooling implements a soft logical AND-like operation:

The output is high only if most activations in the region are high
A single strong activation is diluted by surrounding weak activations
The operation asks: "On average, how strongly do features respond in this region?"

This contrasts sharply with max pooling's OR-like behavior where a single strong response dominates the output.

Channel-wise Independence:

Gradient Flow Analysis

The gradient behavior of average pooling differs fundamentally from max pooling, with important implications for training dynamics.

Gradient of the Mean:

For a scalar average operation $y = \frac{1}{n} \sum_{i=1}^{n} x_i$, the gradient with respect to each input is:

$$\frac{\partial y}{\partial x_i} = \frac{1}{n}$$

Every input contributes equally to the output, and thus every input receives an equal share of the upstream gradient.

Democratic Gradient Distribution

Backpropagation Algorithm:

For average pooling, the backward pass is straightforward:

Receive the upstream gradient $\frac{\partial L}{\partial Y_{c,i,j}}$ for each output position
Distribute this gradient equally to all $k^2$ input positions in the pooling region
Each input position receives $\frac{1}{k^2} \cdot \frac{\partial L}{\partial Y_{c,i,j}}$

Implications for Learning:

Average Pooling Gradient Properties

•All neurons participate: Every position in the pooling region receives gradient signal, enabling broader learning across the feature map
•Gradient magnitude reduction: The 1/k² scaling reduces gradient magnitude, which can contribute to gradient vanishing in very deep networks
•Smooth gradient landscape: The continuous, uniform gradient creates smoother optimization landscapes compared to max pooling's discrete switching
•No memory required for indices: Unlike max pooling, no need to store argmax positions for backpropagation

average_pooling_backward.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
 
def avg_pool_forward(X, pool_size=2, stride=2):
    """
    Forward pass for average pooling.
    
    Args:
        X: Input tensor of shape (batch, channels, height, width)
        pool_size: Size of pooling window (assumes square)
        stride: Stride for pooling operation
    
    Returns:
        out: Pooled output
    """
    N, C, H, W = X.shape
    H_out = (H - pool_size) // stride + 1
    W_out = (W - pool_size) // stride + 1
    
    out = np.zeros((N, C, H_out, W_out))
    
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            w_start = j * stride
            
            # Extract pooling region and compute mean
            region = X[:, :, h_start:h_start+pool_size, 
                               w_start:w_start+pool_size]
            out[:, :, i, j] = np.mean(region, axis=(2, 3))
    
    return out
 
def avg_pool_backward(dout, input_shape, pool_size=2, stride=2):
    """
    Backward pass for average pooling.
    
    Gradients are distributed uniformly to all positions in the 
    pooling region. Each position receives 1/(pool_size^2) of the
    upstream gradient.
    """
    N, C, H, W = input_shape
    dX = np.zeros(input_shape)
    
    _, _, H_out, W_out = dout.shape
    scale = 1.0 / (pool_size * pool_size)
    
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            w_start = j * stride
            
            # Distribute gradient uniformly
            for m in range(pool_size):
                for n in range(pool_size):
                    dX[:, :, h_start+m, w_start+n] += dout[:, :, i, j] * scale
    
    return dX
 
# Comparison of gradient behavior
def compare_gradient_behavior():
    """
    Demonstrate the difference between max and average pooling gradients.
    """
    # Create a 2x2 region with varying activations
    region = np.array([[1.0, 3.0], 
                       [2.0, 0.5]])
    upstream_grad = 1.0  # Gradient from the layer above
    
    # Max pooling: gradient goes only to maximum (3.0 at position [0,1])
    max_grad = np.array([[0.0, 1.0], 
                         [0.0, 0.0]])
    
    # Average pooling: gradient distributed uniformly
    avg_grad = np.array([[0.25, 0.25], 
                         [0.25, 0.25]])
    
    print("Input region:")
    print(region)
    print(f"\nMax pooling output: {np.max(region):.2f}")
    print(f"Average pooling output: {np.mean(region):.2f}")
    print(f"\nMax pooling gradient:\n{max_grad}")
    print(f"\nAverage pooling gradient:\n{avg_grad}")
    
compare_gradient_behavior()

Gradient Vanishing Considerations:

The 1/k² scaling in average pooling gradients can compound across multiple layers:

Layers	Pool Size	Combined Gradient Scale
1	2×2	1/4 = 0.25
2	2×2	1/16 = 0.0625
3	2×2	1/64 ≈ 0.0156
4	2×2	1/256 ≈ 0.0039
5	2×2	1/1024 ≈ 0.00098

After 5 average pooling layers, gradients are scaled by approximately 1/1000. While this illustrates the theoretical concern, in practice:

Other gradient scaling mechanisms (batch norm, residual connections) dominate
Gradients flow through convolutional layers too, not just pooling
The uniform distribution enables more parameters to receive meaningful updates
Modern optimizers (Adam, etc.) adapt to varying gradient magnitudes

Smoothing and Noise Averaging Properties

Average pooling acts as a low-pass filter on feature maps, smoothing spatial variations and averaging out noise. This property has both advantages and disadvantages depending on the application.

Signal Processing Perspective:

The averaging operation is equivalent to convolution with a box filter. In the frequency domain, this corresponds to multiplication with a sinc function, which:

Attenuates high-frequency components (fine spatial details)
Preserves low-frequency components (coarse spatial patterns)
Introduces a characteristic "ringing" at sharp edges (Gibbs phenomenon)

Low-Pass Filtering Effect

Noise Reduction:

Average pooling's noise-reducing property follows from the statistical properties of averaging:

If feature activations have noise with variance $\sigma^2$
Averaging over $k^2$ positions reduces noise variance to $\sigma^2/k^2$
Standard deviation (noise amplitude) reduces by factor $k$

This is particularly valuable in:

Medical imaging: Where sensor noise is significant
Low-light photography: Where photon noise affects features
Remote sensing: Where atmospheric effects introduce variability
Audio processing: Where background noise contaminates signals

Contrast with Max Pooling:

Average Pooling

•Smooths feature maps
•Reduces noise variance by k²
•Suppresses spurious peaks
•Captures overall region activity
•Stable to outliers

Max Pooling

•Preserves peak activations
•Noise can become signal if maximal
•Amplifies spurious strong peaks
•Captures strongest response only
•Sensitive to outliers

noise_reduction_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
import torch
import torch.nn.functional as F
 
def demonstrate_noise_reduction():
    """
    Demonstrate how average pooling reduces noise while max pooling
    can amplify noise peaks.
    """
    np.random.seed(42)
    
    # Create a clean signal: a centered blob of activation
    H, W = 8, 8
    clean_signal = np.zeros((1, 1, H, W))
    
    # Create a smooth Gaussian-like feature
    for i in range(H):
        for j in range(W):
            dist_sq = (i - 3.5)**2 + (j - 3.5)**2
            clean_signal[0, 0, i, j] = np.exp(-dist_sq / 4.0)
    
    # Add Gaussian noise
    noise = np.random.normal(0, 0.3, clean_signal.shape)
    noisy_signal = clean_signal + noise
    
    # Clip any negative values (post-ReLU assumption)
    noisy_signal = np.maximum(noisy_signal, 0)
    
    # Convert to torch tensors
    clean_t = torch.tensor(clean_signal, dtype=torch.float32)
    noisy_t = torch.tensor(noisy_signal, dtype=torch.float32)
    
    # Apply pooling
    max_clean = F.max_pool2d(clean_t, 2, 2)
    max_noisy = F.max_pool2d(noisy_t, 2, 2)
    avg_clean = F.avg_pool2d(clean_t, 2, 2)
    avg_noisy = F.avg_pool2d(noisy_t, 2, 2)
    
    # Measure distortion from clean reference
    max_dist = torch.mean((max_noisy - max_clean)**2).item()
    avg_dist = torch.mean((avg_noisy - avg_clean)**2).item()
    
    print("Noise-induced distortion (MSE from clean pooled output):")
    print(f"  Max pooling: {max_dist:.6f}")
    print(f"  Average pooling: {avg_dist:.6f}")
    print(f"  Ratio (max/avg): {max_dist/avg_dist:.2f}x")
    
    # Result: Average pooling shows lower distortion because noise is averaged
    # rather than selected when it happens to be the maximum
 
demonstrate_noise_reduction()

Trade-off: Detail Loss

The same smoothing that reduces noise also removes fine spatial detail:

Texture patterns: Subtle texture variations are averaged away
Sharp edges: Boundaries become blurred after averaging
Small features: Features smaller than the pooling window may disappear

Global Average Pooling (GAP)

Mathematical Definition:

For an input feature map $X \in \mathbb{R}^{C \times H \times W}$, GAP produces:

$$Y_c = \frac{1}{H \cdot W} \sum_{i=1}^{H} \sum_{j=1}^{W} X_{c,i,j}$$

The output is a vector $Y \in \mathbb{R}^{C}$—one value per channel, independent of spatial dimensions.

Why GAP Revolutionized Classification

Advantages of GAP:

Benefits of Global Average Pooling

•Massive parameter reduction: Eliminates fully connected layers that often contain 90%+ of network parameters
•Built-in regularization: Fewer parameters means reduced overfitting risk
•Input size flexibility: Since GAP adapts to any spatial dimension, trained networks can accept variable input sizes at inference time
•Interpretable feature maps: Each final feature map directly corresponds to a category detector, enabling class activation mapping (CAM)
•Spatial invariance: Complete translation invariance—doesn't matter where in the image the object appears

global_average_pooling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class ClassicClassificationHead(nn.Module):
    """
    Classic approach: Flatten + Fully Connected layers.
    Requires fixed input size, many parameters.
    """
    def __init__(self, in_channels, spatial_size, num_classes):
        super().__init__()
        flat_size = in_channels * spatial_size * spatial_size
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(flat_size, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes)
        )
        
    def forward(self, x):
        return self.fc(x)
 
 
class GAPClassificationHead(nn.Module):
    """
    Modern approach: Global Average Pooling + single linear layer.
    Works with any input size, far fewer parameters.
    """
    def __init__(self, in_channels, num_classes):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)  # Output: (batch, channels, 1, 1)
        self.fc = nn.Linear(in_channels, num_classes)
        
    def forward(self, x):
        x = self.gap(x)  # (batch, channels, 1, 1)
        x = x.view(x.size(0), -1)  # (batch, channels)
        return self.fc(x)
 
 
# Parameter comparison
def compare_parameters():
    in_channels = 512
    spatial = 7
    num_classes = 1000
    
    classic = ClassicClassificationHead(in_channels, spatial, num_classes)
    gap_based = GAPClassificationHead(in_channels, num_classes)
    
    classic_params = sum(p.numel() for p in classic.parameters())
    gap_params = sum(p.numel() for p in gap_based.parameters())
    
    print("Parameter Comparison:")
    print(f"  Classic FC head: {classic_params:,} parameters")
    print(f"  GAP-based head: {gap_params:,} parameters")
    print(f"  Reduction: {classic_params / gap_params:.1f}x fewer parameters")
 
compare_parameters()
# Output:
#   Classic FC head: ~119,000,000 parameters  
#   GAP-based head: ~513,000 parameters
#   Reduction: ~232x fewer parameters

GAP in Modern Architectures:

Virtually all modern classification architectures use GAP:

Architecture	Year	GAP Usage
VGGNet	2014	No (fully connected)
GoogLeNet	2014	Yes, replaced FC layers
ResNet	2015	Yes, before final classifier
DenseNet	2017	Yes, before final classifier
EfficientNet	2019	Yes, with squeeze-excite
ConvNeXt	2022	Yes, standard approach

Adaptive Pooling:

When GAP May Not Be Ideal

When to Use Average Pooling

Selecting between average and max pooling is an architectural decision with meaningful performance implications. Let's examine scenarios where average pooling excels.

Ideal Scenarios for Average Pooling:

Average Pooling Strengths

•Smooth feature maps: When activations are relatively uniform across regions and peaks aren't semantically significant
•Noise sensitivity: In noisy domains where spurious high activations could mislead max pooling
•Aggregate statistics: When the overall activation level matters more than peak responses
•Final feature aggregation: For global average pooling before classification layers
•Dense activations: When feature detectors fire across broad regions rather than at precise point

Domain-Specific Recommendations:

Domain	Pooling Preference	Rationale
Natural images	Max pooling (early), GAP (final)	Sparse texture features, then global aggregation
Medical imaging	Average pooling	Noise reduction, gradual features
Audio/Speech	Often average pooling	Spectral smoothing, noise robustness
Satellite imagery	Depends on task	Detection→max, segmentation→average
Text (character-level)	Max pooling	Sparse character patterns
Scientific data	Average pooling	Statistical aggregation important

Hybrid Approaches:

Some architectures combine both pooling types:

hybrid_pooling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import torch
import torch.nn as nn
 
class ConcatPooling(nn.Module):
    """
    Concatenate max and average pooling outputs.
    Used in fastai and other frameworks for classification.
    
    This captures both peak activations (max) and aggregate 
    statistics (avg), giving the classifier richer information.
    """
    def __init__(self):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.gmp = nn.AdaptiveMaxPool2d(1)  # Global Max Pooling
        
    def forward(self, x):
        avg_out = self.gap(x)  # (batch, C, 1, 1)
        max_out = self.gmp(x)  # (batch, C, 1, 1)
        return torch.cat([avg_out, max_out], dim=1)  # (batch, 2C, 1, 1)
 
 
class SpatialPyramidPooling(nn.Module):
    """
    Spatial Pyramid Pooling (SPP) combines pooling at multiple scales.
    Enables fixed-size output regardless of input dimensions.
    
    Each level uses average pooling at different granularities.
    """
    def __init__(self, levels=[1, 2, 4]):
        super().__init__()
        self.levels = levels
        
    def forward(self, x):
        batch_size, channels, H, W = x.shape
        outputs = []
        
        for level in self.levels:
            # Pool to level × level grid
            pool = nn.AdaptiveAvgPool2d(level)
            out = pool(x)  # (batch, C, level, level)
            out = out.view(batch_size, -1)  # Flatten spatial dims
            outputs.append(out)
        
        # Concatenate all levels
        return torch.cat(outputs, dim=1)
 
 
class MixedPoolingBlock(nn.Module):
    """
    Inception-style mixed pooling: parallel max and avg pooling
    branches that are concatenated.
    """
    def __init__(self, in_channels):
        super().__init__()
        self.max_pool = nn.MaxPool2d(3, stride=1, padding=1)
        self.avg_pool = nn.AvgPool2d(3, stride=1, padding=1)
        self.conv_reduce = nn.Conv2d(in_channels * 2, in_channels, 1)
        
    def forward(self, x):
        max_out = self.max_pool(x)
        avg_out = self.avg_pool(x)
        concat = torch.cat([max_out, avg_out], dim=1)
        return self.conv_reduce(concat)
 
 
# Usage example
def test_pooling_variants():
    x = torch.randn(8, 256, 14, 14)
    
    concat_pool = ConcatPooling()
    spp = SpatialPyramidPooling([1, 2, 4])
    mixed = MixedPoolingBlock(256)
    
    print(f"Input shape: {x.shape}")
    print(f"ConcatPooling output: {concat_pool(x).shape}")  # (8, 512, 1, 1)
    print(f"SPP output: {spp(x).shape}")  # (8, 256*(1+4+16)) = (8, 5376)
    print(f"MixedPooling output: {mixed(x).shape}")  # (8, 256, 14, 14)
 
test_pooling_variants()

Empirical Comparisons:

In practice, the choice often has modest impact on final performance for classification:

Springenberg et al. (2014) found that networks can achieve similar performance with either pooling type when other architecture choices are adjusted
Max pooling typically performs marginally better for sparse features (textures, edges)
Average pooling may generalize slightly better due to its smoothing regularization effect
The feature extraction layers (convolutions) often compensate for pooling choice

The more significant impact comes from:

Whether to pool at all (vs. strided convolutions)
The pooling stride (how much downsampling)
The overall architectural design

Implementation and Framework Support

Average pooling implementation is straightforward, but several nuances affect practical usage.

Framework APIs:

average_pooling_apis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import torch
import torch.nn as nn
import tensorflow as tf
 
# ===========================================
# PyTorch Average Pooling
# ===========================================
 
# Standard 2D average pooling
x = torch.randn(32, 64, 28, 28)  # NCHW format
 
# Functional API
out = torch.nn.functional.avg_pool2d(
    x,
    kernel_size=2,
    stride=2,
    padding=0,
    ceil_mode=False,
    count_include_pad=True,  # Include padding zeros in average
    divisor_override=None    # Custom divisor (not recommended)
)
 
# Module API
avg_pool = nn.AvgPool2d(
    kernel_size=2,
    stride=2,
    padding=0,
    ceil_mode=False,
    count_include_pad=True
)
 
# Adaptive pooling (variable input size → fixed output size)
adaptive_avg = nn.AdaptiveAvgPool2d(output_size=(7, 7))  # Any input → 7×7
global_avg = nn.AdaptiveAvgPool2d(output_size=1)  # GAP
 
print(f"Standard AvgPool2d output: {avg_pool(x).shape}")  # (32, 64, 14, 14)
print(f"Adaptive 7×7 output: {adaptive_avg(x).shape}")  # (32, 64, 7, 7)
print(f"Global average output: {global_avg(x).shape}")  # (32, 64, 1, 1)
 
# 1D average pooling (for sequences/temporal data)
x_1d = torch.randn(32, 64, 100)  # (batch, channels, length)
avg_pool_1d = nn.AvgPool1d(kernel_size=4, stride=2)
print(f"AvgPool1d output: {avg_pool_1d(x_1d).shape}")  # (32, 64, 49)
 
# 3D average pooling (for video/volumetric data)
x_3d = torch.randn(2, 16, 8, 28, 28)  # (batch, C, D, H, W)
avg_pool_3d = nn.AvgPool3d(kernel_size=(2, 2, 2), stride=2)
print(f"AvgPool3d output: {avg_pool_3d(x_3d).shape}")  # (2, 16, 4, 14, 14)
 
 
# ===========================================
# TensorFlow / Keras Average Pooling
# ===========================================
 
x_tf = tf.random.normal([32, 28, 28, 64])  # NHWC format (TF default)
 
# Functional API
out_tf = tf.nn.avg_pool(
    x_tf,
    ksize=[1, 2, 2, 1],
    strides=[1, 2, 2, 1],
    padding='VALID'
)
print(f"TF avg_pool output: {out_tf.shape}")  # (32, 14, 14, 64)
 
# Keras Layer API
keras_avg = tf.keras.layers.AveragePooling2D(
    pool_size=(2, 2),
    strides=(2, 2),
    padding='valid'
)
 
# Global average pooling in Keras
keras_gap = tf.keras.layers.GlobalAveragePooling2D()
gap_output = keras_gap(x_tf)
print(f"Keras GAP output: {gap_output.shape}")  # (32, 64)
 
 
# ===========================================
# count_include_pad behavior
# ===========================================
 
def demonstrate_count_include_pad():
    """
    When padding is used, count_include_pad determines whether
    padding zeros are included in the average calculation.
    """
    x = torch.tensor([[[[1., 2.], [3., 4.]]]])  # 1×1×2×2
    
    # With padding=1, we get a 4×4 padded input
    # count_include_pad=True: divide by all positions including padding
    # count_include_pad=False: divide only by valid positions
    
    pool_include = nn.AvgPool2d(2, stride=1, padding=1, count_include_pad=True)
    pool_exclude = nn.AvgPool2d(2, stride=1, padding=1, count_include_pad=False)
    
    print("Input:")
    print(x.squeeze())
    print("\nWith count_include_pad=True (includes zeros in denominator):")
    print(pool_include(x).squeeze())
    print("\nWith count_include_pad=False (excludes zeros from denominator):")  
    print(pool_exclude(x).squeeze())
 
demonstrate_count_include_pad()

The count_include_pad Gotcha

Performance Considerations:

Average pooling is computationally efficient:

Simple arithmetic: Just additions and a division per output
No index tracking: Unlike max pooling, no need to store argmax indices
Cache-friendly: Sequential memory access patterns
SIMD vectorizable: Sum operations parallelize well

For the same configuration, average pooling typically:

Has similar forward pass speed to max pooling
Has faster backward pass (no index lookup, just uniform distribution)
Uses less memory (no index storage)

Numerical Stability:

For very large pooling regions (especially GAP on high-resolution inputs), numerical precision can matter:

FP16: May lose precision with many small values
FP32: Generally sufficient for most applications
Compensated summation (Kahan): Rarely needed but available for critical applications

Max vs Average Pooling: Comprehensive Comparison

Let's consolidate the key differences between max and average pooling to provide a practical decision framework.

Comprehensive Pooling Comparison
Aspect	Max Pooling	Average Pooling
Mathematical operation	Maximum selection	Arithmetic mean
Gradient behavior	Winner-take-all (sparse)	Uniform distribution (dense)
Information preserved	Peak/strongest activation	Overall activation level
Noise handling	Can amplify noise peaks	Reduces noise by averaging
Texture sensitivity	High (detects sparse patterns)	Lower (smooths textures)
Translation invariance	Bounded local invariance	Bounded local invariance + smoothing
Biological analog	Complex cells in V1	Gain normalization circuits
Memory (backward pass)	Stores argmax indices	No index storage needed
Compute (forward)	Max comparison operations	Sum and divide operations
Dominant use case	Intermediate layers for features	Final GAP for classification

Practical Guideline

Decision Flowchart:

Is this the final aggregation before classification?
- Yes → Use Global Average Pooling
- No → Continue to step 2
Are you working with noisy data or need noise robustness?
- Yes → Consider average pooling
- No → Continue to step 3
Do features activate sparsely at specific locations?
- Yes → Use max pooling
- No (features broadly distributed) → Consider average pooling
Is this a dense prediction task (segmentation, detection)?
- Yes → Consider strided convolutions or atrous convolutions
- No → Default to max pooling

Modern Trend:

Summary and Key Takeaways

Average pooling provides a complementary approach to spatial aggregation, with distinct properties that make it essential in modern CNN architectures.

Core Concepts Mastered

•Mathematical formulation: Average pooling computes the arithmetic mean within local regions, acting as a linear low-pass filter
•Gradient distribution: Gradients flow uniformly to all positions in the pooling region, enabling democratic learning across spatial locations
•Noise averaging: The averaging operation reduces noise variance proportionally to the pool size, providing robustness to input noise
•Global Average Pooling: GAP revolutionized classification by eliminating fully connected layers, reducing parameters by 100x+ while maintaining accuracy
•Use case selection: Average pooling suits smooth features, noisy data, and final aggregation; max pooling suits sparse, textured features
•Implementation nuances: The count_include_pad parameter affects boundary behavior when padding is used
•Hybrid approaches: Concatenating max and average pooling captures complementary information

Page Complete

What's Next:

2 / 5