Loading learning content...
Real-world images aren't single-valued signals—they're multi-channel data. An RGB image has three channels encoding color information. A feature map from a previous layer might have 256 channels encoding different learned features. How do convolutional layers process this multi-channel data?
The answer involves a dimensional expansion that's easy to overlook: a 'single' convolutional filter isn't a k×k matrix—it's a Cᵢₙ×k×k volume that spans all input channels. And a convolutional layer doesn't have one such filter—it has Cₒᵤₜ of them, one for each output channel.
Understanding multi-channel convolution reveals:
This page provides a comprehensive treatment of multi-channel processing—the heart of how CNNs actually compute.
This page covers the complete theory and practice of multi-channel convolutions. You will understand the exact mathematics of channel processing, parameter counts and computational costs, the role of 1×1 convolutions, depthwise separable convolutions and their efficiency gains, and modern channel-based attention mechanisms.
Let's establish the complete mathematical picture of how multi-channel convolution works.
Setup:
Computation:
For each output channel cₒᵤₜ ∈ {0, 1, ..., Cₒᵤₜ-1} and each spatial position (i, j):
$$Y[c_{out}, i, j] = \sum_{c_{in}=0}^{C_{in}-1} \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} X[c_{in}, i+m, j+n] \cdot K[c_{out}, c_{in}, m, n] + b[c_{out}]$$
Key Insight:
The kernel K has four dimensions:
Visualization:
Kernel Tensor: K[C_out, C_in, k, k]
Filter 0 Filter 1 ... Filter C_out-1
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
Input Ch 0 ─────▶│ K[0,0,:,:] │ │ K[1,0,:,:] │ ... │K[C-1,0,:,:]│
Input Ch 1 ─────▶│ K[0,1,:,:] │ │ K[1,1,:,:] │ ... │K[C-1,1,:,:]│
... │ ... │ │ ... │ │ ... │
Input Ch C_in ──▶│K[0,C_in-1]│ │K[1,C_in-1]│ ... │K[C-1,C_in-1]│
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
Output Ch 0 Output Ch 1 ... Output Ch C_out-1
Don't think of a filter as a k×k matrix. Think of it as a Cᵢₙ × k × k volume. When we say 'ResNet has 64 3×3 filters' on RGB input, we actually mean 64 filters each of shape 3×3×3 = 27 values. The '3×3' describes the spatial extent; each filter is actually 3D.
Parameter Count:
Total trainable parameters in a conv layer:
$$\text{Parameters} = C_{out} \times C_{in} \times k^2 + C_{out}$$
where the +Cₒᵤₜ accounts for biases.
Examples:
| Input Channels | Output Channels | Kernel | Parameters |
|---|---|---|---|
| 3 (RGB) | 64 | 3×3 | 1,792 |
| 64 | 64 | 3×3 | 36,928 |
| 256 | 256 | 3×3 | 590,080 |
| 512 | 512 | 3×3 | 2,359,808 |
| 1024 | 1024 | 3×3 | 9,437,184 |
Critical Observation:
Parameters scale quadratically with channel count (Cᵢₙ × Cₒᵤₜ). This explains why deep networks with many channels have millions of parameters, and why efficiency techniques focus on reducing this channel-mixing cost.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import torchimport torch.nn as nn def analyze_conv_parameters(in_channels, out_channels, kernel_size): """ Analyze parameter count and computation for a conv layer. """ conv = nn.Conv2d(in_channels, out_channels, kernel_size, padding=kernel_size//2) # Parameter count num_params = sum(p.numel() for p in conv.parameters()) # Weight shape weight_shape = conv.weight.shape # [C_out, C_in, k, k] # Breakdown print(f"Conv2d({in_channels} → {out_channels}, {kernel_size}×{kernel_size})") print(f" Weight shape: {list(weight_shape)}") print(f" Weight params: {conv.weight.numel():,}") if conv.bias is not None: print(f" Bias params: {conv.bias.numel():,}") print(f" Total params: {num_params:,}") # FLOPs for a 224×224 input H, W = 224, 224 H_out = H - kernel_size + 1 + 2*(kernel_size//2) # with padding W_out = W # Each output = sum over C_in * k * k multiplications + additions flops_per_output = 2 * in_channels * kernel_size * kernel_size total_flops = out_channels * H_out * W_out * flops_per_output print(f" FLOPs (224×224 input): {total_flops:,}") print(f" FLOPs (in billions): {total_flops / 1e9:.3f}B") print() return num_params, total_flops # Analyze typical layers in VGG-style networkprint("Parameter and FLOP Analysis for Typical Conv Layers:")print("=" * 60)print() layers = [ (3, 64, 3), # First layer (RGB → 64) (64, 64, 3), # Early layer (64, 128, 3), # After first pool (128, 256, 3), # Mid network (256, 512, 3), # Late network (512, 512, 3), # Final conv layers] total_params = 0total_flops = 0 for in_c, out_c, k in layers: p, f = analyze_conv_parameters(in_c, out_c, k) total_params += p total_flops += f print("=" * 60)print(f"Total Parameters: {total_params:,}")print(f"Total FLOPs: {total_flops / 1e9:.3f}B")The computational cost of convolution is dominated by the channel interaction. Let's analyze where the compute goes.
FLOP Count:
For a conv layer processing input [Cᵢₙ, H, W] → output [Cₒᵤₜ, H', W']:
$$\text{FLOPs} = 2 \cdot H' \cdot W' \cdot C_{out} \cdot C_{in} \cdot k^2$$
Breaking this down:
Where Does Compute Go?
For a 64→64 3×3 conv on 224×224 input:
$$\text{FLOPs} = 2 \times 224 \times 224 \times 64 \times 64 \times 9 = 3.7\text{B}$$
This single layer requires 3.7 billion operations! The Cᵢₙ × Cₒᵤₜ factor (64 × 64 = 4,096 channel pairs) dominates the cost.
As networks get deeper and channels increase (256 → 512 → 1024), the quadratic cost Cᵢₙ × Cₒᵤₜ becomes prohibitive. A 1024→1024 3×3 conv on 7×7 features requires 900M FLOPs—almost as much as early layers on 224×224! This motivated efficient convolution designs.
Memory Bandwidth Cost:
Beyond FLOPs, multi-channel convolutions are memory-bound:
Weights: Must load Cₒᵤₜ × Cᵢₙ × k² values Input activations: Must load Cᵢₙ × H × W values Output activations: Must write Cₒᵤₜ × H' × W' values
For a 512→512 3×3 conv:
Moving data between GPU memory and compute units often takes longer than the actual computation.
Strategies to Reduce Cost:
| Configuration | Parameters | FLOPs (112×112) | Relative Cost |
|---|---|---|---|
| 64→64, 3×3 | 36K | 920M | 1.0× |
| 64→128, 3×3 | 74K | 1.8B | 2.0× |
| 128→256, 3×3 | 295K | 7.4B | 8.0× |
| 256→512, 3×3 | 1.2M | 29B | 32× |
| 512→512, 3×3 | 2.4M | 59B | 64× |
A 1×1 convolution might seem pointless—it has no spatial extent. But in the multi-channel setting, 1×1 convolutions are remarkably powerful and efficient.
What 1×1 Convolutions Do:
With k=1, the spatial summation collapses:
$$Y[c_{out}, i, j] = \sum_{c_{in}} X[c_{in}, i, j] \cdot K[c_{out}, c_{in}] + b[c_{out}]$$
This is equivalent to applying a fully-connected layer at each spatial position, with weights shared across positions.
Equivalently:
Why 1×1 Convolutions Are Useful:
1. Channel Dimension Reduction (Bottleneck)
Reduce Cᵢₙ to smaller Cₘᵢd before expensive 3×3 conv:
256 → [1×1, 64] → [3×3, 64] → [1×1, 256]
Parameters: 256×64 + 64×64×9 + 64×256 = 16K + 37K + 16K = 69K
Vs full 256→256 3×3: 590K (8.5× reduction!)
2. Channel Dimension Expansion
Increase channels cheaply after spatial operations.
3. Cross-Channel Interaction
Mix information between channels without spatial computation.
The 'Network in Network' paper (Lin et al., 2013) introduced 1×1 convolutions with a key insight: they act as a mini-network applied at each spatial location. This local network can learn complex channel combinations, increasing model expressivity without the cost of larger spatial kernels.
1×1 vs 3×3 Cost Comparison:
For processing [256, H, W] → [256, H, W]:
| Kernel | Parameters | FLOPs (H=W=56) |
|---|---|---|
| 3×3 | 590,080 | 3.7B |
| 1×1 | 65,792 | 411M |
| Ratio | 9× fewer | 9× fewer |
The 9× savings (k² = 3² = 9) makes 1×1 convos extremely efficient for channel processing.
Role in Modern Architectures:
ResNet Bottleneck Block:
Input (256)
↓
1×1 conv → 64 (reduce channels)
↓
3×3 conv → 64 (spatial processing)
↓
1×1 conv → 256 (restore channels)
↓
+ skip → Output (256)
Inception Module:
Input
┌───────┬───────┼───────┬───────┐
↓ ↓ ↓ ↓ ↓
1×1 1×1 1×1 MaxPool
│ ↓ ↓ ↓
│ 3×3 5×5 1×1
│ │ │ │
└───────┴───────┴───────┴───────┘
Concat
1×1 convs before 3×3 and 5×5 reduce computational cost by 3-5×.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import torchimport torch.nn as nn def compare_conv_costs(in_channels, out_channels, spatial_size): """ Compare 1×1 vs 3×3 convolution costs. """ H, W = spatial_size, spatial_size conv_1x1 = nn.Conv2d(in_channels, out_channels, 1) conv_3x3 = nn.Conv2d(in_channels, out_channels, 3, padding=1) params_1x1 = sum(p.numel() for p in conv_1x1.parameters()) params_3x3 = sum(p.numel() for p in conv_3x3.parameters()) # FLOPs: 2 * H * W * C_out * C_in * k^2 flops_1x1 = 2 * H * W * out_channels * in_channels * 1 flops_3x3 = 2 * H * W * out_channels * in_channels * 9 print(f"\nConv {in_channels}→{out_channels} on {H}×{W}:") print(f"{'':4}{'1×1 Conv':>15}{'3×3 Conv':>15}{'Ratio':>10}") print(f" Parameters: {params_1x1:>13,} {params_3x3:>14,} {params_3x3/params_1x1:>9.1f}×") print(f" FLOPs: {flops_1x1:>17,} {flops_3x3:>14,} {flops_3x3/flops_1x1:>9.1f}×") # Compare at different scalesprint("1×1 vs 3×3 Convolution Cost Comparison")print("=" * 60) compare_conv_costs(256, 256, 56)compare_conv_costs(512, 512, 28)compare_conv_costs(1024, 1024, 14) # Show bottleneck savingsprint("\n" + "=" * 60)print("Bottleneck Block Analysis (256→256):") # Direct pathdirect_params = 256 * 256 * 9 + 256 # 3×3 convdirect_flops = 2 * 56 * 56 * 256 * 256 * 9 # Bottleneck path: 1×1 (256→64) → 3×3 (64→64) → 1×1 (64→256)bottleneck_mid = 64bn_params = (256*64+64) + (64*64*9+64) + (64*256+256)bn_flops = (2*56*56*64*256*1) + (2*56*56*64*64*9) + (2*56*56*256*64*1) print(f" Direct 3×3: {direct_params:>10,} params, {direct_flops/1e9:.2f}B FLOPs")print(f" Bottleneck: {bn_params:>10,} params, {bn_flops/1e9:.2f}B FLOPs")print(f" Savings: {direct_params/bn_params:.1f}× params, {direct_flops/bn_flops:.1f}× FLOPs")Grouped convolutions split the channel dimension into independent groups, reducing the channel mixing cost.
Definition:
With G groups, split Cᵢₙ channels into G groups of Cᵢₙ/G channels each:
Parameter Reduction:
$$\text{Parameters}{grouped} = G \times \frac{C{out}}{G} \times \frac{C_{in}}{G} \times k^2 = \frac{C_{out} \times C_{in} \times k^2}{G}$$
Grouped conv with G groups uses 1/G the parameters of a standard conv.
Visualization:
Standard Conv (G=1): Grouped Conv (G=2):
All C_in → All C_out C_in/2 → C_out/2 | C_in/2 → C_out/2
┌─────────────────┐ ┌───────┐ ┌───────┐
│ C_in × C_out │ │ C/4 │ │ C/4 │
│ full mix │ │ mix │ │ mix │
└─────────────────┘ └───────┘ └───────┘
Group 0 Group 1
No inter-group communication!
Grouped convolutions were originally used in AlexNet (2012) to split computation across two GPUs, as each GPU had insufficient memory for the full model. This engineering constraint accidentally created a useful architecture pattern that improves regularization and efficiency.
ResNeXt: Aggregated Residual Transformations:
ResNeXt (2017) embraced grouped convolutions as a design principle:
ResNet Bottleneck: ResNeXt Bottleneck (C=32, groups=32):
Input (256) Input (256)
↓ ↓
1×1 → 64 1×1 → 128 (4× wider)
↓ ↓
3×3 → 64 3×3 → 128, groups=32 (32 groups of 4)
↓ ↓
1×1 → 256 1×1 → 256
↓ ↓
+ skip + skip
ResNeXt uses a 'cardinality' of 32 groups with wider layers, achieving better accuracy with similar compute.
Extreme Case: Depthwise Convolution (G = Cᵢₙ)
When G = Cᵢₙ (one group per input channel):
| Groups (G) | Params per Group | Total Params | % of Full |
|---|---|---|---|
| 1 (Standard) | 256×256×9 | 590,080 | 100% |
| 2 | 128×128×9 | 295,040 | 50% |
| 4 | 64×64×9 | 147,520 | 25% |
| 32 | 8×8×9 | 18,432 | 3.1% |
| 256 (Depthwise) | 1×1×9 | 2,304 | 0.4% |
Depthwise separable convolutions factor a standard convolution into two operations: depthwise (spatial) and pointwise (channel), dramatically reducing computational cost.
Standard Convolution:
Processes spatial and channel dimensions simultaneously: $$Y[c_{out}, i, j] = \sum_{c_{in}} \sum_{m,n} X[c_{in}, i+m, j+n] \cdot K[c_{out}, c_{in}, m, n]$$
Parameters: Cₒᵤₜ × Cᵢₙ × k² FLOPs: 2 × H × W × Cₒᵤₜ × Cᵢₙ × k²
Depthwise Separable Convolution:
Step 1: Depthwise Convolution
Apply spatial k×k filter to each input channel independently: $$D[c, i, j] = \sum_{m,n} X[c, i+m, j+n] \cdot K_{dw}[c, m, n]$$
No cross-channel mixing. Output has same channels as input. Parameters: Cᵢₙ × k² FLOPs: 2 × H × W × Cᵢₙ × k²
Step 2: Pointwise Convolution (1×1)
Mix channels without spatial processing: $$Y[c_{out}, i, j] = \sum_{c} D[c, i, j] \cdot K_{pw}[c_{out}, c]$$
Parameters: Cₒᵤₜ × Cᵢₙ FLOPs: 2 × H × W × Cₒᵤₜ × Cᵢₙ
The cost ratio of depthwise separable to standard convolution is (1/Cₒᵤₜ + 1/k²). For typical values (Cₒᵤₜ = 256, k = 3), this gives 1/256 + 1/9 ≈ 0.11 = 11% of the original cost—roughly a 9× speedup with minimal accuracy loss!
Mathematical Analysis:
Total depthwise separable cost: $$\text{Cost}{DS} = C{in} \cdot k^2 + C_{out} \cdot C_{in}$$
Compared to standard convolution: $$\text{Cost}{std} = C{out} \cdot C_{in} \cdot k^2$$
Ratio: $$\frac{\text{Cost}{DS}}{\text{Cost}{std}} = \frac{C_{in} \cdot k^2 + C_{out} \cdot C_{in}}{C_{out} \cdot C_{in} \cdot k^2} = \frac{1}{C_{out}} + \frac{1}{k^2}$$
MobileNet Block:
Input (C_in)
↓
Depthwise 3×3 → C_in (spatial processing, no channel change)
↓
BatchNorm + ReLU6
↓
Pointwise 1×1 → C_out (channel processing)
↓
BatchNorm + ReLU6
↓
Output (C_out)
MobileNet V2 Inverted Residual:
Input (C_in)
↓
1×1 → t*C_in (expand channels by factor t)
↓
3×3 Depthwise → t*C_in (spatial processing)
↓
1×1 → C_out (project back)
↓
+ skip (if C_in == C_out)
↓
Output (C_out)
The 'inverted' design expands channels before depthwise, allowing richer feature extraction.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import torchimport torch.nn as nnimport torch.nn.functional as F class DepthwiseSeparableConv(nn.Module): """ Depthwise separable convolution: depthwise + pointwise. """ def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0): super().__init__() # Depthwise: groups = in_channels (each channel processed independently) self.depthwise = nn.Conv2d( in_channels, in_channels, kernel_size, stride=stride, padding=padding, groups=in_channels, bias=False ) self.bn1 = nn.BatchNorm2d(in_channels) # Pointwise: 1×1 conv for channel mixing self.pointwise = nn.Conv2d(in_channels, out_channels, 1, bias=False) self.bn2 = nn.BatchNorm2d(out_channels) def forward(self, x): x = F.relu(self.bn1(self.depthwise(x))) x = F.relu(self.bn2(self.pointwise(x))) return x def compare_efficiency(in_channels, out_channels, kernel_size, spatial_size): """ Compare standard vs depthwise separable convolution. """ H, W = spatial_size, spatial_size k = kernel_size # Standard convolution std_params = out_channels * in_channels * k * k + out_channels std_flops = 2 * H * W * out_channels * in_channels * k * k # Depthwise separable (excluding batchnorm) dw_params = in_channels * k * k # depthwise weights pw_params = out_channels * in_channels + out_channels # pointwise + bias ds_params = dw_params + pw_params dw_flops = 2 * H * W * in_channels * k * k pw_flops = 2 * H * W * out_channels * in_channels ds_flops = dw_flops + pw_flops theoretical_ratio = 1/out_channels + 1/(k*k) print(f"\nComparing {in_channels}→{out_channels} {k}×{k} on {H}×{H}:") print(f"{'':4}{'Standard':>15}{'Depthwise Sep':>15}{'Ratio':>10}") print(f" Params: {std_params:>14,} {ds_params:>14,} {ds_params/std_params:>9.2%}") print(f" FLOPs: {std_flops:>14,} {ds_flops:>14,} {ds_flops/std_flops:>9.2%}") print(f" Theoretical ratio: {theoretical_ratio:.2%}") print("Depthwise Separable Convolution Efficiency")print("=" * 60) compare_efficiency(64, 64, 3, 112)compare_efficiency(128, 128, 3, 56)compare_efficiency(256, 512, 3, 28)compare_efficiency(512, 512, 3, 14) # Verify with actual modulesprint("\n" + "=" * 60)print("Actual PyTorch Module Parameter Counts:") std = nn.Conv2d(256, 512, 3, padding=1)ds = DepthwiseSeparableConv(256, 512, 3, padding=1) std_p = sum(p.numel() for p in std.parameters())ds_p = sum(p.numel() for p in ds.parameters()) print(f" Standard Conv2d: {std_p:,}")print(f" Depthwise Separable: {ds_p:,}")print(f" Ratio: {ds_p/std_p:.2%}")Channel attention mechanisms learn to dynamically reweight channels based on the input, emphasizing informative features and suppressing less useful ones.
Squeeze-and-Excitation (SE) Block:
The SE block (Hu et al., 2018) introduced channel attention:
Step 1: Squeeze (Global Information) Global average pooling compresses H×W to 1×1 per channel: $$z_c = \frac{1}{H \times W} \sum_i \sum_j X[c, i, j]$$
Step 2: Excitation (Channel Reweighting) Two FC layers learn channel interdependencies: $$s = \sigma(W_2 \cdot \text{ReLU}(W_1 \cdot z))$$
where W₁ ∈ ℝ^(C/r × C) and W₂ ∈ ℝ^(C × C/r) with reduction ratio r (typically 16).
Step 3: Scale Reweight each channel by learned importance: $$\tilde{X}[c, i, j] = s_c \cdot X[c, i, j]$$
This simple operation improves ResNet-50 by ~1% with <10% parameter overhead.
Not all features are equally important for every input. An image of a red car benefits from emphasizing 'red' and 'car' channels while suppressing 'blue sky' channels. SE blocks learn this input-dependent reweighting, allowing the network to focus computational resources on relevant channels.
Efficient Channel Attention (ECA-Net):
ECA (Wang et al., 2020) simplifies SE by using 1D convolution:
$$s = \sigma(\text{Conv1D}_{k}(z))$$
The kernel size k determines the locality of channel interactions: $$k = \psi(C) = \left| \frac{\log_2(C)}{\gamma} + \frac{b}{\gamma} \right|_{\text{odd}}$$
This parameter-free sizing achieves similar accuracy to SE with fewer parameters.
CBAM: Channel + Spatial Attention:
CBAM (Woo et al., 2018) combines channel and spatial attention:
Channel Attention:
F → AvgPool/MaxPool → MLP → Sigmoid → Channel weights
Spatial Attention:
F → AvgPool/MaxPool along channels → Conv → Sigmoid → Spatial weights
Combined:
F' = (Channel Attn) × F
F'' = (Spatial Attn) × F'
Channel Shuffle (ShuffleNet):
Grouped convolutions don't mix information across groups. ShuffleNet addresses this with channel shuffle:
def channel_shuffle(x, groups):
B, C, H, W = x.shape
# Reshape to [B, groups, C//groups, H, W]
x = x.view(B, groups, C // groups, H, W)
# Transpose groups and channels within groups
x = x.transpose(1, 2).contiguous()
# Flatten back
return x.view(B, C, H, W)
This enables inter-group communication without the cost of 1×1 convolutions.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import torchimport torch.nn as nnimport torch.nn.functional as F class SEBlock(nn.Module): """ Squeeze-and-Excitation block for channel attention. """ def __init__(self, channels, reduction=16): super().__init__() self.squeeze = nn.AdaptiveAvgPool2d(1) self.excitation = nn.Sequential( nn.Linear(channels, channels // reduction, bias=False), nn.ReLU(inplace=True), nn.Linear(channels // reduction, channels, bias=False), nn.Sigmoid() ) def forward(self, x): b, c, _, _ = x.shape # Squeeze: [B, C, H, W] → [B, C, 1, 1] → [B, C] squeeze = self.squeeze(x).view(b, c) # Excitation: [B, C] → [B, C] excitation = self.excitation(squeeze).view(b, c, 1, 1) # Scale return x * excitation class ECABlock(nn.Module): """ Efficient Channel Attention using 1D convolution. """ def __init__(self, channels, gamma=2, b=1): super().__init__() # Adaptive kernel size based on channels t = int(abs((torch.log2(torch.tensor(channels, dtype=torch.float)) + b) / gamma)) k = max(t if t % 2 else t + 1, 3) # Ensure odd kernel size >= 3 self.avg_pool = nn.AdaptiveAvgPool2d(1) self.conv = nn.Conv1d(1, 1, kernel_size=k, padding=k//2, bias=False) self.sigmoid = nn.Sigmoid() def forward(self, x): b, c, _, _ = x.shape # Global average pooling y = self.avg_pool(x).view(b, 1, c) # 1D conv along channel dimension y = self.conv(y).view(b, c, 1, 1) # Scale return x * self.sigmoid(y) def compare_attention_overhead(channels): """ Compare parameter overhead of attention mechanisms. """ se = SEBlock(channels, reduction=16) eca = ECABlock(channels) se_params = sum(p.numel() for p in se.parameters()) eca_params = sum(p.numel() for p in eca.parameters()) # Base conv for comparison base_conv = 3 * 3 * channels * channels print(f"Channel Attention Overhead ({channels} channels):") print(f" SE Block params: {se_params:,} ({100*se_params/base_conv:.2f}% of 3×3 conv)") print(f" ECA Block params: {eca_params:,} ({100*eca_params/base_conv:.4f}% of 3×3 conv)") # Compare at different channel countsfor c in [64, 256, 512, 1024]: compare_attention_overhead(c) print()Modern architectures have converged on sophisticated channel design patterns that balance expressivity, efficiency, and trainability.
EfficientNet: Compound Scaling
EfficientNet scales depth (d), width (w), and resolution (r) together: $$d = \alpha^\phi, \quad w = \beta^\phi, \quad r = \gamma^\phi$$
with constraint $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ (doubling compute).
The base model uses MBConv blocks with:
ConvNeXt: Pure Convolutions Revisited
ConvNeXt modernizes ResNet with insights from Transformers:
ConvNeXt Block:
Depthwise 7×7 conv (spatial mixing with large kernel)
↓
LayerNorm
↓
1×1 conv (expand 4×) (channel mixing, expand)
↓
GELU
↓
1×1 conv (contract) (channel mixing, contract)
↓
+ skip connection
Key insights:
Vision Transformers process 16×16 patches with global attention. To match this receptive field without attention, ConvNeXt uses 7×7 depthwise convolutions. This larger spatial extent per layer means fewer layers are needed for the same effective receptive field, changing the optimal depth-width tradeoff.
RepVGG: Structural Re-parameterization
RepVGG trains with multi-branch blocks (3×3 + 1×1 + identity) but fuses them into single 3×3 convs for inference:
Training:
Input
/ | \\
3×3 1×1 Id
\\ | /
+ (add)
Inference:
Input → 3×3 → Output
The 1×1 conv is treated as a 3×3 with zeros in the border positions, enabling fusion.
MixNet: Mixed Kernel Sizes
MixNet uses multiple kernel sizes in parallel within depthwise convolutions:
class MixedDepthwise(nn.Module):
def __init__(self, channels):
# Split channels for different kernel sizes
self.conv3 = nn.Conv2d(channels//3, channels//3, 3, padding=1, groups=channels//3)
self.conv5 = nn.Conv2d(channels//3, channels//3, 5, padding=2, groups=channels//3)
self.conv7 = nn.Conv2d(channels//3, channels//3, 7, padding=3, groups=channels//3)
def forward(self, x):
x3, x5, x7 = x.chunk(3, dim=1)
return torch.cat([self.conv3(x3), self.conv5(x5), self.conv7(x7)], dim=1)
This multi-scale spatial processing captures features at different scales within a single layer.
| Architecture | Spatial Op | Channel Mixing | Attention | Key Innovation |
|---|---|---|---|---|
| MobileNet V1 | 3×3 DW | 1×1 PW | None | Depthwise separable |
| MobileNet V2 | 3×3 DW | 1×1 expand+project | None | Inverted residual |
| EfficientNet | 3×3/5×5 DW | 1×1 + SE | Channel (SE) | Compound scaling |
| ConvNeXt | 7×7 DW | 1×1 expand×4 | None | Large kernel |
| ShuffleNet V2 | 3×3 DW | Channel shuffle | None | Channel split/shuffle |
| RepVGG | 3×3 | Within 3×3 | None | Structural re-param |
Multi-channel convolution is the computational engine of CNNs. Understanding how channels interact reveals both the power and the cost of convolutional architectures, and opens doors to efficiency improvements.
Module Complete:
With this page, you've completed Module 2: Convolutional Layers. You now understand the five foundational concepts:
These concepts form the foundation for understanding all CNN architectures, from simple classifiers to complex detection and segmentation networks.
Congratulations! You've mastered Convolutional Layers—the core building blocks of CNNs. You understand how parameter sharing enables efficiency, how equivariance enables generalization, how receptive fields grow through depth, how feature maps encode learned representations, and how multi-channel processing can be optimized. You're ready to explore pooling, downsampling, and complete CNN architectures!