Machine LearningCNN Architectures

CNN Architectures: From LeNet to Modern Designs

LevelIntermediate

Duration120 mins

TopicCNN Architectures

5 / 5

CNN Design Principles: From Theory to Practice

The Art and Science of CNN Design

Having studied LeNet, AlexNet, VGGNet, and Inception, we can now extract actionable design principles that guide the construction of effective convolutional neural networks. These aren't arbitrary rules—they're distilled from decades of empirical research and theoretical understanding.

CNN design is both art and science. The science provides constraints (computational budgets, gradient flow requirements). The art involves making tradeoffs within those constraints for specific applications. This page synthesizes both perspectives into a practical guide for designing your own architectures.

What You Will Learn

This page covers fundamental CNN design principles: spatial reduction strategies, channel expansion patterns, receptive field design, computational efficiency techniques, regularization approaches, and practical guidelines for architecture selection and customization.

Principle 1: Progressive Spatial Reduction

Core Insight: Gradually reduce spatial dimensions while increasing channel depth to maintain representational capacity.

Every successful CNN follows this pattern:

Input: 224×224×3 (high spatial, low channels)
Output: 7×7×512+ (low spatial, high channels)

This isn't arbitrary—it reflects the nature of visual processing. Early layers detect local features (edges, textures) across many spatial locations. Deeper layers combine these into complex, abstract features that need fewer locations but more channels to represent.

Spatial Reduction Patterns Across Architectures
Architecture	Reduction Strategy	Final Feature Map
LeNet-5	Pool every 2 conv layers	1×1×120
AlexNet	Pool after conv 1, 2, 5	6×6×256
VGG16	Pool every 2-3 conv layers	7×7×512
GoogLeNet	Pool between Inception groups	7×7×1024
ResNet-50	Stride-2 conv at block starts	7×7×2048

Spatial Reduction Guidelines

•Reduce by factor of 2 each time (halve height and width)
•5 reduction stages for 224×224 input → 7×7 output
•Double channels after each reduction (64 → 128 → 256 → 512)
•Avoid aggressive early reduction — preserve fine-grained info
•Max pooling vs strided conv: Pooling is cheaper; strided conv is learnable

Principle 2: Receptive Field Design

Core Insight: The final layer's receptive field should cover most of the input image to capture global context.

Receptive field (RF) is how much of the input each output neuron "sees." For 224×224 ImageNet:

Too small RF → Can't see whole objects
Appropriate RF → Covers object + some context

Calculating Receptive Field:

For layer $l$ with kernel size $k$, stride $s$, and previous RF: $$RF_l = RF_{l-1} + (k_l - 1) \times \prod_{i=1}^{l-1} s_i$$

receptive_field_calculator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def calc_receptive_field(layers):
    """
    Calculate receptive field for a CNN.
    
    layers: list of (kernel_size, stride) tuples
    Returns: final receptive field size
    """
    rf = 1
    stride_product = 1
    
    for kernel, stride in layers:
        rf += (kernel - 1) * stride_product
        stride_product *= stride
    
    return rf
 
# VGG16 receptive field
vgg16_layers = [
    (3, 1), (3, 1), (2, 2),  # Block 1: 2 conv + pool
    (3, 1), (3, 1), (2, 2),  # Block 2
    (3, 1), (3, 1), (3, 1), (2, 2),  # Block 3
    (3, 1), (3, 1), (3, 1), (2, 2),  # Block 4
    (3, 1), (3, 1), (3, 1), (2, 2),  # Block 5
]
print(f"VGG16 RF: {calc_receptive_field(vgg16_layers)}×{calc_receptive_field(vgg16_layers)}")
# Output: VGG16 RF: 212×212 — nearly covers 224×224 input!
 
# AlexNet receptive field
alexnet_layers = [
    (11, 4), (3, 2),  # Conv1 + Pool
    (5, 1), (3, 2),   # Conv2 + Pool
    (3, 1), (3, 1), (3, 1), (3, 2),  # Conv3-5 + Pool
]
print(f"AlexNet RF: {calc_receptive_field(alexnet_layers)}")
# Output: AlexNet RF: 195×195

RF Rule of Thumb

Your final receptive field should be at least 70-80% of the input size. For 224×224 inputs, aim for RF ≥ 180. Smaller RFs can work for tasks where objects are centered and cropped, but hurt performance on varied compositions.

Principle 3: Small Filters, More Depth

Core Insight: Prefer 3×3 convolutions stacked deeply over single large filters.

VGG proved this definitively: two 3×3 convs = one 5×5 receptive field with:

28% fewer parameters (18C² vs 25C²)
Two ReLU non-linearities instead of one
Better gradient flow through shorter computational paths

When to Use 3×3

•Default choice for new architectures
•Efficient hardware implementations
•Well-understood training dynamics
•Easy to stack for larger RF

When Larger Filters Make Sense

•First layer (captures low-level patterns quickly)
•Very low resolution inputs
•When computational budget allows
•Specific architectural experiments

Modern Refinement: Inception v3 showed that even 3×3 can be factorized:

3×3 → 3×1 then 1×3 (or 1×3 then 3×1)
Same receptive field, 33% fewer parameters
Works best for larger feature maps; diminishing returns at small spatial sizes

Principle 4: Bottleneck Design

Core Insight: Use 1×1 convolutions to reduce dimensionality before expensive operations.

The bottleneck pattern appears everywhere in modern CNNs:

Compress: 1×1 conv reduces channels (e.g., 256 → 64)
Transform: Apply expensive operation (3×3 conv) on fewer channels
Expand: 1×1 conv restores channels (64 → 256)

bottleneck_block.py
Python (PyTorch)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch.nn as nn
 
class BottleneckBlock(nn.Module):
    """
    Bottleneck design used in ResNet-50/101/152 and Inception.
    Reduces computation by 2-4× vs naive approach.
    """
    def __init__(self, in_channels, bottleneck_channels, out_channels):
        super().__init__()
        
        # 1×1 compress
        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1)
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)
        
        # 3×3 transform (on reduced channels)
        self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)
        
        # 1×1 expand
        self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1)
        self.bn3 = nn.BatchNorm2d(out_channels)
        
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        return out
 
# Compare parameter counts
# Naive: 256 → 256 with 3×3
naive_params = 256 * 256 * 3 * 3  # 589,824
 
# Bottleneck: 256 → 64 → 64 → 256
bottleneck = BottleneckBlock(256, 64, 256)
bottleneck_params = sum(p.numel() for p in bottleneck.parameters())
# ≈ 70,000 (includes BN params)
 
print(f"Naive 3×3: {naive_params:,} params")
print(f"Bottleneck: {bottleneck_params:,} params")
print(f"Reduction: {naive_params/bottleneck_params:.1f}×")

Bottleneck Ratio

Common ratio is 4:1 — if the main representation has 256 channels, bottleneck reduces to 64. This provides ~4× cost reduction while preserving most representational capacity. ResNet uses this throughout its deeper variants (50/101/152).

Principle 5: Batch Normalization

Core Insight: Apply batch normalization after every convolution (before or after activation).

Batch normalization normalizes each channel's activations across a mini-batch: $$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \cdot \gamma + \beta$$

where $\mu_B$, $\sigma_B^2$ are batch statistics, and $\gamma$, $\beta$ are learnable scale/shift.

Batch Normalization Benefits

•Faster training: Allows higher learning rates (often 10× higher)
•Regularization effect: Noise from batch statistics reduces overfitting
•Smoother loss landscape: Gradients behave more predictably
•Reduces initialization sensitivity: Networks train even with poor initialization
•Enables deeper networks: Stabilizes gradient propagation in 100+ layer nets

BN Placement Debate:

Original (Ioffe & Szegedy, 2015): Conv → BN → ReLU

Common practice: Conv → BN → ReLU (same as original)

Some evidence for: Conv → ReLU → BN (preserves ReLU's sparse gradients)

Empirically, differences are small. The original placement is most common.

Principle 6: Global Average Pooling

Core Insight: Replace fully connected layers with global average pooling.

Instead of:

Flatten 7×7×512 → 25,088 neurons
FC: 25,088 → 4096 (102M params!)

Use:

Global Average Pool: 7×7×512 → 512 (average each channel)
FC: 512 → num_classes (0.5M params)

FC Layer Problems

•90% of VGG's params are FC
•Prone to overfitting
•Fixed input size required
•No spatial invariance
•Expensive memory/compute

GAP Advantages

•Minimal parameters added
•Strong regularization
•Flexible input sizes
•Translation invariant
•Interpretable (CAMs)

gap_classifier.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch.nn as nn
 
# Modern CNN classifier (ResNet/EfficientNet style)
class GAPClassifier(nn.Module):
    def __init__(self, in_channels, num_classes):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d((1, 1))  # Any input → 1×1
        self.fc = nn.Linear(in_channels, num_classes)
    
    def forward(self, x):
        x = self.gap(x)       # B×C×H×W → B×C×1×1
        x = x.view(x.size(0), -1)  # B×C×1×1 → B×C
        return self.fc(x)     # B×C → B×num_classes
 
# VGG-style vs GAP-style
vgg_fc_params = 7 * 7 * 512 * 4096 + 4096 * 4096 + 4096 * 1000
gap_params = 512 * 1000
 
print(f"VGG FC params: {vgg_fc_params:,}")  # 123M
print(f"GAP params:    {gap_params:,}")      # 512K
print(f"Reduction:     {vgg_fc_params/gap_params:.0f}×")  # 240×

Principle 7: Regularization Strategy

Core Insight: Apply multiple complementary regularization techniques.

Regularization Techniques for CNNs
Technique	Where Applied	Effect	Typical Value
Weight Decay (L2)	All conv/FC weights	Penalizes large weights	1e-4 to 5e-4
Dropout	FC layers (or after GAP)	Prevents co-adaptation	0.2-0.5
Batch Normalization	After every conv	Adds noise, smooths loss	—
Data Augmentation	Training inputs	Expands effective dataset	Task-specific
Label Smoothing	Loss function	Softens hard targets	0.1
Stochastic Depth	Skip random layers	Regularizes depth	Linear 0→0.5

Modern Best Practice

For ImageNet-scale training: Weight decay + BN + Data augmentation + Label smoothing. Dropout is less essential when using BN but still useful after GAP. Modern augmentation (RandAugment, AutoAugment, Mixup, CutMix) provides strong regularization.

Practical Design Guidelines

Synthesizing all principles into actionable guidelines:

CNN Design Checklist

•Start with proven architecture — Modify ResNet or EfficientNet rather than starting from scratch
•Use 3×3 convolutions — Default choice for all layers except possibly the first
•Add batch normalization — After every convolution, before or after ReLU
•Apply bottleneck design — Use 1×1 reduce/expand for depth beyond 50 layers
•Reduce spatial by 2× in stages — Target 5 reduction stages for 224 input
•Double channels after each reduction — 64 → 128 → 256 → 512 → 512+
•Replace FC layers with GAP — Massive parameter reduction, better generalization
•Use skip connections for depth — Essential beyond ~20 layers
•Apply data augmentation aggressively — Most important regularizer
•Match compute budget to data size — Larger models need more data

Architecture Selection Guide

Choosing the Right Architecture
Scenario	Recommended	Reasoning
Learning/prototyping	ResNet-18/34	Simple, fast, well-understood
ImageNet-level accuracy	ResNet-50/101	Strong baseline, many pretrained weights
Mobile/edge deployment	MobileNetV3, EfficientNet-B0	Optimized for latency/params
Maximum accuracy	EfficientNet-B7, ConvNeXt	Best accuracy/compute tradeoffs
Transfer learning	ResNet-50, VGG16	Widely available pretrained features
Object detection backbone	ResNet-50-FPN	Standard for COCO detection
Limited training data	Smaller models + pretrained	Reduce overfitting risk

The Rise of Vision Transformers

Since 2020, Vision Transformers (ViT) have emerged as alternatives to CNNs. For very large datasets/compute, ViT often wins. For typical applications, CNNs remain excellent choices with better data efficiency. ConvNeXt (2022) showed CNNs can match ViT with modern training techniques.

Summary

CNN Design Principles

•Progressive spatial reduction with channel expansion maintains capacity
•Receptive field ≥ 70% of input ensures global context
•3×3 filters stacked beats single large filters in efficiency and expressiveness
•Bottleneck design uses 1×1 reduce/expand for 4× compute reduction
•Batch normalization is essential for training stability and speed
•Global average pooling replaces FC layers, saving 100× parameters
•Multiple regularization techniques work together for generalization

Module Complete:

You've now journeyed through the evolution of CNN architectures from LeNet's pioneering vision to the sophisticated design principles that guide modern networks. These foundations will serve you in understanding ResNet, EfficientNet, and other architectures covered in later modules.

Module Complete

You now have a comprehensive understanding of CNN architecture design. You can explain why specific architectural choices are made, implement standard components like bottleneck blocks and GAP classifiers, and select appropriate architectures for different applications. You're ready to study Residual Networks and modern architectures.

5 / 5

Loading learning content...

Machine LearningCNN Architectures

CNN Architectures: From LeNet to Modern Designs

LevelIntermediate

Duration120 mins

TopicCNN Architectures

5 / 5

CNN Design Principles: From Theory to Practice

The Art and Science of CNN Design

What You Will Learn

Principle 1: Progressive Spatial Reduction

Core Insight: Gradually reduce spatial dimensions while increasing channel depth to maintain representational capacity.

Every successful CNN follows this pattern:

Input: 224×224×3 (high spatial, low channels)
Output: 7×7×512+ (low spatial, high channels)

Spatial Reduction Patterns Across Architectures
Architecture	Reduction Strategy	Final Feature Map
LeNet-5	Pool every 2 conv layers	1×1×120
AlexNet	Pool after conv 1, 2, 5	6×6×256
VGG16	Pool every 2-3 conv layers	7×7×512
GoogLeNet	Pool between Inception groups	7×7×1024
ResNet-50	Stride-2 conv at block starts	7×7×2048

Spatial Reduction Guidelines

•Reduce by factor of 2 each time (halve height and width)
•5 reduction stages for 224×224 input → 7×7 output
•Double channels after each reduction (64 → 128 → 256 → 512)
•Avoid aggressive early reduction — preserve fine-grained info
•Max pooling vs strided conv: Pooling is cheaper; strided conv is learnable

Principle 2: Receptive Field Design

Core Insight: The final layer's receptive field should cover most of the input image to capture global context.

Receptive field (RF) is how much of the input each output neuron "sees." For 224×224 ImageNet:

Too small RF → Can't see whole objects
Appropriate RF → Covers object + some context

Calculating Receptive Field:

For layer $l$ with kernel size $k$, stride $s$, and previous RF: $$RF_l = RF_{l-1} + (k_l - 1) \times \prod_{i=1}^{l-1} s_i$$

receptive_field_calculator.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def calc_receptive_field(layers):
    """
    Calculate receptive field for a CNN.
    
    layers: list of (kernel_size, stride) tuples
    Returns: final receptive field size
    """
    rf = 1
    stride_product = 1
    
    for kernel, stride in layers:
        rf += (kernel - 1) * stride_product
        stride_product *= stride
    
    return rf
 
# VGG16 receptive field
vgg16_layers = [
    (3, 1), (3, 1), (2, 2),  # Block 1: 2 conv + pool
    (3, 1), (3, 1), (2, 2),  # Block 2
    (3, 1), (3, 1), (3, 1), (2, 2),  # Block 3
    (3, 1), (3, 1), (3, 1), (2, 2),  # Block 4
    (3, 1), (3, 1), (3, 1), (2, 2),  # Block 5
]
print(f"VGG16 RF: {calc_receptive_field(vgg16_layers)}×{calc_receptive_field(vgg16_layers)}")
# Output: VGG16 RF: 212×212 — nearly covers 224×224 input!
 
# AlexNet receptive field
alexnet_layers = [
    (11, 4), (3, 2),  # Conv1 + Pool
    (5, 1), (3, 2),   # Conv2 + Pool
    (3, 1), (3, 1), (3, 1), (3, 2),  # Conv3-5 + Pool
]
print(f"AlexNet RF: {calc_receptive_field(alexnet_layers)}")
# Output: AlexNet RF: 195×195

RF Rule of Thumb

Principle 3: Small Filters, More Depth

Core Insight: Prefer 3×3 convolutions stacked deeply over single large filters.

VGG proved this definitively: two 3×3 convs = one 5×5 receptive field with:

28% fewer parameters (18C² vs 25C²)
Two ReLU non-linearities instead of one
Better gradient flow through shorter computational paths

When to Use 3×3

•Default choice for new architectures
•Efficient hardware implementations
•Well-understood training dynamics
•Easy to stack for larger RF

When Larger Filters Make Sense

•First layer (captures low-level patterns quickly)
•Very low resolution inputs
•When computational budget allows
•Specific architectural experiments

Modern Refinement: Inception v3 showed that even 3×3 can be factorized:

3×3 → 3×1 then 1×3 (or 1×3 then 3×1)
Same receptive field, 33% fewer parameters
Works best for larger feature maps; diminishing returns at small spatial sizes

Principle 4: Bottleneck Design

Core Insight: Use 1×1 convolutions to reduce dimensionality before expensive operations.

The bottleneck pattern appears everywhere in modern CNNs:

Compress: 1×1 conv reduces channels (e.g., 256 → 64)
Transform: Apply expensive operation (3×3 conv) on fewer channels
Expand: 1×1 conv restores channels (64 → 256)

bottleneck_block.py
Python (PyTorch)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch.nn as nn
 
class BottleneckBlock(nn.Module):
    """
    Bottleneck design used in ResNet-50/101/152 and Inception.
    Reduces computation by 2-4× vs naive approach.
    """
    def __init__(self, in_channels, bottleneck_channels, out_channels):
        super().__init__()
        
        # 1×1 compress
        self.conv1 = nn.Conv2d(in_channels, bottleneck_channels, 1)
        self.bn1 = nn.BatchNorm2d(bottleneck_channels)
        
        # 3×3 transform (on reduced channels)
        self.conv2 = nn.Conv2d(bottleneck_channels, bottleneck_channels, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(bottleneck_channels)
        
        # 1×1 expand
        self.conv3 = nn.Conv2d(bottleneck_channels, out_channels, 1)
        self.bn3 = nn.BatchNorm2d(out_channels)
        
        self.relu = nn.ReLU(inplace=True)
    
    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        return out
 
# Compare parameter counts
# Naive: 256 → 256 with 3×3
naive_params = 256 * 256 * 3 * 3  # 589,824
 
# Bottleneck: 256 → 64 → 64 → 256
bottleneck = BottleneckBlock(256, 64, 256)
bottleneck_params = sum(p.numel() for p in bottleneck.parameters())
# ≈ 70,000 (includes BN params)
 
print(f"Naive 3×3: {naive_params:,} params")
print(f"Bottleneck: {bottleneck_params:,} params")
print(f"Reduction: {naive_params/bottleneck_params:.1f}×")

Bottleneck Ratio

Principle 5: Batch Normalization

Core Insight: Apply batch normalization after every convolution (before or after activation).

Batch normalization normalizes each channel's activations across a mini-batch: $$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \cdot \gamma + \beta$$

where $\mu_B$, $\sigma_B^2$ are batch statistics, and $\gamma$, $\beta$ are learnable scale/shift.

Batch Normalization Benefits

•Faster training: Allows higher learning rates (often 10× higher)
•Regularization effect: Noise from batch statistics reduces overfitting
•Smoother loss landscape: Gradients behave more predictably
•Reduces initialization sensitivity: Networks train even with poor initialization
•Enables deeper networks: Stabilizes gradient propagation in 100+ layer nets

BN Placement Debate:

Original (Ioffe & Szegedy, 2015): Conv → BN → ReLU

Common practice: Conv → BN → ReLU (same as original)

Some evidence for: Conv → ReLU → BN (preserves ReLU's sparse gradients)

Empirically, differences are small. The original placement is most common.

Principle 6: Global Average Pooling

Core Insight: Replace fully connected layers with global average pooling.

Instead of:

Flatten 7×7×512 → 25,088 neurons
FC: 25,088 → 4096 (102M params!)

Use:

Global Average Pool: 7×7×512 → 512 (average each channel)
FC: 512 → num_classes (0.5M params)

FC Layer Problems

•90% of VGG's params are FC
•Prone to overfitting
•Fixed input size required
•No spatial invariance
•Expensive memory/compute

GAP Advantages

•Minimal parameters added
•Strong regularization
•Flexible input sizes
•Translation invariant
•Interpretable (CAMs)

gap_classifier.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import torch.nn as nn
 
# Modern CNN classifier (ResNet/EfficientNet style)
class GAPClassifier(nn.Module):
    def __init__(self, in_channels, num_classes):
        super().__init__()
        self.gap = nn.AdaptiveAvgPool2d((1, 1))  # Any input → 1×1
        self.fc = nn.Linear(in_channels, num_classes)
    
    def forward(self, x):
        x = self.gap(x)       # B×C×H×W → B×C×1×1
        x = x.view(x.size(0), -1)  # B×C×1×1 → B×C
        return self.fc(x)     # B×C → B×num_classes
 
# VGG-style vs GAP-style
vgg_fc_params = 7 * 7 * 512 * 4096 + 4096 * 4096 + 4096 * 1000
gap_params = 512 * 1000
 
print(f"VGG FC params: {vgg_fc_params:,}")  # 123M
print(f"GAP params:    {gap_params:,}")      # 512K
print(f"Reduction:     {vgg_fc_params/gap_params:.0f}×")  # 240×

Principle 7: Regularization Strategy

Core Insight: Apply multiple complementary regularization techniques.

Regularization Techniques for CNNs
Technique	Where Applied	Effect	Typical Value
Weight Decay (L2)	All conv/FC weights	Penalizes large weights	1e-4 to 5e-4
Dropout	FC layers (or after GAP)	Prevents co-adaptation	0.2-0.5
Batch Normalization	After every conv	Adds noise, smooths loss	—
Data Augmentation	Training inputs	Expands effective dataset	Task-specific
Label Smoothing	Loss function	Softens hard targets	0.1
Stochastic Depth	Skip random layers	Regularizes depth	Linear 0→0.5

Modern Best Practice

Practical Design Guidelines

Synthesizing all principles into actionable guidelines:

CNN Design Checklist

•Start with proven architecture — Modify ResNet or EfficientNet rather than starting from scratch
•Use 3×3 convolutions — Default choice for all layers except possibly the first
•Add batch normalization — After every convolution, before or after ReLU
•Apply bottleneck design — Use 1×1 reduce/expand for depth beyond 50 layers
•Reduce spatial by 2× in stages — Target 5 reduction stages for 224 input
•Double channels after each reduction — 64 → 128 → 256 → 512 → 512+
•Replace FC layers with GAP — Massive parameter reduction, better generalization
•Use skip connections for depth — Essential beyond ~20 layers
•Apply data augmentation aggressively — Most important regularizer
•Match compute budget to data size — Larger models need more data

Architecture Selection Guide

Choosing the Right Architecture
Scenario	Recommended	Reasoning
Learning/prototyping	ResNet-18/34	Simple, fast, well-understood
ImageNet-level accuracy	ResNet-50/101	Strong baseline, many pretrained weights
Mobile/edge deployment	MobileNetV3, EfficientNet-B0	Optimized for latency/params
Maximum accuracy	EfficientNet-B7, ConvNeXt	Best accuracy/compute tradeoffs
Transfer learning	ResNet-50, VGG16	Widely available pretrained features
Object detection backbone	ResNet-50-FPN	Standard for COCO detection
Limited training data	Smaller models + pretrained	Reduce overfitting risk

The Rise of Vision Transformers

Summary

CNN Design Principles

•Progressive spatial reduction with channel expansion maintains capacity
•Receptive field ≥ 70% of input ensures global context
•3×3 filters stacked beats single large filters in efficiency and expressiveness
•Bottleneck design uses 1×1 reduce/expand for 4× compute reduction
•Batch normalization is essential for training stability and speed
•Global average pooling replaces FC layers, saving 100× parameters
•Multiple regularization techniques work together for generalization

Module Complete:

Module Complete

5 / 5