Cnn Architectures - Learning Module

Loading content...

0/245

VGGNet: The Power of Depth and Simplicity

Simplicity Meets Depth

In 2014, researchers at the Visual Geometry Group (VGG) at Oxford asked a fundamental question: What happens if we make a CNN much deeper while keeping everything else extremely simple?

Their answer—VGGNet—demonstrated that stacking many small 3×3 convolutional filters could outperform larger, irregular filter designs. VGG16 and VGG19 achieved second place in ILSVRC 2014 classification (behind GoogLeNet) but introduced design principles that influenced all subsequent architectures.

VGGNet's key insight is elegantly simple: two stacked 3×3 convolutions have the same receptive field as one 5×5 convolution, but with fewer parameters and more non-linearity. This principle—using depth to expand receptive fields rather than large filters—became foundational to modern CNN design.

What You Will Learn

This page covers VGGNet's architecture philosophy, the mathematical reasoning behind small filters, configuration variants (VGG11 through VGG19), computational considerations, and the lasting influence on CNN design. You'll understand why VGG's simplicity made it the go-to backbone for transfer learning for years.

Design Philosophy: Radical Simplicity

VGGNet's design philosophy can be summarized in one sentence: use only 3×3 convolutions, stack them deeper, and see what happens.

This stands in stark contrast to AlexNet's heterogeneous filter sizes (11×11, 5×5, 3×3) and the complex Inception modules of GoogLeNet. VGG prioritized architectural regularity over novelty.

VGGNet Design Principles

•Exclusively 3×3 convolutions with stride 1 and padding 1 (preserves spatial dimensions)
•2×2 max pooling with stride 2 after each spatial reduction
•Double channels after each pooling: 64 → 128 → 256 → 512 → 512
•Three fully connected layers: 4096 → 4096 → 1000 (same as AlexNet)
•ReLU activation after every convolution and FC layer
•No local response normalization (found not helpful experimentally)

Why 3×3?

3×3 is the smallest filter size that captures spatial structure (up/down/left/right/center). It's the minimum size that defines a 'neighborhood' in 2D. Smaller (1×1) filters don't capture spatial patterns; larger filters can be composed from multiple 3×3 layers.

The Receptive Field Argument

VGG's core insight is that stacked small filters achieve the same receptive field as large filters with significant advantages.

Receptive Field Calculation:

One 3×3 conv: receptive field = 3×3
Two 3×3 convs stacked: receptive field = 5×5
Three 3×3 convs stacked: receptive field = 7×7

General formula for n stacked 3×3 convolutions: $$RF = 1 + 2n$$

Parameter Comparison:

For C input and C output channels:

Parameter Counts: Large vs Stacked Small Filters
Configuration	Receptive Field	Parameters	Non-linearities
One 5×5 conv	5×5	25C²	1
Two 3×3 convs	5×5	2 × 9C² = 18C²	2
One 7×7 conv	7×7	49C²	1
Three 3×3 convs	7×7	3 × 9C² = 27C²	3

The Advantages of Stacking:

Fewer parameters: 18C² vs 25C² for 5×5 equivalent (28% reduction)
More non-linearity: Each 3×3 conv adds a ReLU, increasing model expressiveness
Better regularization: More parameters spread across more layers with more activations
Easier optimization: Gradients flow through more, shorter paths

receptive_field.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def compute_receptive_field(kernel_sizes, strides):
    """
    Compute receptive field for stacked convolutions.
    
    RF[l] = RF[l-1] + (kernel_size[l] - 1) * product(strides[:l])
    """
    rf = 1
    stride_product = 1
    
    for k, s in zip(kernel_sizes, strides):
        rf = rf + (k - 1) * stride_product
        stride_product *= s
    
    return rf
 
# VGG block: three 3×3 convs
vgg_block = compute_receptive_field([3, 3, 3], [1, 1, 1])
print(f"Three 3×3 convs: RF = {vgg_block}×{vgg_block}")  # 7×7
 
# Single 7×7 conv  
single_large = compute_receptive_field([7], [1])
print(f"One 7×7 conv:    RF = {single_large}×{single_large}")  # 7×7
 
# AlexNet first layer: 11×11, stride 4
alexnet_conv1 = compute_receptive_field([11], [4])
print(f"11×11 stride 4:  RF = {alexnet_conv1}×{alexnet_conv1}")  # 11×11

VGG Architecture Variants

The VGG paper systematically evaluated configurations from 11 to 19 layers, labeled A through E. The most commonly used are VGG16 (D) and VGG19 (E).

VGG Configurations
VGG Architecture Configurations (conv layers only):
┌─────────────────────────────────────────────────────────────────────┐
│ Stage │ A (VGG11) │ B (VGG13) │ C (VGG16) │ D (VGG16) │ E (VGG19) │
├─────────────────────────────────────────────────────────────────────┤
│ Input │              224 × 224 RGB image                           │
├─────────────────────────────────────────────────────────────────────┤
│ conv1 │ 1×conv64  │ 2×conv64  │ 2×conv64  │ 2×conv64  │ 2×conv64  │
│       │ maxpool   │ maxpool   │ maxpool   │ maxpool   │ maxpool   │
├─────────────────────────────────────────────────────────────────────┤
│ conv2 │ 1×conv128 │ 2×conv128 │ 2×conv128 │ 2×conv128 │ 2×conv128 │
│       │ maxpool   │ maxpool   │ maxpool   │ maxpool   │ maxpool   │
├─────────────────────────────────────────────────────────────────────┤
│ conv3 │ 2×conv256 │ 2×conv256 │ 3×conv256 │ 3×conv256 │ 4×conv256 │
│       │ maxpool   │ maxpool   │ maxpool   │ maxpool   │ maxpool   │
├─────────────────────────────────────────────────────────────────────┤
│ conv4 │ 2×conv512 │ 2×conv512 │ 3×conv512 │ 3×conv512 │ 4×conv512 │
│       │ maxpool   │ maxpool   │ maxpool   │ maxpool   │ maxpool   │
├─────────────────────────────────────────────────────────────────────┤
│ conv5 │ 2×conv512 │ 2×conv512 │ 3×conv512 │ 3×conv512 │ 4×conv512 │
│       │ maxpool   │ maxpool   │ maxpool   │ maxpool   │ maxpool   │
├─────────────────────────────────────────────────────────────────────┤
│ FC    │ 4096 → 4096 → 1000 (same for all)                          │
├─────────────────────────────────────────────────────────────────────┤
│ Params│   133M   │    133M   │   138M    │   138M    │   144M    │
│ Layers│    11    │     13    │    16     │    16     │    19     │
└─────────────────────────────────────────────────────────────────────┘
 
Note: All convolutions are 3×3 with stride 1, padding 1.
      All maxpools are 2×2 with stride 2.
      VGG16-C uses some 1×1 convs; VGG16-D uses all 3×3 (more common).

Why So Many Parameters?

VGG16 has 138 million parameters—90% of which are in the fully connected layers (FC6: 7×7×512×4096 ≈ 102M). The conv layers only contribute ~15M parameters. This motivated later architectures (ResNet, Inception) to replace FC layers with global average pooling.

VGG Implementation

vgg.py
Python (PyTorch)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import torch
import torch.nn as nn
 
# Configuration dictionary: 'M' = maxpool, numbers = conv output channels
cfgs = {
    'VGG11': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'VGG13': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],
    'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],
    'VGG19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],
}
 
class VGG(nn.Module):
    def __init__(self, cfg_name='VGG16', num_classes=1000, use_bn=False):
        super(VGG, self).__init__()
        
        self.features = self._make_layers(cfgs[cfg_name], use_bn)
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7))
        
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),
        )
        
        self._initialize_weights()
    
    def _make_layers(self, cfg, use_bn):
        layers = []
        in_channels = 3
        
        for v in cfg:
            if v == 'M':
                layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
            else:
                conv = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
                if use_bn:
                    layers.extend([conv, nn.BatchNorm2d(v), nn.ReLU(True)])
                else:
                    layers.extend([conv, nn.ReLU(True)])
                in_channels = v
        
        return nn.Sequential(*layers)
    
    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)
    
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x
 
# Create models
vgg16 = VGG('VGG16')
vgg19_bn = VGG('VGG19', use_bn=True)
 
print(f"VGG16 params: {sum(p.numel() for p in vgg16.parameters()):,}")

Training VGGNet

Training VGG's deep architectures required careful initialization and multi-scale training.

VGG Training Techniques

•Progressive training: First train VGG11, use its weights to initialize deeper VGG13, etc.
•Multi-scale training: Randomly rescale images to 256-512 before cropping 224×224
•Dense evaluation: Apply network convolutionally to larger images, average predictions
•Batch size: 256; Momentum: 0.9; Weight decay: 5×10⁻⁴
•Learning rate: Start 0.01, divide by 10 when validation plateaus
•Dropout: 0.5 for first two FC layers

Computational Cost

VGG16 requires ~15.5 billion FLOPs for a single forward pass. This is ~4× more than AlexNet. The massive FC layers also require ~500MB just for weights. VGG is powerful but impractical for edge deployment.

VGG for Transfer Learning

VGG's greatest legacy is as a feature extractor. The simple, uniform structure made it ideal for transfer learning.

Why VGG Excels at Transfer

•Simple architecture easy to modify
•Clear layer hierarchy
•Pretrained weights widely available
•Good features at multiple depths

Common Applications

•Neural style transfer
•Perceptual loss functions
•Object detection backbones
•Medical image analysis

vgg_transfer.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import torch
import torchvision.models as models
 
# Load pretrained VGG16
vgg16 = models.vgg16(pretrained=True)
 
# Freeze feature extractor
for param in vgg16.features.parameters():
    param.requires_grad = False
 
# Replace classifier for new task (e.g., 10 classes)
vgg16.classifier[-1] = torch.nn.Linear(4096, 10)
 
# Only classifier trains now
trainable = sum(p.numel() for p in vgg16.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable:,}")  # ~16M (classifier only)

VGG's Limitations

VGG Drawbacks

•Massive size: 138M parameters, ~500MB model file
•Slow inference: 15.5B FLOPs per image
•Memory intensive: Large activations and FC layers
•Diminishing returns: VGG19 barely beats VGG16
•No skip connections: Limits practical depth to ~19 layers
•FC layers wasteful: 90% of parameters with little benefit

These limitations motivated subsequent architectures:

GoogLeNet/Inception: Parallel paths reduce parameters while maintaining receptive field
ResNet: Skip connections enable 100+ layer networks
MobileNet/EfficientNet: Depthwise separable convolutions for efficiency

Summary

Key Takeaways

•VGG proved depth matters: Systematic exploration showed deeper = better (up to a point)
•3×3 filters are sufficient: Stacking small filters achieves large receptive fields efficiently
•Two 3×3 = one 5×5: Same receptive field, fewer parameters, more non-linearity
•Uniform architecture: Simple design made VGG ideal for transfer learning
•FC layers are expensive: 90% of VGG's parameters are in fully connected layers
•Limitations drove innovation: VGG's computational cost motivated Inception and ResNet

Next: We'll explore Inception/GoogLeNet, which asked: what if we run multiple filter sizes in parallel? The Inception module's clever design achieves excellent accuracy with far fewer parameters than VGG.

Page Complete

You now understand VGGNet's design philosophy, the receptive field argument for small filters, and why VGG became the default choice for transfer learning. You can explain both its strengths (simplicity, good features) and limitations (computational cost, depth barriers).