Loading content...
In 2014, researchers at the Visual Geometry Group (VGG) at Oxford asked a fundamental question: What happens if we make a CNN much deeper while keeping everything else extremely simple?
Their answer—VGGNet—demonstrated that stacking many small 3×3 convolutional filters could outperform larger, irregular filter designs. VGG16 and VGG19 achieved second place in ILSVRC 2014 classification (behind GoogLeNet) but introduced design principles that influenced all subsequent architectures.
VGGNet's key insight is elegantly simple: two stacked 3×3 convolutions have the same receptive field as one 5×5 convolution, but with fewer parameters and more non-linearity. This principle—using depth to expand receptive fields rather than large filters—became foundational to modern CNN design.
This page covers VGGNet's architecture philosophy, the mathematical reasoning behind small filters, configuration variants (VGG11 through VGG19), computational considerations, and the lasting influence on CNN design. You'll understand why VGG's simplicity made it the go-to backbone for transfer learning for years.
VGGNet's design philosophy can be summarized in one sentence: use only 3×3 convolutions, stack them deeper, and see what happens.
This stands in stark contrast to AlexNet's heterogeneous filter sizes (11×11, 5×5, 3×3) and the complex Inception modules of GoogLeNet. VGG prioritized architectural regularity over novelty.
3×3 is the smallest filter size that captures spatial structure (up/down/left/right/center). It's the minimum size that defines a 'neighborhood' in 2D. Smaller (1×1) filters don't capture spatial patterns; larger filters can be composed from multiple 3×3 layers.
VGG's core insight is that stacked small filters achieve the same receptive field as large filters with significant advantages.
Receptive Field Calculation:
General formula for n stacked 3×3 convolutions: $$RF = 1 + 2n$$
Parameter Comparison:
For C input and C output channels:
| Configuration | Receptive Field | Parameters | Non-linearities |
|---|---|---|---|
| One 5×5 conv | 5×5 | 25C² | 1 |
| Two 3×3 convs | 5×5 | 2 × 9C² = 18C² | 2 |
| One 7×7 conv | 7×7 | 49C² | 1 |
| Three 3×3 convs | 7×7 | 3 × 9C² = 27C² | 3 |
The Advantages of Stacking:
1234567891011121314151617181920212223242526
def compute_receptive_field(kernel_sizes, strides): """ Compute receptive field for stacked convolutions. RF[l] = RF[l-1] + (kernel_size[l] - 1) * product(strides[:l]) """ rf = 1 stride_product = 1 for k, s in zip(kernel_sizes, strides): rf = rf + (k - 1) * stride_product stride_product *= s return rf # VGG block: three 3×3 convsvgg_block = compute_receptive_field([3, 3, 3], [1, 1, 1])print(f"Three 3×3 convs: RF = {vgg_block}×{vgg_block}") # 7×7 # Single 7×7 conv single_large = compute_receptive_field([7], [1])print(f"One 7×7 conv: RF = {single_large}×{single_large}") # 7×7 # AlexNet first layer: 11×11, stride 4alexnet_conv1 = compute_receptive_field([11], [4])print(f"11×11 stride 4: RF = {alexnet_conv1}×{alexnet_conv1}") # 11×11The VGG paper systematically evaluated configurations from 11 to 19 layers, labeled A through E. The most commonly used are VGG16 (D) and VGG19 (E).
VGG Architecture Configurations (conv layers only):┌─────────────────────────────────────────────────────────────────────┐│ Stage │ A (VGG11) │ B (VGG13) │ C (VGG16) │ D (VGG16) │ E (VGG19) │├─────────────────────────────────────────────────────────────────────┤│ Input │ 224 × 224 RGB image │├─────────────────────────────────────────────────────────────────────┤│ conv1 │ 1×conv64 │ 2×conv64 │ 2×conv64 │ 2×conv64 │ 2×conv64 ││ │ maxpool │ maxpool │ maxpool │ maxpool │ maxpool │├─────────────────────────────────────────────────────────────────────┤│ conv2 │ 1×conv128 │ 2×conv128 │ 2×conv128 │ 2×conv128 │ 2×conv128 ││ │ maxpool │ maxpool │ maxpool │ maxpool │ maxpool │├─────────────────────────────────────────────────────────────────────┤│ conv3 │ 2×conv256 │ 2×conv256 │ 3×conv256 │ 3×conv256 │ 4×conv256 ││ │ maxpool │ maxpool │ maxpool │ maxpool │ maxpool │├─────────────────────────────────────────────────────────────────────┤│ conv4 │ 2×conv512 │ 2×conv512 │ 3×conv512 │ 3×conv512 │ 4×conv512 ││ │ maxpool │ maxpool │ maxpool │ maxpool │ maxpool │├─────────────────────────────────────────────────────────────────────┤│ conv5 │ 2×conv512 │ 2×conv512 │ 3×conv512 │ 3×conv512 │ 4×conv512 ││ │ maxpool │ maxpool │ maxpool │ maxpool │ maxpool │├─────────────────────────────────────────────────────────────────────┤│ FC │ 4096 → 4096 → 1000 (same for all) │├─────────────────────────────────────────────────────────────────────┤│ Params│ 133M │ 133M │ 138M │ 138M │ 144M ││ Layers│ 11 │ 13 │ 16 │ 16 │ 19 │└─────────────────────────────────────────────────────────────────────┘ Note: All convolutions are 3×3 with stride 1, padding 1. All maxpools are 2×2 with stride 2. VGG16-C uses some 1×1 convs; VGG16-D uses all 3×3 (more common).VGG16 has 138 million parameters—90% of which are in the fully connected layers (FC6: 7×7×512×4096 ≈ 102M). The conv layers only contribute ~15M parameters. This motivated later architectures (ResNet, Inception) to replace FC layers with global average pooling.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import torchimport torch.nn as nn # Configuration dictionary: 'M' = maxpool, numbers = conv output channelscfgs = { 'VGG11': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'], 'VGG13': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'], 'VGG16': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'], 'VGG19': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],} class VGG(nn.Module): def __init__(self, cfg_name='VGG16', num_classes=1000, use_bn=False): super(VGG, self).__init__() self.features = self._make_layers(cfgs[cfg_name], use_bn) self.avgpool = nn.AdaptiveAvgPool2d((7, 7)) self.classifier = nn.Sequential( nn.Linear(512 * 7 * 7, 4096), nn.ReLU(True), nn.Dropout(0.5), nn.Linear(4096, 4096), nn.ReLU(True), nn.Dropout(0.5), nn.Linear(4096, num_classes), ) self._initialize_weights() def _make_layers(self, cfg, use_bn): layers = [] in_channels = 3 for v in cfg: if v == 'M': layers.append(nn.MaxPool2d(kernel_size=2, stride=2)) else: conv = nn.Conv2d(in_channels, v, kernel_size=3, padding=1) if use_bn: layers.extend([conv, nn.BatchNorm2d(v), nn.ReLU(True)]) else: layers.extend([conv, nn.ReLU(True)]) in_channels = v return nn.Sequential(*layers) def _initialize_weights(self): for m in self.modules(): if isinstance(m, nn.Conv2d): nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') if m.bias is not None: nn.init.constant_(m.bias, 0) elif isinstance(m, nn.Linear): nn.init.normal_(m.weight, 0, 0.01) nn.init.constant_(m.bias, 0) def forward(self, x): x = self.features(x) x = self.avgpool(x) x = torch.flatten(x, 1) x = self.classifier(x) return x # Create modelsvgg16 = VGG('VGG16')vgg19_bn = VGG('VGG19', use_bn=True) print(f"VGG16 params: {sum(p.numel() for p in vgg16.parameters()):,}")Training VGG's deep architectures required careful initialization and multi-scale training.
VGG16 requires ~15.5 billion FLOPs for a single forward pass. This is ~4× more than AlexNet. The massive FC layers also require ~500MB just for weights. VGG is powerful but impractical for edge deployment.
VGG's greatest legacy is as a feature extractor. The simple, uniform structure made it ideal for transfer learning.
12345678910111213141516
import torchimport torchvision.models as models # Load pretrained VGG16vgg16 = models.vgg16(pretrained=True) # Freeze feature extractorfor param in vgg16.features.parameters(): param.requires_grad = False # Replace classifier for new task (e.g., 10 classes)vgg16.classifier[-1] = torch.nn.Linear(4096, 10) # Only classifier trains nowtrainable = sum(p.numel() for p in vgg16.parameters() if p.requires_grad)print(f"Trainable parameters: {trainable:,}") # ~16M (classifier only)These limitations motivated subsequent architectures:
Next: We'll explore Inception/GoogLeNet, which asked: what if we run multiple filter sizes in parallel? The Inception module's clever design achieves excellent accuracy with far fewer parameters than VGG.
You now understand VGGNet's design philosophy, the receptive field argument for small filters, and why VGG became the default choice for transfer learning. You can explain both its strengths (simplicity, good features) and limitations (computational cost, depth barriers).