Loading content...
In September 2014, Google published "Going Deeper with Convolutions", introducing GoogLeNet (also called Inception v1)—a 22-layer network that won ILSVRC 2014 with 6.7% top-5 error while using only 5 million parameters.
For context: VGG16 has 138 million parameters. GoogLeNet achieved better accuracy with 27× fewer parameters.
The key innovation was the Inception module: instead of choosing between 1×1, 3×3, or 5×5 convolutions, why not use all of them in parallel and let the network learn which to emphasize? This seemingly wasteful approach, combined with clever dimensionality reduction, produced an architecture that was both deeper and more efficient than anything before it.
This page covers the Inception module's design rationale, the critical role of 1×1 convolutions for dimensionality reduction, GoogLeNet's full architecture including auxiliary classifiers, and the evolution through Inception v2/v3/v4. You'll understand how to build efficient, wide networks.
Traditional CNNs require choosing a filter size at each layer. But what if the optimal size varies across the image?
The Problem:
Different objects in the same image may need different filter sizes. A face detection network needs small filters for eyes and large filters for head shape.
The Inception Solution:
Rather than choose, apply multiple filter sizes in parallel:
The name references the movie Inception ("we need to go deeper") and the Network-in-Network paper that introduced 1×1 convolutions. It's a network within a network within a network—dreams within dreams.
The conceptually simple version of the Inception module runs all operations in parallel:
Naive Inception Module:┌─────────────────────────────────────────────────────────────────────┐│ INPUT (28×28×256) ││ │ ││ ┌───────────┬───────────┼───────────┬───────────┐ ││ ▼ ▼ ▼ ▼ │ ││ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ││ │ Conv │ │ Conv │ │ Conv │ │ MaxPool │ │ ││ │ 1×1 │ │ 3×3 │ │ 5×5 │ │ 3×3 │ │ ││ │ 128 │ │ 192 │ │ 96 │ │ stride 1│ │ ││ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ ││ │ │ │ │ │ ││ └───────────┴───────────┴───────────┘ │ ││ │ │ ││ ▼ │ ││ CONCATENATE │ ││ (28×28×416) │ ││ │ ││ Problem: 5×5 conv on 256 channels = 256×256×5×5 = 1.6M params! ││ Output channels explode with each module. │└─────────────────────────────────────────────────────────────────────┘The Computational Problem:
Consider applying 5×5 convolutions to 256 input channels producing 96 output channels:
With 28×28 spatial size, that's 480 million operations for just one branch of one module. VGG's entire network is only 15 billion operations. The naive approach is computationally infeasible.
The solution is 1×1 convolutions for dimensionality reduction before expensive 3×3 and 5×5 operations.
Cost Reduction Example:
To apply 5×5 conv reducing 256 → 96 channels:
Without bottleneck:
With 1×1 bottleneck (256 → 64 → 96):
The same logic applies to compute: reduce channels first, then apply expensive operations.
Inception Module (with 1×1 bottlenecks):┌─────────────────────────────────────────────────────────────────────┐│ INPUT (28×28×256) ││ │ ││ ┌───────────┬───────────────┼───────────────┬───────────┐ ││ ▼ ▼ ▼ ▼ │ ││ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ ││ │ Conv1×1 │ │ Conv1×1 │ │ Conv1×1 │ │ MaxPool │ │ ││ │ 64 ch │ │ 96 ch │ │ 16 ch │ │ 3×3 │ │ ││ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ ││ │ │ │ │ │ ││ │ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │ ││ │ │ Conv3×3 │ │ Conv5×5 │ │ Conv1×1 │ │ ││ │ │ 128 ch │ │ 32 ch │ │ 32 ch │ │ ││ │ └────┬────┘ └────┬────┘ └────┬────┘ │ ││ │ │ │ │ │ ││ └───────────┴───────────────┴───────────────┘ │ ││ │ │ ││ ▼ │ ││ CONCATENATE │ ││ (28×28×(64+128+32+32)=256) │ ││ │ ││ Now: 5×5 path uses only 16 input channels! │ ││ Total params much lower than naive version. │ │└─────────────────────────────────────────────────────────────────────┘GoogLeNet stacks 9 Inception modules between conventional conv layers, achieving 22 layers total.
| Stage | Layer | Output Size | Notes |
|---|---|---|---|
| Input | — | 224×224×3 | RGB image |
| Stem | Conv 7×7/2, Pool 3×3/2 | 28×28×192 | Initial feature extraction |
| 3a | Inception | 28×28×256 | First Inception module |
| 3b | Inception + Pool | 14×14×480 | MaxPool reduces spatial |
| 4a-4e | 5× Inception | 14×14×832 | Deep feature learning |
| 4a,4d | Auxiliary classifiers | — | Combat vanishing gradients |
| 5a-5b | 2× Inception + Pool | 7×7×1024 | Final Inception modules |
| Output | AvgPool + FC | 1000 | Global average pooling! |
GoogLeNet replaces VGG's massive FC layers with Global Average Pooling. Instead of 7×7×1024 → 4096 FC (29M params), it averages each channel to a single value: 7×7×1024 → 1024. This reduces parameters dramatically and prevents overfitting.
GoogLeNet includes two auxiliary classifiers attached to intermediate layers. These were designed to combat vanishing gradients in the 22-layer network.
Do Auxiliary Classifiers Help?
Later research (Inception v3 paper) found that auxiliary classifiers primarily act as regularizers, not gradient highways. They slightly improve final accuracy but aren't essential. Modern architectures use skip connections (ResNet) instead.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import torchimport torch.nn as nnimport torch.nn.functional as F class InceptionModule(nn.Module): """ Inception module with dimensionality reduction. Args: in_channels: Input channels ch1x1: Output channels for 1×1 branch ch3x3_reduce: Channels after 1×1 reduction for 3×3 branch ch3x3: Output channels for 3×3 branch ch5x5_reduce: Channels after 1×1 reduction for 5×5 branch ch5x5: Output channels for 5×5 branch pool_proj: Output channels for pooling branch """ def __init__(self, in_channels, ch1x1, ch3x3_reduce, ch3x3, ch5x5_reduce, ch5x5, pool_proj): super(InceptionModule, self).__init__() # Branch 1: 1×1 conv self.branch1 = nn.Sequential( nn.Conv2d(in_channels, ch1x1, kernel_size=1), nn.BatchNorm2d(ch1x1), nn.ReLU(inplace=True) ) # Branch 2: 1×1 reduce → 3×3 conv self.branch2 = nn.Sequential( nn.Conv2d(in_channels, ch3x3_reduce, kernel_size=1), nn.BatchNorm2d(ch3x3_reduce), nn.ReLU(inplace=True), nn.Conv2d(ch3x3_reduce, ch3x3, kernel_size=3, padding=1), nn.BatchNorm2d(ch3x3), nn.ReLU(inplace=True) ) # Branch 3: 1×1 reduce → 5×5 conv self.branch3 = nn.Sequential( nn.Conv2d(in_channels, ch5x5_reduce, kernel_size=1), nn.BatchNorm2d(ch5x5_reduce), nn.ReLU(inplace=True), nn.Conv2d(ch5x5_reduce, ch5x5, kernel_size=5, padding=2), nn.BatchNorm2d(ch5x5), nn.ReLU(inplace=True) ) # Branch 4: 3×3 maxpool → 1×1 conv self.branch4 = nn.Sequential( nn.MaxPool2d(kernel_size=3, stride=1, padding=1), nn.Conv2d(in_channels, pool_proj, kernel_size=1), nn.BatchNorm2d(pool_proj), nn.ReLU(inplace=True) ) def forward(self, x): b1 = self.branch1(x) b2 = self.branch2(x) b3 = self.branch3(x) b4 = self.branch4(x) # Concatenate along channel dimension return torch.cat([b1, b2, b3, b4], dim=1) # Example: Inception 3a module from GoogLeNet# Input: 28×28×192inception_3a = InceptionModule( in_channels=192, ch1x1=64, ch3x3_reduce=96, ch3x3=128, ch5x5_reduce=16, ch5x5=32, pool_proj=32)# Output: 28×28×(64+128+32+32) = 28×28×256 x = torch.randn(1, 192, 28, 28)out = inception_3a(x)print(f"Input: {x.shape} → Output: {out.shape}")print(f"Parameters: {sum(p.numel() for p in inception_3a.parameters()):,}")The Inception architecture evolved through several versions, each adding refinements.
| Version | Year | Key Changes | Top-5 Error |
|---|---|---|---|
| Inception v1 | 2014 | Original design, auxiliary classifiers | 6.67% |
| Inception v2 | 2015 | Batch normalization, factorized convs | 5.6% |
| Inception v3 | 2015 | Label smoothing, RMSprop, 7×1+1×7 factorization | 4.2% |
| Inception v4 | 2016 | Cleaner design, combined with ResNet | 3.1% |
| Inception-ResNet | 2016 | Residual connections in Inception modules | 3.1% |
Next: We'll explore CNN Design Principles, synthesizing the lessons from LeNet, AlexNet, VGGNet, and Inception into general guidelines for building effective convolutional architectures.
You now understand the Inception module's multi-scale approach, how 1×1 convolutions enable efficient parallel paths, and why GoogLeNet achieved state-of-the-art results with a fraction of VGG's parameters. You can implement Inception modules and explain their design rationale.