Cnn Architectures - Learning Module

Loading content...

0/245

Inception/GoogLeNet: Going Deeper with Efficiency

The Inception Philosophy

In September 2014, Google published "Going Deeper with Convolutions", introducing GoogLeNet (also called Inception v1)—a 22-layer network that won ILSVRC 2014 with 6.7% top-5 error while using only 5 million parameters.

For context: VGG16 has 138 million parameters. GoogLeNet achieved better accuracy with 27× fewer parameters.

The key innovation was the Inception module: instead of choosing between 1×1, 3×3, or 5×5 convolutions, why not use all of them in parallel and let the network learn which to emphasize? This seemingly wasteful approach, combined with clever dimensionality reduction, produced an architecture that was both deeper and more efficient than anything before it.

What You Will Learn

This page covers the Inception module's design rationale, the critical role of 1×1 convolutions for dimensionality reduction, GoogLeNet's full architecture including auxiliary classifiers, and the evolution through Inception v2/v3/v4. You'll understand how to build efficient, wide networks.

Motivation: The Filter Size Dilemma

Traditional CNNs require choosing a filter size at each layer. But what if the optimal size varies across the image?

The Problem:

Small filters (1×1, 3×3): Capture fine details, local textures
Medium filters (5×5): Capture intermediate features, small objects
Large filters (7×7+): Capture larger context, global patterns

Different objects in the same image may need different filter sizes. A face detection network needs small filters for eyes and large filters for head shape.

The Inception Solution:

Rather than choose, apply multiple filter sizes in parallel:

Run 1×1, 3×3, 5×5 convolutions on the same input
Run 3×3 max pooling for downsampling path
Concatenate all outputs along the channel dimension
Let the network learn to weight different scales

Where Does "Inception" Come From?

The name references the movie Inception ("we need to go deeper") and the Network-in-Network paper that introduced 1×1 convolutions. It's a network within a network within a network—dreams within dreams.

The Naive Inception Module

The conceptually simple version of the Inception module runs all operations in parallel:

Naive Inception Module
Naive Inception Module:
┌─────────────────────────────────────────────────────────────────────┐
│                           INPUT (28×28×256)                         │
│                                  │                                  │
│          ┌───────────┬───────────┼───────────┬───────────┐          │
│          ▼           ▼           ▼           ▼           │          │
│     ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐      │          │
│     │ Conv    │ │ Conv    │ │ Conv    │ │ MaxPool │      │          │
│     │ 1×1     │ │ 3×3     │ │ 5×5     │ │ 3×3     │      │          │
│     │ 128     │ │ 192     │ │ 96      │ │ stride 1│      │          │
│     └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘      │          │
│          │           │           │           │           │          │
│          └───────────┴───────────┴───────────┘           │          │
│                          │                               │          │
│                          ▼                               │          │
│                    CONCATENATE                           │          │
│                    (28×28×416)                           │          │
│                                                          │          │
│  Problem: 5×5 conv on 256 channels = 256×256×5×5 = 1.6M params!    │
│           Output channels explode with each module.                 │
└─────────────────────────────────────────────────────────────────────┘

The Computational Problem:

Consider applying 5×5 convolutions to 256 input channels producing 96 output channels:

Parameters: 256 × 96 × 5 × 5 = 614,400
Operations per spatial location: 256 × 96 × 25 = 614,400 MACs

With 28×28 spatial size, that's 480 million operations for just one branch of one module. VGG's entire network is only 15 billion operations. The naive approach is computationally infeasible.

1×1 Convolutions: The Bottleneck

The solution is 1×1 convolutions for dimensionality reduction before expensive 3×3 and 5×5 operations.

What Do 1×1 Convolutions Do?

•Channel mixing: Linearly combine channels at each spatial location
•Dimensionality reduction: Reduce 256 channels to 64 before 5×5 conv
•Add non-linearity: Each 1×1 conv includes ReLU
•Cross-channel computation: Learn relationships between feature maps
•Cheap: Only C_in × C_out parameters regardless of spatial size

Cost Reduction Example:

To apply 5×5 conv reducing 256 → 96 channels:

Without bottleneck:

256 × 96 × 5 × 5 = 614,400 params

With 1×1 bottleneck (256 → 64 → 96):

1×1: 256 × 64 × 1 × 1 = 16,384 params
5×5: 64 × 96 × 5 × 5 = 153,600 params
Total: 169,984 params (3.6× fewer!)

The same logic applies to compute: reduce channels first, then apply expensive operations.

Inception Module with Dimensionality Reduction
Inception Module (with 1×1 bottlenecks):
┌─────────────────────────────────────────────────────────────────────┐
│                          INPUT (28×28×256)                          │
│                                 │                                   │
│     ┌───────────┬───────────────┼───────────────┬───────────┐       │
│     ▼           ▼               ▼               ▼           │       │
│ ┌─────────┐ ┌─────────┐     ┌─────────┐     ┌─────────┐     │       │
│ │ Conv1×1 │ │ Conv1×1 │     │ Conv1×1 │     │ MaxPool │     │       │
│ │ 64 ch   │ │ 96 ch   │     │ 16 ch   │     │ 3×3     │     │       │
│ └────┬────┘ └────┬────┘     └────┬────┘     └────┬────┘     │       │
│      │           │               │               │          │       │
│      │      ┌────▼────┐     ┌────▼────┐     ┌────▼────┐     │       │
│      │      │ Conv3×3 │     │ Conv5×5 │     │ Conv1×1 │     │       │
│      │      │ 128 ch  │     │ 32 ch   │     │ 32 ch   │     │       │
│      │      └────┬────┘     └────┬────┘     └────┬────┘     │       │
│      │           │               │               │          │       │
│      └───────────┴───────────────┴───────────────┘          │       │
│                          │                                  │       │
│                          ▼                                  │       │
│                    CONCATENATE                              │       │
│               (28×28×(64+128+32+32)=256)                    │       │
│                                                             │       │
│  Now: 5×5 path uses only 16 input channels!                 │       │
│       Total params much lower than naive version.           │       │
└─────────────────────────────────────────────────────────────────────┘

GoogLeNet Architecture

GoogLeNet stacks 9 Inception modules between conventional conv layers, achieving 22 layers total.

GoogLeNet Layer Summary
Stage	Layer	Output Size	Notes
Input	—	224×224×3	RGB image
Stem	Conv 7×7/2, Pool 3×3/2	28×28×192	Initial feature extraction
3a	Inception	28×28×256	First Inception module
3b	Inception + Pool	14×14×480	MaxPool reduces spatial
4a-4e	5× Inception	14×14×832	Deep feature learning
4a,4d	Auxiliary classifiers	—	Combat vanishing gradients
5a-5b	2× Inception + Pool	7×7×1024	Final Inception modules
Output	AvgPool + FC	1000	Global average pooling!

Global Average Pooling

GoogLeNet replaces VGG's massive FC layers with Global Average Pooling. Instead of 7×7×1024 → 4096 FC (29M params), it averages each channel to a single value: 7×7×1024 → 1024. This reduces parameters dramatically and prevents overfitting.

Auxiliary Classifiers

GoogLeNet includes two auxiliary classifiers attached to intermediate layers. These were designed to combat vanishing gradients in the 22-layer network.

Auxiliary Classifier Design

•Location: After Inception 4a and 4d (middle of network)
•Structure: 5×5 AvgPool → 1×1 Conv 128 → FC 1024 → Dropout → FC 1000
•Training weight: Loss weighted by 0.3 (main classifier weighted 1.0)
•Inference: Discarded completely, only main classifier used
•Purpose: Inject gradient signal into middle layers during training

Do Auxiliary Classifiers Help?

Later research (Inception v3 paper) found that auxiliary classifiers primarily act as regularizers, not gradient highways. They slightly improve final accuracy but aren't essential. Modern architectures use skip connections (ResNet) instead.

Inception Module Implementation

inception_module.py
Python (PyTorch)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class InceptionModule(nn.Module):
    """
    Inception module with dimensionality reduction.
    
    Args:
        in_channels: Input channels
        ch1x1: Output channels for 1×1 branch
        ch3x3_reduce: Channels after 1×1 reduction for 3×3 branch
        ch3x3: Output channels for 3×3 branch
        ch5x5_reduce: Channels after 1×1 reduction for 5×5 branch
        ch5x5: Output channels for 5×5 branch
        pool_proj: Output channels for pooling branch
    """
    def __init__(self, in_channels, ch1x1, ch3x3_reduce, ch3x3, 
                 ch5x5_reduce, ch5x5, pool_proj):
        super(InceptionModule, self).__init__()
        
        # Branch 1: 1×1 conv
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, ch1x1, kernel_size=1),
            nn.BatchNorm2d(ch1x1),
            nn.ReLU(inplace=True)
        )
        
        # Branch 2: 1×1 reduce → 3×3 conv
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, ch3x3_reduce, kernel_size=1),
            nn.BatchNorm2d(ch3x3_reduce),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch3x3_reduce, ch3x3, kernel_size=3, padding=1),
            nn.BatchNorm2d(ch3x3),
            nn.ReLU(inplace=True)
        )
        
        # Branch 3: 1×1 reduce → 5×5 conv
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, ch5x5_reduce, kernel_size=1),
            nn.BatchNorm2d(ch5x5_reduce),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch5x5_reduce, ch5x5, kernel_size=5, padding=2),
            nn.BatchNorm2d(ch5x5),
            nn.ReLU(inplace=True)
        )
        
        # Branch 4: 3×3 maxpool → 1×1 conv
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, pool_proj, kernel_size=1),
            nn.BatchNorm2d(pool_proj),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        b1 = self.branch1(x)
        b2 = self.branch2(x)
        b3 = self.branch3(x)
        b4 = self.branch4(x)
        
        # Concatenate along channel dimension
        return torch.cat([b1, b2, b3, b4], dim=1)
 
 
# Example: Inception 3a module from GoogLeNet
# Input: 28×28×192
inception_3a = InceptionModule(
    in_channels=192,
    ch1x1=64,
    ch3x3_reduce=96, ch3x3=128,
    ch5x5_reduce=16, ch5x5=32,
    pool_proj=32
)
# Output: 28×28×(64+128+32+32) = 28×28×256
 
x = torch.randn(1, 192, 28, 28)
out = inception_3a(x)
print(f"Input: {x.shape} → Output: {out.shape}")
print(f"Parameters: {sum(p.numel() for p in inception_3a.parameters()):,}")

Inception Evolution: v2, v3, v4

The Inception architecture evolved through several versions, each adding refinements.

Inception Version Comparison
Version	Year	Key Changes	Top-5 Error
Inception v1	2014	Original design, auxiliary classifiers	6.67%
Inception v2	2015	Batch normalization, factorized convs	5.6%
Inception v3	2015	Label smoothing, RMSprop, 7×1+1×7 factorization	4.2%
Inception v4	2016	Cleaner design, combined with ResNet	3.1%
Inception-ResNet	2016	Residual connections in Inception modules	3.1%

Inception v3 Key Innovations

•Factorized 5×5 → two 3×3: Same receptive field, 28% fewer params
•Factorized n×n → 1×n + n×1: Further cost reduction for larger filters
•Efficient grid reduction: Stride-2 conv + maxpool in parallel
•Label smoothing: Softens targets to prevent overconfidence
•Batch normalization everywhere: Stabilizes and accelerates training

Summary

Key Takeaways

•Inception modules run parallel filter sizes (1×1, 3×3, 5×5) and concatenate results
•1×1 convolutions reduce dimensionality before expensive operations, cutting costs 3-4×
•GoogLeNet achieves 27× fewer parameters than VGG16 with better accuracy
•Global average pooling replaces FC layers, eliminating 90% of VGG's parameters
•Auxiliary classifiers regularize training but aren't essential
•Inception evolved through v2/v3/v4 with batch norm, factorization, and residual connections

Next: We'll explore CNN Design Principles, synthesizing the lessons from LeNet, AlexNet, VGGNet, and Inception into general guidelines for building effective convolutional architectures.

Page Complete

You now understand the Inception module's multi-scale approach, how 1×1 convolutions enable efficient parallel paths, and why GoogLeNet achieved state-of-the-art results with a fraction of VGG's parameters. You can implement Inception modules and explain their design rationale.