Residual Networks - Learning Module

Loading content...

0/245

ResNet Architecture: Building Very Deep Networks

The Architecture That Changed Everything

In December 2015, ResNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with a stunning 152-layer network—8× deeper than any previous winner. The top-5 error rate dropped to 3.57%, surpassing human-level performance (estimated at 5.1%). This wasn't incremental progress; it was a paradigm shift.

This page dissects the ResNet architecture in detail: from the basic building blocks to the complete network designs, from the original formulation to practical implementation considerations that enabled training these unprecedented depths.

What You Will Learn

By the end of this page, you will understand: (1) The two ResNet building blocks: Basic and Bottleneck, (2) Complete ResNet architectures from ResNet-18 to ResNet-152, (3) Stage design and downsampling strategies, (4) Initialization and training practices, and (5) How to implement production-ready ResNets.

The Basic Block

The Basic Block is the fundamental building unit for shallower ResNets (ResNet-18 and ResNet-34). It consists of two 3×3 convolutional layers with a skip connection.

Structure: $$\mathbf{y} = \text{ReLU}(\mathbf{x} + \mathcal{F}(\mathbf{x}))$$

Where $\mathcal{F}$ consists of:

3×3 Conv → BatchNorm → ReLU
3×3 Conv → BatchNorm

The ReLU is applied after the addition, not within F. This placement is crucial and was refined in later work (identity mappings).

basic_block.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import torch
import torch.nn as nn
 
class BasicBlock(nn.Module):
    """
    Basic residual block for ResNet-18/34.
    Two 3x3 conv layers with skip connection.
    
    Parameters per block: 2 * (C * C * 9) ≈ 18C² 
    (ignoring BatchNorm parameters)
    """
    expansion = 1  # Output channels = input channels * expansion
    
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        
        # First conv: may downsample spatially via stride
        self.conv1 = nn.Conv2d(
            in_channels, out_channels, kernel_size=3,
            stride=stride, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        
        # Second conv: always stride=1
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3,
            stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # Shortcut connection
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            # Projection shortcut: 1x1 conv to match dimensions
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Residual path
        out = torch.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        
        # Skip connection + activation
        out += self.shortcut(x)
        out = torch.relu(out)
        
        return out

Basic Block Specifications
Component	Kernel	Output Channels	Purpose
Conv1	3×3	out_channels	Feature extraction, optional downsampling
BN1 + ReLU		out_channels	Normalization and non-linearity
Conv2	3×3	out_channels	Further feature refinement
BN2		out_channels	Normalization before addition
Shortcut	1×1 or Identity	out_channels	Dimension matching

The Bottleneck Block

For deeper networks (ResNet-50, 101, 152), the Bottleneck Block is more parameter-efficient. It uses a 1×1 → 3×3 → 1×1 structure that reduces, processes, then expands channel dimensions.

Design rationale:

1×1 reduction: Compress channels (e.g., 256 → 64), reducing compute for the expensive 3×3 conv
3×3 conv: Spatial processing at reduced channel count
1×1 expansion: Restore channels (e.g., 64 → 256)

This "bottleneck" structure allows deeper networks with similar computational cost to shallower ones using Basic Blocks.

bottleneck_block.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
class Bottleneck(nn.Module):
    """
    Bottleneck residual block for ResNet-50/101/152.
    1x1 -> 3x3 -> 1x1 structure with expansion factor of 4.
    
    Example: 256 input channels
    - 1x1 conv: 256 -> 64 (reduce)
    - 3x3 conv: 64 -> 64 (process)  
    - 1x1 conv: 64 -> 256 (expand)
    """
    expansion = 4  # Output channels = base_channels * 4
    
    def __init__(self, in_channels: int, base_channels: int, stride: int = 1):
        super().__init__()
        out_channels = base_channels * self.expansion
        
        # 1x1 reduce
        self.conv1 = nn.Conv2d(in_channels, base_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(base_channels)
        
        # 3x3 process (may downsample via stride)
        self.conv2 = nn.Conv2d(
            base_channels, base_channels, 3,
            stride=stride, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(base_channels)
        
        # 1x1 expand
        self.conv3 = nn.Conv2d(base_channels, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)
        
        # Shortcut
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1, stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = torch.relu(self.bn1(self.conv1(x)))
        out = torch.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        
        out += self.shortcut(x)
        return torch.relu(out)
 
 
# Parameter comparison
def compare_block_params():
    """Compare parameters between Basic and Bottleneck blocks."""
    # For 256 channels
    basic = BasicBlock(256, 256)
    bottleneck = Bottleneck(256, 64)  # 64 base -> 256 output
    
    basic_params = sum(p.numel() for p in basic.parameters())
    bottle_params = sum(p.numel() for p in bottleneck.parameters())
    
    print(f"Basic Block (256 ch): {basic_params:,} params")
    print(f"Bottleneck (256 ch):  {bottle_params:,} params")
    print(f"Bottleneck is {basic_params/bottle_params:.1f}x smaller")
 
compare_block_params()

Why Expansion Factor 4?

The expansion factor of 4 balances parameter efficiency with representational capacity. With bottleneck ratio 4:1, a Bottleneck block has ~70% fewer parameters than 2 Basic Blocks would, while maintaining similar computational cost. This enables much deeper networks within the same parameter budget.

Complete Network Architectures

ResNet architectures are organized into stages with consistent channel counts. Downsampling occurs at stage transitions.

General structure:

Stem: Initial 7×7 conv with stride 2, then 3×3 max pool with stride 2
Stage 1-4: Residual blocks with increasing channels (64→128→256→512)
Head: Global average pooling + fully connected layer

ResNet Architecture Configurations
Architecture	Block Type	Stage 1	Stage 2	Stage 3	Stage 4	Total Params
ResNet-18	Basic	2	2	2	2	11.7M
ResNet-34	Basic	3	4	6	3	21.8M
ResNet-50	Bottleneck	3	4	6	3	25.6M
ResNet-101	Bottleneck	3	4	23	3	44.5M
ResNet-152	Bottleneck	3	8	36	3	60.2M

resnet.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
class ResNet(nn.Module):
    """
    Complete ResNet implementation supporting all standard configurations.
    """
    def __init__(
        self, 
        block: type,  # BasicBlock or Bottleneck
        layers: list,  # Blocks per stage [2,2,2,2] or [3,4,6,3] etc.
        num_classes: int = 1000
    ):
        super().__init__()
        self.in_channels = 64
        
        # Stem: 7x7 conv + maxpool -> 4x spatial reduction
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        
        # 4 stages of residual blocks
        self.stage1 = self._make_stage(block, 64, layers[0], stride=1)
        self.stage2 = self._make_stage(block, 128, layers[1], stride=2)
        self.stage3 = self._make_stage(block, 256, layers[2], stride=2)
        self.stage4 = self._make_stage(block, 512, layers[3], stride=2)
        
        # Classification head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)
        
        # Weight initialization
        self._initialize_weights()
    
    def _make_stage(self, block, channels, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        for s in strides:
            layers.append(block(self.in_channels, channels, s))
            self.in_channels = channels * block.expansion
        return nn.Sequential(*layers)
    
    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
    
    def forward(self, x):
        x = self.stem(x)
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.stage4(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x
 
 
# Factory functions
def resnet18(num_classes=1000):
    return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)
 
def resnet50(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)
 
def resnet152(num_classes=1000):
    return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)

Downsampling Strategy

ResNet uses a specific downsampling strategy that differs from earlier networks:

Stage transitions:

Spatial dimensions halve (stride 2)
Channel dimensions double (64→128→256→512)
First block of each stage handles the dimension change

Where downsampling occurs:

3×3 conv with stride 2 (in Basic Block's conv1 or Bottleneck's conv2)
1×1 projection shortcut with stride 2 (to match dimensions)

This "downsample in residual path" design maintains gradient flow better than pooling-based approaches.

ResNet-B Variant

The original ResNet placed stride-2 in the first 1×1 conv of Bottleneck blocks. ResNet-B (and later variants) moved stride-2 to the 3×3 conv. This seemingly minor change improves accuracy by ~0.5% because the 3×3 conv is better suited for spatial downsampling than 1×1.

Training Practices and Initialization

Training very deep networks requires careful practices:

Weight initialization:

Convolutional layers: He initialization (Kaiming normal)
BatchNorm: γ=1, β=0
Final BN in residual path: γ=0 (zero initialization)

The zero-initialization trick: Initializing the final BatchNorm's scale (γ) to 0 makes each residual block initially compute the identity function, helping training stability.

Training hyperparameters (ImageNet):

SGD with momentum 0.9
Weight decay: 1e-4
Initial LR: 0.1, reduced by 10× at epochs 30, 60, 90
Batch size: 256
90 epochs total

training_config.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def get_resnet_optimizer(model, initial_lr=0.1):
    """Standard ResNet training configuration."""
    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=initial_lr,
        momentum=0.9,
        weight_decay=1e-4
    )
    
    # Step LR scheduler: divide by 10 at epochs 30, 60, 90
    scheduler = torch.optim.lr_scheduler.MultiStepLR(
        optimizer,
        milestones=[30, 60, 90],
        gamma=0.1
    )
    
    return optimizer, scheduler
 
 
def zero_init_residual(model):
    """
    Zero-initialize the last BN in each residual branch.
    This helps training by making blocks start as identity.
    """
    for m in model.modules():
        if isinstance(m, Bottleneck):
            nn.init.constant_(m.bn3.weight, 0)
        elif isinstance(m, BasicBlock):
            nn.init.constant_(m.bn2.weight, 0)

Summary: ResNet Architecture

Key Takeaways

•Basic Blocks (two 3×3 convs) are used in ResNet-18/34; Bottleneck Blocks (1×1→3×3→1×1) are used in ResNet-50/101/152.
•Expansion factor of 4 in Bottlenecks enables deeper networks with similar compute.
•Four stages with doubling channels and halving spatial resolution.
•He initialization and zero-init for final BN help training stability.
•Standard training: SGD + momentum, step LR decay, 90 epochs for ImageNet.

Page Complete

You now understand the complete ResNet architecture family. Next, we'll explore Identity Mappings—a refinement that improves gradient flow and enables even deeper networks.