Machine LearningResidual Networks

Residual Networks and Skip Connections

LevelAdvanced

Duration90 mins

TopicResidual Networks

3 / 5

Identity Mappings: Optimizing Information Flow

Beyond the Original ResNet

The original ResNet was revolutionary, but its skip connections weren't truly "clean" identity mappings. A subtle issue remained: the ReLU activation after addition potentially blocked gradient flow. In 2016, He et al. published a follow-up paper that refined the residual block design, achieving pre-activation ResNets that could train networks over 1000 layers deep.

This refinement—moving BatchNorm and ReLU before the convolutions—seems minor but has profound implications for both forward information flow and backward gradient propagation.

What You Will Learn

By the end of this page, you will understand: (1) Why post-activation blocks impede information flow, (2) The mathematics of true identity mappings, (3) Pre-activation block design and its benefits, (4) Practical implementation of pre-act ResNets, and (5) When to use pre-activation vs. original designs.

The Problem with Post-Activation

In the original ResNet, the forward pass through a block is:

$$\mathbf{x}_{l+1} = \text{ReLU}(\mathbf{x}_l + \mathcal{F}(\mathbf{x}_l))$$

The issue: The ReLU after addition modifies the skip connection output. If we expand multiple blocks:

$$\mathbf{x}_{l+2} = \text{ReLU}(\text{ReLU}(\mathbf{x}_l + \mathcal{F}l) + \mathcal{F}{l+1})$$

The input $\mathbf{x}_l$ passes through multiple ReLUs on its way to later layers. Each ReLU:

Zeros out negative values, losing information
Creates non-smooth gradients that complicate optimization
Breaks the "pure" identity path that makes residual learning powerful

Gradient flow analysis:

For the original block, the gradient becomes:

$$\frac{\partial \varepsilon}{\partial \mathbf{x}l} = \frac{\partial \varepsilon}{\partial \mathbf{x}{l+1}} \cdot \mathbf{1}_{\mathbf{x}_l + \mathcal{F}_l > 0} \cdot \left(1 + \frac{\partial \mathcal{F}_l}{\partial \mathbf{x}_l}\right)$$

The indicator function $\mathbf{1}_{\cdot > 0}$ (from ReLU derivative) can zero out gradients when the pre-ReLU value is negative. Over many layers, this creates "dead paths" where gradients cannot flow.

The ideal scenario:

For clean gradient highways, we want: $$\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l)$$

Without any function wrapping the skip connection. This gives: $$\frac{\partial \varepsilon}{\partial \mathbf{x}l} = \frac{\partial \varepsilon}{\partial \mathbf{x}{l+1}} \cdot \left(1 + \frac{\partial \mathcal{F}_l}{\partial \mathbf{x}_l}\right)$$

The "1" is now unimpeded—pure identity gradient flow!

The Asymmetry Problem

In original ResNets, forward and backward passes are asymmetric. Forward: information passes through ReLU. Backward: gradients pass through ReLU derivative. This asymmetry becomes problematic in very deep networks where both passes must traverse hundreds of such operations.

The Pre-Activation Design

The solution is elegant: move all operations to inside the residual function, leaving the skip connection as pure identity.

Pre-activation block structure: $$\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\text{BN}(\text{ReLU}(\mathbf{x}_l)))$$

Or equivalently, rearranging the operations:

Original: Conv → BN → ReLU → Conv → BN → (+) → ReLU
Pre-act: BN → ReLU → Conv → BN → ReLU → Conv → (+)

The key insight: BN and ReLU act as a "pre-processing" of the input before convolution, rather than "post-processing" of the output.

preact_resnet.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import torch.nn as nn
 
class PreActBasicBlock(nn.Module):
    """
    Pre-activation basic block.
    BN-ReLU-Conv ordering keeps skip connection clean.
    """
    expansion = 1
    
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        
        # Pre-activation: BN-ReLU before each conv
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1, bias=False)
        
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False)
        
        # Shortcut (no BN here - keep it pure!)
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Conv2d(in_channels, out_channels, 1, stride, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-activation
        out = torch.relu(self.bn1(x))
        
        # First conv (shortcut branches from pre-activated x)
        shortcut = self.shortcut(out)
        out = self.conv1(out)
        
        # Second pre-activation and conv
        out = self.conv2(torch.relu(self.bn2(out)))
        
        # Clean addition - no ReLU after!
        return out + shortcut
 
 
class PreActBottleneck(nn.Module):
    """Pre-activation bottleneck block."""
    expansion = 4
    
    def __init__(self, in_channels: int, base_channels: int, stride: int = 1):
        super().__init__()
        out_channels = base_channels * self.expansion
        
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.conv1 = nn.Conv2d(in_channels, base_channels, 1, bias=False)
        
        self.bn2 = nn.BatchNorm2d(base_channels)
        self.conv2 = nn.Conv2d(base_channels, base_channels, 3, stride, 1, bias=False)
        
        self.bn3 = nn.BatchNorm2d(base_channels)
        self.conv3 = nn.Conv2d(base_channels, out_channels, 1, bias=False)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Conv2d(in_channels, out_channels, 1, stride, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = torch.relu(self.bn1(x))
        shortcut = self.shortcut(out)
        
        out = self.conv1(out)
        out = self.conv2(torch.relu(self.bn2(out)))
        out = self.conv3(torch.relu(self.bn3(out)))
        
        return out + shortcut

Original vs Pre-Activation Block Comparison
Aspect	Original (Post-Act)	Pre-Activation
Skip path	Through ReLU	Pure identity
Gradient flow	Modulated by ReLU	Unimpeded '1' term
BN-ReLU order	Conv→BN→ReLU	BN→ReLU→Conv
After addition	ReLU applied	No activation
1000+ layers	Difficult	Achievable

Mathematical Analysis of Identity Mappings

With pre-activation blocks, we can now write clean recursive equations:

Forward propagation: $$\mathbf{x}_L = \mathbf{x}l + \sum{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, \mathcal{W}_i)$$

Any layer $\mathbf{x}_L$ is the sum of an earlier layer $\mathbf{x}_l$ plus all residual functions in between.

Backward propagation: $$\frac{\partial \varepsilon}{\partial \mathbf{x}_l} = \frac{\partial \varepsilon}{\partial \mathbf{x}_L} \left( 1 + \frac{\partial}{\partial \mathbf{x}l} \sum{i=l}^{L-1} \mathcal{F}_i \right)$$

Key properties:

Direct propagation: Information (forward) and gradients (backward) can travel directly between any two layers
Additive gradients: The gradient is a sum, not a product, so it won't vanish exponentially
The constant '1': Always present regardless of weights, ensuring non-zero gradient flow

Why Sums Are Better Than Products

In plain networks, gradients are products of many terms (chain rule). If any term is small, the product vanishes. In pre-act ResNets, the gradient has an additive structure: 1 + (sum of derivatives). Even if the sum is negative, the 1 provides a baseline. This is the mathematical core of why ResNets train so well.

Empirical Results and Extreme Depth

The pre-activation design enabled training at previously impossible depths:

CIFAR-10/100 Results:

Pre-Activation ResNet Results on CIFAR
Depth	Original ResNet	Pre-Act ResNet	Improvement
110	6.61%	6.37%	+0.24%
164	5.93%	5.46%	+0.47%
1001	7.61% (degraded)	4.92%	+2.69%

The 1001-layer result is remarkable: Original ResNet degraded at this depth (worse than 110 layers), while Pre-Act ResNet achieved the best results. This demonstrates the practical importance of clean identity mappings.

Why original fails at extreme depth:

Gradient modulation accumulates over 1000 ReLUs
Information loss through repeated nonlinearities
Optimization landscape becomes increasingly difficult

Why pre-act succeeds:

Pure gradient highway through all 1000 layers
No information bottleneck from ReLU on skip path
Each block adds refinements to an unmodified representation

Practical Considerations

When to use pre-activation:

Networks deeper than 100 layers
When training stability is critical
Research settings exploring extreme depths

When original suffices:

Standard ResNet-50/101/152 on ImageNet
When using pretrained weights (most are original design)
Shallower networks where the benefit is marginal

Implementation notes:

•First layer handling: The very first layer needs special treatment—you can't pre-activate before the input image. Typically, a standard conv-bn-relu stem is used.
•Final layer: After the last block, add BN-ReLU before global pooling to activate the final representation.
•Pretrained weights: Converting between original and pre-act is non-trivial due to different BN placements.
•Mixed precision training: Pre-act can be more numerically stable due to cleaner gradient flow.

Modern Usage

Despite theoretical advantages, most practitioners use original ResNet designs due to pretrained weight availability. Pre-activation is most valuable when training from scratch at extreme depths or in research contexts.

Summary: Identity Mappings

Key Takeaways

•Post-activation blocks place ReLU after addition, impeding the skip connection's identity nature.
•Pre-activation blocks (BN→ReLU→Conv) keep the skip path as pure identity, enabling perfect gradient flow.
•Mathematical guarantee: Gradients contain an unmodified '1' term ensuring non-zero flow to any layer.
•Extreme depth enabled: Pre-act ResNets can train 1001+ layers where original design degrades.
•Practical adoption: Original design dominates due to pretrained weights; pre-act excels for extreme depth training from scratch.

Page Complete

You now understand how identity mappings perfect the skip connection design. Next, we'll explore DenseNet—an architecture that takes feature reuse to the extreme by connecting every layer to every other layer.

3 / 5

Loading learning content...

Machine LearningResidual Networks

Residual Networks and Skip Connections

LevelAdvanced

Duration90 mins

TopicResidual Networks

3 / 5

Identity Mappings: Optimizing Information Flow

Beyond the Original ResNet

This refinement—moving BatchNorm and ReLU before the convolutions—seems minor but has profound implications for both forward information flow and backward gradient propagation.

What You Will Learn

The Problem with Post-Activation

In the original ResNet, the forward pass through a block is:

$$\mathbf{x}_{l+1} = \text{ReLU}(\mathbf{x}_l + \mathcal{F}(\mathbf{x}_l))$$

The issue: The ReLU after addition modifies the skip connection output. If we expand multiple blocks:

$$\mathbf{x}_{l+2} = \text{ReLU}(\text{ReLU}(\mathbf{x}_l + \mathcal{F}l) + \mathcal{F}{l+1})$$

The input $\mathbf{x}_l$ passes through multiple ReLUs on its way to later layers. Each ReLU:

Zeros out negative values, losing information
Creates non-smooth gradients that complicate optimization
Breaks the "pure" identity path that makes residual learning powerful

Gradient flow analysis:

For the original block, the gradient becomes:

The ideal scenario:

For clean gradient highways, we want: $$\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l)$$

The "1" is now unimpeded—pure identity gradient flow!

The Asymmetry Problem

The Pre-Activation Design

The solution is elegant: move all operations to inside the residual function, leaving the skip connection as pure identity.

Pre-activation block structure: $$\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\text{BN}(\text{ReLU}(\mathbf{x}_l)))$$

Or equivalently, rearranging the operations:

Original: Conv → BN → ReLU → Conv → BN → (+) → ReLU
Pre-act: BN → ReLU → Conv → BN → ReLU → Conv → (+)

The key insight: BN and ReLU act as a "pre-processing" of the input before convolution, rather than "post-processing" of the output.

preact_resnet.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import torch.nn as nn
 
class PreActBasicBlock(nn.Module):
    """
    Pre-activation basic block.
    BN-ReLU-Conv ordering keeps skip connection clean.
    """
    expansion = 1
    
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super().__init__()
        
        # Pre-activation: BN-ReLU before each conv
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1, bias=False)
        
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False)
        
        # Shortcut (no BN here - keep it pure!)
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Conv2d(in_channels, out_channels, 1, stride, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-activation
        out = torch.relu(self.bn1(x))
        
        # First conv (shortcut branches from pre-activated x)
        shortcut = self.shortcut(out)
        out = self.conv1(out)
        
        # Second pre-activation and conv
        out = self.conv2(torch.relu(self.bn2(out)))
        
        # Clean addition - no ReLU after!
        return out + shortcut
 
 
class PreActBottleneck(nn.Module):
    """Pre-activation bottleneck block."""
    expansion = 4
    
    def __init__(self, in_channels: int, base_channels: int, stride: int = 1):
        super().__init__()
        out_channels = base_channels * self.expansion
        
        self.bn1 = nn.BatchNorm2d(in_channels)
        self.conv1 = nn.Conv2d(in_channels, base_channels, 1, bias=False)
        
        self.bn2 = nn.BatchNorm2d(base_channels)
        self.conv2 = nn.Conv2d(base_channels, base_channels, 3, stride, 1, bias=False)
        
        self.bn3 = nn.BatchNorm2d(base_channels)
        self.conv3 = nn.Conv2d(base_channels, out_channels, 1, bias=False)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Conv2d(in_channels, out_channels, 1, stride, bias=False)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        out = torch.relu(self.bn1(x))
        shortcut = self.shortcut(out)
        
        out = self.conv1(out)
        out = self.conv2(torch.relu(self.bn2(out)))
        out = self.conv3(torch.relu(self.bn3(out)))
        
        return out + shortcut

Original vs Pre-Activation Block Comparison
Aspect	Original (Post-Act)	Pre-Activation
Skip path	Through ReLU	Pure identity
Gradient flow	Modulated by ReLU	Unimpeded '1' term
BN-ReLU order	Conv→BN→ReLU	BN→ReLU→Conv
After addition	ReLU applied	No activation
1000+ layers	Difficult	Achievable

Mathematical Analysis of Identity Mappings

With pre-activation blocks, we can now write clean recursive equations:

Forward propagation: $$\mathbf{x}_L = \mathbf{x}l + \sum{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, \mathcal{W}_i)$$

Any layer $\mathbf{x}_L$ is the sum of an earlier layer $\mathbf{x}_l$ plus all residual functions in between.

Key properties:

Direct propagation: Information (forward) and gradients (backward) can travel directly between any two layers
Additive gradients: The gradient is a sum, not a product, so it won't vanish exponentially
The constant '1': Always present regardless of weights, ensuring non-zero gradient flow

Why Sums Are Better Than Products

Empirical Results and Extreme Depth

The pre-activation design enabled training at previously impossible depths:

CIFAR-10/100 Results:

Pre-Activation ResNet Results on CIFAR
Depth	Original ResNet	Pre-Act ResNet	Improvement
110	6.61%	6.37%	+0.24%
164	5.93%	5.46%	+0.47%
1001	7.61% (degraded)	4.92%	+2.69%

Why original fails at extreme depth:

Gradient modulation accumulates over 1000 ReLUs
Information loss through repeated nonlinearities
Optimization landscape becomes increasingly difficult

Why pre-act succeeds:

Pure gradient highway through all 1000 layers
No information bottleneck from ReLU on skip path
Each block adds refinements to an unmodified representation

Practical Considerations

When to use pre-activation:

Networks deeper than 100 layers
When training stability is critical
Research settings exploring extreme depths

When original suffices:

Standard ResNet-50/101/152 on ImageNet
When using pretrained weights (most are original design)
Shallower networks where the benefit is marginal

Implementation notes:

•First layer handling: The very first layer needs special treatment—you can't pre-activate before the input image. Typically, a standard conv-bn-relu stem is used.
•Final layer: After the last block, add BN-ReLU before global pooling to activate the final representation.
•Pretrained weights: Converting between original and pre-act is non-trivial due to different BN placements.
•Mixed precision training: Pre-act can be more numerically stable due to cleaner gradient flow.

Modern Usage

Summary: Identity Mappings

Key Takeaways

•Post-activation blocks place ReLU after addition, impeding the skip connection's identity nature.
•Pre-activation blocks (BN→ReLU→Conv) keep the skip path as pure identity, enabling perfect gradient flow.
•Mathematical guarantee: Gradients contain an unmodified '1' term ensuring non-zero flow to any layer.
•Extreme depth enabled: Pre-act ResNets can train 1001+ layers where original design degrades.
•Practical adoption: Original design dominates due to pretrained weights; pre-act excels for extreme depth training from scratch.

Page Complete

3 / 5