Loading learning content...
The original ResNet was revolutionary, but its skip connections weren't truly "clean" identity mappings. A subtle issue remained: the ReLU activation after addition potentially blocked gradient flow. In 2016, He et al. published a follow-up paper that refined the residual block design, achieving pre-activation ResNets that could train networks over 1000 layers deep.
This refinement—moving BatchNorm and ReLU before the convolutions—seems minor but has profound implications for both forward information flow and backward gradient propagation.
By the end of this page, you will understand: (1) Why post-activation blocks impede information flow, (2) The mathematics of true identity mappings, (3) Pre-activation block design and its benefits, (4) Practical implementation of pre-act ResNets, and (5) When to use pre-activation vs. original designs.
In the original ResNet, the forward pass through a block is:
$$\mathbf{x}_{l+1} = \text{ReLU}(\mathbf{x}_l + \mathcal{F}(\mathbf{x}_l))$$
The issue: The ReLU after addition modifies the skip connection output. If we expand multiple blocks:
$$\mathbf{x}_{l+2} = \text{ReLU}(\text{ReLU}(\mathbf{x}_l + \mathcal{F}l) + \mathcal{F}{l+1})$$
The input $\mathbf{x}_l$ passes through multiple ReLUs on its way to later layers. Each ReLU:
Gradient flow analysis:
For the original block, the gradient becomes:
$$\frac{\partial \varepsilon}{\partial \mathbf{x}l} = \frac{\partial \varepsilon}{\partial \mathbf{x}{l+1}} \cdot \mathbf{1}_{\mathbf{x}_l + \mathcal{F}_l > 0} \cdot \left(1 + \frac{\partial \mathcal{F}_l}{\partial \mathbf{x}_l}\right)$$
The indicator function $\mathbf{1}_{\cdot > 0}$ (from ReLU derivative) can zero out gradients when the pre-ReLU value is negative. Over many layers, this creates "dead paths" where gradients cannot flow.
The ideal scenario:
For clean gradient highways, we want: $$\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l)$$
Without any function wrapping the skip connection. This gives: $$\frac{\partial \varepsilon}{\partial \mathbf{x}l} = \frac{\partial \varepsilon}{\partial \mathbf{x}{l+1}} \cdot \left(1 + \frac{\partial \mathcal{F}_l}{\partial \mathbf{x}_l}\right)$$
The "1" is now unimpeded—pure identity gradient flow!
In original ResNets, forward and backward passes are asymmetric. Forward: information passes through ReLU. Backward: gradients pass through ReLU derivative. This asymmetry becomes problematic in very deep networks where both passes must traverse hundreds of such operations.
The solution is elegant: move all operations to inside the residual function, leaving the skip connection as pure identity.
Pre-activation block structure: $$\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\text{BN}(\text{ReLU}(\mathbf{x}_l)))$$
Or equivalently, rearranging the operations:
The key insight: BN and ReLU act as a "pre-processing" of the input before convolution, rather than "post-processing" of the output.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import torchimport torch.nn as nn class PreActBasicBlock(nn.Module): """ Pre-activation basic block. BN-ReLU-Conv ordering keeps skip connection clean. """ expansion = 1 def __init__(self, in_channels: int, out_channels: int, stride: int = 1): super().__init__() # Pre-activation: BN-ReLU before each conv self.bn1 = nn.BatchNorm2d(in_channels) self.conv1 = nn.Conv2d(in_channels, out_channels, 3, stride, 1, bias=False) self.bn2 = nn.BatchNorm2d(out_channels) self.conv2 = nn.Conv2d(out_channels, out_channels, 3, 1, 1, bias=False) # Shortcut (no BN here - keep it pure!) self.shortcut = nn.Sequential() if stride != 1 or in_channels != out_channels: self.shortcut = nn.Conv2d(in_channels, out_channels, 1, stride, bias=False) def forward(self, x: torch.Tensor) -> torch.Tensor: # Pre-activation out = torch.relu(self.bn1(x)) # First conv (shortcut branches from pre-activated x) shortcut = self.shortcut(out) out = self.conv1(out) # Second pre-activation and conv out = self.conv2(torch.relu(self.bn2(out))) # Clean addition - no ReLU after! return out + shortcut class PreActBottleneck(nn.Module): """Pre-activation bottleneck block.""" expansion = 4 def __init__(self, in_channels: int, base_channels: int, stride: int = 1): super().__init__() out_channels = base_channels * self.expansion self.bn1 = nn.BatchNorm2d(in_channels) self.conv1 = nn.Conv2d(in_channels, base_channels, 1, bias=False) self.bn2 = nn.BatchNorm2d(base_channels) self.conv2 = nn.Conv2d(base_channels, base_channels, 3, stride, 1, bias=False) self.bn3 = nn.BatchNorm2d(base_channels) self.conv3 = nn.Conv2d(base_channels, out_channels, 1, bias=False) self.shortcut = nn.Sequential() if stride != 1 or in_channels != out_channels: self.shortcut = nn.Conv2d(in_channels, out_channels, 1, stride, bias=False) def forward(self, x: torch.Tensor) -> torch.Tensor: out = torch.relu(self.bn1(x)) shortcut = self.shortcut(out) out = self.conv1(out) out = self.conv2(torch.relu(self.bn2(out))) out = self.conv3(torch.relu(self.bn3(out))) return out + shortcut| Aspect | Original (Post-Act) | Pre-Activation |
|---|---|---|
| Skip path | Through ReLU | Pure identity |
| Gradient flow | Modulated by ReLU | Unimpeded '1' term |
| BN-ReLU order | Conv→BN→ReLU | BN→ReLU→Conv |
| After addition | ReLU applied | No activation |
| 1000+ layers | Difficult | Achievable |
With pre-activation blocks, we can now write clean recursive equations:
Forward propagation: $$\mathbf{x}_L = \mathbf{x}l + \sum{i=l}^{L-1} \mathcal{F}(\mathbf{x}_i, \mathcal{W}_i)$$
Any layer $\mathbf{x}_L$ is the sum of an earlier layer $\mathbf{x}_l$ plus all residual functions in between.
Backward propagation: $$\frac{\partial \varepsilon}{\partial \mathbf{x}_l} = \frac{\partial \varepsilon}{\partial \mathbf{x}_L} \left( 1 + \frac{\partial}{\partial \mathbf{x}l} \sum{i=l}^{L-1} \mathcal{F}_i \right)$$
Key properties:
Direct propagation: Information (forward) and gradients (backward) can travel directly between any two layers
Additive gradients: The gradient is a sum, not a product, so it won't vanish exponentially
The constant '1': Always present regardless of weights, ensuring non-zero gradient flow
In plain networks, gradients are products of many terms (chain rule). If any term is small, the product vanishes. In pre-act ResNets, the gradient has an additive structure: 1 + (sum of derivatives). Even if the sum is negative, the 1 provides a baseline. This is the mathematical core of why ResNets train so well.
The pre-activation design enabled training at previously impossible depths:
CIFAR-10/100 Results:
| Depth | Original ResNet | Pre-Act ResNet | Improvement |
|---|---|---|---|
| 110 | 6.61% | 6.37% | +0.24% |
| 164 | 5.93% | 5.46% | +0.47% |
| 1001 | 7.61% (degraded) | 4.92% | +2.69% |
The 1001-layer result is remarkable: Original ResNet degraded at this depth (worse than 110 layers), while Pre-Act ResNet achieved the best results. This demonstrates the practical importance of clean identity mappings.
Why original fails at extreme depth:
Why pre-act succeeds:
When to use pre-activation:
When original suffices:
Implementation notes:
Despite theoretical advantages, most practitioners use original ResNet designs due to pretrained weight availability. Pre-activation is most valuable when training from scratch at extreme depths or in research contexts.
You now understand how identity mappings perfect the skip connection design. Next, we'll explore DenseNet—an architecture that takes feature reuse to the extreme by connecting every layer to every other layer.