Loading content...
In December 2015, ResNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with a stunning 152-layer network—8× deeper than any previous winner. The top-5 error rate dropped to 3.57%, surpassing human-level performance (estimated at 5.1%). This wasn't incremental progress; it was a paradigm shift.
This page dissects the ResNet architecture in detail: from the basic building blocks to the complete network designs, from the original formulation to practical implementation considerations that enabled training these unprecedented depths.
By the end of this page, you will understand: (1) The two ResNet building blocks: Basic and Bottleneck, (2) Complete ResNet architectures from ResNet-18 to ResNet-152, (3) Stage design and downsampling strategies, (4) Initialization and training practices, and (5) How to implement production-ready ResNets.
The Basic Block is the fundamental building unit for shallower ResNets (ResNet-18 and ResNet-34). It consists of two 3×3 convolutional layers with a skip connection.
Structure: $$\mathbf{y} = \text{ReLU}(\mathbf{x} + \mathcal{F}(\mathbf{x}))$$
Where $\mathcal{F}$ consists of:
The ReLU is applied after the addition, not within F. This placement is crucial and was refined in later work (identity mappings).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import torchimport torch.nn as nn class BasicBlock(nn.Module): """ Basic residual block for ResNet-18/34. Two 3x3 conv layers with skip connection. Parameters per block: 2 * (C * C * 9) ≈ 18C² (ignoring BatchNorm parameters) """ expansion = 1 # Output channels = input channels * expansion def __init__(self, in_channels: int, out_channels: int, stride: int = 1): super().__init__() # First conv: may downsample spatially via stride self.conv1 = nn.Conv2d( in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False ) self.bn1 = nn.BatchNorm2d(out_channels) # Second conv: always stride=1 self.conv2 = nn.Conv2d( out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False ) self.bn2 = nn.BatchNorm2d(out_channels) # Shortcut connection self.shortcut = nn.Sequential() if stride != 1 or in_channels != out_channels: # Projection shortcut: 1x1 conv to match dimensions self.shortcut = nn.Sequential( nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False), nn.BatchNorm2d(out_channels) ) def forward(self, x: torch.Tensor) -> torch.Tensor: # Residual path out = torch.relu(self.bn1(self.conv1(x))) out = self.bn2(self.conv2(out)) # Skip connection + activation out += self.shortcut(x) out = torch.relu(out) return out| Component | Kernel | Output Channels | Purpose |
|---|---|---|---|
| Conv1 | 3×3 | out_channels | Feature extraction, optional downsampling |
| BN1 + ReLU | out_channels | Normalization and non-linearity | |
| Conv2 | 3×3 | out_channels | Further feature refinement |
| BN2 | out_channels | Normalization before addition | |
| Shortcut | 1×1 or Identity | out_channels | Dimension matching |
For deeper networks (ResNet-50, 101, 152), the Bottleneck Block is more parameter-efficient. It uses a 1×1 → 3×3 → 1×1 structure that reduces, processes, then expands channel dimensions.
Design rationale:
This "bottleneck" structure allows deeper networks with similar computational cost to shallower ones using Basic Blocks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
class Bottleneck(nn.Module): """ Bottleneck residual block for ResNet-50/101/152. 1x1 -> 3x3 -> 1x1 structure with expansion factor of 4. Example: 256 input channels - 1x1 conv: 256 -> 64 (reduce) - 3x3 conv: 64 -> 64 (process) - 1x1 conv: 64 -> 256 (expand) """ expansion = 4 # Output channels = base_channels * 4 def __init__(self, in_channels: int, base_channels: int, stride: int = 1): super().__init__() out_channels = base_channels * self.expansion # 1x1 reduce self.conv1 = nn.Conv2d(in_channels, base_channels, 1, bias=False) self.bn1 = nn.BatchNorm2d(base_channels) # 3x3 process (may downsample via stride) self.conv2 = nn.Conv2d( base_channels, base_channels, 3, stride=stride, padding=1, bias=False ) self.bn2 = nn.BatchNorm2d(base_channels) # 1x1 expand self.conv3 = nn.Conv2d(base_channels, out_channels, 1, bias=False) self.bn3 = nn.BatchNorm2d(out_channels) # Shortcut self.shortcut = nn.Sequential() if stride != 1 or in_channels != out_channels: self.shortcut = nn.Sequential( nn.Conv2d(in_channels, out_channels, 1, stride, bias=False), nn.BatchNorm2d(out_channels) ) def forward(self, x: torch.Tensor) -> torch.Tensor: out = torch.relu(self.bn1(self.conv1(x))) out = torch.relu(self.bn2(self.conv2(out))) out = self.bn3(self.conv3(out)) out += self.shortcut(x) return torch.relu(out) # Parameter comparisondef compare_block_params(): """Compare parameters between Basic and Bottleneck blocks.""" # For 256 channels basic = BasicBlock(256, 256) bottleneck = Bottleneck(256, 64) # 64 base -> 256 output basic_params = sum(p.numel() for p in basic.parameters()) bottle_params = sum(p.numel() for p in bottleneck.parameters()) print(f"Basic Block (256 ch): {basic_params:,} params") print(f"Bottleneck (256 ch): {bottle_params:,} params") print(f"Bottleneck is {basic_params/bottle_params:.1f}x smaller") compare_block_params()The expansion factor of 4 balances parameter efficiency with representational capacity. With bottleneck ratio 4:1, a Bottleneck block has ~70% fewer parameters than 2 Basic Blocks would, while maintaining similar computational cost. This enables much deeper networks within the same parameter budget.
ResNet architectures are organized into stages with consistent channel counts. Downsampling occurs at stage transitions.
General structure:
| Architecture | Block Type | Stage 1 | Stage 2 | Stage 3 | Stage 4 | Total Params |
|---|---|---|---|---|---|---|
| ResNet-18 | Basic | 2 | 2 | 2 | 2 | 11.7M |
| ResNet-34 | Basic | 3 | 4 | 6 | 3 | 21.8M |
| ResNet-50 | Bottleneck | 3 | 4 | 6 | 3 | 25.6M |
| ResNet-101 | Bottleneck | 3 | 4 | 23 | 3 | 44.5M |
| ResNet-152 | Bottleneck | 3 | 8 | 36 | 3 | 60.2M |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
class ResNet(nn.Module): """ Complete ResNet implementation supporting all standard configurations. """ def __init__( self, block: type, # BasicBlock or Bottleneck layers: list, # Blocks per stage [2,2,2,2] or [3,4,6,3] etc. num_classes: int = 1000 ): super().__init__() self.in_channels = 64 # Stem: 7x7 conv + maxpool -> 4x spatial reduction self.stem = nn.Sequential( nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False), nn.BatchNorm2d(64), nn.ReLU(inplace=True), nn.MaxPool2d(kernel_size=3, stride=2, padding=1) ) # 4 stages of residual blocks self.stage1 = self._make_stage(block, 64, layers[0], stride=1) self.stage2 = self._make_stage(block, 128, layers[1], stride=2) self.stage3 = self._make_stage(block, 256, layers[2], stride=2) self.stage4 = self._make_stage(block, 512, layers[3], stride=2) # Classification head self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) self.fc = nn.Linear(512 * block.expansion, num_classes) # Weight initialization self._initialize_weights() def _make_stage(self, block, channels, num_blocks, stride): strides = [stride] + [1] * (num_blocks - 1) layers = [] for s in strides: layers.append(block(self.in_channels, channels, s)) self.in_channels = channels * block.expansion return nn.Sequential(*layers) def _initialize_weights(self): for m in self.modules(): if isinstance(m, nn.Conv2d): nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') elif isinstance(m, nn.BatchNorm2d): nn.init.constant_(m.weight, 1) nn.init.constant_(m.bias, 0) def forward(self, x): x = self.stem(x) x = self.stage1(x) x = self.stage2(x) x = self.stage3(x) x = self.stage4(x) x = self.avgpool(x) x = torch.flatten(x, 1) x = self.fc(x) return x # Factory functionsdef resnet18(num_classes=1000): return ResNet(BasicBlock, [2, 2, 2, 2], num_classes) def resnet50(num_classes=1000): return ResNet(Bottleneck, [3, 4, 6, 3], num_classes) def resnet152(num_classes=1000): return ResNet(Bottleneck, [3, 8, 36, 3], num_classes)ResNet uses a specific downsampling strategy that differs from earlier networks:
Stage transitions:
Where downsampling occurs:
This "downsample in residual path" design maintains gradient flow better than pooling-based approaches.
The original ResNet placed stride-2 in the first 1×1 conv of Bottleneck blocks. ResNet-B (and later variants) moved stride-2 to the 3×3 conv. This seemingly minor change improves accuracy by ~0.5% because the 3×3 conv is better suited for spatial downsampling than 1×1.
Training very deep networks requires careful practices:
Weight initialization:
The zero-initialization trick: Initializing the final BatchNorm's scale (γ) to 0 makes each residual block initially compute the identity function, helping training stability.
Training hyperparameters (ImageNet):
1234567891011121314151617181920212223242526272829
def get_resnet_optimizer(model, initial_lr=0.1): """Standard ResNet training configuration.""" optimizer = torch.optim.SGD( model.parameters(), lr=initial_lr, momentum=0.9, weight_decay=1e-4 ) # Step LR scheduler: divide by 10 at epochs 30, 60, 90 scheduler = torch.optim.lr_scheduler.MultiStepLR( optimizer, milestones=[30, 60, 90], gamma=0.1 ) return optimizer, scheduler def zero_init_residual(model): """ Zero-initialize the last BN in each residual branch. This helps training by making blocks start as identity. """ for m in model.modules(): if isinstance(m, Bottleneck): nn.init.constant_(m.bn3.weight, 0) elif isinstance(m, BasicBlock): nn.init.constant_(m.bn2.weight, 0)You now understand the complete ResNet architecture family. Next, we'll explore Identity Mappings—a refinement that improves gradient flow and enables even deeper networks.