Loading learning content...
Traditional approaches to increasing network capacity focus on depth (more layers) or width (more channels). ResNeXt, introduced by Xie et al. in 2017, reveals a third dimension: cardinality—the number of parallel transformation paths.
ResNeXt shows that increasing cardinality (e.g., 32 parallel branches) is more effective than increasing depth or width with the same parameter budget. This insight led to simpler, more modular designs that achieve state-of-the-art results with improved efficiency.
By the end of this page, you will understand: (1) The concept of cardinality and aggregated transformations, (2) ResNeXt block design and equivalent formulations, (3) How cardinality compares to depth and width, (4) Complete ResNeXt architectures, and (5) When and why to choose ResNeXt.
Network capacity scaling dimensions:
The ResNeXt formula:
$$\mathbf{y} = \mathbf{x} + \sum_{i=1}^{C} \mathcal{T}_i(\mathbf{x})$$
Where C is the cardinality (number of transformations) and each $\mathcal{T}_i$ has the same topology but different parameters.
This is a generalization of ResNet's residual function. Standard ResNet has C=1; ResNeXt typically uses C=32.
Why cardinality works better:
Consider two designs with similar parameters:
Mathematically: 32 × 4 = 128 ≈ similar capacity to 256 channels.
But ResNeXt performs better because:
ResNeXt can be seen as a simplified, regularized Inception. Where Inception uses heterogeneous branches (1×1, 3×3, 5×5, pooling), ResNeXt uses homogeneous branches (all same structure). This simplification makes ResNeXt easier to design and scale.
ResNeXt blocks can be implemented in three equivalent ways. All produce identical outputs but differ in implementation efficiency.
Notation: ResNeXt-50 (32×4d) means:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import torchimport torch.nn as nn class ResNeXtBlockA(nn.Module): """ Form (a): Explicit aggregation of C parallel paths. Each path: 1x1 -> 3x3 -> 1x1 Outputs are summed. Conceptually clear but inefficient implementation. """ expansion = 4 def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1): super().__init__() D = base_ch // cardinality # Width per group out_ch = base_ch * self.expansion self.paths = nn.ModuleList([ nn.Sequential( nn.Conv2d(in_ch, D, 1, bias=False), nn.BatchNorm2d(D), nn.ReLU(inplace=True), nn.Conv2d(D, D, 3, stride, 1, bias=False), nn.BatchNorm2d(D), nn.ReLU(inplace=True), nn.Conv2d(D, out_ch, 1, bias=False), nn.BatchNorm2d(out_ch), ) for _ in range(cardinality) ]) self.shortcut = nn.Sequential() if stride != 1 or in_ch != out_ch: self.shortcut = nn.Sequential( nn.Conv2d(in_ch, out_ch, 1, stride, bias=False), nn.BatchNorm2d(out_ch) ) def forward(self, x): out = sum(path(x) for path in self.paths) out += self.shortcut(x) return torch.relu(out) class ResNeXtBlockB(nn.Module): """ Form (b): Early aggregation with grouped 3x3 conv. 1x1 reduce -> grouped 3x3 -> 1x1 expand More efficient using group convolution. """ expansion = 4 def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1): super().__init__() D = base_ch # Total width (distributed across groups) out_ch = base_ch * self.expansion # 1x1 reduce to D channels self.conv1 = nn.Conv2d(in_ch, D, 1, bias=False) self.bn1 = nn.BatchNorm2d(D) # Grouped 3x3: D channels split into 'cardinality' groups self.conv2 = nn.Conv2d(D, D, 3, stride, 1, groups=cardinality, bias=False) self.bn2 = nn.BatchNorm2d(D) # 1x1 expand to output self.conv3 = nn.Conv2d(D, out_ch, 1, bias=False) self.bn3 = nn.BatchNorm2d(out_ch) self.shortcut = nn.Sequential() if stride != 1 or in_ch != out_ch: self.shortcut = nn.Sequential( nn.Conv2d(in_ch, out_ch, 1, stride, bias=False), nn.BatchNorm2d(out_ch) ) def forward(self, x): out = torch.relu(self.bn1(self.conv1(x))) out = torch.relu(self.bn2(self.conv2(out))) out = self.bn3(self.conv3(out)) out += self.shortcut(x) return torch.relu(out) class ResNeXtBlockC(nn.Module): """ Form (c): Fully grouped convolutions. Grouped 1x1 -> Grouped 3x3 -> Grouped 1x1 -> Concatenate Alternative formulation for certain hardware. Standard implementation uses Form (b). """ expansion = 4 def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1): super().__init__() D = base_ch // cardinality out_ch = base_ch * self.expansion # All convs are grouped self.conv1 = nn.Conv2d(in_ch, cardinality * D, 1, groups=cardinality, bias=False) self.bn1 = nn.BatchNorm2d(cardinality * D) self.conv2 = nn.Conv2d(cardinality * D, cardinality * D, 3, stride, 1, groups=cardinality, bias=False) self.bn2 = nn.BatchNorm2d(cardinality * D) # Final 1x1 is NOT grouped (merges all paths) self.conv3 = nn.Conv2d(cardinality * D, out_ch, 1, bias=False) self.bn3 = nn.BatchNorm2d(out_ch) self.shortcut = nn.Sequential() if stride != 1 or in_ch != out_ch: self.shortcut = nn.Sequential( nn.Conv2d(in_ch, out_ch, 1, stride, bias=False), nn.BatchNorm2d(out_ch) ) def forward(self, x): out = torch.relu(self.bn1(self.conv1(x))) out = torch.relu(self.bn2(self.conv2(out))) out = self.bn3(self.conv3(out)) out += self.shortcut(x) return torch.relu(out)| Form | Structure | Implementation | Common Use |
|---|---|---|---|
| (a) | Sum of C paths | ModuleList + sum | Conceptual clarity |
| (b) | Grouped 3×3 conv | groups=C parameter | Standard, most efficient |
| (c) | All grouped convs | Split-transform-concat | Hardware-specific |
The ResNeXt paper provides extensive ablations showing cardinality's superiority:
Controlled experiment setup:
| Model | Configuration | Top-1 Error | Improvement |
|---|---|---|---|
| ResNet-50 baseline | 1×64d | 23.9% | |
| Wider | 1×80d | 23.4% | +0.5% |
| Deeper | 1×64d, 101 layers | 22.4% | +1.5% |
| ResNeXt-50 | 32×4d | 22.2% | +1.7% |
Key finding: At equal complexity, increasing cardinality is more effective than increasing width or depth.
Further cardinality scaling:
| Cardinality | Width d | Top-1 Error |
|---|---|---|
| 1 | 64 | 23.9% |
| 2 | 40 | 23.3% |
| 4 | 24 | 22.8% |
| 8 | 14 | 22.4% |
| 32 | 4 | 22.2% |
Cardinality improvements have diminishing returns beyond 32. The jump from C=1 to C=8 is larger than C=8 to C=32. Most ResNeXt implementations use C=32 as the sweet spot.
ResNeXt follows the same stage structure as ResNet, replacing bottleneck blocks with ResNeXt blocks:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
class ResNeXt(nn.Module): """ Complete ResNeXt network. Uses grouped convolutions for efficiency (Form b). """ def __init__( self, layers: list, # Blocks per stage: [3, 4, 6, 3] cardinality: int = 32, base_width: int = 4, # Width per group num_classes: int = 1000 ): super().__init__() self.cardinality = cardinality self.base_width = base_width self.in_channels = 64 # Stem self.stem = nn.Sequential( nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False), nn.BatchNorm2d(64), nn.ReLU(inplace=True), nn.MaxPool2d(kernel_size=3, stride=2, padding=1) ) # Stages with increasing channels self.stage1 = self._make_stage(64, layers[0], stride=1) self.stage2 = self._make_stage(128, layers[1], stride=2) self.stage3 = self._make_stage(256, layers[2], stride=2) self.stage4 = self._make_stage(512, layers[3], stride=2) # Head self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) self.fc = nn.Linear(512 * 4, num_classes) # *4 for expansion def _make_stage(self, base_channels, num_blocks, stride): strides = [stride] + [1] * (num_blocks - 1) layers = [] width = int(base_channels * (self.base_width / 64) * self.cardinality) for s in strides: layers.append(ResNeXtBlockB( self.in_channels, width, self.cardinality, s )) self.in_channels = width * 4 # After expansion return nn.Sequential(*layers) def forward(self, x): x = self.stem(x) x = self.stage1(x) x = self.stage2(x) x = self.stage3(x) x = self.stage4(x) x = self.avgpool(x) x = torch.flatten(x, 1) return self.fc(x) # Standard configurationsdef resnext50_32x4d(num_classes=1000): """ResNeXt-50 with 32 groups, 4-channel bottleneck width.""" return ResNeXt([3, 4, 6, 3], cardinality=32, base_width=4, num_classes=num_classes) def resnext101_32x8d(num_classes=1000): """ResNeXt-101 with 32 groups, 8-channel bottleneck width.""" return ResNeXt([3, 4, 23, 3], cardinality=32, base_width=8, num_classes=num_classes) def resnext101_64x4d(num_classes=1000): """ResNeXt-101 with 64 groups, 4-channel bottleneck width.""" return ResNeXt([3, 4, 23, 3], cardinality=64, base_width=4, num_classes=num_classes)| Model | Layers | Cardinality | Width | Params | Top-1 |
|---|---|---|---|---|---|
| ResNeXt-50 (32×4d) | [3,4,6,3] | 32 | 4 | 25M | 22.2% |
| ResNeXt-101 (32×8d) | [3,4,23,3] | 32 | 8 | 88M | 20.4% |
| ResNeXt-101 (64×4d) | [3,4,23,3] | 64 | 4 | 83M | 20.4% |
When to use ResNeXt:
Computational considerations:
ResNeXt principles appear in many later architectures: RegNet (systematic ResNeXt design space search), EfficientNet (uses similar grouped convolutions), ConvNeXt (modernized ResNeXt for 2020s). Understanding ResNeXt provides foundation for these successors.
You've now mastered residual networks from foundational skip connections through ResNet, identity mappings, DenseNet, and ResNeXt. These architectures form the backbone of modern computer vision and influence designs across all deep learning domains.