Machine LearningResidual Networks

Residual Networks and Skip Connections

LevelAdvanced

Duration90 mins

TopicResidual Networks

5 / 5

ResNeXt: Aggregated Residual Transformations

A New Dimension for Capacity

Traditional approaches to increasing network capacity focus on depth (more layers) or width (more channels). ResNeXt, introduced by Xie et al. in 2017, reveals a third dimension: cardinality—the number of parallel transformation paths.

ResNeXt shows that increasing cardinality (e.g., 32 parallel branches) is more effective than increasing depth or width with the same parameter budget. This insight led to simpler, more modular designs that achieve state-of-the-art results with improved efficiency.

What You Will Learn

By the end of this page, you will understand: (1) The concept of cardinality and aggregated transformations, (2) ResNeXt block design and equivalent formulations, (3) How cardinality compares to depth and width, (4) Complete ResNeXt architectures, and (5) When and why to choose ResNeXt.

The Cardinality Concept

Network capacity scaling dimensions:

Depth: More layers in sequence (ResNet-50 → ResNet-152)
Width: More channels per layer (64 → 128 filters)
Cardinality: More parallel pathways (1 → 32 branches)

The ResNeXt formula:

$$\mathbf{y} = \mathbf{x} + \sum_{i=1}^{C} \mathcal{T}_i(\mathbf{x})$$

Where C is the cardinality (number of transformations) and each $\mathcal{T}_i$ has the same topology but different parameters.

This is a generalization of ResNet's residual function. Standard ResNet has C=1; ResNeXt typically uses C=32.

Why cardinality works better:

Consider two designs with similar parameters:

Wide ResNet: Bottleneck with 2× width (256 intermediate channels)
ResNeXt-32: 32 parallel bottlenecks (4 channels each), aggregated

Mathematically: 32 × 4 = 128 ≈ similar capacity to 256 channels.

But ResNeXt performs better because:

Diverse representations: 32 different transformations capture more varied features
Easier optimization: Smaller independent paths are easier to train than one large path
Implicit regularization: Ensemble-like behavior from multiple branches

Inception Connection

ResNeXt can be seen as a simplified, regularized Inception. Where Inception uses heterogeneous branches (1×1, 3×3, 5×5, pooling), ResNeXt uses homogeneous branches (all same structure). This simplification makes ResNeXt easier to design and scale.

ResNeXt Block Design

ResNeXt blocks can be implemented in three equivalent ways. All produce identical outputs but differ in implementation efficiency.

Notation: ResNeXt-50 (32×4d) means:

50 layers
32 parallel paths (cardinality)
4 channels per path (bottleneck width d)

resnext_blocks.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import torch
import torch.nn as nn
 
class ResNeXtBlockA(nn.Module):
    """
    Form (a): Explicit aggregation of C parallel paths.
    Each path: 1x1 -> 3x3 -> 1x1
    Outputs are summed.
    
    Conceptually clear but inefficient implementation.
    """
    expansion = 4
    
    def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1):
        super().__init__()
        D = base_ch // cardinality  # Width per group
        out_ch = base_ch * self.expansion
        
        self.paths = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(in_ch, D, 1, bias=False),
                nn.BatchNorm2d(D),
                nn.ReLU(inplace=True),
                nn.Conv2d(D, D, 3, stride, 1, bias=False),
                nn.BatchNorm2d(D),
                nn.ReLU(inplace=True),
                nn.Conv2d(D, out_ch, 1, bias=False),
                nn.BatchNorm2d(out_ch),
            )
            for _ in range(cardinality)
        ])
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    
    def forward(self, x):
        out = sum(path(x) for path in self.paths)
        out += self.shortcut(x)
        return torch.relu(out)
 
 
class ResNeXtBlockB(nn.Module):
    """
    Form (b): Early aggregation with grouped 3x3 conv.
    1x1 reduce -> grouped 3x3 -> 1x1 expand
    
    More efficient using group convolution.
    """
    expansion = 4
    
    def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1):
        super().__init__()
        D = base_ch  # Total width (distributed across groups)
        out_ch = base_ch * self.expansion
        
        # 1x1 reduce to D channels
        self.conv1 = nn.Conv2d(in_ch, D, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(D)
        
        # Grouped 3x3: D channels split into 'cardinality' groups
        self.conv2 = nn.Conv2d(D, D, 3, stride, 1, groups=cardinality, bias=False)
        self.bn2 = nn.BatchNorm2d(D)
        
        # 1x1 expand to output
        self.conv3 = nn.Conv2d(D, out_ch, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_ch)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    
    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = torch.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        return torch.relu(out)
 
 
class ResNeXtBlockC(nn.Module):
    """
    Form (c): Fully grouped convolutions.
    Grouped 1x1 -> Grouped 3x3 -> Grouped 1x1 -> Concatenate
    
    Alternative formulation for certain hardware.
    Standard implementation uses Form (b).
    """
    expansion = 4
    
    def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1):
        super().__init__()
        D = base_ch // cardinality
        out_ch = base_ch * self.expansion
        
        # All convs are grouped
        self.conv1 = nn.Conv2d(in_ch, cardinality * D, 1, groups=cardinality, bias=False)
        self.bn1 = nn.BatchNorm2d(cardinality * D)
        
        self.conv2 = nn.Conv2d(cardinality * D, cardinality * D, 3, stride, 1, 
                               groups=cardinality, bias=False)
        self.bn2 = nn.BatchNorm2d(cardinality * D)
        
        # Final 1x1 is NOT grouped (merges all paths)
        self.conv3 = nn.Conv2d(cardinality * D, out_ch, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_ch)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    
    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = torch.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        return torch.relu(out)

Three Equivalent ResNeXt Block Forms
Form	Structure	Implementation	Common Use
(a)	Sum of C paths	ModuleList + sum	Conceptual clarity
(b)	Grouped 3×3 conv	groups=C parameter	Standard, most efficient
(c)	All grouped convs	Split-transform-concat	Hardware-specific

Cardinality vs Width vs Depth

The ResNeXt paper provides extensive ablations showing cardinality's superiority:

Controlled experiment setup:

Fixed parameter budget (~25M parameters)
Compare: increasing width, depth, or cardinality
Metric: ImageNet top-1 error

Increasing Capacity at Fixed Parameters (ImageNet)
Model	Configuration	Top-1 Error	Improvement
ResNet-50 baseline	1×64d	23.9%
Wider	1×80d	23.4%	+0.5%
Deeper	1×64d, 101 layers	22.4%	+1.5%
ResNeXt-50	32×4d	22.2%	+1.7%

Key finding: At equal complexity, increasing cardinality is more effective than increasing width or depth.

Further cardinality scaling:

Effect of Cardinality (Fixed Complexity)
Cardinality	Width d	Top-1 Error
1	64	23.9%
2	40	23.3%
4	24	22.8%
8	14	22.4%
32	4	22.2%

Diminishing Returns

Cardinality improvements have diminishing returns beyond 32. The jump from C=1 to C=8 is larger than C=8 to C=32. Most ResNeXt implementations use C=32 as the sweet spot.

Complete ResNeXt Architecture

ResNeXt follows the same stage structure as ResNet, replacing bottleneck blocks with ResNeXt blocks:

resnext.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class ResNeXt(nn.Module):
    """
    Complete ResNeXt network.
    
    Uses grouped convolutions for efficiency (Form b).
    """
    def __init__(
        self,
        layers: list,  # Blocks per stage: [3, 4, 6, 3]
        cardinality: int = 32,
        base_width: int = 4,  # Width per group
        num_classes: int = 1000
    ):
        super().__init__()
        self.cardinality = cardinality
        self.base_width = base_width
        self.in_channels = 64
        
        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        
        # Stages with increasing channels
        self.stage1 = self._make_stage(64, layers[0], stride=1)
        self.stage2 = self._make_stage(128, layers[1], stride=2)
        self.stage3 = self._make_stage(256, layers[2], stride=2)
        self.stage4 = self._make_stage(512, layers[3], stride=2)
        
        # Head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * 4, num_classes)  # *4 for expansion
        
    def _make_stage(self, base_channels, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        
        width = int(base_channels * (self.base_width / 64) * self.cardinality)
        
        for s in strides:
            layers.append(ResNeXtBlockB(
                self.in_channels, 
                width,
                self.cardinality, 
                s
            ))
            self.in_channels = width * 4  # After expansion
            
        return nn.Sequential(*layers)
    
    def forward(self, x):
        x = self.stem(x)
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.stage4(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        return self.fc(x)
 
 
# Standard configurations
def resnext50_32x4d(num_classes=1000):
    """ResNeXt-50 with 32 groups, 4-channel bottleneck width."""
    return ResNeXt([3, 4, 6, 3], cardinality=32, base_width=4, num_classes=num_classes)
 
def resnext101_32x8d(num_classes=1000):
    """ResNeXt-101 with 32 groups, 8-channel bottleneck width."""
    return ResNeXt([3, 4, 23, 3], cardinality=32, base_width=8, num_classes=num_classes)
 
def resnext101_64x4d(num_classes=1000):
    """ResNeXt-101 with 64 groups, 4-channel bottleneck width."""
    return ResNeXt([3, 4, 23, 3], cardinality=64, base_width=4, num_classes=num_classes)

ResNeXt Architecture Configurations
Model	Layers	Cardinality	Width	Params	Top-1
ResNeXt-50 (32×4d)	[3,4,6,3]	32	4	25M	22.2%
ResNeXt-101 (32×8d)	[3,4,23,3]	32	8	88M	20.4%
ResNeXt-101 (64×4d)	[3,4,23,3]	64	4	83M	20.4%

Practical Considerations

When to use ResNeXt:

•Better accuracy needed: ResNeXt consistently outperforms ResNet at similar parameter counts
•Transfer learning: ResNeXt features transfer well; pretrained weights widely available
•Object detection backbones: Popular choice for Faster R-CNN, RetinaNet, etc.
•When hardware supports grouped convs: Modern GPUs accelerate grouped convolutions well

Computational considerations:

FLOPs: Similar to ResNet at same parameters
Memory: Similar to ResNet
Speed: Can be slower if grouped convs aren't well-optimized on hardware
Mixed precision: Works well with FP16 training

Modern Successors

ResNeXt principles appear in many later architectures: RegNet (systematic ResNeXt design space search), EfficientNet (uses similar grouped convolutions), ConvNeXt (modernized ResNeXt for 2020s). Understanding ResNeXt provides foundation for these successors.

Summary: ResNeXt

Key Takeaways

•Cardinality is the number of parallel transformation paths—a new dimension for capacity scaling.
•Aggregated transformations: y = x + Σᵢ Tᵢ(x), where C homogeneous transformations are summed.
•Grouped convolutions provide efficient implementation (Form b is standard).
•Superior to width/depth scaling: At fixed parameters, increasing C beats increasing width or depth.
•Standard choice: 32×4d: 32 groups with 4-channel width is the common configuration.

Module Complete

You've now mastered residual networks from foundational skip connections through ResNet, identity mappings, DenseNet, and ResNeXt. These architectures form the backbone of modern computer vision and influence designs across all deep learning domains.

5 / 5

Loading learning content...

Machine LearningResidual Networks

Residual Networks and Skip Connections

LevelAdvanced

Duration90 mins

TopicResidual Networks

5 / 5

ResNeXt: Aggregated Residual Transformations

A New Dimension for Capacity

What You Will Learn

The Cardinality Concept

Network capacity scaling dimensions:

Depth: More layers in sequence (ResNet-50 → ResNet-152)
Width: More channels per layer (64 → 128 filters)
Cardinality: More parallel pathways (1 → 32 branches)

The ResNeXt formula:

$$\mathbf{y} = \mathbf{x} + \sum_{i=1}^{C} \mathcal{T}_i(\mathbf{x})$$

Where C is the cardinality (number of transformations) and each $\mathcal{T}_i$ has the same topology but different parameters.

This is a generalization of ResNet's residual function. Standard ResNet has C=1; ResNeXt typically uses C=32.

Why cardinality works better:

Consider two designs with similar parameters:

Wide ResNet: Bottleneck with 2× width (256 intermediate channels)
ResNeXt-32: 32 parallel bottlenecks (4 channels each), aggregated

Mathematically: 32 × 4 = 128 ≈ similar capacity to 256 channels.

But ResNeXt performs better because:

Diverse representations: 32 different transformations capture more varied features
Easier optimization: Smaller independent paths are easier to train than one large path
Implicit regularization: Ensemble-like behavior from multiple branches

Inception Connection

ResNeXt Block Design

ResNeXt blocks can be implemented in three equivalent ways. All produce identical outputs but differ in implementation efficiency.

Notation: ResNeXt-50 (32×4d) means:

50 layers
32 parallel paths (cardinality)
4 channels per path (bottleneck width d)

resnext_blocks.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import torch
import torch.nn as nn
 
class ResNeXtBlockA(nn.Module):
    """
    Form (a): Explicit aggregation of C parallel paths.
    Each path: 1x1 -> 3x3 -> 1x1
    Outputs are summed.
    
    Conceptually clear but inefficient implementation.
    """
    expansion = 4
    
    def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1):
        super().__init__()
        D = base_ch // cardinality  # Width per group
        out_ch = base_ch * self.expansion
        
        self.paths = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(in_ch, D, 1, bias=False),
                nn.BatchNorm2d(D),
                nn.ReLU(inplace=True),
                nn.Conv2d(D, D, 3, stride, 1, bias=False),
                nn.BatchNorm2d(D),
                nn.ReLU(inplace=True),
                nn.Conv2d(D, out_ch, 1, bias=False),
                nn.BatchNorm2d(out_ch),
            )
            for _ in range(cardinality)
        ])
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    
    def forward(self, x):
        out = sum(path(x) for path in self.paths)
        out += self.shortcut(x)
        return torch.relu(out)
 
 
class ResNeXtBlockB(nn.Module):
    """
    Form (b): Early aggregation with grouped 3x3 conv.
    1x1 reduce -> grouped 3x3 -> 1x1 expand
    
    More efficient using group convolution.
    """
    expansion = 4
    
    def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1):
        super().__init__()
        D = base_ch  # Total width (distributed across groups)
        out_ch = base_ch * self.expansion
        
        # 1x1 reduce to D channels
        self.conv1 = nn.Conv2d(in_ch, D, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(D)
        
        # Grouped 3x3: D channels split into 'cardinality' groups
        self.conv2 = nn.Conv2d(D, D, 3, stride, 1, groups=cardinality, bias=False)
        self.bn2 = nn.BatchNorm2d(D)
        
        # 1x1 expand to output
        self.conv3 = nn.Conv2d(D, out_ch, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_ch)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    
    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = torch.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        return torch.relu(out)
 
 
class ResNeXtBlockC(nn.Module):
    """
    Form (c): Fully grouped convolutions.
    Grouped 1x1 -> Grouped 3x3 -> Grouped 1x1 -> Concatenate
    
    Alternative formulation for certain hardware.
    Standard implementation uses Form (b).
    """
    expansion = 4
    
    def __init__(self, in_ch: int, base_ch: int, cardinality: int = 32, stride: int = 1):
        super().__init__()
        D = base_ch // cardinality
        out_ch = base_ch * self.expansion
        
        # All convs are grouped
        self.conv1 = nn.Conv2d(in_ch, cardinality * D, 1, groups=cardinality, bias=False)
        self.bn1 = nn.BatchNorm2d(cardinality * D)
        
        self.conv2 = nn.Conv2d(cardinality * D, cardinality * D, 3, stride, 1, 
                               groups=cardinality, bias=False)
        self.bn2 = nn.BatchNorm2d(cardinality * D)
        
        # Final 1x1 is NOT grouped (merges all paths)
        self.conv3 = nn.Conv2d(cardinality * D, out_ch, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_ch)
        
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
                nn.BatchNorm2d(out_ch)
            )
    
    def forward(self, x):
        out = torch.relu(self.bn1(self.conv1(x)))
        out = torch.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        return torch.relu(out)

Three Equivalent ResNeXt Block Forms
Form	Structure	Implementation	Common Use
(a)	Sum of C paths	ModuleList + sum	Conceptual clarity
(b)	Grouped 3×3 conv	groups=C parameter	Standard, most efficient
(c)	All grouped convs	Split-transform-concat	Hardware-specific

Cardinality vs Width vs Depth

The ResNeXt paper provides extensive ablations showing cardinality's superiority:

Controlled experiment setup:

Fixed parameter budget (~25M parameters)
Compare: increasing width, depth, or cardinality
Metric: ImageNet top-1 error

Increasing Capacity at Fixed Parameters (ImageNet)
Model	Configuration	Top-1 Error	Improvement
ResNet-50 baseline	1×64d	23.9%
Wider	1×80d	23.4%	+0.5%
Deeper	1×64d, 101 layers	22.4%	+1.5%
ResNeXt-50	32×4d	22.2%	+1.7%

Key finding: At equal complexity, increasing cardinality is more effective than increasing width or depth.

Further cardinality scaling:

Effect of Cardinality (Fixed Complexity)
Cardinality	Width d	Top-1 Error
1	64	23.9%
2	40	23.3%
4	24	22.8%
8	14	22.4%
32	4	22.2%

Diminishing Returns

Cardinality improvements have diminishing returns beyond 32. The jump from C=1 to C=8 is larger than C=8 to C=32. Most ResNeXt implementations use C=32 as the sweet spot.

Complete ResNeXt Architecture

ResNeXt follows the same stage structure as ResNet, replacing bottleneck blocks with ResNeXt blocks:

resnext.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class ResNeXt(nn.Module):
    """
    Complete ResNeXt network.
    
    Uses grouped convolutions for efficiency (Form b).
    """
    def __init__(
        self,
        layers: list,  # Blocks per stage: [3, 4, 6, 3]
        cardinality: int = 32,
        base_width: int = 4,  # Width per group
        num_classes: int = 1000
    ):
        super().__init__()
        self.cardinality = cardinality
        self.base_width = base_width
        self.in_channels = 64
        
        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        )
        
        # Stages with increasing channels
        self.stage1 = self._make_stage(64, layers[0], stride=1)
        self.stage2 = self._make_stage(128, layers[1], stride=2)
        self.stage3 = self._make_stage(256, layers[2], stride=2)
        self.stage4 = self._make_stage(512, layers[3], stride=2)
        
        # Head
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * 4, num_classes)  # *4 for expansion
        
    def _make_stage(self, base_channels, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        
        width = int(base_channels * (self.base_width / 64) * self.cardinality)
        
        for s in strides:
            layers.append(ResNeXtBlockB(
                self.in_channels, 
                width,
                self.cardinality, 
                s
            ))
            self.in_channels = width * 4  # After expansion
            
        return nn.Sequential(*layers)
    
    def forward(self, x):
        x = self.stem(x)
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.stage4(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        return self.fc(x)
 
 
# Standard configurations
def resnext50_32x4d(num_classes=1000):
    """ResNeXt-50 with 32 groups, 4-channel bottleneck width."""
    return ResNeXt([3, 4, 6, 3], cardinality=32, base_width=4, num_classes=num_classes)
 
def resnext101_32x8d(num_classes=1000):
    """ResNeXt-101 with 32 groups, 8-channel bottleneck width."""
    return ResNeXt([3, 4, 23, 3], cardinality=32, base_width=8, num_classes=num_classes)
 
def resnext101_64x4d(num_classes=1000):
    """ResNeXt-101 with 64 groups, 4-channel bottleneck width."""
    return ResNeXt([3, 4, 23, 3], cardinality=64, base_width=4, num_classes=num_classes)

ResNeXt Architecture Configurations
Model	Layers	Cardinality	Width	Params	Top-1
ResNeXt-50 (32×4d)	[3,4,6,3]	32	4	25M	22.2%
ResNeXt-101 (32×8d)	[3,4,23,3]	32	8	88M	20.4%
ResNeXt-101 (64×4d)	[3,4,23,3]	64	4	83M	20.4%

Practical Considerations

When to use ResNeXt:

•Better accuracy needed: ResNeXt consistently outperforms ResNet at similar parameter counts
•Transfer learning: ResNeXt features transfer well; pretrained weights widely available
•Object detection backbones: Popular choice for Faster R-CNN, RetinaNet, etc.
•When hardware supports grouped convs: Modern GPUs accelerate grouped convolutions well

Computational considerations:

FLOPs: Similar to ResNet at same parameters
Memory: Similar to ResNet
Speed: Can be slower if grouped convs aren't well-optimized on hardware
Mixed precision: Works well with FP16 training

Modern Successors

Summary: ResNeXt

Key Takeaways

•Cardinality is the number of parallel transformation paths—a new dimension for capacity scaling.
•Aggregated transformations: y = x + Σᵢ Tᵢ(x), where C homogeneous transformations are summed.
•Grouped convolutions provide efficient implementation (Form b is standard).
•Superior to width/depth scaling: At fixed parameters, increasing C beats increasing width or depth.
•Standard choice: 32×4d: 32 groups with 4-channel width is the common configuration.

Module Complete

5 / 5