Machine LearningConvolutional Neural Networks

Modern CNN Architectures

LevelAdvanced

Duration90 mins

TopicConvolutional Neural Networks

4 / 5

ConvNeXt: A ConvNet for the 2020s

The ConvNet Strikes Back

By 2020, Vision Transformers (ViT) seemed poised to replace ConvNets entirely. ViT achieved state-of-the-art results on image classification, and Swin Transformer extended this success to detection and segmentation. Many wondered if the era of convolutional networks was ending.

ConvNeXt (Liu et al., 2022) challenged this narrative. Starting from a standard ResNet-50, the authors systematically applied modernizations inspired by Transformers—not to create a hybrid, but to demonstrate that a pure ConvNet could match or exceed Swin Transformer performance when given equivalent training recipes and architectural refinements.

The result: ConvNeXt achieved 87.8% top-1 on ImageNet (matching Swin-L) while maintaining the simplicity, efficiency, and inductive biases of ConvNets.

What You Will Learn

This page traces the systematic modernization from ResNet to ConvNeXt: training recipe updates, macro design changes, ResNeXt-ification, inverted bottleneck, large kernels, and micro design choices. You'll understand which changes mattered most and why.

The Modernization Roadmap

ConvNeXt's development followed a structured roadmap, measuring accuracy gains at each step:

Starting Point: ResNet-50 trained with modern techniques (76.1% → 78.8%)

Key Modifications:

Macro design: Stage ratios, stem, downsampling
ResNeXt-ify: Grouped convolutions
Inverted bottleneck: Expand → Depthwise → Project
Large kernels: 7×7 depthwise convolutions
Micro design: Activation, normalization, fewer activations

ConvNeXt Modernization Steps (ImageNet Top-1)
Step	Modification	Accuracy	Change
Baseline	ResNet-50 (original training)	76.1%
1	Modern training recipe	78.8%	+2.7%
2	Stage ratio 3:3:9:3 (like Swin)	79.4%	+0.6%
3	Patchify stem (4×4 stride-4 conv)	79.5%	+0.1%
4	ResNeXt (grouped conv, wider)	80.5%	+1.0%
5	Inverted bottleneck	80.6%	+0.1%
6	Large kernel (7×7)	80.6%	—
7	Move depthwise up	79.9%	-0.7%*
8	GELU activation	80.6%	+0.1%
9	Fewer activations/norms	81.4%	+0.8%
10	LayerNorm instead of BatchNorm	81.5%	+0.1%
11	Separate downsampling layers	82.0%	+0.5%

The Training Recipe Matters Most

The largest single improvement (+2.7%) came from simply updating the training recipe to match modern practices: longer training (300 epochs), AdamW optimizer, data augmentation (Mixup, CutMix, RandAugment), regularization (stochastic depth, label smoothing). Architecture changes built on this foundation.

Macro Design Changes

Stage Compute Ratios:

ResNet-50 uses stage ratios of (3, 4, 6, 3) blocks. Swin Transformer uses (1:1:3:1). ConvNeXt adopts (3, 3, 9, 3)—similar to Swin's heavy focus on stage 3 where most computation happens at a good resolution for learning rich features.

Patchify Stem:

ResNet uses an aggressive 7×7 conv with stride 2, followed by max pooling. This rapidly downsamples 4× before the main stages. Transformers use a "patchify" stem: a single 4×4 conv with stride 4.

ConvNeXt adopts the patchify approach, using a 4×4 non-overlapping convolution. This is simpler and slightly more efficient.

Separate Downsampling:

ResNet handles downsampling within the first block of each stage (using stride-2 convolution). Swin uses dedicated downsampling layers between stages.

ConvNeXt separates downsampling into explicit 2×2 stride-2 convolutions with LayerNorm—cleaner and more stable during training.

convnext_stem.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch.nn as nn
 
class ConvNeXtStem(nn.Module):
    """Patchify stem: 4×4 conv with stride 4, then LayerNorm."""
    
    def __init__(self, in_channels: int = 3, out_channels: int = 96):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=4, stride=4),
            LayerNorm2d(out_channels, eps=1e-6)
        )
    
    def forward(self, x):
        return self.stem(x)
 
 
class Downsampling(nn.Module):
    """Explicit downsampling layer between stages."""
    
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.downsample = nn.Sequential(
            LayerNorm2d(in_channels, eps=1e-6),
            nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2)
        )
    
    def forward(self, x):
        return self.downsample(x)
 
 
class LayerNorm2d(nn.Module):
    """LayerNorm for (B, C, H, W) tensors."""
    
    def __init__(self, num_channels: int, eps: float = 1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(num_channels))
        self.bias = nn.Parameter(torch.zeros(num_channels))
        self.eps = eps
    
    def forward(self, x):
        # x: (B, C, H, W) → normalize over C dimension
        u = x.mean(1, keepdim=True)
        s = (x - u).pow(2).mean(1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.eps)
        x = self.weight[:, None, None] * x + self.bias[:, None, None]
        return x

The ConvNeXt Block

The ConvNeXt block combines several design choices into a clean, effective pattern:

Structure: Depthwise 7×7 → LayerNorm → 1×1 Conv (expand 4×) → GELU → 1×1 Conv (project) → Residual

Key Design Choices:

Large Depthwise Kernels (7×7): Transformers process global context via self-attention. Large kernels expand the receptive field of ConvNets. Moving the depthwise conv to the beginning (before expansion) reduces computation.
Inverted Bottleneck (4× expansion): Like Transformers' FFN blocks and MobileNetV2, expand channels before the main computation, then project back down.
GELU Activation: Used in Transformers, smoother than ReLU. Applied only once (after first 1×1).
LayerNorm instead of BatchNorm: More stable, works better with variable batch sizes, aligns with Transformer practice.
Fewer Normalization/Activation Layers: Only one activation and one norm per block (vs. multiple in ResNet). Reduces computational overhead and regularization.

convnext_block.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import torch
import torch.nn as nn
 
class ConvNeXtBlock(nn.Module):
    """
    ConvNeXt Block.
    
    Structure:
    DwConv 7×7 → LayerNorm → Linear (expand 4×) → GELU → Linear (project) → Residual
    
    This mirrors the Transformer block structure:
    - Depthwise conv ≈ token mixing (like self-attention)
    - 1×1 convs ≈ channel mixing (like FFN)
    """
    def __init__(self, dim: int, drop_path: float = 0.0, layer_scale: float = 1e-6):
        super().__init__()
        
        # Depthwise convolution (spatial mixing)
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
        
        # Normalization
        self.norm = nn.LayerNorm(dim, eps=1e-6)
        
        # Pointwise convolutions (channel mixing) - implemented as Linear
        self.pwconv1 = nn.Linear(dim, 4 * dim)  # Expand 4×
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)  # Project back
        
        # Layer Scale: learnable per-channel scaling initialized to small value
        self.gamma = nn.Parameter(layer_scale * torch.ones(dim)) if layer_scale > 0 else None
        
        # Stochastic depth for regularization
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        input = x
        
        # Depthwise conv
        x = self.dwconv(x)
        
        # (B, C, H, W) → (B, H, W, C) for LayerNorm and Linear
        x = x.permute(0, 2, 3, 1)
        x = self.norm(x)
        
        # MLP: expand → GELU → project
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        
        # Layer scale
        if self.gamma is not None:
            x = self.gamma * x
        
        # (B, H, W, C) → (B, C, H, W)
        x = x.permute(0, 3, 1, 2)
        
        # Residual connection with stochastic depth
        x = input + self.drop_path(x)
        return x
 
 
class DropPath(nn.Module):
    """Stochastic depth: randomly drop entire residual branch."""
    
    def __init__(self, drop_prob: float = 0.0):
        super().__init__()
        self.drop_prob = drop_prob
    
    def forward(self, x):
        if self.drop_prob == 0. or not self.training:
            return x
        keep_prob = 1 - self.drop_prob
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
        random_tensor.floor_()
        return x.div(keep_prob) * random_tensor

Layer Scale

Layer Scale (from CaiT) multiplies the output of each residual branch by a learnable scalar initialized to a small value (e.g., 1e-6). This helps training stability, especially for larger models, by allowing the network to initially act like a shallower network.

ConvNeXt Model Family

ConvNeXt Model Variants
Model	Channels (C)	Blocks per Stage	Parameters	ImageNet Top-1
ConvNeXt-T	96	3, 3, 9, 3	28M	82.1%
ConvNeXt-S	96	3, 3, 27, 3	50M	83.1%
ConvNeXt-B	128	3, 3, 27, 3	89M	83.8%
ConvNeXt-L	192	3, 3, 27, 3	198M	84.3%
ConvNeXt-XL	256	3, 3, 27, 3	350M	84.6%

Scaling Strategy:

Depth: Primarily scaled in Stage 3 (9 → 27 blocks)
Width: Channel dimensions follow a consistent pattern: C, 2C, 4C, 8C across stages
Resolution: Standard input of 224×224, with 384×384 for fine-tuning larger models

ConvNeXt V2 (2023):

ConvNeXt V2 introduced additional improvements:

Global Response Normalization (GRN): A new normalization layer that helps with feature collapse in self-supervised learning
Masked Autoencoder Pre-training: Adapted MAE for ConvNets using sparse convolutions
Achieved 88.7% top-1 with ImageNet-22K pre-training

Page Complete

You now understand ConvNeXt: how systematic modernization transformed ResNet into a Transformer-competitive architecture while remaining a pure ConvNet. Next, we'll explore the connection to Vision Transformers and the ongoing convergence of architectural paradigms.

4 / 5

Loading learning content...

Machine LearningConvolutional Neural Networks

Modern CNN Architectures

LevelAdvanced

Duration90 mins

TopicConvolutional Neural Networks

4 / 5

ConvNeXt: A ConvNet for the 2020s

The ConvNet Strikes Back

The result: ConvNeXt achieved 87.8% top-1 on ImageNet (matching Swin-L) while maintaining the simplicity, efficiency, and inductive biases of ConvNets.

What You Will Learn

The Modernization Roadmap

ConvNeXt's development followed a structured roadmap, measuring accuracy gains at each step:

Starting Point: ResNet-50 trained with modern techniques (76.1% → 78.8%)

Key Modifications:

Macro design: Stage ratios, stem, downsampling
ResNeXt-ify: Grouped convolutions
Inverted bottleneck: Expand → Depthwise → Project
Large kernels: 7×7 depthwise convolutions
Micro design: Activation, normalization, fewer activations

ConvNeXt Modernization Steps (ImageNet Top-1)
Step	Modification	Accuracy	Change
Baseline	ResNet-50 (original training)	76.1%
1	Modern training recipe	78.8%	+2.7%
2	Stage ratio 3:3:9:3 (like Swin)	79.4%	+0.6%
3	Patchify stem (4×4 stride-4 conv)	79.5%	+0.1%
4	ResNeXt (grouped conv, wider)	80.5%	+1.0%
5	Inverted bottleneck	80.6%	+0.1%
6	Large kernel (7×7)	80.6%	—
7	Move depthwise up	79.9%	-0.7%*
8	GELU activation	80.6%	+0.1%
9	Fewer activations/norms	81.4%	+0.8%
10	LayerNorm instead of BatchNorm	81.5%	+0.1%
11	Separate downsampling layers	82.0%	+0.5%

The Training Recipe Matters Most

Macro Design Changes

Stage Compute Ratios:

Patchify Stem:

ResNet uses an aggressive 7×7 conv with stride 2, followed by max pooling. This rapidly downsamples 4× before the main stages. Transformers use a "patchify" stem: a single 4×4 conv with stride 4.

ConvNeXt adopts the patchify approach, using a 4×4 non-overlapping convolution. This is simpler and slightly more efficient.

Separate Downsampling:

ResNet handles downsampling within the first block of each stage (using stride-2 convolution). Swin uses dedicated downsampling layers between stages.

ConvNeXt separates downsampling into explicit 2×2 stride-2 convolutions with LayerNorm—cleaner and more stable during training.

convnext_stem.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch.nn as nn
 
class ConvNeXtStem(nn.Module):
    """Patchify stem: 4×4 conv with stride 4, then LayerNorm."""
    
    def __init__(self, in_channels: int = 3, out_channels: int = 96):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=4, stride=4),
            LayerNorm2d(out_channels, eps=1e-6)
        )
    
    def forward(self, x):
        return self.stem(x)
 
 
class Downsampling(nn.Module):
    """Explicit downsampling layer between stages."""
    
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.downsample = nn.Sequential(
            LayerNorm2d(in_channels, eps=1e-6),
            nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2)
        )
    
    def forward(self, x):
        return self.downsample(x)
 
 
class LayerNorm2d(nn.Module):
    """LayerNorm for (B, C, H, W) tensors."""
    
    def __init__(self, num_channels: int, eps: float = 1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(num_channels))
        self.bias = nn.Parameter(torch.zeros(num_channels))
        self.eps = eps
    
    def forward(self, x):
        # x: (B, C, H, W) → normalize over C dimension
        u = x.mean(1, keepdim=True)
        s = (x - u).pow(2).mean(1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.eps)
        x = self.weight[:, None, None] * x + self.bias[:, None, None]
        return x

The ConvNeXt Block

The ConvNeXt block combines several design choices into a clean, effective pattern:

Structure: Depthwise 7×7 → LayerNorm → 1×1 Conv (expand 4×) → GELU → 1×1 Conv (project) → Residual

Key Design Choices:

Large Depthwise Kernels (7×7): Transformers process global context via self-attention. Large kernels expand the receptive field of ConvNets. Moving the depthwise conv to the beginning (before expansion) reduces computation.
Inverted Bottleneck (4× expansion): Like Transformers' FFN blocks and MobileNetV2, expand channels before the main computation, then project back down.
GELU Activation: Used in Transformers, smoother than ReLU. Applied only once (after first 1×1).
LayerNorm instead of BatchNorm: More stable, works better with variable batch sizes, aligns with Transformer practice.
Fewer Normalization/Activation Layers: Only one activation and one norm per block (vs. multiple in ResNet). Reduces computational overhead and regularization.

convnext_block.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import torch
import torch.nn as nn
 
class ConvNeXtBlock(nn.Module):
    """
    ConvNeXt Block.
    
    Structure:
    DwConv 7×7 → LayerNorm → Linear (expand 4×) → GELU → Linear (project) → Residual
    
    This mirrors the Transformer block structure:
    - Depthwise conv ≈ token mixing (like self-attention)
    - 1×1 convs ≈ channel mixing (like FFN)
    """
    def __init__(self, dim: int, drop_path: float = 0.0, layer_scale: float = 1e-6):
        super().__init__()
        
        # Depthwise convolution (spatial mixing)
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
        
        # Normalization
        self.norm = nn.LayerNorm(dim, eps=1e-6)
        
        # Pointwise convolutions (channel mixing) - implemented as Linear
        self.pwconv1 = nn.Linear(dim, 4 * dim)  # Expand 4×
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)  # Project back
        
        # Layer Scale: learnable per-channel scaling initialized to small value
        self.gamma = nn.Parameter(layer_scale * torch.ones(dim)) if layer_scale > 0 else None
        
        # Stochastic depth for regularization
        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        input = x
        
        # Depthwise conv
        x = self.dwconv(x)
        
        # (B, C, H, W) → (B, H, W, C) for LayerNorm and Linear
        x = x.permute(0, 2, 3, 1)
        x = self.norm(x)
        
        # MLP: expand → GELU → project
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        
        # Layer scale
        if self.gamma is not None:
            x = self.gamma * x
        
        # (B, H, W, C) → (B, C, H, W)
        x = x.permute(0, 3, 1, 2)
        
        # Residual connection with stochastic depth
        x = input + self.drop_path(x)
        return x
 
 
class DropPath(nn.Module):
    """Stochastic depth: randomly drop entire residual branch."""
    
    def __init__(self, drop_prob: float = 0.0):
        super().__init__()
        self.drop_prob = drop_prob
    
    def forward(self, x):
        if self.drop_prob == 0. or not self.training:
            return x
        keep_prob = 1 - self.drop_prob
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
        random_tensor.floor_()
        return x.div(keep_prob) * random_tensor

Layer Scale

ConvNeXt Model Family

ConvNeXt Model Variants
Model	Channels (C)	Blocks per Stage	Parameters	ImageNet Top-1
ConvNeXt-T	96	3, 3, 9, 3	28M	82.1%
ConvNeXt-S	96	3, 3, 27, 3	50M	83.1%
ConvNeXt-B	128	3, 3, 27, 3	89M	83.8%
ConvNeXt-L	192	3, 3, 27, 3	198M	84.3%
ConvNeXt-XL	256	3, 3, 27, 3	350M	84.6%

Scaling Strategy:

Depth: Primarily scaled in Stage 3 (9 → 27 blocks)
Width: Channel dimensions follow a consistent pattern: C, 2C, 4C, 8C across stages
Resolution: Standard input of 224×224, with 384×384 for fine-tuning larger models

ConvNeXt V2 (2023):

ConvNeXt V2 introduced additional improvements:

Global Response Normalization (GRN): A new normalization layer that helps with feature collapse in self-supervised learning
Masked Autoencoder Pre-training: Adapted MAE for ConvNets using sparse convolutions
Achieved 88.7% top-1 with ImageNet-22K pre-training

Page Complete

4 / 5