Loading learning content...
By 2020, Vision Transformers (ViT) seemed poised to replace ConvNets entirely. ViT achieved state-of-the-art results on image classification, and Swin Transformer extended this success to detection and segmentation. Many wondered if the era of convolutional networks was ending.
ConvNeXt (Liu et al., 2022) challenged this narrative. Starting from a standard ResNet-50, the authors systematically applied modernizations inspired by Transformers—not to create a hybrid, but to demonstrate that a pure ConvNet could match or exceed Swin Transformer performance when given equivalent training recipes and architectural refinements.
The result: ConvNeXt achieved 87.8% top-1 on ImageNet (matching Swin-L) while maintaining the simplicity, efficiency, and inductive biases of ConvNets.
This page traces the systematic modernization from ResNet to ConvNeXt: training recipe updates, macro design changes, ResNeXt-ification, inverted bottleneck, large kernels, and micro design choices. You'll understand which changes mattered most and why.
ConvNeXt's development followed a structured roadmap, measuring accuracy gains at each step:
Starting Point: ResNet-50 trained with modern techniques (76.1% → 78.8%)
Key Modifications:
| Step | Modification | Accuracy | Change |
|---|---|---|---|
| Baseline | ResNet-50 (original training) | 76.1% | |
| 1 | Modern training recipe | 78.8% | +2.7% |
| 2 | Stage ratio 3:3:9:3 (like Swin) | 79.4% | +0.6% |
| 3 | Patchify stem (4×4 stride-4 conv) | 79.5% | +0.1% |
| 4 | ResNeXt (grouped conv, wider) | 80.5% | +1.0% |
| 5 | Inverted bottleneck | 80.6% | +0.1% |
| 6 | Large kernel (7×7) | 80.6% | — |
| 7 | Move depthwise up | 79.9% | -0.7%* |
| 8 | GELU activation | 80.6% | +0.1% |
| 9 | Fewer activations/norms | 81.4% | +0.8% |
| 10 | LayerNorm instead of BatchNorm | 81.5% | +0.1% |
| 11 | Separate downsampling layers | 82.0% | +0.5% |
The largest single improvement (+2.7%) came from simply updating the training recipe to match modern practices: longer training (300 epochs), AdamW optimizer, data augmentation (Mixup, CutMix, RandAugment), regularization (stochastic depth, label smoothing). Architecture changes built on this foundation.
Stage Compute Ratios:
ResNet-50 uses stage ratios of (3, 4, 6, 3) blocks. Swin Transformer uses (1:1:3:1). ConvNeXt adopts (3, 3, 9, 3)—similar to Swin's heavy focus on stage 3 where most computation happens at a good resolution for learning rich features.
Patchify Stem:
ResNet uses an aggressive 7×7 conv with stride 2, followed by max pooling. This rapidly downsamples 4× before the main stages. Transformers use a "patchify" stem: a single 4×4 conv with stride 4.
ConvNeXt adopts the patchify approach, using a 4×4 non-overlapping convolution. This is simpler and slightly more efficient.
Separate Downsampling:
ResNet handles downsampling within the first block of each stage (using stride-2 convolution). Swin uses dedicated downsampling layers between stages.
ConvNeXt separates downsampling into explicit 2×2 stride-2 convolutions with LayerNorm—cleaner and more stable during training.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import torch.nn as nn class ConvNeXtStem(nn.Module): """Patchify stem: 4×4 conv with stride 4, then LayerNorm.""" def __init__(self, in_channels: int = 3, out_channels: int = 96): super().__init__() self.stem = nn.Sequential( nn.Conv2d(in_channels, out_channels, kernel_size=4, stride=4), LayerNorm2d(out_channels, eps=1e-6) ) def forward(self, x): return self.stem(x) class Downsampling(nn.Module): """Explicit downsampling layer between stages.""" def __init__(self, in_channels: int, out_channels: int): super().__init__() self.downsample = nn.Sequential( LayerNorm2d(in_channels, eps=1e-6), nn.Conv2d(in_channels, out_channels, kernel_size=2, stride=2) ) def forward(self, x): return self.downsample(x) class LayerNorm2d(nn.Module): """LayerNorm for (B, C, H, W) tensors.""" def __init__(self, num_channels: int, eps: float = 1e-6): super().__init__() self.weight = nn.Parameter(torch.ones(num_channels)) self.bias = nn.Parameter(torch.zeros(num_channels)) self.eps = eps def forward(self, x): # x: (B, C, H, W) → normalize over C dimension u = x.mean(1, keepdim=True) s = (x - u).pow(2).mean(1, keepdim=True) x = (x - u) / torch.sqrt(s + self.eps) x = self.weight[:, None, None] * x + self.bias[:, None, None] return xThe ConvNeXt block combines several design choices into a clean, effective pattern:
Structure: Depthwise 7×7 → LayerNorm → 1×1 Conv (expand 4×) → GELU → 1×1 Conv (project) → Residual
Key Design Choices:
Large Depthwise Kernels (7×7): Transformers process global context via self-attention. Large kernels expand the receptive field of ConvNets. Moving the depthwise conv to the beginning (before expansion) reduces computation.
Inverted Bottleneck (4× expansion): Like Transformers' FFN blocks and MobileNetV2, expand channels before the main computation, then project back down.
GELU Activation: Used in Transformers, smoother than ReLU. Applied only once (after first 1×1).
LayerNorm instead of BatchNorm: More stable, works better with variable batch sizes, aligns with Transformer practice.
Fewer Normalization/Activation Layers: Only one activation and one norm per block (vs. multiple in ResNet). Reduces computational overhead and regularization.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import torchimport torch.nn as nn class ConvNeXtBlock(nn.Module): """ ConvNeXt Block. Structure: DwConv 7×7 → LayerNorm → Linear (expand 4×) → GELU → Linear (project) → Residual This mirrors the Transformer block structure: - Depthwise conv ≈ token mixing (like self-attention) - 1×1 convs ≈ channel mixing (like FFN) """ def __init__(self, dim: int, drop_path: float = 0.0, layer_scale: float = 1e-6): super().__init__() # Depthwise convolution (spatial mixing) self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim) # Normalization self.norm = nn.LayerNorm(dim, eps=1e-6) # Pointwise convolutions (channel mixing) - implemented as Linear self.pwconv1 = nn.Linear(dim, 4 * dim) # Expand 4× self.act = nn.GELU() self.pwconv2 = nn.Linear(4 * dim, dim) # Project back # Layer Scale: learnable per-channel scaling initialized to small value self.gamma = nn.Parameter(layer_scale * torch.ones(dim)) if layer_scale > 0 else None # Stochastic depth for regularization self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity() def forward(self, x: torch.Tensor) -> torch.Tensor: input = x # Depthwise conv x = self.dwconv(x) # (B, C, H, W) → (B, H, W, C) for LayerNorm and Linear x = x.permute(0, 2, 3, 1) x = self.norm(x) # MLP: expand → GELU → project x = self.pwconv1(x) x = self.act(x) x = self.pwconv2(x) # Layer scale if self.gamma is not None: x = self.gamma * x # (B, H, W, C) → (B, C, H, W) x = x.permute(0, 3, 1, 2) # Residual connection with stochastic depth x = input + self.drop_path(x) return x class DropPath(nn.Module): """Stochastic depth: randomly drop entire residual branch.""" def __init__(self, drop_prob: float = 0.0): super().__init__() self.drop_prob = drop_prob def forward(self, x): if self.drop_prob == 0. or not self.training: return x keep_prob = 1 - self.drop_prob shape = (x.shape[0],) + (1,) * (x.ndim - 1) random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device) random_tensor.floor_() return x.div(keep_prob) * random_tensorLayer Scale (from CaiT) multiplies the output of each residual branch by a learnable scalar initialized to a small value (e.g., 1e-6). This helps training stability, especially for larger models, by allowing the network to initially act like a shallower network.
| Model | Channels (C) | Blocks per Stage | Parameters | ImageNet Top-1 |
|---|---|---|---|---|
| ConvNeXt-T | 96 | 3, 3, 9, 3 | 28M | 82.1% |
| ConvNeXt-S | 96 | 3, 3, 27, 3 | 50M | 83.1% |
| ConvNeXt-B | 128 | 3, 3, 27, 3 | 89M | 83.8% |
| ConvNeXt-L | 192 | 3, 3, 27, 3 | 198M | 84.3% |
| ConvNeXt-XL | 256 | 3, 3, 27, 3 | 350M | 84.6% |
Scaling Strategy:
ConvNeXt V2 (2023):
ConvNeXt V2 introduced additional improvements:
You now understand ConvNeXt: how systematic modernization transformed ResNet into a Transformer-competitive architecture while remaining a pure ConvNet. Next, we'll explore the connection to Vision Transformers and the ongoing convergence of architectural paradigms.