Loading content...
The story of modern computer vision is one of surprising convergence. Vision Transformers (ViT) arrived in 2020 claiming to replace convolutions entirely—yet the most successful architectures today borrow from both paradigms.
From Competition to Synthesis:
Understanding this connection is crucial for modern practitioners who must choose and design architectures for real-world applications.
This page explores Vision Transformer fundamentals, the relationship between convolution and attention, Swin Transformer's hierarchical design, hybrid architectures, and practical guidance on when to choose CNNs vs. Transformers.
ViT Architecture (Dosovitskiy et al., 2020):
ViT applies the standard Transformer encoder to image patches:
Patch Embedding: Split image into non-overlapping patches (e.g., 16×16). Flatten and linearly project each patch to an embedding dimension.
Position Embedding: Add learnable position embeddings to preserve spatial information.
Transformer Encoder: Stack of self-attention + MLP blocks with LayerNorm and residual connections.
Classification Head: Use a learnable [CLS] token (or global average pool) for class prediction.
Key Insight: ViT treats an image as a sequence of tokens (patches) and processes them with global self-attention, allowing each patch to attend to every other patch from the first layer.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import torchimport torch.nn as nn class PatchEmbedding(nn.Module): """Convert image to patch embeddings.""" def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768): super().__init__() self.num_patches = (img_size // patch_size) ** 2 # Linear projection of flattened patches (equivalent to conv with kernel=stride=patch_size) self.proj = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size) def forward(self, x): # (B, C, H, W) -> (B, embed_dim, H/P, W/P) -> (B, num_patches, embed_dim) x = self.proj(x) x = x.flatten(2).transpose(1, 2) return x class TransformerBlock(nn.Module): """Standard Transformer encoder block.""" def __init__(self, dim, num_heads, mlp_ratio=4.0, drop=0.0): super().__init__() self.norm1 = nn.LayerNorm(dim) self.attn = nn.MultiheadAttention(dim, num_heads, dropout=drop, batch_first=True) self.norm2 = nn.LayerNorm(dim) self.mlp = nn.Sequential( nn.Linear(dim, int(dim * mlp_ratio)), nn.GELU(), nn.Dropout(drop), nn.Linear(int(dim * mlp_ratio), dim), nn.Dropout(drop) ) def forward(self, x): x = x + self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[0] x = x + self.mlp(self.norm2(x)) return x class VisionTransformer(nn.Module): """Simplified Vision Transformer.""" def __init__(self, img_size=224, patch_size=16, num_classes=1000, embed_dim=768, depth=12, num_heads=12): super().__init__() self.patch_embed = PatchEmbedding(img_size, patch_size, 3, embed_dim) num_patches = self.patch_embed.num_patches self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim)) self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim)) self.blocks = nn.Sequential(*[ TransformerBlock(embed_dim, num_heads) for _ in range(depth) ]) self.norm = nn.LayerNorm(embed_dim) self.head = nn.Linear(embed_dim, num_classes) def forward(self, x): B = x.shape[0] x = self.patch_embed(x) cls_tokens = self.cls_token.expand(B, -1, -1) x = torch.cat([cls_tokens, x], dim=1) x = x + self.pos_embed x = self.blocks(x) x = self.norm(x) return self.head(x[:, 0]) # CLS token outputViT lacks inductive biases (locality, translation equivariance) that CNNs have built-in. It requires large-scale pre-training (ImageNet-21K or JFT-300M) to match CNN performance. On ImageNet alone, ViT underperforms well-tuned ResNets.
Fundamental Differences:
| Aspect | Convolution | Self-Attention |
|---|---|---|
| Receptive field | Local (kernel size) | Global (all positions) |
| Weight sharing | Spatial (same kernel everywhere) | None (data-dependent weights) |
| Translation equivariance | Built-in | Requires position encoding |
| Compute complexity | O(K² × C² × HW) | O((HW)² × C) |
| Parameter count | O(K² × C²) | O(C²) |
| Inductive bias | Strong (locality) | Weak (learned from data) |
The Deep Connection:
Surprisingly, convolution and attention are more related than they first appear:
Attention as Dynamic Convolution: Self-attention can be viewed as a convolution with content-dependent, spatially-varying kernels. Where convolution applies the same kernel everywhere, attention computes custom weights for each position based on content.
Convolution as Constrained Attention: A convolution is equivalent to attention where the attention weights are fixed, local, and content-independent.
Depthwise Convolution ≈ Local Attention: Large depthwise kernels (like ConvNeXt's 7×7) approximate local attention—they mix spatial information within a neighborhood. This is why ConvNeXt can match Transformers: large kernels provide a similar inductive bias to local attention.
The MetaFormer paper showed that the Transformer's success comes largely from its general architecture (token mixing + channel mixing + residuals + norms), not specifically from attention. Replacing attention with spatial pooling (PoolFormer) still achieves strong results—suggesting the 'what' of token mixing matters less than 'that' it happens.
Swin Transformer (Liu et al., 2021) bridges CNNs and ViT by reintroducing hierarchy and locality:
Key Innovations:
Hierarchical Feature Maps: Like CNNs, Swin produces multi-scale features (56×56 → 28×28 → 14×14 → 7×7). This enables dense prediction tasks (detection, segmentation).
Window-based Attention: Instead of global attention (O(n²)), compute attention within local windows (e.g., 7×7). Linear complexity in image size.
Shifted Windows: Alternate between regular and shifted window partitions to allow cross-window connections without global attention.
Patch Merging: Downsample by merging 2×2 neighboring patches, similar to strided convolution.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
def window_partition(x: torch.Tensor, window_size: int) -> torch.Tensor: """Partition feature map into non-overlapping windows.""" B, H, W, C = x.shape x = x.view(B, H // window_size, window_size, W // window_size, window_size, C) windows = x.permute(0, 1, 3, 2, 4, 5).contiguous() return windows.view(-1, window_size, window_size, C) def window_reverse(windows, window_size: int, H: int, W: int) -> torch.Tensor: """Reverse window partition.""" B = windows.shape[0] // (H * W // window_size ** 2) x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1) return x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1) class WindowAttention(nn.Module): """Window-based multi-head self-attention.""" def __init__(self, dim, window_size, num_heads): super().__init__() self.window_size = window_size self.num_heads = num_heads self.scale = (dim // num_heads) ** -0.5 self.qkv = nn.Linear(dim, dim * 3) self.proj = nn.Linear(dim, dim) # Relative position bias self.relative_position_bias_table = nn.Parameter( torch.zeros((2 * window_size - 1) ** 2, num_heads) ) def forward(self, x): B_, N, C = x.shape # B_ = batch * num_windows, N = window_size^2 qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads) qkv = qkv.permute(2, 0, 3, 1, 4) q, k, v = qkv[0], qkv[1], qkv[2] attn = (q @ k.transpose(-2, -1)) * self.scale # Add relative position bias (simplified) attn = attn.softmax(dim=-1) x = (attn @ v).transpose(1, 2).reshape(B_, N, C) return self.proj(x)| Aspect | ViT | Swin Transformer |
|---|---|---|
| Feature maps | Single scale | Multi-scale (hierarchical) |
| Attention scope | Global | Local (shifted windows) |
| Complexity | O(n²) | O(n) (linear) |
| Dense prediction | Requires adaptation | Native support |
| Pre-training requirement | Large-scale data | Works with ImageNet-1K |
Many state-of-the-art architectures now combine both paradigms: convolutional stems for efficient early processing, attention for later stages where global context matters. The best architecture often depends on your specific constraints (data, compute, deployment) rather than which paradigm is theoretically superior.
You've completed the Modern CNN Architectures module! You now understand EfficientNet's compound scaling, MobileNet/ShuffleNet for edge deployment, Neural Architecture Search, ConvNeXt's modernization, and the relationship between CNNs and Vision Transformers. This knowledge prepares you to select, adapt, and understand the architectures powering modern computer vision.