Modern Cnn Architectures - Learning Module

Loading content...

0/245

Vision Transformers Connection: CNNs Meet Attention

The Convergence of Paradigms

The story of modern computer vision is one of surprising convergence. Vision Transformers (ViT) arrived in 2020 claiming to replace convolutions entirely—yet the most successful architectures today borrow from both paradigms.

From Competition to Synthesis:

ViT showed that pure attention can match CNNs on image classification
Swin Transformer reintroduced locality and hierarchy (CNN principles) to make Transformers efficient
ConvNeXt showed that CNNs can match Transformers when modernized
Hybrid architectures (ConvNeXt + attention, MetaFormer) blur the boundaries entirely

Understanding this connection is crucial for modern practitioners who must choose and design architectures for real-world applications.

What You Will Learn

This page explores Vision Transformer fundamentals, the relationship between convolution and attention, Swin Transformer's hierarchical design, hybrid architectures, and practical guidance on when to choose CNNs vs. Transformers.

Vision Transformer Fundamentals

ViT Architecture (Dosovitskiy et al., 2020):

ViT applies the standard Transformer encoder to image patches:

Patch Embedding: Split image into non-overlapping patches (e.g., 16×16). Flatten and linearly project each patch to an embedding dimension.
Position Embedding: Add learnable position embeddings to preserve spatial information.
Transformer Encoder: Stack of self-attention + MLP blocks with LayerNorm and residual connections.
Classification Head: Use a learnable [CLS] token (or global average pool) for class prediction.

Key Insight: ViT treats an image as a sequence of tokens (patches) and processes them with global self-attention, allowing each patch to attend to every other patch from the first layer.

vit_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import torch.nn as nn
 
class PatchEmbedding(nn.Module):
    """Convert image to patch embeddings."""
    
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.num_patches = (img_size // patch_size) ** 2
        # Linear projection of flattened patches (equivalent to conv with kernel=stride=patch_size)
        self.proj = nn.Conv2d(in_channels, embed_dim, kernel_size=patch_size, stride=patch_size)
    
    def forward(self, x):
        # (B, C, H, W) -> (B, embed_dim, H/P, W/P) -> (B, num_patches, embed_dim)
        x = self.proj(x)
        x = x.flatten(2).transpose(1, 2)
        return x
 
 
class TransformerBlock(nn.Module):
    """Standard Transformer encoder block."""
    
    def __init__(self, dim, num_heads, mlp_ratio=4.0, drop=0.0):
        super().__init__()
        self.norm1 = nn.LayerNorm(dim)
        self.attn = nn.MultiheadAttention(dim, num_heads, dropout=drop, batch_first=True)
        self.norm2 = nn.LayerNorm(dim)
        self.mlp = nn.Sequential(
            nn.Linear(dim, int(dim * mlp_ratio)),
            nn.GELU(),
            nn.Dropout(drop),
            nn.Linear(int(dim * mlp_ratio), dim),
            nn.Dropout(drop)
        )
    
    def forward(self, x):
        x = x + self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[0]
        x = x + self.mlp(self.norm2(x))
        return x
 
 
class VisionTransformer(nn.Module):
    """Simplified Vision Transformer."""
    
    def __init__(self, img_size=224, patch_size=16, num_classes=1000, 
                 embed_dim=768, depth=12, num_heads=12):
        super().__init__()
        self.patch_embed = PatchEmbedding(img_size, patch_size, 3, embed_dim)
        num_patches = self.patch_embed.num_patches
        
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
        
        self.blocks = nn.Sequential(*[
            TransformerBlock(embed_dim, num_heads) for _ in range(depth)
        ])
        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)
    
    def forward(self, x):
        B = x.shape[0]
        x = self.patch_embed(x)
        
        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)
        x = x + self.pos_embed
        
        x = self.blocks(x)
        x = self.norm(x)
        return self.head(x[:, 0])  # CLS token output

ViT's Data Hunger

ViT lacks inductive biases (locality, translation equivariance) that CNNs have built-in. It requires large-scale pre-training (ImageNet-21K or JFT-300M) to match CNN performance. On ImageNet alone, ViT underperforms well-tuned ResNets.

Convolution vs. Attention

Fundamental Differences:

Convolution vs. Self-Attention
Aspect	Convolution	Self-Attention
Receptive field	Local (kernel size)	Global (all positions)
Weight sharing	Spatial (same kernel everywhere)	None (data-dependent weights)
Translation equivariance	Built-in	Requires position encoding
Compute complexity	O(K² × C² × HW)	O((HW)² × C)
Parameter count	O(K² × C²)	O(C²)
Inductive bias	Strong (locality)	Weak (learned from data)

The Deep Connection:

Surprisingly, convolution and attention are more related than they first appear:

Attention as Dynamic Convolution: Self-attention can be viewed as a convolution with content-dependent, spatially-varying kernels. Where convolution applies the same kernel everywhere, attention computes custom weights for each position based on content.
Convolution as Constrained Attention: A convolution is equivalent to attention where the attention weights are fixed, local, and content-independent.
Depthwise Convolution ≈ Local Attention: Large depthwise kernels (like ConvNeXt's 7×7) approximate local attention—they mix spatial information within a neighborhood. This is why ConvNeXt can match Transformers: large kernels provide a similar inductive bias to local attention.

MetaFormer Hypothesis

The MetaFormer paper showed that the Transformer's success comes largely from its general architecture (token mixing + channel mixing + residuals + norms), not specifically from attention. Replacing attention with spatial pooling (PoolFormer) still achieves strong results—suggesting the 'what' of token mixing matters less than 'that' it happens.

Swin Transformer: Hierarchical Vision Transformers

Swin Transformer (Liu et al., 2021) bridges CNNs and ViT by reintroducing hierarchy and locality:

Key Innovations:

Hierarchical Feature Maps: Like CNNs, Swin produces multi-scale features (56×56 → 28×28 → 14×14 → 7×7). This enables dense prediction tasks (detection, segmentation).
Window-based Attention: Instead of global attention (O(n²)), compute attention within local windows (e.g., 7×7). Linear complexity in image size.
Shifted Windows: Alternate between regular and shifted window partitions to allow cross-window connections without global attention.
Patch Merging: Downsample by merging 2×2 neighboring patches, similar to strided convolution.

window_attention.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def window_partition(x: torch.Tensor, window_size: int) -> torch.Tensor:
    """Partition feature map into non-overlapping windows."""
    B, H, W, C = x.shape
    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous()
    return windows.view(-1, window_size, window_size, C)
 
 
def window_reverse(windows, window_size: int, H: int, W: int) -> torch.Tensor:
    """Reverse window partition."""
    B = windows.shape[0] // (H * W // window_size ** 2)
    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)
    return x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
 
 
class WindowAttention(nn.Module):
    """Window-based multi-head self-attention."""
    
    def __init__(self, dim, window_size, num_heads):
        super().__init__()
        self.window_size = window_size
        self.num_heads = num_heads
        self.scale = (dim // num_heads) ** -0.5
        
        self.qkv = nn.Linear(dim, dim * 3)
        self.proj = nn.Linear(dim, dim)
        
        # Relative position bias
        self.relative_position_bias_table = nn.Parameter(
            torch.zeros((2 * window_size - 1) ** 2, num_heads)
        )
    
    def forward(self, x):
        B_, N, C = x.shape  # B_ = batch * num_windows, N = window_size^2
        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]
        
        attn = (q @ k.transpose(-2, -1)) * self.scale
        # Add relative position bias (simplified)
        attn = attn.softmax(dim=-1)
        
        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
        return self.proj(x)

Swin Transformer vs. ViT
Aspect	ViT	Swin Transformer
Feature maps	Single scale	Multi-scale (hierarchical)
Attention scope	Global	Local (shifted windows)
Complexity	O(n²)	O(n) (linear)
Dense prediction	Requires adaptation	Native support
Pre-training requirement	Large-scale data	Works with ImageNet-1K

Practical Architecture Selection

When to Use CNNs (ConvNeXt, EfficientNet)

•Edge/Mobile Deployment: CNNs have better hardware support, optimized libraries, and are easier to quantize
•Limited Training Data: Inductive biases help CNNs learn from smaller datasets
•Latency-Critical Applications: CNNs often have faster inference, especially at small batch sizes
•Simpler Pipeline: No need for complex position encoding or attention optimizations

When to Use Transformers (ViT, Swin)

•Large-Scale Pre-training Available: Transformers excel when pre-trained on massive datasets
•Multi-Modal Learning: Unified architecture for vision, language, and cross-modal tasks
•Global Context Matters: Tasks where long-range dependencies are critical
•Research Flexibility: Easier to modify for novel attention patterns and architectures

The Hybrid Future

Many state-of-the-art architectures now combine both paradigms: convolutional stems for efficient early processing, attention for later stages where global context matters. The best architecture often depends on your specific constraints (data, compute, deployment) rather than which paradigm is theoretically superior.

Module Complete

You've completed the Modern CNN Architectures module! You now understand EfficientNet's compound scaling, MobileNet/ShuffleNet for edge deployment, Neural Architecture Search, ConvNeXt's modernization, and the relationship between CNNs and Vision Transformers. This knowledge prepares you to select, adapt, and understand the architectures powering modern computer vision.