Modern Cnn Architectures - Learning Module

Loading content...

0/245

Neural Architecture Search: Automating Network Design

Beyond Hand-Crafted Architectures

For decades, neural network architecture design was a manual process requiring deep expertise and extensive experimentation. Researchers would propose architectures based on intuition, test them empirically, and iterate. This produced milestones like AlexNet, VGG, and ResNet—but the process was slow, expensive, and limited by human creativity.

Neural Architecture Search (NAS) automates this process by treating architecture design as an optimization problem. Given a search space of possible architectures and a performance metric, NAS algorithms automatically discover high-performing designs. The results have been remarkable: NAS-discovered architectures now match or exceed the best human-designed networks across many tasks.

What You Will Learn

This page covers NAS fundamentals: search spaces, search strategies (reinforcement learning, evolutionary, differentiable), performance estimation strategies, and landmark NAS-discovered architectures like NASNet, AmoebaNet, and DARTS.

The Three Pillars of NAS

Every NAS system consists of three core components:

1. Search Space: Defines what architectures can be discovered. This includes:

Operation types (conv 3×3, conv 5×5, pooling, skip connections)
Connectivity patterns (how layers connect)
Macro-architecture (number of cells, channel progressions)

2. Search Strategy: The algorithm that explores the search space:

Reinforcement learning (train controller to propose architectures)
Evolutionary algorithms (mutate and select architectures)
Gradient-based/differentiable methods (continuous relaxation)

3. Performance Estimation: How to evaluate candidate architectures:

Full training (accurate but expensive: 1000s of GPU-hours)
Early stopping (train partially, predict final performance)
Weight sharing (amortize training across architectures)
Zero-cost proxies (estimate performance without training)

NAS Search Strategies Comparison
Strategy	Compute Cost	Strengths	Weaknesses
RL-based	Very High	Can explore discrete spaces	Sample inefficient, expensive
Evolutionary	High	Parallel, robust	Slow convergence
Differentiable	Low	Efficient, gradient-based	Search space restrictions
Weight Sharing	Medium	Amortized evaluation	Ranking inconsistency

Search Space Design

Cell-Based Search Spaces:

Modern NAS typically searches for a repeating cell (or block) that is stacked to form the full network. This dramatically reduces the search space while producing transferable designs.

Normal Cell: Maintains spatial resolution, used in most of the network
Reduction Cell: Reduces spatial dimensions (stride 2), used at downsampling points

Operations in Typical Search Spaces:

•Convolutions: 3×3, 5×5, 7×7 (regular or depthwise separable)
•Dilated Convolutions: 3×3, 5×5 with dilation for larger receptive field
•Pooling: Max pooling, average pooling (3×3)
•Skip Connection: Identity mapping (for residual learning)
•Zero: No connection (allows sparse connectivity)

nas_search_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import torch.nn as nn
 
# Standard NAS operation set (DARTS-style)
OPERATIONS = {
    'none': lambda C, stride: Zero(stride),
    'skip': lambda C, stride: Identity() if stride == 1 else FactorizedReduce(C, C),
    'avg_pool_3x3': lambda C, stride: nn.AvgPool2d(3, stride=stride, padding=1),
    'max_pool_3x3': lambda C, stride: nn.MaxPool2d(3, stride=stride, padding=1),
    'sep_conv_3x3': lambda C, stride: SepConv(C, C, 3, stride, 1),
    'sep_conv_5x5': lambda C, stride: SepConv(C, C, 5, stride, 2),
    'dil_conv_3x3': lambda C, stride: DilConv(C, C, 3, stride, 2, 2),
    'dil_conv_5x5': lambda C, stride: DilConv(C, C, 5, stride, 4, 2),
}
 
class SepConv(nn.Module):
    """Separable convolution: depthwise + pointwise, applied twice."""
    def __init__(self, C_in, C_out, kernel, stride, padding):
        super().__init__()
        self.op = nn.Sequential(
            nn.ReLU(inplace=False),
            nn.Conv2d(C_in, C_in, kernel, stride, padding, groups=C_in, bias=False),
            nn.Conv2d(C_in, C_out, 1, bias=False),
            nn.BatchNorm2d(C_out),
            nn.ReLU(inplace=False),
            nn.Conv2d(C_out, C_out, kernel, 1, padding, groups=C_out, bias=False),
            nn.Conv2d(C_out, C_out, 1, bias=False),
            nn.BatchNorm2d(C_out)
        )
    def forward(self, x): return self.op(x)

RL and Evolutionary Search

Reinforcement Learning NAS (2017):

The original NAS paper by Zoph & Le used an RNN controller that outputs architecture descriptions as sequences. The controller is trained with REINFORCE to maximize expected validation accuracy.

Controller generates architecture specification
Child network is built and trained to convergence
Validation accuracy becomes the reward signal
Controller parameters updated via policy gradient

Cost: The original NAS required 800 GPUs for 28 days (22,400 GPU-days) to search on CIFAR-10.

Evolutionary NAS (AmoebaNet, 2019):

Evolutionary algorithms maintain a population of architectures that evolve through mutation and selection:

Initialize random population of architectures
Evaluate fitness (validation accuracy)
Select parents via tournament selection
Mutate: randomly change one operation or connection
Replace oldest individuals with offspring

Result: AmoebaNet-A achieved 83.9% top-1 on ImageNet, matching the best RL-based results while being more interpretable.

The Compute Cost Problem

Early NAS methods required enormous compute: thousands of GPU-days for a single search. This motivated the development of efficient NAS techniques: weight sharing, differentiable search, and zero-cost proxies.

Differentiable Architecture Search (DARTS)

DARTS (2019) made NAS dramatically more efficient by making the search space continuous and differentiable.

Key Insight: Instead of selecting one operation per edge, compute a weighted sum of all possible operations. Learn the weights (architecture parameters α) via gradient descent alongside the network weights.

Mixed Operation: For each edge between nodes, compute:

ō(x) = Σᵢ [exp(αᵢ) / Σⱼexp(αⱼ)] · oᵢ(x)

where αᵢ are learnable architecture parameters and oᵢ are candidate operations.

Bi-Level Optimization:

Outer loop: Update architecture parameters α on validation set
Inner loop: Update network weights w on training set

Efficiency: DARTS reduces search from thousands of GPU-days to 1-4 GPU-days.

darts_cell.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MixedOp(nn.Module):
    """Mixed operation: weighted sum of all candidate operations."""
    
    def __init__(self, C: int, stride: int):
        super().__init__()
        self.ops = nn.ModuleList([
            OPERATIONS[name](C, stride) for name in OPERATIONS.keys()
        ])
    
    def forward(self, x: torch.Tensor, weights: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Input tensor
            weights: Softmax-normalized architecture weights (one per operation)
        """
        return sum(w * op(x) for w, op in zip(weights, self.ops))
 
 
class DARTSCell(nn.Module):
    """Differentiable cell with learnable architecture."""
    
    def __init__(self, C: int, num_nodes: int = 4):
        super().__init__()
        self.num_nodes = num_nodes
        
        # Each node receives input from all previous nodes
        self.ops = nn.ModuleDict()
        for i in range(num_nodes):
            for j in range(i + 2):  # +2 for two input nodes
                self.ops[f'{j}_{i}'] = MixedOp(C, stride=1)
        
        # Architecture parameters: one alpha vector per edge
        num_edges = sum(i + 2 for i in range(num_nodes))
        num_ops = len(OPERATIONS)
        self.alphas = nn.Parameter(torch.randn(num_edges, num_ops) * 1e-3)
    
    def forward(self, s0: torch.Tensor, s1: torch.Tensor) -> torch.Tensor:
        states = [s0, s1]
        edge_idx = 0
        
        for i in range(self.num_nodes):
            # Compute output for node i from all previous nodes
            node_inputs = []
            for j in range(len(states)):
                weights = F.softmax(self.alphas[edge_idx], dim=0)
                node_inputs.append(self.ops[f'{j}_{i}'](states[j], weights))
                edge_idx += 1
            states.append(sum(node_inputs))
        
        # Concatenate intermediate nodes (exclude inputs)
        return torch.cat(states[2:], dim=1)

Discretization

After search, DARTS discretizes the architecture by selecting the top-k operations (by α magnitude) for each node. The final architecture uses only these selected operations—no weighted sums during inference.

Landmark NAS-Discovered Architectures

Notable NAS-Discovered Architectures
Architecture	Year	Search Method	Key Achievement
NASNet-A	2018	RL	First NAS to beat human designs on ImageNet
AmoebaNet-A	2019	Evolutionary	Matched NASNet with interpretable evolution
DARTS	2019	Differentiable	Reduced search to 1-4 GPU-days
EfficientNet-B0	2019	RL + compound scaling	SOTA efficiency-accuracy tradeoff
MobileNetV3	2019	RL + manual refinement	Best mobile architecture
RegNet	2020	Design space search	Discovered general network design principles

RegNet Insight: Rather than searching for specific architectures, RegNet searched for design space constraints. They discovered that the best networks follow a simple linear rule: wⱼ = w₀ + wₐ·j for width at stage j. This produced a family of networks competitive with EfficientNet but simpler to understand and scale.

Page Complete

You now understand Neural Architecture Search: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation, and landmark architectures. Next, we'll explore ConvNeXt—a return to pure ConvNet design that challenges the dominance of Vision Transformers.