Loading content...
For decades, neural network architecture design was a manual process requiring deep expertise and extensive experimentation. Researchers would propose architectures based on intuition, test them empirically, and iterate. This produced milestones like AlexNet, VGG, and ResNet—but the process was slow, expensive, and limited by human creativity.
Neural Architecture Search (NAS) automates this process by treating architecture design as an optimization problem. Given a search space of possible architectures and a performance metric, NAS algorithms automatically discover high-performing designs. The results have been remarkable: NAS-discovered architectures now match or exceed the best human-designed networks across many tasks.
This page covers NAS fundamentals: search spaces, search strategies (reinforcement learning, evolutionary, differentiable), performance estimation strategies, and landmark NAS-discovered architectures like NASNet, AmoebaNet, and DARTS.
Every NAS system consists of three core components:
1. Search Space: Defines what architectures can be discovered. This includes:
2. Search Strategy: The algorithm that explores the search space:
3. Performance Estimation: How to evaluate candidate architectures:
| Strategy | Compute Cost | Strengths | Weaknesses |
|---|---|---|---|
| RL-based | Very High | Can explore discrete spaces | Sample inefficient, expensive |
| Evolutionary | High | Parallel, robust | Slow convergence |
| Differentiable | Low | Efficient, gradient-based | Search space restrictions |
| Weight Sharing | Medium | Amortized evaluation | Ranking inconsistency |
Cell-Based Search Spaces:
Modern NAS typically searches for a repeating cell (or block) that is stacked to form the full network. This dramatically reduces the search space while producing transferable designs.
Operations in Typical Search Spaces:
1234567891011121314151617181920212223242526272829
import torch.nn as nn # Standard NAS operation set (DARTS-style)OPERATIONS = { 'none': lambda C, stride: Zero(stride), 'skip': lambda C, stride: Identity() if stride == 1 else FactorizedReduce(C, C), 'avg_pool_3x3': lambda C, stride: nn.AvgPool2d(3, stride=stride, padding=1), 'max_pool_3x3': lambda C, stride: nn.MaxPool2d(3, stride=stride, padding=1), 'sep_conv_3x3': lambda C, stride: SepConv(C, C, 3, stride, 1), 'sep_conv_5x5': lambda C, stride: SepConv(C, C, 5, stride, 2), 'dil_conv_3x3': lambda C, stride: DilConv(C, C, 3, stride, 2, 2), 'dil_conv_5x5': lambda C, stride: DilConv(C, C, 5, stride, 4, 2),} class SepConv(nn.Module): """Separable convolution: depthwise + pointwise, applied twice.""" def __init__(self, C_in, C_out, kernel, stride, padding): super().__init__() self.op = nn.Sequential( nn.ReLU(inplace=False), nn.Conv2d(C_in, C_in, kernel, stride, padding, groups=C_in, bias=False), nn.Conv2d(C_in, C_out, 1, bias=False), nn.BatchNorm2d(C_out), nn.ReLU(inplace=False), nn.Conv2d(C_out, C_out, kernel, 1, padding, groups=C_out, bias=False), nn.Conv2d(C_out, C_out, 1, bias=False), nn.BatchNorm2d(C_out) ) def forward(self, x): return self.op(x)Reinforcement Learning NAS (2017):
The original NAS paper by Zoph & Le used an RNN controller that outputs architecture descriptions as sequences. The controller is trained with REINFORCE to maximize expected validation accuracy.
Cost: The original NAS required 800 GPUs for 28 days (22,400 GPU-days) to search on CIFAR-10.
Evolutionary NAS (AmoebaNet, 2019):
Evolutionary algorithms maintain a population of architectures that evolve through mutation and selection:
Result: AmoebaNet-A achieved 83.9% top-1 on ImageNet, matching the best RL-based results while being more interpretable.
Early NAS methods required enormous compute: thousands of GPU-days for a single search. This motivated the development of efficient NAS techniques: weight sharing, differentiable search, and zero-cost proxies.
DARTS (2019) made NAS dramatically more efficient by making the search space continuous and differentiable.
Key Insight: Instead of selecting one operation per edge, compute a weighted sum of all possible operations. Learn the weights (architecture parameters α) via gradient descent alongside the network weights.
Mixed Operation: For each edge between nodes, compute:
ō(x) = Σᵢ [exp(αᵢ) / Σⱼexp(αⱼ)] · oᵢ(x)
where αᵢ are learnable architecture parameters and oᵢ are candidate operations.
Bi-Level Optimization:
Efficiency: DARTS reduces search from thousands of GPU-days to 1-4 GPU-days.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import torchimport torch.nn as nnimport torch.nn.functional as F class MixedOp(nn.Module): """Mixed operation: weighted sum of all candidate operations.""" def __init__(self, C: int, stride: int): super().__init__() self.ops = nn.ModuleList([ OPERATIONS[name](C, stride) for name in OPERATIONS.keys() ]) def forward(self, x: torch.Tensor, weights: torch.Tensor) -> torch.Tensor: """ Args: x: Input tensor weights: Softmax-normalized architecture weights (one per operation) """ return sum(w * op(x) for w, op in zip(weights, self.ops)) class DARTSCell(nn.Module): """Differentiable cell with learnable architecture.""" def __init__(self, C: int, num_nodes: int = 4): super().__init__() self.num_nodes = num_nodes # Each node receives input from all previous nodes self.ops = nn.ModuleDict() for i in range(num_nodes): for j in range(i + 2): # +2 for two input nodes self.ops[f'{j}_{i}'] = MixedOp(C, stride=1) # Architecture parameters: one alpha vector per edge num_edges = sum(i + 2 for i in range(num_nodes)) num_ops = len(OPERATIONS) self.alphas = nn.Parameter(torch.randn(num_edges, num_ops) * 1e-3) def forward(self, s0: torch.Tensor, s1: torch.Tensor) -> torch.Tensor: states = [s0, s1] edge_idx = 0 for i in range(self.num_nodes): # Compute output for node i from all previous nodes node_inputs = [] for j in range(len(states)): weights = F.softmax(self.alphas[edge_idx], dim=0) node_inputs.append(self.ops[f'{j}_{i}'](states[j], weights)) edge_idx += 1 states.append(sum(node_inputs)) # Concatenate intermediate nodes (exclude inputs) return torch.cat(states[2:], dim=1)After search, DARTS discretizes the architecture by selecting the top-k operations (by α magnitude) for each node. The final architecture uses only these selected operations—no weighted sums during inference.
| Architecture | Year | Search Method | Key Achievement |
|---|---|---|---|
| NASNet-A | 2018 | RL | First NAS to beat human designs on ImageNet |
| AmoebaNet-A | 2019 | Evolutionary | Matched NASNet with interpretable evolution |
| DARTS | 2019 | Differentiable | Reduced search to 1-4 GPU-days |
| EfficientNet-B0 | 2019 | RL + compound scaling | SOTA efficiency-accuracy tradeoff |
| MobileNetV3 | 2019 | RL + manual refinement | Best mobile architecture |
| RegNet | 2020 | Design space search | Discovered general network design principles |
RegNet Insight: Rather than searching for specific architectures, RegNet searched for design space constraints. They discovered that the best networks follow a simple linear rule: wⱼ = w₀ + wₐ·j for width at stage j. This produced a family of networks competitive with EfficientNet but simpler to understand and scale.
You now understand Neural Architecture Search: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation, and landmark architectures. Next, we'll explore ConvNeXt—a return to pure ConvNet design that challenges the dominance of Vision Transformers.