Neural Architecture Search - Learning Module

Loading content...

0/278

Search Space Design

The Architecture of Search

The search space is arguably the most critical component of any Neural Architecture Search system. It determines which architectures can possibly be discovered—no search algorithm, however sophisticated, can find an architecture outside its search space.

The fundamental tension:

Too small: May exclude optimal architectures; limits discovery potential
Too large: Exponentially more difficult to search; wastes computation on poor regions
Wrong structure: Even well-sized spaces may miss key architectural patterns

Designing an effective search space requires balancing expressiveness (can we represent good architectures?) with tractability (can we search efficiently?). This page provides a deep understanding of search space design principles and common paradigms.

Learning Objectives

Master the principles of search space design: cell-based vs macro search, operation selection, connectivity constraints, and how design choices impact search efficiency and discovered architecture quality.

Search Space Fundamentals

A search space $\mathcal{A}$ is formally the set of all architectures that can be expressed under a given parameterization. Every architecture $a \in \mathcal{A}$ corresponds to a specific instantiation of architectural choices.

Dimensionality of Search Spaces:

The size of a search space grows combinatorially with the number of choices:

$$|\mathcal{A}| = \prod_{i=1}^{n} |C_i|$$

Where $C_i$ is the set of choices for the $i$-th architectural decision. With $n$ decisions and $k$ options each, the space contains $k^n$ architectures.

Example: A Simple Cell Space

Consider a cell with 4 intermediate nodes, where each node:

Selects 2 inputs from previous nodes (including 2 cell inputs)
Applies 1 of 7 operations to each input

For node $i$ (0-indexed), there are $(2+i)^2$ input combinations and $7^2 = 49$ operation combinations.

The total space size: $$|\mathcal{A}| = \prod_{i=0}^{3} (2+i)^2 \times 49 = 2^2 \times 3^2 \times 4^2 \times 5^2 \times 49^4 \approx 3.3 \times 10^{10}$$

Even this "simple" cell contains billions of possible architectures.

Search Space Size Examples
Search Space	Approximate Size	Notable Architectures
NAS-Bench-101	423,624	Limited but fully evaluated
NAS-Bench-201	15,625	Multiple datasets, smaller space
NASNet Cell Space	~10^{18}	Original NAS; huge but structured
DARTS Space	~10^{18}	Continuous relaxation makes this tractable
Unconstrained DAG (10 nodes)	~10^{30}+	Impractical without constraints

Cell-Based Search Spaces

The cell-based paradigm revolutionized NAS by dramatically reducing search space complexity while maintaining expressiveness. Instead of searching for the entire network, we search for a small repeating unit (cell) that is stacked to form the full architecture.

Key insight: Many successful architectures (ResNet, Inception, etc.) consist of repeated blocks. Searching for the block rather than the full network reduces complexity from $O(|C|^L)$ to $O(|C|^B)$, where $L$ is network depth and $B$ is block size ($B \ll L$).

NASNet Cell Structure:

The NASNet cell design introduced key concepts still used today:

Cell Components

•Input nodes: Two inputs—output from previous cell and cell before that (skip connection)
•Intermediate nodes: Each selects 2 previous nodes and applies operations
•Output node: Concatenates all unused intermediate outputs
•Normal cells: Preserve spatial resolution
•Reduction cells: Reduce spatial dimensions by 2x (stride-2 operations)

cell_search_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
"""
Cell-Based Search Space Implementation
"""
from dataclasses import dataclass
from typing import List, Tuple
import itertools
 
# Standard operations in NASNet/DARTS-style spaces
OPERATIONS = [
    'none',           # Zero operation (skip)
    'skip_connect',   # Identity connection
    'sep_conv_3x3',   # Separable convolution 3x3
    'sep_conv_5x5',   # Separable convolution 5x5
    'dil_conv_3x3',   # Dilated convolution 3x3
    'dil_conv_5x5',   # Dilated convolution 5x5
    'avg_pool_3x3',   # Average pooling 3x3
    'max_pool_3x3',   # Max pooling 3x3
]
 
@dataclass
class CellGenotype:
    """Represents a cell architecture"""
    normal: List[Tuple[str, int, str, int]]  # [(op1, input1, op2, input2), ...]
    reduce: List[Tuple[str, int, str, int]]  # Same for reduction cell
    
def enumerate_cell_space(num_nodes: int = 4) -> int:
    """Calculate size of cell search space"""
    total = 1
    for i in range(num_nodes):
        num_inputs = 2 + i  # 2 cell inputs + previous nodes
        # Each node: 2 input selections × 2 operation selections
        node_choices = (num_inputs ** 2) * (len(OPERATIONS) ** 2)
        total *= node_choices
    return total
 
print(f"Cell space size (4 nodes, 8 ops): {enumerate_cell_space():,}")
# Output: Cell space size (4 nodes, 8 ops): ~33 billion

Transferability Benefit

Cells searched on small proxy tasks (CIFAR-10) often transfer to larger tasks (ImageNet) by stacking more cells. This enables searching on cheaper settings and deploying at scale—a key efficiency advantage.

Macro Search Spaces

Macro search directly searches for the entire network architecture rather than repeating cells. This provides maximum flexibility but at significantly higher search cost.

Chain-Structured Macro Search:

The simplest macro space defines a sequence of layers, choosing the operation at each position:

$$a = (o_1, o_2, ..., o_L)$$

where $o_i \in \mathcal{O}$ is the operation at layer $i$.

Multi-Branch Macro Search:

More expressive spaces allow branches and skip connections:

Each layer can receive input from any previous layer
Multiple parallel branches are possible
Requires careful handling to avoid disconnected graphs

Macro Search Advantages

•No assumption of repeating structure
•Can discover novel topologies
•Learns stage-specific patterns
•Optimal depth emerges from search

Macro Search Challenges

•Exponentially larger search space
•Harder to transfer across tasks
•More expensive to evaluate
•Requires longer search

When to Use Macro vs Cell Search:

Cell-based: Standard choice when regular structure is expected (CNNs, standard transformers)
Macro: Novel domains, tasks requiring irregular structure, small networks where transferability isn't needed

Operation Selection

The choice of candidate operations fundamentally shapes what architectures can be discovered. Operations must be:

Diverse: Cover different computational patterns
Appropriate: Match the problem domain (convolutions for images, attention for sequences)
Well-scaled: Have similar parameter counts to avoid bias
Differentiable: For gradient-based NAS methods

Common Operations by Domain
Domain	Typical Operations	Notes
Image Classification	3×3/5×5/7×7 conv, separable conv, dilated conv, pooling, skip	Separable convs reduce params
Object Detection	Above + deformable conv, FPN connections	Multi-scale is critical
Transformers	Self-attention, FFN variants, different head counts	Attention is computationally heavy
Mobile	MBConv (inverted residual), SE blocks, small kernels	Efficiency-focused operations

Operation Balance Matters

If operations differ greatly in parameter count or computation, search may favor simpler options regardless of performance. Normalizing operation costs or using multi-objective search helps.

Connectivity Constraints

Connectivity constraints define how nodes can connect. Unconstrained DAGs are too expensive to search; practical spaces use structural constraints.

Common Connectivity Patterns:

Connectivity Constraints

•Sequential: Each layer connects only to the previous—most restrictive
•Dense (DenseNet-style): Each layer connects to all previous layers—high memory cost
•Skip-k: Layers can skip up to k layers—balanced flexibility
•Cell-local: Connections only within cell; cells connect sequentially
•Hierarchical: Multi-level constraints (within-block, between-block)

connectivity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
def valid_connections_cell(node_idx: int, num_cell_inputs: int = 2):
    """
    For cell-based NAS: node i can connect to:
    - 2 cell input nodes (indices 0, 1)
    - Previous intermediate nodes (indices 2 to 2+i-1)
    """
    return list(range(num_cell_inputs + node_idx))
 
# Node 0: can use inputs [0, 1] (the 2 cell inputs)
# Node 1: can use inputs [0, 1, 2] (2 cell inputs + node 0's output)
# Node 2: can use inputs [0, 1, 2, 3] (+ node 1's output)
# etc.

The DARTS Search Space

DARTS (Differentiable Architecture Search) introduced a search space that enables gradient-based optimization through continuous relaxation.

Key Design Principles:

Mixed Operations: Instead of selecting one operation per edge, DARTS places a weighted mixture of all operations. Architecture weights $\alpha$ determine the mixture.
Continuous Relaxation: The discrete choice is relaxed: $$\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})} \cdot o(x)$$
Derivation: After search, the architecture is discretized by taking $\arg\max$ of the architecture weights for each edge.

darts_mixed_op.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MixedOp(nn.Module):
    """DARTS-style mixed operation for differentiable NAS"""
    
    def __init__(self, channels, operations):
        super().__init__()
        self.ops = nn.ModuleList([
            self._build_op(op, channels) for op in operations
        ])
        
    def forward(self, x, weights):
        """
        Args:
            x: Input tensor
            weights: Architecture weights (softmax applied)
        Returns:
            Weighted sum of all operation outputs
        """
        return sum(w * op(x) for w, op in zip(weights, self.ops))
    
    def _build_op(self, op_name, channels):
        # Build operation based on name
        if op_name == 'none':
            return Zero(channels)
        elif op_name == 'skip_connect':
            return nn.Identity()
        elif op_name == 'sep_conv_3x3':
            return SepConv(channels, channels, 3)
        # ... other operations
        
class DARTSCell(nn.Module):
    """A DARTS cell with mixed operations on all edges"""
    
    def __init__(self, channels, num_nodes=4):
        super().__init__()
        self.num_nodes = num_nodes
        
        # Create mixed ops for all edges
        self.edges = nn.ModuleDict()
        for i in range(num_nodes):
            for j in range(2 + i):  # Can connect to cell inputs + prev nodes
                self.edges[f'{j}->{2+i}'] = MixedOp(channels, OPERATIONS)
                
    def forward(self, s0, s1, arch_weights):
        """Forward with architecture weights determining op mixtures"""
        states = [s0, s1]  # Cell inputs
        
        for i in range(self.num_nodes):
            node_input = sum(
                self.edges[f'{j}->{2+i}'](states[j], arch_weights[f'{j}->{2+i}'])
                for j in range(2 + i)
            )
            states.append(node_input)
            
        # Concatenate intermediate node outputs
        return torch.cat(states[2:], dim=1)

DARTS Discretization Gap

The architecture trained with soft mixtures may perform differently when discretized. This 'discretization gap' is a known DARTS limitation—the searched continuous architecture doesn't perfectly match the final discrete one.

Hierarchical Search Spaces

Hierarchical search spaces combine the benefits of cell-based and macro search by operating at multiple levels of abstraction.

Two-Level Hierarchy (Common):

Low level: Search for cell/block structure (micro-architecture)
High level: Search for how to stack/connect cells (macro-architecture)

Example: Auto-DeepLab

Auto-DeepLab searches for semantic segmentation architectures at two levels:

Cell level: Standard DARTS-style cell search
Network level: Which cells to use at each stage, how to connect different resolution paths

This discovers both optimal local computations AND optimal multi-scale structure.

Benefits of Hierarchical Search

•Captures both local and global architectural patterns
•More flexible than pure cell-based while more tractable than pure macro
•Natural fit for problems with inherent hierarchy (multi-scale, encoder-decoder)
•Enables task-specific macro structure with transferable micro structure

Summary: Search Space Design

Key Takeaways

•Search space often matters more than search algorithm—a well-designed space with random search beats a sophisticated algorithm on a poor space
•Cell-based search reduces complexity by searching for repeatable units; enables transfer across scales
•Macro search offers flexibility for novel architectures but at higher search cost
•Operation selection must balance diversity, domain-appropriateness, and computational parity
•Connectivity constraints tame combinatorial explosion while preserving expressiveness
•DARTS-style spaces enable gradient-based search through continuous relaxation
•Hierarchical spaces combine micro and macro search for maximum expressiveness

Page Complete

You now understand how to design effective NAS search spaces. Next, we'll explore the search strategies that navigate these spaces to find optimal architectures.

Search Space Design

The Architecture of Search

The fundamental tension:

Too small: May exclude optimal architectures; limits discovery potential
Too large: Exponentially more difficult to search; wastes computation on poor regions
Wrong structure: Even well-sized spaces may miss key architectural patterns

Learning Objectives

Search Space Fundamentals

Dimensionality of Search Spaces:

The size of a search space grows combinatorially with the number of choices:

$$|\mathcal{A}| = \prod_{i=1}^{n} |C_i|$$

Where $C_i$ is the set of choices for the $i$-th architectural decision. With $n$ decisions and $k$ options each, the space contains $k^n$ architectures.

Example: A Simple Cell Space

Consider a cell with 4 intermediate nodes, where each node:

Selects 2 inputs from previous nodes (including 2 cell inputs)
Applies 1 of 7 operations to each input

For node $i$ (0-indexed), there are $(2+i)^2$ input combinations and $7^2 = 49$ operation combinations.

The total space size: $$|\mathcal{A}| = \prod_{i=0}^{3} (2+i)^2 \times 49 = 2^2 \times 3^2 \times 4^2 \times 5^2 \times 49^4 \approx 3.3 \times 10^{10}$$

Even this "simple" cell contains billions of possible architectures.

Search Space Size Examples
Search Space	Approximate Size	Notable Architectures
NAS-Bench-101	423,624	Limited but fully evaluated
NAS-Bench-201	15,625	Multiple datasets, smaller space
NASNet Cell Space	~10^{18}	Original NAS; huge but structured
DARTS Space	~10^{18}	Continuous relaxation makes this tractable
Unconstrained DAG (10 nodes)	~10^{30}+	Impractical without constraints

Cell-Based Search Spaces

NASNet Cell Structure:

The NASNet cell design introduced key concepts still used today:

Cell Components

•Input nodes: Two inputs—output from previous cell and cell before that (skip connection)
•Intermediate nodes: Each selects 2 previous nodes and applies operations
•Output node: Concatenates all unused intermediate outputs
•Normal cells: Preserve spatial resolution
•Reduction cells: Reduce spatial dimensions by 2x (stride-2 operations)

cell_search_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
"""
Cell-Based Search Space Implementation
"""
from dataclasses import dataclass
from typing import List, Tuple
import itertools
 
# Standard operations in NASNet/DARTS-style spaces
OPERATIONS = [
    'none',           # Zero operation (skip)
    'skip_connect',   # Identity connection
    'sep_conv_3x3',   # Separable convolution 3x3
    'sep_conv_5x5',   # Separable convolution 5x5
    'dil_conv_3x3',   # Dilated convolution 3x3
    'dil_conv_5x5',   # Dilated convolution 5x5
    'avg_pool_3x3',   # Average pooling 3x3
    'max_pool_3x3',   # Max pooling 3x3
]
 
@dataclass
class CellGenotype:
    """Represents a cell architecture"""
    normal: List[Tuple[str, int, str, int]]  # [(op1, input1, op2, input2), ...]
    reduce: List[Tuple[str, int, str, int]]  # Same for reduction cell
    
def enumerate_cell_space(num_nodes: int = 4) -> int:
    """Calculate size of cell search space"""
    total = 1
    for i in range(num_nodes):
        num_inputs = 2 + i  # 2 cell inputs + previous nodes
        # Each node: 2 input selections × 2 operation selections
        node_choices = (num_inputs ** 2) * (len(OPERATIONS) ** 2)
        total *= node_choices
    return total
 
print(f"Cell space size (4 nodes, 8 ops): {enumerate_cell_space():,}")
# Output: Cell space size (4 nodes, 8 ops): ~33 billion

Transferability Benefit

Macro Search Spaces

Macro search directly searches for the entire network architecture rather than repeating cells. This provides maximum flexibility but at significantly higher search cost.

Chain-Structured Macro Search:

The simplest macro space defines a sequence of layers, choosing the operation at each position:

$$a = (o_1, o_2, ..., o_L)$$

where $o_i \in \mathcal{O}$ is the operation at layer $i$.

Multi-Branch Macro Search:

More expressive spaces allow branches and skip connections:

Each layer can receive input from any previous layer
Multiple parallel branches are possible
Requires careful handling to avoid disconnected graphs

Macro Search Advantages

•No assumption of repeating structure
•Can discover novel topologies
•Learns stage-specific patterns
•Optimal depth emerges from search

Macro Search Challenges

•Exponentially larger search space
•Harder to transfer across tasks
•More expensive to evaluate
•Requires longer search

When to Use Macro vs Cell Search:

Cell-based: Standard choice when regular structure is expected (CNNs, standard transformers)
Macro: Novel domains, tasks requiring irregular structure, small networks where transferability isn't needed

Operation Selection

The choice of candidate operations fundamentally shapes what architectures can be discovered. Operations must be:

Diverse: Cover different computational patterns
Appropriate: Match the problem domain (convolutions for images, attention for sequences)
Well-scaled: Have similar parameter counts to avoid bias
Differentiable: For gradient-based NAS methods

Common Operations by Domain
Domain	Typical Operations	Notes
Image Classification	3×3/5×5/7×7 conv, separable conv, dilated conv, pooling, skip	Separable convs reduce params
Object Detection	Above + deformable conv, FPN connections	Multi-scale is critical
Transformers	Self-attention, FFN variants, different head counts	Attention is computationally heavy
Mobile	MBConv (inverted residual), SE blocks, small kernels	Efficiency-focused operations

Operation Balance Matters

If operations differ greatly in parameter count or computation, search may favor simpler options regardless of performance. Normalizing operation costs or using multi-objective search helps.

Connectivity Constraints

Connectivity constraints define how nodes can connect. Unconstrained DAGs are too expensive to search; practical spaces use structural constraints.

Common Connectivity Patterns:

Connectivity Constraints

•Sequential: Each layer connects only to the previous—most restrictive
•Dense (DenseNet-style): Each layer connects to all previous layers—high memory cost
•Skip-k: Layers can skip up to k layers—balanced flexibility
•Cell-local: Connections only within cell; cells connect sequentially
•Hierarchical: Multi-level constraints (within-block, between-block)

connectivity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
def valid_connections_cell(node_idx: int, num_cell_inputs: int = 2):
    """
    For cell-based NAS: node i can connect to:
    - 2 cell input nodes (indices 0, 1)
    - Previous intermediate nodes (indices 2 to 2+i-1)
    """
    return list(range(num_cell_inputs + node_idx))
 
# Node 0: can use inputs [0, 1] (the 2 cell inputs)
# Node 1: can use inputs [0, 1, 2] (2 cell inputs + node 0's output)
# Node 2: can use inputs [0, 1, 2, 3] (+ node 1's output)
# etc.

The DARTS Search Space

DARTS (Differentiable Architecture Search) introduced a search space that enables gradient-based optimization through continuous relaxation.

Key Design Principles:

Mixed Operations: Instead of selecting one operation per edge, DARTS places a weighted mixture of all operations. Architecture weights $\alpha$ determine the mixture.
Continuous Relaxation: The discrete choice is relaxed: $$\bar{o}^{(i,j)}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o^{(i,j)})}{\sum_{o'} \exp(\alpha_{o'}^{(i,j)})} \cdot o(x)$$
Derivation: After search, the architecture is discretized by taking $\arg\max$ of the architecture weights for each edge.

darts_mixed_op.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MixedOp(nn.Module):
    """DARTS-style mixed operation for differentiable NAS"""
    
    def __init__(self, channels, operations):
        super().__init__()
        self.ops = nn.ModuleList([
            self._build_op(op, channels) for op in operations
        ])
        
    def forward(self, x, weights):
        """
        Args:
            x: Input tensor
            weights: Architecture weights (softmax applied)
        Returns:
            Weighted sum of all operation outputs
        """
        return sum(w * op(x) for w, op in zip(weights, self.ops))
    
    def _build_op(self, op_name, channels):
        # Build operation based on name
        if op_name == 'none':
            return Zero(channels)
        elif op_name == 'skip_connect':
            return nn.Identity()
        elif op_name == 'sep_conv_3x3':
            return SepConv(channels, channels, 3)
        # ... other operations
        
class DARTSCell(nn.Module):
    """A DARTS cell with mixed operations on all edges"""
    
    def __init__(self, channels, num_nodes=4):
        super().__init__()
        self.num_nodes = num_nodes
        
        # Create mixed ops for all edges
        self.edges = nn.ModuleDict()
        for i in range(num_nodes):
            for j in range(2 + i):  # Can connect to cell inputs + prev nodes
                self.edges[f'{j}->{2+i}'] = MixedOp(channels, OPERATIONS)
                
    def forward(self, s0, s1, arch_weights):
        """Forward with architecture weights determining op mixtures"""
        states = [s0, s1]  # Cell inputs
        
        for i in range(self.num_nodes):
            node_input = sum(
                self.edges[f'{j}->{2+i}'](states[j], arch_weights[f'{j}->{2+i}'])
                for j in range(2 + i)
            )
            states.append(node_input)
            
        # Concatenate intermediate node outputs
        return torch.cat(states[2:], dim=1)

DARTS Discretization Gap

Hierarchical Search Spaces

Hierarchical search spaces combine the benefits of cell-based and macro search by operating at multiple levels of abstraction.

Two-Level Hierarchy (Common):

Low level: Search for cell/block structure (micro-architecture)
High level: Search for how to stack/connect cells (macro-architecture)

Example: Auto-DeepLab

Auto-DeepLab searches for semantic segmentation architectures at two levels:

Cell level: Standard DARTS-style cell search
Network level: Which cells to use at each stage, how to connect different resolution paths

This discovers both optimal local computations AND optimal multi-scale structure.

Benefits of Hierarchical Search

•Captures both local and global architectural patterns
•More flexible than pure cell-based while more tractable than pure macro
•Natural fit for problems with inherent hierarchy (multi-scale, encoder-decoder)
•Enables task-specific macro structure with transferable micro structure

Summary: Search Space Design

Key Takeaways

•Search space often matters more than search algorithm—a well-designed space with random search beats a sophisticated algorithm on a poor space
•Cell-based search reduces complexity by searching for repeatable units; enables transfer across scales
•Macro search offers flexibility for novel architectures but at higher search cost
•Operation selection must balance diversity, domain-appropriateness, and computational parity
•Connectivity constraints tame combinatorial explosion while preserving expressiveness
•DARTS-style spaces enable gradient-based search through continuous relaxation
•Hierarchical spaces combine micro and macro search for maximum expressiveness

Page Complete

You now understand how to design effective NAS search spaces. Next, we'll explore the search strategies that navigate these spaces to find optimal architectures.