Neural Architecture Search - Learning Module

Loading content...

0/278

Search Strategies

Navigating the Architecture Space

Given a search space containing billions or trillions of possible architectures, how do we find good ones efficiently? This is the central challenge addressed by NAS search strategies.

The search strategy determines:

Which architectures to evaluate
How to learn from evaluations to guide future exploration
How to balance exploration (trying diverse architectures) with exploitation (refining promising ones)
How to use computational budget effectively

Different strategies make fundamentally different trade-offs between sample efficiency, parallelizability, and the types of architectures they tend to discover.

Learning Objectives

Master the major NAS search paradigms: RL-based search with policy gradients, evolutionary algorithms with mutation and selection, gradient-based methods like DARTS, and Bayesian optimization. Understand when each approach excels.

Random Search: The Essential Baseline

Before exploring sophisticated methods, we must understand random search—the simplest strategy and an essential baseline.

Algorithm:

Sample architecture uniformly at random from search space
Evaluate its performance
Track the best architecture seen
Repeat until budget exhausted

Why Random Search Matters:

Research has shown that random search is surprisingly competitive with more complex methods on many NAS benchmarks. This has important implications:

If a method doesn't significantly outperform random search, the improvement may come from increased samples rather than smarter search
Random search provides a fair baseline that accounts for search space quality
It's embarrassingly parallel—each sample is independent

random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import random
from typing import Callable, Any
 
def random_search_nas(
    search_space,
    evaluate_fn: Callable,
    budget: int,
    maximize: bool = True
) -> tuple:
    """
    Random search for NAS.
    
    Args:
        search_space: Object with sample_random() method
        evaluate_fn: Function that evaluates architecture performance
        budget: Number of architectures to evaluate
        maximize: Whether to maximize (True) or minimize (False)
        
    Returns:
        (best_architecture, best_performance, history)
    """
    best_arch = None
    best_perf = float('-inf') if maximize else float('inf')
    history = []
    
    for i in range(budget):
        # Sample random architecture
        arch = search_space.sample_random()
        
        # Evaluate (this is the expensive part)
        perf = evaluate_fn(arch)
        history.append((arch, perf))
        
        # Update best
        is_better = perf > best_perf if maximize else perf < best_perf
        if is_better:
            best_perf = perf
            best_arch = arch
            
        if (i + 1) % 100 == 0:
            print(f"Iteration {i+1}: Best = {best_perf:.4f}")
    
    return best_arch, best_perf, history
 
# Expected performance: With N samples, probability of finding 
# top-k% architecture is 1 - (1 - k/100)^N
# For top 1%: N=100 gives 63%, N=500 gives 99.3%

Random Search Analysis

If architectures in the top 1% perform well, random search with ~500 samples has 99%+ probability of finding one. This explains why random search works well when search spaces contain many good architectures.

Reinforcement Learning for NAS

The seminal NAS paper (Zoph & Le, 2017) framed architecture search as an RL problem:

Agent: A controller network (typically an RNN) that generates architecture descriptions
Action: The sequence of architectural choices (layer types, connections, etc.)
Reward: Validation accuracy of the generated architecture after training
Goal: Learn a policy that generates high-performing architectures

The Controller:

The controller is an RNN that autoregressively generates architecture tokens: $$P(a) = \prod_{t=1}^{T} P(a_t | a_1, ..., a_{t-1}; \theta_c)$$

where $\theta_c$ are the controller parameters.

Training with REINFORCE:

Since the reward (validation accuracy) is non-differentiable with respect to the architecture, we use policy gradient:

$$ abla_{\theta_c} J(\theta_c) = E_{a \sim \pi_{\theta_c}} [(R(a) - b) abla_{\theta_c} \log P(a; \theta_c)]$$

where $b$ is a baseline (typically exponential moving average of rewards) to reduce variance.

rl_nas_controller.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class NASController(nn.Module):
    """
    RNN controller for RL-based NAS.
    Generates architecture sequence autoregressively.
    """
    
    def __init__(
        self,
        num_layers: int,
        num_operations: int,
        hidden_dim: int = 100,
        temperature: float = 1.0
    ):
        super().__init__()
        self.num_layers = num_layers
        self.num_operations = num_operations
        self.temperature = temperature
        
        # Embedding for operation tokens
        self.op_embedding = nn.Embedding(num_operations, hidden_dim)
        
        # LSTM controller
        self.lstm = nn.LSTMCell(hidden_dim, hidden_dim)
        
        # Output heads
        self.op_classifier = nn.Linear(hidden_dim, num_operations)
        
        # Learnable initial states
        self.h0 = nn.Parameter(torch.zeros(1, hidden_dim))
        self.c0 = nn.Parameter(torch.zeros(1, hidden_dim))
        
    def forward(self, batch_size: int = 1):
        """
        Sample architectures from controller.
        
        Returns:
            architectures: Sampled operation indices [B, num_layers]
            log_probs: Log probabilities for each choice
            entropies: Entropy of each distribution
        """
        h = self.h0.expand(batch_size, -1)
        c = self.c0.expand(batch_size, -1)
        
        architectures = []
        log_probs = []
        entropies = []
        
        # Start token
        input_token = torch.zeros(batch_size, dtype=torch.long)
        
        for layer_idx in range(self.num_layers):
            # Embed previous token
            embed = self.op_embedding(input_token)
            
            # LSTM step
            h, c = self.lstm(embed, (h, c))
            
            # Compute operation distribution
            logits = self.op_classifier(h) / self.temperature
            probs = F.softmax(logits, dim=-1)
            
            # Sample operation
            dist = torch.distributions.Categorical(probs)
            op = dist.sample()
            
            architectures.append(op)
            log_probs.append(dist.log_prob(op))
            entropies.append(dist.entropy())
            
            input_token = op
            
        return (
            torch.stack(architectures, dim=1),
            torch.stack(log_probs, dim=1),
            torch.stack(entropies, dim=1)
        )
    
def train_controller_step(
    controller, 
    rewards, 
    log_probs, 
    baseline,
    optimizer,
    entropy_weight=0.01
):
    """
    One training step using REINFORCE.
    """
    # Advantage = reward - baseline
    advantages = rewards - baseline
    
    # Policy gradient loss
    policy_loss = -(log_probs.sum(dim=1) * advantages).mean()
    
    # Entropy bonus for exploration
    entropy_loss = -log_probs.mean()
    
    loss = policy_loss + entropy_weight * entropy_loss
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss.item()

RL-NAS Challenges

•Sample inefficiency: RL requires many samples to learn; each sample is expensive (training a network)
•High variance: REINFORCE has high gradient variance, leading to unstable training
•Reward sparsity: Reward only at end of architecture generation
•Expensive overall: Original NAS required 800 GPUs for weeks

Evolutionary Algorithms for NAS

Evolutionary NAS applies principles from biological evolution: maintain a population of architectures, select fit individuals, mutate to create offspring, and iterate.

Regularized Evolution (AmoebaNet):

The key algorithm used in AmoebaNet and many subsequent works:

Initialize population with random architectures
Sample tournament from population
Select best individual from tournament (parent)
Mutate parent to create child
Evaluate child and add to population
Remove oldest individual (not worst—key insight)
Repeat

Why remove oldest, not worst?

Removing the oldest provides "regularization"—it prevents the population from converging too quickly to a local optimum and maintains diversity.

evolutionary_nas.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from collections import deque
import random
import copy
 
def regularized_evolution(
    search_space,
    evaluate_fn,
    population_size: int = 50,
    tournament_size: int = 10,
    num_mutations: int = 1,
    budget: int = 1000
):
    """
    Regularized Evolution for NAS (AmoebaNet-style).
    
    Key insight: Remove oldest, not worst, for regularization.
    """
    # Initialize population (FIFO queue)
    population = deque(maxlen=population_size)
    history = []
    
    # Seed with random architectures
    for _ in range(population_size):
        arch = search_space.sample_random()
        fitness = evaluate_fn(arch)
        population.append({'arch': arch, 'fitness': fitness})
        history.append((arch, fitness))
    
    evaluations = population_size
    
    while evaluations < budget:
        # Tournament selection
        tournament = random.sample(list(population), tournament_size)
        parent = max(tournament, key=lambda x: x['fitness'])
        
        # Mutation
        child_arch = mutate(parent['arch'], search_space, num_mutations)
        
        # Evaluate child
        child_fitness = evaluate_fn(child_arch)
        evaluations += 1
        
        # Add child (oldest automatically removed due to maxlen)
        population.append({'arch': child_arch, 'fitness': child_fitness})
        history.append((child_arch, child_fitness))
        
    # Return best ever found
    best = max(history, key=lambda x: x[1])
    return best[0], best[1], history
 
def mutate(arch, search_space, num_mutations=1):
    """
    Mutate architecture by randomly changing one operation or connection.
    """
    child = copy.deepcopy(arch)
    
    for _ in range(num_mutations):
        mutation_type = random.choice(['operation', 'connection'])
        
        if mutation_type == 'operation':
            # Change a random operation
            node_idx = random.randint(0, len(child.nodes) - 1)
            op_idx = random.randint(0, 1)  # Each node has 2 ops
            child.nodes[node_idx]['ops'][op_idx] = random.choice(
                search_space.operations
            )
        else:
            # Change a connection
            node_idx = random.randint(0, len(child.nodes) - 1)
            conn_idx = random.randint(0, 1)
            valid_inputs = list(range(2 + node_idx))  # Cell inputs + prev nodes
            child.nodes[node_idx]['inputs'][conn_idx] = random.choice(
                valid_inputs
            )
    
    return child

Evolution Advantages

•Embarrassingly parallel evaluation
•No gradient required
•Robust to noisy fitness
•Effective exploration of discrete spaces

Evolution Limitations

•Still sample inefficient
•Many hyperparameters to tune
•Can be slow to fine-tune solutions
•Mutation design affects results

Gradient-Based Search: DARTS

DARTS (Differentiable Architecture Search) revolutionized NAS efficiency by enabling gradient-based optimization of architecture.

Core Idea:

Continuous relaxation: Replace discrete operation choice with a weighted sum of all operations
Architecture weights: Learnable parameters $\alpha$ determine operation importance
Bi-level optimization: Jointly optimize weights $w$ and architecture $\alpha$
Discretization: After search, select $\arg\max$ operation for each edge

Bi-Level Optimization:

$$\min_{\alpha} \mathcal{L}{val}(w^(\alpha), \alpha)$$ $$\text{s.t. } w^(\alpha) = \arg\min_w \mathcal{L}{train}(w, \alpha)$$

Approximation (First-Order DARTS):

Full bi-level optimization is expensive. First-order approximation:

Update $w$ with one gradient step on training loss
Update $\alpha$ with one gradient step on validation loss
Alternate

darts_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import torch
import torch.nn as nn
 
class DARTSTrainer:
    """
    DARTS training loop with bi-level optimization.
    """
    
    def __init__(
        self, 
        model, 
        arch_optimizer,
        weight_optimizer,
        train_loader,
        val_loader
    ):
        self.model = model
        self.arch_optimizer = arch_optimizer
        self.weight_optimizer = weight_optimizer
        self.train_loader = train_loader
        self.val_loader = val_loader
        
    def search_epoch(self):
        """One epoch of DARTS search."""
        train_iter = iter(self.train_loader)
        val_iter = iter(self.val_loader)
        
        for step in range(len(self.train_loader)):
            # Get training and validation batches
            train_x, train_y = next(train_iter)
            try:
                val_x, val_y = next(val_iter)
            except StopIteration:
                val_iter = iter(self.val_loader)
                val_x, val_y = next(val_iter)
            
            # Step 1: Update architecture weights on validation loss
            self.arch_optimizer.zero_grad()
            val_logits = self.model(val_x)
            val_loss = nn.functional.cross_entropy(val_logits, val_y)
            val_loss.backward()
            self.arch_optimizer.step()
            
            # Step 2: Update network weights on training loss
            self.weight_optimizer.zero_grad()
            train_logits = self.model(train_x)
            train_loss = nn.functional.cross_entropy(train_logits, train_y)
            train_loss.backward()
            self.weight_optimizer.step()
            
    def derive_architecture(self):
        """
        Discretize continuous architecture weights to get final architecture.
        """
        genotype = []
        
        for edge_name, alpha in self.model.arch_parameters():
            # Select operation with highest weight
            probs = torch.softmax(alpha, dim=-1)
            best_op_idx = probs.argmax().item()
            genotype.append((edge_name, best_op_idx))
            
        return genotype

DARTS Known Issues

DARTS can suffer from 'architecture collapse'—converging to degenerate solutions dominated by skip connections. Many variants (DARTS+, PC-DARTS, SDARTS) address this through regularization, partial channel connections, or perturbation-based stabilization.

Bayesian Optimization for NAS

Bayesian Optimization (BO) approaches NAS by building a probabilistic model of the architecture-performance mapping and using it to guide search.

Components:

Surrogate Model: Predicts performance given architecture encoding (e.g., Gaussian Process, Random Forest, or Neural Network)
Acquisition Function: Determines which architecture to evaluate next (e.g., Expected Improvement, UCB)
Architecture Encoding: Maps discrete architectures to vector representations

Algorithm:

Initialize with random architectures and their evaluations
Fit surrogate model to observed (architecture, performance) pairs
Optimize acquisition function to find promising next architecture
Evaluate that architecture
Repeat

Bayesian Optimization Components
Component	Common Choices	Trade-offs
Surrogate	GP, RF, Neural Network, GNN	GP: principled uncertainty; NN: scales better
Acquisition	EI, UCB, Thompson Sampling	EI: exploitation; UCB: tunable exploration
Encoding	One-hot, adjacency matrix, path encoding	Encoding quality affects surrogate accuracy

BO for NAS Advantages

•Sample efficient: Makes informed decisions about what to evaluate
•Principled uncertainty: Knows what it doesn't know
•Works with expensive evaluations: Ideal when each evaluation costs hours
•Can incorporate prior knowledge: Through surrogate priors or initialization

Comparing Search Strategies

Search Strategy Comparison
Strategy	Sample Efficiency	Parallelizable	Gradient Required	Best For
Random	Low	Yes	No	Baseline; large compute budgets
RL (REINFORCE)	Medium	Yes	No	Large spaces; flexible generation
Evolution	Medium	Yes	No	Discrete spaces; robust optimization
DARTS	Very High	No (sequential)	Yes	Differentiable spaces; limited compute
Bayesian Opt	Very High	Limited	No	Expensive evaluations; small budgets

Key Insights from NAS Research:

No universally best strategy: Performance depends on search space, evaluation cost, and compute budget
DARTS dominates low-budget settings: When you can only afford few GPU-days
Evolution excels with parallelization: When you have many GPUs but limited time
BO best for very expensive evaluations: When each evaluation takes days
Random search is hard to beat: Always compare against it

Summary: Search Strategies

Key Takeaways

•Random search is a strong baseline—always compare against it
•RL-based NAS uses policy gradients to train a controller but is sample inefficient
•Evolutionary NAS maintains populations with mutation and selection; regularized evolution removes oldest, not worst
•DARTS enables gradient-based search through continuous relaxation, dramatically reducing cost
•Bayesian optimization is most sample-efficient but scales less well to high dimensions
•Strategy choice depends on context: compute budget, parallelization, search space structure

Page Complete

You now understand the major NAS search strategies. Next, we'll explore weight sharing—the technique that made NAS practical by dramatically reducing evaluation cost.

Search Strategies

Navigating the Architecture Space

Given a search space containing billions or trillions of possible architectures, how do we find good ones efficiently? This is the central challenge addressed by NAS search strategies.

The search strategy determines:

Which architectures to evaluate
How to learn from evaluations to guide future exploration
How to balance exploration (trying diverse architectures) with exploitation (refining promising ones)
How to use computational budget effectively

Different strategies make fundamentally different trade-offs between sample efficiency, parallelizability, and the types of architectures they tend to discover.

Learning Objectives

Random Search: The Essential Baseline

Before exploring sophisticated methods, we must understand random search—the simplest strategy and an essential baseline.

Algorithm:

Sample architecture uniformly at random from search space
Evaluate its performance
Track the best architecture seen
Repeat until budget exhausted

Why Random Search Matters:

Research has shown that random search is surprisingly competitive with more complex methods on many NAS benchmarks. This has important implications:

If a method doesn't significantly outperform random search, the improvement may come from increased samples rather than smarter search
Random search provides a fair baseline that accounts for search space quality
It's embarrassingly parallel—each sample is independent

random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import random
from typing import Callable, Any
 
def random_search_nas(
    search_space,
    evaluate_fn: Callable,
    budget: int,
    maximize: bool = True
) -> tuple:
    """
    Random search for NAS.
    
    Args:
        search_space: Object with sample_random() method
        evaluate_fn: Function that evaluates architecture performance
        budget: Number of architectures to evaluate
        maximize: Whether to maximize (True) or minimize (False)
        
    Returns:
        (best_architecture, best_performance, history)
    """
    best_arch = None
    best_perf = float('-inf') if maximize else float('inf')
    history = []
    
    for i in range(budget):
        # Sample random architecture
        arch = search_space.sample_random()
        
        # Evaluate (this is the expensive part)
        perf = evaluate_fn(arch)
        history.append((arch, perf))
        
        # Update best
        is_better = perf > best_perf if maximize else perf < best_perf
        if is_better:
            best_perf = perf
            best_arch = arch
            
        if (i + 1) % 100 == 0:
            print(f"Iteration {i+1}: Best = {best_perf:.4f}")
    
    return best_arch, best_perf, history
 
# Expected performance: With N samples, probability of finding 
# top-k% architecture is 1 - (1 - k/100)^N
# For top 1%: N=100 gives 63%, N=500 gives 99.3%

Random Search Analysis

Reinforcement Learning for NAS

The seminal NAS paper (Zoph & Le, 2017) framed architecture search as an RL problem:

Agent: A controller network (typically an RNN) that generates architecture descriptions
Action: The sequence of architectural choices (layer types, connections, etc.)
Reward: Validation accuracy of the generated architecture after training
Goal: Learn a policy that generates high-performing architectures

The Controller:

The controller is an RNN that autoregressively generates architecture tokens: $$P(a) = \prod_{t=1}^{T} P(a_t | a_1, ..., a_{t-1}; \theta_c)$$

where $\theta_c$ are the controller parameters.

Training with REINFORCE:

Since the reward (validation accuracy) is non-differentiable with respect to the architecture, we use policy gradient:

$$ abla_{\theta_c} J(\theta_c) = E_{a \sim \pi_{\theta_c}} [(R(a) - b) abla_{\theta_c} \log P(a; \theta_c)]$$

where $b$ is a baseline (typically exponential moving average of rewards) to reduce variance.

rl_nas_controller.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import torch
import torch.nn as nn
import torch.nn.functional as F
 
class NASController(nn.Module):
    """
    RNN controller for RL-based NAS.
    Generates architecture sequence autoregressively.
    """
    
    def __init__(
        self,
        num_layers: int,
        num_operations: int,
        hidden_dim: int = 100,
        temperature: float = 1.0
    ):
        super().__init__()
        self.num_layers = num_layers
        self.num_operations = num_operations
        self.temperature = temperature
        
        # Embedding for operation tokens
        self.op_embedding = nn.Embedding(num_operations, hidden_dim)
        
        # LSTM controller
        self.lstm = nn.LSTMCell(hidden_dim, hidden_dim)
        
        # Output heads
        self.op_classifier = nn.Linear(hidden_dim, num_operations)
        
        # Learnable initial states
        self.h0 = nn.Parameter(torch.zeros(1, hidden_dim))
        self.c0 = nn.Parameter(torch.zeros(1, hidden_dim))
        
    def forward(self, batch_size: int = 1):
        """
        Sample architectures from controller.
        
        Returns:
            architectures: Sampled operation indices [B, num_layers]
            log_probs: Log probabilities for each choice
            entropies: Entropy of each distribution
        """
        h = self.h0.expand(batch_size, -1)
        c = self.c0.expand(batch_size, -1)
        
        architectures = []
        log_probs = []
        entropies = []
        
        # Start token
        input_token = torch.zeros(batch_size, dtype=torch.long)
        
        for layer_idx in range(self.num_layers):
            # Embed previous token
            embed = self.op_embedding(input_token)
            
            # LSTM step
            h, c = self.lstm(embed, (h, c))
            
            # Compute operation distribution
            logits = self.op_classifier(h) / self.temperature
            probs = F.softmax(logits, dim=-1)
            
            # Sample operation
            dist = torch.distributions.Categorical(probs)
            op = dist.sample()
            
            architectures.append(op)
            log_probs.append(dist.log_prob(op))
            entropies.append(dist.entropy())
            
            input_token = op
            
        return (
            torch.stack(architectures, dim=1),
            torch.stack(log_probs, dim=1),
            torch.stack(entropies, dim=1)
        )
    
def train_controller_step(
    controller, 
    rewards, 
    log_probs, 
    baseline,
    optimizer,
    entropy_weight=0.01
):
    """
    One training step using REINFORCE.
    """
    # Advantage = reward - baseline
    advantages = rewards - baseline
    
    # Policy gradient loss
    policy_loss = -(log_probs.sum(dim=1) * advantages).mean()
    
    # Entropy bonus for exploration
    entropy_loss = -log_probs.mean()
    
    loss = policy_loss + entropy_weight * entropy_loss
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    return loss.item()

RL-NAS Challenges

•Sample inefficiency: RL requires many samples to learn; each sample is expensive (training a network)
•High variance: REINFORCE has high gradient variance, leading to unstable training
•Reward sparsity: Reward only at end of architecture generation
•Expensive overall: Original NAS required 800 GPUs for weeks

Evolutionary Algorithms for NAS

Evolutionary NAS applies principles from biological evolution: maintain a population of architectures, select fit individuals, mutate to create offspring, and iterate.

Regularized Evolution (AmoebaNet):

The key algorithm used in AmoebaNet and many subsequent works:

Initialize population with random architectures
Sample tournament from population
Select best individual from tournament (parent)
Mutate parent to create child
Evaluate child and add to population
Remove oldest individual (not worst—key insight)
Repeat

Why remove oldest, not worst?

Removing the oldest provides "regularization"—it prevents the population from converging too quickly to a local optimum and maintains diversity.

evolutionary_nas.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from collections import deque
import random
import copy
 
def regularized_evolution(
    search_space,
    evaluate_fn,
    population_size: int = 50,
    tournament_size: int = 10,
    num_mutations: int = 1,
    budget: int = 1000
):
    """
    Regularized Evolution for NAS (AmoebaNet-style).
    
    Key insight: Remove oldest, not worst, for regularization.
    """
    # Initialize population (FIFO queue)
    population = deque(maxlen=population_size)
    history = []
    
    # Seed with random architectures
    for _ in range(population_size):
        arch = search_space.sample_random()
        fitness = evaluate_fn(arch)
        population.append({'arch': arch, 'fitness': fitness})
        history.append((arch, fitness))
    
    evaluations = population_size
    
    while evaluations < budget:
        # Tournament selection
        tournament = random.sample(list(population), tournament_size)
        parent = max(tournament, key=lambda x: x['fitness'])
        
        # Mutation
        child_arch = mutate(parent['arch'], search_space, num_mutations)
        
        # Evaluate child
        child_fitness = evaluate_fn(child_arch)
        evaluations += 1
        
        # Add child (oldest automatically removed due to maxlen)
        population.append({'arch': child_arch, 'fitness': child_fitness})
        history.append((child_arch, child_fitness))
        
    # Return best ever found
    best = max(history, key=lambda x: x[1])
    return best[0], best[1], history
 
def mutate(arch, search_space, num_mutations=1):
    """
    Mutate architecture by randomly changing one operation or connection.
    """
    child = copy.deepcopy(arch)
    
    for _ in range(num_mutations):
        mutation_type = random.choice(['operation', 'connection'])
        
        if mutation_type == 'operation':
            # Change a random operation
            node_idx = random.randint(0, len(child.nodes) - 1)
            op_idx = random.randint(0, 1)  # Each node has 2 ops
            child.nodes[node_idx]['ops'][op_idx] = random.choice(
                search_space.operations
            )
        else:
            # Change a connection
            node_idx = random.randint(0, len(child.nodes) - 1)
            conn_idx = random.randint(0, 1)
            valid_inputs = list(range(2 + node_idx))  # Cell inputs + prev nodes
            child.nodes[node_idx]['inputs'][conn_idx] = random.choice(
                valid_inputs
            )
    
    return child

Evolution Advantages

•Embarrassingly parallel evaluation
•No gradient required
•Robust to noisy fitness
•Effective exploration of discrete spaces

Evolution Limitations

•Still sample inefficient
•Many hyperparameters to tune
•Can be slow to fine-tune solutions
•Mutation design affects results

Gradient-Based Search: DARTS

DARTS (Differentiable Architecture Search) revolutionized NAS efficiency by enabling gradient-based optimization of architecture.

Core Idea:

Continuous relaxation: Replace discrete operation choice with a weighted sum of all operations
Architecture weights: Learnable parameters $\alpha$ determine operation importance
Bi-level optimization: Jointly optimize weights $w$ and architecture $\alpha$
Discretization: After search, select $\arg\max$ operation for each edge

Bi-Level Optimization:

$$\min_{\alpha} \mathcal{L}{val}(w^(\alpha), \alpha)$$ $$\text{s.t. } w^(\alpha) = \arg\min_w \mathcal{L}{train}(w, \alpha)$$

Approximation (First-Order DARTS):

Full bi-level optimization is expensive. First-order approximation:

Update $w$ with one gradient step on training loss
Update $\alpha$ with one gradient step on validation loss
Alternate

darts_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import torch
import torch.nn as nn
 
class DARTSTrainer:
    """
    DARTS training loop with bi-level optimization.
    """
    
    def __init__(
        self, 
        model, 
        arch_optimizer,
        weight_optimizer,
        train_loader,
        val_loader
    ):
        self.model = model
        self.arch_optimizer = arch_optimizer
        self.weight_optimizer = weight_optimizer
        self.train_loader = train_loader
        self.val_loader = val_loader
        
    def search_epoch(self):
        """One epoch of DARTS search."""
        train_iter = iter(self.train_loader)
        val_iter = iter(self.val_loader)
        
        for step in range(len(self.train_loader)):
            # Get training and validation batches
            train_x, train_y = next(train_iter)
            try:
                val_x, val_y = next(val_iter)
            except StopIteration:
                val_iter = iter(self.val_loader)
                val_x, val_y = next(val_iter)
            
            # Step 1: Update architecture weights on validation loss
            self.arch_optimizer.zero_grad()
            val_logits = self.model(val_x)
            val_loss = nn.functional.cross_entropy(val_logits, val_y)
            val_loss.backward()
            self.arch_optimizer.step()
            
            # Step 2: Update network weights on training loss
            self.weight_optimizer.zero_grad()
            train_logits = self.model(train_x)
            train_loss = nn.functional.cross_entropy(train_logits, train_y)
            train_loss.backward()
            self.weight_optimizer.step()
            
    def derive_architecture(self):
        """
        Discretize continuous architecture weights to get final architecture.
        """
        genotype = []
        
        for edge_name, alpha in self.model.arch_parameters():
            # Select operation with highest weight
            probs = torch.softmax(alpha, dim=-1)
            best_op_idx = probs.argmax().item()
            genotype.append((edge_name, best_op_idx))
            
        return genotype

DARTS Known Issues

Bayesian Optimization for NAS

Bayesian Optimization (BO) approaches NAS by building a probabilistic model of the architecture-performance mapping and using it to guide search.

Components:

Surrogate Model: Predicts performance given architecture encoding (e.g., Gaussian Process, Random Forest, or Neural Network)
Acquisition Function: Determines which architecture to evaluate next (e.g., Expected Improvement, UCB)
Architecture Encoding: Maps discrete architectures to vector representations

Algorithm:

Initialize with random architectures and their evaluations
Fit surrogate model to observed (architecture, performance) pairs
Optimize acquisition function to find promising next architecture
Evaluate that architecture
Repeat

Bayesian Optimization Components
Component	Common Choices	Trade-offs
Surrogate	GP, RF, Neural Network, GNN	GP: principled uncertainty; NN: scales better
Acquisition	EI, UCB, Thompson Sampling	EI: exploitation; UCB: tunable exploration
Encoding	One-hot, adjacency matrix, path encoding	Encoding quality affects surrogate accuracy

BO for NAS Advantages

•Sample efficient: Makes informed decisions about what to evaluate
•Principled uncertainty: Knows what it doesn't know
•Works with expensive evaluations: Ideal when each evaluation costs hours
•Can incorporate prior knowledge: Through surrogate priors or initialization

Comparing Search Strategies

Search Strategy Comparison
Strategy	Sample Efficiency	Parallelizable	Gradient Required	Best For
Random	Low	Yes	No	Baseline; large compute budgets
RL (REINFORCE)	Medium	Yes	No	Large spaces; flexible generation
Evolution	Medium	Yes	No	Discrete spaces; robust optimization
DARTS	Very High	No (sequential)	Yes	Differentiable spaces; limited compute
Bayesian Opt	Very High	Limited	No	Expensive evaluations; small budgets

Key Insights from NAS Research:

No universally best strategy: Performance depends on search space, evaluation cost, and compute budget
DARTS dominates low-budget settings: When you can only afford few GPU-days
Evolution excels with parallelization: When you have many GPUs but limited time
BO best for very expensive evaluations: When each evaluation takes days
Random search is hard to beat: Always compare against it

Summary: Search Strategies

Key Takeaways

•Random search is a strong baseline—always compare against it
•RL-based NAS uses policy gradients to train a controller but is sample inefficient
•Evolutionary NAS maintains populations with mutation and selection; regularized evolution removes oldest, not worst
•DARTS enables gradient-based search through continuous relaxation, dramatically reducing cost
•Bayesian optimization is most sample-efficient but scales less well to high dimensions
•Strategy choice depends on context: compute budget, parallelization, search space structure

Page Complete

You now understand the major NAS search strategies. Next, we'll explore weight sharing—the technique that made NAS practical by dramatically reducing evaluation cost.