Loading content...
Even with weight sharing reducing NAS from thousands of GPU-days to hours, further efficiency gains remain valuable. Searching faster means:
This page explores the frontier of efficient NAS: zero-cost proxies that evaluate architectures without training, early stopping methods that eliminate poor candidates quickly, performance predictors that estimate final accuracy from partial training, and practical systems that combine these techniques.
Master efficient NAS techniques: zero-cost proxies based on network properties, learning curve extrapolation, predictor-based evaluation, and practical efficient NAS systems like Once-for-All and BigNAS.
Zero-cost proxies estimate architecture quality from network properties computed at initialization—before any training. If reliable, they enable evaluating millions of architectures in minutes.
The Idea:
Certain properties of randomly-initialized networks correlate with trained performance. By measuring these properties, we can rank architectures without training.
Common Zero-Cost Proxies:
| Proxy | Measures | Intuition | Compute Cost |
|---|---|---|---|
| synflow | Synaptic flow (gradient × weight) | Trainability; gradient signal preservation | 1 forward + backward |
| grasp | Gradient × Hessian | Trainability; gradient flow quality | 2 forward + backward |
| snip | Sensitivity-based saliency | Parameter importance | 1 forward + backward |
| jacob_cov | Jacobian covariance | Input-output correlation | Multiple forward passes |
| fisher | Fisher information | Information about labels | 1 forward + backward |
| #params | Parameter count | Capacity (crude baseline) | Instantaneous |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import torchimport torch.nn as nnimport numpy as np def compute_synflow(model, input_shape, device='cuda'): """ SynFlow: measures gradient × weight product at initialization. Higher values indicate better trainability. """ model.to(device) model.eval() # Create dummy input (all ones for SynFlow) x = torch.ones(1, *input_shape, device=device) # Make all parameters positive for SynFlow @torch.no_grad() def linearize(model): signs = {} for name, param in model.named_parameters(): signs[name] = param.sign() param.abs_() return signs @torch.no_grad() def restore(model, signs): for name, param in model.named_parameters(): param.mul_(signs[name]) signs = linearize(model) # Forward pass with all-ones input model.zero_grad() output = model(x) # Sum of outputs as scalar if isinstance(output, tuple): output = output[0] torch.sum(output).backward() # SynFlow score: sum of |gradient × param| score = 0.0 for param in model.parameters(): if param.grad is not None: score += (param.grad * param).abs().sum().item() restore(model, signs) return score def compute_jacob_cov(model, data_loader, num_batches=5, device='cuda'): """ Jacobian covariance: measures correlation between inputs and outputs. Higher correlation suggests better feature propagation. """ model.to(device) model.eval() jacobians = [] for i, (x, _) in enumerate(data_loader): if i >= num_batches: break x = x.to(device).requires_grad_(True) y = model(x) # Compute Jacobian: dy/dx for j in range(y.shape[1]): # For each output dimension model.zero_grad() y[:, j].sum().backward(retain_graph=True) jacobians.append(x.grad.detach().view(x.shape[0], -1)) x.grad.zero_() # Stack and compute covariance J = torch.cat(jacobians, dim=0) # Score: trace of covariance return torch.trace(J.T @ J).item() / J.shape[0] def zcp_ranking(architectures, model_builder, proxy_fn, **kwargs): """ Rank architectures using zero-cost proxy. """ scores = [] for arch in architectures: model = model_builder(arch) score = proxy_fn(model, **kwargs) scores.append((arch, score)) # Sort by score (higher is better for most proxies) return sorted(scores, key=lambda x: x[1], reverse=True)Zero-cost proxies have limited accuracy—they explain ~50-70% of performance variance in typical benchmarks. They're best used for initial filtering (eliminating clearly bad architectures) rather than final selection. Combine with other methods for best results.
Rather than training to completion, we can stop unpromising architectures early and extrapolate final performance from partial learning curves.
Successive Halving (SHA):
A simple but effective multi-fidelity approach:
Hyperband:
Combines multiple SHA runs with different starting configurations:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
def successive_halving( architectures, evaluate_fn, initial_budget=1, # Initial epochs per architecture max_budget=81, # Maximum epochs eta=3 # Reduction factor): """ Successive Halving for architecture selection. Args: architectures: List of candidate architectures evaluate_fn(arch, budget): Train for 'budget' epochs, return accuracy initial_budget: Starting training budget max_budget: Maximum training budget eta: Reduction factor (keep 1/eta best each round) """ # Current candidates with their accumulated training candidates = [(arch, 0) for arch in architectures] current_budget = initial_budget while len(candidates) > 1 and current_budget <= max_budget: print(f"Round: {len(candidates)} candidates, {current_budget} epochs each") # Evaluate all remaining candidates results = [] for arch, trained_epochs in candidates: # Train for additional epochs (up to current_budget total) acc = evaluate_fn(arch, budget=current_budget) results.append((arch, current_budget, acc)) # Sort by accuracy, keep top 1/eta results.sort(key=lambda x: x[2], reverse=True) n_keep = max(1, len(results) // eta) candidates = [(r[0], r[1]) for r in results[:n_keep]] current_budget *= eta return candidates[0] # (best_arch, epochs_trained) def hyperband(architectures, evaluate_fn, max_budget=81, eta=3): """ Hyperband: multiple successive halving runs with different configs. """ s_max = int(np.log(max_budget) / np.log(eta)) B = (s_max + 1) * max_budget # Total budget best_arch = None best_acc = 0 for s in range(s_max, -1, -1): # Number of configs for this bracket n = int(np.ceil(B / max_budget / (s + 1) * eta ** s)) # Initial budget for this bracket r = max_budget * eta ** (-s) # Sample n architectures archs = random.sample(architectures, min(n, len(architectures))) # Run successive halving on this bracket winner = successive_halving(archs, evaluate_fn, r, max_budget, eta) if winner[2] > best_acc: best_acc = winner[2] best_arch = winner[0] return best_arch, best_accPerformance predictors are ML models trained to predict architecture accuracy from architectural features. Once trained on previously evaluated architectures, they enable instant prediction for new candidates.
Predictor Types:
Building a Predictor:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import torchimport torch.nn as nnfrom sklearn.ensemble import RandomForestRegressorimport numpy as np def encode_architecture(arch, num_nodes, num_ops): """ Encode architecture as feature vector. Simple approach: one-hot encoding of operations and connections. """ features = [] for node in arch.nodes: # One-hot for input selections inp1_onehot = [0] * (2 + num_nodes) inp2_onehot = [0] * (2 + num_nodes) inp1_onehot[node['inp1']] = 1 inp2_onehot[node['inp2']] = 1 # One-hot for operation selections op1_onehot = [0] * num_ops op2_onehot = [0] * num_ops op1_onehot[node['op1']] = 1 op2_onehot[node['op2']] = 1 features.extend(inp1_onehot + inp2_onehot + op1_onehot + op2_onehot) return np.array(features) class ArchPredictor: """ Predictor for architecture performance. Uses Random Forest for interpretability and few-shot performance. """ def __init__(self): self.model = RandomForestRegressor( n_estimators=100, max_depth=10, random_state=42 ) self.trained = False def fit(self, architectures, accuracies): """Train predictor on evaluated architectures.""" X = np.array([encode_architecture(a) for a in architectures]) y = np.array(accuracies) self.model.fit(X, y) self.trained = True def predict(self, architecture): """Predict accuracy for new architecture.""" if not self.trained: raise ValueError("Predictor not trained") x = encode_architecture(architecture).reshape(1, -1) return self.model.predict(x)[0] def predict_batch(self, architectures): """Predict for multiple architectures.""" X = np.array([encode_architecture(a) for a in architectures]) return self.model.predict(X) class GNNPredictor(nn.Module): """ Graph Neural Network predictor for architectures. Treats architecture as a graph and predicts performance. """ def __init__(self, num_ops, hidden_dim=64): super().__init__() self.op_embed = nn.Embedding(num_ops, hidden_dim) # Message passing layers self.conv1 = nn.Linear(hidden_dim, hidden_dim) self.conv2 = nn.Linear(hidden_dim, hidden_dim) # Readout self.readout = nn.Sequential( nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) def forward(self, adj, ops): """ adj: Adjacency matrix [N, N] ops: Operation indices per node [N] """ x = self.op_embed(ops) # [N, hidden] # Message passing x = torch.relu(self.conv1(adj @ x)) x = torch.relu(self.conv2(adj @ x)) # Global pooling x = x.mean(dim=0) # [hidden] return self.readout(x) # [1]Combine predictors with acquisition functions for smart sampling: evaluate architectures the predictor is uncertain about (exploration) or predicts will be good (exploitation). This is essentially Bayesian optimization with a learned surrogate.
Once-for-All (OFA) networks take weight sharing to its logical conclusion: train one network, deploy many architectural variants without any additional training.
Key Innovation: Progressive Shrinking
Rather than training all subnets equally (problematic due to coupling), OFA trains progressively:
This produces a single model supporting 10^19+ architectural configurations.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
class OFAMobileNet(nn.Module): """ Once-for-All style network with elastic depth/width/kernel. """ def __init__(self, max_depth=4, max_width=1.0, kernel_sizes=[3,5,7]): super().__init__() self.max_depth = max_depth self.max_width = max_width self.kernel_sizes = kernel_sizes # Build blocks with all kernel sizes self.blocks = nn.ModuleList([ ElasticBlock( in_channels=..., out_channels=..., kernel_sizes=kernel_sizes ) for _ in range(max_depth) ]) def forward(self, x, config): """ config: dict with 'depth', 'width', 'kernel_sizes' """ depth = config['depth'] width = config['width'] kernels = config['kernel_sizes'] for i in range(depth): x = self.blocks[i](x, width=width, kernel=kernels[i]) return self.classifier(x) def sample_subnet(self, constraint=None): """ Sample random subnet, optionally respecting hardware constraint. """ if constraint: # Use predictor to find architecture meeting constraint return self.constrained_sample(constraint) return { 'depth': random.randint(2, self.max_depth), 'width': random.choice([0.5, 0.75, 1.0]), 'kernel_sizes': [random.choice(self.kernel_sizes) for _ in range(self.max_depth)] } def progressive_shrinking_training(ofa_net, train_loader, epochs_per_stage): """ Progressive shrinking: train large first, then shrink. """ optimizer = torch.optim.SGD(ofa_net.parameters(), lr=0.01) # Stage 1: Train full network print("Stage 1: Full network") for epoch in range(epochs_per_stage): train_epoch(ofa_net, train_loader, optimizer, subnet_config={'depth': 4, 'width': 1.0, 'kernel': 7}) # Stage 2: Add elastic kernel print("Stage 2: Elastic kernel") for epoch in range(epochs_per_stage): config = ofa_net.sample_subnet(elastic=['kernel']) train_epoch(ofa_net, train_loader, optimizer, config) # Stage 3: Add elastic depth print("Stage 3: Elastic depth") for epoch in range(epochs_per_stage): config = ofa_net.sample_subnet(elastic=['kernel', 'depth']) train_epoch(ofa_net, train_loader, optimizer, config) # Stage 4: Add elastic width print("Stage 4: Elastic width (full)") for epoch in range(epochs_per_stage): config = ofa_net.sample_subnet(elastic=['kernel', 'depth', 'width']) train_epoch(ofa_net, train_loader, optimizer, config)OFA enables deploying to thousands of different hardware targets (phones, tablets, edge devices) from a single trained network. Search takes minutes since no training is needed—just evaluate subnets on validation set.
Modern efficient NAS doesn't just optimize accuracy—it optimizes for hardware constraints: latency, energy, memory, or FLOPs on specific target devices.
Multi-Objective Formulation:
$$\max_{a \in \mathcal{A}} \text{Accuracy}(a)$$ $$\text{s.t. } \text{Latency}(a) \leq L_{max}$$
Or Pareto optimization: $$\max_{a} (\text{Accuracy}(a), -\text{Latency}(a))$$
Approaches:
| Method | Approach | Advantages |
|---|---|---|
| ProxylessNAS | Differentiable latency with learnable thresholds | Direct hardware optimization |
| FBNet | Latency lookup table in search | Fast, accurate for specific HW |
| MnasNet | Multi-objective RL with Pareto reward | Flexible constraint handling |
| AttentionNAS | Attention-based architecture sampling | Efficient multi-objective |
Latency Modeling:
Direct hardware measurement is slow. Alternatives:
Key insight: FLOPs don't perfectly correlate with latency. Memory access patterns, parallelism, and hardware-specific optimizations matter.
Several libraries and systems make NAS accessible to practitioners:
| System | Key Features | Best For |
|---|---|---|
| AutoKeras | User-friendly; automatic; Keras-based | Beginners; tabular/image/text |
| NNI (Microsoft) | Multiple NAS algorithms; distributed | Research; customizable search |
| Auto-PyTorch | Automated deep learning; HPO + NAS | End-to-end AutoML |
| NATS-Bench | Benchmarking; topology + size search | Fair algorithm comparison |
| AutoGluon | Production-ready; ensemble + NAS | Production deployment |
You have now mastered Neural Architecture Search: from foundational concepts through search space design, search strategies, weight sharing, and efficient methods. These techniques represent the frontier of automated machine learning, enabling the discovery of architectures that match or exceed human-designed networks.