Loading learning content...
Random search's independence property—each sample is drawn and evaluated independently—makes it embarrassingly parallel. This is one of its most significant practical advantages: linear speedup with additional workers, no coordination overhead, and trivial distribution across machines.
In the era of cloud computing and distributed systems, the ability to throw more hardware at a problem is valuable. Random search scales perfectly: 10 machines complete 10× faster with no algorithmic changes. This page develops the theory and practice of parallelizing random search for maximum efficiency.
By the end of this page, you will understand why random search parallelizes perfectly, implement multi-process and distributed random search, handle fault tolerance and checkpointing, deploy random search on cloud infrastructure, and compare parallelization strategies.
Random search's parallelizability stems from a fundamental property: each trial is statistically independent and requires no information from other trials.
Formal Statement:
Let $\mathbf{h}^{(1)}, \ldots, \mathbf{h}^{(n)}$ be $n$ configurations sampled from distribution $p(\mathbf{h})$. For any partition of indices $I_1, \ldots, I_k$ (where $\bigcup_j I_j = {1, \ldots, n}$):
$$P(\mathbf{h}^{(i)} : i \in I_j) = \prod_{i \in I_j} p(\mathbf{h}^{(i)})$$
This means we can assign any subset of trials to any worker without affecting the distribution of results.
Contrast with Sequential Methods:
| Method | Information Flow | Parallelization |
|---|---|---|
| Random Search | None between trials | Perfect (linear speedup) |
| Bayesian Optimization | Each trial informs next | Limited (batch BO possible but complex) |
| Evolutionary | Population interactions | Moderate (island models) |
| Grid Search | None | Perfect (but less useful) |
Computer scientists call problems 'embarrassingly parallel' when they require no inter-process communication. Random search is the canonical example in hyperparameter optimization. The only coordination needed is merging results at the end.
For single-machine parallelization, Python's multiprocessing or joblib provide straightforward parallel random search.
Key Considerations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
import numpy as npfrom typing import Dict, Any, Callable, Listfrom dataclasses import dataclassfrom concurrent.futures import ProcessPoolExecutor, as_completedimport multiprocessing as mp @dataclassclass ParallelSearchResult: """Result from parallel random search.""" best_config: Dict[str, Any] best_score: float all_results: List[Dict] n_workers: int total_time: float def worker_search( worker_id: int, n_trials: int, base_seed: int, distributions: Dict[str, Any], evaluate_fn: Callable) -> List[Dict]: """ Worker function for parallel random search. Each worker runs n_trials independent samples. """ # Unique seed for this worker np.random.seed(base_seed + worker_id * 10000) results = [] for i in range(n_trials): # Sample configuration config = {k: v.sample() for k, v in distributions.items()} # Evaluate try: score = evaluate_fn(config) status = 'success' except Exception as e: score = float('-inf') status = str(e) results.append({ 'worker_id': worker_id, 'trial': i, 'config': config, 'score': score, 'status': status }) return results class ParallelRandomSearch: """ Parallel random search using process pool. """ def __init__( self, distributions: Dict[str, Any], n_iter: int = 100, n_workers: int = None, random_state: int = 42 ): self.distributions = distributions self.n_iter = n_iter self.n_workers = n_workers or mp.cpu_count() self.random_state = random_state def run(self, evaluate_fn: Callable) -> ParallelSearchResult: """ Execute parallel random search. """ import time start_time = time.time() # Divide trials among workers trials_per_worker = self.n_iter // self.n_workers remainder = self.n_iter % self.n_workers all_results = [] with ProcessPoolExecutor(max_workers=self.n_workers) as executor: futures = [] for worker_id in range(self.n_workers): # Distribute remainder trials to first workers n_trials = trials_per_worker + (1 if worker_id < remainder else 0) future = executor.submit( worker_search, worker_id=worker_id, n_trials=n_trials, base_seed=self.random_state, distributions=self.distributions, evaluate_fn=evaluate_fn ) futures.append(future) # Collect results as they complete for future in as_completed(futures): worker_results = future.result() all_results.extend(worker_results) # Find best best_result = max(all_results, key=lambda x: x['score']) return ParallelSearchResult( best_config=best_result['config'], best_score=best_result['score'], all_results=all_results, n_workers=self.n_workers, total_time=time.time() - start_time ) # Alternative: Using joblib for simpler syntaxdef parallel_random_search_joblib( distributions: Dict[str, Any], evaluate_fn: Callable, n_iter: int = 100, n_jobs: int = -1, random_state: int = 42) -> ParallelSearchResult: """ Parallel random search using joblib. Simpler API, good for quick prototyping. """ from joblib import Parallel, delayed import time start_time = time.time() def evaluate_single(trial_id): np.random.seed(random_state + trial_id) config = {k: v.sample() for k, v in distributions.items()} try: score = evaluate_fn(config) except: score = float('-inf') return {'config': config, 'score': score} results = Parallel(n_jobs=n_jobs)( delayed(evaluate_single)(i) for i in range(n_iter) ) best = max(results, key=lambda x: x['score']) return ParallelSearchResult( best_config=best['config'], best_score=best['score'], all_results=results, n_workers=abs(n_jobs) if n_jobs != -1 else mp.cpu_count(), total_time=time.time() - start_time )For truly large-scale hyperparameter optimization, we need to distribute across multiple machines. Several patterns work well.
Pattern 1: Shared Filesystem
Simplest approach for clusters with shared storage:
Pattern 2: Message Queue
More robust for cloud/heterogeneous environments:
Pattern 3: Database Backend
Best for long-running searches with fault tolerance:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
import jsonimport hashlibfrom typing import Dict, Any, Callable, Optionalfrom dataclasses import dataclass, asdictfrom pathlib import Pathimport timeimport uuid @dataclassclass Trial: """Represents a single hyperparameter trial.""" trial_id: str config: Dict[str, Any] status: str # 'pending', 'running', 'completed', 'failed' score: Optional[float] = None worker_id: Optional[str] = None start_time: Optional[float] = None end_time: Optional[float] = None error: Optional[str] = None class FileSystemBackend: """ Simple file-system based distributed coordination. Works with any shared filesystem (NFS, S3, GCS). """ def __init__(self, base_path: str): self.base_path = Path(base_path) self.trials_path = self.base_path / "trials" self.results_path = self.base_path / "results" self.trials_path.mkdir(parents=True, exist_ok=True) self.results_path.mkdir(parents=True, exist_ok=True) def create_trial(self, config: Dict[str, Any]) -> Trial: """Create a new pending trial.""" trial_id = hashlib.md5( json.dumps(config, sort_keys=True).encode() ).hexdigest()[:12] trial = Trial( trial_id=trial_id, config=config, status='pending' ) trial_file = self.trials_path / f"{trial_id}.json" with open(trial_file, 'w') as f: json.dump(asdict(trial), f) return trial def claim_trial(self, worker_id: str) -> Optional[Trial]: """Claim a pending trial for execution.""" for trial_file in self.trials_path.glob("*.json"): with open(trial_file, 'r') as f: trial_data = json.load(f) if trial_data['status'] == 'pending': # Claim it trial_data['status'] = 'running' trial_data['worker_id'] = worker_id trial_data['start_time'] = time.time() with open(trial_file, 'w') as f: json.dump(trial_data, f) return Trial(**trial_data) return None def complete_trial(self, trial: Trial, score: float): """Mark trial as completed with score.""" trial.status = 'completed' trial.score = score trial.end_time = time.time() # Write to results result_file = self.results_path / f"{trial.trial_id}.json" with open(result_file, 'w') as f: json.dump(asdict(trial), f) # Update trial status trial_file = self.trials_path / f"{trial.trial_id}.json" with open(trial_file, 'w') as f: json.dump(asdict(trial), f) def get_best_result(self) -> Optional[Trial]: """Get best completed trial.""" best_trial = None best_score = float('-inf') for result_file in self.results_path.glob("*.json"): with open(result_file, 'r') as f: trial_data = json.load(f) if trial_data['score'] and trial_data['score'] > best_score: best_score = trial_data['score'] best_trial = Trial(**trial_data) return best_trial # Worker scriptdef run_worker( backend: FileSystemBackend, evaluate_fn: Callable, worker_id: str = None): """ Worker process that claims and executes trials. Run this on each worker machine. """ worker_id = worker_id or str(uuid.uuid4())[:8] print(f"Worker {worker_id} starting...") while True: trial = backend.claim_trial(worker_id) if trial is None: print(f"Worker {worker_id}: No pending trials, exiting") break print(f"Worker {worker_id}: Executing trial {trial.trial_id}") try: score = evaluate_fn(trial.config) backend.complete_trial(trial, score) print(f"Worker {worker_id}: Trial {trial.trial_id} = {score:.4f}") except Exception as e: trial.status = 'failed' trial.error = str(e) print(f"Worker {worker_id}: Trial {trial.trial_id} failed: {e}") # Master scriptdef run_master( backend: FileSystemBackend, distributions: Dict[str, Any], n_trials: int, random_state: int = 42): """ Master process that creates trials. Run this once before starting workers. """ import numpy as np np.random.seed(random_state) print(f"Creating {n_trials} trials...") for i in range(n_trials): config = {k: v.sample() for k, v in distributions.items()} trial = backend.create_trial(config) print(f"Created trial {trial.trial_id}") print(f"All trials created. Start workers now.")Long-running distributed search must handle failures gracefully. Random search's independence property makes this straightforward.
Failure Modes:
Mitigation Strategies:
Because random search trials are independent, a failed trial doesn't invalidate other work. Simply re-queue failed trials or sample replacements. This is much harder for sequential methods like Bayesian optimization where the chain of trials is interdependent.
Cloud platforms offer scalable compute for random search. Key patterns for major providers:
AWS:
GCP:
Azure:
| Configuration | Time | Cost (AWS example) |
|---|---|---|
| 1 on-demand instance | 16.7 hours | $1.50 - $15 |
| 10 on-demand instances | 1.7 hours | $1.50 - $15 |
| 10 spot instances | 1.7 hours | $0.30 - $3 |
| 100 spot instances | 10 minutes | $0.30 - $3 |
Spot/preemptible instances can be terminated at any time. Ensure your random search implementation handles this gracefully with checkpointing and trial reclamation. The cost savings (60-90%) are worth the added complexity.
Module Complete!
You have completed the Random Search module. You now understand random sampling foundations, why random beats grid, coverage probability theory, practical advantages, and parallelization strategies.
Next Steps:
The next module covers Bayesian Optimization—a more sophisticated approach that uses information from previous evaluations to guide the search, achieving better results when evaluation budgets are very limited.
Congratulations! You have mastered random search for hyperparameter optimization. You can implement efficient, parallelized random search with theoretical understanding of its guarantees and practical advantages.