Machine LearningHyperparameter Optimization

Random Search

LevelIntermediate

Duration60 mins

TopicHyperparameter Optimization

5 / 5

Parallelization

Scaling Random Search

Random search's independence property—each sample is drawn and evaluated independently—makes it embarrassingly parallel. This is one of its most significant practical advantages: linear speedup with additional workers, no coordination overhead, and trivial distribution across machines.

In the era of cloud computing and distributed systems, the ability to throw more hardware at a problem is valuable. Random search scales perfectly: 10 machines complete 10× faster with no algorithmic changes. This page develops the theory and practice of parallelizing random search for maximum efficiency.

What You Will Learn

By the end of this page, you will understand why random search parallelizes perfectly, implement multi-process and distributed random search, handle fault tolerance and checkpointing, deploy random search on cloud infrastructure, and compare parallelization strategies.

The Independence Property

Random search's parallelizability stems from a fundamental property: each trial is statistically independent and requires no information from other trials.

Formal Statement:

Let $\mathbf{h}^{(1)}, \ldots, \mathbf{h}^{(n)}$ be $n$ configurations sampled from distribution $p(\mathbf{h})$. For any partition of indices $I_1, \ldots, I_k$ (where $\bigcup_j I_j = {1, \ldots, n}$):

$$P(\mathbf{h}^{(i)} : i \in I_j) = \prod_{i \in I_j} p(\mathbf{h}^{(i)})$$

This means we can assign any subset of trials to any worker without affecting the distribution of results.

Contrast with Sequential Methods:

Method	Information Flow	Parallelization
Random Search	None between trials	Perfect (linear speedup)
Bayesian Optimization	Each trial informs next	Limited (batch BO possible but complex)
Evolutionary	Population interactions	Moderate (island models)
Grid Search	None	Perfect (but less useful)

Embarrassingly Parallel

Computer scientists call problems 'embarrassingly parallel' when they require no inter-process communication. Random search is the canonical example in hyperparameter optimization. The only coordination needed is merging results at the end.

Multi-Process Implementation

For single-machine parallelization, Python's multiprocessing or joblib provide straightforward parallel random search.

Key Considerations:

Seed Management: Each worker needs a different random seed to avoid duplicate samples
Result Aggregation: Collect results from all workers and find global best
Memory: Large models may require careful memory management
GIL: Use multiprocessing, not threading, for CPU-bound training

parallel_random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
import numpy as np
from typing import Dict, Any, Callable, List
from dataclasses import dataclass
from concurrent.futures import ProcessPoolExecutor, as_completed
import multiprocessing as mp
 
@dataclass
class ParallelSearchResult:
    """Result from parallel random search."""
    best_config: Dict[str, Any]
    best_score: float
    all_results: List[Dict]
    n_workers: int
    total_time: float
 
def worker_search(
    worker_id: int,
    n_trials: int,
    base_seed: int,
    distributions: Dict[str, Any],
    evaluate_fn: Callable
) -> List[Dict]:
    """
    Worker function for parallel random search.
    Each worker runs n_trials independent samples.
    """
    # Unique seed for this worker
    np.random.seed(base_seed + worker_id * 10000)
    
    results = []
    for i in range(n_trials):
        # Sample configuration
        config = {k: v.sample() for k, v in distributions.items()}
        
        # Evaluate
        try:
            score = evaluate_fn(config)
            status = 'success'
        except Exception as e:
            score = float('-inf')
            status = str(e)
        
        results.append({
            'worker_id': worker_id,
            'trial': i,
            'config': config,
            'score': score,
            'status': status
        })
    
    return results
 
class ParallelRandomSearch:
    """
    Parallel random search using process pool.
    """
    
    def __init__(
        self,
        distributions: Dict[str, Any],
        n_iter: int = 100,
        n_workers: int = None,
        random_state: int = 42
    ):
        self.distributions = distributions
        self.n_iter = n_iter
        self.n_workers = n_workers or mp.cpu_count()
        self.random_state = random_state
    
    def run(self, evaluate_fn: Callable) -> ParallelSearchResult:
        """
        Execute parallel random search.
        """
        import time
        start_time = time.time()
        
        # Divide trials among workers
        trials_per_worker = self.n_iter // self.n_workers
        remainder = self.n_iter % self.n_workers
        
        all_results = []
        
        with ProcessPoolExecutor(max_workers=self.n_workers) as executor:
            futures = []
            
            for worker_id in range(self.n_workers):
                # Distribute remainder trials to first workers
                n_trials = trials_per_worker + (1 if worker_id < remainder else 0)
                
                future = executor.submit(
                    worker_search,
                    worker_id=worker_id,
                    n_trials=n_trials,
                    base_seed=self.random_state,
                    distributions=self.distributions,
                    evaluate_fn=evaluate_fn
                )
                futures.append(future)
            
            # Collect results as they complete
            for future in as_completed(futures):
                worker_results = future.result()
                all_results.extend(worker_results)
        
        # Find best
        best_result = max(all_results, key=lambda x: x['score'])
        
        return ParallelSearchResult(
            best_config=best_result['config'],
            best_score=best_result['score'],
            all_results=all_results,
            n_workers=self.n_workers,
            total_time=time.time() - start_time
        )
 
# Alternative: Using joblib for simpler syntax
def parallel_random_search_joblib(
    distributions: Dict[str, Any],
    evaluate_fn: Callable,
    n_iter: int = 100,
    n_jobs: int = -1,
    random_state: int = 42
) -> ParallelSearchResult:
    """
    Parallel random search using joblib.
    Simpler API, good for quick prototyping.
    """
    from joblib import Parallel, delayed
    import time
    
    start_time = time.time()
    
    def evaluate_single(trial_id):
        np.random.seed(random_state + trial_id)
        config = {k: v.sample() for k, v in distributions.items()}
        try:
            score = evaluate_fn(config)
        except:
            score = float('-inf')
        return {'config': config, 'score': score}
    
    results = Parallel(n_jobs=n_jobs)(
        delayed(evaluate_single)(i) for i in range(n_iter)
    )
    
    best = max(results, key=lambda x: x['score'])
    
    return ParallelSearchResult(
        best_config=best['config'],
        best_score=best['score'],
        all_results=results,
        n_workers=abs(n_jobs) if n_jobs != -1 else mp.cpu_count(),
        total_time=time.time() - start_time
    )

Distributed Implementation

For truly large-scale hyperparameter optimization, we need to distribute across multiple machines. Several patterns work well.

Pattern 1: Shared Filesystem

Simplest approach for clusters with shared storage:

Each worker reads search space definition from shared file
Each worker writes results to unique file
Master aggregates results after all complete

Pattern 2: Message Queue

More robust for cloud/heterogeneous environments:

Master pushes (config, seed) pairs to queue
Workers pull, evaluate, push results to results queue
Master aggregates from results queue

Pattern 3: Database Backend

Best for long-running searches with fault tolerance:

All configurations and results stored in database
Workers claim trials, write results
Master queries for best result anytime

distributed_random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import json
import hashlib
from typing import Dict, Any, Callable, Optional
from dataclasses import dataclass, asdict
from pathlib import Path
import time
import uuid
 
@dataclass
class Trial:
    """Represents a single hyperparameter trial."""
    trial_id: str
    config: Dict[str, Any]
    status: str  # 'pending', 'running', 'completed', 'failed'
    score: Optional[float] = None
    worker_id: Optional[str] = None
    start_time: Optional[float] = None
    end_time: Optional[float] = None
    error: Optional[str] = None
 
class FileSystemBackend:
    """
    Simple file-system based distributed coordination.
    Works with any shared filesystem (NFS, S3, GCS).
    """
    
    def __init__(self, base_path: str):
        self.base_path = Path(base_path)
        self.trials_path = self.base_path / "trials"
        self.results_path = self.base_path / "results"
        
        self.trials_path.mkdir(parents=True, exist_ok=True)
        self.results_path.mkdir(parents=True, exist_ok=True)
    
    def create_trial(self, config: Dict[str, Any]) -> Trial:
        """Create a new pending trial."""
        trial_id = hashlib.md5(
            json.dumps(config, sort_keys=True).encode()
        ).hexdigest()[:12]
        
        trial = Trial(
            trial_id=trial_id,
            config=config,
            status='pending'
        )
        
        trial_file = self.trials_path / f"{trial_id}.json"
        with open(trial_file, 'w') as f:
            json.dump(asdict(trial), f)
        
        return trial
    
    def claim_trial(self, worker_id: str) -> Optional[Trial]:
        """Claim a pending trial for execution."""
        for trial_file in self.trials_path.glob("*.json"):
            with open(trial_file, 'r') as f:
                trial_data = json.load(f)
            
            if trial_data['status'] == 'pending':
                # Claim it
                trial_data['status'] = 'running'
                trial_data['worker_id'] = worker_id
                trial_data['start_time'] = time.time()
                
                with open(trial_file, 'w') as f:
                    json.dump(trial_data, f)
                
                return Trial(**trial_data)
        
        return None
    
    def complete_trial(self, trial: Trial, score: float):
        """Mark trial as completed with score."""
        trial.status = 'completed'
        trial.score = score
        trial.end_time = time.time()
        
        # Write to results
        result_file = self.results_path / f"{trial.trial_id}.json"
        with open(result_file, 'w') as f:
            json.dump(asdict(trial), f)
        
        # Update trial status
        trial_file = self.trials_path / f"{trial.trial_id}.json"
        with open(trial_file, 'w') as f:
            json.dump(asdict(trial), f)
    
    def get_best_result(self) -> Optional[Trial]:
        """Get best completed trial."""
        best_trial = None
        best_score = float('-inf')
        
        for result_file in self.results_path.glob("*.json"):
            with open(result_file, 'r') as f:
                trial_data = json.load(f)
            
            if trial_data['score'] and trial_data['score'] > best_score:
                best_score = trial_data['score']
                best_trial = Trial(**trial_data)
        
        return best_trial
 
# Worker script
def run_worker(
    backend: FileSystemBackend,
    evaluate_fn: Callable,
    worker_id: str = None
):
    """
    Worker process that claims and executes trials.
    Run this on each worker machine.
    """
    worker_id = worker_id or str(uuid.uuid4())[:8]
    print(f"Worker {worker_id} starting...")
    
    while True:
        trial = backend.claim_trial(worker_id)
        
        if trial is None:
            print(f"Worker {worker_id}: No pending trials, exiting")
            break
        
        print(f"Worker {worker_id}: Executing trial {trial.trial_id}")
        
        try:
            score = evaluate_fn(trial.config)
            backend.complete_trial(trial, score)
            print(f"Worker {worker_id}: Trial {trial.trial_id} = {score:.4f}")
        except Exception as e:
            trial.status = 'failed'
            trial.error = str(e)
            print(f"Worker {worker_id}: Trial {trial.trial_id} failed: {e}")
 
# Master script
def run_master(
    backend: FileSystemBackend,
    distributions: Dict[str, Any],
    n_trials: int,
    random_state: int = 42
):
    """
    Master process that creates trials.
    Run this once before starting workers.
    """
    import numpy as np
    np.random.seed(random_state)
    
    print(f"Creating {n_trials} trials...")
    
    for i in range(n_trials):
        config = {k: v.sample() for k, v in distributions.items()}
        trial = backend.create_trial(config)
        print(f"Created trial {trial.trial_id}")
    
    print(f"All trials created. Start workers now.")

Fault Tolerance and Recovery

Long-running distributed search must handle failures gracefully. Random search's independence property makes this straightforward.

Failure Modes:

Worker crash: Trial in 'running' state never completes
Network partition: Worker can't write results
Master crash: Need to resume from partial results
Evaluation timeout: Single trial hangs

Mitigation Strategies:

Fault Tolerance Techniques

•Heartbeats: Workers periodically update timestamp; master reclaims stale trials
•Timeouts: Automatically fail trials that exceed expected duration
•Idempotent writes: Results written atomically; retries are safe
•Checkpointing: Periodic snapshots of all state for recovery
•Redundancy: Run critical trials on multiple workers, take first result

The Independence Advantage

Because random search trials are independent, a failed trial doesn't invalidate other work. Simply re-queue failed trials or sample replacements. This is much harder for sequential methods like Bayesian optimization where the chain of trials is interdependent.

Cloud Deployment Strategies

Cloud platforms offer scalable compute for random search. Key patterns for major providers:

AWS:

EC2 Spot Instances for workers (up to 90% cost savings)
S3 for coordination (filesystem backend)
SQS for work queue pattern
Step Functions for orchestration

GCP:

Preemptible VMs for workers
GCS for coordination
Pub/Sub for work queue
Cloud Tasks for orchestration

Azure:

Spot VMs for workers
Blob Storage for coordination
Service Bus for work queue
Durable Functions for orchestration

Cloud HPO Cost Comparison (100 trials, 10 min each)
Configuration	Time	Cost (AWS example)
1 on-demand instance	16.7 hours	$1.50 - $15
10 on-demand instances	1.7 hours	$1.50 - $15
10 spot instances	1.7 hours	$0.30 - $3
100 spot instances	10 minutes	$0.30 - $3

Spot Instance Considerations

Spot/preemptible instances can be terminated at any time. Ensure your random search implementation handles this gracefully with checkpointing and trial reclamation. The cost savings (60-90%) are worth the added complexity.

Summary: Parallelizing Random Search

Key Takeaways

•Random search is embarrassingly parallel: each trial is independent, enabling linear speedup.
•Multi-process implementation is straightforward with ProcessPoolExecutor or joblib.
•Distributed patterns include shared filesystem, message queues, and database backends.
•Fault tolerance is simpler for random search because trials are independent.
•Cloud deployment with spot instances can reduce costs by 60-90%.

Module Complete!

You have completed the Random Search module. You now understand random sampling foundations, why random beats grid, coverage probability theory, practical advantages, and parallelization strategies.

Next Steps:

The next module covers Bayesian Optimization—a more sophisticated approach that uses information from previous evaluations to guide the search, achieving better results when evaluation budgets are very limited.

Module Complete

Congratulations! You have mastered random search for hyperparameter optimization. You can implement efficient, parallelized random search with theoretical understanding of its guarantees and practical advantages.

5 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Random Search

LevelIntermediate

Duration60 mins

TopicHyperparameter Optimization

5 / 5

Parallelization

Scaling Random Search

What You Will Learn

The Independence Property

Random search's parallelizability stems from a fundamental property: each trial is statistically independent and requires no information from other trials.

Formal Statement:

$$P(\mathbf{h}^{(i)} : i \in I_j) = \prod_{i \in I_j} p(\mathbf{h}^{(i)})$$

This means we can assign any subset of trials to any worker without affecting the distribution of results.

Contrast with Sequential Methods:

Method	Information Flow	Parallelization
Random Search	None between trials	Perfect (linear speedup)
Bayesian Optimization	Each trial informs next	Limited (batch BO possible but complex)
Evolutionary	Population interactions	Moderate (island models)
Grid Search	None	Perfect (but less useful)

Embarrassingly Parallel

Multi-Process Implementation

For single-machine parallelization, Python's multiprocessing or joblib provide straightforward parallel random search.

Key Considerations:

Seed Management: Each worker needs a different random seed to avoid duplicate samples
Result Aggregation: Collect results from all workers and find global best
Memory: Large models may require careful memory management
GIL: Use multiprocessing, not threading, for CPU-bound training

parallel_random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
import numpy as np
from typing import Dict, Any, Callable, List
from dataclasses import dataclass
from concurrent.futures import ProcessPoolExecutor, as_completed
import multiprocessing as mp
 
@dataclass
class ParallelSearchResult:
    """Result from parallel random search."""
    best_config: Dict[str, Any]
    best_score: float
    all_results: List[Dict]
    n_workers: int
    total_time: float
 
def worker_search(
    worker_id: int,
    n_trials: int,
    base_seed: int,
    distributions: Dict[str, Any],
    evaluate_fn: Callable
) -> List[Dict]:
    """
    Worker function for parallel random search.
    Each worker runs n_trials independent samples.
    """
    # Unique seed for this worker
    np.random.seed(base_seed + worker_id * 10000)
    
    results = []
    for i in range(n_trials):
        # Sample configuration
        config = {k: v.sample() for k, v in distributions.items()}
        
        # Evaluate
        try:
            score = evaluate_fn(config)
            status = 'success'
        except Exception as e:
            score = float('-inf')
            status = str(e)
        
        results.append({
            'worker_id': worker_id,
            'trial': i,
            'config': config,
            'score': score,
            'status': status
        })
    
    return results
 
class ParallelRandomSearch:
    """
    Parallel random search using process pool.
    """
    
    def __init__(
        self,
        distributions: Dict[str, Any],
        n_iter: int = 100,
        n_workers: int = None,
        random_state: int = 42
    ):
        self.distributions = distributions
        self.n_iter = n_iter
        self.n_workers = n_workers or mp.cpu_count()
        self.random_state = random_state
    
    def run(self, evaluate_fn: Callable) -> ParallelSearchResult:
        """
        Execute parallel random search.
        """
        import time
        start_time = time.time()
        
        # Divide trials among workers
        trials_per_worker = self.n_iter // self.n_workers
        remainder = self.n_iter % self.n_workers
        
        all_results = []
        
        with ProcessPoolExecutor(max_workers=self.n_workers) as executor:
            futures = []
            
            for worker_id in range(self.n_workers):
                # Distribute remainder trials to first workers
                n_trials = trials_per_worker + (1 if worker_id < remainder else 0)
                
                future = executor.submit(
                    worker_search,
                    worker_id=worker_id,
                    n_trials=n_trials,
                    base_seed=self.random_state,
                    distributions=self.distributions,
                    evaluate_fn=evaluate_fn
                )
                futures.append(future)
            
            # Collect results as they complete
            for future in as_completed(futures):
                worker_results = future.result()
                all_results.extend(worker_results)
        
        # Find best
        best_result = max(all_results, key=lambda x: x['score'])
        
        return ParallelSearchResult(
            best_config=best_result['config'],
            best_score=best_result['score'],
            all_results=all_results,
            n_workers=self.n_workers,
            total_time=time.time() - start_time
        )
 
# Alternative: Using joblib for simpler syntax
def parallel_random_search_joblib(
    distributions: Dict[str, Any],
    evaluate_fn: Callable,
    n_iter: int = 100,
    n_jobs: int = -1,
    random_state: int = 42
) -> ParallelSearchResult:
    """
    Parallel random search using joblib.
    Simpler API, good for quick prototyping.
    """
    from joblib import Parallel, delayed
    import time
    
    start_time = time.time()
    
    def evaluate_single(trial_id):
        np.random.seed(random_state + trial_id)
        config = {k: v.sample() for k, v in distributions.items()}
        try:
            score = evaluate_fn(config)
        except:
            score = float('-inf')
        return {'config': config, 'score': score}
    
    results = Parallel(n_jobs=n_jobs)(
        delayed(evaluate_single)(i) for i in range(n_iter)
    )
    
    best = max(results, key=lambda x: x['score'])
    
    return ParallelSearchResult(
        best_config=best['config'],
        best_score=best['score'],
        all_results=results,
        n_workers=abs(n_jobs) if n_jobs != -1 else mp.cpu_count(),
        total_time=time.time() - start_time
    )

Distributed Implementation

For truly large-scale hyperparameter optimization, we need to distribute across multiple machines. Several patterns work well.

Pattern 1: Shared Filesystem

Simplest approach for clusters with shared storage:

Each worker reads search space definition from shared file
Each worker writes results to unique file
Master aggregates results after all complete

Pattern 2: Message Queue

More robust for cloud/heterogeneous environments:

Master pushes (config, seed) pairs to queue
Workers pull, evaluate, push results to results queue
Master aggregates from results queue

Pattern 3: Database Backend

Best for long-running searches with fault tolerance:

All configurations and results stored in database
Workers claim trials, write results
Master queries for best result anytime

distributed_random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import json
import hashlib
from typing import Dict, Any, Callable, Optional
from dataclasses import dataclass, asdict
from pathlib import Path
import time
import uuid
 
@dataclass
class Trial:
    """Represents a single hyperparameter trial."""
    trial_id: str
    config: Dict[str, Any]
    status: str  # 'pending', 'running', 'completed', 'failed'
    score: Optional[float] = None
    worker_id: Optional[str] = None
    start_time: Optional[float] = None
    end_time: Optional[float] = None
    error: Optional[str] = None
 
class FileSystemBackend:
    """
    Simple file-system based distributed coordination.
    Works with any shared filesystem (NFS, S3, GCS).
    """
    
    def __init__(self, base_path: str):
        self.base_path = Path(base_path)
        self.trials_path = self.base_path / "trials"
        self.results_path = self.base_path / "results"
        
        self.trials_path.mkdir(parents=True, exist_ok=True)
        self.results_path.mkdir(parents=True, exist_ok=True)
    
    def create_trial(self, config: Dict[str, Any]) -> Trial:
        """Create a new pending trial."""
        trial_id = hashlib.md5(
            json.dumps(config, sort_keys=True).encode()
        ).hexdigest()[:12]
        
        trial = Trial(
            trial_id=trial_id,
            config=config,
            status='pending'
        )
        
        trial_file = self.trials_path / f"{trial_id}.json"
        with open(trial_file, 'w') as f:
            json.dump(asdict(trial), f)
        
        return trial
    
    def claim_trial(self, worker_id: str) -> Optional[Trial]:
        """Claim a pending trial for execution."""
        for trial_file in self.trials_path.glob("*.json"):
            with open(trial_file, 'r') as f:
                trial_data = json.load(f)
            
            if trial_data['status'] == 'pending':
                # Claim it
                trial_data['status'] = 'running'
                trial_data['worker_id'] = worker_id
                trial_data['start_time'] = time.time()
                
                with open(trial_file, 'w') as f:
                    json.dump(trial_data, f)
                
                return Trial(**trial_data)
        
        return None
    
    def complete_trial(self, trial: Trial, score: float):
        """Mark trial as completed with score."""
        trial.status = 'completed'
        trial.score = score
        trial.end_time = time.time()
        
        # Write to results
        result_file = self.results_path / f"{trial.trial_id}.json"
        with open(result_file, 'w') as f:
            json.dump(asdict(trial), f)
        
        # Update trial status
        trial_file = self.trials_path / f"{trial.trial_id}.json"
        with open(trial_file, 'w') as f:
            json.dump(asdict(trial), f)
    
    def get_best_result(self) -> Optional[Trial]:
        """Get best completed trial."""
        best_trial = None
        best_score = float('-inf')
        
        for result_file in self.results_path.glob("*.json"):
            with open(result_file, 'r') as f:
                trial_data = json.load(f)
            
            if trial_data['score'] and trial_data['score'] > best_score:
                best_score = trial_data['score']
                best_trial = Trial(**trial_data)
        
        return best_trial
 
# Worker script
def run_worker(
    backend: FileSystemBackend,
    evaluate_fn: Callable,
    worker_id: str = None
):
    """
    Worker process that claims and executes trials.
    Run this on each worker machine.
    """
    worker_id = worker_id or str(uuid.uuid4())[:8]
    print(f"Worker {worker_id} starting...")
    
    while True:
        trial = backend.claim_trial(worker_id)
        
        if trial is None:
            print(f"Worker {worker_id}: No pending trials, exiting")
            break
        
        print(f"Worker {worker_id}: Executing trial {trial.trial_id}")
        
        try:
            score = evaluate_fn(trial.config)
            backend.complete_trial(trial, score)
            print(f"Worker {worker_id}: Trial {trial.trial_id} = {score:.4f}")
        except Exception as e:
            trial.status = 'failed'
            trial.error = str(e)
            print(f"Worker {worker_id}: Trial {trial.trial_id} failed: {e}")
 
# Master script
def run_master(
    backend: FileSystemBackend,
    distributions: Dict[str, Any],
    n_trials: int,
    random_state: int = 42
):
    """
    Master process that creates trials.
    Run this once before starting workers.
    """
    import numpy as np
    np.random.seed(random_state)
    
    print(f"Creating {n_trials} trials...")
    
    for i in range(n_trials):
        config = {k: v.sample() for k, v in distributions.items()}
        trial = backend.create_trial(config)
        print(f"Created trial {trial.trial_id}")
    
    print(f"All trials created. Start workers now.")

Fault Tolerance and Recovery

Long-running distributed search must handle failures gracefully. Random search's independence property makes this straightforward.

Failure Modes:

Worker crash: Trial in 'running' state never completes
Network partition: Worker can't write results
Master crash: Need to resume from partial results
Evaluation timeout: Single trial hangs

Mitigation Strategies:

Fault Tolerance Techniques

•Heartbeats: Workers periodically update timestamp; master reclaims stale trials
•Timeouts: Automatically fail trials that exceed expected duration
•Idempotent writes: Results written atomically; retries are safe
•Checkpointing: Periodic snapshots of all state for recovery
•Redundancy: Run critical trials on multiple workers, take first result

The Independence Advantage

Cloud Deployment Strategies

Cloud platforms offer scalable compute for random search. Key patterns for major providers:

AWS:

EC2 Spot Instances for workers (up to 90% cost savings)
S3 for coordination (filesystem backend)
SQS for work queue pattern
Step Functions for orchestration

GCP:

Preemptible VMs for workers
GCS for coordination
Pub/Sub for work queue
Cloud Tasks for orchestration

Azure:

Spot VMs for workers
Blob Storage for coordination
Service Bus for work queue
Durable Functions for orchestration

Cloud HPO Cost Comparison (100 trials, 10 min each)
Configuration	Time	Cost (AWS example)
1 on-demand instance	16.7 hours	$1.50 - $15
10 on-demand instances	1.7 hours	$1.50 - $15
10 spot instances	1.7 hours	$0.30 - $3
100 spot instances	10 minutes	$0.30 - $3

Spot Instance Considerations

Summary: Parallelizing Random Search

Key Takeaways

•Random search is embarrassingly parallel: each trial is independent, enabling linear speedup.
•Multi-process implementation is straightforward with ProcessPoolExecutor or joblib.
•Distributed patterns include shared filesystem, message queues, and database backends.
•Fault tolerance is simpler for random search because trials are independent.
•Cloud deployment with spot instances can reduce costs by 60-90%.

Module Complete!

You have completed the Random Search module. You now understand random sampling foundations, why random beats grid, coverage probability theory, practical advantages, and parallelization strategies.

Next Steps:

Module Complete

5 / 5