Machine LearningHyperparameter Optimization

Random Search

LevelIntermediate

Duration60 mins

TopicHyperparameter Optimization

1 / 5

Random Sampling

The Power of Randomness

In a field obsessed with optimization and precision, it may seem counterintuitive that random sampling could be a powerful strategy. Yet in hyperparameter optimization, randomness is not a compromise—it is often the optimal approach.

Random search represents a fundamental shift in philosophy: instead of exhaustively enumerating a predetermined grid, we sample hyperparameter configurations from probability distributions over the search space. This simple change has profound implications for efficiency, coverage, and scalability.

The landmark 2012 paper by Bergstra and Bengio demonstrated empirically and theoretically that random search finds good hyperparameters faster than grid search in practice. Since then, random search has become the default baseline for hyperparameter optimization, and understanding its mechanics is essential for any ML practitioner.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of random sampling for HPO, how to define appropriate sampling distributions for different hyperparameter types, implementation techniques for efficient random search, and the theoretical guarantees that make random search effective.

Foundations of Random Sampling

Random search for hyperparameter optimization operates on a simple principle: sample configurations independently from probability distributions defined over the hyperparameter space.

Formal Definition:

Let $\mathcal{H} = \mathcal{H}_1 \times \mathcal{H}_2 \times \cdots \times \mathcal{H}_d$ be the $d$-dimensional hyperparameter space. We define a probability distribution $p(\mathbf{h})$ over $\mathcal{H}$, typically as a product of independent marginals:

$$p(\mathbf{h}) = \prod_{i=1}^{d} p_i(h_i)$$

where $p_i$ is the distribution for hyperparameter $i$. Random search draws $n$ samples $\mathbf{h}^{(1)}, \ldots, \mathbf{h}^{(n)} \sim p(\mathbf{h})$ and evaluates each to find the best configuration.

Key Properties:

Independence: Each sample is drawn independently
Stochasticity: Results vary across runs (manageable with seeds)
Anytime: Can stop early with valid results
Unbiased Coverage: No systematic gaps in the search space

Grid Search vs Random Search: Fundamental Comparison
Property	Grid Search	Random Search
Sampling Strategy	Deterministic enumeration	Stochastic sampling
Coverage Pattern	Regular lattice	Uniform scatter
Dimensionality Scaling	Exponential: $n^d$	Linear: $n$
Per-Dimension Resolution	Fixed globally	Probabilistic, varies
Stopping Behavior	Must complete or waste work	Anytime valid results
Parallelization	Trivial but wasteful	Trivial and efficient
Reproducibility	Deterministic	Seed-controlled

The Independence Assumption

Using independent marginal distributions ignores potential interactions between hyperparameters. While this is a simplification, it works well in practice because: (1) many hyperparameters are approximately independent in their effects, and (2) random sampling naturally explores the joint space. For known strong interactions, conditional distributions can be used.

Sampling Distributions by Hyperparameter Type

Choosing appropriate sampling distributions is crucial for effective random search. The distribution should match the hyperparameter's nature and place probability mass where good values are likely.

Continuous Hyperparameters:

For continuous hyperparameters, the choice between linear and logarithmic sampling is critical:

Uniform (Linear): Use when the hyperparameter effect is roughly linear across its range $$h \sim \text{Uniform}(a, b)$$ Examples: dropout rate, momentum, weight decay (sometimes)
Log-Uniform: Use when the hyperparameter spans orders of magnitude $$\log h \sim \text{Uniform}(\log a, \log b)$$ Examples: learning rate, regularization strength, kernel bandwidth
Truncated Normal: Use when prior knowledge suggests a likely center $$h \sim \mathcal{N}(\mu, \sigma^2) \text{ truncated to } [a, b]$$ Examples: fine-tuning an existing near-optimal value

sampling_distributions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from typing import Union, List, Any
from scipy import stats
from abc import ABC, abstractmethod
 
class Distribution(ABC):
    """Base class for hyperparameter sampling distributions."""
    
    @abstractmethod
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        pass
    
    @abstractmethod
    def __repr__(self) -> str:
        pass
 
class Uniform(Distribution):
    """Uniform distribution over [low, high]."""
    
    def __init__(self, low: float, high: float):
        self.low = low
        self.high = high
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        return rng.uniform(self.low, self.high, n)
    
    def __repr__(self) -> str:
        return f"Uniform({self.low}, {self.high})"
 
class LogUniform(Distribution):
    """Log-uniform distribution: log(x) ~ Uniform(log(low), log(high))."""
    
    def __init__(self, low: float, high: float):
        assert low > 0 and high > 0, "Log-uniform requires positive bounds"
        self.low = low
        self.high = high
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        log_low, log_high = np.log(self.low), np.log(self.high)
        return np.exp(rng.uniform(log_low, log_high, n))
    
    def __repr__(self) -> str:
        return f"LogUniform({self.low}, {self.high})"
 
class IntUniform(Distribution):
    """Uniform distribution over integers [low, high]."""
    
    def __init__(self, low: int, high: int):
        self.low = low
        self.high = high
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        return rng.randint(self.low, self.high + 1, n)
    
    def __repr__(self) -> str:
        return f"IntUniform({self.low}, {self.high})"
 
class LogIntUniform(Distribution):
    """Log-uniform over integers: sample log-uniform, round to int."""
    
    def __init__(self, low: int, high: int):
        assert low > 0, "Log-int-uniform requires positive bounds"
        self.low = low
        self.high = high
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        log_low, log_high = np.log(self.low), np.log(self.high)
        continuous = np.exp(rng.uniform(log_low, log_high, n))
        return np.round(continuous).astype(int)
    
    def __repr__(self) -> str:
        return f"LogIntUniform({self.low}, {self.high})"
 
class Categorical(Distribution):
    """Categorical distribution over discrete choices."""
    
    def __init__(self, choices: List[Any], weights: List[float] = None):
        self.choices = choices
        self.weights = weights
        if weights:
            self.probs = np.array(weights) / sum(weights)
        else:
            self.probs = None
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        indices = rng.choice(len(self.choices), size=n, p=self.probs)
        return np.array([self.choices[i] for i in indices])
    
    def __repr__(self) -> str:
        return f"Categorical({self.choices})"
 
# Example usage
def demonstrate_distributions():
    """Show sampling behavior of different distributions."""
    
    distributions = {
        'learning_rate': LogUniform(1e-5, 1e-1),
        'batch_size': LogIntUniform(16, 512),
        'dropout': Uniform(0.0, 0.5),
        'n_layers': IntUniform(1, 5),
        'activation': Categorical(['relu', 'tanh', 'elu']),
    }
    
    print("Sample hyperparameter configurations:")
    for i in range(3):
        config = {name: dist.sample(1, random_state=i)[0] 
                  for name, dist in distributions.items()}
        print(f"  Config {i}: {config}")

Discrete Hyperparameters:

Categorical: For unordered choices (activation functions, optimizers) $$h \sim \text{Categorical}({c_1, \ldots, c_k}, \mathbf{p})$$
Integer Uniform: For ordered integers (layers, units per layer) $$h \sim \text{DiscreteUniform}(a, b)$$
Log-Integer: For integers spanning orders of magnitude $$h = \lfloor \exp(\text{Uniform}(\log a, \log b)) \rceil$$ Examples: batch size, number of estimators

Distribution Selection Guidelines:

The choice of distribution encodes prior knowledge about where good hyperparameters lie:

Hyperparameter	Typical Distribution	Rationale
Learning rate	LogUniform(1e-5, 1)	Varies over orders of magnitude
Weight decay	LogUniform(1e-6, 1e-1)	Regularization strength varies exponentially
Dropout	Uniform(0, 0.5)	Linear effect on regularization
Hidden units	LogIntUniform(32, 1024)	Capacity scales roughly logarithmically
Batch size	LogIntUniform(16, 512)	Effect on gradient noise is logarithmic

Implementing Random Search

A production-ready random search implementation requires careful attention to reproducibility, efficiency, and integration with ML workflows. Here we develop a comprehensive implementation from first principles.

random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import numpy as np
from typing import Dict, Any, Callable, List, Optional
from dataclasses import dataclass
from sklearn.model_selection import cross_val_score
 
@dataclass
class RandomSearchResult:
    """Container for random search results."""
    best_params: Dict[str, Any]
    best_score: float
    all_results: List[Dict]
    n_iterations: int
    
    def summary(self) -> str:
        return f"""
Random Search Results
====================
Iterations: {self.n_iterations}
Best Score: {self.best_score:.6f}
Best Parameters: {self.best_params}
"""
 
class RandomSearch:
    """
    Random search hyperparameter optimization.
    
    Samples configurations from specified distributions and evaluates
    them using cross-validation or a custom scoring function.
    """
    
    def __init__(
        self,
        param_distributions: Dict[str, Distribution],
        n_iter: int = 100,
        scoring: str = 'accuracy',
        cv: int = 5,
        random_state: Optional[int] = None,
        verbose: int = 1
    ):
        self.param_distributions = param_distributions
        self.n_iter = n_iter
        self.scoring = scoring
        self.cv = cv
        self.random_state = random_state
        self.verbose = verbose
        
        self.results_: List[Dict] = []
        self.best_params_: Dict[str, Any] = None
        self.best_score_: float = float('-inf')
    
    def _sample_configuration(self, iteration: int) -> Dict[str, Any]:
        """Sample a single configuration from the distributions."""
        # Use iteration-dependent seed for reproducibility
        base_seed = self.random_state if self.random_state else 0
        config = {}
        
        for param_name, distribution in self.param_distributions.items():
            seed = base_seed + iteration * 1000 + hash(param_name) % 1000
            value = distribution.sample(1, random_state=seed)[0]
            
            # Convert numpy types to Python types
            if isinstance(value, np.integer):
                value = int(value)
            elif isinstance(value, np.floating):
                value = float(value)
            
            config[param_name] = value
        
        return config
    
    def fit(
        self,
        estimator_class,
        X: np.ndarray,
        y: np.ndarray,
        fixed_params: Dict[str, Any] = None
    ) -> RandomSearchResult:
        """
        Run random search optimization.
        
        Parameters:
        -----------
        estimator_class: sklearn-compatible estimator class
        X, y: Training data
        fixed_params: Parameters to pass to all estimators
        
        Returns:
        --------
        RandomSearchResult with best configuration and full history
        """
        fixed_params = fixed_params or {}
        self.results_ = []
        self.best_score_ = float('-inf')
        
        for i in range(self.n_iter):
            # Sample configuration
            config = self._sample_configuration(i)
            
            # Combine with fixed params
            full_params = {**fixed_params, **config}
            
            # Evaluate with cross-validation
            try:
                estimator = estimator_class(**full_params)
                scores = cross_val_score(
                    estimator, X, y, 
                    scoring=self.scoring, cv=self.cv
                )
                mean_score = scores.mean()
                std_score = scores.std()
                status = 'success'
            except Exception as e:
                mean_score = float('-inf')
                std_score = 0.0
                status = f'error: {str(e)}'
            
            # Record result
            result = {
                'iteration': i,
                'params': config,
                'mean_score': mean_score,
                'std_score': std_score,
                'status': status
            }
            self.results_.append(result)
            
            # Update best
            if mean_score > self.best_score_:
                self.best_score_ = mean_score
                self.best_params_ = config
            
            # Logging
            if self.verbose and (i + 1) % 10 == 0:
                print(f"Iteration {i+1}/{self.n_iter}: "
                      f"best={self.best_score_:.4f}")
        
        return RandomSearchResult(
            best_params=self.best_params_,
            best_score=self.best_score_,
            all_results=self.results_,
            n_iterations=self.n_iter
        )

Reproducibility Through Seeding

Notice how each configuration uses a deterministic seed based on the iteration number. This ensures that: (1) results are reproducible given the same random_state, (2) adding more iterations doesn't change previously sampled configs, and (3) parallel execution with partitioned iterations yields identical results to sequential execution.

Theoretical Guarantees

Random search enjoys strong theoretical guarantees that explain its practical effectiveness. Understanding these guarantees helps in setting appropriate budgets and expectations.

Probability of Finding Good Configurations:

Let $\alpha$ be the fraction of the hyperparameter space containing "good" configurations (e.g., within $\epsilon$ of optimal). With $n$ random samples, the probability of finding at least one good configuration is:

$$P(\text{success}) = 1 - (1 - \alpha)^n$$

Solving for $n$ to achieve success probability $p$:

$$n = \frac{\log(1 - p)}{\log(1 - \alpha)}$$

Practical Implications:

Target Probability	α = 5% (good)	α = 1% (very good)	α = 0.1% (excellent)
90%	45 samples	229 samples	2,302 samples
95%	59 samples	299 samples	2,995 samples
99%	90 samples	459 samples	4,603 samples

This shows that random search is remarkably efficient when good configurations occupy even a small fraction of the space.

The 5% Rule

If 5% of configurations are 'good enough', just 60 random samples give 95% probability of finding one. This is why random search often works well with modest budgets—we don't need to find THE optimal configuration, just a good one.

Dimension-Independent Guarantees:

Unlike grid search, random search's guarantees are independent of dimensionality. To find a configuration in the top $\alpha$ fraction of any individual dimension with probability $p$, we need:

$$n \geq \frac{\log(1-p)}{\log(1-\alpha)}$$

This holds regardless of the number of dimensions $d$. Grid search, by contrast, requires $O(\alpha^{-d})$ samples for the same guarantee—exponentially worse.

Concentration Bounds:

For the best score found after $n$ random samples, we have concentration results. If scores are bounded in $[0, 1]$ and we seek the maximum, the expected gap from optimal decreases as:

$$\mathbb{E}[\text{regret}] = O\left(\frac{1}{n}\right)$$

This linear convergence rate is competitive with more sophisticated methods for smooth objective landscapes.

Practical Considerations

Several practical considerations affect random search effectiveness in real applications.

Key Practical Considerations

•Setting Appropriate Bounds: Overly wide bounds waste samples in poor regions. Use domain knowledge and preliminary experiments to set reasonable ranges.
•Handling Constraints: Some hyperparameter combinations are invalid (e.g., learning rate too high for batch size). Implement rejection sampling or conditional distributions.
•Warm Starting: Initialize with known good configurations from similar problems or previous experiments before random exploration.
•Adaptive Budgets: Start with fewer iterations and increase if results are promising. Random search's anytime property makes this easy.
•Multiple Runs: Due to stochasticity, run random search multiple times with different seeds and take the best, or report confidence intervals.

Distribution Mismatch

The biggest practical failure mode is using wrong distributions. Using linear uniform for learning rate (which varies over orders of magnitude) wastes most samples in the wrong range. Always match the distribution to the hyperparameter's natural scale.

Summary: Random Sampling Foundations

We have established the foundations of random sampling for hyperparameter optimization:

Key Takeaways

•Random search samples from distributions rather than enumerating a grid, providing probabilistic coverage of the search space.
•Distribution choice is critical—use log-uniform for parameters spanning orders of magnitude, uniform for linear-effect parameters.
•Theoretical guarantees are dimension-independent, making random search scale gracefully to high-dimensional spaces.
•Modest sample sizes suffice—60 samples give 95% probability of finding a top-5% configuration.
•Implementation must ensure reproducibility through careful seeding strategies.

What's Next:

The next page explores why random search often outperforms grid search—the theoretical and empirical arguments that establish random search as the superior baseline for hyperparameter optimization.

Page Complete

You now understand the foundations of random sampling for hyperparameter optimization. You can implement random search with appropriate distributions and understand its theoretical guarantees. Next, we'll see why random sampling beats exhaustive grid enumeration.

1 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Random Search

LevelIntermediate

Duration60 mins

TopicHyperparameter Optimization

1 / 5

Random Sampling

The Power of Randomness

What You Will Learn

Foundations of Random Sampling

Random search for hyperparameter optimization operates on a simple principle: sample configurations independently from probability distributions defined over the hyperparameter space.

Formal Definition:

$$p(\mathbf{h}) = \prod_{i=1}^{d} p_i(h_i)$$

Key Properties:

Independence: Each sample is drawn independently
Stochasticity: Results vary across runs (manageable with seeds)
Anytime: Can stop early with valid results
Unbiased Coverage: No systematic gaps in the search space

Grid Search vs Random Search: Fundamental Comparison
Property	Grid Search	Random Search
Sampling Strategy	Deterministic enumeration	Stochastic sampling
Coverage Pattern	Regular lattice	Uniform scatter
Dimensionality Scaling	Exponential: $n^d$	Linear: $n$
Per-Dimension Resolution	Fixed globally	Probabilistic, varies
Stopping Behavior	Must complete or waste work	Anytime valid results
Parallelization	Trivial but wasteful	Trivial and efficient
Reproducibility	Deterministic	Seed-controlled

The Independence Assumption

Sampling Distributions by Hyperparameter Type

Choosing appropriate sampling distributions is crucial for effective random search. The distribution should match the hyperparameter's nature and place probability mass where good values are likely.

Continuous Hyperparameters:

For continuous hyperparameters, the choice between linear and logarithmic sampling is critical:

Uniform (Linear): Use when the hyperparameter effect is roughly linear across its range $$h \sim \text{Uniform}(a, b)$$ Examples: dropout rate, momentum, weight decay (sometimes)
Log-Uniform: Use when the hyperparameter spans orders of magnitude $$\log h \sim \text{Uniform}(\log a, \log b)$$ Examples: learning rate, regularization strength, kernel bandwidth
Truncated Normal: Use when prior knowledge suggests a likely center $$h \sim \mathcal{N}(\mu, \sigma^2) \text{ truncated to } [a, b]$$ Examples: fine-tuning an existing near-optimal value

sampling_distributions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from typing import Union, List, Any
from scipy import stats
from abc import ABC, abstractmethod
 
class Distribution(ABC):
    """Base class for hyperparameter sampling distributions."""
    
    @abstractmethod
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        pass
    
    @abstractmethod
    def __repr__(self) -> str:
        pass
 
class Uniform(Distribution):
    """Uniform distribution over [low, high]."""
    
    def __init__(self, low: float, high: float):
        self.low = low
        self.high = high
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        return rng.uniform(self.low, self.high, n)
    
    def __repr__(self) -> str:
        return f"Uniform({self.low}, {self.high})"
 
class LogUniform(Distribution):
    """Log-uniform distribution: log(x) ~ Uniform(log(low), log(high))."""
    
    def __init__(self, low: float, high: float):
        assert low > 0 and high > 0, "Log-uniform requires positive bounds"
        self.low = low
        self.high = high
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        log_low, log_high = np.log(self.low), np.log(self.high)
        return np.exp(rng.uniform(log_low, log_high, n))
    
    def __repr__(self) -> str:
        return f"LogUniform({self.low}, {self.high})"
 
class IntUniform(Distribution):
    """Uniform distribution over integers [low, high]."""
    
    def __init__(self, low: int, high: int):
        self.low = low
        self.high = high
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        return rng.randint(self.low, self.high + 1, n)
    
    def __repr__(self) -> str:
        return f"IntUniform({self.low}, {self.high})"
 
class LogIntUniform(Distribution):
    """Log-uniform over integers: sample log-uniform, round to int."""
    
    def __init__(self, low: int, high: int):
        assert low > 0, "Log-int-uniform requires positive bounds"
        self.low = low
        self.high = high
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        log_low, log_high = np.log(self.low), np.log(self.high)
        continuous = np.exp(rng.uniform(log_low, log_high, n))
        return np.round(continuous).astype(int)
    
    def __repr__(self) -> str:
        return f"LogIntUniform({self.low}, {self.high})"
 
class Categorical(Distribution):
    """Categorical distribution over discrete choices."""
    
    def __init__(self, choices: List[Any], weights: List[float] = None):
        self.choices = choices
        self.weights = weights
        if weights:
            self.probs = np.array(weights) / sum(weights)
        else:
            self.probs = None
    
    def sample(self, n: int = 1, random_state: int = None) -> np.ndarray:
        rng = np.random.RandomState(random_state)
        indices = rng.choice(len(self.choices), size=n, p=self.probs)
        return np.array([self.choices[i] for i in indices])
    
    def __repr__(self) -> str:
        return f"Categorical({self.choices})"
 
# Example usage
def demonstrate_distributions():
    """Show sampling behavior of different distributions."""
    
    distributions = {
        'learning_rate': LogUniform(1e-5, 1e-1),
        'batch_size': LogIntUniform(16, 512),
        'dropout': Uniform(0.0, 0.5),
        'n_layers': IntUniform(1, 5),
        'activation': Categorical(['relu', 'tanh', 'elu']),
    }
    
    print("Sample hyperparameter configurations:")
    for i in range(3):
        config = {name: dist.sample(1, random_state=i)[0] 
                  for name, dist in distributions.items()}
        print(f"  Config {i}: {config}")

Discrete Hyperparameters:

Categorical: For unordered choices (activation functions, optimizers) $$h \sim \text{Categorical}({c_1, \ldots, c_k}, \mathbf{p})$$
Integer Uniform: For ordered integers (layers, units per layer) $$h \sim \text{DiscreteUniform}(a, b)$$
Log-Integer: For integers spanning orders of magnitude $$h = \lfloor \exp(\text{Uniform}(\log a, \log b)) \rceil$$ Examples: batch size, number of estimators

Distribution Selection Guidelines:

The choice of distribution encodes prior knowledge about where good hyperparameters lie:

Hyperparameter	Typical Distribution	Rationale
Learning rate	LogUniform(1e-5, 1)	Varies over orders of magnitude
Weight decay	LogUniform(1e-6, 1e-1)	Regularization strength varies exponentially
Dropout	Uniform(0, 0.5)	Linear effect on regularization
Hidden units	LogIntUniform(32, 1024)	Capacity scales roughly logarithmically
Batch size	LogIntUniform(16, 512)	Effect on gradient noise is logarithmic

Implementing Random Search

random_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import numpy as np
from typing import Dict, Any, Callable, List, Optional
from dataclasses import dataclass
from sklearn.model_selection import cross_val_score
 
@dataclass
class RandomSearchResult:
    """Container for random search results."""
    best_params: Dict[str, Any]
    best_score: float
    all_results: List[Dict]
    n_iterations: int
    
    def summary(self) -> str:
        return f"""
Random Search Results
====================
Iterations: {self.n_iterations}
Best Score: {self.best_score:.6f}
Best Parameters: {self.best_params}
"""
 
class RandomSearch:
    """
    Random search hyperparameter optimization.
    
    Samples configurations from specified distributions and evaluates
    them using cross-validation or a custom scoring function.
    """
    
    def __init__(
        self,
        param_distributions: Dict[str, Distribution],
        n_iter: int = 100,
        scoring: str = 'accuracy',
        cv: int = 5,
        random_state: Optional[int] = None,
        verbose: int = 1
    ):
        self.param_distributions = param_distributions
        self.n_iter = n_iter
        self.scoring = scoring
        self.cv = cv
        self.random_state = random_state
        self.verbose = verbose
        
        self.results_: List[Dict] = []
        self.best_params_: Dict[str, Any] = None
        self.best_score_: float = float('-inf')
    
    def _sample_configuration(self, iteration: int) -> Dict[str, Any]:
        """Sample a single configuration from the distributions."""
        # Use iteration-dependent seed for reproducibility
        base_seed = self.random_state if self.random_state else 0
        config = {}
        
        for param_name, distribution in self.param_distributions.items():
            seed = base_seed + iteration * 1000 + hash(param_name) % 1000
            value = distribution.sample(1, random_state=seed)[0]
            
            # Convert numpy types to Python types
            if isinstance(value, np.integer):
                value = int(value)
            elif isinstance(value, np.floating):
                value = float(value)
            
            config[param_name] = value
        
        return config
    
    def fit(
        self,
        estimator_class,
        X: np.ndarray,
        y: np.ndarray,
        fixed_params: Dict[str, Any] = None
    ) -> RandomSearchResult:
        """
        Run random search optimization.
        
        Parameters:
        -----------
        estimator_class: sklearn-compatible estimator class
        X, y: Training data
        fixed_params: Parameters to pass to all estimators
        
        Returns:
        --------
        RandomSearchResult with best configuration and full history
        """
        fixed_params = fixed_params or {}
        self.results_ = []
        self.best_score_ = float('-inf')
        
        for i in range(self.n_iter):
            # Sample configuration
            config = self._sample_configuration(i)
            
            # Combine with fixed params
            full_params = {**fixed_params, **config}
            
            # Evaluate with cross-validation
            try:
                estimator = estimator_class(**full_params)
                scores = cross_val_score(
                    estimator, X, y, 
                    scoring=self.scoring, cv=self.cv
                )
                mean_score = scores.mean()
                std_score = scores.std()
                status = 'success'
            except Exception as e:
                mean_score = float('-inf')
                std_score = 0.0
                status = f'error: {str(e)}'
            
            # Record result
            result = {
                'iteration': i,
                'params': config,
                'mean_score': mean_score,
                'std_score': std_score,
                'status': status
            }
            self.results_.append(result)
            
            # Update best
            if mean_score > self.best_score_:
                self.best_score_ = mean_score
                self.best_params_ = config
            
            # Logging
            if self.verbose and (i + 1) % 10 == 0:
                print(f"Iteration {i+1}/{self.n_iter}: "
                      f"best={self.best_score_:.4f}")
        
        return RandomSearchResult(
            best_params=self.best_params_,
            best_score=self.best_score_,
            all_results=self.results_,
            n_iterations=self.n_iter
        )

Reproducibility Through Seeding

Theoretical Guarantees

Random search enjoys strong theoretical guarantees that explain its practical effectiveness. Understanding these guarantees helps in setting appropriate budgets and expectations.

Probability of Finding Good Configurations:

$$P(\text{success}) = 1 - (1 - \alpha)^n$$

Solving for $n$ to achieve success probability $p$:

$$n = \frac{\log(1 - p)}{\log(1 - \alpha)}$$

Practical Implications:

Target Probability	α = 5% (good)	α = 1% (very good)	α = 0.1% (excellent)
90%	45 samples	229 samples	2,302 samples
95%	59 samples	299 samples	2,995 samples
99%	90 samples	459 samples	4,603 samples

This shows that random search is remarkably efficient when good configurations occupy even a small fraction of the space.

The 5% Rule

Dimension-Independent Guarantees:

Unlike grid search, random search's guarantees are independent of dimensionality. To find a configuration in the top $\alpha$ fraction of any individual dimension with probability $p$, we need:

$$n \geq \frac{\log(1-p)}{\log(1-\alpha)}$$

This holds regardless of the number of dimensions $d$. Grid search, by contrast, requires $O(\alpha^{-d})$ samples for the same guarantee—exponentially worse.

Concentration Bounds:

For the best score found after $n$ random samples, we have concentration results. If scores are bounded in $[0, 1]$ and we seek the maximum, the expected gap from optimal decreases as:

$$\mathbb{E}[\text{regret}] = O\left(\frac{1}{n}\right)$$

This linear convergence rate is competitive with more sophisticated methods for smooth objective landscapes.

Practical Considerations

Several practical considerations affect random search effectiveness in real applications.

Key Practical Considerations

•Setting Appropriate Bounds: Overly wide bounds waste samples in poor regions. Use domain knowledge and preliminary experiments to set reasonable ranges.
•Handling Constraints: Some hyperparameter combinations are invalid (e.g., learning rate too high for batch size). Implement rejection sampling or conditional distributions.
•Warm Starting: Initialize with known good configurations from similar problems or previous experiments before random exploration.
•Adaptive Budgets: Start with fewer iterations and increase if results are promising. Random search's anytime property makes this easy.
•Multiple Runs: Due to stochasticity, run random search multiple times with different seeds and take the best, or report confidence intervals.

Distribution Mismatch

Summary: Random Sampling Foundations

We have established the foundations of random sampling for hyperparameter optimization:

Key Takeaways

•Random search samples from distributions rather than enumerating a grid, providing probabilistic coverage of the search space.
•Distribution choice is critical—use log-uniform for parameters spanning orders of magnitude, uniform for linear-effect parameters.
•Theoretical guarantees are dimension-independent, making random search scale gracefully to high-dimensional spaces.
•Modest sample sizes suffice—60 samples give 95% probability of finding a top-5% configuration.
•Implementation must ensure reproducibility through careful seeding strategies.

What's Next:

Page Complete

1 / 5