Loading learning content...
In a field obsessed with optimization and precision, it may seem counterintuitive that random sampling could be a powerful strategy. Yet in hyperparameter optimization, randomness is not a compromise—it is often the optimal approach.
Random search represents a fundamental shift in philosophy: instead of exhaustively enumerating a predetermined grid, we sample hyperparameter configurations from probability distributions over the search space. This simple change has profound implications for efficiency, coverage, and scalability.
The landmark 2012 paper by Bergstra and Bengio demonstrated empirically and theoretically that random search finds good hyperparameters faster than grid search in practice. Since then, random search has become the default baseline for hyperparameter optimization, and understanding its mechanics is essential for any ML practitioner.
By the end of this page, you will understand the mathematical foundations of random sampling for HPO, how to define appropriate sampling distributions for different hyperparameter types, implementation techniques for efficient random search, and the theoretical guarantees that make random search effective.
Random search for hyperparameter optimization operates on a simple principle: sample configurations independently from probability distributions defined over the hyperparameter space.
Formal Definition:
Let $\mathcal{H} = \mathcal{H}_1 \times \mathcal{H}_2 \times \cdots \times \mathcal{H}_d$ be the $d$-dimensional hyperparameter space. We define a probability distribution $p(\mathbf{h})$ over $\mathcal{H}$, typically as a product of independent marginals:
$$p(\mathbf{h}) = \prod_{i=1}^{d} p_i(h_i)$$
where $p_i$ is the distribution for hyperparameter $i$. Random search draws $n$ samples $\mathbf{h}^{(1)}, \ldots, \mathbf{h}^{(n)} \sim p(\mathbf{h})$ and evaluates each to find the best configuration.
Key Properties:
| Property | Grid Search | Random Search |
|---|---|---|
| Sampling Strategy | Deterministic enumeration | Stochastic sampling |
| Coverage Pattern | Regular lattice | Uniform scatter |
| Dimensionality Scaling | Exponential: $n^d$ | Linear: $n$ |
| Per-Dimension Resolution | Fixed globally | Probabilistic, varies |
| Stopping Behavior | Must complete or waste work | Anytime valid results |
| Parallelization | Trivial but wasteful | Trivial and efficient |
| Reproducibility | Deterministic | Seed-controlled |
Using independent marginal distributions ignores potential interactions between hyperparameters. While this is a simplification, it works well in practice because: (1) many hyperparameters are approximately independent in their effects, and (2) random sampling naturally explores the joint space. For known strong interactions, conditional distributions can be used.
Choosing appropriate sampling distributions is crucial for effective random search. The distribution should match the hyperparameter's nature and place probability mass where good values are likely.
Continuous Hyperparameters:
For continuous hyperparameters, the choice between linear and logarithmic sampling is critical:
Uniform (Linear): Use when the hyperparameter effect is roughly linear across its range $$h \sim \text{Uniform}(a, b)$$ Examples: dropout rate, momentum, weight decay (sometimes)
Log-Uniform: Use when the hyperparameter spans orders of magnitude $$\log h \sim \text{Uniform}(\log a, \log b)$$ Examples: learning rate, regularization strength, kernel bandwidth
Truncated Normal: Use when prior knowledge suggests a likely center $$h \sim \mathcal{N}(\mu, \sigma^2) \text{ truncated to } [a, b]$$ Examples: fine-tuning an existing near-optimal value
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
import numpy as npfrom typing import Union, List, Anyfrom scipy import statsfrom abc import ABC, abstractmethod class Distribution(ABC): """Base class for hyperparameter sampling distributions.""" @abstractmethod def sample(self, n: int = 1, random_state: int = None) -> np.ndarray: pass @abstractmethod def __repr__(self) -> str: pass class Uniform(Distribution): """Uniform distribution over [low, high].""" def __init__(self, low: float, high: float): self.low = low self.high = high def sample(self, n: int = 1, random_state: int = None) -> np.ndarray: rng = np.random.RandomState(random_state) return rng.uniform(self.low, self.high, n) def __repr__(self) -> str: return f"Uniform({self.low}, {self.high})" class LogUniform(Distribution): """Log-uniform distribution: log(x) ~ Uniform(log(low), log(high)).""" def __init__(self, low: float, high: float): assert low > 0 and high > 0, "Log-uniform requires positive bounds" self.low = low self.high = high def sample(self, n: int = 1, random_state: int = None) -> np.ndarray: rng = np.random.RandomState(random_state) log_low, log_high = np.log(self.low), np.log(self.high) return np.exp(rng.uniform(log_low, log_high, n)) def __repr__(self) -> str: return f"LogUniform({self.low}, {self.high})" class IntUniform(Distribution): """Uniform distribution over integers [low, high].""" def __init__(self, low: int, high: int): self.low = low self.high = high def sample(self, n: int = 1, random_state: int = None) -> np.ndarray: rng = np.random.RandomState(random_state) return rng.randint(self.low, self.high + 1, n) def __repr__(self) -> str: return f"IntUniform({self.low}, {self.high})" class LogIntUniform(Distribution): """Log-uniform over integers: sample log-uniform, round to int.""" def __init__(self, low: int, high: int): assert low > 0, "Log-int-uniform requires positive bounds" self.low = low self.high = high def sample(self, n: int = 1, random_state: int = None) -> np.ndarray: rng = np.random.RandomState(random_state) log_low, log_high = np.log(self.low), np.log(self.high) continuous = np.exp(rng.uniform(log_low, log_high, n)) return np.round(continuous).astype(int) def __repr__(self) -> str: return f"LogIntUniform({self.low}, {self.high})" class Categorical(Distribution): """Categorical distribution over discrete choices.""" def __init__(self, choices: List[Any], weights: List[float] = None): self.choices = choices self.weights = weights if weights: self.probs = np.array(weights) / sum(weights) else: self.probs = None def sample(self, n: int = 1, random_state: int = None) -> np.ndarray: rng = np.random.RandomState(random_state) indices = rng.choice(len(self.choices), size=n, p=self.probs) return np.array([self.choices[i] for i in indices]) def __repr__(self) -> str: return f"Categorical({self.choices})" # Example usagedef demonstrate_distributions(): """Show sampling behavior of different distributions.""" distributions = { 'learning_rate': LogUniform(1e-5, 1e-1), 'batch_size': LogIntUniform(16, 512), 'dropout': Uniform(0.0, 0.5), 'n_layers': IntUniform(1, 5), 'activation': Categorical(['relu', 'tanh', 'elu']), } print("Sample hyperparameter configurations:") for i in range(3): config = {name: dist.sample(1, random_state=i)[0] for name, dist in distributions.items()} print(f" Config {i}: {config}")Discrete Hyperparameters:
Categorical: For unordered choices (activation functions, optimizers) $$h \sim \text{Categorical}({c_1, \ldots, c_k}, \mathbf{p})$$
Integer Uniform: For ordered integers (layers, units per layer) $$h \sim \text{DiscreteUniform}(a, b)$$
Log-Integer: For integers spanning orders of magnitude $$h = \lfloor \exp(\text{Uniform}(\log a, \log b)) \rceil$$ Examples: batch size, number of estimators
Distribution Selection Guidelines:
The choice of distribution encodes prior knowledge about where good hyperparameters lie:
| Hyperparameter | Typical Distribution | Rationale |
|---|---|---|
| Learning rate | LogUniform(1e-5, 1) | Varies over orders of magnitude |
| Weight decay | LogUniform(1e-6, 1e-1) | Regularization strength varies exponentially |
| Dropout | Uniform(0, 0.5) | Linear effect on regularization |
| Hidden units | LogIntUniform(32, 1024) | Capacity scales roughly logarithmically |
| Batch size | LogIntUniform(16, 512) | Effect on gradient noise is logarithmic |
A production-ready random search implementation requires careful attention to reproducibility, efficiency, and integration with ML workflows. Here we develop a comprehensive implementation from first principles.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
import numpy as npfrom typing import Dict, Any, Callable, List, Optionalfrom dataclasses import dataclassfrom sklearn.model_selection import cross_val_score @dataclassclass RandomSearchResult: """Container for random search results.""" best_params: Dict[str, Any] best_score: float all_results: List[Dict] n_iterations: int def summary(self) -> str: return f"""Random Search Results====================Iterations: {self.n_iterations}Best Score: {self.best_score:.6f}Best Parameters: {self.best_params}""" class RandomSearch: """ Random search hyperparameter optimization. Samples configurations from specified distributions and evaluates them using cross-validation or a custom scoring function. """ def __init__( self, param_distributions: Dict[str, Distribution], n_iter: int = 100, scoring: str = 'accuracy', cv: int = 5, random_state: Optional[int] = None, verbose: int = 1 ): self.param_distributions = param_distributions self.n_iter = n_iter self.scoring = scoring self.cv = cv self.random_state = random_state self.verbose = verbose self.results_: List[Dict] = [] self.best_params_: Dict[str, Any] = None self.best_score_: float = float('-inf') def _sample_configuration(self, iteration: int) -> Dict[str, Any]: """Sample a single configuration from the distributions.""" # Use iteration-dependent seed for reproducibility base_seed = self.random_state if self.random_state else 0 config = {} for param_name, distribution in self.param_distributions.items(): seed = base_seed + iteration * 1000 + hash(param_name) % 1000 value = distribution.sample(1, random_state=seed)[0] # Convert numpy types to Python types if isinstance(value, np.integer): value = int(value) elif isinstance(value, np.floating): value = float(value) config[param_name] = value return config def fit( self, estimator_class, X: np.ndarray, y: np.ndarray, fixed_params: Dict[str, Any] = None ) -> RandomSearchResult: """ Run random search optimization. Parameters: ----------- estimator_class: sklearn-compatible estimator class X, y: Training data fixed_params: Parameters to pass to all estimators Returns: -------- RandomSearchResult with best configuration and full history """ fixed_params = fixed_params or {} self.results_ = [] self.best_score_ = float('-inf') for i in range(self.n_iter): # Sample configuration config = self._sample_configuration(i) # Combine with fixed params full_params = {**fixed_params, **config} # Evaluate with cross-validation try: estimator = estimator_class(**full_params) scores = cross_val_score( estimator, X, y, scoring=self.scoring, cv=self.cv ) mean_score = scores.mean() std_score = scores.std() status = 'success' except Exception as e: mean_score = float('-inf') std_score = 0.0 status = f'error: {str(e)}' # Record result result = { 'iteration': i, 'params': config, 'mean_score': mean_score, 'std_score': std_score, 'status': status } self.results_.append(result) # Update best if mean_score > self.best_score_: self.best_score_ = mean_score self.best_params_ = config # Logging if self.verbose and (i + 1) % 10 == 0: print(f"Iteration {i+1}/{self.n_iter}: " f"best={self.best_score_:.4f}") return RandomSearchResult( best_params=self.best_params_, best_score=self.best_score_, all_results=self.results_, n_iterations=self.n_iter )Notice how each configuration uses a deterministic seed based on the iteration number. This ensures that: (1) results are reproducible given the same random_state, (2) adding more iterations doesn't change previously sampled configs, and (3) parallel execution with partitioned iterations yields identical results to sequential execution.
Random search enjoys strong theoretical guarantees that explain its practical effectiveness. Understanding these guarantees helps in setting appropriate budgets and expectations.
Probability of Finding Good Configurations:
Let $\alpha$ be the fraction of the hyperparameter space containing "good" configurations (e.g., within $\epsilon$ of optimal). With $n$ random samples, the probability of finding at least one good configuration is:
$$P(\text{success}) = 1 - (1 - \alpha)^n$$
Solving for $n$ to achieve success probability $p$:
$$n = \frac{\log(1 - p)}{\log(1 - \alpha)}$$
Practical Implications:
| Target Probability | α = 5% (good) | α = 1% (very good) | α = 0.1% (excellent) |
|---|---|---|---|
| 90% | 45 samples | 229 samples | 2,302 samples |
| 95% | 59 samples | 299 samples | 2,995 samples |
| 99% | 90 samples | 459 samples | 4,603 samples |
This shows that random search is remarkably efficient when good configurations occupy even a small fraction of the space.
If 5% of configurations are 'good enough', just 60 random samples give 95% probability of finding one. This is why random search often works well with modest budgets—we don't need to find THE optimal configuration, just a good one.
Dimension-Independent Guarantees:
Unlike grid search, random search's guarantees are independent of dimensionality. To find a configuration in the top $\alpha$ fraction of any individual dimension with probability $p$, we need:
$$n \geq \frac{\log(1-p)}{\log(1-\alpha)}$$
This holds regardless of the number of dimensions $d$. Grid search, by contrast, requires $O(\alpha^{-d})$ samples for the same guarantee—exponentially worse.
Concentration Bounds:
For the best score found after $n$ random samples, we have concentration results. If scores are bounded in $[0, 1]$ and we seek the maximum, the expected gap from optimal decreases as:
$$\mathbb{E}[\text{regret}] = O\left(\frac{1}{n}\right)$$
This linear convergence rate is competitive with more sophisticated methods for smooth objective landscapes.
Several practical considerations affect random search effectiveness in real applications.
The biggest practical failure mode is using wrong distributions. Using linear uniform for learning rate (which varies over orders of magnitude) wastes most samples in the wrong range. Always match the distribution to the hyperparameter's natural scale.
We have established the foundations of random sampling for hyperparameter optimization:
What's Next:
The next page explores why random search often outperforms grid search—the theoretical and empirical arguments that establish random search as the superior baseline for hyperparameter optimization.
You now understand the foundations of random sampling for hyperparameter optimization. You can implement random search with appropriate distributions and understand its theoretical guarantees. Next, we'll see why random sampling beats exhaustive grid enumeration.