Loading learning content...
Hyperparameters come in fundamentally different flavors. Continuous hyperparameters like learning rate can take any value in a range—there's always a value between any two values you pick. Discrete hyperparameters like number of layers can only take specific values—there's no such thing as 2.7 layers.
This distinction isn't merely taxonomic—it has profound implications for how we optimize. Continuous spaces support calculus-based reasoning: gradients, interpolation, and smooth optimization. Discrete spaces require combinatorial thinking: enumeration, local search, and different notions of 'nearby' configurations.
In this page, we'll develop a deep understanding of both types and the challenges that arise when optimizing mixed spaces containing both.
By the end of this page, you will: • Understand the mathematical properties of continuous vs discrete spaces • Know how different HPO algorithms handle each type • Master techniques for handling integer and categorical hyperparameters • Design effective strategies for mixed search spaces
Continuous hyperparameters take values in a continuous subset of the real numbers, typically a bounded interval $[a, b] \subset \mathbb{R}$. They are the most mathematically tractable type, enabling powerful optimization techniques.
Mathematical Properties:
Common Continuous Hyperparameters:
| Category | Hyperparameter | Typical Domain | Recommended Scale |
|---|---|---|---|
| Optimization | Learning rate | [1e-6, 1.0] | Log |
| Momentum | [0.0, 0.999] | Linear or 1-log(1-x) | |
| Weight decay | [1e-8, 0.1] | Log | |
| Regularization | L1 penalty (α) | [1e-8, 10] | Log |
| L2 penalty (λ) | [1e-8, 10] | Log | |
| Dropout probability | [0.0, 0.7] | Linear | |
| Kernel | RBF γ | [1e-6, 1e3] | Log |
| Polynomial degree (when real) | [1.0, 5.0] | Linear | |
| Sampling | Subsample ratio | [0.5, 1.0] | Linear |
| Column sample ratio | [0.5, 1.0] | Linear |
Optimization Advantages:
Continuous spaces enable powerful techniques:
The Continuity Assumption:
Most HPO methods implicitly assume the response surface is continuous. For a continuous hyperparameter $\lambda$, this means:
$$|f(\lambda_1) - f(\lambda_2)| \leq L |\lambda_1 - \lambda_2|$$
for some Lipschitz constant $L$. This assumption justifies using nearby evaluations to predict performance at new points.
Some 'continuous' hyperparameters exhibit discontinuous behavior. Example: At learning rate thresholds, training may suddenly diverge, creating cliff-like drops in performance. Robust HPO methods should handle such discontinuities gracefully.
Discrete hyperparameters take values from a finite set. They come in two important subtypes:
Integer Hyperparameters take values from ${a, a+1, ..., b} \subset \mathbb{Z}$. Examples:
Categorical Hyperparameters take values from an unordered set ${c_1, c_2, ..., c_k}$. Examples:
The key distinction: integers have ordering and distance, while categories have neither.
Ordinal Hyperparameters: The Middle Ground
Some hyperparameters are discrete with meaningful order but non-uniform spacing:
These are ordinal: we know 256 batch size is 'larger' than 32, but the spacing isn't linear. They're often treated as:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222
"""Handling Discrete Hyperparameters This module demonstrates different strategies for encoding andoptimizing integer, categorical, and ordinal hyperparameters."""import numpy as npfrom typing import List, Dict, Any, Unionfrom enum import Enum class DiscreteHandling(Enum): """Strategies for handling discrete hyperparameters.""" ENUMERATE = "enumerate" # Exhaustive enumeration RELAX_ROUND = "relax_round" # Treat as continuous, round result ONE_HOT = "one_hot" # One-hot encoding for categorical ORDINAL = "ordinal" # Integer index for ordinal class IntegerParameter: """ Integer hyperparameter with bounds. Can be treated as: 1. Discrete: enumerate all values 2. Relaxed continuous: optimize in [low, high], round at end """ def __init__(self, name: str, low: int, high: int, log_scale: bool = False): self.name = name self.low = low self.high = high self.log_scale = log_scale self.n_values = high - low + 1 def sample_uniform(self) -> int: """Sample uniformly from the integer range.""" if self.log_scale: # Sample on log scale, then round log_value = np.random.uniform(np.log(self.low), np.log(self.high)) return int(np.round(np.exp(log_value))) else: return np.random.randint(self.low, self.high + 1) def relax_and_round(self, continuous_value: float) -> int: """Convert a continuous relaxation back to integer.""" if self.log_scale: # continuous_value is in log space actual = np.exp(continuous_value) else: actual = continuous_value # Round and clip to valid range return int(np.clip(np.round(actual), self.low, self.high)) def to_continuous_bounds(self) -> tuple: """Get bounds for continuous relaxation.""" if self.log_scale: return (np.log(self.low), np.log(self.high)) else: return (self.low - 0.5, self.high + 0.5) # Allow rounding def enumerate(self) -> List[int]: """List all possible values.""" return list(range(self.low, self.high + 1)) class CategoricalParameter: """ Categorical hyperparameter with no natural ordering. Must be handled via: 1. Enumeration 2. One-hot encoding 3. Learned embeddings """ def __init__(self, name: str, choices: List[Any]): self.name = name self.choices = choices self.n_choices = len(choices) self._idx_to_choice = {i: c for i, c in enumerate(choices)} self._choice_to_idx = {c: i for i, c in enumerate(choices)} def sample_uniform(self) -> Any: """Sample uniformly from choices.""" return np.random.choice(self.choices) def to_one_hot(self, choice: Any) -> np.ndarray: """Encode a choice as one-hot vector.""" idx = self._choice_to_idx[choice] one_hot = np.zeros(self.n_choices) one_hot[idx] = 1.0 return one_hot def from_one_hot(self, one_hot: np.ndarray) -> Any: """Decode one-hot vector to choice (argmax).""" idx = np.argmax(one_hot) return self._idx_to_choice[idx] def from_probability(self, probs: np.ndarray) -> Any: """Sample a choice from probability distribution.""" idx = np.random.choice(self.n_choices, p=probs) return self._idx_to_choice[idx] def enumerate(self) -> List[Any]: """List all possible choices.""" return self.choices.copy() class OrdinalParameter: """ Ordinal hyperparameter: discrete with meaningful order. Examples: batch sizes [16, 32, 64, 128, 256] Can be handled as: 1. Categorical (losing order information) 2. Integer index (0, 1, 2, 3, 4) 3. Actual values (possibly log-transformed) """ def __init__(self, name: str, values: List[float], log_scale: bool = False): self.name = name self.values = sorted(values) # Ensure ordering self.n_values = len(values) self.log_scale = log_scale # Precompute transformed values for continuous relaxation if log_scale: self._transformed = [np.log(v) for v in self.values] else: self._transformed = self.values.copy() def sample_uniform(self) -> float: """Sample uniformly from ordinal values.""" return np.random.choice(self.values) def to_continuous(self, value: float) -> float: """Convert ordinal value to continuous representation.""" idx = self.values.index(value) return self._transformed[idx] def from_continuous(self, continuous_value: float) -> float: """Find nearest ordinal value from continuous representation.""" distances = [abs(t - continuous_value) for t in self._transformed] nearest_idx = np.argmin(distances) return self.values[nearest_idx] def neighbor(self, value: float, step: int = 1) -> float: """Get neighboring ordinal value (for local search).""" idx = self.values.index(value) new_idx = np.clip(idx + step, 0, self.n_values - 1) return self.values[new_idx] def create_mixed_search_space(): """ Example of a mixed search space containing all types. """ space = { # Continuous parameters 'learning_rate': { 'type': 'continuous', 'bounds': (1e-5, 0.1), 'log_scale': True }, 'dropout': { 'type': 'continuous', 'bounds': (0.0, 0.5), 'log_scale': False }, # Integer parameters 'num_layers': IntegerParameter('num_layers', 1, 6, log_scale=False), 'hidden_units': IntegerParameter('hidden_units', 32, 512, log_scale=True), # Categorical parameters 'optimizer': CategoricalParameter('optimizer', ['sgd', 'adam', 'adamw']), 'activation': CategoricalParameter('activation', ['relu', 'gelu', 'swish']), # Ordinal parameters 'batch_size': OrdinalParameter('batch_size', [16, 32, 64, 128, 256], log_scale=True), } return space def sample_from_mixed_space(space: Dict) -> Dict[str, Any]: """Sample a configuration from a mixed search space.""" config = {} for name, param in space.items(): if isinstance(param, dict): # Continuous parameter low, high = param['bounds'] if param.get('log_scale', False): value = np.exp(np.random.uniform(np.log(low), np.log(high))) else: value = np.random.uniform(low, high) config[name] = value elif isinstance(param, (IntegerParameter, CategoricalParameter, OrdinalParameter)): config[name] = param.sample_uniform() return config # Exampleif __name__ == "__main__": space = create_mixed_search_space() print("Mixed Search Space Example") print("=" * 50) for _ in range(5): config = sample_from_mixed_space(space) print(f"\nSampled configuration:") for name, value in config.items(): print(f" {name}: {value}")A common approach for integer hyperparameters is relaxation and rounding: treat them as continuous during optimization, then round to the nearest integer. This is elegant but introduces subtleties.
The Basic Idea:
For an integer parameter $n \in {1, 2, 3, 4, 5}$:
Problems with Naive Rounding:
Better Approaches for Integers:
1. Transform Before Optimization
Map integers to a continuous space where uniform sampling gives proper distribution:
# For range [a, b], use bounds [a - 0.5, b + 0.5]
# Then round at evaluation time
2. Use Integer-Aware Optimizers
Some Bayesian optimization implementations (SMAC, Optuna) handle integers natively, maintaining proper kernels and acquisition functions.
3. Enumerate for Small Spaces
If the integer range is small (e.g., num_layers ∈ {1, 2, 3, 4}), just treat it as categorical and enumerate. The overhead is minimal.
4. Log-Rounding for Wide Ranges
For ranges like n_estimators ∈ {50, ..., 2000}:
log_value = optimize_continuous(log(50), log(2000))
actual = round(exp(log_value))
For most integer hyperparameters, relaxation + rounding works well enough. The exceptions are: (1) very small ranges where enumeration is better, (2) parameters where specific values matter (e.g., power-of-2 for memory alignment), and (3) parameters that interact nonlinearly with other hyperparameters.
Categorical hyperparameters pose the greatest challenge for optimization because they lack structure that can be exploited. There's no gradient, no meaningful interpolation, and no continuous relaxation.
Strategies for Categorical Variables:
Strategy: Evaluate all possible values.
When to use: Small number of categories (< 10) where evaluation is affordable.
Advantages:
Disadvantages:
Example: Optimizer ∈ {SGD, Adam, AdamW} — just try all three.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
"""Strategies for Handling Categorical Hyperparameters Demonstrates one-hot encoding, tree-based handling, and learned embeddings."""import numpy as npfrom typing import List, Dict, Tuple, Anyfrom scipy.special import softmax class CategoricalEncoder: """Base class for categorical encoding strategies.""" def encode(self, category: str) -> np.ndarray: raise NotImplementedError def decode(self, encoding: np.ndarray) -> str: raise NotImplementedError class OneHotEncoder(CategoricalEncoder): """ One-hot encoding for categorical variables. Maps each category to a binary vector. """ def __init__(self, categories: List[str]): self.categories = categories self.n_categories = len(categories) self.cat_to_idx = {c: i for i, c in enumerate(categories)} self.idx_to_cat = {i: c for i, c in enumerate(categories)} def encode(self, category: str) -> np.ndarray: """Encode category as one-hot vector.""" vec = np.zeros(self.n_categories) vec[self.cat_to_idx[category]] = 1.0 return vec def decode(self, encoding: np.ndarray) -> str: """Decode one-hot (or soft) vector to category.""" return self.idx_to_cat[np.argmax(encoding)] def decode_probabilistic(self, encoding: np.ndarray) -> str: """Sample a category from probability distribution.""" probs = softmax(encoding) # Normalize to valid probabilities idx = np.random.choice(self.n_categories, p=probs) return self.idx_to_cat[idx] class LearnedEmbeddingEncoder(CategoricalEncoder): """ Learned embedding for categorical variables. Maps categories to a learned continuous space where similar categories can be close together. """ def __init__(self, categories: List[str], embedding_dim: int = 3): self.categories = categories self.n_categories = len(categories) self.embedding_dim = embedding_dim self.cat_to_idx = {c: i for i, c in enumerate(categories)} # Initialize embeddings randomly self.embeddings = np.random.randn(self.n_categories, embedding_dim) def encode(self, category: str) -> np.ndarray: """Get embedding for category.""" idx = self.cat_to_idx[category] return self.embeddings[idx].copy() def decode(self, encoding: np.ndarray) -> str: """Find nearest category in embedding space.""" distances = np.linalg.norm(self.embeddings - encoding, axis=1) nearest_idx = np.argmin(distances) return self.categories[nearest_idx] def update_embeddings(self, category: str, gradient: np.ndarray, learning_rate: float = 0.01): """Update embeddings via gradient descent.""" idx = self.cat_to_idx[category] self.embeddings[idx] -= learning_rate * gradient def category_distance(self, cat1: str, cat2: str) -> float: """Compute distance between two categories in embedding space.""" emb1 = self.encode(cat1) emb2 = self.encode(cat2) return np.linalg.norm(emb1 - emb2) class TreeBasedCategoricalHandler: """ Simulates how tree-based methods handle categorical variables. Trees can split on categorical features directly, without encoding. This is conceptually how TPE and SMAC work. """ def __init__(self, categories: List[str]): self.categories = categories self.n_categories = len(categories) # Track observations for each category self.observations: Dict[str, List[float]] = {c: [] for c in categories} def observe(self, category: str, performance: float): """Record an observation for a category.""" self.observations[category].append(performance) def get_category_statistics(self) -> Dict[str, Dict[str, float]]: """Get mean and variance for each category.""" stats = {} for cat, obs in self.observations.items(): if obs: stats[cat] = { 'mean': np.mean(obs), 'std': np.std(obs) if len(obs) > 1 else float('inf'), 'count': len(obs) } else: stats[cat] = {'mean': None, 'std': None, 'count': 0} return stats def sample_category(self, n_samples: int = 1) -> List[str]: """ Sample categories, balancing exploration (untried) and exploitation (good performers). This mimics how TPE would sample categorical values. """ stats = self.get_category_statistics() samples = [] for _ in range(n_samples): # Prioritize untried categories untried = [c for c, s in stats.items() if s['count'] == 0] if untried: samples.append(np.random.choice(untried)) continue # Otherwise, use Upper Confidence Bound-style selection ucb_scores = [] total_count = sum(s['count'] for s in stats.values()) for cat, s in stats.items(): # Higher is better (assuming minimization, negate) mean = -s['mean'] # Negate for minimization exploration = np.sqrt(2 * np.log(total_count) / s['count']) ucb_scores.append(mean + exploration) # Softmax to get probabilities probs = softmax(np.array(ucb_scores)) idx = np.random.choice(self.n_categories, p=probs) samples.append(self.categories[idx]) return samples # Example usageif __name__ == "__main__": optimizers = ['sgd', 'adam', 'adamw', 'rmsprop'] # One-hot encoding print("One-Hot Encoding") print("=" * 40) encoder = OneHotEncoder(optimizers) for opt in optimizers: print(f"{opt}: {encoder.encode(opt)}") # Learned embeddings print("\nLearned Embeddings (initial)") print("=" * 40) embed_encoder = LearnedEmbeddingEncoder(optimizers, embedding_dim=2) for opt in optimizers: print(f"{opt}: {embed_encoder.encode(opt).round(3)}") print("\nCategory distances:") print(f"adam-adamw: {embed_encoder.category_distance('adam', 'adamw'):.3f}") print(f"adam-sgd: {embed_encoder.category_distance('adam', 'sgd'):.3f}") # Tree-based handling print("\nTree-Based Sampling (after some observations)") print("=" * 40) handler = TreeBasedCategoricalHandler(optimizers) handler.observe('adam', 0.05) handler.observe('adam', 0.06) handler.observe('adamw', 0.04) handler.observe('sgd', 0.15) # rmsprop not observed yet print("Statistics:", handler.get_category_statistics()) print("Next samples:", handler.sample_category(5))Real-world HPO almost always involves mixed spaces—combinations of continuous, integer, and categorical hyperparameters. This creates unique challenges:
Challenge 1: Kernel/Distance Design
For Bayesian optimization, we need a kernel $k(\lambda, \lambda')$ that handles all types:
$$k(\lambda, \lambda') = k_{cont}(\lambda_{cont}, \lambda'{cont}) \cdot k{int}(\lambda_{int}, \lambda'{int}) \cdot k{cat}(\lambda_{cat}, \lambda'_{cat})$$
Different kernel types are needed for each:
Challenge 2: Acquisition Function Optimization
Once we have a surrogate model, we need to optimize the acquisition function over the mixed space. This typically requires:
Tree-based methods (Random Forests, TPE) naturally handle mixed spaces because trees can split on any feature type. This is why SMAC and TPE-based tools (Optuna, Hyperopt) are often preferred for practical HPO over GP-based methods.
Practical Algorithms for Mixed Spaces:
1. TPE (Tree-structured Parzen Estimator)
2. SMAC (Sequential Model-based Algorithm Configuration)
3. Mixed-Space Bayesian Optimization
4. Evolutionary Algorithms
| Algorithm | Continuous | Integer | Categorical | Mixed |
|---|---|---|---|---|
| GP-BO (Gaussian Process) | ⭐⭐⭐ | ⭐⭐ | ⭐ | ⭐⭐ |
| TPE (Optuna, Hyperopt) | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| SMAC (Random Forest) | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| Random Search | ⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| Grid Search | ⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐ |
| Evolutionary | ⭐⭐ | ⭐⭐ | ⭐⭐ | ⭐⭐ |
Let's consolidate our understanding into practical guidance for handling different hyperparameter types:
Modern HPO libraries handle most of this automatically:
• Optuna: Set log=True for log-scale continuous, use CategoricalDistribution for categories
• Ray Tune: Use tune.loguniform() and tune.choice()
• SMAC: Configure parameter types in the configuration space
Let the library handle the details—just specify types and ranges correctly.
Understanding hyperparameter types is essential for effective HPO. The structure of your search space fundamentally constrains what optimization strategies can work.
What's Next
With continuous and discrete hyperparameters understood, we'll next explore conditional hyperparameters—hyperparameters that only exist when another hyperparameter takes certain values. This adds another layer of complexity to search space design.
You now understand how different hyperparameter types affect optimization and can make informed choices about encoding, sampling, and optimization strategies for each type.