Machine LearningHyperparameter Fundamentals

Hyperparameter Fundamentals

LevelIntermediate

Duration90 mins

TopicHyperparameter Fundamentals

3 / 5

Continuous vs Discrete Hyperparameters

The Geometry of Hyperparameter Spaces

Hyperparameters come in fundamentally different flavors. Continuous hyperparameters like learning rate can take any value in a range—there's always a value between any two values you pick. Discrete hyperparameters like number of layers can only take specific values—there's no such thing as 2.7 layers.

This distinction isn't merely taxonomic—it has profound implications for how we optimize. Continuous spaces support calculus-based reasoning: gradients, interpolation, and smooth optimization. Discrete spaces require combinatorial thinking: enumeration, local search, and different notions of 'nearby' configurations.

In this page, we'll develop a deep understanding of both types and the challenges that arise when optimizing mixed spaces containing both.

What You Will Learn

By the end of this page, you will: • Understand the mathematical properties of continuous vs discrete spaces • Know how different HPO algorithms handle each type • Master techniques for handling integer and categorical hyperparameters • Design effective strategies for mixed search spaces

Continuous Hyperparameters

Continuous hyperparameters take values in a continuous subset of the real numbers, typically a bounded interval $[a, b] \subset \mathbb{R}$. They are the most mathematically tractable type, enabling powerful optimization techniques.

Mathematical Properties:

Infinite cardinality: Uncountably many possible values
Interpolation: Given values $\lambda_1$ and $\lambda_2$, any $\alpha \lambda_1 + (1-\alpha) \lambda_2$ is valid
Smoothness assumption: We typically assume the response surface varies smoothly (Lipschitz continuous or smoother)
Gradient information: In principle, $\partial f / \partial \lambda$ exists (though often inaccessible)

Common Continuous Hyperparameters:

Continuous Hyperparameters Across Model Types
Category	Hyperparameter	Typical Domain	Recommended Scale
Optimization	Learning rate	[1e-6, 1.0]	Log
	Momentum	[0.0, 0.999]	Linear or 1-log(1-x)
	Weight decay	[1e-8, 0.1]	Log
Regularization	L1 penalty (α)	[1e-8, 10]	Log
	L2 penalty (λ)	[1e-8, 10]	Log
	Dropout probability	[0.0, 0.7]	Linear
Kernel	RBF γ	[1e-6, 1e3]	Log
	Polynomial degree (when real)	[1.0, 5.0]	Linear
Sampling	Subsample ratio	[0.5, 1.0]	Linear
	Column sample ratio	[0.5, 1.0]	Linear

Optimization Advantages:

Continuous spaces enable powerful techniques:

Gaussian Process modeling: GPs provide smooth interpolation with uncertainty estimates
Gradient-based optimization: When gradients are available (e.g., hypergradients)
Local smoothness exploitation: Similar hyperparameters tend to give similar performance
Efficient sampling: Quasi-random sequences (Sobol, Halton) provide excellent coverage

The Continuity Assumption:

Most HPO methods implicitly assume the response surface is continuous. For a continuous hyperparameter $\lambda$, this means:

$$|f(\lambda_1) - f(\lambda_2)| \leq L |\lambda_1 - \lambda_2|$$

for some Lipschitz constant $L$. This assumption justifies using nearby evaluations to predict performance at new points.

When Continuity Breaks

Some 'continuous' hyperparameters exhibit discontinuous behavior. Example: At learning rate thresholds, training may suddenly diverge, creating cliff-like drops in performance. Robust HPO methods should handle such discontinuities gracefully.

Discrete Hyperparameters: Integer and Categorical

Discrete hyperparameters take values from a finite set. They come in two important subtypes:

Integer Hyperparameters take values from ${a, a+1, ..., b} \subset \mathbb{Z}$. Examples:

Number of hidden layers: {1, 2, 3, 4, 5}
Number of trees/estimators: {100, 200, 500, 1000}
Number of attention heads: {1, 2, 4, 8}

Categorical Hyperparameters take values from an unordered set ${c_1, c_2, ..., c_k}$. Examples:

Optimizer: {SGD, Adam, AdamW, RMSprop}
Activation function: {ReLU, GELU, Swish, Tanh}
Kernel type: {linear, polynomial, RBF}

The key distinction: integers have ordering and distance, while categories have neither.

Integer Hyperparameters

•Have natural ordering (3 < 4 < 5)
•Notion of distance (|5 - 3| = 2)
•Can often be relaxed to continuous
•Interpolation sometimes meaningful
•Typically treated as special continuous case

Categorical Hyperparameters

•No natural ordering (is SGD < Adam?)
•Only Hamming distance (same/different)
•Cannot be relaxed to continuous
•No meaningful interpolation
•Require explicit enumeration or embedding

Ordinal Hyperparameters: The Middle Ground

Some hyperparameters are discrete with meaningful order but non-uniform spacing:

Batch size: {16, 32, 64, 128, 256} — ordered powers of 2
Tree depth: {3, 5, 7, 10, 15} — increasing but irregular spacing

These are ordinal: we know 256 batch size is 'larger' than 32, but the spacing isn't linear. They're often treated as:

Categorical (ignoring order)
Integer indices (0, 1, 2, 3, 4) — preserving order, assuming uniform effect
The actual values on log-scale — preserving order and relative spacing

discrete_hyperparameters.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
"""
Handling Discrete Hyperparameters
 
This module demonstrates different strategies for encoding and
optimizing integer, categorical, and ordinal hyperparameters.
"""
import numpy as np
from typing import List, Dict, Any, Union
from enum import Enum
 
 
class DiscreteHandling(Enum):
    """Strategies for handling discrete hyperparameters."""
    ENUMERATE = "enumerate"        # Exhaustive enumeration
    RELAX_ROUND = "relax_round"    # Treat as continuous, round result
    ONE_HOT = "one_hot"            # One-hot encoding for categorical
    ORDINAL = "ordinal"            # Integer index for ordinal
 
 
class IntegerParameter:
    """
    Integer hyperparameter with bounds.
    
    Can be treated as:
    1. Discrete: enumerate all values
    2. Relaxed continuous: optimize in [low, high], round at end
    """
    
    def __init__(self, name: str, low: int, high: int, log_scale: bool = False):
        self.name = name
        self.low = low
        self.high = high
        self.log_scale = log_scale
        self.n_values = high - low + 1
    
    def sample_uniform(self) -> int:
        """Sample uniformly from the integer range."""
        if self.log_scale:
            # Sample on log scale, then round
            log_value = np.random.uniform(np.log(self.low), np.log(self.high))
            return int(np.round(np.exp(log_value)))
        else:
            return np.random.randint(self.low, self.high + 1)
    
    def relax_and_round(self, continuous_value: float) -> int:
        """Convert a continuous relaxation back to integer."""
        if self.log_scale:
            # continuous_value is in log space
            actual = np.exp(continuous_value)
        else:
            actual = continuous_value
        
        # Round and clip to valid range
        return int(np.clip(np.round(actual), self.low, self.high))
    
    def to_continuous_bounds(self) -> tuple:
        """Get bounds for continuous relaxation."""
        if self.log_scale:
            return (np.log(self.low), np.log(self.high))
        else:
            return (self.low - 0.5, self.high + 0.5)  # Allow rounding
    
    def enumerate(self) -> List[int]:
        """List all possible values."""
        return list(range(self.low, self.high + 1))
 
 
class CategoricalParameter:
    """
    Categorical hyperparameter with no natural ordering.
    
    Must be handled via:
    1. Enumeration
    2. One-hot encoding
    3. Learned embeddings
    """
    
    def __init__(self, name: str, choices: List[Any]):
        self.name = name
        self.choices = choices
        self.n_choices = len(choices)
        self._idx_to_choice = {i: c for i, c in enumerate(choices)}
        self._choice_to_idx = {c: i for i, c in enumerate(choices)}
    
    def sample_uniform(self) -> Any:
        """Sample uniformly from choices."""
        return np.random.choice(self.choices)
    
    def to_one_hot(self, choice: Any) -> np.ndarray:
        """Encode a choice as one-hot vector."""
        idx = self._choice_to_idx[choice]
        one_hot = np.zeros(self.n_choices)
        one_hot[idx] = 1.0
        return one_hot
    
    def from_one_hot(self, one_hot: np.ndarray) -> Any:
        """Decode one-hot vector to choice (argmax)."""
        idx = np.argmax(one_hot)
        return self._idx_to_choice[idx]
    
    def from_probability(self, probs: np.ndarray) -> Any:
        """Sample a choice from probability distribution."""
        idx = np.random.choice(self.n_choices, p=probs)
        return self._idx_to_choice[idx]
    
    def enumerate(self) -> List[Any]:
        """List all possible choices."""
        return self.choices.copy()
 
 
class OrdinalParameter:
    """
    Ordinal hyperparameter: discrete with meaningful order.
    
    Examples: batch sizes [16, 32, 64, 128, 256]
    
    Can be handled as:
    1. Categorical (losing order information)
    2. Integer index (0, 1, 2, 3, 4)
    3. Actual values (possibly log-transformed)
    """
    
    def __init__(self, name: str, values: List[float], log_scale: bool = False):
        self.name = name
        self.values = sorted(values)  # Ensure ordering
        self.n_values = len(values)
        self.log_scale = log_scale
        
        # Precompute transformed values for continuous relaxation
        if log_scale:
            self._transformed = [np.log(v) for v in self.values]
        else:
            self._transformed = self.values.copy()
    
    def sample_uniform(self) -> float:
        """Sample uniformly from ordinal values."""
        return np.random.choice(self.values)
    
    def to_continuous(self, value: float) -> float:
        """Convert ordinal value to continuous representation."""
        idx = self.values.index(value)
        return self._transformed[idx]
    
    def from_continuous(self, continuous_value: float) -> float:
        """Find nearest ordinal value from continuous representation."""
        distances = [abs(t - continuous_value) for t in self._transformed]
        nearest_idx = np.argmin(distances)
        return self.values[nearest_idx]
    
    def neighbor(self, value: float, step: int = 1) -> float:
        """Get neighboring ordinal value (for local search)."""
        idx = self.values.index(value)
        new_idx = np.clip(idx + step, 0, self.n_values - 1)
        return self.values[new_idx]
 
 
def create_mixed_search_space():
    """
    Example of a mixed search space containing all types.
    """
    space = {
        # Continuous parameters
        'learning_rate': {
            'type': 'continuous',
            'bounds': (1e-5, 0.1),
            'log_scale': True
        },
        'dropout': {
            'type': 'continuous', 
            'bounds': (0.0, 0.5),
            'log_scale': False
        },
        
        # Integer parameters
        'num_layers': IntegerParameter('num_layers', 1, 6, log_scale=False),
        'hidden_units': IntegerParameter('hidden_units', 32, 512, log_scale=True),
        
        # Categorical parameters
        'optimizer': CategoricalParameter('optimizer', ['sgd', 'adam', 'adamw']),
        'activation': CategoricalParameter('activation', ['relu', 'gelu', 'swish']),
        
        # Ordinal parameters
        'batch_size': OrdinalParameter('batch_size', [16, 32, 64, 128, 256], 
                                        log_scale=True),
    }
    
    return space
 
 
def sample_from_mixed_space(space: Dict) -> Dict[str, Any]:
    """Sample a configuration from a mixed search space."""
    config = {}
    
    for name, param in space.items():
        if isinstance(param, dict):
            # Continuous parameter
            low, high = param['bounds']
            if param.get('log_scale', False):
                value = np.exp(np.random.uniform(np.log(low), np.log(high)))
            else:
                value = np.random.uniform(low, high)
            config[name] = value
            
        elif isinstance(param, (IntegerParameter, CategoricalParameter, 
                                OrdinalParameter)):
            config[name] = param.sample_uniform()
    
    return config
 
 
# Example
if __name__ == "__main__":
    space = create_mixed_search_space()
    
    print("Mixed Search Space Example")
    print("=" * 50)
    
    for _ in range(5):
        config = sample_from_mixed_space(space)
        print(f"\nSampled configuration:")
        for name, value in config.items():
            print(f"  {name}: {value}")

The Rounding Challenge for Integers

A common approach for integer hyperparameters is relaxation and rounding: treat them as continuous during optimization, then round to the nearest integer. This is elegant but introduces subtleties.

The Basic Idea:

For an integer parameter $n \in {1, 2, 3, 4, 5}$:

Pretend it's continuous: $n \in [0.5, 5.5]$
Use continuous optimization to find $n^* = 3.7$
Round: $n_{int} = \text{round}(3.7) = 4$

Problems with Naive Rounding:

Rounding Pitfalls

•Non-uniform regions: The value 3 gets samples from $(2.5, 3.5)$, but with log-scale transformation, these regions aren't equal in original space.
•Boundary issues: Values at edges (1 and 5) have half the 'basin' of interior values—they're undersampled.
•Scale mismatch: Rounding on linear scale vs log scale gives different distributions.
•Discontinuous objectives: Rounding creates a step-function wrapper around a smooth continuous objective.

Better Approaches for Integers:

1. Transform Before Optimization

Map integers to a continuous space where uniform sampling gives proper distribution:

# For range [a, b], use bounds [a - 0.5, b + 0.5]
# Then round at evaluation time

2. Use Integer-Aware Optimizers

Some Bayesian optimization implementations (SMAC, Optuna) handle integers natively, maintaining proper kernels and acquisition functions.

3. Enumerate for Small Spaces

If the integer range is small (e.g., num_layers ∈ {1, 2, 3, 4}), just treat it as categorical and enumerate. The overhead is minimal.

4. Log-Rounding for Wide Ranges

For ranges like n_estimators ∈ {50, ..., 2000}:

log_value = optimize_continuous(log(50), log(2000))
actual = round(exp(log_value))

Practical Guidance

For most integer hyperparameters, relaxation + rounding works well enough. The exceptions are: (1) very small ranges where enumeration is better, (2) parameters where specific values matter (e.g., power-of-2 for memory alignment), and (3) parameters that interact nonlinearly with other hyperparameters.

Handling Categorical Variables in Optimization

Categorical hyperparameters pose the greatest challenge for optimization because they lack structure that can be exploited. There's no gradient, no meaningful interpolation, and no continuous relaxation.

Strategies for Categorical Variables:

Strategy: Evaluate all possible values.

When to use: Small number of categories (< 10) where evaluation is affordable.

Advantages:

Guaranteed to find the best category
No approximation or modeling errors
Simple implementation

Disadvantages:

Exponential cost in number of categories
Doesn't scale to many categorical variables

Example: Optimizer ∈ {SGD, Adam, AdamW} — just try all three.

categorical_handling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
"""
Strategies for Handling Categorical Hyperparameters
 
Demonstrates one-hot encoding, tree-based handling, and learned embeddings.
"""
import numpy as np
from typing import List, Dict, Tuple, Any
from scipy.special import softmax
 
 
class CategoricalEncoder:
    """Base class for categorical encoding strategies."""
    
    def encode(self, category: str) -> np.ndarray:
        raise NotImplementedError
    
    def decode(self, encoding: np.ndarray) -> str:
        raise NotImplementedError
 
 
class OneHotEncoder(CategoricalEncoder):
    """
    One-hot encoding for categorical variables.
    
    Maps each category to a binary vector.
    """
    
    def __init__(self, categories: List[str]):
        self.categories = categories
        self.n_categories = len(categories)
        self.cat_to_idx = {c: i for i, c in enumerate(categories)}
        self.idx_to_cat = {i: c for i, c in enumerate(categories)}
    
    def encode(self, category: str) -> np.ndarray:
        """Encode category as one-hot vector."""
        vec = np.zeros(self.n_categories)
        vec[self.cat_to_idx[category]] = 1.0
        return vec
    
    def decode(self, encoding: np.ndarray) -> str:
        """Decode one-hot (or soft) vector to category."""
        return self.idx_to_cat[np.argmax(encoding)]
    
    def decode_probabilistic(self, encoding: np.ndarray) -> str:
        """Sample a category from probability distribution."""
        probs = softmax(encoding)  # Normalize to valid probabilities
        idx = np.random.choice(self.n_categories, p=probs)
        return self.idx_to_cat[idx]
 
 
class LearnedEmbeddingEncoder(CategoricalEncoder):
    """
    Learned embedding for categorical variables.
    
    Maps categories to a learned continuous space where
    similar categories can be close together.
    """
    
    def __init__(self, categories: List[str], embedding_dim: int = 3):
        self.categories = categories
        self.n_categories = len(categories)
        self.embedding_dim = embedding_dim
        self.cat_to_idx = {c: i for i, c in enumerate(categories)}
        
        # Initialize embeddings randomly
        self.embeddings = np.random.randn(self.n_categories, embedding_dim)
    
    def encode(self, category: str) -> np.ndarray:
        """Get embedding for category."""
        idx = self.cat_to_idx[category]
        return self.embeddings[idx].copy()
    
    def decode(self, encoding: np.ndarray) -> str:
        """Find nearest category in embedding space."""
        distances = np.linalg.norm(self.embeddings - encoding, axis=1)
        nearest_idx = np.argmin(distances)
        return self.categories[nearest_idx]
    
    def update_embeddings(self, category: str, gradient: np.ndarray, 
                          learning_rate: float = 0.01):
        """Update embeddings via gradient descent."""
        idx = self.cat_to_idx[category]
        self.embeddings[idx] -= learning_rate * gradient
    
    def category_distance(self, cat1: str, cat2: str) -> float:
        """Compute distance between two categories in embedding space."""
        emb1 = self.encode(cat1)
        emb2 = self.encode(cat2)
        return np.linalg.norm(emb1 - emb2)
 
 
class TreeBasedCategoricalHandler:
    """
    Simulates how tree-based methods handle categorical variables.
    
    Trees can split on categorical features directly, without encoding.
    This is conceptually how TPE and SMAC work.
    """
    
    def __init__(self, categories: List[str]):
        self.categories = categories
        self.n_categories = len(categories)
        
        # Track observations for each category
        self.observations: Dict[str, List[float]] = {c: [] for c in categories}
    
    def observe(self, category: str, performance: float):
        """Record an observation for a category."""
        self.observations[category].append(performance)
    
    def get_category_statistics(self) -> Dict[str, Dict[str, float]]:
        """Get mean and variance for each category."""
        stats = {}
        for cat, obs in self.observations.items():
            if obs:
                stats[cat] = {
                    'mean': np.mean(obs),
                    'std': np.std(obs) if len(obs) > 1 else float('inf'),
                    'count': len(obs)
                }
            else:
                stats[cat] = {'mean': None, 'std': None, 'count': 0}
        return stats
    
    def sample_category(self, n_samples: int = 1) -> List[str]:
        """
        Sample categories, balancing exploration (untried) 
        and exploitation (good performers).
        
        This mimics how TPE would sample categorical values.
        """
        stats = self.get_category_statistics()
        samples = []
        
        for _ in range(n_samples):
            # Prioritize untried categories
            untried = [c for c, s in stats.items() if s['count'] == 0]
            if untried:
                samples.append(np.random.choice(untried))
                continue
            
            # Otherwise, use Upper Confidence Bound-style selection
            ucb_scores = []
            total_count = sum(s['count'] for s in stats.values())
            
            for cat, s in stats.items():
                # Higher is better (assuming minimization, negate)
                mean = -s['mean']  # Negate for minimization
                exploration = np.sqrt(2 * np.log(total_count) / s['count'])
                ucb_scores.append(mean + exploration)
            
            # Softmax to get probabilities
            probs = softmax(np.array(ucb_scores))
            idx = np.random.choice(self.n_categories, p=probs)
            samples.append(self.categories[idx])
        
        return samples
 
 
# Example usage
if __name__ == "__main__":
    optimizers = ['sgd', 'adam', 'adamw', 'rmsprop']
    
    # One-hot encoding
    print("One-Hot Encoding")
    print("=" * 40)
    encoder = OneHotEncoder(optimizers)
    for opt in optimizers:
        print(f"{opt}: {encoder.encode(opt)}")
    
    # Learned embeddings
    print("\nLearned Embeddings (initial)")
    print("=" * 40)
    embed_encoder = LearnedEmbeddingEncoder(optimizers, embedding_dim=2)
    for opt in optimizers:
        print(f"{opt}: {embed_encoder.encode(opt).round(3)}")
    
    print("\nCategory distances:")
    print(f"adam-adamw: {embed_encoder.category_distance('adam', 'adamw'):.3f}")
    print(f"adam-sgd: {embed_encoder.category_distance('adam', 'sgd'):.3f}")
    
    # Tree-based handling
    print("\nTree-Based Sampling (after some observations)")
    print("=" * 40)
    handler = TreeBasedCategoricalHandler(optimizers)
    handler.observe('adam', 0.05)
    handler.observe('adam', 0.06)
    handler.observe('adamw', 0.04)
    handler.observe('sgd', 0.15)
    # rmsprop not observed yet
    
    print("Statistics:", handler.get_category_statistics())
    print("Next samples:", handler.sample_category(5))

Optimizing Mixed Search Spaces

Real-world HPO almost always involves mixed spaces—combinations of continuous, integer, and categorical hyperparameters. This creates unique challenges:

Challenge 1: Kernel/Distance Design

For Bayesian optimization, we need a kernel $k(\lambda, \lambda')$ that handles all types:

$$k(\lambda, \lambda') = k_{cont}(\lambda_{cont}, \lambda'{cont}) \cdot k{int}(\lambda_{int}, \lambda'{int}) \cdot k{cat}(\lambda_{cat}, \lambda'_{cat})$$

Different kernel types are needed for each:

Continuous: Matérn, RBF (squared exponential)
Integer: Rounded versions of continuous kernels
Categorical: Hamming kernel $k(c, c') = \mathbb{1}[c = c']$, or learned similarity

Challenge 2: Acquisition Function Optimization

Once we have a surrogate model, we need to optimize the acquisition function over the mixed space. This typically requires:

Continuous optimization for continuous dimensions
Enumeration or sampling for categorical dimensions
Hybrid approaches (e.g., optimize continuous, enumerate categorical)

Why Tree-Based Methods Excel Here

Tree-based methods (Random Forests, TPE) naturally handle mixed spaces because trees can split on any feature type. This is why SMAC and TPE-based tools (Optuna, Hyperopt) are often preferred for practical HPO over GP-based methods.

Practical Algorithms for Mixed Spaces:

1. TPE (Tree-structured Parzen Estimator)

Models each hyperparameter independently
Naturally handles all types
Used in Hyperopt, Optuna

2. SMAC (Sequential Model-based Algorithm Configuration)

Random forest surrogate handles mixed spaces
Sophisticated instance-aware optimization
Designed for algorithm configuration

3. Mixed-Space Bayesian Optimization

GP with product kernels
Special handling for categorical (embedding or Hamming kernel)
More complex but can exploit smoothness better

4. Evolutionary Algorithms

Define mutation operators for each type
Continuous: add Gaussian noise
Integer: ±1 or resample
Categorical: random switch

Algorithm Suitability for Different Space Types
Algorithm	Continuous	Integer	Categorical	Mixed
GP-BO (Gaussian Process)	⭐⭐⭐	⭐⭐	⭐	⭐⭐
TPE (Optuna, Hyperopt)	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
SMAC (Random Forest)	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
Random Search	⭐⭐	⭐⭐	⭐⭐⭐	⭐⭐
Grid Search	⭐	⭐⭐	⭐⭐⭐	⭐
Evolutionary	⭐⭐	⭐⭐	⭐⭐	⭐⭐

Practical Considerations and Recommendations

Let's consolidate our understanding into practical guidance for handling different hyperparameter types:

Recommended Strategies by Type

•Wide-range continuous (e.g., learning rate): Use log-scale transformation. Sample on log scale, report and optimize on original scale. Ensure your HPO tool supports this natively.
•Narrow-range continuous (e.g., dropout): Linear scale is fine. Consider discretizing to common values if the effect is roughly linear (0.1, 0.2, 0.3, ...).
•Small integer range (e.g., num_layers 1-5): Treat as categorical if the effect is non-monotonic. Treat as continuous with rounding if effect is smooth.
•Large integer range (e.g., n_estimators 50-5000): Use log-scale continuous relaxation with rounding. This gives proper sampling across orders of magnitude.
•Few categories (e.g., optimizer with 3 choices): Enumerate all. Run at least a few trials per category to get reliable estimates.
•Many categories (e.g., architecture block types): Use tree-based methods that handle categories natively, or consider hierarchical search.

Library Defaults

Modern HPO libraries handle most of this automatically:

• Optuna: Set log=True for log-scale continuous, use CategoricalDistribution for categories • Ray Tune: Use tune.loguniform() and tune.choice() • SMAC: Configure parameter types in the configuration space

Let the library handle the details—just specify types and ranges correctly.

Summary: Navigating Hyperparameter Types

Understanding hyperparameter types is essential for effective HPO. The structure of your search space fundamentally constrains what optimization strategies can work.

Key Takeaways

•Continuous hyperparameters enable smooth optimization with interpolation, gradients, and Gaussian Process modeling.
•Integer hyperparameters can often be handled via relaxation + rounding, but respect scale transformations.
•Categorical hyperparameters have no structure to exploit—use enumeration, one-hot encoding, or tree-based methods.
•Mixed spaces are common and challenging. Tree-based methods (TPE, SMAC) handle them naturally.
•Choose the right tool: GP-BO for pure continuous spaces, tree-based methods for mixed/categorical-heavy spaces.
•Scale matters for all types: Log-scale for wide continuous ranges, proper encoding for ordinal values.

What's Next

With continuous and discrete hyperparameters understood, we'll next explore conditional hyperparameters—hyperparameters that only exist when another hyperparameter takes certain values. This adds another layer of complexity to search space design.

Page Complete

You now understand how different hyperparameter types affect optimization and can make informed choices about encoding, sampling, and optimization strategies for each type.

3 / 5

Loading learning content...

Machine LearningHyperparameter Fundamentals

Hyperparameter Fundamentals

LevelIntermediate

Duration90 mins

TopicHyperparameter Fundamentals

3 / 5

Continuous vs Discrete Hyperparameters

The Geometry of Hyperparameter Spaces

In this page, we'll develop a deep understanding of both types and the challenges that arise when optimizing mixed spaces containing both.

What You Will Learn

Continuous Hyperparameters

Mathematical Properties:

Infinite cardinality: Uncountably many possible values
Interpolation: Given values $\lambda_1$ and $\lambda_2$, any $\alpha \lambda_1 + (1-\alpha) \lambda_2$ is valid
Smoothness assumption: We typically assume the response surface varies smoothly (Lipschitz continuous or smoother)
Gradient information: In principle, $\partial f / \partial \lambda$ exists (though often inaccessible)

Common Continuous Hyperparameters:

Continuous Hyperparameters Across Model Types
Category	Hyperparameter	Typical Domain	Recommended Scale
Optimization	Learning rate	[1e-6, 1.0]	Log
	Momentum	[0.0, 0.999]	Linear or 1-log(1-x)
	Weight decay	[1e-8, 0.1]	Log
Regularization	L1 penalty (α)	[1e-8, 10]	Log
	L2 penalty (λ)	[1e-8, 10]	Log
	Dropout probability	[0.0, 0.7]	Linear
Kernel	RBF γ	[1e-6, 1e3]	Log
	Polynomial degree (when real)	[1.0, 5.0]	Linear
Sampling	Subsample ratio	[0.5, 1.0]	Linear
	Column sample ratio	[0.5, 1.0]	Linear

Optimization Advantages:

Continuous spaces enable powerful techniques:

Gaussian Process modeling: GPs provide smooth interpolation with uncertainty estimates
Gradient-based optimization: When gradients are available (e.g., hypergradients)
Local smoothness exploitation: Similar hyperparameters tend to give similar performance
Efficient sampling: Quasi-random sequences (Sobol, Halton) provide excellent coverage

The Continuity Assumption:

Most HPO methods implicitly assume the response surface is continuous. For a continuous hyperparameter $\lambda$, this means:

$$|f(\lambda_1) - f(\lambda_2)| \leq L |\lambda_1 - \lambda_2|$$

for some Lipschitz constant $L$. This assumption justifies using nearby evaluations to predict performance at new points.

When Continuity Breaks

Discrete Hyperparameters: Integer and Categorical

Discrete hyperparameters take values from a finite set. They come in two important subtypes:

Integer Hyperparameters take values from ${a, a+1, ..., b} \subset \mathbb{Z}$. Examples:

Number of hidden layers: {1, 2, 3, 4, 5}
Number of trees/estimators: {100, 200, 500, 1000}
Number of attention heads: {1, 2, 4, 8}

Categorical Hyperparameters take values from an unordered set ${c_1, c_2, ..., c_k}$. Examples:

Optimizer: {SGD, Adam, AdamW, RMSprop}
Activation function: {ReLU, GELU, Swish, Tanh}
Kernel type: {linear, polynomial, RBF}

The key distinction: integers have ordering and distance, while categories have neither.

Integer Hyperparameters

•Have natural ordering (3 < 4 < 5)
•Notion of distance (|5 - 3| = 2)
•Can often be relaxed to continuous
•Interpolation sometimes meaningful
•Typically treated as special continuous case

Categorical Hyperparameters

•No natural ordering (is SGD < Adam?)
•Only Hamming distance (same/different)
•Cannot be relaxed to continuous
•No meaningful interpolation
•Require explicit enumeration or embedding

Ordinal Hyperparameters: The Middle Ground

Some hyperparameters are discrete with meaningful order but non-uniform spacing:

Batch size: {16, 32, 64, 128, 256} — ordered powers of 2
Tree depth: {3, 5, 7, 10, 15} — increasing but irregular spacing

These are ordinal: we know 256 batch size is 'larger' than 32, but the spacing isn't linear. They're often treated as:

Categorical (ignoring order)
Integer indices (0, 1, 2, 3, 4) — preserving order, assuming uniform effect
The actual values on log-scale — preserving order and relative spacing

discrete_hyperparameters.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
"""
Handling Discrete Hyperparameters
 
This module demonstrates different strategies for encoding and
optimizing integer, categorical, and ordinal hyperparameters.
"""
import numpy as np
from typing import List, Dict, Any, Union
from enum import Enum
 
 
class DiscreteHandling(Enum):
    """Strategies for handling discrete hyperparameters."""
    ENUMERATE = "enumerate"        # Exhaustive enumeration
    RELAX_ROUND = "relax_round"    # Treat as continuous, round result
    ONE_HOT = "one_hot"            # One-hot encoding for categorical
    ORDINAL = "ordinal"            # Integer index for ordinal
 
 
class IntegerParameter:
    """
    Integer hyperparameter with bounds.
    
    Can be treated as:
    1. Discrete: enumerate all values
    2. Relaxed continuous: optimize in [low, high], round at end
    """
    
    def __init__(self, name: str, low: int, high: int, log_scale: bool = False):
        self.name = name
        self.low = low
        self.high = high
        self.log_scale = log_scale
        self.n_values = high - low + 1
    
    def sample_uniform(self) -> int:
        """Sample uniformly from the integer range."""
        if self.log_scale:
            # Sample on log scale, then round
            log_value = np.random.uniform(np.log(self.low), np.log(self.high))
            return int(np.round(np.exp(log_value)))
        else:
            return np.random.randint(self.low, self.high + 1)
    
    def relax_and_round(self, continuous_value: float) -> int:
        """Convert a continuous relaxation back to integer."""
        if self.log_scale:
            # continuous_value is in log space
            actual = np.exp(continuous_value)
        else:
            actual = continuous_value
        
        # Round and clip to valid range
        return int(np.clip(np.round(actual), self.low, self.high))
    
    def to_continuous_bounds(self) -> tuple:
        """Get bounds for continuous relaxation."""
        if self.log_scale:
            return (np.log(self.low), np.log(self.high))
        else:
            return (self.low - 0.5, self.high + 0.5)  # Allow rounding
    
    def enumerate(self) -> List[int]:
        """List all possible values."""
        return list(range(self.low, self.high + 1))
 
 
class CategoricalParameter:
    """
    Categorical hyperparameter with no natural ordering.
    
    Must be handled via:
    1. Enumeration
    2. One-hot encoding
    3. Learned embeddings
    """
    
    def __init__(self, name: str, choices: List[Any]):
        self.name = name
        self.choices = choices
        self.n_choices = len(choices)
        self._idx_to_choice = {i: c for i, c in enumerate(choices)}
        self._choice_to_idx = {c: i for i, c in enumerate(choices)}
    
    def sample_uniform(self) -> Any:
        """Sample uniformly from choices."""
        return np.random.choice(self.choices)
    
    def to_one_hot(self, choice: Any) -> np.ndarray:
        """Encode a choice as one-hot vector."""
        idx = self._choice_to_idx[choice]
        one_hot = np.zeros(self.n_choices)
        one_hot[idx] = 1.0
        return one_hot
    
    def from_one_hot(self, one_hot: np.ndarray) -> Any:
        """Decode one-hot vector to choice (argmax)."""
        idx = np.argmax(one_hot)
        return self._idx_to_choice[idx]
    
    def from_probability(self, probs: np.ndarray) -> Any:
        """Sample a choice from probability distribution."""
        idx = np.random.choice(self.n_choices, p=probs)
        return self._idx_to_choice[idx]
    
    def enumerate(self) -> List[Any]:
        """List all possible choices."""
        return self.choices.copy()
 
 
class OrdinalParameter:
    """
    Ordinal hyperparameter: discrete with meaningful order.
    
    Examples: batch sizes [16, 32, 64, 128, 256]
    
    Can be handled as:
    1. Categorical (losing order information)
    2. Integer index (0, 1, 2, 3, 4)
    3. Actual values (possibly log-transformed)
    """
    
    def __init__(self, name: str, values: List[float], log_scale: bool = False):
        self.name = name
        self.values = sorted(values)  # Ensure ordering
        self.n_values = len(values)
        self.log_scale = log_scale
        
        # Precompute transformed values for continuous relaxation
        if log_scale:
            self._transformed = [np.log(v) for v in self.values]
        else:
            self._transformed = self.values.copy()
    
    def sample_uniform(self) -> float:
        """Sample uniformly from ordinal values."""
        return np.random.choice(self.values)
    
    def to_continuous(self, value: float) -> float:
        """Convert ordinal value to continuous representation."""
        idx = self.values.index(value)
        return self._transformed[idx]
    
    def from_continuous(self, continuous_value: float) -> float:
        """Find nearest ordinal value from continuous representation."""
        distances = [abs(t - continuous_value) for t in self._transformed]
        nearest_idx = np.argmin(distances)
        return self.values[nearest_idx]
    
    def neighbor(self, value: float, step: int = 1) -> float:
        """Get neighboring ordinal value (for local search)."""
        idx = self.values.index(value)
        new_idx = np.clip(idx + step, 0, self.n_values - 1)
        return self.values[new_idx]
 
 
def create_mixed_search_space():
    """
    Example of a mixed search space containing all types.
    """
    space = {
        # Continuous parameters
        'learning_rate': {
            'type': 'continuous',
            'bounds': (1e-5, 0.1),
            'log_scale': True
        },
        'dropout': {
            'type': 'continuous', 
            'bounds': (0.0, 0.5),
            'log_scale': False
        },
        
        # Integer parameters
        'num_layers': IntegerParameter('num_layers', 1, 6, log_scale=False),
        'hidden_units': IntegerParameter('hidden_units', 32, 512, log_scale=True),
        
        # Categorical parameters
        'optimizer': CategoricalParameter('optimizer', ['sgd', 'adam', 'adamw']),
        'activation': CategoricalParameter('activation', ['relu', 'gelu', 'swish']),
        
        # Ordinal parameters
        'batch_size': OrdinalParameter('batch_size', [16, 32, 64, 128, 256], 
                                        log_scale=True),
    }
    
    return space
 
 
def sample_from_mixed_space(space: Dict) -> Dict[str, Any]:
    """Sample a configuration from a mixed search space."""
    config = {}
    
    for name, param in space.items():
        if isinstance(param, dict):
            # Continuous parameter
            low, high = param['bounds']
            if param.get('log_scale', False):
                value = np.exp(np.random.uniform(np.log(low), np.log(high)))
            else:
                value = np.random.uniform(low, high)
            config[name] = value
            
        elif isinstance(param, (IntegerParameter, CategoricalParameter, 
                                OrdinalParameter)):
            config[name] = param.sample_uniform()
    
    return config
 
 
# Example
if __name__ == "__main__":
    space = create_mixed_search_space()
    
    print("Mixed Search Space Example")
    print("=" * 50)
    
    for _ in range(5):
        config = sample_from_mixed_space(space)
        print(f"\nSampled configuration:")
        for name, value in config.items():
            print(f"  {name}: {value}")

The Rounding Challenge for Integers

The Basic Idea:

For an integer parameter $n \in {1, 2, 3, 4, 5}$:

Pretend it's continuous: $n \in [0.5, 5.5]$
Use continuous optimization to find $n^* = 3.7$
Round: $n_{int} = \text{round}(3.7) = 4$

Problems with Naive Rounding:

Rounding Pitfalls

•Non-uniform regions: The value 3 gets samples from $(2.5, 3.5)$, but with log-scale transformation, these regions aren't equal in original space.
•Boundary issues: Values at edges (1 and 5) have half the 'basin' of interior values—they're undersampled.
•Scale mismatch: Rounding on linear scale vs log scale gives different distributions.
•Discontinuous objectives: Rounding creates a step-function wrapper around a smooth continuous objective.

Better Approaches for Integers:

1. Transform Before Optimization

Map integers to a continuous space where uniform sampling gives proper distribution:

# For range [a, b], use bounds [a - 0.5, b + 0.5]
# Then round at evaluation time

2. Use Integer-Aware Optimizers

Some Bayesian optimization implementations (SMAC, Optuna) handle integers natively, maintaining proper kernels and acquisition functions.

3. Enumerate for Small Spaces

If the integer range is small (e.g., num_layers ∈ {1, 2, 3, 4}), just treat it as categorical and enumerate. The overhead is minimal.

4. Log-Rounding for Wide Ranges

For ranges like n_estimators ∈ {50, ..., 2000}:

log_value = optimize_continuous(log(50), log(2000))
actual = round(exp(log_value))

Practical Guidance

Handling Categorical Variables in Optimization

Strategies for Categorical Variables:

Strategy: Evaluate all possible values.

When to use: Small number of categories (< 10) where evaluation is affordable.

Advantages:

Guaranteed to find the best category
No approximation or modeling errors
Simple implementation

Disadvantages:

Exponential cost in number of categories
Doesn't scale to many categorical variables

Example: Optimizer ∈ {SGD, Adam, AdamW} — just try all three.

categorical_handling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
"""
Strategies for Handling Categorical Hyperparameters
 
Demonstrates one-hot encoding, tree-based handling, and learned embeddings.
"""
import numpy as np
from typing import List, Dict, Tuple, Any
from scipy.special import softmax
 
 
class CategoricalEncoder:
    """Base class for categorical encoding strategies."""
    
    def encode(self, category: str) -> np.ndarray:
        raise NotImplementedError
    
    def decode(self, encoding: np.ndarray) -> str:
        raise NotImplementedError
 
 
class OneHotEncoder(CategoricalEncoder):
    """
    One-hot encoding for categorical variables.
    
    Maps each category to a binary vector.
    """
    
    def __init__(self, categories: List[str]):
        self.categories = categories
        self.n_categories = len(categories)
        self.cat_to_idx = {c: i for i, c in enumerate(categories)}
        self.idx_to_cat = {i: c for i, c in enumerate(categories)}
    
    def encode(self, category: str) -> np.ndarray:
        """Encode category as one-hot vector."""
        vec = np.zeros(self.n_categories)
        vec[self.cat_to_idx[category]] = 1.0
        return vec
    
    def decode(self, encoding: np.ndarray) -> str:
        """Decode one-hot (or soft) vector to category."""
        return self.idx_to_cat[np.argmax(encoding)]
    
    def decode_probabilistic(self, encoding: np.ndarray) -> str:
        """Sample a category from probability distribution."""
        probs = softmax(encoding)  # Normalize to valid probabilities
        idx = np.random.choice(self.n_categories, p=probs)
        return self.idx_to_cat[idx]
 
 
class LearnedEmbeddingEncoder(CategoricalEncoder):
    """
    Learned embedding for categorical variables.
    
    Maps categories to a learned continuous space where
    similar categories can be close together.
    """
    
    def __init__(self, categories: List[str], embedding_dim: int = 3):
        self.categories = categories
        self.n_categories = len(categories)
        self.embedding_dim = embedding_dim
        self.cat_to_idx = {c: i for i, c in enumerate(categories)}
        
        # Initialize embeddings randomly
        self.embeddings = np.random.randn(self.n_categories, embedding_dim)
    
    def encode(self, category: str) -> np.ndarray:
        """Get embedding for category."""
        idx = self.cat_to_idx[category]
        return self.embeddings[idx].copy()
    
    def decode(self, encoding: np.ndarray) -> str:
        """Find nearest category in embedding space."""
        distances = np.linalg.norm(self.embeddings - encoding, axis=1)
        nearest_idx = np.argmin(distances)
        return self.categories[nearest_idx]
    
    def update_embeddings(self, category: str, gradient: np.ndarray, 
                          learning_rate: float = 0.01):
        """Update embeddings via gradient descent."""
        idx = self.cat_to_idx[category]
        self.embeddings[idx] -= learning_rate * gradient
    
    def category_distance(self, cat1: str, cat2: str) -> float:
        """Compute distance between two categories in embedding space."""
        emb1 = self.encode(cat1)
        emb2 = self.encode(cat2)
        return np.linalg.norm(emb1 - emb2)
 
 
class TreeBasedCategoricalHandler:
    """
    Simulates how tree-based methods handle categorical variables.
    
    Trees can split on categorical features directly, without encoding.
    This is conceptually how TPE and SMAC work.
    """
    
    def __init__(self, categories: List[str]):
        self.categories = categories
        self.n_categories = len(categories)
        
        # Track observations for each category
        self.observations: Dict[str, List[float]] = {c: [] for c in categories}
    
    def observe(self, category: str, performance: float):
        """Record an observation for a category."""
        self.observations[category].append(performance)
    
    def get_category_statistics(self) -> Dict[str, Dict[str, float]]:
        """Get mean and variance for each category."""
        stats = {}
        for cat, obs in self.observations.items():
            if obs:
                stats[cat] = {
                    'mean': np.mean(obs),
                    'std': np.std(obs) if len(obs) > 1 else float('inf'),
                    'count': len(obs)
                }
            else:
                stats[cat] = {'mean': None, 'std': None, 'count': 0}
        return stats
    
    def sample_category(self, n_samples: int = 1) -> List[str]:
        """
        Sample categories, balancing exploration (untried) 
        and exploitation (good performers).
        
        This mimics how TPE would sample categorical values.
        """
        stats = self.get_category_statistics()
        samples = []
        
        for _ in range(n_samples):
            # Prioritize untried categories
            untried = [c for c, s in stats.items() if s['count'] == 0]
            if untried:
                samples.append(np.random.choice(untried))
                continue
            
            # Otherwise, use Upper Confidence Bound-style selection
            ucb_scores = []
            total_count = sum(s['count'] for s in stats.values())
            
            for cat, s in stats.items():
                # Higher is better (assuming minimization, negate)
                mean = -s['mean']  # Negate for minimization
                exploration = np.sqrt(2 * np.log(total_count) / s['count'])
                ucb_scores.append(mean + exploration)
            
            # Softmax to get probabilities
            probs = softmax(np.array(ucb_scores))
            idx = np.random.choice(self.n_categories, p=probs)
            samples.append(self.categories[idx])
        
        return samples
 
 
# Example usage
if __name__ == "__main__":
    optimizers = ['sgd', 'adam', 'adamw', 'rmsprop']
    
    # One-hot encoding
    print("One-Hot Encoding")
    print("=" * 40)
    encoder = OneHotEncoder(optimizers)
    for opt in optimizers:
        print(f"{opt}: {encoder.encode(opt)}")
    
    # Learned embeddings
    print("\nLearned Embeddings (initial)")
    print("=" * 40)
    embed_encoder = LearnedEmbeddingEncoder(optimizers, embedding_dim=2)
    for opt in optimizers:
        print(f"{opt}: {embed_encoder.encode(opt).round(3)}")
    
    print("\nCategory distances:")
    print(f"adam-adamw: {embed_encoder.category_distance('adam', 'adamw'):.3f}")
    print(f"adam-sgd: {embed_encoder.category_distance('adam', 'sgd'):.3f}")
    
    # Tree-based handling
    print("\nTree-Based Sampling (after some observations)")
    print("=" * 40)
    handler = TreeBasedCategoricalHandler(optimizers)
    handler.observe('adam', 0.05)
    handler.observe('adam', 0.06)
    handler.observe('adamw', 0.04)
    handler.observe('sgd', 0.15)
    # rmsprop not observed yet
    
    print("Statistics:", handler.get_category_statistics())
    print("Next samples:", handler.sample_category(5))

Optimizing Mixed Search Spaces

Real-world HPO almost always involves mixed spaces—combinations of continuous, integer, and categorical hyperparameters. This creates unique challenges:

Challenge 1: Kernel/Distance Design

For Bayesian optimization, we need a kernel $k(\lambda, \lambda')$ that handles all types:

$$k(\lambda, \lambda') = k_{cont}(\lambda_{cont}, \lambda'{cont}) \cdot k{int}(\lambda_{int}, \lambda'{int}) \cdot k{cat}(\lambda_{cat}, \lambda'_{cat})$$

Different kernel types are needed for each:

Continuous: Matérn, RBF (squared exponential)
Integer: Rounded versions of continuous kernels
Categorical: Hamming kernel $k(c, c') = \mathbb{1}[c = c']$, or learned similarity

Challenge 2: Acquisition Function Optimization

Once we have a surrogate model, we need to optimize the acquisition function over the mixed space. This typically requires:

Continuous optimization for continuous dimensions
Enumeration or sampling for categorical dimensions
Hybrid approaches (e.g., optimize continuous, enumerate categorical)

Why Tree-Based Methods Excel Here

Practical Algorithms for Mixed Spaces:

1. TPE (Tree-structured Parzen Estimator)

Models each hyperparameter independently
Naturally handles all types
Used in Hyperopt, Optuna

2. SMAC (Sequential Model-based Algorithm Configuration)

Random forest surrogate handles mixed spaces
Sophisticated instance-aware optimization
Designed for algorithm configuration

3. Mixed-Space Bayesian Optimization

GP with product kernels
Special handling for categorical (embedding or Hamming kernel)
More complex but can exploit smoothness better

4. Evolutionary Algorithms

Define mutation operators for each type
Continuous: add Gaussian noise
Integer: ±1 or resample
Categorical: random switch

Algorithm Suitability for Different Space Types
Algorithm	Continuous	Integer	Categorical	Mixed
GP-BO (Gaussian Process)	⭐⭐⭐	⭐⭐	⭐	⭐⭐
TPE (Optuna, Hyperopt)	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
SMAC (Random Forest)	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
Random Search	⭐⭐	⭐⭐	⭐⭐⭐	⭐⭐
Grid Search	⭐	⭐⭐	⭐⭐⭐	⭐
Evolutionary	⭐⭐	⭐⭐	⭐⭐	⭐⭐

Practical Considerations and Recommendations

Let's consolidate our understanding into practical guidance for handling different hyperparameter types:

Recommended Strategies by Type

•Wide-range continuous (e.g., learning rate): Use log-scale transformation. Sample on log scale, report and optimize on original scale. Ensure your HPO tool supports this natively.
•Narrow-range continuous (e.g., dropout): Linear scale is fine. Consider discretizing to common values if the effect is roughly linear (0.1, 0.2, 0.3, ...).
•Small integer range (e.g., num_layers 1-5): Treat as categorical if the effect is non-monotonic. Treat as continuous with rounding if effect is smooth.
•Large integer range (e.g., n_estimators 50-5000): Use log-scale continuous relaxation with rounding. This gives proper sampling across orders of magnitude.
•Few categories (e.g., optimizer with 3 choices): Enumerate all. Run at least a few trials per category to get reliable estimates.
•Many categories (e.g., architecture block types): Use tree-based methods that handle categories natively, or consider hierarchical search.

Library Defaults

Modern HPO libraries handle most of this automatically:

Let the library handle the details—just specify types and ranges correctly.

Summary: Navigating Hyperparameter Types

Understanding hyperparameter types is essential for effective HPO. The structure of your search space fundamentally constrains what optimization strategies can work.

Key Takeaways

•Continuous hyperparameters enable smooth optimization with interpolation, gradients, and Gaussian Process modeling.
•Integer hyperparameters can often be handled via relaxation + rounding, but respect scale transformations.
•Categorical hyperparameters have no structure to exploit—use enumeration, one-hot encoding, or tree-based methods.
•Mixed spaces are common and challenging. Tree-based methods (TPE, SMAC) handle them naturally.
•Choose the right tool: GP-BO for pure continuous spaces, tree-based methods for mixed/categorical-heavy spaces.
•Scale matters for all types: Log-scale for wide continuous ranges, proper encoding for ordinal values.

What's Next

Page Complete

You now understand how different hyperparameter types affect optimization and can make informed choices about encoding, sampling, and optimization strategies for each type.

3 / 5