Machine LearningHyperparameter Optimization

Bayesian Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

5 / 5

Tree-structured Parzen Estimators (TPE)

Beyond Gaussian Processes

Gaussian Processes are powerful surrogates, but they have limitations. They struggle with:

Categorical hyperparameters (optimizer type, activation function)
Conditional hyperparameters (layer-specific parameters that only exist for certain architectures)
High-dimensional spaces (O(n³) scaling becomes prohibitive)

Tree-structured Parzen Estimators (TPE) offer an elegant alternative. Instead of modeling p(y|x) like GPs, TPE models p(x|y)—the probability of configurations given their performance. This inversion enables natural handling of complex, mixed hyperparameter spaces.

TPE is the default algorithm in Optuna and Hyperopt, two of the most popular hyperparameter optimization libraries.

What You Will Learn

By the end of this page, you will understand: how TPE inverts the modeling problem, the construction of l(x) and g(x) density estimators, why tree structure handles conditionals, and practical usage with Optuna.

The TPE Core Idea: Modeling p(x|y)

Traditional Bayesian Optimization models p(y|x)—the probability of objective values given configurations. TPE flips this:

TPE models p(x|y)—the probability of configurations given objective values.

The key insight: Instead of predicting "what score will configuration x achieve?", TPE asks "what configurations tend to achieve good vs. bad scores?"

Two density estimators:

TPE divides observations into two groups based on a quantile threshold γ:

$$l(x) = p(x | y < y^*)$$ — density of configurations with good (below threshold) scores

$$g(x) = p(x | y \geq y^*)$$ — density of configurations with bad (above threshold) scores

where $y^* = \text{quantile}_{\gamma}(y_1, ..., y_n)$ is typically the top 15-25% threshold.

tpe_fundamentals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from scipy.stats import gaussian_kde
 
class SimpleTPE:
    """
    Simplified TPE implementation demonstrating core concepts.
    
    Key insight: instead of p(y|x), model p(x|y) via two densities:
    - l(x): density of x given y < threshold (good)
    - g(x): density of x given y >= threshold (bad)
    """
    
    def __init__(self, gamma: float = 0.25):
        """
        gamma: quantile threshold for good/bad split
               e.g., gamma=0.25 means top 25% are "good"
        """
        self.gamma = gamma
        self.l = None  # KDE for good observations
        self.g = None  # KDE for bad observations
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit l(x) and g(x) density estimators."""
        # Determine threshold (gamma quantile)
        n_good = max(1, int(len(y) * self.gamma))
        threshold = np.sort(y)[n_good - 1]
        
        # Split observations
        good_mask = y <= threshold
        X_good = X[good_mask]
        X_bad = X[~good_mask]
        
        # Fit kernel density estimators
        if len(X_good) > 1:
            self.l = gaussian_kde(X_good.T)
        if len(X_bad) > 1:
            self.g = gaussian_kde(X_bad.T)
    
    def acquisition(self, X_candidates: np.ndarray) -> np.ndarray:
        """
        TPE acquisition: maximize l(x) / g(x)
        
        This is equivalent to Expected Improvement under
        certain assumptions about the distribution of y.
        """
        # Compute densities
        l_vals = self.l(X_candidates.T) if self.l else np.ones(len(X_candidates))
        g_vals = self.g(X_candidates.T) if self.g else np.ones(len(X_candidates))
        
        # Avoid division by zero
        g_vals = np.maximum(g_vals, 1e-10)
        
        # Acquisition: ratio of densities
        return l_vals / g_vals
    
    def suggest(self, X_candidates: np.ndarray) -> np.ndarray:
        """Suggest next point to evaluate."""
        scores = self.acquisition(X_candidates)
        return X_candidates[np.argmax(scores)]

Why l(x)/g(x) Works

Maximizing l(x)/g(x) finds configurations that are common among good observations and rare among bad observations. It's Bayesian: we want high probability of being good (high l(x)) and low probability of being bad (low g(x)).

Connection to Expected Improvement

The l(x)/g(x) acquisition has a deep connection to Expected Improvement. Using Bayes' rule:

$$\text{EI}(x) \propto \left(\gamma + \frac{g(x)}{l(x)}(1 - \gamma)\right)^{-1}$$

Derivation sketch:

By Bayes: $p(y < y^* | x) = \frac{p(x | y < y^) p(y < y^)}{p(x)}$
Using $p(x) = l(x)p(y < y^) + g(x)p(y \geq y^)$
Substituting $p(y < y^*) = \gamma$:

$$p(y < y^* | x) = \frac{\gamma l(x)}{\gamma l(x) + (1-\gamma) g(x)}$$

This is monotonic with $l(x)/g(x)$

Thus, maximizing l(x)/g(x) is equivalent to maximizing the probability of improvement (and closely related to EI).

GP-EI vs TPE Comparison
Aspect	GP-EI	TPE
Model	p(y\|x) via Gaussian Process	p(x\|y) via density estimators
Acquisition	EI in closed form	l(x)/g(x) ratio
Continuous hyperparameters	Natural fit	Via Parzen estimators
Categorical hyperparameters	Requires special kernels	Natural via trees
Conditional hyperparameters	Difficult	Elegant via tree structure
Computational scaling	O(n³)	O(n log n)
High dimensions	Struggles (>20D)	Handles better

The Tree Structure: Handling Complex Spaces

The "tree-structured" in TPE refers to how it handles conditional hyperparameters—parameters that only exist depending on other choices.

Example: Neural network architecture

optimizer_type: ['sgd', 'adam', 'rmsprop']
if optimizer_type == 'sgd':
    momentum: [0.0, 0.99]
if optimizer_type == 'adam':
    beta1: [0.8, 0.999]
    beta2: [0.9, 0.9999]
learning_rate: [1e-5, 1.0]  # Always present

The hyperparameter space forms a tree: optimizer_type is the root, and momentum vs beta1/beta2 are conditional branches.

TPE handles this elegantly:

Build separate density estimators for each branch
When sampling, first sample the parent, then sample children conditionally
When computing l(x)/g(x), only include relevant branches

tree_structure_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
class ConditionalTPE:
    """
    TPE with tree-structured conditional hyperparameters.
    
    Key insight: model each hyperparameter independently,
    conditioned on its parent values in the tree.
    """
    
    def __init__(self, search_space):
        """
        search_space: dict defining the tree structure
        Example:
        {
            'optimizer': {
                'type': 'categorical',
                'choices': ['sgd', 'adam'],
                'children': {
                    'sgd': {
                        'momentum': {'type': 'float', 'range': [0, 0.99]}
                    },
                    'adam': {
                        'beta1': {'type': 'float', 'range': [0.8, 0.999]},
                        'beta2': {'type': 'float', 'range': [0.9, 0.9999]}
                    }
                }
            },
            'lr': {'type': 'float', 'range': [1e-5, 1.0], 'log': True}
        }
        """
        self.search_space = search_space
        self.estimators = {}  # param_name -> (l, g) estimators
    
    def fit(self, observations: list):
        """
        Fit density estimators for each parameter.
        
        For conditional parameters, only use observations
        where the parent condition is satisfied.
        """
        # Extract values for each parameter
        for param, spec in self._iterate_params(self.search_space):
            # Filter observations where param is active
            values = [obs[param] for obs in observations if param in obs]
            
            if len(values) >= 2:
                # Fit l and g estimators for this parameter
                self.estimators[param] = self._fit_param(values, spec)
    
    def sample(self) -> dict:
        """
        Sample a configuration respecting tree structure.
        
        1. Sample from l(x) for each parameter
        2. Only sample children if parent condition matches
        """
        config = {}
        self._sample_node(self.search_space, config)
        return config
    
    def _sample_node(self, node, config):
        """Recursively sample from tree structure."""
        for param, spec in node.items():
            if param == 'children':
                continue
                
            # Sample from l(x) for this parameter
            if param in self.estimators:
                l, g = self.estimators[param]
                config[param] = self._sample_from_l(l, spec)
            else:
                config[param] = self._sample_prior(spec)
            
            # Handle children (conditional parameters)
            if 'children' in spec:
                parent_value = config[param]
                if parent_value in spec['children']:
                    self._sample_node(spec['children'][parent_value], config)
    
    def _sample_prior(self, spec):
        """Sample from prior distribution."""
        if spec['type'] == 'categorical':
            return np.random.choice(spec['choices'])
        elif spec['type'] == 'float':
            low, high = spec['range']
            if spec.get('log', False):
                return np.exp(np.random.uniform(np.log(low), np.log(high)))
            return np.random.uniform(low, high)
        elif spec['type'] == 'int':
            low, high = spec['range']
            return np.random.randint(low, high + 1)

Why Independent Modeling Works

TPE models each hyperparameter independently. This seems like a strong assumption, but it works well in practice because: (1) hyperparameters often are nearly independent, (2) the good/bad split captures important interactions implicitly, (3) computational benefits outweigh modeling loss.

TPE in Practice: Optuna

Optuna is the most popular library for TPE-based optimization. It provides a clean API for defining complex search spaces with conditionals.

optuna_tpe_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
 
# Load data
X, y = load_iris(return_X_y=True)
 
def objective(trial):
    """
    Optuna objective with conditional hyperparameters.
    
    The tree structure is implicit in the if-else logic.
    TPE handles this automatically!
    """
    # Categorical: model type
    classifier_type = trial.suggest_categorical(
        'classifier', ['rf', 'svm', 'gbdt']
    )
    
    if classifier_type == 'rf':
        # Random Forest hyperparameters
        n_estimators = trial.suggest_int('rf_n_estimators', 10, 200)
        max_depth = trial.suggest_int('rf_max_depth', 2, 32, log=True)
        min_samples_split = trial.suggest_int('rf_min_samples_split', 2, 10)
        
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            random_state=42
        )
        
    elif classifier_type == 'svm':
        # SVM hyperparameters
        from sklearn.svm import SVC
        
        kernel = trial.suggest_categorical('svm_kernel', ['rbf', 'poly'])
        C = trial.suggest_float('svm_C', 1e-3, 100, log=True)
        
        if kernel == 'rbf':
            gamma = trial.suggest_float('svm_gamma', 1e-4, 10, log=True)
            model = SVC(kernel='rbf', C=C, gamma=gamma)
        else:
            degree = trial.suggest_int('svm_degree', 2, 5)
            model = SVC(kernel='poly', C=C, degree=degree)
            
    else:  # gbdt
        from sklearn.ensemble import GradientBoostingClassifier
        
        n_estimators = trial.suggest_int('gbdt_n_estimators', 50, 300)
        learning_rate = trial.suggest_float('gbdt_lr', 0.01, 0.3, log=True)
        max_depth = trial.suggest_int('gbdt_max_depth', 2, 10)
        
        model = GradientBoostingClassifier(
            n_estimators=n_estimators,
            learning_rate=learning_rate,
            max_depth=max_depth,
            random_state=42
        )
    
    # Evaluate
    score = cross_val_score(model, X, y, cv=5).mean()
    return score  # Optuna maximizes by default
 
# Create study with TPE sampler (default)
study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(
        n_startup_trials=10,  # Random trials before TPE kicks in
        gamma=lambda n: min(int(0.25 * n), 25),  # Adaptive gamma
        multivariate=True  # Consider correlations
    )
)
 
# Run optimization
study.optimize(objective, n_trials=100, show_progress_bar=True)
 
# Results
print(f"Best score: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Optuna TPE Features

•Automatic conditional handling — Define conditionals naturally with if-else; TPE tracks dependencies
•Multivariate TPE — Optional correlations between hyperparameters for better modeling
•Pruning integration — Early stopping of bad trials saves compute
•Parallel optimization — Database-backed storage for distributed tuning
•Visualization — Built-in plots for optimization history and importance

When to Use TPE vs GP-based BO

Both TPE and GP-based BO are effective. Here's when to prefer each:

TPE vs GP Selection Guide
Scenario	Preferred Method	Reason
Pure continuous hyperparameters	GP-BO	Better uncertainty quantification
Mixed continuous + categorical	TPE	Natural handling of categoricals
Conditional hyperparameters	TPE	Tree structure elegantly handles dependencies
High-dimensional (>20)	TPE	O(n log n) vs O(n³) scaling
Small budget (<30 evals)	GP-BO	Better sample efficiency
Large budget (>100 evals)	Either	Both converge well
Neural architecture search	TPE	Complex conditionals are common
Simple ML models (RF, SVM)	Either	Both work well

Practical Default

For most hyperparameter optimization tasks: use TPE (Optuna) for its flexibility with mixed spaces and ease of use. Use GP-BO (GPyOpt, BOTorch) when you have purely continuous hyperparameters and need maximal sample efficiency.

Summary: Tree-structured Parzen Estimators

Key Takeaways

•TPE models p(x|y) instead of p(y|x) — Inverts the surrogate modeling problem
•Two density estimators l(x) and g(x) — Good and bad configurations, separated by quantile
•l(x)/g(x) is the acquisition — Equivalent to probability of improvement
•Tree structure handles conditionals — Natural fit for complex hyperparameter spaces
•Optuna provides production-ready TPE — Easy to use with powerful features

Module Complete!

You've now mastered Bayesian Optimization—from the foundational SMBO framework through Gaussian Process surrogates, acquisition functions, exploration-exploitation tradeoffs, and TPE for complex spaces. You have the knowledge to apply these techniques effectively to hyperparameter optimization problems.

Module Complete

You've completed the Bayesian Optimization module! You can now implement and apply BO with both GP and TPE surrogates, understand the mathematical foundations, and select appropriate methods for different hyperparameter optimization scenarios.

5 / 5

Loading learning content...

Machine LearningHyperparameter Optimization

Bayesian Optimization

LevelAdvanced

Duration90 mins

TopicHyperparameter Optimization

5 / 5

Tree-structured Parzen Estimators (TPE)

Beyond Gaussian Processes

Gaussian Processes are powerful surrogates, but they have limitations. They struggle with:

Categorical hyperparameters (optimizer type, activation function)
Conditional hyperparameters (layer-specific parameters that only exist for certain architectures)
High-dimensional spaces (O(n³) scaling becomes prohibitive)

TPE is the default algorithm in Optuna and Hyperopt, two of the most popular hyperparameter optimization libraries.

What You Will Learn

The TPE Core Idea: Modeling p(x|y)

Traditional Bayesian Optimization models p(y|x)—the probability of objective values given configurations. TPE flips this:

TPE models p(x|y)—the probability of configurations given objective values.

The key insight: Instead of predicting "what score will configuration x achieve?", TPE asks "what configurations tend to achieve good vs. bad scores?"

Two density estimators:

TPE divides observations into two groups based on a quantile threshold γ:

$$l(x) = p(x | y < y^*)$$ — density of configurations with good (below threshold) scores

$$g(x) = p(x | y \geq y^*)$$ — density of configurations with bad (above threshold) scores

where $y^* = \text{quantile}_{\gamma}(y_1, ..., y_n)$ is typically the top 15-25% threshold.

tpe_fundamentals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from scipy.stats import gaussian_kde
 
class SimpleTPE:
    """
    Simplified TPE implementation demonstrating core concepts.
    
    Key insight: instead of p(y|x), model p(x|y) via two densities:
    - l(x): density of x given y < threshold (good)
    - g(x): density of x given y >= threshold (bad)
    """
    
    def __init__(self, gamma: float = 0.25):
        """
        gamma: quantile threshold for good/bad split
               e.g., gamma=0.25 means top 25% are "good"
        """
        self.gamma = gamma
        self.l = None  # KDE for good observations
        self.g = None  # KDE for bad observations
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Fit l(x) and g(x) density estimators."""
        # Determine threshold (gamma quantile)
        n_good = max(1, int(len(y) * self.gamma))
        threshold = np.sort(y)[n_good - 1]
        
        # Split observations
        good_mask = y <= threshold
        X_good = X[good_mask]
        X_bad = X[~good_mask]
        
        # Fit kernel density estimators
        if len(X_good) > 1:
            self.l = gaussian_kde(X_good.T)
        if len(X_bad) > 1:
            self.g = gaussian_kde(X_bad.T)
    
    def acquisition(self, X_candidates: np.ndarray) -> np.ndarray:
        """
        TPE acquisition: maximize l(x) / g(x)
        
        This is equivalent to Expected Improvement under
        certain assumptions about the distribution of y.
        """
        # Compute densities
        l_vals = self.l(X_candidates.T) if self.l else np.ones(len(X_candidates))
        g_vals = self.g(X_candidates.T) if self.g else np.ones(len(X_candidates))
        
        # Avoid division by zero
        g_vals = np.maximum(g_vals, 1e-10)
        
        # Acquisition: ratio of densities
        return l_vals / g_vals
    
    def suggest(self, X_candidates: np.ndarray) -> np.ndarray:
        """Suggest next point to evaluate."""
        scores = self.acquisition(X_candidates)
        return X_candidates[np.argmax(scores)]

Why l(x)/g(x) Works

Connection to Expected Improvement

The l(x)/g(x) acquisition has a deep connection to Expected Improvement. Using Bayes' rule:

$$\text{EI}(x) \propto \left(\gamma + \frac{g(x)}{l(x)}(1 - \gamma)\right)^{-1}$$

Derivation sketch:

By Bayes: $p(y < y^* | x) = \frac{p(x | y < y^) p(y < y^)}{p(x)}$
Using $p(x) = l(x)p(y < y^) + g(x)p(y \geq y^)$
Substituting $p(y < y^*) = \gamma$:

$$p(y < y^* | x) = \frac{\gamma l(x)}{\gamma l(x) + (1-\gamma) g(x)}$$

This is monotonic with $l(x)/g(x)$

Thus, maximizing l(x)/g(x) is equivalent to maximizing the probability of improvement (and closely related to EI).

GP-EI vs TPE Comparison
Aspect	GP-EI	TPE
Model	p(y\|x) via Gaussian Process	p(x\|y) via density estimators
Acquisition	EI in closed form	l(x)/g(x) ratio
Continuous hyperparameters	Natural fit	Via Parzen estimators
Categorical hyperparameters	Requires special kernels	Natural via trees
Conditional hyperparameters	Difficult	Elegant via tree structure
Computational scaling	O(n³)	O(n log n)
High dimensions	Struggles (>20D)	Handles better

The Tree Structure: Handling Complex Spaces

The "tree-structured" in TPE refers to how it handles conditional hyperparameters—parameters that only exist depending on other choices.

Example: Neural network architecture

optimizer_type: ['sgd', 'adam', 'rmsprop']
if optimizer_type == 'sgd':
    momentum: [0.0, 0.99]
if optimizer_type == 'adam':
    beta1: [0.8, 0.999]
    beta2: [0.9, 0.9999]
learning_rate: [1e-5, 1.0]  # Always present

The hyperparameter space forms a tree: optimizer_type is the root, and momentum vs beta1/beta2 are conditional branches.

TPE handles this elegantly:

Build separate density estimators for each branch
When sampling, first sample the parent, then sample children conditionally
When computing l(x)/g(x), only include relevant branches

tree_structure_sampling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
 
class ConditionalTPE:
    """
    TPE with tree-structured conditional hyperparameters.
    
    Key insight: model each hyperparameter independently,
    conditioned on its parent values in the tree.
    """
    
    def __init__(self, search_space):
        """
        search_space: dict defining the tree structure
        Example:
        {
            'optimizer': {
                'type': 'categorical',
                'choices': ['sgd', 'adam'],
                'children': {
                    'sgd': {
                        'momentum': {'type': 'float', 'range': [0, 0.99]}
                    },
                    'adam': {
                        'beta1': {'type': 'float', 'range': [0.8, 0.999]},
                        'beta2': {'type': 'float', 'range': [0.9, 0.9999]}
                    }
                }
            },
            'lr': {'type': 'float', 'range': [1e-5, 1.0], 'log': True}
        }
        """
        self.search_space = search_space
        self.estimators = {}  # param_name -> (l, g) estimators
    
    def fit(self, observations: list):
        """
        Fit density estimators for each parameter.
        
        For conditional parameters, only use observations
        where the parent condition is satisfied.
        """
        # Extract values for each parameter
        for param, spec in self._iterate_params(self.search_space):
            # Filter observations where param is active
            values = [obs[param] for obs in observations if param in obs]
            
            if len(values) >= 2:
                # Fit l and g estimators for this parameter
                self.estimators[param] = self._fit_param(values, spec)
    
    def sample(self) -> dict:
        """
        Sample a configuration respecting tree structure.
        
        1. Sample from l(x) for each parameter
        2. Only sample children if parent condition matches
        """
        config = {}
        self._sample_node(self.search_space, config)
        return config
    
    def _sample_node(self, node, config):
        """Recursively sample from tree structure."""
        for param, spec in node.items():
            if param == 'children':
                continue
                
            # Sample from l(x) for this parameter
            if param in self.estimators:
                l, g = self.estimators[param]
                config[param] = self._sample_from_l(l, spec)
            else:
                config[param] = self._sample_prior(spec)
            
            # Handle children (conditional parameters)
            if 'children' in spec:
                parent_value = config[param]
                if parent_value in spec['children']:
                    self._sample_node(spec['children'][parent_value], config)
    
    def _sample_prior(self, spec):
        """Sample from prior distribution."""
        if spec['type'] == 'categorical':
            return np.random.choice(spec['choices'])
        elif spec['type'] == 'float':
            low, high = spec['range']
            if spec.get('log', False):
                return np.exp(np.random.uniform(np.log(low), np.log(high)))
            return np.random.uniform(low, high)
        elif spec['type'] == 'int':
            low, high = spec['range']
            return np.random.randint(low, high + 1)

Why Independent Modeling Works

TPE in Practice: Optuna

Optuna is the most popular library for TPE-based optimization. It provides a clean API for defining complex search spaces with conditionals.

optuna_tpe_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
 
# Load data
X, y = load_iris(return_X_y=True)
 
def objective(trial):
    """
    Optuna objective with conditional hyperparameters.
    
    The tree structure is implicit in the if-else logic.
    TPE handles this automatically!
    """
    # Categorical: model type
    classifier_type = trial.suggest_categorical(
        'classifier', ['rf', 'svm', 'gbdt']
    )
    
    if classifier_type == 'rf':
        # Random Forest hyperparameters
        n_estimators = trial.suggest_int('rf_n_estimators', 10, 200)
        max_depth = trial.suggest_int('rf_max_depth', 2, 32, log=True)
        min_samples_split = trial.suggest_int('rf_min_samples_split', 2, 10)
        
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            random_state=42
        )
        
    elif classifier_type == 'svm':
        # SVM hyperparameters
        from sklearn.svm import SVC
        
        kernel = trial.suggest_categorical('svm_kernel', ['rbf', 'poly'])
        C = trial.suggest_float('svm_C', 1e-3, 100, log=True)
        
        if kernel == 'rbf':
            gamma = trial.suggest_float('svm_gamma', 1e-4, 10, log=True)
            model = SVC(kernel='rbf', C=C, gamma=gamma)
        else:
            degree = trial.suggest_int('svm_degree', 2, 5)
            model = SVC(kernel='poly', C=C, degree=degree)
            
    else:  # gbdt
        from sklearn.ensemble import GradientBoostingClassifier
        
        n_estimators = trial.suggest_int('gbdt_n_estimators', 50, 300)
        learning_rate = trial.suggest_float('gbdt_lr', 0.01, 0.3, log=True)
        max_depth = trial.suggest_int('gbdt_max_depth', 2, 10)
        
        model = GradientBoostingClassifier(
            n_estimators=n_estimators,
            learning_rate=learning_rate,
            max_depth=max_depth,
            random_state=42
        )
    
    # Evaluate
    score = cross_val_score(model, X, y, cv=5).mean()
    return score  # Optuna maximizes by default
 
# Create study with TPE sampler (default)
study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(
        n_startup_trials=10,  # Random trials before TPE kicks in
        gamma=lambda n: min(int(0.25 * n), 25),  # Adaptive gamma
        multivariate=True  # Consider correlations
    )
)
 
# Run optimization
study.optimize(objective, n_trials=100, show_progress_bar=True)
 
# Results
print(f"Best score: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Optuna TPE Features

•Automatic conditional handling — Define conditionals naturally with if-else; TPE tracks dependencies
•Multivariate TPE — Optional correlations between hyperparameters for better modeling
•Pruning integration — Early stopping of bad trials saves compute
•Parallel optimization — Database-backed storage for distributed tuning
•Visualization — Built-in plots for optimization history and importance

When to Use TPE vs GP-based BO

Both TPE and GP-based BO are effective. Here's when to prefer each:

TPE vs GP Selection Guide
Scenario	Preferred Method	Reason
Pure continuous hyperparameters	GP-BO	Better uncertainty quantification
Mixed continuous + categorical	TPE	Natural handling of categoricals
Conditional hyperparameters	TPE	Tree structure elegantly handles dependencies
High-dimensional (>20)	TPE	O(n log n) vs O(n³) scaling
Small budget (<30 evals)	GP-BO	Better sample efficiency
Large budget (>100 evals)	Either	Both converge well
Neural architecture search	TPE	Complex conditionals are common
Simple ML models (RF, SVM)	Either	Both work well

Practical Default

Summary: Tree-structured Parzen Estimators

Key Takeaways

•TPE models p(x|y) instead of p(y|x) — Inverts the surrogate modeling problem
•Two density estimators l(x) and g(x) — Good and bad configurations, separated by quantile
•l(x)/g(x) is the acquisition — Equivalent to probability of improvement
•Tree structure handles conditionals — Natural fit for complex hyperparameter spaces
•Optuna provides production-ready TPE — Easy to use with powerful features

Module Complete!

Module Complete

5 / 5