Loading learning content...
Gaussian Processes are powerful surrogates, but they have limitations. They struggle with:
Tree-structured Parzen Estimators (TPE) offer an elegant alternative. Instead of modeling p(y|x) like GPs, TPE models p(x|y)—the probability of configurations given their performance. This inversion enables natural handling of complex, mixed hyperparameter spaces.
TPE is the default algorithm in Optuna and Hyperopt, two of the most popular hyperparameter optimization libraries.
By the end of this page, you will understand: how TPE inverts the modeling problem, the construction of l(x) and g(x) density estimators, why tree structure handles conditionals, and practical usage with Optuna.
Traditional Bayesian Optimization models p(y|x)—the probability of objective values given configurations. TPE flips this:
TPE models p(x|y)—the probability of configurations given objective values.
The key insight: Instead of predicting "what score will configuration x achieve?", TPE asks "what configurations tend to achieve good vs. bad scores?"
Two density estimators:
TPE divides observations into two groups based on a quantile threshold γ:
$$l(x) = p(x | y < y^*)$$ — density of configurations with good (below threshold) scores
$$g(x) = p(x | y \geq y^*)$$ — density of configurations with bad (above threshold) scores
where $y^* = \text{quantile}_{\gamma}(y_1, ..., y_n)$ is typically the top 15-25% threshold.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npfrom scipy.stats import gaussian_kde class SimpleTPE: """ Simplified TPE implementation demonstrating core concepts. Key insight: instead of p(y|x), model p(x|y) via two densities: - l(x): density of x given y < threshold (good) - g(x): density of x given y >= threshold (bad) """ def __init__(self, gamma: float = 0.25): """ gamma: quantile threshold for good/bad split e.g., gamma=0.25 means top 25% are "good" """ self.gamma = gamma self.l = None # KDE for good observations self.g = None # KDE for bad observations def fit(self, X: np.ndarray, y: np.ndarray): """Fit l(x) and g(x) density estimators.""" # Determine threshold (gamma quantile) n_good = max(1, int(len(y) * self.gamma)) threshold = np.sort(y)[n_good - 1] # Split observations good_mask = y <= threshold X_good = X[good_mask] X_bad = X[~good_mask] # Fit kernel density estimators if len(X_good) > 1: self.l = gaussian_kde(X_good.T) if len(X_bad) > 1: self.g = gaussian_kde(X_bad.T) def acquisition(self, X_candidates: np.ndarray) -> np.ndarray: """ TPE acquisition: maximize l(x) / g(x) This is equivalent to Expected Improvement under certain assumptions about the distribution of y. """ # Compute densities l_vals = self.l(X_candidates.T) if self.l else np.ones(len(X_candidates)) g_vals = self.g(X_candidates.T) if self.g else np.ones(len(X_candidates)) # Avoid division by zero g_vals = np.maximum(g_vals, 1e-10) # Acquisition: ratio of densities return l_vals / g_vals def suggest(self, X_candidates: np.ndarray) -> np.ndarray: """Suggest next point to evaluate.""" scores = self.acquisition(X_candidates) return X_candidates[np.argmax(scores)]Maximizing l(x)/g(x) finds configurations that are common among good observations and rare among bad observations. It's Bayesian: we want high probability of being good (high l(x)) and low probability of being bad (low g(x)).
The l(x)/g(x) acquisition has a deep connection to Expected Improvement. Using Bayes' rule:
$$\text{EI}(x) \propto \left(\gamma + \frac{g(x)}{l(x)}(1 - \gamma)\right)^{-1}$$
Derivation sketch:
By Bayes: $p(y < y^* | x) = \frac{p(x | y < y^) p(y < y^)}{p(x)}$
Using $p(x) = l(x)p(y < y^) + g(x)p(y \geq y^)$
Substituting $p(y < y^*) = \gamma$:
$$p(y < y^* | x) = \frac{\gamma l(x)}{\gamma l(x) + (1-\gamma) g(x)}$$
Thus, maximizing l(x)/g(x) is equivalent to maximizing the probability of improvement (and closely related to EI).
| Aspect | GP-EI | TPE |
|---|---|---|
| Model | p(y|x) via Gaussian Process | p(x|y) via density estimators |
| Acquisition | EI in closed form | l(x)/g(x) ratio |
| Continuous hyperparameters | Natural fit | Via Parzen estimators |
| Categorical hyperparameters | Requires special kernels | Natural via trees |
| Conditional hyperparameters | Difficult | Elegant via tree structure |
| Computational scaling | O(n³) | O(n log n) |
| High dimensions | Struggles (>20D) | Handles better |
The "tree-structured" in TPE refers to how it handles conditional hyperparameters—parameters that only exist depending on other choices.
Example: Neural network architecture
optimizer_type: ['sgd', 'adam', 'rmsprop']
if optimizer_type == 'sgd':
momentum: [0.0, 0.99]
if optimizer_type == 'adam':
beta1: [0.8, 0.999]
beta2: [0.9, 0.9999]
learning_rate: [1e-5, 1.0] # Always present
The hyperparameter space forms a tree: optimizer_type is the root, and momentum vs beta1/beta2 are conditional branches.
TPE handles this elegantly:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as np class ConditionalTPE: """ TPE with tree-structured conditional hyperparameters. Key insight: model each hyperparameter independently, conditioned on its parent values in the tree. """ def __init__(self, search_space): """ search_space: dict defining the tree structure Example: { 'optimizer': { 'type': 'categorical', 'choices': ['sgd', 'adam'], 'children': { 'sgd': { 'momentum': {'type': 'float', 'range': [0, 0.99]} }, 'adam': { 'beta1': {'type': 'float', 'range': [0.8, 0.999]}, 'beta2': {'type': 'float', 'range': [0.9, 0.9999]} } } }, 'lr': {'type': 'float', 'range': [1e-5, 1.0], 'log': True} } """ self.search_space = search_space self.estimators = {} # param_name -> (l, g) estimators def fit(self, observations: list): """ Fit density estimators for each parameter. For conditional parameters, only use observations where the parent condition is satisfied. """ # Extract values for each parameter for param, spec in self._iterate_params(self.search_space): # Filter observations where param is active values = [obs[param] for obs in observations if param in obs] if len(values) >= 2: # Fit l and g estimators for this parameter self.estimators[param] = self._fit_param(values, spec) def sample(self) -> dict: """ Sample a configuration respecting tree structure. 1. Sample from l(x) for each parameter 2. Only sample children if parent condition matches """ config = {} self._sample_node(self.search_space, config) return config def _sample_node(self, node, config): """Recursively sample from tree structure.""" for param, spec in node.items(): if param == 'children': continue # Sample from l(x) for this parameter if param in self.estimators: l, g = self.estimators[param] config[param] = self._sample_from_l(l, spec) else: config[param] = self._sample_prior(spec) # Handle children (conditional parameters) if 'children' in spec: parent_value = config[param] if parent_value in spec['children']: self._sample_node(spec['children'][parent_value], config) def _sample_prior(self, spec): """Sample from prior distribution.""" if spec['type'] == 'categorical': return np.random.choice(spec['choices']) elif spec['type'] == 'float': low, high = spec['range'] if spec.get('log', False): return np.exp(np.random.uniform(np.log(low), np.log(high))) return np.random.uniform(low, high) elif spec['type'] == 'int': low, high = spec['range'] return np.random.randint(low, high + 1)TPE models each hyperparameter independently. This seems like a strong assumption, but it works well in practice because: (1) hyperparameters often are nearly independent, (2) the good/bad split captures important interactions implicitly, (3) computational benefits outweigh modeling loss.
Optuna is the most popular library for TPE-based optimization. It provides a clean API for defining complex search spaces with conditionals.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import optunafrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import cross_val_scorefrom sklearn.datasets import load_iris # Load dataX, y = load_iris(return_X_y=True) def objective(trial): """ Optuna objective with conditional hyperparameters. The tree structure is implicit in the if-else logic. TPE handles this automatically! """ # Categorical: model type classifier_type = trial.suggest_categorical( 'classifier', ['rf', 'svm', 'gbdt'] ) if classifier_type == 'rf': # Random Forest hyperparameters n_estimators = trial.suggest_int('rf_n_estimators', 10, 200) max_depth = trial.suggest_int('rf_max_depth', 2, 32, log=True) min_samples_split = trial.suggest_int('rf_min_samples_split', 2, 10) model = RandomForestClassifier( n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, random_state=42 ) elif classifier_type == 'svm': # SVM hyperparameters from sklearn.svm import SVC kernel = trial.suggest_categorical('svm_kernel', ['rbf', 'poly']) C = trial.suggest_float('svm_C', 1e-3, 100, log=True) if kernel == 'rbf': gamma = trial.suggest_float('svm_gamma', 1e-4, 10, log=True) model = SVC(kernel='rbf', C=C, gamma=gamma) else: degree = trial.suggest_int('svm_degree', 2, 5) model = SVC(kernel='poly', C=C, degree=degree) else: # gbdt from sklearn.ensemble import GradientBoostingClassifier n_estimators = trial.suggest_int('gbdt_n_estimators', 50, 300) learning_rate = trial.suggest_float('gbdt_lr', 0.01, 0.3, log=True) max_depth = trial.suggest_int('gbdt_max_depth', 2, 10) model = GradientBoostingClassifier( n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth, random_state=42 ) # Evaluate score = cross_val_score(model, X, y, cv=5).mean() return score # Optuna maximizes by default # Create study with TPE sampler (default)study = optuna.create_study( direction='maximize', sampler=optuna.samplers.TPESampler( n_startup_trials=10, # Random trials before TPE kicks in gamma=lambda n: min(int(0.25 * n), 25), # Adaptive gamma multivariate=True # Consider correlations )) # Run optimizationstudy.optimize(objective, n_trials=100, show_progress_bar=True) # Resultsprint(f"Best score: {study.best_value:.4f}")print(f"Best params: {study.best_params}")Both TPE and GP-based BO are effective. Here's when to prefer each:
| Scenario | Preferred Method | Reason |
|---|---|---|
| Pure continuous hyperparameters | GP-BO | Better uncertainty quantification |
| Mixed continuous + categorical | TPE | Natural handling of categoricals |
| Conditional hyperparameters | TPE | Tree structure elegantly handles dependencies |
| High-dimensional (>20) | TPE | O(n log n) vs O(n³) scaling |
| Small budget (<30 evals) | GP-BO | Better sample efficiency |
| Large budget (>100 evals) | Either | Both converge well |
| Neural architecture search | TPE | Complex conditionals are common |
| Simple ML models (RF, SVM) | Either | Both work well |
For most hyperparameter optimization tasks: use TPE (Optuna) for its flexibility with mixed spaces and ease of use. Use GP-BO (GPyOpt, BOTorch) when you have purely continuous hyperparameters and need maximal sample efficiency.
Module Complete!
You've now mastered Bayesian Optimization—from the foundational SMBO framework through Gaussian Process surrogates, acquisition functions, exploration-exploitation tradeoffs, and TPE for complex spaces. You have the knowledge to apply these techniques effectively to hyperparameter optimization problems.
You've completed the Bayesian Optimization module! You can now implement and apply BO with both GP and TPE surrogates, understand the mathematical foundations, and select appropriate methods for different hyperparameter optimization scenarios.