Machine LearningAutoML & Neural Architecture Search

Automated Model Selection

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

2 / 5

Combined Algorithm Selection and Hyperparameter Optimization (CASH)

The Grand Unification of AutoML

In the previous page, we explored algorithm selection: choosing the right learning algorithm for a dataset. But this tells only half the story. Every algorithm comes with hyperparameters—learning rates, regularization strengths, tree depths, network architectures—that dramatically affect performance.

Traditionally, practitioners first select an algorithm, then tune its hyperparameters. But this sequential approach has a fundamental flaw: the best algorithm with default hyperparameters may not be the best algorithm with optimized hyperparameters. A random forest with defaults might beat an SVM with defaults, but a well-tuned SVM might outperform both.

This realization led to the formulation of the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem. CASH treats algorithm selection and hyperparameter tuning as a single, unified optimization problem. It's the mathematical foundation of modern AutoML systems like Auto-sklearn, Auto-WEKA, and TPOT.

Understanding CASH is essential for anyone working with AutoML—it explains why these systems work, their computational challenges, and how to design better automated machine learning pipelines.

What You Will Learn

By the end of this page, you will understand the formal CASH problem definition, hierarchical configuration spaces, search strategies (Bayesian optimization, evolutionary algorithms, random search), the theoretical complexity of CASH, and practical implementations in production AutoML systems.

Formal Definition of the CASH Problem

The CASH problem was formally defined in the foundational Auto-WEKA paper (Thornton et al., 2013). It unifies algorithm selection and hyperparameter optimization into a single search problem.

Formal Definition:

Given:

A dataset D = {(x₁, y₁), ..., (xₙ, yₙ)}
A set of learning algorithms A = {A⁽¹⁾, A⁽²⁾, ..., A⁽ᵏ⁾}
Each algorithm A⁽ⁱ⁾ has a hyperparameter space Λ⁽ⁱ⁾
A loss function L (e.g., cross-validation error)

The CASH problem is to find:

A, λ = argmin_{A⁽ⁱ⁾ ∈ A, λ ∈ Λ⁽ⁱ⁾} L(A⁽ⁱ⁾λ, D{train}, D_{valid})**

In words: jointly find the algorithm A* and its hyperparameters λ* that minimize the validation loss when trained on D_train and evaluated on D_valid.

Key Insights:

Unified search space: The combined space is Λ = Λ⁽¹⁾ ∪ Λ⁽²⁾ ∪ ... ∪ Λ⁽ᵏ⁾, with algorithm choice as a categorical hyperparameter.
Hierarchical/conditional structure: Hyperparameters are conditional—SVM's kernel hyperparameters only matter if SVM is selected.
Black-box optimization: We can evaluate configurations but don't have gradients with respect to algorithm/hyperparameter choices.
Expensive evaluations: Each configuration requires training a model and measuring validation performance.

cash_problem_formulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from sklearn.model_selection import cross_val_score
from typing import Dict, Any, List, Tuple
 
class CASHProblem:
    """
    Formal representation of the CASH problem.
    
    Encapsulates the combined algorithm selection and hyperparameter
    optimization as a black-box optimization problem.
    """
    
    def __init__(self, algorithms: Dict[str, Any], 
                 hyperparameter_spaces: Dict[str, Dict],
                 X, y, cv=5, scoring='accuracy'):
        """
        Parameters:
            algorithms: Dict mapping algorithm names to (estimator_class, default_params)
            hyperparameter_spaces: Dict mapping algorithm names to their param distributions
            X: Feature matrix
            y: Target vector
            cv: Cross-validation folds
            scoring: Scoring metric
        """
        self.algorithms = algorithms
        self.hyperparameter_spaces = hyperparameter_spaces
        self.X = X
        self.y = y
        self.cv = cv
        self.scoring = scoring
        
        # Track evaluations
        self.evaluation_history = []
        
    def evaluate_configuration(self, algorithm_name: str, 
                                hyperparameters: Dict[str, Any]) -> float:
        """
        Evaluate a single configuration (algorithm + hyperparameters).
        
        This is the black-box objective function that CASH optimizers
        query repeatedly.
        
        Parameters:
            algorithm_name: Which algorithm to use
            hyperparameters: Hyperparameter settings for that algorithm
            
        Returns:
            Negative cross-validation score (we minimize)
        """
        if algorithm_name not in self.algorithms:
            raise ValueError(f"Unknown algorithm: {algorithm_name}")
        
        estimator_class, default_params = self.algorithms[algorithm_name]
        
        # Merge default params with provided hyperparameters
        params = {**default_params, **hyperparameters}
        
        try:
            # Instantiate and evaluate
            estimator = estimator_class(**params)
            scores = cross_val_score(estimator, self.X, self.y, 
                                      cv=self.cv, scoring=self.scoring)
            score = scores.mean()
            
            # Track evaluation
            self.evaluation_history.append({
                'algorithm': algorithm_name,
                'hyperparameters': hyperparameters,
                'score': score,
                'std': scores.std()
            })
            
            return -score  # Negative because we minimize
            
        except Exception as e:
            # Invalid configuration - return worst possible score
            return 0.0  # Return 0 accuracy (worst)
    
    def get_configuration_space_size(self) -> Dict[str, int]:
        """
        Estimate the size of the configuration space.
        
        Returns approximate number of configurations per algorithm.
        """
        sizes = {}
        for algo_name, param_space in self.hyperparameter_spaces.items():
            # Very rough estimate: product of parameter cardinalities
            size = 1
            for param_name, param_dist in param_space.items():
                if hasattr(param_dist, '__len__'):
                    size *= len(param_dist)
                else:
                    size *= 100  # Continuous parameter - discretize
            sizes[algo_name] = size
        return sizes
    
    def get_best_configuration(self) -> Tuple[str, Dict, float]:
        """
        Get the best configuration found so far.
        """
        if not self.evaluation_history:
            return None, None, None
        
        best = max(self.evaluation_history, key=lambda x: x['score'])
        return best['algorithm'], best['hyperparameters'], best['score']

Why CASH is Hard

CASH is a mixed-categorical, hierarchical, black-box optimization problem with expensive evaluations. Standard optimization techniques (gradient descent, convex optimization) don't apply. We must use specialized algorithms designed for this structure.

Hierarchical Configuration Spaces

A crucial aspect of CASH is the hierarchical or conditional structure of the configuration space. Different algorithms have different hyperparameters, and some hyperparameters only matter when certain choices are made.

Example: The Conditional Structure

Consider three algorithms: Random Forest, SVM, and Neural Network.

Configuration Space:
├── algorithm: {RandomForest, SVM, NeuralNetwork}
│
├── [If algorithm = RandomForest]
│   ├── n_estimators: [10, 500]
│   ├── max_depth: [1, 50]
│   └── min_samples_split: [2, 20]
│
├── [If algorithm = SVM]
│   ├── kernel: {linear, rbf, poly}
│   ├── C: [1e-3, 1e3] (log-scale)
│   ├── [If kernel = rbf]
│   │   └── gamma: [1e-4, 1e1]
│   └── [If kernel = poly]
│       ├── degree: [2, 5]
│       └── coef0: [0, 10]
│
└── [If algorithm = NeuralNetwork]
    ├── hidden_layers: [1, 5]
    ├── hidden_size: [16, 512]
    ├── activation: {relu, tanh, sigmoid}
    └── learning_rate: [1e-5, 1e-1]

Notice how:

gamma only matters if kernel = rbf
degree only matters if kernel = poly
All SVM hyperparameters are irrelevant if algorithm ≠ SVM

This conditional structure is essential—ignoring it leads to wasted search effort on irrelevant hyperparameters.

hierarchical_config_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
from ConfigSpace import ConfigurationSpace, Configuration
from ConfigSpace import CategoricalHyperparameter, UniformIntegerHyperparameter
from ConfigSpace import UniformFloatHyperparameter, EqualsCondition, InCondition
 
def create_cash_configuration_space():
    """
    Create a hierarchical configuration space for CASH.
    
    Uses ConfigSpace library (used by Auto-sklearn) to properly
    represent conditional hyperparameters.
    """
    cs = ConfigurationSpace()
    
    # Root: Algorithm choice
    algorithm = CategoricalHyperparameter(
        "algorithm", 
        ["RandomForest", "SVM", "NeuralNetwork", "GradientBoosting"]
    )
    cs.add_hyperparameter(algorithm)
    
    # ================= Random Forest =================
    rf_n_estimators = UniformIntegerHyperparameter(
        "rf_n_estimators", lower=10, upper=500, default_value=100
    )
    rf_max_depth = UniformIntegerHyperparameter(
        "rf_max_depth", lower=1, upper=50, default_value=10
    )
    rf_min_samples_split = UniformIntegerHyperparameter(
        "rf_min_samples_split", lower=2, upper=20, default_value=2
    )
    
    cs.add_hyperparameters([rf_n_estimators, rf_max_depth, rf_min_samples_split])
    
    # Conditionals: These only matter if algorithm = RandomForest
    cs.add_condition(EqualsCondition(rf_n_estimators, algorithm, "RandomForest"))
    cs.add_condition(EqualsCondition(rf_max_depth, algorithm, "RandomForest"))
    cs.add_condition(EqualsCondition(rf_min_samples_split, algorithm, "RandomForest"))
    
    # ================= SVM =================
    svm_kernel = CategoricalHyperparameter(
        "svm_kernel", ["linear", "rbf", "poly"], default_value="rbf"
    )
    svm_C = UniformFloatHyperparameter(
        "svm_C", lower=1e-3, upper=1e3, log=True, default_value=1.0
    )
    svm_gamma = UniformFloatHyperparameter(
        "svm_gamma", lower=1e-4, upper=1e1, log=True, default_value=0.1
    )
    svm_degree = UniformIntegerHyperparameter(
        "svm_degree", lower=2, upper=5, default_value=3
    )
    svm_coef0 = UniformFloatHyperparameter(
        "svm_coef0", lower=0.0, upper=10.0, default_value=0.0
    )
    
    cs.add_hyperparameters([svm_kernel, svm_C, svm_gamma, svm_degree, svm_coef0])
    
    # SVM hyperparams conditional on algorithm = SVM
    cs.add_condition(EqualsCondition(svm_kernel, algorithm, "SVM"))
    cs.add_condition(EqualsCondition(svm_C, algorithm, "SVM"))
    
    # gamma only matters for RBF kernel
    cs.add_condition(EqualsCondition(svm_gamma, svm_kernel, "rbf"))
    
    # degree and coef0 only matter for polynomial kernel
    cs.add_condition(EqualsCondition(svm_degree, svm_kernel, "poly"))
    cs.add_condition(EqualsCondition(svm_coef0, svm_kernel, "poly"))
    
    # ================= Neural Network =================
    nn_hidden_layers = UniformIntegerHyperparameter(
        "nn_hidden_layers", lower=1, upper=5, default_value=2
    )
    nn_hidden_size = UniformIntegerHyperparameter(
        "nn_hidden_size", lower=16, upper=512, default_value=64
    )
    nn_activation = CategoricalHyperparameter(
        "nn_activation", ["relu", "tanh", "sigmoid"], default_value="relu"
    )
    nn_learning_rate = UniformFloatHyperparameter(
        "nn_learning_rate", lower=1e-5, upper=1e-1, log=True, default_value=1e-3
    )
    
    cs.add_hyperparameters([nn_hidden_layers, nn_hidden_size, 
                           nn_activation, nn_learning_rate])
    
    # Neural network params conditional on algorithm = NeuralNetwork
    for hp in [nn_hidden_layers, nn_hidden_size, nn_activation, nn_learning_rate]:
        cs.add_condition(EqualsCondition(hp, algorithm, "NeuralNetwork"))
    
    # ================= Gradient Boosting =================
    gb_n_estimators = UniformIntegerHyperparameter(
        "gb_n_estimators", lower=50, upper=500, default_value=100
    )
    gb_learning_rate = UniformFloatHyperparameter(
        "gb_learning_rate", lower=0.01, upper=1.0, log=True, default_value=0.1
    )
    gb_max_depth = UniformIntegerHyperparameter(
        "gb_max_depth", lower=1, upper=15, default_value=3
    )
    
    cs.add_hyperparameters([gb_n_estimators, gb_learning_rate, gb_max_depth])
    
    for hp in [gb_n_estimators, gb_learning_rate, gb_max_depth]:
        cs.add_condition(EqualsCondition(hp, algorithm, "GradientBoosting"))
    
    return cs
 
# Usage
cs = create_cash_configuration_space()
print(f"Configuration space has {len(cs)} hyperparameters")
print(f"Conditionals ensure only relevant params are active")
 
# Sample valid configurations
for i in range(3):
    config = cs.sample_configuration()
    print(f"
Sampled: {dict(config)}")

Why Hierarchical Structure Matters:

Reduced effective dimensionality: Instead of searching N = Σᵢ|Λ⁽ⁱ⁾| hyperparameters simultaneously, we only search over the relevant subset for each algorithm.
Valid configurations: Without conditionals, we could generate meaningless configurations (e.g., setting SVM's gamma when using Random Forest).
Efficient surrogate modeling: Bayesian optimization can build better surrogate models when hierarchical structure is respected.
Interpretable results: Final configurations only show relevant hyperparameters.

The ConfigSpace Library:

ConfigSpace is the de facto standard for defining hierarchical configuration spaces in Python AutoML. It's used by Auto-sklearn, SMAC, and other major frameworks. Key concepts:

Hyperparameters: Categorical, Integer, Float (with log-scaling)
Conditions: EqualsCondition, InCondition, AndConjunction, OrConjunction
Forbiddens: Explicitly forbid certain combinations

Best Practices for Configuration Spaces

Use log-scaling for parameters spanning orders of magnitude (learning rates, regularization). 2. Set sensible defaults—they serve as starting points. 3. Include conditionals for all algorithm-specific hyperparameters. 4. Consider forbidding known bad combinations (e.g., very deep trees with few samples).

Search Strategies for CASH

Given the CASH problem's complexity—hierarchical, mixed-type, expensive-to-evaluate—specialized search strategies are needed. Let's examine the major approaches:

1. Random Search

Surprisingly effective baseline:

Sample random configurations from the space
Evaluate each one
Return the best found

Random search has theoretical guarantees: with probability 1-δ, it finds a configuration in the top ε fraction after O(1/(ε·δ)) evaluations. It's embarrassingly parallel and doesn't suffer from poor surrogate models.

When random search works well:

Large budgets (many evaluations allowed)
When good configurations are dense
When surrogate modeling is difficult

2. Grid Search (Don't Use for CASH)

Grid search is exponential in dimensionality and fails catastrophically in high-dimensional conditional spaces. It's mentioned only to emphasize: don't use grid search for CASH.

random_search_cash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from typing import Callable, Dict, Any, List, Tuple
from concurrent.futures import ProcessPoolExecutor
 
class RandomSearchCASH:
    """
    Random search for the CASH problem.
    
    Despite its simplicity, random search is a strong baseline
    that's hard to beat with small evaluation budgets.
    """
    
    def __init__(self, config_space, n_workers=1, random_state=42):
        """
        Parameters:
            config_space: ConfigSpace object defining the search space
            n_workers: Number of parallel workers
            random_state: Random seed for reproducibility
        """
        self.config_space = config_space
        self.n_workers = n_workers
        self.rng = np.random.RandomState(random_state)
        
    def optimize(self, objective_func: Callable, 
                 n_evaluations: int) -> Tuple[Dict, float]:
        """
        Run random search.
        
        Parameters:
            objective_func: Function that takes a config dict, returns loss
            n_evaluations: Number of configurations to try
            
        Returns:
            (best_config, best_loss)
        """
        # Sample all configurations upfront
        configs = [self.config_space.sample_configuration() 
                   for _ in range(n_evaluations)]
        
        # Evaluate (potentially in parallel)
        if self.n_workers == 1:
            results = [objective_func(dict(config)) for config in configs]
        else:
            with ProcessPoolExecutor(max_workers=self.n_workers) as executor:
                results = list(executor.map(
                    lambda c: objective_func(dict(c)), configs
                ))
        
        # Find best
        best_idx = np.argmin(results)
        
        return dict(configs[best_idx]), results[best_idx]
    
    def optimize_with_budget(self, objective_func: Callable,
                              time_budget_seconds: float,
                              min_evaluations: int = 10) -> Tuple[Dict, float]:
        """
        Run random search with a time budget instead of fixed evaluations.
        
        Useful for fair comparison with adaptive methods.
        """
        import time
        
        best_config = None
        best_loss = float('inf')
        n_evaluated = 0
        start_time = time.time()
        
        while True:
            # Check budget
            elapsed = time.time() - start_time
            if elapsed >= time_budget_seconds and n_evaluated >= min_evaluations:
                break
            
            # Sample and evaluate
            config = self.config_space.sample_configuration()
            loss = objective_func(dict(config))
            n_evaluated += 1
            
            if loss < best_loss:
                best_loss = loss
                best_config = dict(config)
        
        print(f"Random search: {n_evaluated} evaluations in {elapsed:.1f}s")
        return best_config, best_loss

3. Bayesian Optimization (SMAC)

Bayesian optimization is the dominant approach for CASH. It builds a surrogate model of the objective function and uses it to guide search:

Surrogate model: Predict performance of untried configurations
Acquisition function: Balance exploration (try uncertain regions) vs exploitation (try promising regions)
Sequential decisions: Each evaluation informs the next

SMAC (Sequential Model-based Algorithm Configuration) is the state-of-the-art Bayesian optimizer for CASH. Key features:

Random forest surrogate: Unlike GP-based BO, SMAC uses random forests which handle categorical and conditional hyperparameters natively
Expected Improvement acquisition: Select configurations with high expected improvement over current best
Intensification: Verify promising configurations with more evaluations

smac_bayesian_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
from smac import HyperparameterOptimizationFacade, Scenario
from ConfigSpace import ConfigurationSpace
import numpy as np
 
def run_smac_optimization(config_space: ConfigurationSpace,
                           objective_func,
                           n_trials: int = 100,
                           n_workers: int = 1,
                           seed: int = 42):
    """
    Run SMAC (Sequential Model-based Algorithm Configuration) for CASH.
    
    SMAC is the state-of-the-art Bayesian optimizer for hierarchical,
    conditional configuration spaces.
    
    Parameters:
        config_space: ConfigSpace defining algorithms and hyperparameters
        objective_func: Function mapping config dict to loss (minimize)
        n_trials: Number of configurations to evaluate
        n_workers: Parallel workers
        seed: Random seed
        
    Returns:
        Best configuration found
    """
    
    # Define the scenario
    scenario = Scenario(
        configspace=config_space,
        n_trials=n_trials,       # Total evaluations
        n_workers=n_workers,      # Parallelism
        seed=seed,
        deterministic=True
    )
    
    # Wrap objective for SMAC interface
    def smac_objective(config, seed=None):
        """SMAC-compatible objective wrapper."""
        return objective_func(dict(config))
    
    # Create SMAC optimizer
    smac = HyperparameterOptimizationFacade(
        scenario=scenario,
        target_function=smac_objective,
        overwrite=True  # Overwrite previous runs
    )
    
    # Run optimization
    incumbent = smac.optimize()
    
    return dict(incumbent)
 
 
# Example usage with Auto-sklearn style CASH problem
def example_cash_with_smac():
    """Demonstrate SMAC on a CASH problem."""
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
    from sklearn.svm import SVC
    
    # Generate dataset
    X, y = make_classification(n_samples=500, n_features=20, 
                               n_informative=10, random_state=42)
    
    # Build configuration space (simplified)
    cs = create_cash_configuration_space()  # From previous example
    
    # Define objective
    def objective(config):
        algo = config['algorithm']
        
        if algo == 'RandomForest':
            clf = RandomForestClassifier(
                n_estimators=config.get('rf_n_estimators', 100),
                max_depth=config.get('rf_max_depth', None),
                random_state=42
            )
        elif algo == 'GradientBoosting':
            clf = GradientBoostingClassifier(
                n_estimators=config.get('gb_n_estimators', 100),
                learning_rate=config.get('gb_learning_rate', 0.1),
                max_depth=config.get('gb_max_depth', 3),
                random_state=42
            )
        elif algo == 'SVM':
            clf = SVC(
                kernel=config.get('svm_kernel', 'rbf'),
                C=config.get('svm_C', 1.0),
                gamma=config.get('svm_gamma', 'scale'),
                random_state=42
            )
        else:
            raise ValueError(f"Unknown algorithm: {algo}")
        
        # Return negative accuracy (since we minimize)
        scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
        return -scores.mean()
    
    # Run SMAC
    best_config = run_smac_optimization(cs, objective, n_trials=50)
    print(f"Best configuration: {best_config}")
    print(f"Best score: {-objective(best_config):.4f}")

4. Evolutionary Algorithms (TPOT)

Evolutionary algorithms treat configurations as individuals in a population:

Initial population: Random configurations
Selection: Keep configurations with low loss
Crossover: Combine good configurations
Mutation: Random perturbations
Repeat: Evolve towards better configurations

TPOT uses genetic programming to evolve not just hyperparameters but entire ML pipelines (feature selection → preprocessing → model).

5. Hyperband and Successive Halving

These are multi-fidelity approaches that evaluate configurations at multiple resource levels:

Start many configurations with minimal resources (few epochs, small data subset)
Progressively eliminate poor performers
Give more resources to promising configurations

This is especially effective when early performance correlates with final performance—common in deep learning.

Comparison of Strategies:

CASH Search Strategies Comparison
Strategy	Mechanism	Strengths	Weaknesses
Random Search	Uniform sampling	Simple, parallelizable, robust	No learning, inefficient for peaked landscapes
SMAC (Bayesian)	Surrogate + acquisition	Sample efficient, handles conditionals	Sequential, complex implementation
TPE	Density ratio modeling	Handles conditionals, simpler than GP	Less sample efficient than SMAC
Evolutionary	Population evolution	Diverse solutions, flexible pipelines	Many hyperparameters, slow convergence
Hyperband	Multi-fidelity elimination	Fast, efficient resource use	Requires meaningful fidelity proxy

Theoretical Complexity of CASH

Understanding the theoretical complexity of CASH helps set realistic expectations and guides design decisions.

Configuration Space Size

For a CASH problem with:

k algorithms
Each algorithm i having nᵢ hyperparameters
Each hyperparameter having mᵢⱼ possible values (or discretized for continuous)

The total configuration space size is:

|Λ| = Σᵢ₌₁ᵏ Πⱼ₌₁ⁿⁱ mᵢⱼ

For example:

5 algorithms
Each with 5 hyperparameters
Each hyperparameter with 10 values

|Λ| = 5 × 10⁵ = 500,000 configurations

With continuous hyperparameters, the space is infinite. Even with discretization, it's astronomically large.

Computational Complexity

Each configuration evaluation requires training a model. If training takes time T(n, d) for n samples and d features:

Evaluating E configurations takes: E × T(n, d) × k_cv (for k_cv fold cross-validation)
Most algorithms have T(n, d) = O(n × d) to O(n² × d) for training

This motivates:

Sample-efficient optimization: Minimize E (number of evaluations)
Multi-fidelity methods: Reduce per-evaluation cost by early stopping
Parallelization: Evaluate multiple configurations simultaneously

The Curse of Expensive Evaluations

A single configuration evaluation might take seconds to hours depending on dataset size and algorithm. With 1000 configurations and 1 minute per evaluation, CASH takes ~17 hours. This is why sample efficiency is crucial—every evaluation counts.

Regret Bounds and Sample Complexity

For Bayesian optimization in CASH, theoretical results show:

Simple regret (gap from optimal): O(log(T)/T) under smoothness assumptions
Cumulative regret: O(√T log T) for GP-based BO

However, these bounds assume:

Correct prior specification (rarely true)
Continuous, smooth objectives (violated by algorithm choice)
No conditional structure (violated by hierarchical spaces)

Practically, convergence is problem-dependent. Meta-learning and warm-starting significantly accelerate convergence on structured problem distributions.

Hardness Results

CASH includes algorithm selection as a special case. Algorithm selection is at least as hard as zero-one integer programming in the worst case (NP-hard). No polynomial-time algorithm can solve general CASH problems optimally.

Implications:

We must accept approximate solutions
Exploit problem structure (meta-learning, warm-starting)
Use anytime algorithms that improve with more budget
Parallelize when possible

cash_complexity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import time
from typing import Dict
 
def analyze_cash_complexity(algorithms: Dict, 
                             hyperparameter_ranges: Dict,
                             dataset_size: int,
                             cv_folds: int = 5):
    """
    Analyze the computational complexity of a CASH problem.
    
    Provides estimates of:
    - Configuration space size
    - Estimated time per evaluation
    - Total search time for different budgets
    """
    
    # Configuration space size
    total_configs = 0
    space_details = {}
    
    for algo_name, param_ranges in hyperparameter_ranges.items():
        algo_configs = 1
        for param_name, param_range in param_ranges.items():
            if isinstance(param_range, list):
                algo_configs *= len(param_range)
            elif isinstance(param_range, tuple) and len(param_range) == 2:
                # Continuous: estimate 100 effective values
                algo_configs *= 100
        
        space_details[algo_name] = algo_configs
        total_configs += algo_configs
    
    print("=== CASH Complexity Analysis ===
")
    print(f"Number of algorithms: {len(algorithms)}")
    print(f"
Configurations per algorithm:")
    for algo, count in space_details.items():
        print(f"  {algo}: ~{count:,} configurations")
    print(f"
Total configuration space: ~{total_configs:,} configurations")
    
    # Estimate time per evaluation
    # Very rough: assume O(n * d) training, where d ~ n/10
    d_approx = max(10, dataset_size // 10)
    
    time_estimates = {
        'RandomForest': dataset_size * d_approx * 100 / 1e8,  # ~100 trees
        'SVM': (dataset_size ** 2) * d_approx / 1e9,          # Kernel computation
        'GradientBoosting': dataset_size * d_approx * 100 / 1e8,
        'NeuralNetwork': dataset_size * d_approx * 1000 / 1e8, # ~1000 iterations
    }
    
    avg_time = np.mean([time_estimates.get(algo, 1.0) 
                        for algo in algorithms.keys()])
    
    print(f"
=== Time Estimates (dataset size={dataset_size}) ===")
    print(f"Estimated time per evaluation: ~{avg_time * cv_folds:.1f} seconds")
    print(f"  (with {cv_folds}-fold CV)")
    
    # Total time for different evaluation budgets
    budgets = [50, 100, 500, 1000]
    print(f"
=== Total Search Time Estimates ===")
    for budget in budgets:
        total_time = budget * avg_time * cv_folds
        hours = total_time / 3600
        print(f"  {budget} evaluations: ~{hours:.1f} hours")
    
    # Coverage
    print(f"
=== Search Coverage ===")
    for budget in budgets:
        coverage = budget / total_configs * 100
        print(f"  {budget} evaluations: {coverage:.4f}% of space explored")
    
    return {
        'total_configs': total_configs,
        'avg_time_per_eval': avg_time * cv_folds,
        'space_details': space_details
    }
 
# Example usage
algorithms = {
    'RandomForest': None,
    'SVM': None,
    'GradientBoosting': None,
    'NeuralNetwork': None
}
 
hyperparameter_ranges = {
    'RandomForest': {
        'n_estimators': (10, 500),
        'max_depth': (1, 50),
        'min_samples_split': (2, 20)
    },
    'SVM': {
        'kernel': ['linear', 'rbf', 'poly'],
        'C': (1e-3, 1e3),
        'gamma': (1e-4, 1e1)
    },
    'GradientBoosting': {
        'n_estimators': (50, 500),
        'learning_rate': (0.01, 1.0),
        'max_depth': (1, 15)
    },
    'NeuralNetwork': {
        'hidden_layers': (1, 5),
        'hidden_size': (16, 512),
        'learning_rate': (1e-5, 1e-1),
        'activation': ['relu', 'tanh', 'sigmoid']
    }
}
 
analysis = analyze_cash_complexity(algorithms, hyperparameter_ranges, 
                                   dataset_size=10000)

CASH in Production AutoML Systems

Let's examine how leading AutoML systems implement CASH:

Auto-sklearn 2.0

Auto-sklearn is the reference implementation for CASH-based AutoML:

Configuration space: 15 classifiers, 14 feature preprocessors, ~100 hyperparameters total
Optimizer: SMAC (Bayesian optimization with random forest surrogate)
Meta-learning: Warm-start from 25 similar datasets
Ensembling: Combine top configurations via ensemble selection
Budget control: Time-based with early termination

Key architectural decisions:

Separate preprocessing and classification into a two-stage pipeline
Use meta-learning to initialize, then refine with SMAC
Ensemble to improve robustness and hedge against selection errors

Auto-WEKA

The original CASH paper's implementation:

Configuration space: 39 classifiers from WEKA, extensive hyperparameter spaces
Optimizer: SMAC
Unique feature: Very large algorithm portfolio (more than Auto-sklearn)

Auto-WEKA demonstrated CASH's viability but is less maintained than Auto-sklearn.

cash_autosklearn_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import autosklearn.classification
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
 
def run_autosklearn_cash(time_budget_minutes=10):
    """
    Run Auto-sklearn, demonstrating CASH in action.
    
    Auto-sklearn internally solves the CASH problem using SMAC
    with meta-learning warm-starting.
    """
    
    # Load dataset
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    print(f"Dataset: {X_train.shape[0]} training samples, "
          f"{X_train.shape[1]} features")
    print(f"Time budget: {time_budget_minutes} minutes")
    
    # Initialize Auto-sklearn
    # This sets up the CASH problem internally
    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=time_budget_minutes * 60,  # Total seconds
        per_run_time_limit=60,  # Max seconds per configuration
        memory_limit=4096,      # MB
        
        # CASH settings
        include={
            'classifier': [
                'random_forest',
                'gradient_boosting', 
                'adaboost',
                'extra_trees',
                'libsvm_svc',
                'liblinear_svc',
                'k_nearest_neighbors',
                'decision_tree',
            ]
        },
        
        # Meta-learning for warm-starting
        initial_configurations_via_metalearning=25,
        
        # Ensembling
        ensemble_kwargs={'ensemble_size': 50},
        
        # Reproducibility
        seed=42,
        n_jobs=1
    )
    
    # Fit - this runs SMAC to solve the CASH problem
    print("
Starting Auto-sklearn (CASH optimization)...")
    automl.fit(X_train, y_train)
    
    # Results
    print("
=== CASH Results ===")
    print(f"Test accuracy: {automl.score(X_test, y_test):.4f}")
    
    # Show examined configurations
    print(f"
Configurations examined: {len(automl.cv_results_['mean_test_score'])}")
    
    # Show best configuration (the CASH solution)
    print("
Best configuration found:")
    print(automl.show_models())
    
    # Leaderboard
    print("
Top 5 configurations:")
    leaderboard = automl.leaderboard(detailed=True, 
                                      ensemble_size=False)[['rank', 'type', 'cost']]
    print(leaderboard.head())
    
    return automl
 
# Run CASH with Auto-sklearn
# automl = run_autosklearn_cash(time_budget_minutes=10)

TPOT (Genetic Programming CASH)

TPOT takes a different approach using evolutionary algorithms:

Extended pipeline space: Evolves entire pipelines, not just model + hyperparameters
Genetic programming: Configurations are trees that can recombine
Feature construction: Can create new features through combinations

H2O AutoML

H2O emphasizes robustness over selection:

Fixed algorithm progression: Train GBM variants, then RF, then deep learning, then GLM
Stacking: Rather than selecting one, stack all trained models
No HPO initially: Use sensible defaults, then tune winners

This is a portfolio approach that trades compute efficiency for reliability.

Design Patterns Across Systems:

Common CASH Design Patterns

•Warm-starting: Use meta-learning to initialize search with good configurations
•Anytime behavior: Return best-so-far at any point; improve with more time
•Ensembling: Combine multiple configurations to improve robustness
•Resource limits: Per-run timeouts and memory limits prevent runaway configs
•Fallback defaults: Use sensible defaults when optimization fails
•Progressive training: Start with simple/fast algorithms, add complex ones with budget

Practical Guidelines for CASH

When using or designing CASH systems, consider these practical guidelines:

Designing Configuration Spaces:

Start small: Begin with a few algorithms you trust, not dozens
Focus on impactful hyperparameters: Not all hyperparameters matter equally
Use prior knowledge: Set reasonable ranges based on literature and experience
Log-scale for learning rates and regularization: These span orders of magnitude
Include conditionals: Don't search irrelevant hyperparameters

Budget Allocation:

Know your time constraints: 10 minutes? 10 hours? 10 days?
Account for CV overhead: 5-fold CV means 5× the training time
Reserve time for ensembling: Ensemble selection improves robustness
Consider multi-fidelity: For large datasets, use subsampling early in search

Evaluation Strategies:

Use proper CV: Never evaluate on training data; holdout is noisier than CV
Stratify for imbalanced classes: Ensure folds have similar class distributions
Be consistent: Use the same CV splits across all configurations
Handle failures gracefully: Some configurations will crash; return worst-case loss

CASH Best Practices

•Use meta-learning when available
•Ensemble top configurations
•Set per-run time limits
•Monitor for overfitting
•Save intermediate results

Common Mistakes

•Running without time limits
•Including too many algorithms
•Ignoring conditional structure
•Overfitting to validation set
•Not using a test set holdout

The 80/20 of CASH

In practice, a well-tuned gradient boosting model (XGBoost, LightGBM, CatBoost) with 50-100 random hyperparameter trials often matches or beats elaborate CASH optimization for tabular data. CASH's biggest wins come on diverse problem portfolios where no single algorithm dominates.

Summary: CASH

We've covered the Combined Algorithm Selection and Hyperparameter optimization problem—the mathematical foundation of modern AutoML. Let's consolidate the key takeaways:

Key Takeaways

•CASH unifies algorithm selection and hyperparameter optimization into a single search problem.
•Hierarchical configuration spaces capture the conditional structure where hyperparameters depend on algorithm choice.
•Bayesian optimization (SMAC) is the dominant approach, using surrogate models to guide search efficiently.
•The configuration space is enormous—even with 5 algorithms and 5 hyperparameters each, millions of configurations exist.
•Production systems combine meta-learning, ensembling, and anytime behavior to make CASH practical.

What's Next:

Meta-learning is a crucial accelerator for CASH—it allows systems to transfer knowledge from past experiments to new datasets. The next page explores meta-learning for model selection in depth, showing how systems learn to predict algorithm performance and warm-start optimization effectively.

Page Complete

You now understand the CASH problem—how algorithm selection and hyperparameter optimization are unified, the structure of hierarchical configuration spaces, and how leading AutoML systems implement CASH in practice. This knowledge equips you to use, evaluate, and extend AutoML tools effectively.

2 / 5

Loading learning content...

Machine LearningAutoML & Neural Architecture Search

Automated Model Selection

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

2 / 5

Combined Algorithm Selection and Hyperparameter Optimization (CASH)

The Grand Unification of AutoML

Understanding CASH is essential for anyone working with AutoML—it explains why these systems work, their computational challenges, and how to design better automated machine learning pipelines.

What You Will Learn

Formal Definition of the CASH Problem

The CASH problem was formally defined in the foundational Auto-WEKA paper (Thornton et al., 2013). It unifies algorithm selection and hyperparameter optimization into a single search problem.

Formal Definition:

Given:

A dataset D = {(x₁, y₁), ..., (xₙ, yₙ)}
A set of learning algorithms A = {A⁽¹⁾, A⁽²⁾, ..., A⁽ᵏ⁾}
Each algorithm A⁽ⁱ⁾ has a hyperparameter space Λ⁽ⁱ⁾
A loss function L (e.g., cross-validation error)

The CASH problem is to find:

A, λ = argmin_{A⁽ⁱ⁾ ∈ A, λ ∈ Λ⁽ⁱ⁾} L(A⁽ⁱ⁾λ, D{train}, D_{valid})**

In words: jointly find the algorithm A* and its hyperparameters λ* that minimize the validation loss when trained on D_train and evaluated on D_valid.

Key Insights:

Unified search space: The combined space is Λ = Λ⁽¹⁾ ∪ Λ⁽²⁾ ∪ ... ∪ Λ⁽ᵏ⁾, with algorithm choice as a categorical hyperparameter.
Hierarchical/conditional structure: Hyperparameters are conditional—SVM's kernel hyperparameters only matter if SVM is selected.
Black-box optimization: We can evaluate configurations but don't have gradients with respect to algorithm/hyperparameter choices.
Expensive evaluations: Each configuration requires training a model and measuring validation performance.

cash_problem_formulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from sklearn.model_selection import cross_val_score
from typing import Dict, Any, List, Tuple
 
class CASHProblem:
    """
    Formal representation of the CASH problem.
    
    Encapsulates the combined algorithm selection and hyperparameter
    optimization as a black-box optimization problem.
    """
    
    def __init__(self, algorithms: Dict[str, Any], 
                 hyperparameter_spaces: Dict[str, Dict],
                 X, y, cv=5, scoring='accuracy'):
        """
        Parameters:
            algorithms: Dict mapping algorithm names to (estimator_class, default_params)
            hyperparameter_spaces: Dict mapping algorithm names to their param distributions
            X: Feature matrix
            y: Target vector
            cv: Cross-validation folds
            scoring: Scoring metric
        """
        self.algorithms = algorithms
        self.hyperparameter_spaces = hyperparameter_spaces
        self.X = X
        self.y = y
        self.cv = cv
        self.scoring = scoring
        
        # Track evaluations
        self.evaluation_history = []
        
    def evaluate_configuration(self, algorithm_name: str, 
                                hyperparameters: Dict[str, Any]) -> float:
        """
        Evaluate a single configuration (algorithm + hyperparameters).
        
        This is the black-box objective function that CASH optimizers
        query repeatedly.
        
        Parameters:
            algorithm_name: Which algorithm to use
            hyperparameters: Hyperparameter settings for that algorithm
            
        Returns:
            Negative cross-validation score (we minimize)
        """
        if algorithm_name not in self.algorithms:
            raise ValueError(f"Unknown algorithm: {algorithm_name}")
        
        estimator_class, default_params = self.algorithms[algorithm_name]
        
        # Merge default params with provided hyperparameters
        params = {**default_params, **hyperparameters}
        
        try:
            # Instantiate and evaluate
            estimator = estimator_class(**params)
            scores = cross_val_score(estimator, self.X, self.y, 
                                      cv=self.cv, scoring=self.scoring)
            score = scores.mean()
            
            # Track evaluation
            self.evaluation_history.append({
                'algorithm': algorithm_name,
                'hyperparameters': hyperparameters,
                'score': score,
                'std': scores.std()
            })
            
            return -score  # Negative because we minimize
            
        except Exception as e:
            # Invalid configuration - return worst possible score
            return 0.0  # Return 0 accuracy (worst)
    
    def get_configuration_space_size(self) -> Dict[str, int]:
        """
        Estimate the size of the configuration space.
        
        Returns approximate number of configurations per algorithm.
        """
        sizes = {}
        for algo_name, param_space in self.hyperparameter_spaces.items():
            # Very rough estimate: product of parameter cardinalities
            size = 1
            for param_name, param_dist in param_space.items():
                if hasattr(param_dist, '__len__'):
                    size *= len(param_dist)
                else:
                    size *= 100  # Continuous parameter - discretize
            sizes[algo_name] = size
        return sizes
    
    def get_best_configuration(self) -> Tuple[str, Dict, float]:
        """
        Get the best configuration found so far.
        """
        if not self.evaluation_history:
            return None, None, None
        
        best = max(self.evaluation_history, key=lambda x: x['score'])
        return best['algorithm'], best['hyperparameters'], best['score']

Why CASH is Hard

Hierarchical Configuration Spaces

Example: The Conditional Structure

Consider three algorithms: Random Forest, SVM, and Neural Network.

Configuration Space:
├── algorithm: {RandomForest, SVM, NeuralNetwork}
│
├── [If algorithm = RandomForest]
│   ├── n_estimators: [10, 500]
│   ├── max_depth: [1, 50]
│   └── min_samples_split: [2, 20]
│
├── [If algorithm = SVM]
│   ├── kernel: {linear, rbf, poly}
│   ├── C: [1e-3, 1e3] (log-scale)
│   ├── [If kernel = rbf]
│   │   └── gamma: [1e-4, 1e1]
│   └── [If kernel = poly]
│       ├── degree: [2, 5]
│       └── coef0: [0, 10]
│
└── [If algorithm = NeuralNetwork]
    ├── hidden_layers: [1, 5]
    ├── hidden_size: [16, 512]
    ├── activation: {relu, tanh, sigmoid}
    └── learning_rate: [1e-5, 1e-1]

Notice how:

gamma only matters if kernel = rbf
degree only matters if kernel = poly
All SVM hyperparameters are irrelevant if algorithm ≠ SVM

This conditional structure is essential—ignoring it leads to wasted search effort on irrelevant hyperparameters.

hierarchical_config_space.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
from ConfigSpace import ConfigurationSpace, Configuration
from ConfigSpace import CategoricalHyperparameter, UniformIntegerHyperparameter
from ConfigSpace import UniformFloatHyperparameter, EqualsCondition, InCondition
 
def create_cash_configuration_space():
    """
    Create a hierarchical configuration space for CASH.
    
    Uses ConfigSpace library (used by Auto-sklearn) to properly
    represent conditional hyperparameters.
    """
    cs = ConfigurationSpace()
    
    # Root: Algorithm choice
    algorithm = CategoricalHyperparameter(
        "algorithm", 
        ["RandomForest", "SVM", "NeuralNetwork", "GradientBoosting"]
    )
    cs.add_hyperparameter(algorithm)
    
    # ================= Random Forest =================
    rf_n_estimators = UniformIntegerHyperparameter(
        "rf_n_estimators", lower=10, upper=500, default_value=100
    )
    rf_max_depth = UniformIntegerHyperparameter(
        "rf_max_depth", lower=1, upper=50, default_value=10
    )
    rf_min_samples_split = UniformIntegerHyperparameter(
        "rf_min_samples_split", lower=2, upper=20, default_value=2
    )
    
    cs.add_hyperparameters([rf_n_estimators, rf_max_depth, rf_min_samples_split])
    
    # Conditionals: These only matter if algorithm = RandomForest
    cs.add_condition(EqualsCondition(rf_n_estimators, algorithm, "RandomForest"))
    cs.add_condition(EqualsCondition(rf_max_depth, algorithm, "RandomForest"))
    cs.add_condition(EqualsCondition(rf_min_samples_split, algorithm, "RandomForest"))
    
    # ================= SVM =================
    svm_kernel = CategoricalHyperparameter(
        "svm_kernel", ["linear", "rbf", "poly"], default_value="rbf"
    )
    svm_C = UniformFloatHyperparameter(
        "svm_C", lower=1e-3, upper=1e3, log=True, default_value=1.0
    )
    svm_gamma = UniformFloatHyperparameter(
        "svm_gamma", lower=1e-4, upper=1e1, log=True, default_value=0.1
    )
    svm_degree = UniformIntegerHyperparameter(
        "svm_degree", lower=2, upper=5, default_value=3
    )
    svm_coef0 = UniformFloatHyperparameter(
        "svm_coef0", lower=0.0, upper=10.0, default_value=0.0
    )
    
    cs.add_hyperparameters([svm_kernel, svm_C, svm_gamma, svm_degree, svm_coef0])
    
    # SVM hyperparams conditional on algorithm = SVM
    cs.add_condition(EqualsCondition(svm_kernel, algorithm, "SVM"))
    cs.add_condition(EqualsCondition(svm_C, algorithm, "SVM"))
    
    # gamma only matters for RBF kernel
    cs.add_condition(EqualsCondition(svm_gamma, svm_kernel, "rbf"))
    
    # degree and coef0 only matter for polynomial kernel
    cs.add_condition(EqualsCondition(svm_degree, svm_kernel, "poly"))
    cs.add_condition(EqualsCondition(svm_coef0, svm_kernel, "poly"))
    
    # ================= Neural Network =================
    nn_hidden_layers = UniformIntegerHyperparameter(
        "nn_hidden_layers", lower=1, upper=5, default_value=2
    )
    nn_hidden_size = UniformIntegerHyperparameter(
        "nn_hidden_size", lower=16, upper=512, default_value=64
    )
    nn_activation = CategoricalHyperparameter(
        "nn_activation", ["relu", "tanh", "sigmoid"], default_value="relu"
    )
    nn_learning_rate = UniformFloatHyperparameter(
        "nn_learning_rate", lower=1e-5, upper=1e-1, log=True, default_value=1e-3
    )
    
    cs.add_hyperparameters([nn_hidden_layers, nn_hidden_size, 
                           nn_activation, nn_learning_rate])
    
    # Neural network params conditional on algorithm = NeuralNetwork
    for hp in [nn_hidden_layers, nn_hidden_size, nn_activation, nn_learning_rate]:
        cs.add_condition(EqualsCondition(hp, algorithm, "NeuralNetwork"))
    
    # ================= Gradient Boosting =================
    gb_n_estimators = UniformIntegerHyperparameter(
        "gb_n_estimators", lower=50, upper=500, default_value=100
    )
    gb_learning_rate = UniformFloatHyperparameter(
        "gb_learning_rate", lower=0.01, upper=1.0, log=True, default_value=0.1
    )
    gb_max_depth = UniformIntegerHyperparameter(
        "gb_max_depth", lower=1, upper=15, default_value=3
    )
    
    cs.add_hyperparameters([gb_n_estimators, gb_learning_rate, gb_max_depth])
    
    for hp in [gb_n_estimators, gb_learning_rate, gb_max_depth]:
        cs.add_condition(EqualsCondition(hp, algorithm, "GradientBoosting"))
    
    return cs
 
# Usage
cs = create_cash_configuration_space()
print(f"Configuration space has {len(cs)} hyperparameters")
print(f"Conditionals ensure only relevant params are active")
 
# Sample valid configurations
for i in range(3):
    config = cs.sample_configuration()
    print(f"
Sampled: {dict(config)}")

Why Hierarchical Structure Matters:

Reduced effective dimensionality: Instead of searching N = Σᵢ|Λ⁽ⁱ⁾| hyperparameters simultaneously, we only search over the relevant subset for each algorithm.
Valid configurations: Without conditionals, we could generate meaningless configurations (e.g., setting SVM's gamma when using Random Forest).
Efficient surrogate modeling: Bayesian optimization can build better surrogate models when hierarchical structure is respected.
Interpretable results: Final configurations only show relevant hyperparameters.

The ConfigSpace Library:

ConfigSpace is the de facto standard for defining hierarchical configuration spaces in Python AutoML. It's used by Auto-sklearn, SMAC, and other major frameworks. Key concepts:

Hyperparameters: Categorical, Integer, Float (with log-scaling)
Conditions: EqualsCondition, InCondition, AndConjunction, OrConjunction
Forbiddens: Explicitly forbid certain combinations

Best Practices for Configuration Spaces

Use log-scaling for parameters spanning orders of magnitude (learning rates, regularization). 2. Set sensible defaults—they serve as starting points. 3. Include conditionals for all algorithm-specific hyperparameters. 4. Consider forbidding known bad combinations (e.g., very deep trees with few samples).

Search Strategies for CASH

Given the CASH problem's complexity—hierarchical, mixed-type, expensive-to-evaluate—specialized search strategies are needed. Let's examine the major approaches:

1. Random Search

Surprisingly effective baseline:

Sample random configurations from the space
Evaluate each one
Return the best found

When random search works well:

Large budgets (many evaluations allowed)
When good configurations are dense
When surrogate modeling is difficult

2. Grid Search (Don't Use for CASH)

Grid search is exponential in dimensionality and fails catastrophically in high-dimensional conditional spaces. It's mentioned only to emphasize: don't use grid search for CASH.

random_search_cash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from typing import Callable, Dict, Any, List, Tuple
from concurrent.futures import ProcessPoolExecutor
 
class RandomSearchCASH:
    """
    Random search for the CASH problem.
    
    Despite its simplicity, random search is a strong baseline
    that's hard to beat with small evaluation budgets.
    """
    
    def __init__(self, config_space, n_workers=1, random_state=42):
        """
        Parameters:
            config_space: ConfigSpace object defining the search space
            n_workers: Number of parallel workers
            random_state: Random seed for reproducibility
        """
        self.config_space = config_space
        self.n_workers = n_workers
        self.rng = np.random.RandomState(random_state)
        
    def optimize(self, objective_func: Callable, 
                 n_evaluations: int) -> Tuple[Dict, float]:
        """
        Run random search.
        
        Parameters:
            objective_func: Function that takes a config dict, returns loss
            n_evaluations: Number of configurations to try
            
        Returns:
            (best_config, best_loss)
        """
        # Sample all configurations upfront
        configs = [self.config_space.sample_configuration() 
                   for _ in range(n_evaluations)]
        
        # Evaluate (potentially in parallel)
        if self.n_workers == 1:
            results = [objective_func(dict(config)) for config in configs]
        else:
            with ProcessPoolExecutor(max_workers=self.n_workers) as executor:
                results = list(executor.map(
                    lambda c: objective_func(dict(c)), configs
                ))
        
        # Find best
        best_idx = np.argmin(results)
        
        return dict(configs[best_idx]), results[best_idx]
    
    def optimize_with_budget(self, objective_func: Callable,
                              time_budget_seconds: float,
                              min_evaluations: int = 10) -> Tuple[Dict, float]:
        """
        Run random search with a time budget instead of fixed evaluations.
        
        Useful for fair comparison with adaptive methods.
        """
        import time
        
        best_config = None
        best_loss = float('inf')
        n_evaluated = 0
        start_time = time.time()
        
        while True:
            # Check budget
            elapsed = time.time() - start_time
            if elapsed >= time_budget_seconds and n_evaluated >= min_evaluations:
                break
            
            # Sample and evaluate
            config = self.config_space.sample_configuration()
            loss = objective_func(dict(config))
            n_evaluated += 1
            
            if loss < best_loss:
                best_loss = loss
                best_config = dict(config)
        
        print(f"Random search: {n_evaluated} evaluations in {elapsed:.1f}s")
        return best_config, best_loss

3. Bayesian Optimization (SMAC)

Bayesian optimization is the dominant approach for CASH. It builds a surrogate model of the objective function and uses it to guide search:

Surrogate model: Predict performance of untried configurations
Acquisition function: Balance exploration (try uncertain regions) vs exploitation (try promising regions)
Sequential decisions: Each evaluation informs the next

SMAC (Sequential Model-based Algorithm Configuration) is the state-of-the-art Bayesian optimizer for CASH. Key features:

Random forest surrogate: Unlike GP-based BO, SMAC uses random forests which handle categorical and conditional hyperparameters natively
Expected Improvement acquisition: Select configurations with high expected improvement over current best
Intensification: Verify promising configurations with more evaluations

smac_bayesian_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
from smac import HyperparameterOptimizationFacade, Scenario
from ConfigSpace import ConfigurationSpace
import numpy as np
 
def run_smac_optimization(config_space: ConfigurationSpace,
                           objective_func,
                           n_trials: int = 100,
                           n_workers: int = 1,
                           seed: int = 42):
    """
    Run SMAC (Sequential Model-based Algorithm Configuration) for CASH.
    
    SMAC is the state-of-the-art Bayesian optimizer for hierarchical,
    conditional configuration spaces.
    
    Parameters:
        config_space: ConfigSpace defining algorithms and hyperparameters
        objective_func: Function mapping config dict to loss (minimize)
        n_trials: Number of configurations to evaluate
        n_workers: Parallel workers
        seed: Random seed
        
    Returns:
        Best configuration found
    """
    
    # Define the scenario
    scenario = Scenario(
        configspace=config_space,
        n_trials=n_trials,       # Total evaluations
        n_workers=n_workers,      # Parallelism
        seed=seed,
        deterministic=True
    )
    
    # Wrap objective for SMAC interface
    def smac_objective(config, seed=None):
        """SMAC-compatible objective wrapper."""
        return objective_func(dict(config))
    
    # Create SMAC optimizer
    smac = HyperparameterOptimizationFacade(
        scenario=scenario,
        target_function=smac_objective,
        overwrite=True  # Overwrite previous runs
    )
    
    # Run optimization
    incumbent = smac.optimize()
    
    return dict(incumbent)
 
 
# Example usage with Auto-sklearn style CASH problem
def example_cash_with_smac():
    """Demonstrate SMAC on a CASH problem."""
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
    from sklearn.svm import SVC
    
    # Generate dataset
    X, y = make_classification(n_samples=500, n_features=20, 
                               n_informative=10, random_state=42)
    
    # Build configuration space (simplified)
    cs = create_cash_configuration_space()  # From previous example
    
    # Define objective
    def objective(config):
        algo = config['algorithm']
        
        if algo == 'RandomForest':
            clf = RandomForestClassifier(
                n_estimators=config.get('rf_n_estimators', 100),
                max_depth=config.get('rf_max_depth', None),
                random_state=42
            )
        elif algo == 'GradientBoosting':
            clf = GradientBoostingClassifier(
                n_estimators=config.get('gb_n_estimators', 100),
                learning_rate=config.get('gb_learning_rate', 0.1),
                max_depth=config.get('gb_max_depth', 3),
                random_state=42
            )
        elif algo == 'SVM':
            clf = SVC(
                kernel=config.get('svm_kernel', 'rbf'),
                C=config.get('svm_C', 1.0),
                gamma=config.get('svm_gamma', 'scale'),
                random_state=42
            )
        else:
            raise ValueError(f"Unknown algorithm: {algo}")
        
        # Return negative accuracy (since we minimize)
        scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
        return -scores.mean()
    
    # Run SMAC
    best_config = run_smac_optimization(cs, objective, n_trials=50)
    print(f"Best configuration: {best_config}")
    print(f"Best score: {-objective(best_config):.4f}")

4. Evolutionary Algorithms (TPOT)

Evolutionary algorithms treat configurations as individuals in a population:

Initial population: Random configurations
Selection: Keep configurations with low loss
Crossover: Combine good configurations
Mutation: Random perturbations
Repeat: Evolve towards better configurations

TPOT uses genetic programming to evolve not just hyperparameters but entire ML pipelines (feature selection → preprocessing → model).

5. Hyperband and Successive Halving

These are multi-fidelity approaches that evaluate configurations at multiple resource levels:

Start many configurations with minimal resources (few epochs, small data subset)
Progressively eliminate poor performers
Give more resources to promising configurations

This is especially effective when early performance correlates with final performance—common in deep learning.

Comparison of Strategies:

CASH Search Strategies Comparison
Strategy	Mechanism	Strengths	Weaknesses
Random Search	Uniform sampling	Simple, parallelizable, robust	No learning, inefficient for peaked landscapes
SMAC (Bayesian)	Surrogate + acquisition	Sample efficient, handles conditionals	Sequential, complex implementation
TPE	Density ratio modeling	Handles conditionals, simpler than GP	Less sample efficient than SMAC
Evolutionary	Population evolution	Diverse solutions, flexible pipelines	Many hyperparameters, slow convergence
Hyperband	Multi-fidelity elimination	Fast, efficient resource use	Requires meaningful fidelity proxy

Theoretical Complexity of CASH

Understanding the theoretical complexity of CASH helps set realistic expectations and guides design decisions.

Configuration Space Size

For a CASH problem with:

k algorithms
Each algorithm i having nᵢ hyperparameters
Each hyperparameter having mᵢⱼ possible values (or discretized for continuous)

The total configuration space size is:

|Λ| = Σᵢ₌₁ᵏ Πⱼ₌₁ⁿⁱ mᵢⱼ

For example:

5 algorithms
Each with 5 hyperparameters
Each hyperparameter with 10 values

|Λ| = 5 × 10⁵ = 500,000 configurations

With continuous hyperparameters, the space is infinite. Even with discretization, it's astronomically large.

Computational Complexity

Each configuration evaluation requires training a model. If training takes time T(n, d) for n samples and d features:

Evaluating E configurations takes: E × T(n, d) × k_cv (for k_cv fold cross-validation)
Most algorithms have T(n, d) = O(n × d) to O(n² × d) for training

This motivates:

Sample-efficient optimization: Minimize E (number of evaluations)
Multi-fidelity methods: Reduce per-evaluation cost by early stopping
Parallelization: Evaluate multiple configurations simultaneously

The Curse of Expensive Evaluations

Regret Bounds and Sample Complexity

For Bayesian optimization in CASH, theoretical results show:

Simple regret (gap from optimal): O(log(T)/T) under smoothness assumptions
Cumulative regret: O(√T log T) for GP-based BO

However, these bounds assume:

Correct prior specification (rarely true)
Continuous, smooth objectives (violated by algorithm choice)
No conditional structure (violated by hierarchical spaces)

Practically, convergence is problem-dependent. Meta-learning and warm-starting significantly accelerate convergence on structured problem distributions.

Hardness Results

Implications:

We must accept approximate solutions
Exploit problem structure (meta-learning, warm-starting)
Use anytime algorithms that improve with more budget
Parallelize when possible

cash_complexity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import numpy as np
import time
from typing import Dict
 
def analyze_cash_complexity(algorithms: Dict, 
                             hyperparameter_ranges: Dict,
                             dataset_size: int,
                             cv_folds: int = 5):
    """
    Analyze the computational complexity of a CASH problem.
    
    Provides estimates of:
    - Configuration space size
    - Estimated time per evaluation
    - Total search time for different budgets
    """
    
    # Configuration space size
    total_configs = 0
    space_details = {}
    
    for algo_name, param_ranges in hyperparameter_ranges.items():
        algo_configs = 1
        for param_name, param_range in param_ranges.items():
            if isinstance(param_range, list):
                algo_configs *= len(param_range)
            elif isinstance(param_range, tuple) and len(param_range) == 2:
                # Continuous: estimate 100 effective values
                algo_configs *= 100
        
        space_details[algo_name] = algo_configs
        total_configs += algo_configs
    
    print("=== CASH Complexity Analysis ===
")
    print(f"Number of algorithms: {len(algorithms)}")
    print(f"
Configurations per algorithm:")
    for algo, count in space_details.items():
        print(f"  {algo}: ~{count:,} configurations")
    print(f"
Total configuration space: ~{total_configs:,} configurations")
    
    # Estimate time per evaluation
    # Very rough: assume O(n * d) training, where d ~ n/10
    d_approx = max(10, dataset_size // 10)
    
    time_estimates = {
        'RandomForest': dataset_size * d_approx * 100 / 1e8,  # ~100 trees
        'SVM': (dataset_size ** 2) * d_approx / 1e9,          # Kernel computation
        'GradientBoosting': dataset_size * d_approx * 100 / 1e8,
        'NeuralNetwork': dataset_size * d_approx * 1000 / 1e8, # ~1000 iterations
    }
    
    avg_time = np.mean([time_estimates.get(algo, 1.0) 
                        for algo in algorithms.keys()])
    
    print(f"
=== Time Estimates (dataset size={dataset_size}) ===")
    print(f"Estimated time per evaluation: ~{avg_time * cv_folds:.1f} seconds")
    print(f"  (with {cv_folds}-fold CV)")
    
    # Total time for different evaluation budgets
    budgets = [50, 100, 500, 1000]
    print(f"
=== Total Search Time Estimates ===")
    for budget in budgets:
        total_time = budget * avg_time * cv_folds
        hours = total_time / 3600
        print(f"  {budget} evaluations: ~{hours:.1f} hours")
    
    # Coverage
    print(f"
=== Search Coverage ===")
    for budget in budgets:
        coverage = budget / total_configs * 100
        print(f"  {budget} evaluations: {coverage:.4f}% of space explored")
    
    return {
        'total_configs': total_configs,
        'avg_time_per_eval': avg_time * cv_folds,
        'space_details': space_details
    }
 
# Example usage
algorithms = {
    'RandomForest': None,
    'SVM': None,
    'GradientBoosting': None,
    'NeuralNetwork': None
}
 
hyperparameter_ranges = {
    'RandomForest': {
        'n_estimators': (10, 500),
        'max_depth': (1, 50),
        'min_samples_split': (2, 20)
    },
    'SVM': {
        'kernel': ['linear', 'rbf', 'poly'],
        'C': (1e-3, 1e3),
        'gamma': (1e-4, 1e1)
    },
    'GradientBoosting': {
        'n_estimators': (50, 500),
        'learning_rate': (0.01, 1.0),
        'max_depth': (1, 15)
    },
    'NeuralNetwork': {
        'hidden_layers': (1, 5),
        'hidden_size': (16, 512),
        'learning_rate': (1e-5, 1e-1),
        'activation': ['relu', 'tanh', 'sigmoid']
    }
}
 
analysis = analyze_cash_complexity(algorithms, hyperparameter_ranges, 
                                   dataset_size=10000)

CASH in Production AutoML Systems

Let's examine how leading AutoML systems implement CASH:

Auto-sklearn 2.0

Auto-sklearn is the reference implementation for CASH-based AutoML:

Configuration space: 15 classifiers, 14 feature preprocessors, ~100 hyperparameters total
Optimizer: SMAC (Bayesian optimization with random forest surrogate)
Meta-learning: Warm-start from 25 similar datasets
Ensembling: Combine top configurations via ensemble selection
Budget control: Time-based with early termination

Key architectural decisions:

Separate preprocessing and classification into a two-stage pipeline
Use meta-learning to initialize, then refine with SMAC
Ensemble to improve robustness and hedge against selection errors

Auto-WEKA

The original CASH paper's implementation:

Configuration space: 39 classifiers from WEKA, extensive hyperparameter spaces
Optimizer: SMAC
Unique feature: Very large algorithm portfolio (more than Auto-sklearn)

Auto-WEKA demonstrated CASH's viability but is less maintained than Auto-sklearn.

cash_autosklearn_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import autosklearn.classification
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
 
def run_autosklearn_cash(time_budget_minutes=10):
    """
    Run Auto-sklearn, demonstrating CASH in action.
    
    Auto-sklearn internally solves the CASH problem using SMAC
    with meta-learning warm-starting.
    """
    
    # Load dataset
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    print(f"Dataset: {X_train.shape[0]} training samples, "
          f"{X_train.shape[1]} features")
    print(f"Time budget: {time_budget_minutes} minutes")
    
    # Initialize Auto-sklearn
    # This sets up the CASH problem internally
    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=time_budget_minutes * 60,  # Total seconds
        per_run_time_limit=60,  # Max seconds per configuration
        memory_limit=4096,      # MB
        
        # CASH settings
        include={
            'classifier': [
                'random_forest',
                'gradient_boosting', 
                'adaboost',
                'extra_trees',
                'libsvm_svc',
                'liblinear_svc',
                'k_nearest_neighbors',
                'decision_tree',
            ]
        },
        
        # Meta-learning for warm-starting
        initial_configurations_via_metalearning=25,
        
        # Ensembling
        ensemble_kwargs={'ensemble_size': 50},
        
        # Reproducibility
        seed=42,
        n_jobs=1
    )
    
    # Fit - this runs SMAC to solve the CASH problem
    print("
Starting Auto-sklearn (CASH optimization)...")
    automl.fit(X_train, y_train)
    
    # Results
    print("
=== CASH Results ===")
    print(f"Test accuracy: {automl.score(X_test, y_test):.4f}")
    
    # Show examined configurations
    print(f"
Configurations examined: {len(automl.cv_results_['mean_test_score'])}")
    
    # Show best configuration (the CASH solution)
    print("
Best configuration found:")
    print(automl.show_models())
    
    # Leaderboard
    print("
Top 5 configurations:")
    leaderboard = automl.leaderboard(detailed=True, 
                                      ensemble_size=False)[['rank', 'type', 'cost']]
    print(leaderboard.head())
    
    return automl
 
# Run CASH with Auto-sklearn
# automl = run_autosklearn_cash(time_budget_minutes=10)

TPOT (Genetic Programming CASH)

TPOT takes a different approach using evolutionary algorithms:

Extended pipeline space: Evolves entire pipelines, not just model + hyperparameters
Genetic programming: Configurations are trees that can recombine
Feature construction: Can create new features through combinations

H2O AutoML

H2O emphasizes robustness over selection:

Fixed algorithm progression: Train GBM variants, then RF, then deep learning, then GLM
Stacking: Rather than selecting one, stack all trained models
No HPO initially: Use sensible defaults, then tune winners

This is a portfolio approach that trades compute efficiency for reliability.

Design Patterns Across Systems:

Common CASH Design Patterns

•Warm-starting: Use meta-learning to initialize search with good configurations
•Anytime behavior: Return best-so-far at any point; improve with more time
•Ensembling: Combine multiple configurations to improve robustness
•Resource limits: Per-run timeouts and memory limits prevent runaway configs
•Fallback defaults: Use sensible defaults when optimization fails
•Progressive training: Start with simple/fast algorithms, add complex ones with budget

Practical Guidelines for CASH

When using or designing CASH systems, consider these practical guidelines:

Designing Configuration Spaces:

Start small: Begin with a few algorithms you trust, not dozens
Focus on impactful hyperparameters: Not all hyperparameters matter equally
Use prior knowledge: Set reasonable ranges based on literature and experience
Log-scale for learning rates and regularization: These span orders of magnitude
Include conditionals: Don't search irrelevant hyperparameters

Budget Allocation:

Know your time constraints: 10 minutes? 10 hours? 10 days?
Account for CV overhead: 5-fold CV means 5× the training time
Reserve time for ensembling: Ensemble selection improves robustness
Consider multi-fidelity: For large datasets, use subsampling early in search

Evaluation Strategies:

Use proper CV: Never evaluate on training data; holdout is noisier than CV
Stratify for imbalanced classes: Ensure folds have similar class distributions
Be consistent: Use the same CV splits across all configurations
Handle failures gracefully: Some configurations will crash; return worst-case loss

CASH Best Practices

•Use meta-learning when available
•Ensemble top configurations
•Set per-run time limits
•Monitor for overfitting
•Save intermediate results

Common Mistakes

•Running without time limits
•Including too many algorithms
•Ignoring conditional structure
•Overfitting to validation set
•Not using a test set holdout

The 80/20 of CASH

Summary: CASH

We've covered the Combined Algorithm Selection and Hyperparameter optimization problem—the mathematical foundation of modern AutoML. Let's consolidate the key takeaways:

Key Takeaways

•CASH unifies algorithm selection and hyperparameter optimization into a single search problem.
•Hierarchical configuration spaces capture the conditional structure where hyperparameters depend on algorithm choice.
•Bayesian optimization (SMAC) is the dominant approach, using surrogate models to guide search efficiently.
•The configuration space is enormous—even with 5 algorithms and 5 hyperparameters each, millions of configurations exist.
•Production systems combine meta-learning, ensembling, and anytime behavior to make CASH practical.

What's Next:

Page Complete

2 / 5