Machine LearningAutoML & Neural Architecture Search

AutoML Systems

LevelAdvanced

Duration180 mins

TopicAutoML & Neural Architecture Search

1 / 5

Auto-sklearn: Automated Machine Learning Through Meta-Learning

The Dawn of Practical AutoML

In 2015, a team from the University of Freiburg introduced Auto-sklearn, a system that would fundamentally reshape how the machine learning community thought about automation. Built on the insight that machine learning algorithm configuration is itself a machine learning problem, Auto-sklearn leveraged decades of meta-learning research to create a framework that could compete with, and often surpass, human experts in traditional ML pipeline construction.

Auto-sklearn wasn't the first attempt at automated machine learning—systems like Auto-WEKA had explored similar territory. However, Auto-sklearn's integration of Bayesian optimization, meta-learning for warm-starting, and automatic ensemble construction created a uniquely powerful combination that achieved first place in the inaugural AutoML Challenge at ICML 2015 and 2016, establishing the template that many subsequent AutoML systems would follow.

What You Will Master

By the end of this page, you will understand Auto-sklearn's complete architecture—from its SMAC-based Bayesian optimizer to its meta-learning initialization strategy, from its automated preprocessing to its sophisticated ensemble construction. You will be able to configure Auto-sklearn for production workloads, understand its theoretical foundations, and recognize when it represents the optimal choice for your AutoML needs.

Foundational Architecture

Auto-sklearn approaches automated machine learning as a Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem. This formulation, while seemingly straightforward, represents a profound insight: rather than treating algorithm selection and hyperparameter tuning as separate stages, Auto-sklearn optimizes them jointly within a unified search space.

The CASH Problem Formulation

Formally, given a dataset $D = {(x_i, y_i)}_{i=1}^{n}$, a set of possible algorithms $\mathcal{A} = {A^{(1)}, A^{(2)}, \ldots, A^{(K)}}$, and for each algorithm $A^{(j)}$ a hyperparameter space $\Lambda^{(j)}$, the CASH problem seeks:

$$A^, \lambda^ = \arg\min_{A^{(j)} \in \mathcal{A}, \lambda \in \Lambda^{(j)}} \mathcal{L}(A^{(j)}\lambda, D{train}, D_{valid})$$

where $\mathcal{L}$ represents a loss metric evaluated on the validation set after training on the training set.

This joint optimization over both discrete choices (which algorithm) and continuous/categorical hyperparameters creates a hierarchical, conditional search space that is extraordinarily challenging to navigate efficiently.

The Hierarchical Nature of Search Spaces

The CASH search space is inherently hierarchical because hyperparameters are conditional on algorithm choice. The 'C' regularization parameter only exists if you've selected SVM. The 'max_depth' parameter only exists if you've chosen a tree-based method. This conditional structure creates a space where different regions have entirely different dimensionalities and parameter semantics.

The Complete ML Pipeline Search Space

Auto-sklearn doesn't just search over classifiers—it searches over complete machine learning pipelines. Each pipeline consists of:

Data Preprocessing: Missing value imputation, categorical encoding, feature rescaling
Feature Preprocessing: PCA, polynomial features, feature selection, random kitchen sinks
Classification/Regression Algorithm: From 15+ algorithms including Random Forests, Gradient Boosting, SVMs, Neural Networks, and more

This creates an aggregate search space of approximately 110 hyperparameters, though the effective dimensionality at any point is lower due to the conditional structure.

The search space includes both continuous hyperparameters (learning rates, regularization strengths), discrete hyperparameters (number of layers, tree depth), and categorical choices (algorithm family, imputation strategy). This heterogeneous space requires specialized optimization techniques.

Auto-sklearn Full Search Space Components
Component	Options	Key Hyperparameters
Data Preprocessing	Imputers, Encoders, Scalers	Imputation strategy, encoding method, scaling type
Feature Preprocessing	13 methods including PCA, SelectPercentile, etc.	n_components, percentile, polynomial degree
Classifiers	15 algorithms: AdaBoost, Decision Tree, Extra Trees, Gradient Boosting, KNN, LDA, Logistic Regression, Passive Aggressive, QDA, Random Forest, SGD, SVM, MLP, etc.	Algorithm-specific: max_depth, n_estimators, learning_rate, C, etc.
Regressors	14 algorithms: similar to classifiers plus specialized regressors	Corresponding regression-specific parameters

SMAC-Based Bayesian Optimization

At Auto-sklearn's core lies SMAC (Sequential Model-based Algorithm Configuration), a Bayesian optimization method specifically designed for algorithm configuration. Unlike traditional Bayesian optimization with Gaussian Processes, SMAC uses Random Forests as its surrogate model, providing crucial advantages for AutoML:

Why Random Forests Over Gaussian Processes?

Gaussian Processes (GPs) are the traditional choice for Bayesian optimization because they provide well-calibrated uncertainty estimates. However, GPs face critical limitations in the AutoML context:

Computational Complexity: GPs scale as $O(n^3)$ with the number of observations, becoming prohibitively expensive after a few thousand evaluations
Continuous Spaces Only: Standard GPs handle continuous hyperparameters naturally but struggle with categorical and conditional parameters
Kernel Selection: GPs require careful kernel engineering for mixed-type spaces

Random Forests address these limitations elegantly:

Random Forest Advantages for Surrogate Modeling

•Native handling of categorical and conditional parameters — Tree splits naturally partition categorical spaces and handle missing values (absent conditional parameters) gracefully
•Scalability — Inference and training scale favorably with data size, enabling longer optimization runs
•Uncertainty via tree variance — Prediction uncertainty estimated from variance across tree predictions in the ensemble
•Built-in feature importance — Natural identification of which hyperparameters most influence performance
•Robustness to outliers — Important when some configurations crash or timeout

The SMAC Optimization Loop

SMAC operates through an iterative process that balances exploration (trying uncertain regions) and exploitation (refining near good solutions):

Step 1: Initialize — Start with a small set of configurations, potentially informed by meta-learning (discussed later)

Step 2: Train Surrogate — Fit a Random Forest on all observed (configuration, performance) pairs

Step 3: Acquisition Function Optimization — Maximize the Expected Improvement (EI) acquisition function using local search:

$$EI(\lambda) = \mathbb{E}[\max(f^* - f(\lambda), 0)]$$

where $f^*$ is the best performance observed so far and $f(\lambda)$ is the predicted performance of configuration $\lambda$.

Step 4: Evaluate — Train and validate the selected configuration on the actual dataset

Step 5: Update — Add the new observation to the history and return to Step 2

smac_pseudocode.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# SMAC Optimization Loop (Pseudocode)
def smac_optimize(search_space, dataset, time_budget, meta_features=None):
    """
    SMAC-based Bayesian optimization for AutoML.
    
    Args:
        search_space: Hierarchical configuration space
        dataset: Training and validation data
        time_budget: Total optimization time in seconds
        meta_features: Optional dataset meta-features for warm-starting
    
    Returns:
        Best configuration found and its performance
    """
    # Initialize with default + random configurations
    # Meta-learning can provide informed initial configurations
    if meta_features is not None:
        initial_configs = warm_start_from_meta_learning(
            meta_features, 
            num_initial=25
        )
    else:
        initial_configs = generate_random_configs(num_initial=10)
    
    # Evaluate initial configurations
    history = []
    for config in initial_configs:
        if time_remaining() <= 0:
            break
        performance = evaluate_configuration(config, dataset)
        history.append((config, performance))
    
    # Main optimization loop
    while time_remaining() > 0:
        # Step 2: Train Random Forest surrogate
        surrogate_model = RandomForestRegressor(
            n_estimators=10,
            max_features='sqrt',
            min_samples_leaf=3,
            bootstrap=True
        )
        X = encode_configurations([h[0] for h in history])
        y = [h[1] for h in history]
        surrogate_model.fit(X, y)
        
        # Step 3: Optimize acquisition function via local search
        best_incumbent = min(history, key=lambda x: x[1])[0]
        
        # Expected Improvement with uncertainty from tree variance
        def expected_improvement(config):
            predictions = [tree.predict(encode(config)) 
                          for tree in surrogate_model.estimators_]
            mean = np.mean(predictions)
            std = np.std(predictions)
            
            best_so_far = min([h[1] for h in history])
            
            # EI formula
            z = (best_so_far - mean) / (std + 1e-9)
            ei = std * (z * norm.cdf(z) + norm.pdf(z))
            return ei
        
        # Local search from multiple starting points
        candidates = [
            local_search(best_incumbent, expected_improvement),
            local_search(random_config(), expected_improvement),
            *[random_config() for _ in range(10)]  # Random exploration
        ]
        next_config = max(candidates, key=expected_improvement)
        
        # Step 4: Evaluate selected configuration
        performance = evaluate_configuration(next_config, dataset)
        
        # Step 5: Update history
        history.append((next_config, performance))
    
    # Return best configuration found
    best_config, best_perf = min(history, key=lambda x: x[1])
    return best_config, best_perf

Interleaved Random Configurations

SMAC interleaves model-guided suggestions with purely random configurations at a ratio that favors model-guided selections but maintains exploration diversity. This prevents the optimizer from getting trapped in local optima of the surrogate model, particularly important in the high-dimensional, multi-modal CASH landscape.

Meta-Learning for Warm Starting

One of Auto-sklearn's most innovative features is its use of meta-learning to warm-start the optimization process. Rather than beginning from scratch on each new dataset, Auto-sklearn leverages knowledge from 140+ previously tuned datasets to identify promising initial configurations.

The Meta-Learning Hypothesis

The core insight is that similar datasets often benefit from similar configurations. If gradient boosting with learning_rate=0.1 and max_depth=5 works well on dataset A, and dataset B has similar properties to dataset A, then this configuration is likely a good starting point for dataset B.

This hypothesis is formalized through dataset meta-features—quantitative characteristics that describe datasets independently of any specific learning algorithm.

Dataset Meta-Features

Auto-sklearn computes approximately 38 meta-features for each dataset:

Auto-sklearn Meta-Features Categories
Category	Example Meta-Features	Intuition
Simple	Number of instances, number of features, number of classes	Basic dataset scale indicators
Statistical	Mean/std of feature means, skewness, kurtosis	Data distribution characteristics
Information-Theoretic	Class entropy, mean feature entropy, mutual information	Complexity and redundancy signals
Landmarking	1-NN accuracy, decision stump accuracy, random forest (quick eval)	Cheap performance estimates that correlate with full tuning
PCA	PCA fraction variance for first k components, PCA skewness/kurtosis	Intrinsic dimensionality indicators

The Warm-Starting Process

Given a new dataset $D_{new}$:

Compute meta-features: Extract the 38-dimensional meta-feature vector $m_{new}$ for the new dataset
Find similar datasets: Using a distance metric (typically L1 or L2 distance on normalized meta-features), identify the $k$ most similar datasets from the meta-knowledge base
Select promising configurations: For each similar dataset, retrieve the best-performing configurations found during that dataset's tuning
Prioritize evaluation: These configurations become the initial candidates for SMAC, evaluated before random exploration begins

The meta-knowledge base stores:

Meta-features for each dataset
The best configurations found for each dataset
Performance metrics achieved with each configuration

This process typically identifies 25 initial configurations that represent strong starting points, dramatically reducing the time needed to find good solutions.

meta_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# Auto-sklearn Meta-Learning Implementation
import numpy as np
from sklearn.preprocessing import StandardScaler
 
class MetaLearningWarmStart:
    """
    Implements Auto-sklearn's meta-learning warm-start strategy.
    """
    
    def __init__(self, meta_knowledge_base_path: str):
        """
        Args:
            meta_knowledge_base_path: Path to stored meta-knowledge
                containing (dataset_meta_features, best_configs, performances)
        """
        self.meta_knowledge = self._load_meta_knowledge(meta_knowledge_base_path)
        self.meta_features = np.array([
            entry['meta_features'] for entry in self.meta_knowledge
        ])
        self.scaler = StandardScaler().fit(self.meta_features)
        self.normalized_features = self.scaler.transform(self.meta_features)
    
    def compute_meta_features(self, X, y) -> np.ndarray:
        """
        Computes the 38-dimensional meta-feature vector for a dataset.
        
        Returns:
            meta_features: Array of shape (38,) containing dataset characteristics
        """
        meta_features = []
        
        # Simple meta-features
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        meta_features.extend([
            np.log(n_samples),  # Log-transformed for stability
            np.log(n_features),
            n_classes,
            n_features / n_samples,  # Dimensionality ratio
        ])
        
        # Statistical meta-features
        feature_means = np.nanmean(X, axis=0)
        feature_stds = np.nanstd(X, axis=0)
        meta_features.extend([
            np.mean(feature_means),
            np.std(feature_means),
            np.mean(feature_stds),
            np.std(feature_stds),
        ])
        
        # Information-theoretic meta-features
        class_counts = np.bincount(y.astype(int))
        class_probs = class_counts / len(y)
        class_entropy = -np.sum(class_probs * np.log2(class_probs + 1e-10))
        meta_features.append(class_entropy)
        
        # Landmarking meta-features (cheap model evaluations)
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.neighbors import KNeighborsClassifier
        from sklearn.model_selection import cross_val_score
        
        # Decision stump landmark
        stump = DecisionTreeClassifier(max_depth=1)
        stump_acc = np.mean(cross_val_score(stump, X, y, cv=3, scoring='accuracy'))
        meta_features.append(stump_acc)
        
        # 1-NN landmark
        knn = KNeighborsClassifier(n_neighbors=1)
        knn_acc = np.mean(cross_val_score(knn, X, y, cv=3, scoring='accuracy'))
        meta_features.append(knn_acc)
        
        # PCA meta-features
        from sklearn.decomposition import PCA
        pca = PCA().fit(X)
        variance_ratios = pca.explained_variance_ratio_
        meta_features.extend([
            variance_ratios[0] if len(variance_ratios) > 0 else 0,
            np.sum(variance_ratios[:min(5, len(variance_ratios))]),
        ])
        
        # ... (additional meta-features omitted for brevity)
        
        return np.array(meta_features[:38])  # Ensure exactly 38 features
    
    def get_warm_start_configs(
        self, 
        X, 
        y, 
        n_configs: int = 25,
        n_similar_datasets: int = 10
    ) -> list:
        """
        Returns promising initial configurations based on similar datasets.
        
        Args:
            X, y: The new dataset
            n_configs: Number of configurations to return
            n_similar_datasets: Number of similar datasets to consider
            
        Returns:
            List of configuration dictionaries to evaluate first
        """
        # Compute meta-features for new dataset
        new_meta_features = self.compute_meta_features(X, y)
        normalized_new = self.scaler.transform(new_meta_features.reshape(1, -1))
        
        # Find most similar datasets using L1 distance
        distances = np.sum(
            np.abs(self.normalized_features - normalized_new), 
            axis=1
        )
        similar_indices = np.argsort(distances)[:n_similar_datasets]
        
        # Collect best configurations from similar datasets
        candidate_configs = []
        for idx in similar_indices:
            dataset_configs = self.meta_knowledge[idx]['best_configs']
            candidate_configs.extend(dataset_configs)
        
        # Deduplicate and rank by frequency of high performance
        config_scores = {}
        for config, perf in candidate_configs:
            config_key = self._config_to_key(config)
            if config_key not in config_scores:
                config_scores[config_key] = {'config': config, 'count': 0, 'avg_rank': 0}
            config_scores[config_key]['count'] += 1
        
        # Return top n_configs unique configurations
        sorted_configs = sorted(
            config_scores.values(), 
            key=lambda x: x['count'], 
            reverse=True
        )
        
        return [entry['config'] for entry in sorted_configs[:n_configs]]

Meta-Learning Limitations

Meta-learning works best when the new dataset genuinely resembles datasets in the knowledge base. For highly unusual datasets (novel domains, extreme scales, unique feature distributions), the warm-start configurations may be suboptimal. Auto-sklearn handles this by still allowing substantial random exploration alongside meta-learned suggestions.

Automatic Ensemble Construction

A distinguishing feature of Auto-sklearn is its post-hoc ensemble construction. Rather than returning only the single best model found during optimization, Auto-sklearn builds an ensemble from all models evaluated during the search process. This approach consistently improves upon single-model selection across diverse datasets.

The Ensemble Selection Algorithm

Auto-sklearn uses greedy ensemble selection with replacement, based on the algorithm by Caruana et al. (2004):

Initialize: Start with an empty ensemble
Greedy Addition: Iteratively add the model that most improves validation performance when combined with the current ensemble
With Replacement: The same model can be added multiple times, effectively increasing its weight in the ensemble
Terminate: Stop when validation performance stops improving or a maximum ensemble size is reached

The key insight is that ensemble diversity matters more than individual model quality. A slightly weaker model that makes different errors than existing ensemble members can improve overall performance more than another copy of the strongest individual model.

ensemble_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# Ensemble Selection Algorithm (Caruana et al., 2004)
import numpy as np
from sklearn.metrics import accuracy_score
 
class EnsembleSelection:
    """
    Greedy ensemble selection with replacement.
    Builds optimal weighted ensemble from candidate models.
    """
    
    def __init__(
        self, 
        candidate_predictions: np.ndarray,
        y_valid: np.ndarray,
        max_ensemble_size: int = 50,
        metric: callable = accuracy_score
    ):
        """
        Args:
            candidate_predictions: Shape (n_candidates, n_samples, n_classes)
                Predicted probabilities from all trained models
            y_valid: True labels for validation set
            max_ensemble_size: Maximum number of models in ensemble
            metric: Scoring function (higher is better)
        """
        self.candidates = candidate_predictions
        self.y_valid = y_valid
        self.max_size = max_ensemble_size
        self.metric = metric
        self.n_candidates = len(candidate_predictions)
        
        self.ensemble_indices = []
        self.ensemble_weights = None
    
    def fit(self) -> 'EnsembleSelection':
        """
        Performs greedy ensemble selection with replacement.
        """
        best_score = float('-inf')
        
        for iteration in range(self.max_size):
            best_candidate = None
            best_new_score = best_score
            
            # Current ensemble predictions
            if len(self.ensemble_indices) > 0:
                current_preds = np.mean(
                    self.candidates[self.ensemble_indices], 
                    axis=0
                )
            else:
                current_preds = np.zeros_like(self.candidates[0])
            
            # Try adding each candidate
            for i in range(self.n_candidates):
                # New ensemble would be current + candidate i
                n_current = len(self.ensemble_indices)
                new_ensemble_preds = (
                    current_preds * n_current + self.candidates[i]
                ) / (n_current + 1)
                
                # Evaluate new ensemble
                y_pred = np.argmax(new_ensemble_preds, axis=1)
                score = self.metric(self.y_valid, y_pred)
                
                if score > best_new_score:
                    best_new_score = score
                    best_candidate = i
            
            # Check for improvement
            if best_new_score > best_score:
                self.ensemble_indices.append(best_candidate)
                best_score = best_new_score
                print(f"Iteration {iteration+1}: Added model {best_candidate}, "
                      f"score = {best_score:.4f}")
            else:
                # No improvement, stop early
                print(f"Stopping at iteration {iteration+1}: no improvement")
                break
        
        # Compute final weights
        self._compute_weights()
        
        return self
    
    def _compute_weights(self):
        """
        Computes normalized weights from selection frequencies.
        """
        counts = np.bincount(
            self.ensemble_indices, 
            minlength=self.n_candidates
        )
        self.ensemble_weights = counts / len(self.ensemble_indices)
    
    def predict_proba(self, candidate_predictions: np.ndarray) -> np.ndarray:
        """
        Makes predictions using the selected ensemble.
        
        Args:
            candidate_predictions: Shape (n_candidates, n_samples, n_classes)
                Predictions from all candidate models on new data
                
        Returns:
            Weighted ensemble predictions of shape (n_samples, n_classes)
        """
        weighted_preds = np.zeros_like(candidate_predictions[0])
        
        for i, weight in enumerate(self.ensemble_weights):
            if weight > 0:
                weighted_preds += weight * candidate_predictions[i]
        
        return weighted_preds
    
    def get_selected_models(self) -> list:
        """
        Returns the indices and weights of selected models.
        """
        selected = [
            (i, w) for i, w in enumerate(self.ensemble_weights) if w > 0
        ]
        return sorted(selected, key=lambda x: x[1], reverse=True)

Why Post-Hoc Ensemble Works

The success of post-hoc ensemble selection stems from several factors:

1. Free Diversity: During the SMAC search, many diverse configurations are evaluated at no additional cost. Random forests, gradient boosting, SVMs, and various preprocessing pipelines naturally produce models with different inductive biases.

2. Validation-Guided Weighting: By selecting based on validation performance, the ensemble automatically weights models according to their complementary contributions, not just individual strength.

3. Error Decorrelation: Models with uncorrelated errors provide maximal ensemble benefit. The diverse CASH search space naturally produces decorrelated predictions.

4. Robustness: Ensembles are more robust to overfitting than single models, providing smoother generalization.

Ensemble Size Considerations

In practice, Auto-sklearn ensembles typically contain 5-30 models, even though 50+ configurations may be evaluated. The greedy selection naturally prunes poorly performing and redundant models. For deployment, this means the ensemble inference cost is manageable, not proportional to total search budget.

Practical Configuration and Usage

Understanding Auto-sklearn's architecture enables effective practical configuration. Here we explore the key parameters and their impact on optimization behavior.

Core Configuration Parameters

autosklearn_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import autosklearn.classification
import autosklearn.regression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import numpy as np
 
# Load example dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# ========================================
# Basic Usage - Minimal Configuration
# ========================================
automl_basic = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,  # Total time budget in seconds
    per_run_time_limit=300,        # Max time per model training
)
 
automl_basic.fit(X_train, y_train)
predictions = automl_basic.predict(X_test)
print(f"Accuracy: {automl_basic.score(X_test, y_test):.4f}")
 
# ========================================
# Advanced Configuration for Production
# ========================================
automl_advanced = autosklearn.classification.AutoSklearnClassifier(
    # Time Budgets
    time_left_for_this_task=7200,  # 2 hours total
    per_run_time_limit=600,        # 10 min per model
    
    # Memory Management
    memory_limit=4096,      # 4GB per process
    
    # Meta-Learning Configuration
    initial_configurations_via_metalearning=25,  # Warm start configs
    
    # Ensemble Configuration
    ensemble_size=50,              # Max ensemble members
    ensemble_nbest=50,             # Consider top 50 models for ensemble
    
    # Parallelization
    n_jobs=4,                      # Parallel model evaluation
    
    # Search Space Customization
    include={
        'classifier': ['random_forest', 'gradient_boosting', 'extra_trees'],
        'feature_preprocessor': ['no_preprocessing', 'pca', 'select_percentile'],
    },
    exclude={
        'classifier': ['passive_aggressive', 'qda'],  # Exclude problematic classifiers
    },
    
    # Resampling Strategy
    resampling_strategy='cv',      # Cross-validation
    resampling_strategy_arguments={'folds': 5},
    
    # Reproducibility
    seed=42,
    
    # Metric Optimization
    metric=autosklearn.metrics.balanced_accuracy,
)
 
automl_advanced.fit(X_train, y_train)
 
# ========================================
# Inspecting Results
# ========================================
 
# View the final ensemble composition
print("\n=== Ensemble Composition ===")
print(automl_advanced.show_models())
 
# Access leaderboard of all evaluated models
print("\n=== Leaderboard (Top 10) ===")
leaderboard = automl_advanced.leaderboard(detailed=True)
print(leaderboard.head(10))
 
# Get statistics about the optimization process
print("\n=== Optimization Statistics ===")
stats = automl_advanced.sprint_statistics()
print(stats)
 
# Export ensemble for deployment
# Each model in the ensemble with its weight
for weight, model in automl_advanced.get_models_with_weights():
    print(f"Weight: {weight:.3f}, Model: {type(model).__name__}")

Key Parameter Insights

•time_left_for_this_task: Primary budget parameter. More time generally means better results, but with diminishing returns after ~2 hours for typical datasets.
•per_run_time_limit: Prevents any single configuration from monopolizing the budget. Set based on dataset size and complexity—larger datasets need longer limits.
•memory_limit: Crucial for preventing OOM errors. Monitor and adjust based on feature dimensionality and algorithm memory requirements.
•initial_configurations_via_metalearning: Higher values use more meta-learning guidance. Set to 0 for completely novel domains where meta-learning may mislead.
•include/exclude: Customize search space when you have domain knowledge about suitable algorithm families. Reduces search space size for faster convergence.
•n_jobs: Parallelism. Set to number of CPU cores available. Note: each job consumes memory_limit portion.

Resource Multiplier Effect

When using n_jobs > 1, total memory consumption is approximately n_jobs × memory_limit. A 4-worker setup with 4GB per worker requires 16GB+ RAM. Plan resources accordingly and consider using n_jobs=-1 only when memory is truly abundant.

Auto-sklearn 2.0 Enhancements

Auto-sklearn 2.0, released in 2020, introduced several significant architectural improvements that enable more robust performance across diverse dataset types. The key innovation is a portfolio-based strategy that replaces pure meta-learning with a more reliable approach.

Portfolio-Based Initialization

Instead of relying solely on dataset similarity for warm-starting, Auto-sklearn 2.0 uses a pre-computed portfolio of configurations that perform well across a wide variety of datasets. This portfolio:

Was optimized offline using a meta-learning approach over 100+ datasets
Contains diverse configurations covering different algorithm families
Guarantees reasonable performance even for unusual datasets where similarity-based meta-learning fails

Early Stopping Integration

Auto-sklearn 2.0 integrates iterative algorithm early stopping, allowing it to:

•Budget-aware training: Allocate more training budget to promising configurations
•Progressive evaluation: Start with quick evaluations, refine promising candidates
•Hyperband-style scheduling: Multi-fidelity optimization for faster convergence

autosklearn2_config.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Auto-sklearn 2.0 Specific Features
import autosklearn.classification
 
# Auto-sklearn 2.0 automatically uses:
# - Portfolio-based warm starting
# - Early stopping for iterative algorithms
# - Improved preprocessing pipeline
automl_v2 = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,
    per_run_time_limit=600,
    
    # Auto-sklearn 2.0 improvements
    # These are enabled by default in version 2.x
    
    # Use iterative algorithms with early stopping
    include={
        'classifier': [
            'extra_trees',          # Iterative (n_estimators)
            'gradient_boosting',    # Iterative (n_estimators)
            'random_forest',        # Iterative (n_estimators)
            'mlp',                  # Iterative (epochs)
        ],
    },
    
    # Disable meta-learning in favor of portfolio (default in 2.0)
    initial_configurations_via_metalearning=0,  # Portfolio instead
    
    # Enable more aggressive memory-managed processing
    delete_tmp_folder_after_terminate=True,
    delete_output_folder_after_terminate=True,
)
 
automl_v2.fit(X_train, y_train)
 
# Analysis of what Auto-sklearn 2.0 explored
stats = automl_v2.sprint_statistics()
print("Auto-sklearn 2.0 Search Summary:")
print(stats)

Auto-sklearn 1.0 vs 2.0 Comparison
Feature	Auto-sklearn 1.0	Auto-sklearn 2.0
Initialization	Meta-learning (dataset similarity)	Portfolio + Meta-learning
Early Stopping	Not integrated	Integrated for iterative algorithms
Multi-Fidelity	No	Successive Halving support
Robustness	Can struggle on unusual datasets	More consistent across dataset types
Speed	Full training for all evaluations	Faster through early termination

Limitations and When to Choose Alternatives

Despite its strengths, Auto-sklearn has specific limitations that make other AutoML systems preferable in certain scenarios. Understanding these limitations enables informed system selection.

Auto-sklearn Limitations

•No native GPU support — Auto-sklearn runs entirely on CPU. For large-scale deep learning, systems like AutoGluon with GPU-accelerated neural architectures outperform.
•Single-node limitation — No built-in distributed training. Dask-ML or Spark-based systems scale better for very large datasets.
•Limited deep learning — The included MLP is basic compared to modern deep learning. For tasks where neural networks dominate (NLP, vision), specialized AutoDL systems are superior.
•No automated feature engineering — Unlike TPOT or AutoGluon, Auto-sklearn doesn't generate novel feature interactions or transformations.
•Memory-bound — Large datasets can exhaust memory during ensemble construction when all model predictions must be held simultaneously.
•Tabular focus only — No native support for time series, text, or image data without extensive manual preprocessing.

When Auto-sklearn Excels

Auto-sklearn remains excellent for: (1) Medium-scale tabular classification/regression, (2) When explainability matters (no black-box neural networks), (3) Academic benchmarking requiring reproducibility, (4) Environments without GPU resources, (5) Datasets where classical ML outperforms deep learning.

Summary: Auto-sklearn's Place in the AutoML Ecosystem

Auto-sklearn represents the mature, academically rigorous end of the AutoML spectrum. Its SMAC optimization, meta-learning, and ensemble construction form a coherent, theoretically grounded system that has proven itself across hundreds of benchmark datasets.

For practitioners, Auto-sklearn serves as an excellent:

Baseline system — Establish performance benchmarks before trying more complex approaches
Production solution for medium-scale tabular ML with robust, interpretable models
Learning platform — Its open-source codebase provides educational insight into AutoML techniques

As we explore other AutoML systems in subsequent pages, keep Auto-sklearn's architecture in mind as a reference point. Many subsequent systems adopt, extend, or explicitly depart from the patterns Auto-sklearn established.

Page Complete

You now possess deep understanding of Auto-sklearn's architecture—from SMAC Bayesian optimization to meta-learning warm-starting to greedy ensemble construction. You can configure Auto-sklearn for production workloads and evaluate its suitability against alternatives. Next, we explore AutoGluon, which takes a radically different approach to automated machine learning.

1 / 5

Loading learning content...

Machine LearningAutoML & Neural Architecture Search

AutoML Systems

LevelAdvanced

Duration180 mins

TopicAutoML & Neural Architecture Search

1 / 5

Auto-sklearn: Automated Machine Learning Through Meta-Learning

The Dawn of Practical AutoML

What You Will Master

Foundational Architecture

The CASH Problem Formulation

$$A^, \lambda^ = \arg\min_{A^{(j)} \in \mathcal{A}, \lambda \in \Lambda^{(j)}} \mathcal{L}(A^{(j)}\lambda, D{train}, D_{valid})$$

where $\mathcal{L}$ represents a loss metric evaluated on the validation set after training on the training set.

The Hierarchical Nature of Search Spaces

The Complete ML Pipeline Search Space

Auto-sklearn doesn't just search over classifiers—it searches over complete machine learning pipelines. Each pipeline consists of:

Data Preprocessing: Missing value imputation, categorical encoding, feature rescaling
Feature Preprocessing: PCA, polynomial features, feature selection, random kitchen sinks
Classification/Regression Algorithm: From 15+ algorithms including Random Forests, Gradient Boosting, SVMs, Neural Networks, and more

This creates an aggregate search space of approximately 110 hyperparameters, though the effective dimensionality at any point is lower due to the conditional structure.

Auto-sklearn Full Search Space Components
Component	Options	Key Hyperparameters
Data Preprocessing	Imputers, Encoders, Scalers	Imputation strategy, encoding method, scaling type
Feature Preprocessing	13 methods including PCA, SelectPercentile, etc.	n_components, percentile, polynomial degree
Classifiers	15 algorithms: AdaBoost, Decision Tree, Extra Trees, Gradient Boosting, KNN, LDA, Logistic Regression, Passive Aggressive, QDA, Random Forest, SGD, SVM, MLP, etc.	Algorithm-specific: max_depth, n_estimators, learning_rate, C, etc.
Regressors	14 algorithms: similar to classifiers plus specialized regressors	Corresponding regression-specific parameters

SMAC-Based Bayesian Optimization

Why Random Forests Over Gaussian Processes?

Computational Complexity: GPs scale as $O(n^3)$ with the number of observations, becoming prohibitively expensive after a few thousand evaluations
Continuous Spaces Only: Standard GPs handle continuous hyperparameters naturally but struggle with categorical and conditional parameters
Kernel Selection: GPs require careful kernel engineering for mixed-type spaces

Random Forests address these limitations elegantly:

Random Forest Advantages for Surrogate Modeling

•Native handling of categorical and conditional parameters — Tree splits naturally partition categorical spaces and handle missing values (absent conditional parameters) gracefully
•Scalability — Inference and training scale favorably with data size, enabling longer optimization runs
•Uncertainty via tree variance — Prediction uncertainty estimated from variance across tree predictions in the ensemble
•Built-in feature importance — Natural identification of which hyperparameters most influence performance
•Robustness to outliers — Important when some configurations crash or timeout

The SMAC Optimization Loop

SMAC operates through an iterative process that balances exploration (trying uncertain regions) and exploitation (refining near good solutions):

Step 1: Initialize — Start with a small set of configurations, potentially informed by meta-learning (discussed later)

Step 2: Train Surrogate — Fit a Random Forest on all observed (configuration, performance) pairs

Step 3: Acquisition Function Optimization — Maximize the Expected Improvement (EI) acquisition function using local search:

$$EI(\lambda) = \mathbb{E}[\max(f^* - f(\lambda), 0)]$$

where $f^*$ is the best performance observed so far and $f(\lambda)$ is the predicted performance of configuration $\lambda$.

Step 4: Evaluate — Train and validate the selected configuration on the actual dataset

Step 5: Update — Add the new observation to the history and return to Step 2

smac_pseudocode.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# SMAC Optimization Loop (Pseudocode)
def smac_optimize(search_space, dataset, time_budget, meta_features=None):
    """
    SMAC-based Bayesian optimization for AutoML.
    
    Args:
        search_space: Hierarchical configuration space
        dataset: Training and validation data
        time_budget: Total optimization time in seconds
        meta_features: Optional dataset meta-features for warm-starting
    
    Returns:
        Best configuration found and its performance
    """
    # Initialize with default + random configurations
    # Meta-learning can provide informed initial configurations
    if meta_features is not None:
        initial_configs = warm_start_from_meta_learning(
            meta_features, 
            num_initial=25
        )
    else:
        initial_configs = generate_random_configs(num_initial=10)
    
    # Evaluate initial configurations
    history = []
    for config in initial_configs:
        if time_remaining() <= 0:
            break
        performance = evaluate_configuration(config, dataset)
        history.append((config, performance))
    
    # Main optimization loop
    while time_remaining() > 0:
        # Step 2: Train Random Forest surrogate
        surrogate_model = RandomForestRegressor(
            n_estimators=10,
            max_features='sqrt',
            min_samples_leaf=3,
            bootstrap=True
        )
        X = encode_configurations([h[0] for h in history])
        y = [h[1] for h in history]
        surrogate_model.fit(X, y)
        
        # Step 3: Optimize acquisition function via local search
        best_incumbent = min(history, key=lambda x: x[1])[0]
        
        # Expected Improvement with uncertainty from tree variance
        def expected_improvement(config):
            predictions = [tree.predict(encode(config)) 
                          for tree in surrogate_model.estimators_]
            mean = np.mean(predictions)
            std = np.std(predictions)
            
            best_so_far = min([h[1] for h in history])
            
            # EI formula
            z = (best_so_far - mean) / (std + 1e-9)
            ei = std * (z * norm.cdf(z) + norm.pdf(z))
            return ei
        
        # Local search from multiple starting points
        candidates = [
            local_search(best_incumbent, expected_improvement),
            local_search(random_config(), expected_improvement),
            *[random_config() for _ in range(10)]  # Random exploration
        ]
        next_config = max(candidates, key=expected_improvement)
        
        # Step 4: Evaluate selected configuration
        performance = evaluate_configuration(next_config, dataset)
        
        # Step 5: Update history
        history.append((next_config, performance))
    
    # Return best configuration found
    best_config, best_perf = min(history, key=lambda x: x[1])
    return best_config, best_perf

Interleaved Random Configurations

Meta-Learning for Warm Starting

The Meta-Learning Hypothesis

This hypothesis is formalized through dataset meta-features—quantitative characteristics that describe datasets independently of any specific learning algorithm.

Dataset Meta-Features

Auto-sklearn computes approximately 38 meta-features for each dataset:

Auto-sklearn Meta-Features Categories
Category	Example Meta-Features	Intuition
Simple	Number of instances, number of features, number of classes	Basic dataset scale indicators
Statistical	Mean/std of feature means, skewness, kurtosis	Data distribution characteristics
Information-Theoretic	Class entropy, mean feature entropy, mutual information	Complexity and redundancy signals
Landmarking	1-NN accuracy, decision stump accuracy, random forest (quick eval)	Cheap performance estimates that correlate with full tuning
PCA	PCA fraction variance for first k components, PCA skewness/kurtosis	Intrinsic dimensionality indicators

The Warm-Starting Process

Given a new dataset $D_{new}$:

Compute meta-features: Extract the 38-dimensional meta-feature vector $m_{new}$ for the new dataset
Find similar datasets: Using a distance metric (typically L1 or L2 distance on normalized meta-features), identify the $k$ most similar datasets from the meta-knowledge base
Select promising configurations: For each similar dataset, retrieve the best-performing configurations found during that dataset's tuning
Prioritize evaluation: These configurations become the initial candidates for SMAC, evaluated before random exploration begins

The meta-knowledge base stores:

Meta-features for each dataset
The best configurations found for each dataset
Performance metrics achieved with each configuration

This process typically identifies 25 initial configurations that represent strong starting points, dramatically reducing the time needed to find good solutions.

meta_learning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# Auto-sklearn Meta-Learning Implementation
import numpy as np
from sklearn.preprocessing import StandardScaler
 
class MetaLearningWarmStart:
    """
    Implements Auto-sklearn's meta-learning warm-start strategy.
    """
    
    def __init__(self, meta_knowledge_base_path: str):
        """
        Args:
            meta_knowledge_base_path: Path to stored meta-knowledge
                containing (dataset_meta_features, best_configs, performances)
        """
        self.meta_knowledge = self._load_meta_knowledge(meta_knowledge_base_path)
        self.meta_features = np.array([
            entry['meta_features'] for entry in self.meta_knowledge
        ])
        self.scaler = StandardScaler().fit(self.meta_features)
        self.normalized_features = self.scaler.transform(self.meta_features)
    
    def compute_meta_features(self, X, y) -> np.ndarray:
        """
        Computes the 38-dimensional meta-feature vector for a dataset.
        
        Returns:
            meta_features: Array of shape (38,) containing dataset characteristics
        """
        meta_features = []
        
        # Simple meta-features
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        meta_features.extend([
            np.log(n_samples),  # Log-transformed for stability
            np.log(n_features),
            n_classes,
            n_features / n_samples,  # Dimensionality ratio
        ])
        
        # Statistical meta-features
        feature_means = np.nanmean(X, axis=0)
        feature_stds = np.nanstd(X, axis=0)
        meta_features.extend([
            np.mean(feature_means),
            np.std(feature_means),
            np.mean(feature_stds),
            np.std(feature_stds),
        ])
        
        # Information-theoretic meta-features
        class_counts = np.bincount(y.astype(int))
        class_probs = class_counts / len(y)
        class_entropy = -np.sum(class_probs * np.log2(class_probs + 1e-10))
        meta_features.append(class_entropy)
        
        # Landmarking meta-features (cheap model evaluations)
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.neighbors import KNeighborsClassifier
        from sklearn.model_selection import cross_val_score
        
        # Decision stump landmark
        stump = DecisionTreeClassifier(max_depth=1)
        stump_acc = np.mean(cross_val_score(stump, X, y, cv=3, scoring='accuracy'))
        meta_features.append(stump_acc)
        
        # 1-NN landmark
        knn = KNeighborsClassifier(n_neighbors=1)
        knn_acc = np.mean(cross_val_score(knn, X, y, cv=3, scoring='accuracy'))
        meta_features.append(knn_acc)
        
        # PCA meta-features
        from sklearn.decomposition import PCA
        pca = PCA().fit(X)
        variance_ratios = pca.explained_variance_ratio_
        meta_features.extend([
            variance_ratios[0] if len(variance_ratios) > 0 else 0,
            np.sum(variance_ratios[:min(5, len(variance_ratios))]),
        ])
        
        # ... (additional meta-features omitted for brevity)
        
        return np.array(meta_features[:38])  # Ensure exactly 38 features
    
    def get_warm_start_configs(
        self, 
        X, 
        y, 
        n_configs: int = 25,
        n_similar_datasets: int = 10
    ) -> list:
        """
        Returns promising initial configurations based on similar datasets.
        
        Args:
            X, y: The new dataset
            n_configs: Number of configurations to return
            n_similar_datasets: Number of similar datasets to consider
            
        Returns:
            List of configuration dictionaries to evaluate first
        """
        # Compute meta-features for new dataset
        new_meta_features = self.compute_meta_features(X, y)
        normalized_new = self.scaler.transform(new_meta_features.reshape(1, -1))
        
        # Find most similar datasets using L1 distance
        distances = np.sum(
            np.abs(self.normalized_features - normalized_new), 
            axis=1
        )
        similar_indices = np.argsort(distances)[:n_similar_datasets]
        
        # Collect best configurations from similar datasets
        candidate_configs = []
        for idx in similar_indices:
            dataset_configs = self.meta_knowledge[idx]['best_configs']
            candidate_configs.extend(dataset_configs)
        
        # Deduplicate and rank by frequency of high performance
        config_scores = {}
        for config, perf in candidate_configs:
            config_key = self._config_to_key(config)
            if config_key not in config_scores:
                config_scores[config_key] = {'config': config, 'count': 0, 'avg_rank': 0}
            config_scores[config_key]['count'] += 1
        
        # Return top n_configs unique configurations
        sorted_configs = sorted(
            config_scores.values(), 
            key=lambda x: x['count'], 
            reverse=True
        )
        
        return [entry['config'] for entry in sorted_configs[:n_configs]]

Meta-Learning Limitations

Automatic Ensemble Construction

The Ensemble Selection Algorithm

Auto-sklearn uses greedy ensemble selection with replacement, based on the algorithm by Caruana et al. (2004):

Initialize: Start with an empty ensemble
Greedy Addition: Iteratively add the model that most improves validation performance when combined with the current ensemble
With Replacement: The same model can be added multiple times, effectively increasing its weight in the ensemble
Terminate: Stop when validation performance stops improving or a maximum ensemble size is reached

ensemble_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# Ensemble Selection Algorithm (Caruana et al., 2004)
import numpy as np
from sklearn.metrics import accuracy_score
 
class EnsembleSelection:
    """
    Greedy ensemble selection with replacement.
    Builds optimal weighted ensemble from candidate models.
    """
    
    def __init__(
        self, 
        candidate_predictions: np.ndarray,
        y_valid: np.ndarray,
        max_ensemble_size: int = 50,
        metric: callable = accuracy_score
    ):
        """
        Args:
            candidate_predictions: Shape (n_candidates, n_samples, n_classes)
                Predicted probabilities from all trained models
            y_valid: True labels for validation set
            max_ensemble_size: Maximum number of models in ensemble
            metric: Scoring function (higher is better)
        """
        self.candidates = candidate_predictions
        self.y_valid = y_valid
        self.max_size = max_ensemble_size
        self.metric = metric
        self.n_candidates = len(candidate_predictions)
        
        self.ensemble_indices = []
        self.ensemble_weights = None
    
    def fit(self) -> 'EnsembleSelection':
        """
        Performs greedy ensemble selection with replacement.
        """
        best_score = float('-inf')
        
        for iteration in range(self.max_size):
            best_candidate = None
            best_new_score = best_score
            
            # Current ensemble predictions
            if len(self.ensemble_indices) > 0:
                current_preds = np.mean(
                    self.candidates[self.ensemble_indices], 
                    axis=0
                )
            else:
                current_preds = np.zeros_like(self.candidates[0])
            
            # Try adding each candidate
            for i in range(self.n_candidates):
                # New ensemble would be current + candidate i
                n_current = len(self.ensemble_indices)
                new_ensemble_preds = (
                    current_preds * n_current + self.candidates[i]
                ) / (n_current + 1)
                
                # Evaluate new ensemble
                y_pred = np.argmax(new_ensemble_preds, axis=1)
                score = self.metric(self.y_valid, y_pred)
                
                if score > best_new_score:
                    best_new_score = score
                    best_candidate = i
            
            # Check for improvement
            if best_new_score > best_score:
                self.ensemble_indices.append(best_candidate)
                best_score = best_new_score
                print(f"Iteration {iteration+1}: Added model {best_candidate}, "
                      f"score = {best_score:.4f}")
            else:
                # No improvement, stop early
                print(f"Stopping at iteration {iteration+1}: no improvement")
                break
        
        # Compute final weights
        self._compute_weights()
        
        return self
    
    def _compute_weights(self):
        """
        Computes normalized weights from selection frequencies.
        """
        counts = np.bincount(
            self.ensemble_indices, 
            minlength=self.n_candidates
        )
        self.ensemble_weights = counts / len(self.ensemble_indices)
    
    def predict_proba(self, candidate_predictions: np.ndarray) -> np.ndarray:
        """
        Makes predictions using the selected ensemble.
        
        Args:
            candidate_predictions: Shape (n_candidates, n_samples, n_classes)
                Predictions from all candidate models on new data
                
        Returns:
            Weighted ensemble predictions of shape (n_samples, n_classes)
        """
        weighted_preds = np.zeros_like(candidate_predictions[0])
        
        for i, weight in enumerate(self.ensemble_weights):
            if weight > 0:
                weighted_preds += weight * candidate_predictions[i]
        
        return weighted_preds
    
    def get_selected_models(self) -> list:
        """
        Returns the indices and weights of selected models.
        """
        selected = [
            (i, w) for i, w in enumerate(self.ensemble_weights) if w > 0
        ]
        return sorted(selected, key=lambda x: x[1], reverse=True)

Why Post-Hoc Ensemble Works

The success of post-hoc ensemble selection stems from several factors:

3. Error Decorrelation: Models with uncorrelated errors provide maximal ensemble benefit. The diverse CASH search space naturally produces decorrelated predictions.

4. Robustness: Ensembles are more robust to overfitting than single models, providing smoother generalization.

Ensemble Size Considerations

Practical Configuration and Usage

Understanding Auto-sklearn's architecture enables effective practical configuration. Here we explore the key parameters and their impact on optimization behavior.

Core Configuration Parameters

autosklearn_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import autosklearn.classification
import autosklearn.regression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import numpy as np
 
# Load example dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
# ========================================
# Basic Usage - Minimal Configuration
# ========================================
automl_basic = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,  # Total time budget in seconds
    per_run_time_limit=300,        # Max time per model training
)
 
automl_basic.fit(X_train, y_train)
predictions = automl_basic.predict(X_test)
print(f"Accuracy: {automl_basic.score(X_test, y_test):.4f}")
 
# ========================================
# Advanced Configuration for Production
# ========================================
automl_advanced = autosklearn.classification.AutoSklearnClassifier(
    # Time Budgets
    time_left_for_this_task=7200,  # 2 hours total
    per_run_time_limit=600,        # 10 min per model
    
    # Memory Management
    memory_limit=4096,      # 4GB per process
    
    # Meta-Learning Configuration
    initial_configurations_via_metalearning=25,  # Warm start configs
    
    # Ensemble Configuration
    ensemble_size=50,              # Max ensemble members
    ensemble_nbest=50,             # Consider top 50 models for ensemble
    
    # Parallelization
    n_jobs=4,                      # Parallel model evaluation
    
    # Search Space Customization
    include={
        'classifier': ['random_forest', 'gradient_boosting', 'extra_trees'],
        'feature_preprocessor': ['no_preprocessing', 'pca', 'select_percentile'],
    },
    exclude={
        'classifier': ['passive_aggressive', 'qda'],  # Exclude problematic classifiers
    },
    
    # Resampling Strategy
    resampling_strategy='cv',      # Cross-validation
    resampling_strategy_arguments={'folds': 5},
    
    # Reproducibility
    seed=42,
    
    # Metric Optimization
    metric=autosklearn.metrics.balanced_accuracy,
)
 
automl_advanced.fit(X_train, y_train)
 
# ========================================
# Inspecting Results
# ========================================
 
# View the final ensemble composition
print("\n=== Ensemble Composition ===")
print(automl_advanced.show_models())
 
# Access leaderboard of all evaluated models
print("\n=== Leaderboard (Top 10) ===")
leaderboard = automl_advanced.leaderboard(detailed=True)
print(leaderboard.head(10))
 
# Get statistics about the optimization process
print("\n=== Optimization Statistics ===")
stats = automl_advanced.sprint_statistics()
print(stats)
 
# Export ensemble for deployment
# Each model in the ensemble with its weight
for weight, model in automl_advanced.get_models_with_weights():
    print(f"Weight: {weight:.3f}, Model: {type(model).__name__}")

Key Parameter Insights

•time_left_for_this_task: Primary budget parameter. More time generally means better results, but with diminishing returns after ~2 hours for typical datasets.
•per_run_time_limit: Prevents any single configuration from monopolizing the budget. Set based on dataset size and complexity—larger datasets need longer limits.
•memory_limit: Crucial for preventing OOM errors. Monitor and adjust based on feature dimensionality and algorithm memory requirements.
•initial_configurations_via_metalearning: Higher values use more meta-learning guidance. Set to 0 for completely novel domains where meta-learning may mislead.
•include/exclude: Customize search space when you have domain knowledge about suitable algorithm families. Reduces search space size for faster convergence.
•n_jobs: Parallelism. Set to number of CPU cores available. Note: each job consumes memory_limit portion.

Resource Multiplier Effect

Auto-sklearn 2.0 Enhancements

Portfolio-Based Initialization

Was optimized offline using a meta-learning approach over 100+ datasets
Contains diverse configurations covering different algorithm families
Guarantees reasonable performance even for unusual datasets where similarity-based meta-learning fails

Early Stopping Integration

Auto-sklearn 2.0 integrates iterative algorithm early stopping, allowing it to:

•Budget-aware training: Allocate more training budget to promising configurations
•Progressive evaluation: Start with quick evaluations, refine promising candidates
•Hyperband-style scheduling: Multi-fidelity optimization for faster convergence

autosklearn2_config.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Auto-sklearn 2.0 Specific Features
import autosklearn.classification
 
# Auto-sklearn 2.0 automatically uses:
# - Portfolio-based warm starting
# - Early stopping for iterative algorithms
# - Improved preprocessing pipeline
automl_v2 = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,
    per_run_time_limit=600,
    
    # Auto-sklearn 2.0 improvements
    # These are enabled by default in version 2.x
    
    # Use iterative algorithms with early stopping
    include={
        'classifier': [
            'extra_trees',          # Iterative (n_estimators)
            'gradient_boosting',    # Iterative (n_estimators)
            'random_forest',        # Iterative (n_estimators)
            'mlp',                  # Iterative (epochs)
        ],
    },
    
    # Disable meta-learning in favor of portfolio (default in 2.0)
    initial_configurations_via_metalearning=0,  # Portfolio instead
    
    # Enable more aggressive memory-managed processing
    delete_tmp_folder_after_terminate=True,
    delete_output_folder_after_terminate=True,
)
 
automl_v2.fit(X_train, y_train)
 
# Analysis of what Auto-sklearn 2.0 explored
stats = automl_v2.sprint_statistics()
print("Auto-sklearn 2.0 Search Summary:")
print(stats)

Auto-sklearn 1.0 vs 2.0 Comparison
Feature	Auto-sklearn 1.0	Auto-sklearn 2.0
Initialization	Meta-learning (dataset similarity)	Portfolio + Meta-learning
Early Stopping	Not integrated	Integrated for iterative algorithms
Multi-Fidelity	No	Successive Halving support
Robustness	Can struggle on unusual datasets	More consistent across dataset types
Speed	Full training for all evaluations	Faster through early termination

Limitations and When to Choose Alternatives

Despite its strengths, Auto-sklearn has specific limitations that make other AutoML systems preferable in certain scenarios. Understanding these limitations enables informed system selection.

Auto-sklearn Limitations

•No native GPU support — Auto-sklearn runs entirely on CPU. For large-scale deep learning, systems like AutoGluon with GPU-accelerated neural architectures outperform.
•Single-node limitation — No built-in distributed training. Dask-ML or Spark-based systems scale better for very large datasets.
•Limited deep learning — The included MLP is basic compared to modern deep learning. For tasks where neural networks dominate (NLP, vision), specialized AutoDL systems are superior.
•No automated feature engineering — Unlike TPOT or AutoGluon, Auto-sklearn doesn't generate novel feature interactions or transformations.
•Memory-bound — Large datasets can exhaust memory during ensemble construction when all model predictions must be held simultaneously.
•Tabular focus only — No native support for time series, text, or image data without extensive manual preprocessing.

When Auto-sklearn Excels

Summary: Auto-sklearn's Place in the AutoML Ecosystem

For practitioners, Auto-sklearn serves as an excellent:

Baseline system — Establish performance benchmarks before trying more complex approaches
Production solution for medium-scale tabular ML with robust, interpretable models
Learning platform — Its open-source codebase provides educational insight into AutoML techniques

Page Complete

1 / 5