Hyperparameter Fundamentals - Learning Module

Loading content...

0/278

Hyperparameter Importance

Not All Hyperparameters Are Created Equal

A typical neural network has dozens of hyperparameters: learning rate, batch size, weight decay, dropout rates, layer widths, activation functions, optimizer parameters, and more. A gradient boosting model might have: number of trees, max depth, min samples per leaf, learning rate, subsample ratio, colsample ratio, L1/L2 regularization, and various tree-building parameters.

But here's an empirical fact that experienced practitioners know: not all hyperparameters matter equally. Some—like learning rate for neural networks—can change performance by orders of magnitude. Others—like the choice between GELU and ReLU activation—often make little difference. And some matter intensely in certain regimes but are irrelevant in others.

Understanding hyperparameter importance helps you allocate your optimization budget wisely. Why spend 90% of your HPO budget exploring dropout rates when learning rate accounts for 90% of the performance variance?

What You Will Learn

By the end of this page, you will: • Understand formal definitions of hyperparameter importance • Know methods for measuring importance from HPO data • Recognize which hyperparameters typically matter most for different model families • Apply importance analysis to improve your HPO strategy

Defining Hyperparameter Importance

Before we can measure importance, we need to define what it means. There are several valid, complementary definitions:

1. Variance-Based Importance

The importance of hyperparameter $\lambda_i$ is the fraction of output variance it explains:

$$I_i = \frac{\text{Var}[\mathbb{E}[f(\lambda) | \lambda_i]]}{\text{Var}[f(\lambda)]}$$

This is the first-order Sobol' index. It measures how much of the performance variance can be attributed to varying $\lambda_i$ alone.

2. Marginal Contribution

How much does optimizing $\lambda_i$ improve over using its default value, given other hyperparameters are optimized:

$$I_i = \mathbb{E}{\lambda{-i}}[f^*(\lambda_i, \lambda_{-i})] - \mathbb{E}{\lambda{-i}}[f(\lambda_i^{\text{default}}, \lambda_{-i})]$$

where $\lambda_{-i}$ denotes all hyperparameters except $\lambda_i$.

3. Sensitivity/Gradient Magnitude

How much does performance change per unit change in $\lambda_i$:

$$I_i = \mathbb{E}\left[\left|\frac{\partial f}{\partial \lambda_i}\right|\right]$$

For log-scaled hyperparameters, this uses the log derivative.

Importance Definitions: Comparison
Definition	Interpretation	Pros	Cons
Variance-based	Fraction of variance explained	Well-founded, comparable across params	Expensive to estimate accurately
Marginal contribution	Value of tuning this param	Directly actionable	Depends on default chosen
Sensitivity	Local rate of change	Identifies sensitive regimes	Doesn't capture global structure
Ablation importance	Impact of removing this param	Easy to compute from HPO data	May miss interactions

Interaction Effects

The definitions above focus on marginal (individual) importance. But hyperparameters often interact. Learning rate × batch size, or regularization × model capacity. Total importance should account for:

• Main effects: Importance of each hyperparameter individually • Interaction effects: Importance of hyperparameter pairs, triples, etc.

Variance-based methods can compute higher-order Sobol' indices for interactions.

Measuring Importance from HPO Data

After running HPO, you have a dataset of (configuration, performance) pairs. Several methods can extract importance from this data:

Functional ANOVA (fANOVA) decomposes the response surface into additive components and measures each component's variance.

Method:

Fit a Random Forest to (hyperparameters, performance) data
Use the forest's structure to efficiently compute variance decomposition
Report fraction of variance explained by each hyperparameter and each pair

Outputs:

Individual importance: Variance explained by each hyperparameter alone
Pairwise importance: Variance explained by each pair's interaction
Marginal performance curves: Expected performance as function of each hyperparameter

Advantages:

Works with mixed (continuous, categorical, conditional) spaces
Provides interpretable marginal effect plots
Handles non-linear relationships and interactions

Limitations:

Importance estimates can be noisy with few evaluations
Random forest may not capture all structure
Expensive for many hyperparameters (pairwise = O(d²))

hyperparameter_importance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
"""
Hyperparameter Importance Analysis
 
Methods for quantifying which hyperparameters matter most.
"""
import numpy as np
from typing import Dict, List, Tuple, Any
from dataclasses import dataclass
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
 
 
@dataclass
class HPOResult:
    """Result from a single hyperparameter evaluation."""
    config: Dict[str, Any]
    performance: float  # Lower is better (validation loss)
 
 
class HyperparameterImportanceAnalyzer:
    """
    Analyze hyperparameter importance from HPO trial data.
    
    Implements multiple importance methods:
    1. Random Forest feature importance (fast)
    2. Permutation importance (more reliable)
    3. Variance-based (fANOVA-style) importance
    4. Ablation analysis
    """
    
    def __init__(self, results: List[HPOResult], param_names: List[str]):
        """
        Args:
            results: List of (config, performance) pairs from HPO
            param_names: Names of hyperparameters to analyze
        """
        self.results = results
        self.param_names = param_names
        
        # Convert to arrays for analysis
        self.X, self.y = self._prepare_data()
        
        # Fit surrogate model
        self.surrogate = RandomForestRegressor(
            n_estimators=100,
            random_state=42,
            n_jobs=-1
        )
        self.surrogate.fit(self.X, self.y)
    
    def _prepare_data(self) -> Tuple[np.ndarray, np.ndarray]:
        """Convert HPO results to numpy arrays."""
        n = len(self.results)
        d = len(self.param_names)
        
        X = np.zeros((n, d))
        y = np.zeros(n)
        
        for i, result in enumerate(self.results):
            for j, name in enumerate(self.param_names):
                value = result.config.get(name)
                # Handle categorical: convert to numeric
                if isinstance(value, str):
                    # Simple: hash to numeric (better: use proper encoding)
                    value = hash(value) % 1000
                elif isinstance(value, bool):
                    value = 1.0 if value else 0.0
                elif value is None:
                    value = np.nan
                X[i, j] = value
            y[i] = result.performance
        
        return X, y
    
    def random_forest_importance(self) -> Dict[str, float]:
        """
        Compute importance using Random Forest's MDI (Mean Decrease in Impurity).
        
        Fast but can be biased toward high-cardinality features.
        """
        importances = self.surrogate.feature_importances_
        return {name: float(imp) for name, imp in 
                zip(self.param_names, importances)}
    
    def permutation_importance(self, n_repeats: int = 10) -> Dict[str, float]:
        """
        Compute importance via permutation.
        
        More reliable than MDI but slower.
        Measures how much shuffling each feature hurts prediction.
        """
        perm_imp = permutation_importance(
            self.surrogate, self.X, self.y, 
            n_repeats=n_repeats,
            random_state=42
        )
        
        return {name: float(imp) for name, imp in 
                zip(self.param_names, perm_imp.importances_mean)}
    
    def variance_based_importance(self, n_samples: int = 1000) -> Dict[str, float]:
        """
        Estimate first-order Sobol' indices via Monte Carlo.
        
        For each hyperparameter, estimate:
        Var[E[f(λ) | λ_i]] / Var[f(λ)]
        
        This measures the fraction of variance explained by each hyperparameter.
        """
        # Sample random configurations
        X_samples = np.zeros((n_samples, len(self.param_names)))
        for j in range(len(self.param_names)):
            col = self.X[:, j]
            # Sample from empirical distribution
            X_samples[:, j] = np.random.choice(col[~np.isnan(col)], size=n_samples)
        
        # Predict performances
        y_pred = self.surrogate.predict(X_samples)
        total_var = np.var(y_pred)
        
        if total_var < 1e-10:
            return {name: 1.0 / len(self.param_names) for name in self.param_names}
        
        importances = {}
        
        for j, name in enumerate(self.param_names):
            # Estimate E[f | λ_j] for each unique value of λ_j
            unique_vals = np.unique(X_samples[:, j])
            conditional_means = []
            
            for val in unique_vals:
                mask = X_samples[:, j] == val
                if mask.sum() > 0:
                    conditional_means.append(y_pred[mask].mean())
            
            # Variance of conditional means approximates Var[E[f|λ_j]]
            var_conditional = np.var(conditional_means) if len(conditional_means) > 1 else 0
            importances[name] = var_conditional / total_var
        
        # Normalize
        total = sum(importances.values())
        if total > 0:
            importances = {k: v / total for k, v in importances.items()}
        
        return importances
    
    def ablation_importance(self) -> Dict[str, float]:
        """
        Compute importance via ablation from the best configuration.
        
        For each hyperparameter:
        1. Take the best config
        2. Replace that hyperparameter with median/mode
        3. Predict performance drop
        """
        # Find the best configuration
        best_idx = np.argmin(self.y)
        best_config = self.X[best_idx].copy()
        best_perf = self.surrogate.predict(best_config.reshape(1, -1))[0]
        
        importances = {}
        
        for j, name in enumerate(self.param_names):
            # Create ablated config
            ablated = best_config.copy()
            
            # Replace with median (for continuous) or mode (for discrete)
            col = self.X[:, j]
            col_clean = col[~np.isnan(col)]
            ablated[j] = np.median(col_clean)
            
            # Predict performance with ablation
            ablated_perf = self.surrogate.predict(ablated.reshape(1, -1))[0]
            
            # Importance = performance degradation
            importances[name] = max(0, ablated_perf - best_perf)
        
        # Normalize
        total = sum(importances.values())
        if total > 0:
            importances = {k: v / total for k, v in importances.items()}
        
        return importances
    
    def marginal_effect(self, param_name: str, 
                        n_points: int = 50) -> Tuple[np.ndarray, np.ndarray]:
        """
        Compute marginal effect curve for a hyperparameter.
        
        Shows expected performance as function of this hyperparameter,
        averaging over other hyperparameters.
        
        Returns:
            (param_values, expected_performance) arrays for plotting
        """
        j = self.param_names.index(param_name)
        col = self.X[:, j]
        col_clean = col[~np.isnan(col)]
        
        param_values = np.linspace(col_clean.min(), col_clean.max(), n_points)
        expected_perf = np.zeros(n_points)
        
        # For each param value, average over other hyperparameters
        n_samples = min(100, len(self.X))
        sample_indices = np.random.choice(len(self.X), n_samples, replace=False)
        
        for i, val in enumerate(param_values):
            X_test = self.X[sample_indices].copy()
            X_test[:, j] = val
            predictions = self.surrogate.predict(X_test)
            expected_perf[i] = predictions.mean()
        
        return param_values, expected_perf
    
    def full_report(self) -> Dict[str, Dict[str, float]]:
        """Generate comprehensive importance report using all methods."""
        return {
            'random_forest_mdi': self.random_forest_importance(),
            'permutation': self.permutation_importance(),
            'variance_based': self.variance_based_importance(),
            'ablation': self.ablation_importance(),
        }
 
 
def print_importance_report(analyzer: HyperparameterImportanceAnalyzer):
    """Pretty-print the importance analysis."""
    report = analyzer.full_report()
    
    print("
Hyperparameter Importance Analysis")
    print("=" * 70)
    
    # Header
    print(f"{'Parameter':<25} {'RF-MDI':>10} {'Permutation':>12} "
          f"{'Variance':>10} {'Ablation':>10}")
    print("-" * 70)
    
    for name in analyzer.param_names:
        rf = report['random_forest_mdi'].get(name, 0)
        perm = report['permutation'].get(name, 0)
        var = report['variance_based'].get(name, 0)
        abl = report['ablation'].get(name, 0)
        
        print(f"{name:<25} {rf:>10.3f} {perm:>12.3f} {var:>10.3f} {abl:>10.3f}")
    
    # Summary
    print("
Most important hyperparameters (by variance-based):")
    sorted_params = sorted(report['variance_based'].items(), 
                          key=lambda x: x[1], reverse=True)
    for name, imp in sorted_params[:5]:
        print(f"  {name}: {imp:.3f}")
 
 
# Example usage with synthetic HPO data
if __name__ == "__main__":
    np.random.seed(42)
    
    # Simulate HPO results for a neural network
    param_names = ['learning_rate', 'weight_decay', 'dropout', 'hidden_units', 
                   'num_layers', 'batch_size']
    
    results = []
    for _ in range(100):
        config = {
            'learning_rate': 10 ** np.random.uniform(-5, -1),
            'weight_decay': 10 ** np.random.uniform(-6, -2),
            'dropout': np.random.uniform(0, 0.5),
            'hidden_units': int(2 ** np.random.uniform(5, 9)),
            'num_layers': np.random.randint(1, 6),
            'batch_size': int(np.random.choice([32, 64, 128, 256])),
        }
        
        # Simulated performance: learning_rate matters most
        log_lr = np.log10(config['learning_rate'])
        optimal_log_lr = -3  # lr=0.001 is optimal
        
        performance = (
            (log_lr - optimal_log_lr) ** 2 * 0.5  # LR dominates
            + np.log10(config['weight_decay'] + 1e-6) * 0.02
            + config['dropout'] * 0.1
            + 0.01 * np.random.randn()
        )
        
        results.append(HPOResult(config=config, performance=performance))
    
    analyzer = HyperparameterImportanceAnalyzer(results, param_names)
    print_importance_report(analyzer)

Empirical Importance Patterns

Extensive empirical studies have identified consistent patterns in hyperparameter importance across model families. These findings can guide your HPO strategy before running any experiments.

Typical Hyperparameter Importance Rankings by Model Type
Model Family	Critical	Important	Usually Minor
Neural Networks (SGD)	Learning rate	Batch size, Weight decay, Momentum	Activation, Init scheme
Neural Networks (Adam)	Learning rate	Weight decay (AdamW), Architecture	β₁, β₂, ε
Gradient Boosting	Learning rate, n_estimators	max_depth, min_child_weight	L1/L2 reg, colsample
Random Forest	n_estimators, max_depth	max_features, min_samples_split	bootstrap, criterion
SVM (RBF)	C, γ	(these two dominate)	All others
Transformers	Learning rate, Warmup	Weight decay, Batch size	Attention head count (if large enough)

Key Insight: The Learning Rate Dominance

Across virtually all neural network architectures and optimizers, learning rate is the most important hyperparameter. This has been demonstrated repeatedly:

Bergstra and Bengio (2012): Random search beats grid search largely because it explores more learning rate values
Probst et al. (2019): Large-scale meta-learning study confirms learning rate importance
Smith (2017): Cyclical learning rates work by exploring the learning rate landscape

Implications:

Prioritize learning rate tuning: If you can only tune one thing, tune learning rate
Use wide ranges: Log-scale from 1e-5 to 1.0 covers most useful values
Learning rate interacts with everything: Optimal learning rate changes with batch size, architecture, and dataset

The Regularization-Capacity Interaction

Regularization hyperparameters (dropout, weight decay, L1/L2) become more important as model capacity increases. For small models on ample data, regularization barely matters. For large models on limited data, it's critical. Assess your regime before deciding where to focus tuning effort.

Using Importance to Guide HPO Strategy

Hyperparameter importance analysis isn't just academic—it directly informs practical HPO decisions:

Strategic Applications of Importance

•Budget allocation: Spend more budget (finer grids, more samples) on important hyperparameters. For unimportant ones, use defaults or coarse search.
•Staged optimization: First tune critical hyperparameters with coarse settings for others. Then fix criticals and fine-tune the rest. This reduces search space exponentially.
•Search space design: Give important hyperparameters wider ranges. Narrow or eliminate unimportant ones to reduce dimensionality.
•Early stopping: If importance analysis shows a hyperparameter doesn't matter, stop tuning it and fix at default.
•Transfer learning: Important hyperparameters often need retuning when transferring; unimportant ones may transfer as-is.
•Debugging: If an 'unimportant' hyperparameter suddenly matters, something unusual is happening—investigate.

Staged/Hierarchical HPO

A powerful strategy based on importance:

Stage 1: Critical hyperparameters

Tune only learning rate and 1-2 other critical params
Use defaults for everything else
Goal: Find the right ballpark

Stage 2: Important hyperparameters

Fix learning rate near Stage 1 optimum
Tune regularization, architecture choices
Expand search around Stage 1 results

Stage 3: Fine-tuning

Refine all parameters in narrow ranges around Stage 2
May include previously-skipped minor parameters

This staged approach can reduce total evaluations by 10-100× compared to searching all hyperparameters simultaneously.

staged_hpo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
"""
Staged HPO Based on Hyperparameter Importance
 
Demonstrates a hierarchical optimization strategy that
focuses budget on important hyperparameters first.
"""
import optuna
from typing import Dict, Any, Tuple
 
 
def staged_neural_network_hpo(
    train_fn,  # Function that trains and returns validation loss
    n_stage1: int = 20,
    n_stage2: int = 30,
    n_stage3: int = 50,
) -> Tuple[Dict[str, Any], float]:
    """
    Three-stage hierarchical HPO for neural networks.
    
    Stage 1: Learning rate only (most important)
    Stage 2: Learning rate + regularization
    Stage 3: Fine-tune all parameters
    
    Total budget: n_stage1 + n_stage2 + n_stage3
    
    Returns:
        (best_config, best_performance)
    """
    
    # ==========================================
    # Stage 1: Critical hyperparameters only
    # ==========================================
    print("
=== Stage 1: Learning Rate Search ===")
    
    def stage1_objective(trial):
        config = {
            # CRITICAL: Full range search
            'learning_rate': trial.suggest_float('lr', 1e-5, 1e-1, log=True),
            # DEFAULTS for other parameters
            'weight_decay': 1e-4,
            'dropout': 0.1,
            'hidden_units': 256,
            'num_layers': 3,
            'batch_size': 64,
        }
        return train_fn(config)
    
    study1 = optuna.create_study(direction='minimize')
    study1.optimize(stage1_objective, n_trials=n_stage1)
    
    best_lr = study1.best_params['lr']
    print(f"Best learning rate: {best_lr:.6f}")
    
    # ==========================================
    # Stage 2: Critical + Important
    # ==========================================
    print("
=== Stage 2: Regularization Search ===")
    
    # Narrow learning rate range around Stage 1 best
    lr_low = best_lr / 5
    lr_high = best_lr * 5
    
    def stage2_objective(trial):
        config = {
            # Narrow LR range
            'learning_rate': trial.suggest_float('lr', lr_low, lr_high, log=True),
            # IMPORTANT: Now search these
            'weight_decay': trial.suggest_float('wd', 1e-6, 1e-2, log=True),
            'dropout': trial.suggest_float('dropout', 0.0, 0.5),
            # Still at defaults
            'hidden_units': 256,
            'num_layers': trial.suggest_int('layers', 2, 5),  # Add depth search
            'batch_size': 64,
        }
        return train_fn(config)
    
    study2 = optuna.create_study(direction='minimize')
    study2.optimize(stage2_objective, n_trials=n_stage2)
    
    best_stage2 = study2.best_params
    print(f"Best Stage 2: LR={best_stage2['lr']:.6f}, "
          f"WD={best_stage2['wd']:.6f}, "
          f"Dropout={best_stage2['dropout']:.3f}")
    
    # ==========================================
    # Stage 3: Fine-tuning all
    # ==========================================
    print("
=== Stage 3: Fine-Tuning ===")
    
    # Very narrow ranges around Stage 2 best
    def stage3_objective(trial):
        lr = best_stage2['lr']
        wd = best_stage2['wd']
        do = best_stage2['dropout']
        
        config = {
            # Fine-tune LR
            'learning_rate': trial.suggest_float('lr', lr * 0.5, lr * 2, log=True),
            # Fine-tune regularization
            'weight_decay': trial.suggest_float('wd', wd * 0.2, wd * 5, log=True),
            'dropout': trial.suggest_float('dropout', 
                                           max(0, do - 0.1), 
                                           min(0.5, do + 0.1)),
            # Now search architecture details
            'hidden_units': trial.suggest_int('units', 128, 512, log=True),
            'num_layers': best_stage2['layers'],  # Fix at Stage 2 best
            'batch_size': trial.suggest_categorical('bs', [32, 64, 128]),
        }
        return train_fn(config)
    
    study3 = optuna.create_study(direction='minimize')
    study3.optimize(stage3_objective, n_trials=n_stage3)
    
    # Combine best parameters
    best_config = {
        'learning_rate': study3.best_params['lr'],
        'weight_decay': study3.best_params['wd'],
        'dropout': study3.best_params['dropout'],
        'hidden_units': study3.best_params['units'],
        'num_layers': best_stage2['layers'],
        'batch_size': study3.best_params['bs'],
    }
    
    print(f"
=== Final Best Config ===")
    for k, v in best_config.items():
        print(f"  {k}: {v}")
    print(f"  Performance: {study3.best_value:.6f}")
    
    return best_config, study3.best_value
 
 
# Example demonstrating budget efficiency
def compare_strategies():
    """
    Compare staged vs flat HPO.
    
    Flat: 100 trials searching all hyperparameters
    Staged: 20 + 30 + 50 = 100 trials in stages
    
    Staged typically reaches better results faster by
    focusing early budget on important hyperparameters.
    """
    print("Staged HPO demonstrates:")
    print("1. Early trials find good learning rate quickly")
    print("2. Middle trials explore regularization with good LR")
    print("3. Final trials fine-tune with high-quality starting point")
    print("
Flat search often wastes trials on bad LR + good regularization")
    print("combinations that never had a chance.")

Importance Changes Across Regimes

A nuanced understanding recognizes that hyperparameter importance isn't static—it changes based on the problem regime:

Dataset Size:

Small data (n < 1000): Regularization hyperparameters become critical to prevent overfitting
Large data (n > 100,000): Model capacity matters more; regularization becomes less critical

Model Capacity:

Under-capacity models: Architecture hyperparameters matter most (need more capacity)
Over-capacity models: Regularization dominates (need to prevent overfitting)

Training Budget:

Short training: Learning rate is extremely sensitive—wrong LR means no learning
Long training: Learning rate matters less (more time to converge); schedule becomes important

Task Difficulty:

Easy tasks: Many configurations work; importance differences are muted
Hard tasks: Specific configurations are required; importance differences are amplified

Don't Over-Generalize

Importance rankings from benchmarks may not apply to your specific problem. The general patterns (LR matters most for NNs) usually hold, but the relative importance of secondary hyperparameters can shift. When in doubt, run your own importance analysis after initial HPO.

Interaction Effects That Change Importance

Some hyperparameters matter only in combination:

Learning rate × Batch size: With SGD, optimal LR scales with batch size (linear scaling rule). This interaction makes batch size effectively invisible if LR is tuned for each batch size.
Depth × Width: Very deep networks need specific initialization and normalization. One hyperparameter's importance depends on the other's value.
Regularization × Capacity: Dropout matters more in overparameterized models. Weight decay matters more when capacity exceeds data complexity.

When analyzing importance, consider running pairwise analyses to detect significant interactions.

Practical Importance Analysis Workflow

Here's a practical workflow for using hyperparameter importance in your projects:

Importance Analysis Workflow

•Start with priors: Use known importance rankings (LR first for NNs, etc.) for initial search space design.
•Run initial HPO: Execute 50-100 trials with broad search space to gather data.
•Compute importance: Apply fANOVA or permutation importance to the trial data.
•Validate findings: Do importance rankings match your expectations? Investigate surprises.
•Refine search space: Narrow ranges for unimportant hyperparameters; widen for important ones.
•Run focused HPO: Execute more trials with refined space, focusing budget on important dimensions.
•Iterate if needed: Re-analyze importance after more data; may reveal new patterns.

Tool Support

Optuna: Built-in optuna.importance.get_param_importances() using fANOVA CAVE: Visualization tool for SMAC with comprehensive importance analysis Weights & Biases: Importance visualization in sweeps dashboard

These tools make importance analysis a few lines of code after your HPO run.

Summary: Focusing Your Optimization Efforts

Hyperparameter importance analysis lets you work smarter, not harder. By understanding which knobs actually matter, you can focus limited optimization budget where it counts most.

Key Takeaways

•Not all hyperparameters are equal: Some (like learning rate) can change performance by orders of magnitude; others barely matter.
•Multiple importance definitions exist: Variance-based, ablation, sensitivity—each captures different aspects. Use multiple methods for robustness.
•Importance can be measured after HPO: fANOVA, permutation importance, and ablation analysis work on any HPO trial data.
•Empirical patterns provide priors: Learning rate dominates for neural networks; C and γ dominate for SVMs.
•Use importance strategically: Staged HPO, budget allocation, search space design all benefit from importance knowledge.
•Importance is regime-dependent: Small data emphasizes regularization; large data emphasizes capacity. Adjust expectations accordingly.

Module Complete

You've now completed the Hyperparameter Fundamentals module. You understand:

The distinction between model parameters and hyperparameters
How to define and structure search spaces
The differences between continuous, discrete, and conditional hyperparameters
How to measure and use hyperparameter importance

With this foundation, you're ready to explore specific HPO algorithms—starting with Grid Search in the next module.

Module Complete

You now have a comprehensive understanding of hyperparameter fundamentals. This conceptual foundation—parameters vs hyperparameters, search space design, hyperparameter types, and importance—provides the basis for all HPO techniques you'll learn in subsequent modules.