Machine LearningHyperparameter Fundamentals

Hyperparameter Fundamentals

LevelIntermediate

Duration90 mins

TopicHyperparameter Fundamentals

1 / 5

Model vs Hyperparameters

The Two-Level Learning Problem

Every machine learning model operates at two distinct levels of abstraction. At the first level, the model learns parameters from data—the weights in a neural network, the coefficients in a linear regression, or the split points in a decision tree. At the second level, we choose hyperparameters that govern how that learning happens—the learning rate, the number of hidden layers, or the tree depth.

This distinction between parameters and hyperparameters is fundamental to understanding modern machine learning. Yet it's often glossed over in introductory treatments, leading to confusion about what we're actually optimizing and how. In this page, we'll develop a rigorous understanding of this distinction—one that will serve as the foundation for all hyperparameter optimization techniques.

What You Will Learn

By the end of this page, you will: • Precisely distinguish model parameters from hyperparameters • Understand why this distinction creates a bi-level optimization problem • Recognize hyperparameters across different model families • Appreciate the fundamental challenge that hyperparameter optimization addresses

The Fundamental Distinction

Let's establish precise definitions that will guide our thinking throughout this chapter:

Model Parameters are the internal variables that a model learns from the training data. They define the mapping from inputs to outputs that the model has discovered. Parameters exist within the model's hypothesis space and are optimized directly by the learning algorithm.

Hyperparameters are configuration settings that define how the learning process works. They exist outside the model and cannot be learned from the training data directly. Hyperparameters determine the hypothesis space itself, the learning dynamics, or the regularization strength.

The key insight is that parameters are learned; hyperparameters are chosen. This creates a hierarchical optimization problem: we must choose hyperparameters before learning can begin, but we can only evaluate hyperparameter choices after learning completes.

Parameters vs Hyperparameters: A Systematic Comparison
Characteristic	Model Parameters	Hyperparameters
Definition	Internal model variables	External configuration settings
Determined by	Learning algorithm (gradient descent, etc.)	Practitioner or optimization algorithm
Learned from	Training data	Cannot be learned from data directly
When set	During training	Before training begins
Evaluation requires	Forward pass through model	Complete training run + validation
Scale	Can be millions to billions	Typically dozens to hundreds
Gradient signal	Direct gradient from loss	No direct gradient (typically)

The Meta-Learning Perspective

From a meta-learning perspective, hyperparameters encode our prior beliefs about what learning processes work well. When we choose a learning rate of 0.001, we're expressing a belief that updates should be small. When we set max_depth=10, we're limiting model complexity based on prior experience. Hyperparameter optimization can be viewed as learning these priors from data.

Mathematical Formalization

To reason precisely about hyperparameter optimization, we need a formal framework. Let's develop the mathematical notation that underlies all HPO methods.

Consider a learning algorithm $\mathcal{A}$ that takes training data $\mathcal{D}_{\text{train}}$ and produces a model with parameters $\theta$. The algorithm is parameterized by hyperparameters $\lambda \in \Lambda$, where $\Lambda$ is the hyperparameter space:

$$\theta^*(\lambda) = \mathcal{A}(\mathcal{D}_{\text{train}}, \lambda)$$

The learned parameters $\theta^*$ depend on our hyperparameter choice $\lambda$. Different hyperparameters lead to different learned models.

The inner optimization (training) finds the best parameters for fixed hyperparameters:

$$\theta^*(\lambda) = \arg\min_\theta \mathcal{L}_{\text{train}}(\theta, \lambda)$$

The outer optimization (hyperparameter optimization) finds the best hyperparameters based on validation performance:

$$\lambda^* = \arg\min_\lambda \mathcal{L}_{\text{val}}(\theta^*(\lambda))$$

The Bi-Level Optimization Challenge

This formulation reveals why HPO is fundamentally difficult: the outer objective depends on the solution to the inner problem. Evaluating a single hyperparameter configuration requires solving an entire training problem. With modern deep learning models, a single evaluation can take hours to days.

The Response Surface

The function $f(\lambda) = \mathcal{L}_{\text{val}}(\theta^*(\lambda))$ that maps hyperparameters to validation performance is called the response surface or objective landscape. This surface has several challenging properties:

Non-differentiable: We typically cannot compute $\frac{\partial \mathcal{L}_{\text{val}}}{\partial \lambda}$ directly because training is a complex, discrete procedure
Expensive to evaluate: Each point on the surface requires a full training run
Noisy: Stochastic elements in training (mini-batch sampling, initialization) make evaluations noisy
Non-convex: The surface typically has multiple local minima and complex structure
High-dimensional: Many hyperparameters create a high-dimensional search space

These properties explain why naive hyperparameter optimization (random guessing, manual tuning) often performs poorly, and why specialized optimization techniques have been developed.

bilevel_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
"""
Bi-level Optimization Framework for Hyperparameter Optimization
 
This illustrates the mathematical structure of HPO as a bi-level problem.
"""
import numpy as np
from typing import Callable, Dict, Any, Tuple
 
class BiLevelOptimizer:
    """
    Demonstrates the bi-level structure of hyperparameter optimization.
    
    Outer level: Optimize hyperparameters λ
    Inner level: Optimize model parameters θ given λ
    """
    
    def __init__(
        self,
        inner_optimizer: Callable[[Dict, np.ndarray], np.ndarray],
        train_loss: Callable[[np.ndarray, Dict], float],
        val_loss: Callable[[np.ndarray], float],
        lambda_space: Dict[str, Tuple[float, float]],
    ):
        """
        Args:
            inner_optimizer: A(D_train, λ) → θ* - training algorithm
            train_loss: L_train(θ, λ) - training objective
            val_loss: L_val(θ) - validation objective
            lambda_space: Hyperparameter bounds {name: (low, high)}
        """
        self.inner_optimizer = inner_optimizer
        self.train_loss = train_loss
        self.val_loss = val_loss
        self.lambda_space = lambda_space
        
        # Track all evaluations (expensive, so we cache)
        self.evaluation_history = []
    
    def evaluate_hyperparameters(
        self,
        lambda_config: Dict[str, float],
        train_data: np.ndarray,
    ) -> float:
        """
        Evaluate a single hyperparameter configuration.
        
        This is the EXPENSIVE step in HPO:
        1. Run complete inner optimization (training)
        2. Evaluate learned model on validation set
        
        Args:
            lambda_config: Hyperparameter values
            train_data: Training dataset
            
        Returns:
            Validation loss for this hyperparameter configuration
        """
        # Inner optimization: θ*(λ) = A(D_train, λ)
        # This is where the computational cost lives
        theta_star = self.inner_optimizer(lambda_config, train_data)
        
        # Outer objective evaluation: L_val(θ*(λ))
        val_performance = self.val_loss(theta_star)
        
        # Cache the result
        self.evaluation_history.append({
            'lambda': lambda_config.copy(),
            'theta': theta_star.copy(),
            'val_loss': val_performance,
        })
        
        return val_performance
    
    def response_surface_at(self, lambda_config: Dict) -> float:
        """
        Evaluate the response surface f(λ) = L_val(θ*(λ)).
        
        The response surface is what we're trying to minimize,
        but each evaluation requires a full training run.
        """
        # Check cache first (important optimization)
        for entry in self.evaluation_history:
            if entry['lambda'] == lambda_config:
                return entry['val_loss']
        
        # Cache miss: must retrain
        raise ValueError("Configuration not yet evaluated - requires training")
    
    def estimate_gradient(
        self,
        lambda_config: Dict,
        epsilon: float = 1e-3,
    ) -> Dict[str, float]:
        """
        Estimate gradient of response surface via finite differences.
        
        This shows why gradient-based HPO is expensive:
        Each dimension requires 2 full training runs!
        
        For d hyperparameters: 2d training runs per gradient estimate.
        """
        gradients = {}
        
        for name in lambda_config:
            # Evaluate f(λ + ε*e_i)
            lambda_plus = lambda_config.copy()
            lambda_plus[name] += epsilon
            f_plus = self.evaluate_hyperparameters(lambda_plus, None)
            
            # Evaluate f(λ - ε*e_i)
            lambda_minus = lambda_config.copy()
            lambda_minus[name] -= epsilon
            f_minus = self.evaluate_hyperparameters(lambda_minus, None)
            
            # Central difference approximation
            gradients[name] = (f_plus - f_minus) / (2 * epsilon)
        
        return gradients
 
 
# Example: Linear Regression with L2 Regularization
def linear_regression_example():
    """
    Demonstrates parameters vs hyperparameters in ridge regression.
    
    Parameters (θ): Regression coefficients w, bias b
    Hyperparameters (λ): Regularization strength α
    
    Inner problem: min_θ ||Xθ - y||² + α||θ||²
    Outer problem: min_α L_val(θ*(α))
    """
    np.random.seed(42)
    
    # Generate synthetic data
    n_samples, n_features = 100, 10
    X_train = np.random.randn(n_samples, n_features)
    X_val = np.random.randn(50, n_features)
    
    true_weights = np.random.randn(n_features)
    y_train = X_train @ true_weights + 0.1 * np.random.randn(n_samples)
    y_val = X_val @ true_weights + 0.1 * np.random.randn(50)
    
    def inner_optimizer(config: Dict, X: np.ndarray) -> np.ndarray:
        """
        Solve ridge regression: θ* = (X'X + αI)^(-1) X'y
        
        This is the inner optimization - finding optimal θ for given α.
        """
        alpha = config['regularization_strength']
        # Closed-form solution for ridge regression
        XtX = X_train.T @ X_train
        Xty = X_train.T @ y_train
        theta = np.linalg.solve(XtX + alpha * np.eye(n_features), Xty)
        return theta
    
    def val_loss(theta: np.ndarray) -> float:
        """Validation MSE - what we ultimately want to minimize."""
        predictions = X_val @ theta
        return np.mean((predictions - y_val) ** 2)
    
    # Explore the response surface
    print("Exploring response surface: Validation Loss vs Regularization")
    print("=" * 60)
    
    alphas = [1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0, 100.0]
    
    for alpha in alphas:
        config = {'regularization_strength': alpha}
        theta_star = inner_optimizer(config, X_train)
        val_mse = val_loss(theta_star)
        print(f"α = {alpha:>8.4f}  →  θ* norm = {np.linalg.norm(theta_star):>8.4f}"
              f"  →  Val MSE = {val_mse:.6f}")
    
    # Find optimal α
    best_alpha = min(alphas, key=lambda a: val_loss(
        inner_optimizer({'regularization_strength': a}, X_train)
    ))
    print(f"\nOptimal regularization: α* = {best_alpha}")
    print("\nNote: Each row above required solving a complete optimization problem!")
 
 
if __name__ == "__main__":
    linear_regression_example()

Hyperparameters Across Model Families

Different model families have different hyperparameters, but they serve similar conceptual roles. Let's survey the hyperparameter landscape across common ML models, categorizing them by their function.

Linear and Logistic Regression

Linear models have few hyperparameters, making them an excellent starting point for understanding the concept.

Parameters (Learned):

Weight vector $\mathbf{w} \in \mathbb{R}^d$
Bias term $b \in \mathbb{R}$

Hyperparameters (Chosen):

•Regularization type: L1 (Lasso), L2 (Ridge), or ElasticNet
•Regularization strength (α): Controls bias-variance tradeoff
•Solver: Algorithm for optimization (SGD, LBFGS, SAG, etc.)
•Learning rate: Step size for gradient-based solvers
•Maximum iterations: Training budget for iterative solvers
•Convergence tolerance: Stopping criterion

The Regularization Spectrum

As α → 0, the model approaches unregularized least squares (high variance, low bias). As α → ∞, coefficients shrink toward zero (low variance, high bias). The optimal α balances these extremes for your specific data.

The Gray Area: When the Distinction Blurs

While the parameter/hyperparameter distinction is conceptually clean, real-world ML systems sometimes blur this boundary. Understanding these edge cases deepens our appreciation of the underlying concepts.

Cases Where the Boundary Blurs

•Learnable Learning Rates: In meta-learning, learning rates themselves can be learned. MAML learns initialization that works well with a fixed learning rate across tasks. Other methods learn per-parameter learning rates.
•Neural Architecture Search (NAS): Architecture choices (number of layers, operations) are hyperparameters, but NAS treats them as learnable. Weight-sharing approaches blur architecture and weights.
•Data Augmentation Policies: AutoAugment learns augmentation strategies, turning what was a hyperparameter choice into a learned policy.
•Adaptive Regularization: Methods like self-tuning networks adjust regularization during training based on validation performance.
•Feature Engineering: Whether preprocessing choices (normalization, encoding) count as hyperparameters depends on whether they're selected via validation.

The Practical Test

When in doubt, ask: "Can this be learned via gradient descent on the training loss?" If yes, it's a parameter. If it requires validation data or is set before training, it's a hyperparameter. If it's learned but requires validation, you're in meta-learning territory.

The Hypergradient Perspective

Recent research has developed hypergradient methods that compute gradients of validation loss with respect to hyperparameters. This is achieved through techniques like:

Implicit differentiation: Using the implicit function theorem to differentiate through the training process
Unrolled differentiation: Treating T steps of training as a computational graph
Approximate methods: Such as DrMAD, Hoag, and DEQ-based approaches

These methods begin to treat hyperparameters more like parameters, optimizing them via gradient descent. However, they don't eliminate the fundamental distinction—they just make the outer optimization more efficient.

hypergradient_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
"""
Hypergradient: Gradients of Hyperparameters
 
This illustrates how hypergradients can be computed, showing that even
when we can differentiate through training, the bi-level structure remains.
"""
import torch
import torch.nn as nn
 
def unrolled_training(model, train_loader, val_loader, lr, steps=100):
    """
    Perform training while maintaining computation graph for hypergradient.
    
    The key insight: if we don't detach intermediate states, we can
    backpropagate through the entire training process.
    """
    # Learning rate as a differentiable tensor
    learning_rate = torch.tensor(lr, requires_grad=True)
    
    # Create a fresh model copy for this training run
    model_copy = copy_model(model)
    optimizer_state = None
    
    for step in range(steps):
        # Forward pass (training)
        for x_batch, y_batch in train_loader:
            loss = compute_loss(model_copy, x_batch, y_batch)
            
            # Manual SGD step (to maintain graph)
            grads = torch.autograd.grad(loss, model_copy.parameters(),
                                         create_graph=True)
            
            # Update parameters (keeping in graph!)
            for param, grad in zip(model_copy.parameters(), grads):
                param.data = param - learning_rate * grad
                # Note: param.data breaks the graph
                # For true hypergradients, we'd need param = param - lr * grad
    
    # Now compute validation loss
    val_loss = compute_validation_loss(model_copy, val_loader)
    
    # This gradient exists because we maintained the computation graph
    # through all training steps
    hypergradient = torch.autograd.grad(val_loss, learning_rate)
    
    return val_loss, hypergradient
 
 
def approximate_hypergradient(model, train_data, val_data, lambda_val, epsilon=0.01):
    """
    Approximate hypergradient via finite differences.
    
    d(val_loss)/d(lambda) ≈ [val_loss(lambda + ε) - val_loss(lambda - ε)] / 2ε
    
    Note: This requires 2 complete training runs per hyperparameter dimension!
    """
    # Training with lambda + epsilon
    model_plus = train_model(model, train_data, lambda_val + epsilon)
    val_loss_plus = evaluate(model_plus, val_data)
    
    # Training with lambda - epsilon
    model_minus = train_model(model, train_data, lambda_val - epsilon)
    val_loss_minus = evaluate(model_minus, val_data)
    
    # Finite difference approximation
    approx_hypergradient = (val_loss_plus - val_loss_minus) / (2 * epsilon)
    
    return approx_hypergradient
 
 
# Key insight: Even with hypergradients, we still have:
# 1. An inner optimization (training)
# 2. An outer optimization (hyperparameter tuning)
# 
# Hypergradients just make the outer optimization gradient-based
# instead of black-box, but don't eliminate the bi-level structure.

Why This Matters for Practice

Understanding the parameter/hyperparameter distinction has direct implications for how you approach machine learning projects:

1. Budget Allocation

With limited compute budget, you must decide how to allocate resources between:

Training longer (more epochs, more data)
Trying more hyperparameter configurations

The optimal allocation depends on whether you're in the "undertrained" regime (more training helps) or "hyperparameter-limited" regime (better hyperparameters help more).

2. Evaluation Protocol Design

The distinction dictates your evaluation strategy:

Training set: Used for inner optimization (learning parameters)
Validation set: Used for outer optimization (selecting hyperparameters)
Test set: Final evaluation (never used for any optimization)

Violating this separation leads to selection bias—your reported performance will be optimistically biased.

Common Mistakes

•Tuning hyperparameters on test set
•Using same data for training and validation
•Reporting validation (not test) accuracy as final result
•Ignoring hyperparameter sensitivity
•Treating all hyperparameters as equally important

Best Practices

•Strict train/val/test separation
•Nested cross-validation for small datasets
•Reporting hyperparameter configurations with results
•Sensitivity analysis for key hyperparameters
•Focus tuning budget on high-impact hyperparameters

3. Reproducibility

For results to be reproducible, you must report:

All fixed hyperparameters values
The hyperparameter search procedure used
The search space and number of configurations tried
The selection criterion (validation metric)

Without this information, others cannot replicate your results, and your reported performance may not be achievable.

4. Transfer and Generalization

Optimal hyperparameters often don't transfer:

Across datasets (even similar ones)
Across model scales (larger models need different hyperparameters)
Across hardware (batch size interacts with learning rate)

This is why hyperparameter tuning is not a one-time cost—it's recurring as your data and infrastructure evolve.

The Hidden Cost of Hyperparameter Tuning

Published ML results often underreport the computational cost of hyperparameter tuning. A paper might say 'training took 2 hours' but hide the 200 hours spent finding those hyperparameters. When estimating project timelines and costs, always account for HPO.

Historical Perspective: From Manual to Automated

The history of hyperparameter optimization reflects the broader evolution of machine learning from an art to a science.

The Manual Era (Pre-2000s)

Early ML practitioners relied on intuition, rules of thumb, and extensive experimentation. Hyperparameter selection was treated as craft knowledge, passed down through papers, textbooks, and mentorship. This worked when models were simple and training was fast.

The Grid Search Era (2000s)

As models grew more complex, practitioners formalized hyperparameter search. Grid search became standard: define a grid of values, train a model for each combination, select the best. This was systematic but exponentially expensive.

The Random Search Revolution (2012)

Bergstra and Bengio's 2012 paper showed that random search often outperforms grid search with the same budget. This was counterintuitive but mathematically sound: random search explores more values of important hyperparameters.

The Bayesian Optimization Era (2013+)

BAYESopt, SMAC, and TPE brought principled sequential optimization. By modeling the response surface, these methods could make intelligent decisions about where to evaluate next, dramatically reducing the number of required training runs.

The Modern Era (2018+)

Today's HPO landscape includes:

Multi-fidelity methods (Hyperband, BOHB) that use early stopping
Neural architecture search as a special case
Population-based training for dynamic hyperparameter schedules
Meta-learning approaches that transfer knowledge across tasks

Evolution of Hyperparameter Optimization
Era	Method	Evaluations for 10D Space	Key Insight
Manual	Expert intuition	~10-50	Domain knowledge matters
Grid	Exhaustive grid	3¹⁰ = 59,049	Systematic but expensive
Random	Random sampling	~100-1000	Focus on important dimensions
Bayesian	Sequential optimization	~50-200	Model guidance reduces waste
Multi-fidelity	Early stopping	~100-500 (partial)	Cheap evaluations via early stopping

Summary: The Foundation for HPO

We've established the fundamental concepts that underpin all of hyperparameter optimization. Let's consolidate the key insights:

Key Takeaways

•Parameters are learned from data; hyperparameters are chosen before training. This creates the bi-level optimization structure central to HPO.
•The response surface maps hyperparameters to performance. It's expensive to evaluate, noisy, non-differentiable, and non-convex.
•Different models have different hyperparameters, but they serve similar roles: controlling capacity, optimization dynamics, and regularization.
•The boundary can blur in meta-learning and NAS, but the conceptual distinction remains useful.
•Proper evaluation requires strict data separation: training for parameters, validation for hyperparameters, test for final evaluation.
•HPO has evolved from manual tuning to sophisticated automated methods, each reducing the number of required evaluations.

What's Next

With this foundation in place, we'll next examine how to define the search space for hyperparameter optimization. This seemingly simple task—specifying what values hyperparameters can take—has profound implications for HPO efficiency and success.

Page Complete

You now understand the fundamental distinction between model parameters and hyperparameters, and why this creates the bi-level optimization problem that HPO addresses. This conceptual foundation will inform everything that follows in this chapter.

1 / 5

Loading learning content...

Machine LearningHyperparameter Fundamentals

Hyperparameter Fundamentals

LevelIntermediate

Duration90 mins

TopicHyperparameter Fundamentals

1 / 5

Model vs Hyperparameters

The Two-Level Learning Problem

What You Will Learn

The Fundamental Distinction

Let's establish precise definitions that will guide our thinking throughout this chapter:

Parameters vs Hyperparameters: A Systematic Comparison
Characteristic	Model Parameters	Hyperparameters
Definition	Internal model variables	External configuration settings
Determined by	Learning algorithm (gradient descent, etc.)	Practitioner or optimization algorithm
Learned from	Training data	Cannot be learned from data directly
When set	During training	Before training begins
Evaluation requires	Forward pass through model	Complete training run + validation
Scale	Can be millions to billions	Typically dozens to hundreds
Gradient signal	Direct gradient from loss	No direct gradient (typically)

The Meta-Learning Perspective

Mathematical Formalization

To reason precisely about hyperparameter optimization, we need a formal framework. Let's develop the mathematical notation that underlies all HPO methods.

$$\theta^*(\lambda) = \mathcal{A}(\mathcal{D}_{\text{train}}, \lambda)$$

The learned parameters $\theta^*$ depend on our hyperparameter choice $\lambda$. Different hyperparameters lead to different learned models.

The inner optimization (training) finds the best parameters for fixed hyperparameters:

$$\theta^*(\lambda) = \arg\min_\theta \mathcal{L}_{\text{train}}(\theta, \lambda)$$

The outer optimization (hyperparameter optimization) finds the best hyperparameters based on validation performance:

$$\lambda^* = \arg\min_\lambda \mathcal{L}_{\text{val}}(\theta^*(\lambda))$$

The Bi-Level Optimization Challenge

The Response Surface

Non-differentiable: We typically cannot compute $\frac{\partial \mathcal{L}_{\text{val}}}{\partial \lambda}$ directly because training is a complex, discrete procedure
Expensive to evaluate: Each point on the surface requires a full training run
Noisy: Stochastic elements in training (mini-batch sampling, initialization) make evaluations noisy
Non-convex: The surface typically has multiple local minima and complex structure
High-dimensional: Many hyperparameters create a high-dimensional search space

These properties explain why naive hyperparameter optimization (random guessing, manual tuning) often performs poorly, and why specialized optimization techniques have been developed.

bilevel_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
"""
Bi-level Optimization Framework for Hyperparameter Optimization
 
This illustrates the mathematical structure of HPO as a bi-level problem.
"""
import numpy as np
from typing import Callable, Dict, Any, Tuple
 
class BiLevelOptimizer:
    """
    Demonstrates the bi-level structure of hyperparameter optimization.
    
    Outer level: Optimize hyperparameters λ
    Inner level: Optimize model parameters θ given λ
    """
    
    def __init__(
        self,
        inner_optimizer: Callable[[Dict, np.ndarray], np.ndarray],
        train_loss: Callable[[np.ndarray, Dict], float],
        val_loss: Callable[[np.ndarray], float],
        lambda_space: Dict[str, Tuple[float, float]],
    ):
        """
        Args:
            inner_optimizer: A(D_train, λ) → θ* - training algorithm
            train_loss: L_train(θ, λ) - training objective
            val_loss: L_val(θ) - validation objective
            lambda_space: Hyperparameter bounds {name: (low, high)}
        """
        self.inner_optimizer = inner_optimizer
        self.train_loss = train_loss
        self.val_loss = val_loss
        self.lambda_space = lambda_space
        
        # Track all evaluations (expensive, so we cache)
        self.evaluation_history = []
    
    def evaluate_hyperparameters(
        self,
        lambda_config: Dict[str, float],
        train_data: np.ndarray,
    ) -> float:
        """
        Evaluate a single hyperparameter configuration.
        
        This is the EXPENSIVE step in HPO:
        1. Run complete inner optimization (training)
        2. Evaluate learned model on validation set
        
        Args:
            lambda_config: Hyperparameter values
            train_data: Training dataset
            
        Returns:
            Validation loss for this hyperparameter configuration
        """
        # Inner optimization: θ*(λ) = A(D_train, λ)
        # This is where the computational cost lives
        theta_star = self.inner_optimizer(lambda_config, train_data)
        
        # Outer objective evaluation: L_val(θ*(λ))
        val_performance = self.val_loss(theta_star)
        
        # Cache the result
        self.evaluation_history.append({
            'lambda': lambda_config.copy(),
            'theta': theta_star.copy(),
            'val_loss': val_performance,
        })
        
        return val_performance
    
    def response_surface_at(self, lambda_config: Dict) -> float:
        """
        Evaluate the response surface f(λ) = L_val(θ*(λ)).
        
        The response surface is what we're trying to minimize,
        but each evaluation requires a full training run.
        """
        # Check cache first (important optimization)
        for entry in self.evaluation_history:
            if entry['lambda'] == lambda_config:
                return entry['val_loss']
        
        # Cache miss: must retrain
        raise ValueError("Configuration not yet evaluated - requires training")
    
    def estimate_gradient(
        self,
        lambda_config: Dict,
        epsilon: float = 1e-3,
    ) -> Dict[str, float]:
        """
        Estimate gradient of response surface via finite differences.
        
        This shows why gradient-based HPO is expensive:
        Each dimension requires 2 full training runs!
        
        For d hyperparameters: 2d training runs per gradient estimate.
        """
        gradients = {}
        
        for name in lambda_config:
            # Evaluate f(λ + ε*e_i)
            lambda_plus = lambda_config.copy()
            lambda_plus[name] += epsilon
            f_plus = self.evaluate_hyperparameters(lambda_plus, None)
            
            # Evaluate f(λ - ε*e_i)
            lambda_minus = lambda_config.copy()
            lambda_minus[name] -= epsilon
            f_minus = self.evaluate_hyperparameters(lambda_minus, None)
            
            # Central difference approximation
            gradients[name] = (f_plus - f_minus) / (2 * epsilon)
        
        return gradients
 
 
# Example: Linear Regression with L2 Regularization
def linear_regression_example():
    """
    Demonstrates parameters vs hyperparameters in ridge regression.
    
    Parameters (θ): Regression coefficients w, bias b
    Hyperparameters (λ): Regularization strength α
    
    Inner problem: min_θ ||Xθ - y||² + α||θ||²
    Outer problem: min_α L_val(θ*(α))
    """
    np.random.seed(42)
    
    # Generate synthetic data
    n_samples, n_features = 100, 10
    X_train = np.random.randn(n_samples, n_features)
    X_val = np.random.randn(50, n_features)
    
    true_weights = np.random.randn(n_features)
    y_train = X_train @ true_weights + 0.1 * np.random.randn(n_samples)
    y_val = X_val @ true_weights + 0.1 * np.random.randn(50)
    
    def inner_optimizer(config: Dict, X: np.ndarray) -> np.ndarray:
        """
        Solve ridge regression: θ* = (X'X + αI)^(-1) X'y
        
        This is the inner optimization - finding optimal θ for given α.
        """
        alpha = config['regularization_strength']
        # Closed-form solution for ridge regression
        XtX = X_train.T @ X_train
        Xty = X_train.T @ y_train
        theta = np.linalg.solve(XtX + alpha * np.eye(n_features), Xty)
        return theta
    
    def val_loss(theta: np.ndarray) -> float:
        """Validation MSE - what we ultimately want to minimize."""
        predictions = X_val @ theta
        return np.mean((predictions - y_val) ** 2)
    
    # Explore the response surface
    print("Exploring response surface: Validation Loss vs Regularization")
    print("=" * 60)
    
    alphas = [1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0, 100.0]
    
    for alpha in alphas:
        config = {'regularization_strength': alpha}
        theta_star = inner_optimizer(config, X_train)
        val_mse = val_loss(theta_star)
        print(f"α = {alpha:>8.4f}  →  θ* norm = {np.linalg.norm(theta_star):>8.4f}"
              f"  →  Val MSE = {val_mse:.6f}")
    
    # Find optimal α
    best_alpha = min(alphas, key=lambda a: val_loss(
        inner_optimizer({'regularization_strength': a}, X_train)
    ))
    print(f"\nOptimal regularization: α* = {best_alpha}")
    print("\nNote: Each row above required solving a complete optimization problem!")
 
 
if __name__ == "__main__":
    linear_regression_example()

Hyperparameters Across Model Families

Linear and Logistic Regression

Linear models have few hyperparameters, making them an excellent starting point for understanding the concept.

Parameters (Learned):

Weight vector $\mathbf{w} \in \mathbb{R}^d$
Bias term $b \in \mathbb{R}$

Hyperparameters (Chosen):

•Regularization type: L1 (Lasso), L2 (Ridge), or ElasticNet
•Regularization strength (α): Controls bias-variance tradeoff
•Solver: Algorithm for optimization (SGD, LBFGS, SAG, etc.)
•Learning rate: Step size for gradient-based solvers
•Maximum iterations: Training budget for iterative solvers
•Convergence tolerance: Stopping criterion

The Regularization Spectrum

The Gray Area: When the Distinction Blurs

Cases Where the Boundary Blurs

•Learnable Learning Rates: In meta-learning, learning rates themselves can be learned. MAML learns initialization that works well with a fixed learning rate across tasks. Other methods learn per-parameter learning rates.
•Neural Architecture Search (NAS): Architecture choices (number of layers, operations) are hyperparameters, but NAS treats them as learnable. Weight-sharing approaches blur architecture and weights.
•Data Augmentation Policies: AutoAugment learns augmentation strategies, turning what was a hyperparameter choice into a learned policy.
•Adaptive Regularization: Methods like self-tuning networks adjust regularization during training based on validation performance.
•Feature Engineering: Whether preprocessing choices (normalization, encoding) count as hyperparameters depends on whether they're selected via validation.

The Practical Test

The Hypergradient Perspective

Recent research has developed hypergradient methods that compute gradients of validation loss with respect to hyperparameters. This is achieved through techniques like:

Implicit differentiation: Using the implicit function theorem to differentiate through the training process
Unrolled differentiation: Treating T steps of training as a computational graph
Approximate methods: Such as DrMAD, Hoag, and DEQ-based approaches

hypergradient_intuition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
"""
Hypergradient: Gradients of Hyperparameters
 
This illustrates how hypergradients can be computed, showing that even
when we can differentiate through training, the bi-level structure remains.
"""
import torch
import torch.nn as nn
 
def unrolled_training(model, train_loader, val_loader, lr, steps=100):
    """
    Perform training while maintaining computation graph for hypergradient.
    
    The key insight: if we don't detach intermediate states, we can
    backpropagate through the entire training process.
    """
    # Learning rate as a differentiable tensor
    learning_rate = torch.tensor(lr, requires_grad=True)
    
    # Create a fresh model copy for this training run
    model_copy = copy_model(model)
    optimizer_state = None
    
    for step in range(steps):
        # Forward pass (training)
        for x_batch, y_batch in train_loader:
            loss = compute_loss(model_copy, x_batch, y_batch)
            
            # Manual SGD step (to maintain graph)
            grads = torch.autograd.grad(loss, model_copy.parameters(),
                                         create_graph=True)
            
            # Update parameters (keeping in graph!)
            for param, grad in zip(model_copy.parameters(), grads):
                param.data = param - learning_rate * grad
                # Note: param.data breaks the graph
                # For true hypergradients, we'd need param = param - lr * grad
    
    # Now compute validation loss
    val_loss = compute_validation_loss(model_copy, val_loader)
    
    # This gradient exists because we maintained the computation graph
    # through all training steps
    hypergradient = torch.autograd.grad(val_loss, learning_rate)
    
    return val_loss, hypergradient
 
 
def approximate_hypergradient(model, train_data, val_data, lambda_val, epsilon=0.01):
    """
    Approximate hypergradient via finite differences.
    
    d(val_loss)/d(lambda) ≈ [val_loss(lambda + ε) - val_loss(lambda - ε)] / 2ε
    
    Note: This requires 2 complete training runs per hyperparameter dimension!
    """
    # Training with lambda + epsilon
    model_plus = train_model(model, train_data, lambda_val + epsilon)
    val_loss_plus = evaluate(model_plus, val_data)
    
    # Training with lambda - epsilon
    model_minus = train_model(model, train_data, lambda_val - epsilon)
    val_loss_minus = evaluate(model_minus, val_data)
    
    # Finite difference approximation
    approx_hypergradient = (val_loss_plus - val_loss_minus) / (2 * epsilon)
    
    return approx_hypergradient
 
 
# Key insight: Even with hypergradients, we still have:
# 1. An inner optimization (training)
# 2. An outer optimization (hyperparameter tuning)
# 
# Hypergradients just make the outer optimization gradient-based
# instead of black-box, but don't eliminate the bi-level structure.

Why This Matters for Practice

Understanding the parameter/hyperparameter distinction has direct implications for how you approach machine learning projects:

1. Budget Allocation

With limited compute budget, you must decide how to allocate resources between:

Training longer (more epochs, more data)
Trying more hyperparameter configurations

The optimal allocation depends on whether you're in the "undertrained" regime (more training helps) or "hyperparameter-limited" regime (better hyperparameters help more).

2. Evaluation Protocol Design

The distinction dictates your evaluation strategy:

Training set: Used for inner optimization (learning parameters)
Validation set: Used for outer optimization (selecting hyperparameters)
Test set: Final evaluation (never used for any optimization)

Violating this separation leads to selection bias—your reported performance will be optimistically biased.

Common Mistakes

•Tuning hyperparameters on test set
•Using same data for training and validation
•Reporting validation (not test) accuracy as final result
•Ignoring hyperparameter sensitivity
•Treating all hyperparameters as equally important

Best Practices

•Strict train/val/test separation
•Nested cross-validation for small datasets
•Reporting hyperparameter configurations with results
•Sensitivity analysis for key hyperparameters
•Focus tuning budget on high-impact hyperparameters

3. Reproducibility

For results to be reproducible, you must report:

All fixed hyperparameters values
The hyperparameter search procedure used
The search space and number of configurations tried
The selection criterion (validation metric)

Without this information, others cannot replicate your results, and your reported performance may not be achievable.

4. Transfer and Generalization

Optimal hyperparameters often don't transfer:

Across datasets (even similar ones)
Across model scales (larger models need different hyperparameters)
Across hardware (batch size interacts with learning rate)

This is why hyperparameter tuning is not a one-time cost—it's recurring as your data and infrastructure evolve.

The Hidden Cost of Hyperparameter Tuning

Historical Perspective: From Manual to Automated

The history of hyperparameter optimization reflects the broader evolution of machine learning from an art to a science.

The Manual Era (Pre-2000s)

The Grid Search Era (2000s)

The Random Search Revolution (2012)

The Bayesian Optimization Era (2013+)

The Modern Era (2018+)

Today's HPO landscape includes:

Multi-fidelity methods (Hyperband, BOHB) that use early stopping
Neural architecture search as a special case
Population-based training for dynamic hyperparameter schedules
Meta-learning approaches that transfer knowledge across tasks

Evolution of Hyperparameter Optimization
Era	Method	Evaluations for 10D Space	Key Insight
Manual	Expert intuition	~10-50	Domain knowledge matters
Grid	Exhaustive grid	3¹⁰ = 59,049	Systematic but expensive
Random	Random sampling	~100-1000	Focus on important dimensions
Bayesian	Sequential optimization	~50-200	Model guidance reduces waste
Multi-fidelity	Early stopping	~100-500 (partial)	Cheap evaluations via early stopping

Summary: The Foundation for HPO

We've established the fundamental concepts that underpin all of hyperparameter optimization. Let's consolidate the key insights:

Key Takeaways

•Parameters are learned from data; hyperparameters are chosen before training. This creates the bi-level optimization structure central to HPO.
•The response surface maps hyperparameters to performance. It's expensive to evaluate, noisy, non-differentiable, and non-convex.
•Different models have different hyperparameters, but they serve similar roles: controlling capacity, optimization dynamics, and regularization.
•The boundary can blur in meta-learning and NAS, but the conceptual distinction remains useful.
•Proper evaluation requires strict data separation: training for parameters, validation for hyperparameters, test for final evaluation.
•HPO has evolved from manual tuning to sophisticated automated methods, each reducing the number of required evaluations.

What's Next

Page Complete

1 / 5