Advanced Hpo Topics - Learning Module

Loading content...

0/245

Multi-Objective HPO

Optimizing Multiple Conflicting Objectives

Real-world machine learning systems are rarely optimized for a single metric. A recommendation system must balance relevance against diversity. A fraud detection model trades off precision against recall. A deployed model must achieve accuracy while meeting latency constraints and minimizing memory usage.

These objectives frequently conflict: improving one degrades another. A larger, more accurate model runs slower. A more aggressive classifier catches more fraud but generates more false alarms. There is no single 'best' hyperparameter configuration—instead, there exists a set of Pareto-optimal configurations, each representing a different trade-off among objectives.

Multi-Objective Hyperparameter Optimization (MOHPO) addresses this challenge by discovering the entire trade-off surface, enabling informed decisions about which compromises to accept. Rather than asking 'what is the best model?', MOHPO answers 'what are the best trade-offs available?'

What You Will Learn

By the end of this page, you will understand Pareto optimality and dominance, scalarization techniques for converting multi-objective problems to single-objective, evolutionary and Bayesian approaches to multi-objective optimization, and practical strategies for deploying MOHPO in production ML pipelines.

Multi-Objective Optimization Fundamentals

Problem Formulation:

A multi-objective optimization problem seeks to simultaneously optimize k objectives:

min_{λ ∈ Λ} F(λ) = [f₁(λ), f₂(λ), ..., fₖ(λ)]

where each fᵢ(λ) is an objective function (we assume minimization; maximization objectives are negated). Unlike single-objective optimization, there is no natural total ordering among solutions—we cannot say one is 'better' without specifying how to weigh the objectives.

Pareto Dominance:

The key concept in multi-objective optimization is dominance. Configuration λ₁ dominates λ₂ (written λ₁ ≺ λ₂) if and only if:

λ₁ is at least as good as λ₂ on all objectives: fᵢ(λ₁) ≤ fᵢ(λ₂) for all i
λ₁ is strictly better on at least one objective: fⱼ(λ₁) < fⱼ(λ₂) for some j

If neither configuration dominates the other, they are incomparable (or non-dominated with respect to each other).

Key Multi-Objective Concepts

•Pareto Optimal: A configuration is Pareto optimal if no other configuration dominates it—you cannot improve any objective without degrading another
•Pareto Front: The set of all Pareto optimal configurations in objective space—this is the 'efficient frontier' of trade-offs
•Pareto Set: The set of Pareto optimal configurations in input (hyperparameter) space
•Dominance Ranking: Non-dominated configurations form rank 0; configurations dominated only by rank 0 form rank 1; and so on
•Hypervolume: The volume (area in 2D, volume in 3D) of objective space dominated by the Pareto front—a common quality metric

Common Multi-Objective Scenarios in ML:

Multi-Objective Trade-offs in Machine Learning
Scenario	Objective 1	Objective 2	Nature of Trade-off
Deployment Efficiency	Accuracy / Quality	Latency / Throughput	Larger models are more accurate but slower
Resource Constraints	Accuracy / Quality	Memory / Model Size	More parameters enable better fitting but consume more memory
Detection Systems	Precision	Recall	Lower threshold catches more positives but increases false alarms
Fairness-Aware	Accuracy	Fairness (demographic parity, etc.)	Unconstrained optimization may amplify biases
Training Efficiency	Final Performance	Training Time / Cost	More training improves performance but costs more
Multi-Task Learning	Task A Performance	Task B Performance	Sharing capacity may help or hurt individual tasks

The Value of the Pareto Front

Discovering the full Pareto front is valuable even if you will ultimately deploy a single model. The front reveals what trade-offs are possible, how much you must sacrifice in one objective to gain in another, and whether certain objectives are fundamentally incompatible or surprisingly compatible.

Scalarization Methods

The simplest approach to multi-objective optimization is scalarization: combine the objectives into a single scalar value that can be optimized with standard techniques. Different scalarization methods recover different points on the Pareto front.

Linear Scalarization:

The most common approach weights objectives and sums them:

minimize S(λ) = Σᵢ wᵢ × fᵢ(λ)

where wᵢ ≥ 0 and typically Σᵢ wᵢ = 1. By varying weights, we can find different Pareto-optimal solutions.

Limitations of Linear Scalarization:

Cannot find solutions in non-convex regions of the Pareto front
The mapping from weights to Pareto points is non-intuitive and often non-linear
Requires knowing appropriate weight ranges in advance
Multiple weight vectors may find the same solution

Scalarization Methods

•Weighted Sum: S = Σ wᵢfᵢ. Simple but cannot find non-convex Pareto points.
•Epsilon-Constraint: Optimize one objective while constraining others below thresholds. Can find any Pareto point.
•Tchebycheff: S = max_i [wᵢ|fᵢ - zᵢ|]. Uses reference point z; can find non-convex points.
•Achievement Scalarizing: Generalization of Tchebycheff with augmentation term to break ties.
•Weighted Product: S = Π fᵢ^wᵢ. Sensitive to objective scaling.
•Lexicographic: Optimize objectives in priority order; later objectives break ties.

scalarization_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from typing import List, Callable, Tuple
 
class Scalarizer:
    """
    Scalarization methods for multi-objective optimization.
    Converts vector objectives to scalar for standard optimizers.
    """
    
    @staticmethod
    def weighted_sum(objectives: np.ndarray, weights: np.ndarray) -> float:
        """
        Linear weighted sum scalarization.
        
        Args:
            objectives: Array of objective values [f1, f2, ..., fk]
            weights: Array of weights [w1, w2, ..., wk], should sum to 1
        
        Returns:
            Weighted sum of objectives
        """
        return np.dot(weights, objectives)
    
    @staticmethod
    def tchebycheff(
        objectives: np.ndarray, 
        weights: np.ndarray,
        reference_point: np.ndarray,
        rho: float = 0.05
    ) -> float:
        """
        Augmented Tchebycheff scalarization.
        Can find points in non-convex regions of the Pareto front.
        
        Args:
            objectives: Array of objective values
            weights: Array of weights (importance of each objective)
            reference_point: Ideal/utopia point (best possible for each objective)
            rho: Augmentation parameter (small positive value)
        
        Returns:
            Augmented Tchebycheff value
        """
        # Main term: weighted max distance from reference
        diffs = np.abs(objectives - reference_point)
        weighted_diffs = weights * diffs
        main_term = np.max(weighted_diffs)
        
        # Augmentation term: ensures unique optimum
        augmentation = rho * np.sum(diffs)
        
        return main_term + augmentation
    
    @staticmethod
    def epsilon_constraint(
        objectives: np.ndarray,
        primary_idx: int,
        epsilon_bounds: np.ndarray
    ) -> Tuple[float, bool]:
        """
        Epsilon-constraint method.
        Optimizes primary objective subject to constraints on others.
        
        Args:
            objectives: Array of objective values
            primary_idx: Index of objective to optimize
            epsilon_bounds: Upper bounds for non-primary objectives
        
        Returns:
            (primary_value, is_feasible) tuple
        """
        # Check constraints
        feasible = True
        for i, (obj, bound) in enumerate(zip(objectives, epsilon_bounds)):
            if i != primary_idx and obj > bound:
                feasible = False
                break
        
        return objectives[primary_idx], feasible
 
 
class ParallelScalarizedBO:
    """
    Run multiple scalarized Bayesian optimizations in parallel
    to approximate the Pareto front.
    """
    
    def __init__(self, n_objectives: int, n_weight_vectors: int = 10):
        self.n_objectives = n_objectives
        
        # Generate diverse weight vectors
        self.weight_vectors = self._generate_weight_vectors(
            n_objectives, n_weight_vectors
        )
        
        # One BO instance per weight vector
        self.optimizers = [None] * n_weight_vectors  # Initialize with actual BO
        
    def _generate_weight_vectors(self, k: int, n: int) -> List[np.ndarray]:
        """Generate evenly distributed weight vectors on the simplex."""
        if k == 2:
            # Simple linear spacing for 2 objectives
            weights = []
            for i in range(n):
                w1 = i / (n - 1)
                weights.append(np.array([w1, 1 - w1]))
            return weights
        else:
            # Random sampling for higher dimensions
            weights = []
            for _ in range(n):
                w = np.random.dirichlet(np.ones(k))
                weights.append(w)
            return weights
    
    def suggest(self, weight_idx: int) -> np.ndarray:
        """Suggest configuration for a specific weight vector."""
        # Each optimizer focuses on its scalarization
        return self.optimizers[weight_idx].suggest()
    
    def observe(self, weight_idx: int, config: np.ndarray, objectives: np.ndarray):
        """Record observation for a specific optimizer."""
        # Scalarize objectives and observe
        scalarized = Scalarizer.weighted_sum(objectives, self.weight_vectors[weight_idx])
        self.optimizers[weight_idx].observe(config, scalarized)
    
    def get_pareto_front(self) -> List[Tuple[np.ndarray, np.ndarray]]:
        """
        Extract Pareto front from all optimizers.
        Returns list of (config, objectives) tuples.
        """
        all_observations = []
        for opt in self.optimizers:
            all_observations.extend(opt.get_all_observations())
        
        return self._compute_pareto_front(all_observations)
    
    def _compute_pareto_front(self, points):
        """Filter to non-dominated points."""
        pareto = []
        for i, (config_i, obj_i) in enumerate(points):
            dominated = False
            for j, (config_j, obj_j) in enumerate(points):
                if i != j:
                    # Check if j dominates i
                    if all(obj_j <= obj_i) and any(obj_j < obj_i):
                        dominated = True
                        break
            if not dominated:
                pareto.append((config_i, obj_i))
        return pareto

When to Use Scalarization

Scalarization is appropriate when: (1) you have a specific trade-off in mind and just need one solution, (2) you want to leverage existing single-objective optimizers, or (3) you're approximating the Pareto front via parallel scalarized runs. For comprehensive Pareto front discovery, native multi-objective methods are more efficient.

Evolutionary Multi-Objective Optimization

Evolutionary algorithms are natural candidates for multi-objective optimization because they maintain a population of solutions. This population can simultaneously cover different regions of the Pareto front, providing approximation to the entire trade-off surface in a single run.

NSGA-II (Non-dominated Sorting Genetic Algorithm II):

NSGA-II is the most widely used multi-objective evolutionary algorithm. Its key innovations are:

Fast Non-dominated Sorting: Efficiently ranks individuals by dominance level
Crowding Distance: Measures how isolated a point is on the current front; encourages diversity
Elitist Selection: Combines parent and offspring populations, keeping the best from both

NSGA-II Algorithm Steps

•Initialize: Random population of N configurations
•Evaluate: Compute all objectives for each configuration
•Non-dominated Sort: Assign rank 0 to non-dominated, rank 1 to dominated only by rank 0, etc.
•Crowding Distance: Within each rank, compute density estimate for diversity
•Selection: Binary tournament based on (rank, crowding distance) - prefer lower rank, higher crowding
•Crossover & Mutation: Create offspring population of size N
•Combine & Select: Merge parents + offspring (2N), select best N by rank then crowding
•Repeat until budget exhausted

Crowding Distance:

Crowding distance estimates the density of solutions around a point on the front. For each objective, it sorts solutions and measures the distance between neighbors:

def crowding_distance(front, objectives):
    distances = [0] * len(front)
    
    for obj_idx in range(n_objectives):
        # Sort by objective
        sorted_indices = sorted(range(len(front)), 
                                key=lambda i: objectives[i][obj_idx])
        
        # Boundary points get infinite distance
        distances[sorted_indices[0]] = float('inf')
        distances[sorted_indices[-1]] = float('inf')
        
        # Interior points: normalized distance to neighbors
        obj_range = objectives[sorted_indices[-1]][obj_idx] - 
                   objectives[sorted_indices[0]][obj_idx]
        
        for i in range(1, len(front) - 1):
            distances[sorted_indices[i]] += (
                objectives[sorted_indices[i+1]][obj_idx] - 
                objectives[sorted_indices[i-1]][obj_idx]
            ) / max(obj_range, 1e-10)
    
    return distances

Points with higher crowding distance are in sparser regions and are preferred during selection, promoting diversity along the front.

Multi-Objective Evolutionary Algorithms
Algorithm	Key Innovation	Strengths	Weaknesses
NSGA-II	Fast non-dominated sorting + crowding distance	Fast, proven, widely implemented	Crowding struggles in many objectives
NSGA-III	Reference-point based selection	Scales to many objectives (3+)	Requires defining reference points
MOEA/D	Decomposition into scalar subproblems	Efficient, good diversity	Depends on weight vector design
SMS-EMOA	Hypervolume contribution for selection	Provably converges to Pareto front	Expensive hypervolume computation
SPEA2	Fine-grained fitness + archive	Good boundary coverage	Slower than NSGA-II

The Many-Objective Challenge

As the number of objectives grows beyond 3-4, most configurations become non-dominated (most points are Pareto optimal in high dimensions). This makes selection based on dominance ineffective. Many-objective optimization (4+ objectives) requires specialized algorithms like NSGA-III, MOEA/D, or hypervolume-based methods.

Multi-Objective Bayesian Optimization

When objective function evaluations are expensive, Bayesian optimization becomes essential. Multi-Objective Bayesian Optimization (MOBO) extends BO to multiple objectives by modeling each objective with a separate surrogate and using multi-objective acquisition functions.

Multi-Objective Surrogates:

The standard approach maintains independent Gaussian Process models for each objective:

for each objective i:
    GPᵢ: λ → (μᵢ(λ), σᵢ²(λ))

This assumes objective functions are independent given the hyperparameters—a simplification that works well in practice. Correlated models (multi-output GPs) can capture objective dependencies but are more complex.

Multi-Objective Acquisition Functions:

Unlike single-objective BO where the acquisition function (EI, UCB) returns a scalar, multi-objective acquisition must balance improvement across all objectives. Key approaches:

Multi-Objective Acquisition Functions

•Expected Hypervolume Improvement (EHVI): Expected increase in hypervolume dominated by the Pareto front. The gold standard but computationally expensive.
•ParEGO: Randomly scalarize objectives each iteration, apply standard EI. Simple and effective.
•PESMO (Predictive Entropy Search): Information-theoretic; selects points that reduce uncertainty about the Pareto front.
•qEHVI: Batch version of EHVI for parallel evaluation. State-of-the-art for expensive evaluations.
•Random Scalarization EI: Like ParEGO but with diverse scalarization weights.

Expected Hypervolume Improvement (EHVI):

Hypervolume is the volume of objective space dominated by the Pareto front (bounded by a reference point). EHVI measures the expected increase in this volume from evaluating a new point:

EHVI(λ) = E[HV(P ∪ {f(λ)}) - HV(P)]

where P is the current Pareto front and HV is hypervolume. Computing EHVI exactly requires integrating over the joint distribution of objectives, which is tractable for 2-3 objectives but expensive for more.

Modern implementations (BoTorch's qEHVI) use Monte Carlo sampling and efficient partitioning algorithms to scale EHVI to higher dimensions and batch settings.

ParEGO: A Simple and Effective Heuristic:

ParEGO (Parego Extended to Greater Optimization) is remarkably simple:

At each iteration, sample a random weight vector w from the unit simplex
Scalarize objectives using augmented Tchebycheff: s(λ) = max_i[wᵢ|fᵢ(λ) - zᵢ|] + ρ×Σᵢ|fᵢ(λ) - zᵢ*|*
Apply standard Expected Improvement to the scalarized objective
Evaluate the selected point, update all GPs

The random scalarization causes different iterations to focus on different parts of the front, naturally spreading coverage.

multi_objective_bo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
import numpy as np
from scipy.stats import norm
 
class MultiObjectiveBO:
    """
    Multi-Objective Bayesian Optimization using ParEGO or EHVI.
    """
    
    def __init__(self, n_objectives: int, search_space, method='parego'):
        self.n_objectives = n_objectives
        self.search_space = search_space
        self.method = method
        
        # One GP per objective
        self.gps = [GaussianProcess() for _ in range(n_objectives)]
        
        # Observations: list of (config, objectives) tuples
        self.observations = []
        
        # Reference point for hypervolume (dominated region bound)
        self.reference_point = None
        
    def observe(self, config: np.ndarray, objectives: np.ndarray):
        """Record an observation."""
        self.observations.append((config, objectives))
        
        # Update each GP
        configs = np.array([c for c, _ in self.observations])
        for i, gp in enumerate(self.gps):
            obj_values = np.array([o[i] for _, o in self.observations])
            gp.fit(configs, obj_values)
        
        # Update reference point (worst observed + margin)
        all_objs = np.array([o for _, o in self.observations])
        self.reference_point = np.max(all_objs, axis=0) + 0.1 * np.ptp(all_objs, axis=0)
    
    def get_pareto_front(self):
        """Return current Pareto front."""
        if not self.observations:
            return []
        
        points = [(c, o) for c, o in self.observations]
        pareto = []
        
        for i, (c_i, o_i) in enumerate(points):
            dominated = False
            for j, (c_j, o_j) in enumerate(points):
                if i != j and self._dominates(o_j, o_i):
                    dominated = True
                    break
            if not dominated:
                pareto.append((c_i, o_i))
        
        return pareto
    
    def _dominates(self, obj1, obj2):
        """Check if obj1 dominates obj2 (assuming minimization)."""
        return all(obj1 <= obj2) and any(obj1 < obj2)
    
    def suggest(self) -> np.ndarray:
        """Suggest next configuration to evaluate."""
        if self.method == 'parego':
            return self._suggest_parego()
        elif self.method == 'ehvi':
            return self._suggest_ehvi()
        else:
            raise ValueError(f"Unknown method: {self.method}")
    
    def _suggest_parego(self) -> np.ndarray:
        """ParEGO: random scalarization + EI."""
        # Sample random weight vector
        weights = np.random.dirichlet(np.ones(self.n_objectives))
        
        # Get reference point (ideal point) for Tchebycheff
        pareto = self.get_pareto_front()
        if pareto:
            pareto_objs = np.array([o for _, o in pareto])
            z_star = np.min(pareto_objs, axis=0)
        else:
            z_star = np.zeros(self.n_objectives)
        
        # Find configuration that maximizes EI on scalarized objective
        best_x = None
        best_ei = -np.inf
        
        for _ in range(1000):  # Random search for acquisition
            x = self.search_space.sample()
            
            # Predict objectives
            means = [gp.predict_mean(x) for gp in self.gps]
            vars = [gp.predict_var(x) for gp in self.gps]
            
            # Scalarize prediction (Tchebycheff)
            scalar_mean = self._tchebycheff_scalarization(
                np.array(means), weights, z_star
            )
            # Approximate variance (conservative)
            scalar_var = np.sum(weights**2 * np.array(vars))
            scalar_std = np.sqrt(max(scalar_var, 1e-10))
            
            # Compute EI
            if pareto:
                best_scalar = min(
                    self._tchebycheff_scalarization(o, weights, z_star)
                    for _, o in pareto
                )
            else:
                best_scalar = float('inf')
            
            ei = self._expected_improvement(scalar_mean, scalar_std, best_scalar)
            
            if ei > best_ei:
                best_ei = ei
                best_x = x
        
        return best_x
    
    def _tchebycheff_scalarization(self, objectives, weights, z_star, rho=0.05):
        """Augmented Tchebycheff scalarization."""
        diffs = np.abs(objectives - z_star)
        main = np.max(weights * diffs)
        aug = rho * np.sum(diffs)
        return main + aug
    
    def _expected_improvement(self, mean, std, best):
        """Standard Expected Improvement."""
        if std < 1e-10:
            return 0.0
        z = (best - mean) / std  # Note: we're minimizing
        return std * (z * norm.cdf(z) + norm.pdf(z))
    
    def _suggest_ehvi(self) -> np.ndarray:
        """EHVI-based suggestion (simplified Monte Carlo version)."""
        current_hv = self._compute_hypervolume()
        
        best_x = None
        best_ehvi = -np.inf
        
        for _ in range(500):
            x = self.search_space.sample()
            
            # Sample from posterior
            ehvi = 0
            n_samples = 50
            for _ in range(n_samples):
                sampled_objs = []
                for gp in self.gps:
                    mean = gp.predict_mean(x)
                    std = np.sqrt(gp.predict_var(x))
                    sampled_objs.append(np.random.normal(mean, std))
                
                # Compute HV with new point
                new_hv = self._compute_hypervolume(
                    additional_point=np.array(sampled_objs)
                )
                ehvi += (new_hv - current_hv) / n_samples
            
            if ehvi > best_ehvi:
                best_ehvi = ehvi
                best_x = x
        
        return best_x
    
    def _compute_hypervolume(self, additional_point=None):
        """Compute hypervolume indicator (2D case for simplicity)."""
        pareto = self.get_pareto_front()
        pareto_objs = [o for _, o in pareto]
        
        if additional_point is not None:
            pareto_objs.append(additional_point)
            pareto_objs = self._filter_to_pareto(pareto_objs)
        
        if not pareto_objs or self.reference_point is None:
            return 0.0
        
        # Simple 2D hypervolume calculation
        if self.n_objectives == 2:
            return self._hypervolume_2d(pareto_objs)
        else:
            # For higher dimensions, would use a proper HV algorithm
            return self._approximate_hypervolume(pareto_objs)
    
    def _hypervolume_2d(self, points):
        """Exact 2D hypervolume computation."""
        sorted_points = sorted(points, key=lambda p: p[0])
        hv = 0.0
        prev_x = sorted_points[0][0]
        
        for i, point in enumerate(sorted_points):
            height = self.reference_point[1] - point[1]
            if i < len(sorted_points) - 1:
                width = sorted_points[i + 1][0] - point[0]
            else:
                width = self.reference_point[0] - point[0]
            hv += height * width
        
        return hv
    
    def _filter_to_pareto(self, points):
        """Filter to Pareto-optimal points only."""
        pareto = []
        for i, p_i in enumerate(points):
            dominated = False
            for j, p_j in enumerate(points):
                if i != j and self._dominates(np.array(p_j), np.array(p_i)):
                    dominated = True
                    break
            if not dominated:
                pareto.append(p_i)
        return pareto

BoTorch for Production MOBO

For production multi-objective BO, use BoTorch (PyTorch-based) or similar well-tested libraries. BoTorch provides efficient implementations of qEHVI, qNEHVI, and other state-of-the-art acquisition functions with GPU acceleration and batch optimization support.

Handling Constraints in Multi-Objective HPO

Real-world multi-objective problems often involve constraints: not all configurations are acceptable, even if their objectives are good. A model must not exceed a latency budget. Training must complete within available resources. Predictions must satisfy fairness requirements.

Types of Constraints:

Known Constraints: Can be evaluated directly from the configuration (e.g., model size from architecture)
Black-box Constraints: Require evaluating the model (e.g., actual latency, fairness metrics)
Probabilistic Constraints: Must be satisfied with high probability (e.g., latency < 50ms with 99% confidence)
Hidden Constraints: Configurations that fail to train or produce invalid outputs

Common Constraint Handling Approaches

•Penalty Methods: Add constraint violation to objectives; simple but hard to tune penalty weights
•Feasibility Probability: Model P(feasible|λ) and incorporate into acquisition
•Constraint Surrogates: Model constraints with GPs like objectives; compute expected feasibility
•Constrained Domination: Feasible solutions always dominate infeasible ones

Constraint-Aware Acquisition

•cEHVI: Constrained EHVI multiplied by feasibility probability
•PESC: Entropy search that conditions on constraint satisfaction
•Expected Feasible Improvement: EI weighted by P(constraints satisfied)
•Constraint Modeling via Classification: Train classifier for feasibility

Constrained Expected Hypervolume Improvement:

The most principled approach models constraint satisfaction probabilistically:

cEHVI(λ) = EHVI(λ) × P(constraints satisfied | λ)

The feasibility probability is estimated from a GP or classifier trained on constraint observations. This naturally balances improvement against constraint risk.

Practical Considerations:

Constraint Tightness: Very tight constraints may leave little feasible region; consider relaxing during exploration
Constraint Correlation: Constraints may be correlated with objectives (e.g., faster models are smaller); exploit this structure
Constraint Discovery: Some constraints are only discovered during search (training failures); treat these as hidden constraints
Robust Constraints: For deployment, constraints must hold reliably; use probabilistic constraints with high confidence thresholds

Objectives vs. Constraints

The distinction between objectives and constraints is sometimes fluid. 'Latency < 50ms' can be a constraint, or latency can be an objective where we seek the Pareto front of accuracy vs. latency. Treating metrics as objectives reveals trade-offs; treating them as constraints enables focused search in the feasible region. Choose based on whether you need to explore the full trade-off surface or have a clear feasibility requirement.

Practical Multi-Objective HPO

Deploying multi-objective HPO in production requires careful consideration of practical issues beyond algorithmic correctness.

Choosing the Number of Objectives:

More objectives is not always better. Beyond 3-4 objectives:

Most configurations become Pareto-optimal (dominance becomes rare)
Visualization and interpretation becomes difficult
Search algorithms degrade in performance

Strategies for many objectives:

Objective Grouping: Combine related objectives (e.g., precision and recall → F1)
Priority Ordering: Use lexicographic optimization for less important objectives
Constraint Conversion: Convert nice-to-have objectives to constraints

Multi-Objective HPO Decision Guide
Situation	Recommended Approach	Rationale
Known fixed trade-off	Single-objective with weighted sum	Simplest if weights are clear
2-3 objectives, need full front	MOBO (qEHVI) or NSGA-II	Efficient Pareto front discovery
Many evaluations cheap	Evolutionary (NSGA-II/III)	Population naturally diversifies
Evaluations expensive	MOBO (ParEGO, qEHVI)	Sample-efficient with surrogates
4+ objectives	MOEA/D, NSGA-III, or decomposition	Handles many-objective scaling
Hard constraints present	Constrained MOBO (cEHVI)	Principled constraint handling

Selecting from the Pareto Front:

After obtaining a Pareto front, you must ultimately select one configuration for deployment. Selection strategies:

Knee-Point Selection: Choose the 'knee' of the front—the point with maximum curvature, representing the best compromise
Constraint-Based Selection: Apply deployment constraints to filter the front, select best on primary objective
Decision-Maker Preference: Present the front to stakeholders, let them choose based on business priorities
Weighted Selection: Apply preference weights to rank Pareto-optimal points
Robustness Selection: Choose configurations that remain good under perturbation of inputs or environment

Visualizing Multi-Objective Results:

Visualization is critical for understanding trade-offs and communicating results:

2D Scatter Plots: Plot objective pairs; each point is a configuration
Parallel Coordinates: Show all objectives as vertical axes, each configuration as a line
Radar Charts: Each objective as a spoke, configurations as colored polygons
Trade-off Curves: Interpolate the Pareto front, show the marginal cost of improvement

Iterative Multi-Objective HPO

In practice, preferences often clarify as the Pareto front is revealed. Start with broad exploration, present early results to stakeholders, refine objectives or constraints based on feedback, and focus search on preferred regions. This iterative process is often more practical than trying to specify all preferences upfront.

Tools and Frameworks

Several mature frameworks support multi-objective hyperparameter optimization:

BoTorch / Ax:

BoTorch (PyTorch-based) provides state-of-the-art multi-objective acquisition functions including qEHVI, qNEHVI, and qParEGO. Ax (Adaptive Experimentation Platform) builds on BoTorch to provide a higher-level interface:

from ax.service.ax_client import AxClient
from ax import ObjectiveProperties

ax_client = AxClient()

ax_client.create_experiment(
    name="multi_objective_hpo",
    parameters=[...],
    objectives={
        "accuracy": ObjectiveProperties(minimize=False),
        "latency": ObjectiveProperties(minimize=True),
    }
)

for _ in range(n_trials):
    params, trial_idx = ax_client.get_next_trial()
    results = evaluate(params)
    ax_client.complete_trial(trial_idx, results)

pareto_front = ax_client.get_pareto_optimal_parameters()

Multi-Objective HPO Frameworks

•BoTorch/Ax: State-of-the-art MOBO; GPU-accelerated; production-ready. Best for expensive evaluations.
•Optuna: Supports multi-objective via NSGA-II and other algorithms; easy to use; good visualization.
•DEAP: Python evolutionary computation; includes NSGA-II, NSGA-III, SPEA2. Best for evolutionary approaches.
•pymoo: Comprehensive multi-objective optimization; many algorithms and visualization tools.
•Ray Tune: Distributed HPO with multi-objective support; integrates with various algorithms.
•SMAC3: Multi-objective via random scalarization; good for mixed spaces with conditionals.

Optuna Multi-Objective Example:

import optuna

def objective(trial):
    # Suggest hyperparameters
    lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    n_layers = trial.suggest_int("n_layers", 1, 5)
    
    # Train model and evaluate multiple objectives
    model = train(lr, n_layers)
    accuracy = evaluate_accuracy(model)
    latency = measure_latency(model)
    
    return accuracy, latency  # Return tuple for multi-objective

study = optuna.create_study(
    directions=["maximize", "minimize"],  # Maximize accuracy, minimize latency
)
study.optimize(objective, n_trials=100)

# Get Pareto front
pareto_trials = study.best_trials
for trial in pareto_trials:
    print(f"Accuracy: {trial.values[0]:.4f}, Latency: {trial.values[1]:.2f}ms")

Integration with ML Workflows:

Multi-objective HPO integrates naturally with ML workflows:

Objectives from Validation: Compute accuracy, F1, AUC on validation set
Hardware Profiling: Measure latency, memory, throughput on target hardware
Fairness Metrics: Compute demographic parity, equalized odds, etc.
Cost Computation: Calculate training time, inference cost, model size

Summary: Multi-Objective HPO

Multi-objective hyperparameter optimization addresses the reality that ML systems must balance multiple, often conflicting goals. Rather than collapsing these into a single metric, MOHPO reveals the full trade-off surface, enabling informed decisions about which compromises to accept.

Key Takeaways

•Pareto dominance provides a principled way to compare solutions when objectives conflict—no single 'best' exists
•Scalarization (weighted sum, Tchebycheff) converts multi-objective to single-objective, but cannot find all Pareto points
•Evolutionary methods (NSGA-II) naturally maintain diverse populations covering the Pareto front
•Bayesian methods (ParEGO, EHVI) are sample-efficient when evaluations are expensive
•Constraints are common in practice; use feasibility-aware acquisition functions
•Many-objective (4+) problems require specialized algorithms as standard methods degrade
•Final selection from the Pareto front requires stakeholder input or preference articulation

Looking Ahead:

The next page explores Practical HPO Systems—production-ready tools and practices for deploying hyperparameter optimization at scale, including distributed execution, fault tolerance, result management, and organizational best practices.

Page Complete

You now understand multi-objective hyperparameter optimization—from Pareto optimality and scalarization to evolutionary and Bayesian approaches, constraint handling, and practical deployment considerations. These techniques enable systematic exploration of trade-off spaces in real-world ML systems. Next, we'll examine practical HPO systems for production deployment.

Multi-Objective HPO

Optimizing Multiple Conflicting Objectives

What You Will Learn

Multi-Objective Optimization Fundamentals

Problem Formulation:

A multi-objective optimization problem seeks to simultaneously optimize k objectives:

min_{λ ∈ Λ} F(λ) = [f₁(λ), f₂(λ), ..., fₖ(λ)]

Pareto Dominance:

The key concept in multi-objective optimization is dominance. Configuration λ₁ dominates λ₂ (written λ₁ ≺ λ₂) if and only if:

λ₁ is at least as good as λ₂ on all objectives: fᵢ(λ₁) ≤ fᵢ(λ₂) for all i
λ₁ is strictly better on at least one objective: fⱼ(λ₁) < fⱼ(λ₂) for some j

If neither configuration dominates the other, they are incomparable (or non-dominated with respect to each other).

Key Multi-Objective Concepts

•Pareto Optimal: A configuration is Pareto optimal if no other configuration dominates it—you cannot improve any objective without degrading another
•Pareto Front: The set of all Pareto optimal configurations in objective space—this is the 'efficient frontier' of trade-offs
•Pareto Set: The set of Pareto optimal configurations in input (hyperparameter) space
•Dominance Ranking: Non-dominated configurations form rank 0; configurations dominated only by rank 0 form rank 1; and so on
•Hypervolume: The volume (area in 2D, volume in 3D) of objective space dominated by the Pareto front—a common quality metric

Common Multi-Objective Scenarios in ML:

Multi-Objective Trade-offs in Machine Learning
Scenario	Objective 1	Objective 2	Nature of Trade-off
Deployment Efficiency	Accuracy / Quality	Latency / Throughput	Larger models are more accurate but slower
Resource Constraints	Accuracy / Quality	Memory / Model Size	More parameters enable better fitting but consume more memory
Detection Systems	Precision	Recall	Lower threshold catches more positives but increases false alarms
Fairness-Aware	Accuracy	Fairness (demographic parity, etc.)	Unconstrained optimization may amplify biases
Training Efficiency	Final Performance	Training Time / Cost	More training improves performance but costs more
Multi-Task Learning	Task A Performance	Task B Performance	Sharing capacity may help or hurt individual tasks

The Value of the Pareto Front

Scalarization Methods

Linear Scalarization:

The most common approach weights objectives and sums them:

minimize S(λ) = Σᵢ wᵢ × fᵢ(λ)

where wᵢ ≥ 0 and typically Σᵢ wᵢ = 1. By varying weights, we can find different Pareto-optimal solutions.

Limitations of Linear Scalarization:

Cannot find solutions in non-convex regions of the Pareto front
The mapping from weights to Pareto points is non-intuitive and often non-linear
Requires knowing appropriate weight ranges in advance
Multiple weight vectors may find the same solution

Scalarization Methods

•Weighted Sum: S = Σ wᵢfᵢ. Simple but cannot find non-convex Pareto points.
•Epsilon-Constraint: Optimize one objective while constraining others below thresholds. Can find any Pareto point.
•Tchebycheff: S = max_i [wᵢ|fᵢ - zᵢ|]. Uses reference point z; can find non-convex points.
•Achievement Scalarizing: Generalization of Tchebycheff with augmentation term to break ties.
•Weighted Product: S = Π fᵢ^wᵢ. Sensitive to objective scaling.
•Lexicographic: Optimize objectives in priority order; later objectives break ties.

scalarization_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import numpy as np
from typing import List, Callable, Tuple
 
class Scalarizer:
    """
    Scalarization methods for multi-objective optimization.
    Converts vector objectives to scalar for standard optimizers.
    """
    
    @staticmethod
    def weighted_sum(objectives: np.ndarray, weights: np.ndarray) -> float:
        """
        Linear weighted sum scalarization.
        
        Args:
            objectives: Array of objective values [f1, f2, ..., fk]
            weights: Array of weights [w1, w2, ..., wk], should sum to 1
        
        Returns:
            Weighted sum of objectives
        """
        return np.dot(weights, objectives)
    
    @staticmethod
    def tchebycheff(
        objectives: np.ndarray, 
        weights: np.ndarray,
        reference_point: np.ndarray,
        rho: float = 0.05
    ) -> float:
        """
        Augmented Tchebycheff scalarization.
        Can find points in non-convex regions of the Pareto front.
        
        Args:
            objectives: Array of objective values
            weights: Array of weights (importance of each objective)
            reference_point: Ideal/utopia point (best possible for each objective)
            rho: Augmentation parameter (small positive value)
        
        Returns:
            Augmented Tchebycheff value
        """
        # Main term: weighted max distance from reference
        diffs = np.abs(objectives - reference_point)
        weighted_diffs = weights * diffs
        main_term = np.max(weighted_diffs)
        
        # Augmentation term: ensures unique optimum
        augmentation = rho * np.sum(diffs)
        
        return main_term + augmentation
    
    @staticmethod
    def epsilon_constraint(
        objectives: np.ndarray,
        primary_idx: int,
        epsilon_bounds: np.ndarray
    ) -> Tuple[float, bool]:
        """
        Epsilon-constraint method.
        Optimizes primary objective subject to constraints on others.
        
        Args:
            objectives: Array of objective values
            primary_idx: Index of objective to optimize
            epsilon_bounds: Upper bounds for non-primary objectives
        
        Returns:
            (primary_value, is_feasible) tuple
        """
        # Check constraints
        feasible = True
        for i, (obj, bound) in enumerate(zip(objectives, epsilon_bounds)):
            if i != primary_idx and obj > bound:
                feasible = False
                break
        
        return objectives[primary_idx], feasible
 
 
class ParallelScalarizedBO:
    """
    Run multiple scalarized Bayesian optimizations in parallel
    to approximate the Pareto front.
    """
    
    def __init__(self, n_objectives: int, n_weight_vectors: int = 10):
        self.n_objectives = n_objectives
        
        # Generate diverse weight vectors
        self.weight_vectors = self._generate_weight_vectors(
            n_objectives, n_weight_vectors
        )
        
        # One BO instance per weight vector
        self.optimizers = [None] * n_weight_vectors  # Initialize with actual BO
        
    def _generate_weight_vectors(self, k: int, n: int) -> List[np.ndarray]:
        """Generate evenly distributed weight vectors on the simplex."""
        if k == 2:
            # Simple linear spacing for 2 objectives
            weights = []
            for i in range(n):
                w1 = i / (n - 1)
                weights.append(np.array([w1, 1 - w1]))
            return weights
        else:
            # Random sampling for higher dimensions
            weights = []
            for _ in range(n):
                w = np.random.dirichlet(np.ones(k))
                weights.append(w)
            return weights
    
    def suggest(self, weight_idx: int) -> np.ndarray:
        """Suggest configuration for a specific weight vector."""
        # Each optimizer focuses on its scalarization
        return self.optimizers[weight_idx].suggest()
    
    def observe(self, weight_idx: int, config: np.ndarray, objectives: np.ndarray):
        """Record observation for a specific optimizer."""
        # Scalarize objectives and observe
        scalarized = Scalarizer.weighted_sum(objectives, self.weight_vectors[weight_idx])
        self.optimizers[weight_idx].observe(config, scalarized)
    
    def get_pareto_front(self) -> List[Tuple[np.ndarray, np.ndarray]]:
        """
        Extract Pareto front from all optimizers.
        Returns list of (config, objectives) tuples.
        """
        all_observations = []
        for opt in self.optimizers:
            all_observations.extend(opt.get_all_observations())
        
        return self._compute_pareto_front(all_observations)
    
    def _compute_pareto_front(self, points):
        """Filter to non-dominated points."""
        pareto = []
        for i, (config_i, obj_i) in enumerate(points):
            dominated = False
            for j, (config_j, obj_j) in enumerate(points):
                if i != j:
                    # Check if j dominates i
                    if all(obj_j <= obj_i) and any(obj_j < obj_i):
                        dominated = True
                        break
            if not dominated:
                pareto.append((config_i, obj_i))
        return pareto

When to Use Scalarization

Evolutionary Multi-Objective Optimization

NSGA-II (Non-dominated Sorting Genetic Algorithm II):

NSGA-II is the most widely used multi-objective evolutionary algorithm. Its key innovations are:

Fast Non-dominated Sorting: Efficiently ranks individuals by dominance level
Crowding Distance: Measures how isolated a point is on the current front; encourages diversity
Elitist Selection: Combines parent and offspring populations, keeping the best from both

NSGA-II Algorithm Steps

•Initialize: Random population of N configurations
•Evaluate: Compute all objectives for each configuration
•Non-dominated Sort: Assign rank 0 to non-dominated, rank 1 to dominated only by rank 0, etc.
•Crowding Distance: Within each rank, compute density estimate for diversity
•Selection: Binary tournament based on (rank, crowding distance) - prefer lower rank, higher crowding
•Crossover & Mutation: Create offspring population of size N
•Combine & Select: Merge parents + offspring (2N), select best N by rank then crowding
•Repeat until budget exhausted

Crowding Distance:

Crowding distance estimates the density of solutions around a point on the front. For each objective, it sorts solutions and measures the distance between neighbors:

def crowding_distance(front, objectives):
    distances = [0] * len(front)
    
    for obj_idx in range(n_objectives):
        # Sort by objective
        sorted_indices = sorted(range(len(front)), 
                                key=lambda i: objectives[i][obj_idx])
        
        # Boundary points get infinite distance
        distances[sorted_indices[0]] = float('inf')
        distances[sorted_indices[-1]] = float('inf')
        
        # Interior points: normalized distance to neighbors
        obj_range = objectives[sorted_indices[-1]][obj_idx] - 
                   objectives[sorted_indices[0]][obj_idx]
        
        for i in range(1, len(front) - 1):
            distances[sorted_indices[i]] += (
                objectives[sorted_indices[i+1]][obj_idx] - 
                objectives[sorted_indices[i-1]][obj_idx]
            ) / max(obj_range, 1e-10)
    
    return distances

Points with higher crowding distance are in sparser regions and are preferred during selection, promoting diversity along the front.

Multi-Objective Evolutionary Algorithms
Algorithm	Key Innovation	Strengths	Weaknesses
NSGA-II	Fast non-dominated sorting + crowding distance	Fast, proven, widely implemented	Crowding struggles in many objectives
NSGA-III	Reference-point based selection	Scales to many objectives (3+)	Requires defining reference points
MOEA/D	Decomposition into scalar subproblems	Efficient, good diversity	Depends on weight vector design
SMS-EMOA	Hypervolume contribution for selection	Provably converges to Pareto front	Expensive hypervolume computation
SPEA2	Fine-grained fitness + archive	Good boundary coverage	Slower than NSGA-II

The Many-Objective Challenge

Multi-Objective Bayesian Optimization

Multi-Objective Surrogates:

The standard approach maintains independent Gaussian Process models for each objective:

for each objective i:
    GPᵢ: λ → (μᵢ(λ), σᵢ²(λ))

Multi-Objective Acquisition Functions:

Unlike single-objective BO where the acquisition function (EI, UCB) returns a scalar, multi-objective acquisition must balance improvement across all objectives. Key approaches:

Multi-Objective Acquisition Functions

•Expected Hypervolume Improvement (EHVI): Expected increase in hypervolume dominated by the Pareto front. The gold standard but computationally expensive.
•ParEGO: Randomly scalarize objectives each iteration, apply standard EI. Simple and effective.
•PESMO (Predictive Entropy Search): Information-theoretic; selects points that reduce uncertainty about the Pareto front.
•qEHVI: Batch version of EHVI for parallel evaluation. State-of-the-art for expensive evaluations.
•Random Scalarization EI: Like ParEGO but with diverse scalarization weights.

Expected Hypervolume Improvement (EHVI):

Hypervolume is the volume of objective space dominated by the Pareto front (bounded by a reference point). EHVI measures the expected increase in this volume from evaluating a new point:

EHVI(λ) = E[HV(P ∪ {f(λ)}) - HV(P)]

Modern implementations (BoTorch's qEHVI) use Monte Carlo sampling and efficient partitioning algorithms to scale EHVI to higher dimensions and batch settings.

ParEGO: A Simple and Effective Heuristic:

ParEGO (Parego Extended to Greater Optimization) is remarkably simple:

At each iteration, sample a random weight vector w from the unit simplex
Scalarize objectives using augmented Tchebycheff: s(λ) = max_i[wᵢ|fᵢ(λ) - zᵢ|] + ρ×Σᵢ|fᵢ(λ) - zᵢ*|*
Apply standard Expected Improvement to the scalarized objective
Evaluate the selected point, update all GPs

The random scalarization causes different iterations to focus on different parts of the front, naturally spreading coverage.

multi_objective_bo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
import numpy as np
from scipy.stats import norm
 
class MultiObjectiveBO:
    """
    Multi-Objective Bayesian Optimization using ParEGO or EHVI.
    """
    
    def __init__(self, n_objectives: int, search_space, method='parego'):
        self.n_objectives = n_objectives
        self.search_space = search_space
        self.method = method
        
        # One GP per objective
        self.gps = [GaussianProcess() for _ in range(n_objectives)]
        
        # Observations: list of (config, objectives) tuples
        self.observations = []
        
        # Reference point for hypervolume (dominated region bound)
        self.reference_point = None
        
    def observe(self, config: np.ndarray, objectives: np.ndarray):
        """Record an observation."""
        self.observations.append((config, objectives))
        
        # Update each GP
        configs = np.array([c for c, _ in self.observations])
        for i, gp in enumerate(self.gps):
            obj_values = np.array([o[i] for _, o in self.observations])
            gp.fit(configs, obj_values)
        
        # Update reference point (worst observed + margin)
        all_objs = np.array([o for _, o in self.observations])
        self.reference_point = np.max(all_objs, axis=0) + 0.1 * np.ptp(all_objs, axis=0)
    
    def get_pareto_front(self):
        """Return current Pareto front."""
        if not self.observations:
            return []
        
        points = [(c, o) for c, o in self.observations]
        pareto = []
        
        for i, (c_i, o_i) in enumerate(points):
            dominated = False
            for j, (c_j, o_j) in enumerate(points):
                if i != j and self._dominates(o_j, o_i):
                    dominated = True
                    break
            if not dominated:
                pareto.append((c_i, o_i))
        
        return pareto
    
    def _dominates(self, obj1, obj2):
        """Check if obj1 dominates obj2 (assuming minimization)."""
        return all(obj1 <= obj2) and any(obj1 < obj2)
    
    def suggest(self) -> np.ndarray:
        """Suggest next configuration to evaluate."""
        if self.method == 'parego':
            return self._suggest_parego()
        elif self.method == 'ehvi':
            return self._suggest_ehvi()
        else:
            raise ValueError(f"Unknown method: {self.method}")
    
    def _suggest_parego(self) -> np.ndarray:
        """ParEGO: random scalarization + EI."""
        # Sample random weight vector
        weights = np.random.dirichlet(np.ones(self.n_objectives))
        
        # Get reference point (ideal point) for Tchebycheff
        pareto = self.get_pareto_front()
        if pareto:
            pareto_objs = np.array([o for _, o in pareto])
            z_star = np.min(pareto_objs, axis=0)
        else:
            z_star = np.zeros(self.n_objectives)
        
        # Find configuration that maximizes EI on scalarized objective
        best_x = None
        best_ei = -np.inf
        
        for _ in range(1000):  # Random search for acquisition
            x = self.search_space.sample()
            
            # Predict objectives
            means = [gp.predict_mean(x) for gp in self.gps]
            vars = [gp.predict_var(x) for gp in self.gps]
            
            # Scalarize prediction (Tchebycheff)
            scalar_mean = self._tchebycheff_scalarization(
                np.array(means), weights, z_star
            )
            # Approximate variance (conservative)
            scalar_var = np.sum(weights**2 * np.array(vars))
            scalar_std = np.sqrt(max(scalar_var, 1e-10))
            
            # Compute EI
            if pareto:
                best_scalar = min(
                    self._tchebycheff_scalarization(o, weights, z_star)
                    for _, o in pareto
                )
            else:
                best_scalar = float('inf')
            
            ei = self._expected_improvement(scalar_mean, scalar_std, best_scalar)
            
            if ei > best_ei:
                best_ei = ei
                best_x = x
        
        return best_x
    
    def _tchebycheff_scalarization(self, objectives, weights, z_star, rho=0.05):
        """Augmented Tchebycheff scalarization."""
        diffs = np.abs(objectives - z_star)
        main = np.max(weights * diffs)
        aug = rho * np.sum(diffs)
        return main + aug
    
    def _expected_improvement(self, mean, std, best):
        """Standard Expected Improvement."""
        if std < 1e-10:
            return 0.0
        z = (best - mean) / std  # Note: we're minimizing
        return std * (z * norm.cdf(z) + norm.pdf(z))
    
    def _suggest_ehvi(self) -> np.ndarray:
        """EHVI-based suggestion (simplified Monte Carlo version)."""
        current_hv = self._compute_hypervolume()
        
        best_x = None
        best_ehvi = -np.inf
        
        for _ in range(500):
            x = self.search_space.sample()
            
            # Sample from posterior
            ehvi = 0
            n_samples = 50
            for _ in range(n_samples):
                sampled_objs = []
                for gp in self.gps:
                    mean = gp.predict_mean(x)
                    std = np.sqrt(gp.predict_var(x))
                    sampled_objs.append(np.random.normal(mean, std))
                
                # Compute HV with new point
                new_hv = self._compute_hypervolume(
                    additional_point=np.array(sampled_objs)
                )
                ehvi += (new_hv - current_hv) / n_samples
            
            if ehvi > best_ehvi:
                best_ehvi = ehvi
                best_x = x
        
        return best_x
    
    def _compute_hypervolume(self, additional_point=None):
        """Compute hypervolume indicator (2D case for simplicity)."""
        pareto = self.get_pareto_front()
        pareto_objs = [o for _, o in pareto]
        
        if additional_point is not None:
            pareto_objs.append(additional_point)
            pareto_objs = self._filter_to_pareto(pareto_objs)
        
        if not pareto_objs or self.reference_point is None:
            return 0.0
        
        # Simple 2D hypervolume calculation
        if self.n_objectives == 2:
            return self._hypervolume_2d(pareto_objs)
        else:
            # For higher dimensions, would use a proper HV algorithm
            return self._approximate_hypervolume(pareto_objs)
    
    def _hypervolume_2d(self, points):
        """Exact 2D hypervolume computation."""
        sorted_points = sorted(points, key=lambda p: p[0])
        hv = 0.0
        prev_x = sorted_points[0][0]
        
        for i, point in enumerate(sorted_points):
            height = self.reference_point[1] - point[1]
            if i < len(sorted_points) - 1:
                width = sorted_points[i + 1][0] - point[0]
            else:
                width = self.reference_point[0] - point[0]
            hv += height * width
        
        return hv
    
    def _filter_to_pareto(self, points):
        """Filter to Pareto-optimal points only."""
        pareto = []
        for i, p_i in enumerate(points):
            dominated = False
            for j, p_j in enumerate(points):
                if i != j and self._dominates(np.array(p_j), np.array(p_i)):
                    dominated = True
                    break
            if not dominated:
                pareto.append(p_i)
        return pareto

BoTorch for Production MOBO

Handling Constraints in Multi-Objective HPO

Types of Constraints:

Known Constraints: Can be evaluated directly from the configuration (e.g., model size from architecture)
Black-box Constraints: Require evaluating the model (e.g., actual latency, fairness metrics)
Probabilistic Constraints: Must be satisfied with high probability (e.g., latency < 50ms with 99% confidence)
Hidden Constraints: Configurations that fail to train or produce invalid outputs

Common Constraint Handling Approaches

•Penalty Methods: Add constraint violation to objectives; simple but hard to tune penalty weights
•Feasibility Probability: Model P(feasible|λ) and incorporate into acquisition
•Constraint Surrogates: Model constraints with GPs like objectives; compute expected feasibility
•Constrained Domination: Feasible solutions always dominate infeasible ones

Constraint-Aware Acquisition

•cEHVI: Constrained EHVI multiplied by feasibility probability
•PESC: Entropy search that conditions on constraint satisfaction
•Expected Feasible Improvement: EI weighted by P(constraints satisfied)
•Constraint Modeling via Classification: Train classifier for feasibility

Constrained Expected Hypervolume Improvement:

The most principled approach models constraint satisfaction probabilistically:

cEHVI(λ) = EHVI(λ) × P(constraints satisfied | λ)

The feasibility probability is estimated from a GP or classifier trained on constraint observations. This naturally balances improvement against constraint risk.

Practical Considerations:

Constraint Tightness: Very tight constraints may leave little feasible region; consider relaxing during exploration
Constraint Correlation: Constraints may be correlated with objectives (e.g., faster models are smaller); exploit this structure
Constraint Discovery: Some constraints are only discovered during search (training failures); treat these as hidden constraints
Robust Constraints: For deployment, constraints must hold reliably; use probabilistic constraints with high confidence thresholds

Objectives vs. Constraints

Practical Multi-Objective HPO

Deploying multi-objective HPO in production requires careful consideration of practical issues beyond algorithmic correctness.

Choosing the Number of Objectives:

More objectives is not always better. Beyond 3-4 objectives:

Most configurations become Pareto-optimal (dominance becomes rare)
Visualization and interpretation becomes difficult
Search algorithms degrade in performance

Strategies for many objectives:

Objective Grouping: Combine related objectives (e.g., precision and recall → F1)
Priority Ordering: Use lexicographic optimization for less important objectives
Constraint Conversion: Convert nice-to-have objectives to constraints

Multi-Objective HPO Decision Guide
Situation	Recommended Approach	Rationale
Known fixed trade-off	Single-objective with weighted sum	Simplest if weights are clear
2-3 objectives, need full front	MOBO (qEHVI) or NSGA-II	Efficient Pareto front discovery
Many evaluations cheap	Evolutionary (NSGA-II/III)	Population naturally diversifies
Evaluations expensive	MOBO (ParEGO, qEHVI)	Sample-efficient with surrogates
4+ objectives	MOEA/D, NSGA-III, or decomposition	Handles many-objective scaling
Hard constraints present	Constrained MOBO (cEHVI)	Principled constraint handling

Selecting from the Pareto Front:

After obtaining a Pareto front, you must ultimately select one configuration for deployment. Selection strategies:

Knee-Point Selection: Choose the 'knee' of the front—the point with maximum curvature, representing the best compromise
Constraint-Based Selection: Apply deployment constraints to filter the front, select best on primary objective
Decision-Maker Preference: Present the front to stakeholders, let them choose based on business priorities
Weighted Selection: Apply preference weights to rank Pareto-optimal points
Robustness Selection: Choose configurations that remain good under perturbation of inputs or environment

Visualizing Multi-Objective Results:

Visualization is critical for understanding trade-offs and communicating results:

2D Scatter Plots: Plot objective pairs; each point is a configuration
Parallel Coordinates: Show all objectives as vertical axes, each configuration as a line
Radar Charts: Each objective as a spoke, configurations as colored polygons
Trade-off Curves: Interpolate the Pareto front, show the marginal cost of improvement

Iterative Multi-Objective HPO

Tools and Frameworks

Several mature frameworks support multi-objective hyperparameter optimization:

BoTorch / Ax:

from ax.service.ax_client import AxClient
from ax import ObjectiveProperties

ax_client = AxClient()

ax_client.create_experiment(
    name="multi_objective_hpo",
    parameters=[...],
    objectives={
        "accuracy": ObjectiveProperties(minimize=False),
        "latency": ObjectiveProperties(minimize=True),
    }
)

for _ in range(n_trials):
    params, trial_idx = ax_client.get_next_trial()
    results = evaluate(params)
    ax_client.complete_trial(trial_idx, results)

pareto_front = ax_client.get_pareto_optimal_parameters()

Multi-Objective HPO Frameworks

•BoTorch/Ax: State-of-the-art MOBO; GPU-accelerated; production-ready. Best for expensive evaluations.
•Optuna: Supports multi-objective via NSGA-II and other algorithms; easy to use; good visualization.
•DEAP: Python evolutionary computation; includes NSGA-II, NSGA-III, SPEA2. Best for evolutionary approaches.
•pymoo: Comprehensive multi-objective optimization; many algorithms and visualization tools.
•Ray Tune: Distributed HPO with multi-objective support; integrates with various algorithms.
•SMAC3: Multi-objective via random scalarization; good for mixed spaces with conditionals.

Optuna Multi-Objective Example:

import optuna

def objective(trial):
    # Suggest hyperparameters
    lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    n_layers = trial.suggest_int("n_layers", 1, 5)
    
    # Train model and evaluate multiple objectives
    model = train(lr, n_layers)
    accuracy = evaluate_accuracy(model)
    latency = measure_latency(model)
    
    return accuracy, latency  # Return tuple for multi-objective

study = optuna.create_study(
    directions=["maximize", "minimize"],  # Maximize accuracy, minimize latency
)
study.optimize(objective, n_trials=100)

# Get Pareto front
pareto_trials = study.best_trials
for trial in pareto_trials:
    print(f"Accuracy: {trial.values[0]:.4f}, Latency: {trial.values[1]:.2f}ms")

Integration with ML Workflows:

Multi-objective HPO integrates naturally with ML workflows:

Objectives from Validation: Compute accuracy, F1, AUC on validation set
Hardware Profiling: Measure latency, memory, throughput on target hardware
Fairness Metrics: Compute demographic parity, equalized odds, etc.
Cost Computation: Calculate training time, inference cost, model size

Summary: Multi-Objective HPO

Key Takeaways

•Pareto dominance provides a principled way to compare solutions when objectives conflict—no single 'best' exists
•Scalarization (weighted sum, Tchebycheff) converts multi-objective to single-objective, but cannot find all Pareto points
•Evolutionary methods (NSGA-II) naturally maintain diverse populations covering the Pareto front
•Bayesian methods (ParEGO, EHVI) are sample-efficient when evaluations are expensive
•Constraints are common in practice; use feasibility-aware acquisition functions
•Many-objective (4+) problems require specialized algorithms as standard methods degrade
•Final selection from the Pareto front requires stakeholder input or preference articulation

Looking Ahead:

Page Complete