Loading content...
Real-world machine learning systems are rarely optimized for a single metric. A recommendation system must balance relevance against diversity. A fraud detection model trades off precision against recall. A deployed model must achieve accuracy while meeting latency constraints and minimizing memory usage.
These objectives frequently conflict: improving one degrades another. A larger, more accurate model runs slower. A more aggressive classifier catches more fraud but generates more false alarms. There is no single 'best' hyperparameter configuration—instead, there exists a set of Pareto-optimal configurations, each representing a different trade-off among objectives.
Multi-Objective Hyperparameter Optimization (MOHPO) addresses this challenge by discovering the entire trade-off surface, enabling informed decisions about which compromises to accept. Rather than asking 'what is the best model?', MOHPO answers 'what are the best trade-offs available?'
By the end of this page, you will understand Pareto optimality and dominance, scalarization techniques for converting multi-objective problems to single-objective, evolutionary and Bayesian approaches to multi-objective optimization, and practical strategies for deploying MOHPO in production ML pipelines.
Problem Formulation:
A multi-objective optimization problem seeks to simultaneously optimize k objectives:
min_{λ ∈ Λ} F(λ) = [f₁(λ), f₂(λ), ..., fₖ(λ)]
where each fᵢ(λ) is an objective function (we assume minimization; maximization objectives are negated). Unlike single-objective optimization, there is no natural total ordering among solutions—we cannot say one is 'better' without specifying how to weigh the objectives.
Pareto Dominance:
The key concept in multi-objective optimization is dominance. Configuration λ₁ dominates λ₂ (written λ₁ ≺ λ₂) if and only if:
If neither configuration dominates the other, they are incomparable (or non-dominated with respect to each other).
Common Multi-Objective Scenarios in ML:
| Scenario | Objective 1 | Objective 2 | Nature of Trade-off |
|---|---|---|---|
| Deployment Efficiency | Accuracy / Quality | Latency / Throughput | Larger models are more accurate but slower |
| Resource Constraints | Accuracy / Quality | Memory / Model Size | More parameters enable better fitting but consume more memory |
| Detection Systems | Precision | Recall | Lower threshold catches more positives but increases false alarms |
| Fairness-Aware | Accuracy | Fairness (demographic parity, etc.) | Unconstrained optimization may amplify biases |
| Training Efficiency | Final Performance | Training Time / Cost | More training improves performance but costs more |
| Multi-Task Learning | Task A Performance | Task B Performance | Sharing capacity may help or hurt individual tasks |
Discovering the full Pareto front is valuable even if you will ultimately deploy a single model. The front reveals what trade-offs are possible, how much you must sacrifice in one objective to gain in another, and whether certain objectives are fundamentally incompatible or surprisingly compatible.
The simplest approach to multi-objective optimization is scalarization: combine the objectives into a single scalar value that can be optimized with standard techniques. Different scalarization methods recover different points on the Pareto front.
Linear Scalarization:
The most common approach weights objectives and sums them:
minimize S(λ) = Σᵢ wᵢ × fᵢ(λ)
where wᵢ ≥ 0 and typically Σᵢ wᵢ = 1. By varying weights, we can find different Pareto-optimal solutions.
Limitations of Linear Scalarization:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
import numpy as npfrom typing import List, Callable, Tuple class Scalarizer: """ Scalarization methods for multi-objective optimization. Converts vector objectives to scalar for standard optimizers. """ @staticmethod def weighted_sum(objectives: np.ndarray, weights: np.ndarray) -> float: """ Linear weighted sum scalarization. Args: objectives: Array of objective values [f1, f2, ..., fk] weights: Array of weights [w1, w2, ..., wk], should sum to 1 Returns: Weighted sum of objectives """ return np.dot(weights, objectives) @staticmethod def tchebycheff( objectives: np.ndarray, weights: np.ndarray, reference_point: np.ndarray, rho: float = 0.05 ) -> float: """ Augmented Tchebycheff scalarization. Can find points in non-convex regions of the Pareto front. Args: objectives: Array of objective values weights: Array of weights (importance of each objective) reference_point: Ideal/utopia point (best possible for each objective) rho: Augmentation parameter (small positive value) Returns: Augmented Tchebycheff value """ # Main term: weighted max distance from reference diffs = np.abs(objectives - reference_point) weighted_diffs = weights * diffs main_term = np.max(weighted_diffs) # Augmentation term: ensures unique optimum augmentation = rho * np.sum(diffs) return main_term + augmentation @staticmethod def epsilon_constraint( objectives: np.ndarray, primary_idx: int, epsilon_bounds: np.ndarray ) -> Tuple[float, bool]: """ Epsilon-constraint method. Optimizes primary objective subject to constraints on others. Args: objectives: Array of objective values primary_idx: Index of objective to optimize epsilon_bounds: Upper bounds for non-primary objectives Returns: (primary_value, is_feasible) tuple """ # Check constraints feasible = True for i, (obj, bound) in enumerate(zip(objectives, epsilon_bounds)): if i != primary_idx and obj > bound: feasible = False break return objectives[primary_idx], feasible class ParallelScalarizedBO: """ Run multiple scalarized Bayesian optimizations in parallel to approximate the Pareto front. """ def __init__(self, n_objectives: int, n_weight_vectors: int = 10): self.n_objectives = n_objectives # Generate diverse weight vectors self.weight_vectors = self._generate_weight_vectors( n_objectives, n_weight_vectors ) # One BO instance per weight vector self.optimizers = [None] * n_weight_vectors # Initialize with actual BO def _generate_weight_vectors(self, k: int, n: int) -> List[np.ndarray]: """Generate evenly distributed weight vectors on the simplex.""" if k == 2: # Simple linear spacing for 2 objectives weights = [] for i in range(n): w1 = i / (n - 1) weights.append(np.array([w1, 1 - w1])) return weights else: # Random sampling for higher dimensions weights = [] for _ in range(n): w = np.random.dirichlet(np.ones(k)) weights.append(w) return weights def suggest(self, weight_idx: int) -> np.ndarray: """Suggest configuration for a specific weight vector.""" # Each optimizer focuses on its scalarization return self.optimizers[weight_idx].suggest() def observe(self, weight_idx: int, config: np.ndarray, objectives: np.ndarray): """Record observation for a specific optimizer.""" # Scalarize objectives and observe scalarized = Scalarizer.weighted_sum(objectives, self.weight_vectors[weight_idx]) self.optimizers[weight_idx].observe(config, scalarized) def get_pareto_front(self) -> List[Tuple[np.ndarray, np.ndarray]]: """ Extract Pareto front from all optimizers. Returns list of (config, objectives) tuples. """ all_observations = [] for opt in self.optimizers: all_observations.extend(opt.get_all_observations()) return self._compute_pareto_front(all_observations) def _compute_pareto_front(self, points): """Filter to non-dominated points.""" pareto = [] for i, (config_i, obj_i) in enumerate(points): dominated = False for j, (config_j, obj_j) in enumerate(points): if i != j: # Check if j dominates i if all(obj_j <= obj_i) and any(obj_j < obj_i): dominated = True break if not dominated: pareto.append((config_i, obj_i)) return paretoScalarization is appropriate when: (1) you have a specific trade-off in mind and just need one solution, (2) you want to leverage existing single-objective optimizers, or (3) you're approximating the Pareto front via parallel scalarized runs. For comprehensive Pareto front discovery, native multi-objective methods are more efficient.
Evolutionary algorithms are natural candidates for multi-objective optimization because they maintain a population of solutions. This population can simultaneously cover different regions of the Pareto front, providing approximation to the entire trade-off surface in a single run.
NSGA-II (Non-dominated Sorting Genetic Algorithm II):
NSGA-II is the most widely used multi-objective evolutionary algorithm. Its key innovations are:
Crowding Distance:
Crowding distance estimates the density of solutions around a point on the front. For each objective, it sorts solutions and measures the distance between neighbors:
def crowding_distance(front, objectives):
distances = [0] * len(front)
for obj_idx in range(n_objectives):
# Sort by objective
sorted_indices = sorted(range(len(front)),
key=lambda i: objectives[i][obj_idx])
# Boundary points get infinite distance
distances[sorted_indices[0]] = float('inf')
distances[sorted_indices[-1]] = float('inf')
# Interior points: normalized distance to neighbors
obj_range = objectives[sorted_indices[-1]][obj_idx] -
objectives[sorted_indices[0]][obj_idx]
for i in range(1, len(front) - 1):
distances[sorted_indices[i]] += (
objectives[sorted_indices[i+1]][obj_idx] -
objectives[sorted_indices[i-1]][obj_idx]
) / max(obj_range, 1e-10)
return distances
Points with higher crowding distance are in sparser regions and are preferred during selection, promoting diversity along the front.
| Algorithm | Key Innovation | Strengths | Weaknesses |
|---|---|---|---|
| NSGA-II | Fast non-dominated sorting + crowding distance | Fast, proven, widely implemented | Crowding struggles in many objectives |
| NSGA-III | Reference-point based selection | Scales to many objectives (3+) | Requires defining reference points |
| MOEA/D | Decomposition into scalar subproblems | Efficient, good diversity | Depends on weight vector design |
| SMS-EMOA | Hypervolume contribution for selection | Provably converges to Pareto front | Expensive hypervolume computation |
| SPEA2 | Fine-grained fitness + archive | Good boundary coverage | Slower than NSGA-II |
As the number of objectives grows beyond 3-4, most configurations become non-dominated (most points are Pareto optimal in high dimensions). This makes selection based on dominance ineffective. Many-objective optimization (4+ objectives) requires specialized algorithms like NSGA-III, MOEA/D, or hypervolume-based methods.
When objective function evaluations are expensive, Bayesian optimization becomes essential. Multi-Objective Bayesian Optimization (MOBO) extends BO to multiple objectives by modeling each objective with a separate surrogate and using multi-objective acquisition functions.
Multi-Objective Surrogates:
The standard approach maintains independent Gaussian Process models for each objective:
for each objective i:
GPᵢ: λ → (μᵢ(λ), σᵢ²(λ))
This assumes objective functions are independent given the hyperparameters—a simplification that works well in practice. Correlated models (multi-output GPs) can capture objective dependencies but are more complex.
Multi-Objective Acquisition Functions:
Unlike single-objective BO where the acquisition function (EI, UCB) returns a scalar, multi-objective acquisition must balance improvement across all objectives. Key approaches:
Expected Hypervolume Improvement (EHVI):
Hypervolume is the volume of objective space dominated by the Pareto front (bounded by a reference point). EHVI measures the expected increase in this volume from evaluating a new point:
EHVI(λ) = E[HV(P ∪ {f(λ)}) - HV(P)]
where P is the current Pareto front and HV is hypervolume. Computing EHVI exactly requires integrating over the joint distribution of objectives, which is tractable for 2-3 objectives but expensive for more.
Modern implementations (BoTorch's qEHVI) use Monte Carlo sampling and efficient partitioning algorithms to scale EHVI to higher dimensions and batch settings.
ParEGO: A Simple and Effective Heuristic:
ParEGO (Parego Extended to Greater Optimization) is remarkably simple:
The random scalarization causes different iterations to focus on different parts of the front, naturally spreading coverage.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210
import numpy as npfrom scipy.stats import norm class MultiObjectiveBO: """ Multi-Objective Bayesian Optimization using ParEGO or EHVI. """ def __init__(self, n_objectives: int, search_space, method='parego'): self.n_objectives = n_objectives self.search_space = search_space self.method = method # One GP per objective self.gps = [GaussianProcess() for _ in range(n_objectives)] # Observations: list of (config, objectives) tuples self.observations = [] # Reference point for hypervolume (dominated region bound) self.reference_point = None def observe(self, config: np.ndarray, objectives: np.ndarray): """Record an observation.""" self.observations.append((config, objectives)) # Update each GP configs = np.array([c for c, _ in self.observations]) for i, gp in enumerate(self.gps): obj_values = np.array([o[i] for _, o in self.observations]) gp.fit(configs, obj_values) # Update reference point (worst observed + margin) all_objs = np.array([o for _, o in self.observations]) self.reference_point = np.max(all_objs, axis=0) + 0.1 * np.ptp(all_objs, axis=0) def get_pareto_front(self): """Return current Pareto front.""" if not self.observations: return [] points = [(c, o) for c, o in self.observations] pareto = [] for i, (c_i, o_i) in enumerate(points): dominated = False for j, (c_j, o_j) in enumerate(points): if i != j and self._dominates(o_j, o_i): dominated = True break if not dominated: pareto.append((c_i, o_i)) return pareto def _dominates(self, obj1, obj2): """Check if obj1 dominates obj2 (assuming minimization).""" return all(obj1 <= obj2) and any(obj1 < obj2) def suggest(self) -> np.ndarray: """Suggest next configuration to evaluate.""" if self.method == 'parego': return self._suggest_parego() elif self.method == 'ehvi': return self._suggest_ehvi() else: raise ValueError(f"Unknown method: {self.method}") def _suggest_parego(self) -> np.ndarray: """ParEGO: random scalarization + EI.""" # Sample random weight vector weights = np.random.dirichlet(np.ones(self.n_objectives)) # Get reference point (ideal point) for Tchebycheff pareto = self.get_pareto_front() if pareto: pareto_objs = np.array([o for _, o in pareto]) z_star = np.min(pareto_objs, axis=0) else: z_star = np.zeros(self.n_objectives) # Find configuration that maximizes EI on scalarized objective best_x = None best_ei = -np.inf for _ in range(1000): # Random search for acquisition x = self.search_space.sample() # Predict objectives means = [gp.predict_mean(x) for gp in self.gps] vars = [gp.predict_var(x) for gp in self.gps] # Scalarize prediction (Tchebycheff) scalar_mean = self._tchebycheff_scalarization( np.array(means), weights, z_star ) # Approximate variance (conservative) scalar_var = np.sum(weights**2 * np.array(vars)) scalar_std = np.sqrt(max(scalar_var, 1e-10)) # Compute EI if pareto: best_scalar = min( self._tchebycheff_scalarization(o, weights, z_star) for _, o in pareto ) else: best_scalar = float('inf') ei = self._expected_improvement(scalar_mean, scalar_std, best_scalar) if ei > best_ei: best_ei = ei best_x = x return best_x def _tchebycheff_scalarization(self, objectives, weights, z_star, rho=0.05): """Augmented Tchebycheff scalarization.""" diffs = np.abs(objectives - z_star) main = np.max(weights * diffs) aug = rho * np.sum(diffs) return main + aug def _expected_improvement(self, mean, std, best): """Standard Expected Improvement.""" if std < 1e-10: return 0.0 z = (best - mean) / std # Note: we're minimizing return std * (z * norm.cdf(z) + norm.pdf(z)) def _suggest_ehvi(self) -> np.ndarray: """EHVI-based suggestion (simplified Monte Carlo version).""" current_hv = self._compute_hypervolume() best_x = None best_ehvi = -np.inf for _ in range(500): x = self.search_space.sample() # Sample from posterior ehvi = 0 n_samples = 50 for _ in range(n_samples): sampled_objs = [] for gp in self.gps: mean = gp.predict_mean(x) std = np.sqrt(gp.predict_var(x)) sampled_objs.append(np.random.normal(mean, std)) # Compute HV with new point new_hv = self._compute_hypervolume( additional_point=np.array(sampled_objs) ) ehvi += (new_hv - current_hv) / n_samples if ehvi > best_ehvi: best_ehvi = ehvi best_x = x return best_x def _compute_hypervolume(self, additional_point=None): """Compute hypervolume indicator (2D case for simplicity).""" pareto = self.get_pareto_front() pareto_objs = [o for _, o in pareto] if additional_point is not None: pareto_objs.append(additional_point) pareto_objs = self._filter_to_pareto(pareto_objs) if not pareto_objs or self.reference_point is None: return 0.0 # Simple 2D hypervolume calculation if self.n_objectives == 2: return self._hypervolume_2d(pareto_objs) else: # For higher dimensions, would use a proper HV algorithm return self._approximate_hypervolume(pareto_objs) def _hypervolume_2d(self, points): """Exact 2D hypervolume computation.""" sorted_points = sorted(points, key=lambda p: p[0]) hv = 0.0 prev_x = sorted_points[0][0] for i, point in enumerate(sorted_points): height = self.reference_point[1] - point[1] if i < len(sorted_points) - 1: width = sorted_points[i + 1][0] - point[0] else: width = self.reference_point[0] - point[0] hv += height * width return hv def _filter_to_pareto(self, points): """Filter to Pareto-optimal points only.""" pareto = [] for i, p_i in enumerate(points): dominated = False for j, p_j in enumerate(points): if i != j and self._dominates(np.array(p_j), np.array(p_i)): dominated = True break if not dominated: pareto.append(p_i) return paretoFor production multi-objective BO, use BoTorch (PyTorch-based) or similar well-tested libraries. BoTorch provides efficient implementations of qEHVI, qNEHVI, and other state-of-the-art acquisition functions with GPU acceleration and batch optimization support.
Real-world multi-objective problems often involve constraints: not all configurations are acceptable, even if their objectives are good. A model must not exceed a latency budget. Training must complete within available resources. Predictions must satisfy fairness requirements.
Types of Constraints:
Constrained Expected Hypervolume Improvement:
The most principled approach models constraint satisfaction probabilistically:
cEHVI(λ) = EHVI(λ) × P(constraints satisfied | λ)
The feasibility probability is estimated from a GP or classifier trained on constraint observations. This naturally balances improvement against constraint risk.
Practical Considerations:
The distinction between objectives and constraints is sometimes fluid. 'Latency < 50ms' can be a constraint, or latency can be an objective where we seek the Pareto front of accuracy vs. latency. Treating metrics as objectives reveals trade-offs; treating them as constraints enables focused search in the feasible region. Choose based on whether you need to explore the full trade-off surface or have a clear feasibility requirement.
Deploying multi-objective HPO in production requires careful consideration of practical issues beyond algorithmic correctness.
Choosing the Number of Objectives:
More objectives is not always better. Beyond 3-4 objectives:
Strategies for many objectives:
| Situation | Recommended Approach | Rationale |
|---|---|---|
| Known fixed trade-off | Single-objective with weighted sum | Simplest if weights are clear |
| 2-3 objectives, need full front | MOBO (qEHVI) or NSGA-II | Efficient Pareto front discovery |
| Many evaluations cheap | Evolutionary (NSGA-II/III) | Population naturally diversifies |
| Evaluations expensive | MOBO (ParEGO, qEHVI) | Sample-efficient with surrogates |
| 4+ objectives | MOEA/D, NSGA-III, or decomposition | Handles many-objective scaling |
| Hard constraints present | Constrained MOBO (cEHVI) | Principled constraint handling |
Selecting from the Pareto Front:
After obtaining a Pareto front, you must ultimately select one configuration for deployment. Selection strategies:
Knee-Point Selection: Choose the 'knee' of the front—the point with maximum curvature, representing the best compromise
Constraint-Based Selection: Apply deployment constraints to filter the front, select best on primary objective
Decision-Maker Preference: Present the front to stakeholders, let them choose based on business priorities
Weighted Selection: Apply preference weights to rank Pareto-optimal points
Robustness Selection: Choose configurations that remain good under perturbation of inputs or environment
Visualizing Multi-Objective Results:
Visualization is critical for understanding trade-offs and communicating results:
In practice, preferences often clarify as the Pareto front is revealed. Start with broad exploration, present early results to stakeholders, refine objectives or constraints based on feedback, and focus search on preferred regions. This iterative process is often more practical than trying to specify all preferences upfront.
Several mature frameworks support multi-objective hyperparameter optimization:
BoTorch / Ax:
BoTorch (PyTorch-based) provides state-of-the-art multi-objective acquisition functions including qEHVI, qNEHVI, and qParEGO. Ax (Adaptive Experimentation Platform) builds on BoTorch to provide a higher-level interface:
from ax.service.ax_client import AxClient
from ax import ObjectiveProperties
ax_client = AxClient()
ax_client.create_experiment(
name="multi_objective_hpo",
parameters=[...],
objectives={
"accuracy": ObjectiveProperties(minimize=False),
"latency": ObjectiveProperties(minimize=True),
}
)
for _ in range(n_trials):
params, trial_idx = ax_client.get_next_trial()
results = evaluate(params)
ax_client.complete_trial(trial_idx, results)
pareto_front = ax_client.get_pareto_optimal_parameters()
Optuna Multi-Objective Example:
import optuna
def objective(trial):
# Suggest hyperparameters
lr = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
n_layers = trial.suggest_int("n_layers", 1, 5)
# Train model and evaluate multiple objectives
model = train(lr, n_layers)
accuracy = evaluate_accuracy(model)
latency = measure_latency(model)
return accuracy, latency # Return tuple for multi-objective
study = optuna.create_study(
directions=["maximize", "minimize"], # Maximize accuracy, minimize latency
)
study.optimize(objective, n_trials=100)
# Get Pareto front
pareto_trials = study.best_trials
for trial in pareto_trials:
print(f"Accuracy: {trial.values[0]:.4f}, Latency: {trial.values[1]:.2f}ms")
Integration with ML Workflows:
Multi-objective HPO integrates naturally with ML workflows:
Multi-objective hyperparameter optimization addresses the reality that ML systems must balance multiple, often conflicting goals. Rather than collapsing these into a single metric, MOHPO reveals the full trade-off surface, enabling informed decisions about which compromises to accept.
Looking Ahead:
The next page explores Practical HPO Systems—production-ready tools and practices for deploying hyperparameter optimization at scale, including distributed execution, fault tolerance, result management, and organizational best practices.
You now understand multi-objective hyperparameter optimization—from Pareto optimality and scalarization to evolutionary and Bayesian approaches, constraint handling, and practical deployment considerations. These techniques enable systematic exploration of trade-off spaces in real-world ML systems. Next, we'll examine practical HPO systems for production deployment.