Loading learning content...
No single kernel is optimal for all problems. In practice:
Multiple Kernel Learning (MKL) addresses these challenges by learning an optimal combination of base kernels from data.
This page covers the theory of kernel combination, MKL formulations (linear, sparse, localized), optimization algorithms (alternating, gradient-based), and practical considerations for applying MKL effectively.
The fundamental property enabling kernel combination is that positive definite kernels are closed under certain operations. This means we can construct new valid kernels from existing ones.
Closure Properties of Positive Definite Kernels:
Given valid PD kernels $k_1, k_2$, the following are also valid PD kernels:
Sum corresponds to concatenating feature maps: if k₁ → φ₁ and k₂ → φ₂, then k₁ + k₂ → [φ₁; φ₂]. Product corresponds to tensor product of feature maps. These give geometric intuition for combined kernels.
Linear Combination of Kernels:
The most common MKL formulation uses a linear (conic) combination:
$$k(\mathbf{x}, \mathbf{y}; \boldsymbol{\mu}) = \sum_{m=1}^{M} \mu_m k_m(\mathbf{x}, \mathbf{y})$$
where ${k_m}_{m=1}^{M}$ are base kernels and $\boldsymbol{\mu} = (\mu_1, \ldots, \mu_M)^\top$ with $\mu_m \geq 0$ are kernel weights.
The Learning Problem:
In MKL, we simultaneously learn:
This joint optimization is typically non-convex, but can be reformulated as a convex (though more complex) optimization problem.
| Operation | Formula | Feature Space Effect | Use Case |
|---|---|---|---|
| Sum | $k_1 + k_2$ | Concatenation: $[\phi_1; \phi_2]$ | Different feature types |
| Product | $k_1 \cdot k_2$ | Tensor product: $\phi_1 \otimes \phi_2$ | Interaction between features |
| Weighted sum | $\sum \mu_m k_m$ | Weighted concatenation | Feature/scale selection |
| Power | $k^p$ ($p > 0$) | Complex polynomial expansion | Increased expressiveness |
Different MKL formulations impose different constraints and regularization on the kernel weights, leading to different solution properties.
Standard (L1-regularized) MKL:
The original MKL formulation (Lanckriet et al., 2004; Bach et al., 2004) uses an implicit L1 constraint:
$$\min_{\boldsymbol{\mu}} \min_{f \in \mathcal{H}{\boldsymbol{\mu}}} \frac{1}{n}\sum{i=1}^{n} \ell(y_i, f(\mathbf{x}i)) + \lambda |f|{\mathcal{H}_{\boldsymbol{\mu}}}^2$$
$$\text{subject to } \boldsymbol{\mu} \geq 0, |\boldsymbol{\mu}|_1 = 1$$
where $\mathcal{H}{\boldsymbol{\mu}}$ is the RKHS of $k{\boldsymbol{\mu}} = \sum_m \mu_m k_m$.
Key Property: L1 constraint induces sparsity—only a few kernels receive non-zero weight. This acts as automatic kernel selection.
Sparse MKL is interpretable (few kernels selected) but may discard useful complementary information. When base kernels capture truly different but all-relevant aspects, non-sparse MKL often performs better.
Lp-norm MKL (Kloft et al., 2011):
Generalizes to $L_p$ regularization on kernel weights:
$$|\boldsymbol{\mu}|_p \leq 1, \quad p \geq 1$$
In practice, $p=2$ often outperforms $p=1$ when all kernels contribute useful information.
SimpleMKL (Rakotomamonjy et al., 2008):
A widely-used efficient algorithm that alternates between:
RBMKL (Bayesian MKL):
Places priors on kernel weights for uncertainty quantification:
$$\mu_m \sim \text{Gamma}(a, b) \text{ or } \mu_m \sim \text{Dirichlet}(\boldsymbol{\alpha})$$
Enables posterior inference over which kernels are relevant.
| Formulation | Constraint | Weight Properties | Best When |
|---|---|---|---|
| L1-MKL | $|\mu|_1 = 1$ | Sparse (few non-zero) | Many irrelevant kernels |
| L2-MKL | $|\mu|_2 \leq 1$ | Non-sparse, smooth | All kernels useful |
| L∞-MKL | $|\mu|_\infty \leq 1$ | Approximately uniform | Equal importance known |
| EasyMKL | Centered alignment | Non-sparse | Simple, effective default |
| Bayesian MKL | Prior $p(\mu)$ | Uncertainty quantified | Need confidence estimates |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import numpy as npfrom scipy.optimize import minimizefrom sklearn.svm import SVC def simple_mkl(Ks, y, C=1.0, max_iter=100, tol=1e-4): """ SimpleMKL algorithm for L1-MKL. Parameters: ----------- Ks : list of kernel matrices, each (n, n) y : labels (n,) C : SVM regularization Returns: -------- mu : kernel weights alpha : SVM dual variables """ M = len(Ks) n = len(y) # Initialize weights uniformly mu = np.ones(M) / M for iteration in range(max_iter): # Step 1: Solve SVM with combined kernel K_combined = sum(mu[m] * Ks[m] for m in range(M)) # Fit SVM (using sklearn with precomputed kernel) svm = SVC(kernel='precomputed', C=C) svm.fit(K_combined, y) # Get dual variables (approximate from support vectors) alpha = np.zeros(n) alpha[svm.support_] = np.abs(svm.dual_coef_.ravel()) # Step 2: Compute gradient w.r.t. mu grad = np.zeros(M) for m in range(M): # Gradient: -0.5 * alpha^T K_m alpha grad[m] = -0.5 * alpha @ Ks[m] @ alpha # Step 3: Update mu via projected gradient d = grad - grad.mean() # Descent direction on simplex mu_new = mu - 0.1 * d # Step mu_new = np.maximum(mu_new, 0) # Project to positive mu_new /= mu_new.sum() # Normalize to simplex # Check convergence if np.linalg.norm(mu_new - mu) < tol: break mu = mu_new return mu, alpha def lp_mkl_constraint(mu, p): """Lp norm constraint: ||mu||_p <= 1""" return 1 - np.sum(np.abs(mu) ** p) ** (1/p) # Example usagenp.random.seed(42)n = 100# Create synthetic kernel matricesK1 = np.random.randn(n, n); K1 = K1 @ K1.T # RBF-likeK2 = np.random.randn(n, n); K2 = K2 @ K2.T # Polynomial-likeK3 = np.random.randn(n, n); K3 = K3 @ K3.T # Linear-likey = np.sign(np.random.randn(n)) Ks = [K1, K2, K3]# mu, alpha = simple_mkl(Ks, y) # Uncomment to runprint("SimpleMKL would learn optimal kernel combination weights")The joint optimization over predictor parameters and kernel weights creates a challenging optimization problem. Several algorithmic frameworks have been developed.
Alternating Optimization:
The most intuitive approach iterates between two subproblems:
Each step is convex, but the joint problem is not—convergence is to a local optimum.
Semi-Definite Programming (SDP) Formulation:
The original MKL formulation (Lanckriet et al., 2004) casts the problem as a semi-definite program:
$$\max_{\boldsymbol{\mu}, K} \mathbf{1}^\top \boldsymbol{\alpha} - \frac{1}{2} \boldsymbol{\alpha}^\top Y K Y \boldsymbol{\alpha}$$ $$\text{s.t. } K = \sum_m \mu_m K_m, ; \boldsymbol{\mu} \geq 0, ; |\boldsymbol{\mu}|_1 = 1, ; K \succeq 0$$
Pros: Globally optimal solution. Cons: Very expensive ($O(n^6)$ complexity), limited to small datasets.
Second-Order Cone Programming (SOCP):
A reformulation that's more efficient than SDP while maintaining convexity, reducing complexity to $O(n^3 M)$.
Exact MKL algorithms become impractical beyond a few thousand training points. Practical implementations use approximations: stochastic gradient descent, random Fourier features, or Nyström approximations of base kernels.
Gradient Descent Methods:
For large-scale MKL, gradient-based optimization is preferred:
The Wrapper vs. Interleaved Debate:
Wrapper methods are simpler but require expensive repeated model fitting. Interleaved methods are more efficient but algorithm-specific.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as np def mirror_descent_mkl_update(mu, grad, eta=0.1): """ Mirror descent update on the probability simplex. Uses KL divergence as the Bregman divergence. Parameters: ----------- mu : current weights (on simplex) grad : gradient of loss w.r.t. mu eta : step size Returns: -------- mu_new : updated weights (on simplex) """ # Multiplicative update (equivalent to mirror descent with KL) log_mu = np.log(mu + 1e-10) - eta * grad mu_new = np.exp(log_mu) mu_new /= mu_new.sum() # Normalize to simplex return mu_new def projected_gradient_mkl_update(mu, grad, eta=0.1): """ Projected gradient descent update on the simplex. """ # Gradient step mu_new = mu - eta * grad # Project onto simplex (Duchi et al. algorithm) mu_new = project_simplex(mu_new) return mu_new def project_simplex(v): """ Project vector v onto the probability simplex. Efficient O(n log n) algorithm. """ n = len(v) u = np.sort(v)[::-1] cssv = np.cumsum(u) rho = np.nonzero(u * np.arange(1, n+1) > (cssv - 1))[0][-1] theta = (cssv[rho] - 1) / (rho + 1) return np.maximum(v - theta, 0) def easy_mkl_weights(Ks, y): """ EasyMKL: Simple and effective kernel combination. Uses kernel alignment to set weights. Weights proportional to alignment with ideal kernel. """ n = len(y) y = y.reshape(-1, 1) yy = y @ y.T # Ideal kernel (outer product) scores = [] for K in Ks: # Centered kernel alignment K_centered = K - K.mean(axis=0) - K.mean(axis=1, keepdims=True) + K.mean() yy_centered = yy - yy.mean() alignment = np.sum(K_centered * yy_centered) / ( np.linalg.norm(K_centered, 'fro') * np.linalg.norm(yy_centered, 'fro') ) scores.append(max(alignment, 0)) # Only positive alignments # Normalize to get weights scores = np.array(scores) if scores.sum() > 0: return scores / scores.sum() else: return np.ones(len(Ks)) / len(Ks) # Exampleprint("Simplex projection of [0.5, 0.5, 0.5]:", project_simplex(np.array([0.5, 0.5, 0.5])))Standard MKL learns a single set of kernel weights for the entire input space. However, different kernels may be appropriate in different regions:
Localized MKL addresses this by learning input-dependent kernel weights.
Formulation:
$$k(\mathbf{x}, \mathbf{x}'; \eta) = \sum_{m=1}^{M} \eta_m(\mathbf{x}) \eta_m(\mathbf{x}') k_m(\mathbf{x}, \mathbf{x}')$$
where $\eta_m: \mathcal{X} \rightarrow \mathbb{R}_{\geq 0}$ is a gating function that determines the local weight of kernel $m$ at input $\mathbf{x}$.
The product η(x)η(x') ensures the combined kernel remains positive definite. Simply using η(x) would violate symmetry and PD properties. This product form means weights are computed independently for each input then multiplied.
Gating Function Approaches:
Parametric, learns linear decision boundaries for kernel selection.
Non-parametric, can learn complex gating boundaries.
SimpleMKL-loc:
Extends SimpleMKL to localized setting:
| Aspect | Global MKL | Localized MKL |
|---|---|---|
| Weights | Single μ for all inputs | η(x) varies with input |
| Expressiveness | Limited to global selection | Adapts to local structure |
| Parameters | M weights | M × (gating params) |
| Overfitting risk | Lower | Higher (more parameters) |
| Interpretability | Clear: which kernels matter | Harder: where and which kernels |
| Optimization | Convex possible | Typically non-convex |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom scipy.special import softmax def softmax_gating(X, W): """ Compute softmax gating weights for localized MKL. Parameters: ----------- X : input data (n, d) W : gating parameters (M, d) for M kernels Returns: -------- eta : gating weights (n, M) """ logits = X @ W.T # (n, M) return softmax(logits, axis=1) def localized_kernel_matrix(X, Ks, W): """ Compute localized MKL kernel matrix. K_loc(x, x') = sum_m eta_m(x) * eta_m(x') * K_m(x, x') Parameters: ----------- X : input data (n, d) Ks : list of M base kernel matrices (n, n) W : gating parameters (M, d) Returns: -------- K : localized combined kernel matrix (n, n) """ eta = softmax_gating(X, W) # (n, M) M = len(Ks) n = X.shape[0] K = np.zeros((n, n)) for m in range(M): # Outer product of gating weights gate_product = np.outer(eta[:, m], eta[:, m]) K += gate_product * Ks[m] return K def cluster_based_local_mkl(X, Ks, n_clusters=5): """ Cluster-based localized MKL. Assigns each cluster a separate weight vector. """ from sklearn.cluster import KMeans M = len(Ks) n = X.shape[0] # Cluster the data kmeans = KMeans(n_clusters=n_clusters, random_state=42) cluster_labels = kmeans.fit_predict(X) # Each cluster gets its own kernel weights # (In practice, these would be learned) cluster_weights = np.random.dirichlet(np.ones(M), size=n_clusters) # Build kernel matrix with cluster-specific weights K = np.zeros((n, n)) for i in range(n): for j in range(n): ci, cj = cluster_labels[i], cluster_labels[j] for m in range(M): K[i, j] += np.sqrt(cluster_weights[ci, m] * cluster_weights[cj, m]) * Ks[m][i, j] return K, cluster_weights print("Localized MKL allows input-dependent kernel selection")Successfully applying MKL requires careful attention to base kernel design, normalization, and hyperparameter selection.
Base Kernel Design:
Coverage: Ensure base kernels cover different aspects:
Redundancy vs. Diversity: Some redundancy (overlapping kernels) is acceptable, as MKL will assign low weights. However, too many similar kernels can slow optimization.
Scale matching: Different kernels may have very different value ranges. A polynomial kernel of degree 5 can produce huge values compared to RBF's [0,1] range.
Kernel Normalization:
Normalize each base kernel to a common scale:
$$\tilde{k}_m(\mathbf{x}, \mathbf{x}') = \frac{k_m(\mathbf{x}, \mathbf{x}')}{\sqrt{k_m(\mathbf{x}, \mathbf{x}) \cdot k_m(\mathbf{x}', \mathbf{x}')}}$$
Or normalize the kernel matrix: $$\tilde{K}_m = D_m^{-1/2} K_m D_m^{-1/2}$$
where $D_m = \text{diag}(K_m)$.
Without normalization, MKL may select kernels based on scale rather than relevance. A high-variance kernel will dominate simply because its values are larger. Always normalize base kernels before combining.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom scipy.spatial.distance import cdist def create_rbf_kernel_grid(X, gamma_range): """ Create a grid of RBF kernels with different length-scales. """ kernels = [] sq_dists = cdist(X, X, metric='sqeuclidean') for gamma in gamma_range: K = np.exp(-gamma * sq_dists) kernels.append(K) return kernels def normalize_kernel(K): """ Normalize kernel matrix (cosine normalization). K_norm(i,j) = K(i,j) / sqrt(K(i,i) * K(j,j)) """ diag = np.diag(K) diag_sqrt = np.sqrt(np.maximum(diag, 1e-10)) return K / np.outer(diag_sqrt, diag_sqrt) def trace_normalize_kernel(K): """ Normalize kernel by trace: K / trace(K) * n Ensures trace(K) = n (unit average self-similarity). """ n = K.shape[0] return K / np.trace(K) * n def build_mkl_kernel_set(X): """ Build a comprehensive set of base kernels for MKL. """ n = X.shape[0] kernels = [] kernel_names = [] # 1. RBF kernels with different length-scales gammas = [0.001, 0.01, 0.1, 1.0, 10.0] for gamma in gammas: K = np.exp(-gamma * cdist(X, X, 'sqeuclidean')) kernels.append(normalize_kernel(K)) kernel_names.append(f'RBF(γ={gamma})') # 2. Polynomial kernels for degree in [2, 3, 4]: K = (X @ X.T + 1) ** degree kernels.append(normalize_kernel(K)) kernel_names.append(f'Poly(d={degree})') # 3. Linear kernel K_linear = X @ X.T kernels.append(normalize_kernel(K_linear)) kernel_names.append('Linear') return kernels, kernel_names # Examplenp.random.seed(42)X = np.random.randn(100, 10)kernels, names = build_mkl_kernel_set(X)print(f"Created {len(kernels)} base kernels:")for name in names: print(f" - {name}")You now understand how to combine multiple kernels optimally using MKL. Next, we'll explore kernel alignment—a principled way to measure kernel quality and guide kernel selection.