Machine LearningKernel Methods for Regression

Kernel Selection and Design

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

3 / 5

Multiple Kernel Learning

Why Combine Kernels?

No single kernel is optimal for all problems. In practice:

Different feature types (text + images + numerical) may each require specialized kernels
Multiple scales of variation may exist (fine local patterns + coarse global trends)
Uncertainty about structure makes it unclear which kernel family to choose
Different views of the same data may provide complementary information

Multiple Kernel Learning (MKL) addresses these challenges by learning an optimal combination of base kernels from data.

What You Will Learn

This page covers the theory of kernel combination, MKL formulations (linear, sparse, localized), optimization algorithms (alternating, gradient-based), and practical considerations for applying MKL effectively.

Kernel Combination Theory

The fundamental property enabling kernel combination is that positive definite kernels are closed under certain operations. This means we can construct new valid kernels from existing ones.

Closure Properties of Positive Definite Kernels:

Given valid PD kernels $k_1, k_2$, the following are also valid PD kernels:

Sum: $k(x, y) = k_1(x, y) + k_2(x, y)$
Product: $k(x, y) = k_1(x, y) \cdot k_2(x, y)$
Positive Scaling: $k(x, y) = c \cdot k_1(x, y)$ for $c > 0$
Convex Combination: $k(x, y) = \alpha k_1(x, y) + (1-\alpha) k_2(x, y)$ for $\alpha \in [0,1]$
Polynomial: $k(x, y) = p(k_1(x, y))$ where $p$ is a polynomial with non-negative coefficients

Feature Space Interpretation

Sum corresponds to concatenating feature maps: if k₁ → φ₁ and k₂ → φ₂, then k₁ + k₂ → [φ₁; φ₂]. Product corresponds to tensor product of feature maps. These give geometric intuition for combined kernels.

Linear Combination of Kernels:

The most common MKL formulation uses a linear (conic) combination:

$$k(\mathbf{x}, \mathbf{y}; \boldsymbol{\mu}) = \sum_{m=1}^{M} \mu_m k_m(\mathbf{x}, \mathbf{y})$$

where ${k_m}_{m=1}^{M}$ are base kernels and $\boldsymbol{\mu} = (\mu_1, \ldots, \mu_M)^\top$ with $\mu_m \geq 0$ are kernel weights.

The Learning Problem:

In MKL, we simultaneously learn:

The kernel weights $\boldsymbol{\mu}$ (which kernels are important)
The predictor parameters (e.g., SVM dual variables $\alpha$ or regression coefficients)

This joint optimization is typically non-convex, but can be reformulated as a convex (though more complex) optimization problem.

Kernel Operations and Feature Space Effects
Operation	Formula	Feature Space Effect	Use Case
Sum	$k_1 + k_2$	Concatenation: $[\phi_1; \phi_2]$	Different feature types
Product	$k_1 \cdot k_2$	Tensor product: $\phi_1 \otimes \phi_2$	Interaction between features
Weighted sum	$\sum \mu_m k_m$	Weighted concatenation	Feature/scale selection
Power	$k^p$ ($p > 0$)	Complex polynomial expansion	Increased expressiveness

MKL Formulations

Different MKL formulations impose different constraints and regularization on the kernel weights, leading to different solution properties.

Standard (L1-regularized) MKL:

The original MKL formulation (Lanckriet et al., 2004; Bach et al., 2004) uses an implicit L1 constraint:

$$\min_{\boldsymbol{\mu}} \min_{f \in \mathcal{H}{\boldsymbol{\mu}}} \frac{1}{n}\sum{i=1}^{n} \ell(y_i, f(\mathbf{x}i)) + \lambda |f|{\mathcal{H}_{\boldsymbol{\mu}}}^2$$

$$\text{subject to } \boldsymbol{\mu} \geq 0, |\boldsymbol{\mu}|_1 = 1$$

where $\mathcal{H}{\boldsymbol{\mu}}$ is the RKHS of $k{\boldsymbol{\mu}} = \sum_m \mu_m k_m$.

Key Property: L1 constraint induces sparsity—only a few kernels receive non-zero weight. This acts as automatic kernel selection.

Sparsity Can Be a Double-Edged Sword

Sparse MKL is interpretable (few kernels selected) but may discard useful complementary information. When base kernels capture truly different but all-relevant aspects, non-sparse MKL often performs better.

Lp-norm MKL (Kloft et al., 2011):

Generalizes to $L_p$ regularization on kernel weights:

$$|\boldsymbol{\mu}|_p \leq 1, \quad p \geq 1$$

$p = 1$: Sparse selection (original MKL)
$p = 2$: Non-sparse, smoother weights
$p \to \infty$: Uniform weighting

In practice, $p=2$ often outperforms $p=1$ when all kernels contribute useful information.

SimpleMKL (Rakotomamonjy et al., 2008):

A widely-used efficient algorithm that alternates between:

Solving the SVM with current kernel weights
Updating weights via gradient descent on a reduced objective

RBMKL (Bayesian MKL):

Places priors on kernel weights for uncertainty quantification:

$$\mu_m \sim \text{Gamma}(a, b) \text{ or } \mu_m \sim \text{Dirichlet}(\boldsymbol{\alpha})$$

Enables posterior inference over which kernels are relevant.

MKL Formulation Comparison
Formulation	Constraint	Weight Properties	Best When
L1-MKL	$\|\mu\|_1 = 1$	Sparse (few non-zero)	Many irrelevant kernels
L2-MKL	$\|\mu\|_2 \leq 1$	Non-sparse, smooth	All kernels useful
L∞-MKL	$\|\mu\|_\infty \leq 1$	Approximately uniform	Equal importance known
EasyMKL	Centered alignment	Non-sparse	Simple, effective default
Bayesian MKL	Prior $p(\mu)$	Uncertainty quantified	Need confidence estimates

mkl_formulations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from scipy.optimize import minimize
from sklearn.svm import SVC
 
def simple_mkl(Ks, y, C=1.0, max_iter=100, tol=1e-4):
    """
    SimpleMKL algorithm for L1-MKL.
    
    Parameters:
    -----------
    Ks : list of kernel matrices, each (n, n)
    y : labels (n,)
    C : SVM regularization
    
    Returns:
    --------
    mu : kernel weights
    alpha : SVM dual variables
    """
    M = len(Ks)
    n = len(y)
    
    # Initialize weights uniformly
    mu = np.ones(M) / M
    
    for iteration in range(max_iter):
        # Step 1: Solve SVM with combined kernel
        K_combined = sum(mu[m] * Ks[m] for m in range(M))
        
        # Fit SVM (using sklearn with precomputed kernel)
        svm = SVC(kernel='precomputed', C=C)
        svm.fit(K_combined, y)
        
        # Get dual variables (approximate from support vectors)
        alpha = np.zeros(n)
        alpha[svm.support_] = np.abs(svm.dual_coef_.ravel())
        
        # Step 2: Compute gradient w.r.t. mu
        grad = np.zeros(M)
        for m in range(M):
            # Gradient: -0.5 * alpha^T K_m alpha
            grad[m] = -0.5 * alpha @ Ks[m] @ alpha
        
        # Step 3: Update mu via projected gradient
        d = grad - grad.mean()  # Descent direction on simplex
        mu_new = mu - 0.1 * d   # Step
        mu_new = np.maximum(mu_new, 0)  # Project to positive
        mu_new /= mu_new.sum()  # Normalize to simplex
        
        # Check convergence
        if np.linalg.norm(mu_new - mu) < tol:
            break
        mu = mu_new
    
    return mu, alpha
 
def lp_mkl_constraint(mu, p):
    """Lp norm constraint: ||mu||_p <= 1"""
    return 1 - np.sum(np.abs(mu) ** p) ** (1/p)
 
# Example usage
np.random.seed(42)
n = 100
# Create synthetic kernel matrices
K1 = np.random.randn(n, n); K1 = K1 @ K1.T  # RBF-like
K2 = np.random.randn(n, n); K2 = K2 @ K2.T  # Polynomial-like
K3 = np.random.randn(n, n); K3 = K3 @ K3.T  # Linear-like
y = np.sign(np.random.randn(n))
 
Ks = [K1, K2, K3]
# mu, alpha = simple_mkl(Ks, y)  # Uncomment to run
print("SimpleMKL would learn optimal kernel combination weights")

Optimization Algorithms for MKL

The joint optimization over predictor parameters and kernel weights creates a challenging optimization problem. Several algorithmic frameworks have been developed.

Alternating Optimization:

The most intuitive approach iterates between two subproblems:

Fix $\boldsymbol{\mu}$: Solve standard kernel SVM/regression with $K_{\boldsymbol{\mu}} = \sum_m \mu_m K_m$
Fix model: Optimize $\boldsymbol{\mu}$ given current model parameters

Each step is convex, but the joint problem is not—convergence is to a local optimum.

Semi-Definite Programming (SDP) Formulation:

The original MKL formulation (Lanckriet et al., 2004) casts the problem as a semi-definite program:

$$\max_{\boldsymbol{\mu}, K} \mathbf{1}^\top \boldsymbol{\alpha} - \frac{1}{2} \boldsymbol{\alpha}^\top Y K Y \boldsymbol{\alpha}$$ $$\text{s.t. } K = \sum_m \mu_m K_m, ; \boldsymbol{\mu} \geq 0, ; |\boldsymbol{\mu}|_1 = 1, ; K \succeq 0$$

Pros: Globally optimal solution. Cons: Very expensive ($O(n^6)$ complexity), limited to small datasets.

Second-Order Cone Programming (SOCP):

A reformulation that's more efficient than SDP while maintaining convexity, reducing complexity to $O(n^3 M)$.

Scalability Challenges

Exact MKL algorithms become impractical beyond a few thousand training points. Practical implementations use approximations: stochastic gradient descent, random Fourier features, or Nyström approximations of base kernels.

Gradient Descent Methods:

For large-scale MKL, gradient-based optimization is preferred:

Mirror Descent MKL: Uses Bregman divergence for updates on simplex
Level Method: Achieves optimal convergence rate for constrained problems
Stochastic MKL: Mini-batch updates for very large datasets

The Wrapper vs. Interleaved Debate:

Wrapper methods: Treat learning algorithm as black-box, optimize $\boldsymbol{\mu}$ based on validation performance (e.g., cross-validation)
Interleaved methods: Jointly optimize via alternating/gradient schemes

Wrapper methods are simpler but require expensive repeated model fitting. Interleaved methods are more efficient but algorithm-specific.

mkl_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
 
def mirror_descent_mkl_update(mu, grad, eta=0.1):
    """
    Mirror descent update on the probability simplex.
    Uses KL divergence as the Bregman divergence.
    
    Parameters:
    -----------
    mu : current weights (on simplex)
    grad : gradient of loss w.r.t. mu
    eta : step size
    
    Returns:
    --------
    mu_new : updated weights (on simplex)
    """
    # Multiplicative update (equivalent to mirror descent with KL)
    log_mu = np.log(mu + 1e-10) - eta * grad
    mu_new = np.exp(log_mu)
    mu_new /= mu_new.sum()  # Normalize to simplex
    return mu_new
 
def projected_gradient_mkl_update(mu, grad, eta=0.1):
    """
    Projected gradient descent update on the simplex.
    """
    # Gradient step
    mu_new = mu - eta * grad
    
    # Project onto simplex (Duchi et al. algorithm)
    mu_new = project_simplex(mu_new)
    return mu_new
 
def project_simplex(v):
    """
    Project vector v onto the probability simplex.
    Efficient O(n log n) algorithm.
    """
    n = len(v)
    u = np.sort(v)[::-1]
    cssv = np.cumsum(u)
    rho = np.nonzero(u * np.arange(1, n+1) > (cssv - 1))[0][-1]
    theta = (cssv[rho] - 1) / (rho + 1)
    return np.maximum(v - theta, 0)
 
def easy_mkl_weights(Ks, y):
    """
    EasyMKL: Simple and effective kernel combination.
    Uses kernel alignment to set weights.
    
    Weights proportional to alignment with ideal kernel.
    """
    n = len(y)
    y = y.reshape(-1, 1)
    yy = y @ y.T  # Ideal kernel (outer product)
    
    scores = []
    for K in Ks:
        # Centered kernel alignment
        K_centered = K - K.mean(axis=0) - K.mean(axis=1, keepdims=True) + K.mean()
        yy_centered = yy - yy.mean()
        
        alignment = np.sum(K_centered * yy_centered) / (
            np.linalg.norm(K_centered, 'fro') * np.linalg.norm(yy_centered, 'fro')
        )
        scores.append(max(alignment, 0))  # Only positive alignments
    
    # Normalize to get weights
    scores = np.array(scores)
    if scores.sum() > 0:
        return scores / scores.sum()
    else:
        return np.ones(len(Ks)) / len(Ks)
 
# Example
print("Simplex projection of [0.5, 0.5, 0.5]:", project_simplex(np.array([0.5, 0.5, 0.5])))

Localized Multiple Kernel Learning

Standard MKL learns a single set of kernel weights for the entire input space. However, different kernels may be appropriate in different regions:

RBF kernel might be optimal for smooth regions
Polynomial kernel might capture local interactions elsewhere
Periodic kernel might be relevant only in certain domains

Localized MKL addresses this by learning input-dependent kernel weights.

Formulation:

$$k(\mathbf{x}, \mathbf{x}'; \eta) = \sum_{m=1}^{M} \eta_m(\mathbf{x}) \eta_m(\mathbf{x}') k_m(\mathbf{x}, \mathbf{x}')$$

where $\eta_m: \mathcal{X} \rightarrow \mathbb{R}_{\geq 0}$ is a gating function that determines the local weight of kernel $m$ at input $\mathbf{x}$.

Product Form for Valid Kernels

The product η(x)η(x') ensures the combined kernel remains positive definite. Simply using η(x) would violate symmetry and PD properties. This product form means weights are computed independently for each input then multiplied.

Gating Function Approaches:

Softmax gating: $$\eta_m(\mathbf{x}) = \frac{\exp(\mathbf{w}m^\top \mathbf{x})}{\sum{m'} \exp(\mathbf{w}_{m'}^\top \mathbf{x})}$$

Parametric, learns linear decision boundaries for kernel selection.

Kernel-based gating: $$\eta_m(\mathbf{x}) = \sum_{j} v_{mj} k_{\text{gate}}(\mathbf{x}, \mathbf{x}_j)$$

Non-parametric, can learn complex gating boundaries.

Cluster-based gating: Assign each region to a cluster, use cluster-specific kernel weights.

SimpleMKL-loc:

Extends SimpleMKL to localized setting:

Alternate between learning predictor and gating functions
Use separate validation or joint optimization

Global vs. Localized MKL
Aspect	Global MKL	Localized MKL
Weights	Single μ for all inputs	η(x) varies with input
Expressiveness	Limited to global selection	Adapts to local structure
Parameters	M weights	M × (gating params)
Overfitting risk	Lower	Higher (more parameters)
Interpretability	Clear: which kernels matter	Harder: where and which kernels
Optimization	Convex possible	Typically non-convex

localized_mkl.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy.special import softmax
 
def softmax_gating(X, W):
    """
    Compute softmax gating weights for localized MKL.
    
    Parameters:
    -----------
    X : input data (n, d)
    W : gating parameters (M, d) for M kernels
    
    Returns:
    --------
    eta : gating weights (n, M)
    """
    logits = X @ W.T  # (n, M)
    return softmax(logits, axis=1)
 
def localized_kernel_matrix(X, Ks, W):
    """
    Compute localized MKL kernel matrix.
    
    K_loc(x, x') = sum_m eta_m(x) * eta_m(x') * K_m(x, x')
    
    Parameters:
    -----------
    X : input data (n, d)
    Ks : list of M base kernel matrices (n, n)
    W : gating parameters (M, d)
    
    Returns:
    --------
    K : localized combined kernel matrix (n, n)
    """
    eta = softmax_gating(X, W)  # (n, M)
    M = len(Ks)
    n = X.shape[0]
    
    K = np.zeros((n, n))
    for m in range(M):
        # Outer product of gating weights
        gate_product = np.outer(eta[:, m], eta[:, m])
        K += gate_product * Ks[m]
    
    return K
 
def cluster_based_local_mkl(X, Ks, n_clusters=5):
    """
    Cluster-based localized MKL.
    Assigns each cluster a separate weight vector.
    """
    from sklearn.cluster import KMeans
    
    M = len(Ks)
    n = X.shape[0]
    
    # Cluster the data
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(X)
    
    # Each cluster gets its own kernel weights
    # (In practice, these would be learned)
    cluster_weights = np.random.dirichlet(np.ones(M), size=n_clusters)
    
    # Build kernel matrix with cluster-specific weights
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            ci, cj = cluster_labels[i], cluster_labels[j]
            for m in range(M):
                K[i, j] += np.sqrt(cluster_weights[ci, m] * 
                                   cluster_weights[cj, m]) * Ks[m][i, j]
    
    return K, cluster_weights
 
print("Localized MKL allows input-dependent kernel selection")

Practical Considerations

Successfully applying MKL requires careful attention to base kernel design, normalization, and hyperparameter selection.

Base Kernel Design:

Coverage: Ensure base kernels cover different aspects:
- Different kernel families (RBF, polynomial, Matérn)
- Different length-scales for each family
- Different feature subsets (if applicable)
Redundancy vs. Diversity: Some redundancy (overlapping kernels) is acceptable, as MKL will assign low weights. However, too many similar kernels can slow optimization.
Scale matching: Different kernels may have very different value ranges. A polynomial kernel of degree 5 can produce huge values compared to RBF's [0,1] range.

Kernel Normalization:

Normalize each base kernel to a common scale:

$$\tilde{k}_m(\mathbf{x}, \mathbf{x}') = \frac{k_m(\mathbf{x}, \mathbf{x}')}{\sqrt{k_m(\mathbf{x}, \mathbf{x}) \cdot k_m(\mathbf{x}', \mathbf{x}')}}$$

Or normalize the kernel matrix: $$\tilde{K}_m = D_m^{-1/2} K_m D_m^{-1/2}$$

where $D_m = \text{diag}(K_m)$.

The Importance of Normalization

Without normalization, MKL may select kernels based on scale rather than relevance. A high-variance kernel will dominate simply because its values are larger. Always normalize base kernels before combining.

MKL Best Practices

•Start simple: Try EasyMKL or uniform combination before complex optimization
•Grid of length-scales: For RBF, include kernels with length-scales spanning orders of magnitude
•Normalize kernels: Use cosine normalization or trace normalization
•Monitor convergence: Alternating optimization can stall; use appropriate stopping criteria
•Validate properly: Use nested cross-validation when tuning regularization parameters
•Consider uniform: Sometimes uniform combination beats learned weights (especially with limited data)
•Interpret weights: Examine learned weights for insights about which representations matter

mkl_practical.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy.spatial.distance import cdist
 
def create_rbf_kernel_grid(X, gamma_range):
    """
    Create a grid of RBF kernels with different length-scales.
    """
    kernels = []
    sq_dists = cdist(X, X, metric='sqeuclidean')
    
    for gamma in gamma_range:
        K = np.exp(-gamma * sq_dists)
        kernels.append(K)
    
    return kernels
 
def normalize_kernel(K):
    """
    Normalize kernel matrix (cosine normalization).
    K_norm(i,j) = K(i,j) / sqrt(K(i,i) * K(j,j))
    """
    diag = np.diag(K)
    diag_sqrt = np.sqrt(np.maximum(diag, 1e-10))
    return K / np.outer(diag_sqrt, diag_sqrt)
 
def trace_normalize_kernel(K):
    """
    Normalize kernel by trace: K / trace(K) * n
    Ensures trace(K) = n (unit average self-similarity).
    """
    n = K.shape[0]
    return K / np.trace(K) * n
 
def build_mkl_kernel_set(X):
    """
    Build a comprehensive set of base kernels for MKL.
    """
    n = X.shape[0]
    kernels = []
    kernel_names = []
    
    # 1. RBF kernels with different length-scales
    gammas = [0.001, 0.01, 0.1, 1.0, 10.0]
    for gamma in gammas:
        K = np.exp(-gamma * cdist(X, X, 'sqeuclidean'))
        kernels.append(normalize_kernel(K))
        kernel_names.append(f'RBF(γ={gamma})')
    
    # 2. Polynomial kernels
    for degree in [2, 3, 4]:
        K = (X @ X.T + 1) ** degree
        kernels.append(normalize_kernel(K))
        kernel_names.append(f'Poly(d={degree})')
    
    # 3. Linear kernel
    K_linear = X @ X.T
    kernels.append(normalize_kernel(K_linear))
    kernel_names.append('Linear')
    
    return kernels, kernel_names
 
# Example
np.random.seed(42)
X = np.random.randn(100, 10)
kernels, names = build_mkl_kernel_set(X)
print(f"Created {len(kernels)} base kernels:")
for name in names:
    print(f"  - {name}")

Summary: Multiple Kernel Learning

Key Takeaways

•MKL combines multiple kernels with learnable weights, enabling automatic kernel selection and combination of complementary information.
•Closure properties guarantee that sums, products, and convex combinations of valid kernels remain valid kernels.
•L1-MKL produces sparse solutions (kernel selection); L2-MKL produces non-sparse solutions (often better when all kernels contribute).
•Optimization can be done via alternating methods (SimpleMKL), SDPs (exact but expensive), or gradient descent (scalable).
•Localized MKL learns input-dependent kernel weights, allowing different kernels for different regions of input space.
•Practical success requires kernel normalization, appropriate base kernel diversity, and careful hyperparameter tuning.

Page Complete

You now understand how to combine multiple kernels optimally using MKL. Next, we'll explore kernel alignment—a principled way to measure kernel quality and guide kernel selection.

3 / 5

Loading learning content...

Machine LearningKernel Methods for Regression

Kernel Selection and Design

LevelAdvanced

Duration90 mins

TopicKernel Methods for Regression

3 / 5

Multiple Kernel Learning

Why Combine Kernels?

No single kernel is optimal for all problems. In practice:

Different feature types (text + images + numerical) may each require specialized kernels
Multiple scales of variation may exist (fine local patterns + coarse global trends)
Uncertainty about structure makes it unclear which kernel family to choose
Different views of the same data may provide complementary information

Multiple Kernel Learning (MKL) addresses these challenges by learning an optimal combination of base kernels from data.

What You Will Learn

Kernel Combination Theory

The fundamental property enabling kernel combination is that positive definite kernels are closed under certain operations. This means we can construct new valid kernels from existing ones.

Closure Properties of Positive Definite Kernels:

Given valid PD kernels $k_1, k_2$, the following are also valid PD kernels:

Sum: $k(x, y) = k_1(x, y) + k_2(x, y)$
Product: $k(x, y) = k_1(x, y) \cdot k_2(x, y)$
Positive Scaling: $k(x, y) = c \cdot k_1(x, y)$ for $c > 0$
Convex Combination: $k(x, y) = \alpha k_1(x, y) + (1-\alpha) k_2(x, y)$ for $\alpha \in [0,1]$
Polynomial: $k(x, y) = p(k_1(x, y))$ where $p$ is a polynomial with non-negative coefficients

Feature Space Interpretation

Linear Combination of Kernels:

The most common MKL formulation uses a linear (conic) combination:

$$k(\mathbf{x}, \mathbf{y}; \boldsymbol{\mu}) = \sum_{m=1}^{M} \mu_m k_m(\mathbf{x}, \mathbf{y})$$

where ${k_m}_{m=1}^{M}$ are base kernels and $\boldsymbol{\mu} = (\mu_1, \ldots, \mu_M)^\top$ with $\mu_m \geq 0$ are kernel weights.

The Learning Problem:

In MKL, we simultaneously learn:

The kernel weights $\boldsymbol{\mu}$ (which kernels are important)
The predictor parameters (e.g., SVM dual variables $\alpha$ or regression coefficients)

This joint optimization is typically non-convex, but can be reformulated as a convex (though more complex) optimization problem.

Kernel Operations and Feature Space Effects
Operation	Formula	Feature Space Effect	Use Case
Sum	$k_1 + k_2$	Concatenation: $[\phi_1; \phi_2]$	Different feature types
Product	$k_1 \cdot k_2$	Tensor product: $\phi_1 \otimes \phi_2$	Interaction between features
Weighted sum	$\sum \mu_m k_m$	Weighted concatenation	Feature/scale selection
Power	$k^p$ ($p > 0$)	Complex polynomial expansion	Increased expressiveness

MKL Formulations

Different MKL formulations impose different constraints and regularization on the kernel weights, leading to different solution properties.

Standard (L1-regularized) MKL:

The original MKL formulation (Lanckriet et al., 2004; Bach et al., 2004) uses an implicit L1 constraint:

$$\min_{\boldsymbol{\mu}} \min_{f \in \mathcal{H}{\boldsymbol{\mu}}} \frac{1}{n}\sum{i=1}^{n} \ell(y_i, f(\mathbf{x}i)) + \lambda |f|{\mathcal{H}_{\boldsymbol{\mu}}}^2$$

$$\text{subject to } \boldsymbol{\mu} \geq 0, |\boldsymbol{\mu}|_1 = 1$$

where $\mathcal{H}{\boldsymbol{\mu}}$ is the RKHS of $k{\boldsymbol{\mu}} = \sum_m \mu_m k_m$.

Key Property: L1 constraint induces sparsity—only a few kernels receive non-zero weight. This acts as automatic kernel selection.

Sparsity Can Be a Double-Edged Sword

Lp-norm MKL (Kloft et al., 2011):

Generalizes to $L_p$ regularization on kernel weights:

$$|\boldsymbol{\mu}|_p \leq 1, \quad p \geq 1$$

$p = 1$: Sparse selection (original MKL)
$p = 2$: Non-sparse, smoother weights
$p \to \infty$: Uniform weighting

In practice, $p=2$ often outperforms $p=1$ when all kernels contribute useful information.

SimpleMKL (Rakotomamonjy et al., 2008):

A widely-used efficient algorithm that alternates between:

Solving the SVM with current kernel weights
Updating weights via gradient descent on a reduced objective

RBMKL (Bayesian MKL):

Places priors on kernel weights for uncertainty quantification:

$$\mu_m \sim \text{Gamma}(a, b) \text{ or } \mu_m \sim \text{Dirichlet}(\boldsymbol{\alpha})$$

Enables posterior inference over which kernels are relevant.

MKL Formulation Comparison
Formulation	Constraint	Weight Properties	Best When
L1-MKL	$\|\mu\|_1 = 1$	Sparse (few non-zero)	Many irrelevant kernels
L2-MKL	$\|\mu\|_2 \leq 1$	Non-sparse, smooth	All kernels useful
L∞-MKL	$\|\mu\|_\infty \leq 1$	Approximately uniform	Equal importance known
EasyMKL	Centered alignment	Non-sparse	Simple, effective default
Bayesian MKL	Prior $p(\mu)$	Uncertainty quantified	Need confidence estimates

mkl_formulations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from scipy.optimize import minimize
from sklearn.svm import SVC
 
def simple_mkl(Ks, y, C=1.0, max_iter=100, tol=1e-4):
    """
    SimpleMKL algorithm for L1-MKL.
    
    Parameters:
    -----------
    Ks : list of kernel matrices, each (n, n)
    y : labels (n,)
    C : SVM regularization
    
    Returns:
    --------
    mu : kernel weights
    alpha : SVM dual variables
    """
    M = len(Ks)
    n = len(y)
    
    # Initialize weights uniformly
    mu = np.ones(M) / M
    
    for iteration in range(max_iter):
        # Step 1: Solve SVM with combined kernel
        K_combined = sum(mu[m] * Ks[m] for m in range(M))
        
        # Fit SVM (using sklearn with precomputed kernel)
        svm = SVC(kernel='precomputed', C=C)
        svm.fit(K_combined, y)
        
        # Get dual variables (approximate from support vectors)
        alpha = np.zeros(n)
        alpha[svm.support_] = np.abs(svm.dual_coef_.ravel())
        
        # Step 2: Compute gradient w.r.t. mu
        grad = np.zeros(M)
        for m in range(M):
            # Gradient: -0.5 * alpha^T K_m alpha
            grad[m] = -0.5 * alpha @ Ks[m] @ alpha
        
        # Step 3: Update mu via projected gradient
        d = grad - grad.mean()  # Descent direction on simplex
        mu_new = mu - 0.1 * d   # Step
        mu_new = np.maximum(mu_new, 0)  # Project to positive
        mu_new /= mu_new.sum()  # Normalize to simplex
        
        # Check convergence
        if np.linalg.norm(mu_new - mu) < tol:
            break
        mu = mu_new
    
    return mu, alpha
 
def lp_mkl_constraint(mu, p):
    """Lp norm constraint: ||mu||_p <= 1"""
    return 1 - np.sum(np.abs(mu) ** p) ** (1/p)
 
# Example usage
np.random.seed(42)
n = 100
# Create synthetic kernel matrices
K1 = np.random.randn(n, n); K1 = K1 @ K1.T  # RBF-like
K2 = np.random.randn(n, n); K2 = K2 @ K2.T  # Polynomial-like
K3 = np.random.randn(n, n); K3 = K3 @ K3.T  # Linear-like
y = np.sign(np.random.randn(n))
 
Ks = [K1, K2, K3]
# mu, alpha = simple_mkl(Ks, y)  # Uncomment to run
print("SimpleMKL would learn optimal kernel combination weights")

Optimization Algorithms for MKL

The joint optimization over predictor parameters and kernel weights creates a challenging optimization problem. Several algorithmic frameworks have been developed.

Alternating Optimization:

The most intuitive approach iterates between two subproblems:

Fix $\boldsymbol{\mu}$: Solve standard kernel SVM/regression with $K_{\boldsymbol{\mu}} = \sum_m \mu_m K_m$
Fix model: Optimize $\boldsymbol{\mu}$ given current model parameters

Each step is convex, but the joint problem is not—convergence is to a local optimum.

Semi-Definite Programming (SDP) Formulation:

The original MKL formulation (Lanckriet et al., 2004) casts the problem as a semi-definite program:

Pros: Globally optimal solution. Cons: Very expensive ($O(n^6)$ complexity), limited to small datasets.

Second-Order Cone Programming (SOCP):

A reformulation that's more efficient than SDP while maintaining convexity, reducing complexity to $O(n^3 M)$.

Scalability Challenges

Gradient Descent Methods:

For large-scale MKL, gradient-based optimization is preferred:

Mirror Descent MKL: Uses Bregman divergence for updates on simplex
Level Method: Achieves optimal convergence rate for constrained problems
Stochastic MKL: Mini-batch updates for very large datasets

The Wrapper vs. Interleaved Debate:

Wrapper methods: Treat learning algorithm as black-box, optimize $\boldsymbol{\mu}$ based on validation performance (e.g., cross-validation)
Interleaved methods: Jointly optimize via alternating/gradient schemes

Wrapper methods are simpler but require expensive repeated model fitting. Interleaved methods are more efficient but algorithm-specific.

mkl_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
 
def mirror_descent_mkl_update(mu, grad, eta=0.1):
    """
    Mirror descent update on the probability simplex.
    Uses KL divergence as the Bregman divergence.
    
    Parameters:
    -----------
    mu : current weights (on simplex)
    grad : gradient of loss w.r.t. mu
    eta : step size
    
    Returns:
    --------
    mu_new : updated weights (on simplex)
    """
    # Multiplicative update (equivalent to mirror descent with KL)
    log_mu = np.log(mu + 1e-10) - eta * grad
    mu_new = np.exp(log_mu)
    mu_new /= mu_new.sum()  # Normalize to simplex
    return mu_new
 
def projected_gradient_mkl_update(mu, grad, eta=0.1):
    """
    Projected gradient descent update on the simplex.
    """
    # Gradient step
    mu_new = mu - eta * grad
    
    # Project onto simplex (Duchi et al. algorithm)
    mu_new = project_simplex(mu_new)
    return mu_new
 
def project_simplex(v):
    """
    Project vector v onto the probability simplex.
    Efficient O(n log n) algorithm.
    """
    n = len(v)
    u = np.sort(v)[::-1]
    cssv = np.cumsum(u)
    rho = np.nonzero(u * np.arange(1, n+1) > (cssv - 1))[0][-1]
    theta = (cssv[rho] - 1) / (rho + 1)
    return np.maximum(v - theta, 0)
 
def easy_mkl_weights(Ks, y):
    """
    EasyMKL: Simple and effective kernel combination.
    Uses kernel alignment to set weights.
    
    Weights proportional to alignment with ideal kernel.
    """
    n = len(y)
    y = y.reshape(-1, 1)
    yy = y @ y.T  # Ideal kernel (outer product)
    
    scores = []
    for K in Ks:
        # Centered kernel alignment
        K_centered = K - K.mean(axis=0) - K.mean(axis=1, keepdims=True) + K.mean()
        yy_centered = yy - yy.mean()
        
        alignment = np.sum(K_centered * yy_centered) / (
            np.linalg.norm(K_centered, 'fro') * np.linalg.norm(yy_centered, 'fro')
        )
        scores.append(max(alignment, 0))  # Only positive alignments
    
    # Normalize to get weights
    scores = np.array(scores)
    if scores.sum() > 0:
        return scores / scores.sum()
    else:
        return np.ones(len(Ks)) / len(Ks)
 
# Example
print("Simplex projection of [0.5, 0.5, 0.5]:", project_simplex(np.array([0.5, 0.5, 0.5])))

Localized Multiple Kernel Learning

Standard MKL learns a single set of kernel weights for the entire input space. However, different kernels may be appropriate in different regions:

RBF kernel might be optimal for smooth regions
Polynomial kernel might capture local interactions elsewhere
Periodic kernel might be relevant only in certain domains

Localized MKL addresses this by learning input-dependent kernel weights.

Formulation:

$$k(\mathbf{x}, \mathbf{x}'; \eta) = \sum_{m=1}^{M} \eta_m(\mathbf{x}) \eta_m(\mathbf{x}') k_m(\mathbf{x}, \mathbf{x}')$$

where $\eta_m: \mathcal{X} \rightarrow \mathbb{R}_{\geq 0}$ is a gating function that determines the local weight of kernel $m$ at input $\mathbf{x}$.

Product Form for Valid Kernels

Gating Function Approaches:

Softmax gating: $$\eta_m(\mathbf{x}) = \frac{\exp(\mathbf{w}m^\top \mathbf{x})}{\sum{m'} \exp(\mathbf{w}_{m'}^\top \mathbf{x})}$$

Parametric, learns linear decision boundaries for kernel selection.

Kernel-based gating: $$\eta_m(\mathbf{x}) = \sum_{j} v_{mj} k_{\text{gate}}(\mathbf{x}, \mathbf{x}_j)$$

Non-parametric, can learn complex gating boundaries.

Cluster-based gating: Assign each region to a cluster, use cluster-specific kernel weights.

SimpleMKL-loc:

Extends SimpleMKL to localized setting:

Alternate between learning predictor and gating functions
Use separate validation or joint optimization

Global vs. Localized MKL
Aspect	Global MKL	Localized MKL
Weights	Single μ for all inputs	η(x) varies with input
Expressiveness	Limited to global selection	Adapts to local structure
Parameters	M weights	M × (gating params)
Overfitting risk	Lower	Higher (more parameters)
Interpretability	Clear: which kernels matter	Harder: where and which kernels
Optimization	Convex possible	Typically non-convex

localized_mkl.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from scipy.special import softmax
 
def softmax_gating(X, W):
    """
    Compute softmax gating weights for localized MKL.
    
    Parameters:
    -----------
    X : input data (n, d)
    W : gating parameters (M, d) for M kernels
    
    Returns:
    --------
    eta : gating weights (n, M)
    """
    logits = X @ W.T  # (n, M)
    return softmax(logits, axis=1)
 
def localized_kernel_matrix(X, Ks, W):
    """
    Compute localized MKL kernel matrix.
    
    K_loc(x, x') = sum_m eta_m(x) * eta_m(x') * K_m(x, x')
    
    Parameters:
    -----------
    X : input data (n, d)
    Ks : list of M base kernel matrices (n, n)
    W : gating parameters (M, d)
    
    Returns:
    --------
    K : localized combined kernel matrix (n, n)
    """
    eta = softmax_gating(X, W)  # (n, M)
    M = len(Ks)
    n = X.shape[0]
    
    K = np.zeros((n, n))
    for m in range(M):
        # Outer product of gating weights
        gate_product = np.outer(eta[:, m], eta[:, m])
        K += gate_product * Ks[m]
    
    return K
 
def cluster_based_local_mkl(X, Ks, n_clusters=5):
    """
    Cluster-based localized MKL.
    Assigns each cluster a separate weight vector.
    """
    from sklearn.cluster import KMeans
    
    M = len(Ks)
    n = X.shape[0]
    
    # Cluster the data
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(X)
    
    # Each cluster gets its own kernel weights
    # (In practice, these would be learned)
    cluster_weights = np.random.dirichlet(np.ones(M), size=n_clusters)
    
    # Build kernel matrix with cluster-specific weights
    K = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            ci, cj = cluster_labels[i], cluster_labels[j]
            for m in range(M):
                K[i, j] += np.sqrt(cluster_weights[ci, m] * 
                                   cluster_weights[cj, m]) * Ks[m][i, j]
    
    return K, cluster_weights
 
print("Localized MKL allows input-dependent kernel selection")

Practical Considerations

Successfully applying MKL requires careful attention to base kernel design, normalization, and hyperparameter selection.

Base Kernel Design:

Coverage: Ensure base kernels cover different aspects:
- Different kernel families (RBF, polynomial, Matérn)
- Different length-scales for each family
- Different feature subsets (if applicable)
Redundancy vs. Diversity: Some redundancy (overlapping kernels) is acceptable, as MKL will assign low weights. However, too many similar kernels can slow optimization.
Scale matching: Different kernels may have very different value ranges. A polynomial kernel of degree 5 can produce huge values compared to RBF's [0,1] range.

Kernel Normalization:

Normalize each base kernel to a common scale:

$$\tilde{k}_m(\mathbf{x}, \mathbf{x}') = \frac{k_m(\mathbf{x}, \mathbf{x}')}{\sqrt{k_m(\mathbf{x}, \mathbf{x}) \cdot k_m(\mathbf{x}', \mathbf{x}')}}$$

Or normalize the kernel matrix: $$\tilde{K}_m = D_m^{-1/2} K_m D_m^{-1/2}$$

where $D_m = \text{diag}(K_m)$.

The Importance of Normalization

MKL Best Practices

•Start simple: Try EasyMKL or uniform combination before complex optimization
•Grid of length-scales: For RBF, include kernels with length-scales spanning orders of magnitude
•Normalize kernels: Use cosine normalization or trace normalization
•Monitor convergence: Alternating optimization can stall; use appropriate stopping criteria
•Validate properly: Use nested cross-validation when tuning regularization parameters
•Consider uniform: Sometimes uniform combination beats learned weights (especially with limited data)
•Interpret weights: Examine learned weights for insights about which representations matter

mkl_practical.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy.spatial.distance import cdist
 
def create_rbf_kernel_grid(X, gamma_range):
    """
    Create a grid of RBF kernels with different length-scales.
    """
    kernels = []
    sq_dists = cdist(X, X, metric='sqeuclidean')
    
    for gamma in gamma_range:
        K = np.exp(-gamma * sq_dists)
        kernels.append(K)
    
    return kernels
 
def normalize_kernel(K):
    """
    Normalize kernel matrix (cosine normalization).
    K_norm(i,j) = K(i,j) / sqrt(K(i,i) * K(j,j))
    """
    diag = np.diag(K)
    diag_sqrt = np.sqrt(np.maximum(diag, 1e-10))
    return K / np.outer(diag_sqrt, diag_sqrt)
 
def trace_normalize_kernel(K):
    """
    Normalize kernel by trace: K / trace(K) * n
    Ensures trace(K) = n (unit average self-similarity).
    """
    n = K.shape[0]
    return K / np.trace(K) * n
 
def build_mkl_kernel_set(X):
    """
    Build a comprehensive set of base kernels for MKL.
    """
    n = X.shape[0]
    kernels = []
    kernel_names = []
    
    # 1. RBF kernels with different length-scales
    gammas = [0.001, 0.01, 0.1, 1.0, 10.0]
    for gamma in gammas:
        K = np.exp(-gamma * cdist(X, X, 'sqeuclidean'))
        kernels.append(normalize_kernel(K))
        kernel_names.append(f'RBF(γ={gamma})')
    
    # 2. Polynomial kernels
    for degree in [2, 3, 4]:
        K = (X @ X.T + 1) ** degree
        kernels.append(normalize_kernel(K))
        kernel_names.append(f'Poly(d={degree})')
    
    # 3. Linear kernel
    K_linear = X @ X.T
    kernels.append(normalize_kernel(K_linear))
    kernel_names.append('Linear')
    
    return kernels, kernel_names
 
# Example
np.random.seed(42)
X = np.random.randn(100, 10)
kernels, names = build_mkl_kernel_set(X)
print(f"Created {len(kernels)} base kernels:")
for name in names:
    print(f"  - {name}")

Summary: Multiple Kernel Learning

Key Takeaways

•MKL combines multiple kernels with learnable weights, enabling automatic kernel selection and combination of complementary information.
•Closure properties guarantee that sums, products, and convex combinations of valid kernels remain valid kernels.
•L1-MKL produces sparse solutions (kernel selection); L2-MKL produces non-sparse solutions (often better when all kernels contribute).
•Optimization can be done via alternating methods (SimpleMKL), SDPs (exact but expensive), or gradient descent (scalable).
•Localized MKL learns input-dependent kernel weights, allowing different kernels for different regions of input space.
•Practical success requires kernel normalization, appropriate base kernel diversity, and careful hyperparameter tuning.

Page Complete

You now understand how to combine multiple kernels optimally using MKL. Next, we'll explore kernel alignment—a principled way to measure kernel quality and guide kernel selection.

3 / 5