Scalable Gps - Learning Module

Loading content...

0/245

Inducing Points

The Art and Science of Representative Selection

The previous page established that sparse Gaussian Processes achieve scalability by compressing N training points into M << N inducing points. But we glossed over a critical question: how do we choose these M inducing points?

This is not merely an implementation detail—it is central to the success or failure of sparse GP methods. Poorly chosen inducing points can render the approximation useless, producing predictions no better than random guessing and uncertainty estimates that are wildly miscalibrated. Optimally chosen inducing points can capture the essence of the full GP with remarkably few representatives.

The inducing point selection problem is fundamentally a compression problem: given N data points with complex interdependencies encoded by the kernel, find M locations that best preserve the information content. This framing connects sparse GPs to core ideas in information theory, optimal experimental design, and numerical linear algebra.

This page develops the full spectrum of inducing point strategies, from simple heuristics suitable for quick prototyping to sophisticated optimization methods that learn inducing locations jointly with kernel hyperparameters. Understanding these approaches is essential for practical sparse GP deployment.

What You Will Learn

By the end of this page, you will command a toolkit of inducing point selection strategies: random and grid-based methods, clustering approaches, greedy information-theoretic selection, and gradient-based optimization. You'll understand the trade-offs between these approaches and develop practical guidelines for choosing among them.

The Selection Problem Formalized

Let us formalize what "good" inducing points mean. Given N training points $X = \{\mathbf{x}_1, ..., \mathbf{x}_N\}$ and targets $\mathbf{y}$, we seek M inducing locations $Z = \{\mathbf{z}_1, ..., \mathbf{z}_M\}$ that minimize the approximation error of the sparse GP.

Several objective functions capture different aspects of approximation quality:

1. Nyström approximation error:

$$\mathcal{L}{\text{Nystrom}}(Z) = \|K{ff} - K_{fu}K_{uu}^{-1}K_{uf}\|_F$$

This measures how well the low-rank approximation reconstructs the full kernel matrix. Minimizing this is an NP-hard column subset selection problem.

2. Predictive divergence:

$$\mathcal{L}{\text{pred}}(Z) = \text{KL}(p{\text{exact}}(f_|\mathbf{y}) \| p_{\text{sparse}}(f_|\mathbf{y}))$$

This measures the difference between exact and sparse GP posteriors. Difficult to compute exactly but captures what we ultimately care about.

3. Information loss:

$$\mathcal{L}{\text{info}}(Z) = H(\mathbf{f}|\mathbf{u}) = \frac{1}{2}\log|K{ff} - K_{fu}K_{uu}^{-1}K_{uf}|$$

This measures the entropy of training function values not explained by inducing points. Lower entropy means inducing points capture more information.

The Fundamental Trade-off

All formulations face the same fundamental tension: computing the optimal inducing points requires O(N³) operations (to evaluate objective functions involving K_{ff}), which defeats the purpose of sparse methods. Practical selection algorithms must use approximations or heuristics that avoid full-scale kernel matrix operations.

Constraints on inducing point placement:

Crucially, inducing points $Z$ need not be a subset of training points $X$. They can be:

Subset selection: $Z \subseteq X$. Simplifies selection to a combinatorial problem but limits flexibility.
Continuous optimization: $Z \in \mathbb{R}^{M \times d}$. Allows inducing points anywhere in input space, typically optimized via gradient descent.
Constrained placement: Inducing points on a grid, in a learned subspace, or satisfying domain constraints.

The choice between these modes affects both the optimization landscape and the quality of the final approximation. Subset selection is simpler but may be suboptimal; continuous optimization is more flexible but may be harder to optimize.

Simple Heuristics: Random, Grid, and Clustering

For rapid prototyping and as baselines for more sophisticated methods, simple heuristics often suffice:

Random subset selection:

Select M points uniformly at random from the training set. This is the simplest possible approach:

indices = np.random.choice(N, M, replace=False)
Z = X[indices]

Surprisingly effective for well-behaved problems where data is roughly uniformly distributed and the kernel length-scale is not too short. However, random selection has high variance—different random samples can produce very different approximation quality.

K-Means Clustering

•Run k-means with k=M on training inputs X
•Use cluster centers as inducing points
•Provides space-filling coverage
•Cheap: O(NMT) for T iterations
•Works well for uniformly dense data
•Standard choice in many GP libraries

Farthest Point Sampling

•Greedily select point farthest from current set
•Start with random point or data centroid
•Maximizes minimum distance coverage
•Deterministic given initialization
•Better worst-case coverage than k-means
•O(NM) complexity with efficient data structures

Grid-based placement:

For low-dimensional problems (d ≤ 3), regular grids provide systematic coverage:

# 2D grid
M_per_dim = int(np.sqrt(M))
x1 = np.linspace(X[:, 0].min(), X[:, 0].max(), M_per_dim)
x2 = np.linspace(X[:, 1].min(), X[:, 1].max(), M_per_dim)
Z = np.stack(np.meshgrid(x1, x2), -1).reshape(-1, 2)

Grids ensure uniform coverage but suffer from the curse of dimensionality—M must grow exponentially with dimension to maintain coverage density. In high dimensions, grids become impractical.

Practical Recommendation: K-Means++

K-means with the smart ++-initialization is the recommended default for most applications. It's fast, deterministic (given seed), provides good coverage, and is available in all major libraries. Use other methods only when k-means demonstrably underperforms or when specific problem structure suggests alternatives.

heuristic_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.cluster.vq import kmeans2
from scipy.spatial.distance import cdist
 
def random_subset(X: np.ndarray, M: int, seed: int = None) -> np.ndarray:
    """Random subset of training points."""
    rng = np.random.RandomState(seed)
    indices = rng.choice(len(X), M, replace=False)
    return X[indices].copy()
 
def kmeans_inducing(X: np.ndarray, M: int, seed: int = None) -> np.ndarray:
    """K-means cluster centers as inducing points."""
    Z, _ = kmeans2(X, M, minit='++', seed=seed)
    return Z
 
def farthest_point_sampling(X: np.ndarray, M: int, seed: int = None) -> np.ndarray:
    """
    Farthest point sampling for maximally spread inducing points.
    
    Greedily selects points that maximize the minimum distance to
    the current inducing set. Provides excellent worst-case coverage.
    """
    rng = np.random.RandomState(seed)
    N = len(X)
    
    # Start with random point
    selected = [rng.randint(N)]
    
    # Track minimum distance to selected set for each point
    min_distances = np.full(N, np.inf)
    
    for _ in range(M - 1):
        # Update distances from newly selected point
        new_distances = cdist(X, X[selected[-1:]], metric='euclidean').flatten()
        min_distances = np.minimum(min_distances, new_distances)
        
        # Select point with maximum minimum distance
        min_distances[selected] = -np.inf  # Exclude already selected
        next_point = np.argmax(min_distances)
        selected.append(next_point)
    
    return X[selected].copy()
 
def leverage_score_sampling(X: np.ndarray, M: int, kernel_fn, seed: int = None) -> np.ndarray:
    """
    Leverage score sampling based on kernel feature space.
    
    Samples points proportionally to their leverage scores,
    which measure each point's influence on the kernel matrix.
    More computationally expensive but can be more effective
    for non-uniform data distributions.
    """
    rng = np.random.RandomState(seed)
    N = len(X)
    
    # Compute kernel matrix (expensive but provides best sampling)
    K = kernel_fn(X, X)
    
    # Leverage scores are diagonal of K @ K^{-1}
    # For stability, use eigendecomposition
    eigvals, eigvecs = np.linalg.eigh(K + 1e-6 * np.eye(N))
    
    # Keep top-M eigencomponents for approximate leverage scores
    top_M = min(M, N)
    leverage_scores = np.sum(eigvecs[:, -top_M:]**2 * eigvals[-top_M:], axis=1)
    
    # Sample proportionally to leverage scores
    probs = leverage_scores / leverage_scores.sum()
    indices = rng.choice(N, M, replace=False, p=probs)
    
    return X[indices].copy()
 
# Comparison demonstration
def compare_selection_methods(X, M, kernel_fn):
    """Compare different inducing point selection strategies."""
    from numpy.linalg import norm
    
    methods = {
        'Random': random_subset(X, M, seed=42),
        'K-Means': kmeans_inducing(X, M, seed=42),
        'Farthest Point': farthest_point_sampling(X, M, seed=42),
    }
    
    print(f"Inducing Point Selection Comparison (N={len(X)}, M={M})")
    print("=" * 60)
    
    for name, Z in methods.items():
        # Compute Nyström approximation error
        Kff = kernel_fn(X, X)
        Kfu = kernel_fn(X, Z)
        Kuu = kernel_fn(Z, Z)
        
        try:
            Kuu_inv = np.linalg.inv(Kuu + 1e-6 * np.eye(M))
            Qff = Kfu @ Kuu_inv @ Kfu.T
            approx_error = norm(Kff - Qff, 'fro') / norm(Kff, 'fro')
            
            # Coverage: mean distance to nearest inducing point
            dists = cdist(X, Z)
            mean_min_dist = np.mean(np.min(dists, axis=1))
            
            print(f"{name:20s}: Rel. Error={approx_error:.4f}, Mean Dist={mean_min_dist:.4f}")
        except:
            print(f"{name:20s}: Failed (singular K_uu)")
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Generate clustered data (challenging for random selection)
    X = np.vstack([
        np.random.randn(200, 2) + np.array([0, 0]),
        np.random.randn(200, 2) + np.array([5, 5]),
        np.random.randn(100, 2) + np.array([10, 0]),
    ])
    
    def se_kernel(X1, X2, lengthscale=1.0):
        dists_sq = np.sum((X1[:, None] - X2[None, :])**2, axis=2)
        return np.exp(-0.5 * dists_sq / lengthscale**2)
    
    compare_selection_methods(X, M=30, kernel_fn=se_kernel)

Greedy Selection: Information-Theoretic Approaches

Beyond simple heuristics, greedy algorithms can optimize information-theoretic objectives that directly measure inducing point quality. These methods iteratively select points that maximize marginal gain in an objective function.

Information gain selection:

The information gain from adding inducing point $\mathbf{z}$ to existing set $Z$ is:

$$\Delta I(\mathbf{z}; Z) = I(\mathbf{f}; \mathbf{u} \cup \{f(\mathbf{z})\}) - I(\mathbf{f}; \mathbf{u})$$

For Gaussian distributions, this can be computed as:

$$\Delta I(\mathbf{z}; Z) = \frac{1}{2}\log\left(1 + \frac{k(\mathbf{z}, \mathbf{z}) - \mathbf{k}z^\top K{uu}^{-1}\mathbf{k}_z}{\sigma_n^2}\right)$$

where $\mathbf{k}_z = [k(\mathbf{z}, \mathbf{z}_1), ..., k(\mathbf{z}, \mathbf{z}_M)]^\top$ is the cross-covariance with existing inducing points.

The numerator is the conditional variance of $f(\mathbf{z})$ given current inducing points—it measures how much new information $\mathbf{z}$ provides. Points in already well-covered regions have low conditional variance and thus low information gain.

The greedy algorithm:

Algorithm: Greedy Information-Gain Inducing Point Selection

Input: Training points X, number of inducing points M, kernel k
Output: Inducing set Z

1. Initialize Z ← {random point from X}
2. For m = 2 to M:
   a. For each candidate z ∈ X \ Z:
      - Compute conditional variance: σ²(z|Z) = k(z,z) - k_z^T K_uu^{-1} k_z
      - Compute gain: ΔI(z) = 0.5 * log(1 + σ²(z|Z)/σ_n²)
   b. Select z* = argmax_z ΔI(z)
   c. Z ← Z ∪ {z*}
   d. Update K_uu with new row/column (rank-1 update)
3. Return Z

This algorithm is submodular: the marginal gain of adding a point decreases as Z grows. Submodularity guarantees that greedy selection achieves at least (1 - 1/e) ≈ 63% of the optimal information gain, providing a theoretical quality guarantee.

Computational Considerations

Naively, greedy selection requires O(NM) information gain evaluations, each involving O(M²) for K_{uu}^{-1} multiplication. The total cost is O(NM³). With Cholesky rank-1 updates, this reduces to O(NM²). For very large N, sub-sampling candidates can further reduce cost at the expense of optimality.

greedy_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def greedy_information_gain(
    X: np.ndarray, 
    M: int, 
    kernel_fn, 
    noise_var: float = 0.01,
    seed: int = None
) -> np.ndarray:
    """
    Greedy inducing point selection maximizing information gain.
    
    Uses efficient rank-1 Cholesky updates for O(NM²) complexity.
    
    Args:
        X: (N, d) training inputs
        M: Number of inducing points to select
        kernel_fn: Kernel function k(X1, X2) -> matrix
        noise_var: Observation noise variance (affects gain computation)
        seed: Random seed for initialization
    
    Returns:
        Z: (M, d) selected inducing point locations
    """
    rng = np.random.RandomState(seed)
    N, d = X.shape
    
    # Precompute diagonal of kernel matrix (self-variance)
    K_diag = np.array([kernel_fn(X[i:i+1], X[i:i+1])[0, 0] for i in range(N)])
    
    # Initialize with random point
    first_idx = rng.randint(N)
    selected_indices = [first_idx]
    
    # Initialize Cholesky factor of K_uu
    L = np.array([[np.sqrt(K_diag[first_idx] + 1e-6)]])
    
    # Track conditional variance for each candidate
    # Initially σ²(x|Z) = k(x,x) for all x
    cond_var = K_diag.copy()
    
    for m in range(1, M):
        # Compute cross-covariance of all points with current Z
        Z_current = X[selected_indices]
        K_candidates_Z = kernel_fn(X, Z_current)  # N x m
        
        # Efficient conditional variance computation
        # σ²(x|Z) = k(x,x) - k_{x,Z} K_{ZZ}^{-1} k_{Z,x}
        # = k(x,x) - ||L^{-1} k_{Z,x}||²
        
        # Solve L @ v = K_Z,candidates.T for v
        v = solve_triangular(L, K_candidates_Z.T, lower=True)  # m x N
        explained_var = np.sum(v**2, axis=0)  # ||L^{-1} k_{Z,x}||² for each x
        cond_var = K_diag - explained_var
        
        # Information gain (proportional to)
        # ΔI ∝ log(1 + σ²(x|Z)/σ_n²)
        info_gain = np.log1p(np.maximum(cond_var, 0) / noise_var)
        
        # Exclude already selected
        info_gain[selected_indices] = -np.inf
        
        # Select point with maximum gain
        best_idx = np.argmax(info_gain)
        selected_indices.append(best_idx)
        
        # Rank-1 Cholesky update
        # New K_uu row: k_new = kernel_fn(X[best_idx:best_idx+1], Z_current)
        # L_new = [L     0  ]
        #         [ℓ_new  ℓ ]
        # where L @ ℓ_new = k_new, and ℓ = sqrt(k(x,x) - ||ℓ_new||²)
        k_new = K_candidates_Z[best_idx]
        ell_new = solve_triangular(L, k_new, lower=True)
        ell = np.sqrt(max(K_diag[best_idx] - np.dot(ell_new, ell_new), 1e-6))
        
        # Expand L
        new_L = np.zeros((m + 1, m + 1))
        new_L[:m, :m] = L
        new_L[m, :m] = ell_new
        new_L[m, m] = ell
        L = new_L
    
    return X[selected_indices].copy()
 
def differential_entropy_selection(
    X: np.ndarray,
    M: int,
    kernel_fn,
    seed: int = None
) -> np.ndarray:
    """
    Select inducing points that minimize conditional entropy H(f|u).
    
    This directly minimizes the information loss from the sparse approximation,
    equivalent to maximizing I(f; u).
    """
    rng = np.random.RandomState(seed)
    N = len(X)
    
    # Initialize with point that has maximum self-variance
    variances = np.array([kernel_fn(X[i:i+1], X[i:i+1])[0, 0] for i in range(N)])
    selected = [np.argmax(variances)]
    
    for m in range(1, M):
        Z = X[selected]
        Kuu = kernel_fn(Z, Z) + 1e-6 * np.eye(len(Z))
        L = cholesky(Kuu, lower=True)
        
        best_score = -np.inf
        best_idx = None
        
        # Evaluate each candidate
        for i in range(N):
            if i in selected:
                continue
            
            # Compute log-det of K_uu augmented with candidate
            # Using matrix determinant lemma:
            # |K_new| = |K_uu| * (k_ii - k_i^T K_uu^{-1} k_i)
            k_i = kernel_fn(X[i:i+1], Z).flatten()
            k_ii = kernel_fn(X[i:i+1], X[i:i+1])[0, 0]
            
            v = solve_triangular(L, k_i, lower=True)
            schur_complement = k_ii - np.dot(v, v)
            
            if schur_complement > 1e-8:
                # Score is proportional to log(schur_complement)
                # Higher = more information gained
                score = np.log(schur_complement)
            else:
                score = -np.inf
            
            if score > best_score:
                best_score = score
                best_idx = i
        
        if best_idx is not None:
            selected.append(best_idx)
        else:
            # Fall back to random if all candidates singular
            remaining = [i for i in range(N) if i not in selected]
            if remaining:
                selected.append(rng.choice(remaining))
    
    return X[selected].copy()
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create structured data where information-theoretic selection matters
    N = 500
    X = np.vstack([
        np.random.randn(400, 2) * 0.5,  # Dense cluster
        np.random.randn(100, 2) * 0.3 + np.array([4, 0]),  # Smaller distant cluster
    ])
    
    def se_kernel(X1, X2, ls=1.0):
        d = np.sum((X1[:, None] - X2[None, :])**2, axis=2)
        return np.exp(-0.5 * d / ls**2)
    
    M = 20
    
    # Compare methods
    from scipy.cluster.vq import kmeans2
    
    Z_random = X[np.random.choice(N, M, replace=False)]
    Z_kmeans, _ = kmeans2(X, M, minit='++', seed=42)
    Z_greedy = greedy_information_gain(X, M, se_kernel, seed=42)
    
    print("Inducing Point Selection Results:")
    print("=" * 50)
    print(f"Data: Two clusters (N={N})")
    print(f"M = {M} inducing points")
    
    for name, Z in [("Random", Z_random), ("K-Means", Z_kmeans), ("Greedy Info", Z_greedy)]:
        Kff = se_kernel(X, X)
        Kfu = se_kernel(X, Z)
        Kuu = se_kernel(Z, Z)
        Kuu_inv = np.linalg.inv(Kuu + 1e-6 * np.eye(M))
        Qff = Kfu @ Kuu_inv @ Kfu.T
        
        rel_error = np.linalg.norm(Kff - Qff, 'fro') / np.linalg.norm(Kff, 'fro')
        print(f"{name:15s}: Nyström relative error = {rel_error:.4f}")

Learning Inducing Point Locations via Gradient Optimization

Rather than selecting inducing points from fixed candidates, we can learn their locations by treating $Z$ as continuous parameters optimized jointly with kernel hyperparameters. This is the approach taken in modern sparse GP implementations like GPyTorch and GPflow.

The optimization objective:

For variational sparse GPs (covered in the next page), we optimize the Evidence Lower Bound (ELBO):

$$\text{ELBO}(\theta, Z) = \mathbb{E}_{q(\mathbf{u})}[\log p(\mathbf{y}|\mathbf{f})] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$

Both the expected log-likelihood and KL divergence are differentiable with respect to Z, enabling gradient-based optimization.

Gradients with respect to inducing locations:

The key matrices $K_{uu}$ and $K_{fu}$ depend on $Z$ through the kernel:

$$\frac{\partial K_{uu}}{\partial z_{mj}} = \frac{\partial}{\partial z_{mj}} k(\mathbf{z}m, \mathbf{z}{m'})$$

$$\frac{\partial K_{fu}}{\partial z_{mj}} = \frac{\partial}{\partial z_{mj}} k(\mathbf{x}_n, \mathbf{z}_m)$$

For stationary kernels like the squared exponential:

$$\frac{\partial k(\mathbf{x}, \mathbf{z})}{\partial z_j} = k(\mathbf{x}, \mathbf{z}) \cdot \frac{x_j - z_j}{\ell^2}$$

These derivatives can be computed efficiently using automatic differentiation frameworks.

Optimization Challenges

Learning inducing locations is a non-convex optimization problem with many local optima. Poor initialization can lead to suboptimal solutions where inducing points cluster together or fail to cover important regions. Initialization with k-means or other heuristics is strongly recommended before gradient optimization.

The collapse phenomenon:

A well-known failure mode is inducing point collapse: during optimization, inducing points may converge to nearly identical locations, reducing effective M. This happens when:

The objective doesn't sufficiently penalize redundant inducing points
Gradients push inducing points toward high-density data regions
Initialization places inducing points too close together

Mitigation strategies include:

Spread-based regularization: Add penalty for inducing points being too close
Diverse initialization: Use farthest point sampling or k-means
Early stopping: Monitor inducing point spread during optimization
Constrained optimization: Enforce minimum pairwise distances

Comparison: Fixed vs Learned Inducing Points
Aspect	Fixed Selection	Learned Locations
Flexibility	Limited to training points or heuristics	Anywhere in input space
Optimization	Discrete/combinatorial	Continuous gradient descent
Cost	O(NM²) for greedy methods	O(NM² × iterations)
Local optima	N/A for heuristics	Significant concern
Hyperparameter coupling	Decoupled	Jointly optimized
Final quality	Good for well-chosen M	Often better, but not guaranteed

learned_inducing_points.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam
 
class SparseGPWithLearnedZ(nn.Module):
    """
    Sparse GP with learnable inducing point locations.
    
    This implementation demonstrates:
    1. Parameterizing inducing locations as optimizable tensors
    2. Computing gradients through kernel operations
    3. Joint optimization of Z with kernel hyperparameters
    """
    
    def __init__(self, X: np.ndarray, M: int, init_method: str = 'kmeans'):
        super().__init__()
        
        N, d = X.shape
        
        # Initialize inducing points
        if init_method == 'kmeans':
            from scipy.cluster.vq import kmeans2
            Z_init, _ = kmeans2(X, M, minit='++')
        elif init_method == 'random':
            indices = np.random.choice(N, M, replace=False)
            Z_init = X[indices].copy()
        else:
            raise ValueError(f"Unknown init method: {init_method}")
        
        # Learnable parameters
        self.Z = nn.Parameter(torch.tensor(Z_init, dtype=torch.float64))
        self.log_lengthscale = nn.Parameter(torch.tensor(0.0, dtype=torch.float64))
        self.log_variance = nn.Parameter(torch.tensor(0.0, dtype=torch.float64))
        self.log_noise = nn.Parameter(torch.tensor(-2.0, dtype=torch.float64))
        
        # Store training data
        self.register_buffer('X', torch.tensor(X, dtype=torch.float64))
        
    @property
    def lengthscale(self):
        return torch.exp(self.log_lengthscale)
    
    @property
    def variance(self):
        return torch.exp(self.log_variance)
    
    @property
    def noise_variance(self):
        return torch.exp(self.log_noise)
    
    def se_kernel(self, X1, X2):
        """Squared exponential kernel with current hyperparameters."""
        dist_sq = torch.sum((X1.unsqueeze(1) - X2.unsqueeze(0))**2, dim=2)
        return self.variance * torch.exp(-0.5 * dist_sq / self.lengthscale**2)
    
    def forward(self, y: torch.Tensor) -> torch.Tensor:
        """
        Compute negative ELBO (to minimize).
        
        Uses the DTC approximation for simplicity. Variational extension
        covered in next page.
        """
        X = self.X
        Z = self.Z
        N, M = len(X), len(Z)
        
        # Kernel matrices
        Kuu = self.se_kernel(Z, Z) + 1e-6 * torch.eye(M, dtype=torch.float64)
        Kfu = self.se_kernel(X, Z)
        
        # Cholesky of Kuu
        L_uu = torch.linalg.cholesky(Kuu)
        
        # Compute Λ = Kuu + σ^{-2} Kuf Kfu
        Kuf = Kfu.T
        Lambda = Kuu + Kuf @ Kfu / self.noise_variance
        L_lambda = torch.linalg.cholesky(Lambda)
        
        # Log marginal likelihood (DTC)
        # log p(y) = -0.5 * y^T (Qff + σ²I)^{-1} y - 0.5 * log|Qff + σ²I| - N/2 * log(2π)
        
        # Quadratic form via Woodbury
        y_scaled = y / self.noise_variance
        c = Kuf @ y_scaled
        c_solve = torch.linalg.solve_triangular(
            L_lambda.T,
            torch.linalg.solve_triangular(L_lambda, c, upper=False),
            upper=True
        )
        quad_form = torch.dot(y_scaled, y) - torch.dot(c, c_solve)
        
        # Log determinant via Woodbury
        log_det_noise = N * self.log_noise
        log_det_Lambda = 2 * torch.sum(torch.log(torch.diag(L_lambda)))
        log_det_Kuu = 2 * torch.sum(torch.log(torch.diag(L_uu)))
        log_det = log_det_noise + log_det_Lambda - log_det_Kuu
        
        # Negative log marginal likelihood
        neg_lml = 0.5 * quad_form + 0.5 * log_det + 0.5 * N * np.log(2 * np.pi)
        
        return neg_lml
    
    def predict(self, X_star: torch.Tensor):
        """Predictive mean and variance."""
        with torch.no_grad():
            Z = self.Z
            M = len(Z)
            
            Kuu = self.se_kernel(Z, Z) + 1e-6 * torch.eye(M, dtype=torch.float64)
            Ksu = self.se_kernel(X_star, Z)
            Kss_diag = self.variance.expand(len(X_star))  # SE diagonal is constant variance
            
            L_uu = torch.linalg.cholesky(Kuu)
            
            # ... (prediction equations would continue)
            return Ksu  # Placeholder
 
def train_sparse_gp_with_learned_z(
    X: np.ndarray, 
    y: np.ndarray, 
    M: int, 
    n_iterations: int = 500
):
    """
    Train sparse GP with jointly optimized inducing points.
    """
    model = SparseGPWithLearnedZ(X, M, init_method='kmeans')
    y_tensor = torch.tensor(y, dtype=torch.float64)
    
    optimizer = Adam(model.parameters(), lr=0.01)
    
    print("Training Sparse GP with Learned Inducing Points")
    print("=" * 50)
    
    history = {'loss': [], 'lengthscale': [], 'z_spread': []}
    
    for i in range(n_iterations):
        optimizer.zero_grad()
        loss = model(y_tensor)
        loss.backward()
        optimizer.step()
        
        # Monitor inducing point spread (collapse detection)
        Z = model.Z.detach().numpy()
        pairwise_dists = np.sqrt(np.sum((Z[:, None] - Z[None, :])**2, axis=2))
        np.fill_diagonal(pairwise_dists, np.inf)
        min_spread = pairwise_dists.min()
        
        history['loss'].append(loss.item())
        history['z_spread'].append(min_spread)
        
        if (i + 1) % 100 == 0:
            print(f"Iter {i+1}: Loss={loss.item():.2f}, "
                  f"Lengthscale={model.lengthscale.item():.3f}, "
                  f"Min Z dist={min_spread:.4f}")
    
    return model, history
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    torch.manual_seed(42)
    
    # Generate data
    N = 500
    X = np.random.randn(N, 2) * 2
    y = np.sin(X[:, 0]) + 0.5 * np.cos(X[:, 1]) + 0.1 * np.random.randn(N)
    
    model, history = train_sparse_gp_with_learned_z(X, y, M=30, n_iterations=300)
    
    print("\nFinal inducing points:")
    print(model.Z.detach().numpy()[:5])  # Show first 5

Inducing Points in High Dimensions

High-dimensional input spaces pose fundamental challenges for inducing point methods. The curse of dimensionality means that covering a d-dimensional space requires exponentially many points—far more than is computationally feasible.

The geometry problem:

In d dimensions, the volume of a hypercube grows as $V = L^d$ where $L$ is the side length. To maintain a fixed point density $\rho$, the number of required inducing points scales as:

$$M \propto \rho \cdot L^d$$

For d = 10 and moderate L, this quickly becomes intractable. With M = 1,000 inducing points in 10D, the average distance to the nearest inducing point may exceed the kernel length-scale, making the sparse approximation useless.

Manifestations of the curse:

High-Dimensional Failure Modes

•Sparse coverage: Even many inducing points leave large regions uncovered. Conditional variance K_{ff} - Q_{ff} becomes large everywhere.
•Length-scale sensitivity: Shorter length-scales require denser coverage, exacerbating the problem. But longer length-scales may not capture function structure.
•Concentration of measure: In high dimensions, most points are nearly equidistant. Kernel values become nearly constant, reducing discriminative power.
•Optimization difficulty: More inducing points mean more parameters; optimization becomes slower and more prone to local optima.

Strategies for high dimensions:

1. Automatic Relevance Determination (ARD)

Use an ARD kernel with per-dimension length-scales:

$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{1}{2}\sum_{j=1}^d \frac{(x_j - x_j')^2}{\ell_j^2}\right)$$

If the function depends strongly on only a few dimensions, ARD learns large length-scales for irrelevant dimensions, effectively reducing dimensionality.

2. Dimensionality reduction

Project inputs to a lower-dimensional space before constructing the GP:

PCA-based projections (linear)
Deep kernel learning (nonlinear, learned)
Random projections (efficient, dimension-independent error bounds)

3. Additive structure

Decompose the function as a sum of low-dimensional components: $$f(\mathbf{x}) = \sum_{s \in S} f_s(\mathbf{x}_s)$$ where each $\mathbf{x}_s$ is a subset of variables. Separate inducing points for each component avoid the full-dimensional curse.

4. Structured inducing points

Place inducing points on a Kronecker product grid. For separable kernels, this enables efficient O(M^{3/d}) complexity, though applicability is limited.

Practical Guidance

For d > 10-20, sparse GPs with standard inducing points become increasingly unreliable. Consider alternatives: deep kernel learning (combining neural networks with GPs), random feature approximations (next pages), or exploiting problem-specific structure. If using sparse GPs in high dimensions, always validate approximation quality on held-out data.

Practical Guidelines for Inducing Point Selection

Consolidating the concepts from this page, here are practical recommendations for inducing point selection across different scenarios:

Inducing Point Selection by Scenario
Scenario	Recommended Approach	Typical M	Notes
Quick prototyping	K-means++	100-500	Simple baseline, often sufficient
Production model	K-means + gradient refinement	500-2000	Best quality with moderate cost
Small data (N < 5000)	Greedy information gain	√N to N/10	Worth the extra computation
Large data (N > 50000)	K-means + minibatch training	1000-5000	Gradient refinement impractical
Non-uniform data	Leverage score or greedy	Varies	Need good coverage of dense regions
High dimensions (d > 10)	K-means + ARD kernel	1000+	Consider alternatives
Sequential/streaming	Incremental k-medoids	Budget-limited	Update Z as data arrives

Choosing M:

The optimal number of inducing points depends on the data complexity and computational budget:

Minimum: M ≥ 10 × d for basic coverage in d dimensions
Rule of thumb: M ≈ √N works well for smooth functions
Upper bound: Diminishing returns typically beyond M ≈ 2000-5000

Validation:

Always validate inducing point quality:

Compute predictive performance on held-out data
Compare predictions with exact GP on a subset (if tractable)
Monitor the residual variance Λ = diag(K_{ff} - Q_{ff})
Increase M until predictions stabilize

Common mistakes to avoid:

Using too few inducing points for complex functions
Placing inducing points outside the data support
Ignoring initialization quality when learning Z
Assuming sparse GP matches exact GP for all queries

The 80/20 Rule

In practice, well-initialized k-means inducing points get you 80% of the way to optimal. Gradient refinement of Z provides marginal improvement in most cases. Invest effort in proper initialization and appropriate M, rather than sophisticated optimization of Z locations.

Summary: The Art of Compression

We have developed a comprehensive understanding of inducing point selection—the critical design choice that determines the quality of sparse GP approximations. The key insights:

Key Takeaways

•Inducing points compress the GP: M well-chosen inducing points can capture most information from N training points, enabling O(NM²) complexity.
•Multiple selection strategies exist: From simple random/clustering heuristics to greedy information-theoretic methods to learned locations via gradient descent.
•K-means provides an excellent baseline: It's fast, deterministic, and generally effective. Use it as the default starting point.
•Greedy methods optimize information content: Submodular maximization provides theoretical guarantees, but at higher computational cost.
•Learning Z can improve quality: Gradient-based optimization of inducing locations is possible but requires careful initialization and regularization to avoid collapse.
•High dimensions are fundamentally challenging: The curse of dimensionality limits sparse GP effectiveness; consider dimensionality reduction or alternative methods.

What's next:

With the mechanics of inducing point selection established, we now turn to the variational sparse GP framework. This elegant approach recasts sparse GP inference as variational inference, providing principled optimization objectives, proper uncertainty quantification, and enabling scalable stochastic training with minibatches. The variational perspective unifies and extends the classical DTC/FITC approximations we developed earlier.

Page Complete

You now command a toolkit of inducing point selection methods, from quick heuristics to sophisticated optimization. This knowledge is essential for deploying sparse GPs effectively—the best algorithm in the world fails if inducing points are poorly chosen. The variational framework ahead will leverage these inducing points in a principled probabilistic inference scheme.