Loading content...
The previous page established that sparse Gaussian Processes achieve scalability by compressing N training points into M << N inducing points. But we glossed over a critical question: how do we choose these M inducing points?
This is not merely an implementation detail—it is central to the success or failure of sparse GP methods. Poorly chosen inducing points can render the approximation useless, producing predictions no better than random guessing and uncertainty estimates that are wildly miscalibrated. Optimally chosen inducing points can capture the essence of the full GP with remarkably few representatives.
The inducing point selection problem is fundamentally a compression problem: given N data points with complex interdependencies encoded by the kernel, find M locations that best preserve the information content. This framing connects sparse GPs to core ideas in information theory, optimal experimental design, and numerical linear algebra.
This page develops the full spectrum of inducing point strategies, from simple heuristics suitable for quick prototyping to sophisticated optimization methods that learn inducing locations jointly with kernel hyperparameters. Understanding these approaches is essential for practical sparse GP deployment.
By the end of this page, you will command a toolkit of inducing point selection strategies: random and grid-based methods, clustering approaches, greedy information-theoretic selection, and gradient-based optimization. You'll understand the trade-offs between these approaches and develop practical guidelines for choosing among them.
Let us formalize what "good" inducing points mean. Given N training points $X = \{\mathbf{x}_1, ..., \mathbf{x}_N\}$ and targets $\mathbf{y}$, we seek M inducing locations $Z = \{\mathbf{z}_1, ..., \mathbf{z}_M\}$ that minimize the approximation error of the sparse GP.
Several objective functions capture different aspects of approximation quality:
1. Nyström approximation error:
$$\mathcal{L}{\text{Nystrom}}(Z) = \|K{ff} - K_{fu}K_{uu}^{-1}K_{uf}\|_F$$
This measures how well the low-rank approximation reconstructs the full kernel matrix. Minimizing this is an NP-hard column subset selection problem.
2. Predictive divergence:
$$\mathcal{L}{\text{pred}}(Z) = \text{KL}(p{\text{exact}}(f_|\mathbf{y}) \| p_{\text{sparse}}(f_|\mathbf{y}))$$
This measures the difference between exact and sparse GP posteriors. Difficult to compute exactly but captures what we ultimately care about.
3. Information loss:
$$\mathcal{L}{\text{info}}(Z) = H(\mathbf{f}|\mathbf{u}) = \frac{1}{2}\log|K{ff} - K_{fu}K_{uu}^{-1}K_{uf}|$$
This measures the entropy of training function values not explained by inducing points. Lower entropy means inducing points capture more information.
All formulations face the same fundamental tension: computing the optimal inducing points requires O(N³) operations (to evaluate objective functions involving K_{ff}), which defeats the purpose of sparse methods. Practical selection algorithms must use approximations or heuristics that avoid full-scale kernel matrix operations.
Constraints on inducing point placement:
Crucially, inducing points $Z$ need not be a subset of training points $X$. They can be:
Subset selection: $Z \subseteq X$. Simplifies selection to a combinatorial problem but limits flexibility.
Continuous optimization: $Z \in \mathbb{R}^{M \times d}$. Allows inducing points anywhere in input space, typically optimized via gradient descent.
Constrained placement: Inducing points on a grid, in a learned subspace, or satisfying domain constraints.
The choice between these modes affects both the optimization landscape and the quality of the final approximation. Subset selection is simpler but may be suboptimal; continuous optimization is more flexible but may be harder to optimize.
For rapid prototyping and as baselines for more sophisticated methods, simple heuristics often suffice:
Random subset selection:
Select M points uniformly at random from the training set. This is the simplest possible approach:
indices = np.random.choice(N, M, replace=False)
Z = X[indices]
Surprisingly effective for well-behaved problems where data is roughly uniformly distributed and the kernel length-scale is not too short. However, random selection has high variance—different random samples can produce very different approximation quality.
Grid-based placement:
For low-dimensional problems (d ≤ 3), regular grids provide systematic coverage:
# 2D grid
M_per_dim = int(np.sqrt(M))
x1 = np.linspace(X[:, 0].min(), X[:, 0].max(), M_per_dim)
x2 = np.linspace(X[:, 1].min(), X[:, 1].max(), M_per_dim)
Z = np.stack(np.meshgrid(x1, x2), -1).reshape(-1, 2)
Grids ensure uniform coverage but suffer from the curse of dimensionality—M must grow exponentially with dimension to maintain coverage density. In high dimensions, grids become impractical.
K-means with the smart ++-initialization is the recommended default for most applications. It's fast, deterministic (given seed), provides good coverage, and is available in all major libraries. Use other methods only when k-means demonstrably underperforms or when specific problem structure suggests alternatives.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom scipy.cluster.vq import kmeans2from scipy.spatial.distance import cdist def random_subset(X: np.ndarray, M: int, seed: int = None) -> np.ndarray: """Random subset of training points.""" rng = np.random.RandomState(seed) indices = rng.choice(len(X), M, replace=False) return X[indices].copy() def kmeans_inducing(X: np.ndarray, M: int, seed: int = None) -> np.ndarray: """K-means cluster centers as inducing points.""" Z, _ = kmeans2(X, M, minit='++', seed=seed) return Z def farthest_point_sampling(X: np.ndarray, M: int, seed: int = None) -> np.ndarray: """ Farthest point sampling for maximally spread inducing points. Greedily selects points that maximize the minimum distance to the current inducing set. Provides excellent worst-case coverage. """ rng = np.random.RandomState(seed) N = len(X) # Start with random point selected = [rng.randint(N)] # Track minimum distance to selected set for each point min_distances = np.full(N, np.inf) for _ in range(M - 1): # Update distances from newly selected point new_distances = cdist(X, X[selected[-1:]], metric='euclidean').flatten() min_distances = np.minimum(min_distances, new_distances) # Select point with maximum minimum distance min_distances[selected] = -np.inf # Exclude already selected next_point = np.argmax(min_distances) selected.append(next_point) return X[selected].copy() def leverage_score_sampling(X: np.ndarray, M: int, kernel_fn, seed: int = None) -> np.ndarray: """ Leverage score sampling based on kernel feature space. Samples points proportionally to their leverage scores, which measure each point's influence on the kernel matrix. More computationally expensive but can be more effective for non-uniform data distributions. """ rng = np.random.RandomState(seed) N = len(X) # Compute kernel matrix (expensive but provides best sampling) K = kernel_fn(X, X) # Leverage scores are diagonal of K @ K^{-1} # For stability, use eigendecomposition eigvals, eigvecs = np.linalg.eigh(K + 1e-6 * np.eye(N)) # Keep top-M eigencomponents for approximate leverage scores top_M = min(M, N) leverage_scores = np.sum(eigvecs[:, -top_M:]**2 * eigvals[-top_M:], axis=1) # Sample proportionally to leverage scores probs = leverage_scores / leverage_scores.sum() indices = rng.choice(N, M, replace=False, p=probs) return X[indices].copy() # Comparison demonstrationdef compare_selection_methods(X, M, kernel_fn): """Compare different inducing point selection strategies.""" from numpy.linalg import norm methods = { 'Random': random_subset(X, M, seed=42), 'K-Means': kmeans_inducing(X, M, seed=42), 'Farthest Point': farthest_point_sampling(X, M, seed=42), } print(f"Inducing Point Selection Comparison (N={len(X)}, M={M})") print("=" * 60) for name, Z in methods.items(): # Compute Nyström approximation error Kff = kernel_fn(X, X) Kfu = kernel_fn(X, Z) Kuu = kernel_fn(Z, Z) try: Kuu_inv = np.linalg.inv(Kuu + 1e-6 * np.eye(M)) Qff = Kfu @ Kuu_inv @ Kfu.T approx_error = norm(Kff - Qff, 'fro') / norm(Kff, 'fro') # Coverage: mean distance to nearest inducing point dists = cdist(X, Z) mean_min_dist = np.mean(np.min(dists, axis=1)) print(f"{name:20s}: Rel. Error={approx_error:.4f}, Mean Dist={mean_min_dist:.4f}") except: print(f"{name:20s}: Failed (singular K_uu)") # Example usageif __name__ == "__main__": np.random.seed(42) # Generate clustered data (challenging for random selection) X = np.vstack([ np.random.randn(200, 2) + np.array([0, 0]), np.random.randn(200, 2) + np.array([5, 5]), np.random.randn(100, 2) + np.array([10, 0]), ]) def se_kernel(X1, X2, lengthscale=1.0): dists_sq = np.sum((X1[:, None] - X2[None, :])**2, axis=2) return np.exp(-0.5 * dists_sq / lengthscale**2) compare_selection_methods(X, M=30, kernel_fn=se_kernel)Beyond simple heuristics, greedy algorithms can optimize information-theoretic objectives that directly measure inducing point quality. These methods iteratively select points that maximize marginal gain in an objective function.
Information gain selection:
The information gain from adding inducing point $\mathbf{z}$ to existing set $Z$ is:
$$\Delta I(\mathbf{z}; Z) = I(\mathbf{f}; \mathbf{u} \cup \{f(\mathbf{z})\}) - I(\mathbf{f}; \mathbf{u})$$
For Gaussian distributions, this can be computed as:
$$\Delta I(\mathbf{z}; Z) = \frac{1}{2}\log\left(1 + \frac{k(\mathbf{z}, \mathbf{z}) - \mathbf{k}z^\top K{uu}^{-1}\mathbf{k}_z}{\sigma_n^2}\right)$$
where $\mathbf{k}_z = [k(\mathbf{z}, \mathbf{z}_1), ..., k(\mathbf{z}, \mathbf{z}_M)]^\top$ is the cross-covariance with existing inducing points.
The numerator is the conditional variance of $f(\mathbf{z})$ given current inducing points—it measures how much new information $\mathbf{z}$ provides. Points in already well-covered regions have low conditional variance and thus low information gain.
The greedy algorithm:
Algorithm: Greedy Information-Gain Inducing Point Selection
Input: Training points X, number of inducing points M, kernel k
Output: Inducing set Z
1. Initialize Z ← {random point from X}
2. For m = 2 to M:
a. For each candidate z ∈ X \ Z:
- Compute conditional variance: σ²(z|Z) = k(z,z) - k_z^T K_uu^{-1} k_z
- Compute gain: ΔI(z) = 0.5 * log(1 + σ²(z|Z)/σ_n²)
b. Select z* = argmax_z ΔI(z)
c. Z ← Z ∪ {z*}
d. Update K_uu with new row/column (rank-1 update)
3. Return Z
This algorithm is submodular: the marginal gain of adding a point decreases as Z grows. Submodularity guarantees that greedy selection achieves at least (1 - 1/e) ≈ 63% of the optimal information gain, providing a theoretical quality guarantee.
Naively, greedy selection requires O(NM) information gain evaluations, each involving O(M²) for K_{uu}^{-1} multiplication. The total cost is O(NM³). With Cholesky rank-1 updates, this reduces to O(NM²). For very large N, sub-sampling candidates can further reduce cost at the expense of optimality.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185
import numpy as npfrom scipy.linalg import cholesky, solve_triangular def greedy_information_gain( X: np.ndarray, M: int, kernel_fn, noise_var: float = 0.01, seed: int = None) -> np.ndarray: """ Greedy inducing point selection maximizing information gain. Uses efficient rank-1 Cholesky updates for O(NM²) complexity. Args: X: (N, d) training inputs M: Number of inducing points to select kernel_fn: Kernel function k(X1, X2) -> matrix noise_var: Observation noise variance (affects gain computation) seed: Random seed for initialization Returns: Z: (M, d) selected inducing point locations """ rng = np.random.RandomState(seed) N, d = X.shape # Precompute diagonal of kernel matrix (self-variance) K_diag = np.array([kernel_fn(X[i:i+1], X[i:i+1])[0, 0] for i in range(N)]) # Initialize with random point first_idx = rng.randint(N) selected_indices = [first_idx] # Initialize Cholesky factor of K_uu L = np.array([[np.sqrt(K_diag[first_idx] + 1e-6)]]) # Track conditional variance for each candidate # Initially σ²(x|Z) = k(x,x) for all x cond_var = K_diag.copy() for m in range(1, M): # Compute cross-covariance of all points with current Z Z_current = X[selected_indices] K_candidates_Z = kernel_fn(X, Z_current) # N x m # Efficient conditional variance computation # σ²(x|Z) = k(x,x) - k_{x,Z} K_{ZZ}^{-1} k_{Z,x} # = k(x,x) - ||L^{-1} k_{Z,x}||² # Solve L @ v = K_Z,candidates.T for v v = solve_triangular(L, K_candidates_Z.T, lower=True) # m x N explained_var = np.sum(v**2, axis=0) # ||L^{-1} k_{Z,x}||² for each x cond_var = K_diag - explained_var # Information gain (proportional to) # ΔI ∝ log(1 + σ²(x|Z)/σ_n²) info_gain = np.log1p(np.maximum(cond_var, 0) / noise_var) # Exclude already selected info_gain[selected_indices] = -np.inf # Select point with maximum gain best_idx = np.argmax(info_gain) selected_indices.append(best_idx) # Rank-1 Cholesky update # New K_uu row: k_new = kernel_fn(X[best_idx:best_idx+1], Z_current) # L_new = [L 0 ] # [ℓ_new ℓ ] # where L @ ℓ_new = k_new, and ℓ = sqrt(k(x,x) - ||ℓ_new||²) k_new = K_candidates_Z[best_idx] ell_new = solve_triangular(L, k_new, lower=True) ell = np.sqrt(max(K_diag[best_idx] - np.dot(ell_new, ell_new), 1e-6)) # Expand L new_L = np.zeros((m + 1, m + 1)) new_L[:m, :m] = L new_L[m, :m] = ell_new new_L[m, m] = ell L = new_L return X[selected_indices].copy() def differential_entropy_selection( X: np.ndarray, M: int, kernel_fn, seed: int = None) -> np.ndarray: """ Select inducing points that minimize conditional entropy H(f|u). This directly minimizes the information loss from the sparse approximation, equivalent to maximizing I(f; u). """ rng = np.random.RandomState(seed) N = len(X) # Initialize with point that has maximum self-variance variances = np.array([kernel_fn(X[i:i+1], X[i:i+1])[0, 0] for i in range(N)]) selected = [np.argmax(variances)] for m in range(1, M): Z = X[selected] Kuu = kernel_fn(Z, Z) + 1e-6 * np.eye(len(Z)) L = cholesky(Kuu, lower=True) best_score = -np.inf best_idx = None # Evaluate each candidate for i in range(N): if i in selected: continue # Compute log-det of K_uu augmented with candidate # Using matrix determinant lemma: # |K_new| = |K_uu| * (k_ii - k_i^T K_uu^{-1} k_i) k_i = kernel_fn(X[i:i+1], Z).flatten() k_ii = kernel_fn(X[i:i+1], X[i:i+1])[0, 0] v = solve_triangular(L, k_i, lower=True) schur_complement = k_ii - np.dot(v, v) if schur_complement > 1e-8: # Score is proportional to log(schur_complement) # Higher = more information gained score = np.log(schur_complement) else: score = -np.inf if score > best_score: best_score = score best_idx = i if best_idx is not None: selected.append(best_idx) else: # Fall back to random if all candidates singular remaining = [i for i in range(N) if i not in selected] if remaining: selected.append(rng.choice(remaining)) return X[selected].copy() # Demonstrationif __name__ == "__main__": np.random.seed(42) # Create structured data where information-theoretic selection matters N = 500 X = np.vstack([ np.random.randn(400, 2) * 0.5, # Dense cluster np.random.randn(100, 2) * 0.3 + np.array([4, 0]), # Smaller distant cluster ]) def se_kernel(X1, X2, ls=1.0): d = np.sum((X1[:, None] - X2[None, :])**2, axis=2) return np.exp(-0.5 * d / ls**2) M = 20 # Compare methods from scipy.cluster.vq import kmeans2 Z_random = X[np.random.choice(N, M, replace=False)] Z_kmeans, _ = kmeans2(X, M, minit='++', seed=42) Z_greedy = greedy_information_gain(X, M, se_kernel, seed=42) print("Inducing Point Selection Results:") print("=" * 50) print(f"Data: Two clusters (N={N})") print(f"M = {M} inducing points") for name, Z in [("Random", Z_random), ("K-Means", Z_kmeans), ("Greedy Info", Z_greedy)]: Kff = se_kernel(X, X) Kfu = se_kernel(X, Z) Kuu = se_kernel(Z, Z) Kuu_inv = np.linalg.inv(Kuu + 1e-6 * np.eye(M)) Qff = Kfu @ Kuu_inv @ Kfu.T rel_error = np.linalg.norm(Kff - Qff, 'fro') / np.linalg.norm(Kff, 'fro') print(f"{name:15s}: Nyström relative error = {rel_error:.4f}")Rather than selecting inducing points from fixed candidates, we can learn their locations by treating $Z$ as continuous parameters optimized jointly with kernel hyperparameters. This is the approach taken in modern sparse GP implementations like GPyTorch and GPflow.
The optimization objective:
For variational sparse GPs (covered in the next page), we optimize the Evidence Lower Bound (ELBO):
$$\text{ELBO}(\theta, Z) = \mathbb{E}_{q(\mathbf{u})}[\log p(\mathbf{y}|\mathbf{f})] - \text{KL}(q(\mathbf{u}) \| p(\mathbf{u}))$$
Both the expected log-likelihood and KL divergence are differentiable with respect to Z, enabling gradient-based optimization.
Gradients with respect to inducing locations:
The key matrices $K_{uu}$ and $K_{fu}$ depend on $Z$ through the kernel:
$$\frac{\partial K_{uu}}{\partial z_{mj}} = \frac{\partial}{\partial z_{mj}} k(\mathbf{z}m, \mathbf{z}{m'})$$
$$\frac{\partial K_{fu}}{\partial z_{mj}} = \frac{\partial}{\partial z_{mj}} k(\mathbf{x}_n, \mathbf{z}_m)$$
For stationary kernels like the squared exponential:
$$\frac{\partial k(\mathbf{x}, \mathbf{z})}{\partial z_j} = k(\mathbf{x}, \mathbf{z}) \cdot \frac{x_j - z_j}{\ell^2}$$
These derivatives can be computed efficiently using automatic differentiation frameworks.
Learning inducing locations is a non-convex optimization problem with many local optima. Poor initialization can lead to suboptimal solutions where inducing points cluster together or fail to cover important regions. Initialization with k-means or other heuristics is strongly recommended before gradient optimization.
The collapse phenomenon:
A well-known failure mode is inducing point collapse: during optimization, inducing points may converge to nearly identical locations, reducing effective M. This happens when:
Mitigation strategies include:
| Aspect | Fixed Selection | Learned Locations |
|---|---|---|
| Flexibility | Limited to training points or heuristics | Anywhere in input space |
| Optimization | Discrete/combinatorial | Continuous gradient descent |
| Cost | O(NM²) for greedy methods | O(NM² × iterations) |
| Local optima | N/A for heuristics | Significant concern |
| Hyperparameter coupling | Decoupled | Jointly optimized |
| Final quality | Good for well-chosen M | Often better, but not guaranteed |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173
import numpy as npimport torchimport torch.nn as nnfrom torch.optim import Adam class SparseGPWithLearnedZ(nn.Module): """ Sparse GP with learnable inducing point locations. This implementation demonstrates: 1. Parameterizing inducing locations as optimizable tensors 2. Computing gradients through kernel operations 3. Joint optimization of Z with kernel hyperparameters """ def __init__(self, X: np.ndarray, M: int, init_method: str = 'kmeans'): super().__init__() N, d = X.shape # Initialize inducing points if init_method == 'kmeans': from scipy.cluster.vq import kmeans2 Z_init, _ = kmeans2(X, M, minit='++') elif init_method == 'random': indices = np.random.choice(N, M, replace=False) Z_init = X[indices].copy() else: raise ValueError(f"Unknown init method: {init_method}") # Learnable parameters self.Z = nn.Parameter(torch.tensor(Z_init, dtype=torch.float64)) self.log_lengthscale = nn.Parameter(torch.tensor(0.0, dtype=torch.float64)) self.log_variance = nn.Parameter(torch.tensor(0.0, dtype=torch.float64)) self.log_noise = nn.Parameter(torch.tensor(-2.0, dtype=torch.float64)) # Store training data self.register_buffer('X', torch.tensor(X, dtype=torch.float64)) @property def lengthscale(self): return torch.exp(self.log_lengthscale) @property def variance(self): return torch.exp(self.log_variance) @property def noise_variance(self): return torch.exp(self.log_noise) def se_kernel(self, X1, X2): """Squared exponential kernel with current hyperparameters.""" dist_sq = torch.sum((X1.unsqueeze(1) - X2.unsqueeze(0))**2, dim=2) return self.variance * torch.exp(-0.5 * dist_sq / self.lengthscale**2) def forward(self, y: torch.Tensor) -> torch.Tensor: """ Compute negative ELBO (to minimize). Uses the DTC approximation for simplicity. Variational extension covered in next page. """ X = self.X Z = self.Z N, M = len(X), len(Z) # Kernel matrices Kuu = self.se_kernel(Z, Z) + 1e-6 * torch.eye(M, dtype=torch.float64) Kfu = self.se_kernel(X, Z) # Cholesky of Kuu L_uu = torch.linalg.cholesky(Kuu) # Compute Λ = Kuu + σ^{-2} Kuf Kfu Kuf = Kfu.T Lambda = Kuu + Kuf @ Kfu / self.noise_variance L_lambda = torch.linalg.cholesky(Lambda) # Log marginal likelihood (DTC) # log p(y) = -0.5 * y^T (Qff + σ²I)^{-1} y - 0.5 * log|Qff + σ²I| - N/2 * log(2π) # Quadratic form via Woodbury y_scaled = y / self.noise_variance c = Kuf @ y_scaled c_solve = torch.linalg.solve_triangular( L_lambda.T, torch.linalg.solve_triangular(L_lambda, c, upper=False), upper=True ) quad_form = torch.dot(y_scaled, y) - torch.dot(c, c_solve) # Log determinant via Woodbury log_det_noise = N * self.log_noise log_det_Lambda = 2 * torch.sum(torch.log(torch.diag(L_lambda))) log_det_Kuu = 2 * torch.sum(torch.log(torch.diag(L_uu))) log_det = log_det_noise + log_det_Lambda - log_det_Kuu # Negative log marginal likelihood neg_lml = 0.5 * quad_form + 0.5 * log_det + 0.5 * N * np.log(2 * np.pi) return neg_lml def predict(self, X_star: torch.Tensor): """Predictive mean and variance.""" with torch.no_grad(): Z = self.Z M = len(Z) Kuu = self.se_kernel(Z, Z) + 1e-6 * torch.eye(M, dtype=torch.float64) Ksu = self.se_kernel(X_star, Z) Kss_diag = self.variance.expand(len(X_star)) # SE diagonal is constant variance L_uu = torch.linalg.cholesky(Kuu) # ... (prediction equations would continue) return Ksu # Placeholder def train_sparse_gp_with_learned_z( X: np.ndarray, y: np.ndarray, M: int, n_iterations: int = 500): """ Train sparse GP with jointly optimized inducing points. """ model = SparseGPWithLearnedZ(X, M, init_method='kmeans') y_tensor = torch.tensor(y, dtype=torch.float64) optimizer = Adam(model.parameters(), lr=0.01) print("Training Sparse GP with Learned Inducing Points") print("=" * 50) history = {'loss': [], 'lengthscale': [], 'z_spread': []} for i in range(n_iterations): optimizer.zero_grad() loss = model(y_tensor) loss.backward() optimizer.step() # Monitor inducing point spread (collapse detection) Z = model.Z.detach().numpy() pairwise_dists = np.sqrt(np.sum((Z[:, None] - Z[None, :])**2, axis=2)) np.fill_diagonal(pairwise_dists, np.inf) min_spread = pairwise_dists.min() history['loss'].append(loss.item()) history['z_spread'].append(min_spread) if (i + 1) % 100 == 0: print(f"Iter {i+1}: Loss={loss.item():.2f}, " f"Lengthscale={model.lengthscale.item():.3f}, " f"Min Z dist={min_spread:.4f}") return model, history # Example usageif __name__ == "__main__": np.random.seed(42) torch.manual_seed(42) # Generate data N = 500 X = np.random.randn(N, 2) * 2 y = np.sin(X[:, 0]) + 0.5 * np.cos(X[:, 1]) + 0.1 * np.random.randn(N) model, history = train_sparse_gp_with_learned_z(X, y, M=30, n_iterations=300) print("\nFinal inducing points:") print(model.Z.detach().numpy()[:5]) # Show first 5High-dimensional input spaces pose fundamental challenges for inducing point methods. The curse of dimensionality means that covering a d-dimensional space requires exponentially many points—far more than is computationally feasible.
The geometry problem:
In d dimensions, the volume of a hypercube grows as $V = L^d$ where $L$ is the side length. To maintain a fixed point density $\rho$, the number of required inducing points scales as:
$$M \propto \rho \cdot L^d$$
For d = 10 and moderate L, this quickly becomes intractable. With M = 1,000 inducing points in 10D, the average distance to the nearest inducing point may exceed the kernel length-scale, making the sparse approximation useless.
Manifestations of the curse:
Strategies for high dimensions:
1. Automatic Relevance Determination (ARD)
Use an ARD kernel with per-dimension length-scales:
$$k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left(-\frac{1}{2}\sum_{j=1}^d \frac{(x_j - x_j')^2}{\ell_j^2}\right)$$
If the function depends strongly on only a few dimensions, ARD learns large length-scales for irrelevant dimensions, effectively reducing dimensionality.
2. Dimensionality reduction
Project inputs to a lower-dimensional space before constructing the GP:
3. Additive structure
Decompose the function as a sum of low-dimensional components: $$f(\mathbf{x}) = \sum_{s \in S} f_s(\mathbf{x}_s)$$ where each $\mathbf{x}_s$ is a subset of variables. Separate inducing points for each component avoid the full-dimensional curse.
4. Structured inducing points
Place inducing points on a Kronecker product grid. For separable kernels, this enables efficient O(M^{3/d}) complexity, though applicability is limited.
For d > 10-20, sparse GPs with standard inducing points become increasingly unreliable. Consider alternatives: deep kernel learning (combining neural networks with GPs), random feature approximations (next pages), or exploiting problem-specific structure. If using sparse GPs in high dimensions, always validate approximation quality on held-out data.
Consolidating the concepts from this page, here are practical recommendations for inducing point selection across different scenarios:
| Scenario | Recommended Approach | Typical M | Notes |
|---|---|---|---|
| Quick prototyping | K-means++ | 100-500 | Simple baseline, often sufficient |
| Production model | K-means + gradient refinement | 500-2000 | Best quality with moderate cost |
| Small data (N < 5000) | Greedy information gain | √N to N/10 | Worth the extra computation |
| Large data (N > 50000) | K-means + minibatch training | 1000-5000 | Gradient refinement impractical |
| Non-uniform data | Leverage score or greedy | Varies | Need good coverage of dense regions |
| High dimensions (d > 10) | K-means + ARD kernel | 1000+ | Consider alternatives |
| Sequential/streaming | Incremental k-medoids | Budget-limited | Update Z as data arrives |
Choosing M:
The optimal number of inducing points depends on the data complexity and computational budget:
Validation:
Always validate inducing point quality:
Common mistakes to avoid:
In practice, well-initialized k-means inducing points get you 80% of the way to optimal. Gradient refinement of Z provides marginal improvement in most cases. Invest effort in proper initialization and appropriate M, rather than sophisticated optimization of Z locations.
We have developed a comprehensive understanding of inducing point selection—the critical design choice that determines the quality of sparse GP approximations. The key insights:
What's next:
With the mechanics of inducing point selection established, we now turn to the variational sparse GP framework. This elegant approach recasts sparse GP inference as variational inference, providing principled optimization objectives, proper uncertainty quantification, and enabling scalable stochastic training with minibatches. The variational perspective unifies and extends the classical DTC/FITC approximations we developed earlier.
You now command a toolkit of inducing point selection methods, from quick heuristics to sophisticated optimization. This knowledge is essential for deploying sparse GPs effectively—the best algorithm in the world fails if inducing points are poorly chosen. The variational framework ahead will leverage these inducing points in a principled probabilistic inference scheme.