Loading learning content...
The previous page established an uncomfortable truth: exact Gaussian Process inference scales as O(N³), rendering it impractical for datasets beyond tens of thousands of points. The fundamental challenge is the N×N kernel matrix K—computing it, storing it, and inverting it all scale prohibitively with dataset size.
Sparse Gaussian Process methods offer a compelling resolution: what if we could compress the information from N training points into a much smaller set of M representative points, where M << N? If successful, we would replace the N×N matrix operations with M×M operations, dramatically reducing computational burden while preserving the essential GP predictions.
This seemingly simple idea—approximating a large GP with a smaller one—conceals remarkable mathematical depth. The challenge is not merely computational; it's representational. How do we choose which M points best summarize N points? What information is lost in the compression? How do we quantify the approximation error? And critically, how do we ensure the resulting approximation remains a valid Gaussian Process with meaningful uncertainty estimates?
This page develops the foundational sparse GP framework that addresses these questions, establishing the theoretical apparatus for all modern scalable GP methods.
By the end of this page, you will understand the core principles behind sparse GP approximations, master the mathematical formulation that reduces O(N³) to O(NM²), differentiate between classical sparse approximations (SoD, SoR, DTC, FITC), and develop intuition for when and why these approximations succeed or fail.
At the heart of all sparse GP methods lies the concept of inducing points (also called pseudo-inputs, support points, or basis points). Instead of conditioning the GP directly on all N training observations, we introduce a set of M << N synthetic locations $Z = \{\mathbf{z}_1, ..., \mathbf{z}_M\}$ in the input space, along with corresponding function values $\mathbf{u} = [f(\mathbf{z}_1), ..., f(\mathbf{z}_M)]^\top$.
The key insight is that under the GP prior, the function values at any finite set of points are jointly Gaussian. In particular, the training function values $\mathbf{f} = [f(\mathbf{x}_1), ..., f(\mathbf{x}_N)]^\top$ and inducing values $\mathbf{u}$ satisfy:
$$\begin{bmatrix} \mathbf{f} \\ \mathbf{u} \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} K_{ff} & K_{fu} \\ K_{uf} & K_{uu} \end{bmatrix} \right)$$
where:
The conditional distribution of $\mathbf{f}$ given $\mathbf{u}$ is also Gaussian:
$$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, K_{ff} - K_{fu}K_{uu}^{-1}K_{uf})$$
This conditioning relationship is exact under the GP prior—no approximation has been made yet. The magic of sparse GPs happens when we make strategic approximations to this conditional structure.
Notice that K_{uu}^{-1} is only M×M. If we can reformulate inference to depend primarily on K_{uu} rather than K_{ff}, we've shifted the bottleneck from N to M. The art of sparse GPs lies in clever approximations that make this reformulation possible while minimizing information loss.
The graphical model perspective:
Sparse GPs can be understood through the lens of probabilistic graphical models. In the exact GP, all training points $\mathbf{f}$ are correlated through the full covariance matrix $K_{ff}$. In sparse GPs, we introduce inducing variables $\mathbf{u}$ as intermediate latent variables that mediate the dependence:
$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})p(\mathbf{u})$$
Different sparse GP approximations correspond to different assumptions about the conditional $p(\mathbf{f}|\mathbf{u})$—specifically, which dependencies are retained and which are discarded. This graphical model viewpoint clarifies that we're not merely subsampling data, but intelligently restructuring the probabilistic model.
The simplest approach to scaling GPs is to ignore most of the data entirely. Subset of Data (SoD) selects M representative training points from the full dataset of N points and performs exact GP inference using only this subset.
Let $(X_M, \mathbf{y}_M)$ denote the selected M points. Inference proceeds exactly as for standard GPs:
$$\mu_* = \mathbf{k}{*M}^\top (K{MM} + \sigma_n^2 I)^{-1} \mathbf{y}M$$ $$\sigma^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}{*M}^\top (K_{MM} + \sigma_n^2 I)^{-1} \mathbf{k}_{*M}$$
where $K_{MM}$ is the M×M kernel matrix over the selected subset.
Subset selection strategies:
The performance of SoD depends critically on how the M points are selected:
Despite its limitations, SoD serves as an important baseline. If a sophisticated sparse GP method cannot outperform well-selected SoD, its additional complexity is not justified. Moreover, SoD reveals the information-theoretic limits of what M points can capture about N points—setting expectations for all sparse methods.
SoD's most serious flaw is uncertainty estimation. By ignoring N-M points, the model has no memory that data exists in those regions. Predictions may be confident when they should be uncertain, leading to overconfident decisions. This is particularly dangerous in safety-critical applications like Bayesian optimization with expensive experiments.
Subset of Regressors (SoR), also known as the Nyström approximation in the kernel methods literature, takes a fundamentally different approach. Rather than ignoring data, SoR uses all N observations but approximates the kernel matrix $K_{ff}$ with a low-rank structure.
The key approximation is:
$$K_{ff} \approx Q_{ff} = K_{fu}K_{uu}^{-1}K_{uf}$$
Here, $Q_{ff}$ is the Nyström approximation of the full kernel matrix—a rank-M approximation that can be computed and stored efficiently. The intuition is that the kernel function can be approximated by its projection onto the span of kernel functions centered at the M inducing points.
Mathematical derivation:
Consider the kernel as defining a feature map $\phi(\mathbf{x})$ in a (possibly infinite-dimensional) Hilbert space:
$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$
The Nyström approximation projects this feature map onto the subspace spanned by $\{\phi(\mathbf{z}_1), ..., \phi(\mathbf{z}_M)\}$. The projected feature map is:
$$\tilde{\phi}(\mathbf{x}) = K_{uu}^{-1/2} \mathbf{k}_u(\mathbf{x})$$
where $\mathbf{k}_u(\mathbf{x}) = [k(\mathbf{z}_1, \mathbf{x}), ..., k(\mathbf{z}_M, \mathbf{x})]^\top$. The approximate kernel becomes:
$$\tilde{k}(\mathbf{x}, \mathbf{x}') = \langle \tilde{\phi}(\mathbf{x}), \tilde{\phi}(\mathbf{x}') \rangle = \mathbf{k}u(\mathbf{x})^\top K{uu}^{-1} \mathbf{k}_u(\mathbf{x}')$$
Predictive equations under SoR:
Substituting the Nyström approximation into the GP predictive equations:
$$\mu_* = \mathbf{k}{*u}^\top (K{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1} K_{uf}\Sigma^{-1}\mathbf{y}$$
$$\sigma_^2 = \mathbf{k}{*u}^\top (K{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1} \mathbf{k}_{u}$$
where $\Sigma = \sigma_n^2 I$ is the noise covariance.
Using the Woodbury identity, these can be rewritten as:
$$\mu_* = \mathbf{k}{*u}^\top K{uu}^{-1} K_{uf} (Q_{ff} + \Sigma)^{-1} \mathbf{y}$$
The term $(Q_{ff} + \Sigma)^{-1}$ still involves an N×N matrix, but the low-rank structure of $Q_{ff}$ enables efficient computation via the matrix inversion lemma:
$$(Q_{ff} + \Sigma)^{-1} = \Sigma^{-1} - \Sigma^{-1}K_{fu}(K_{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1}K_{uf}\Sigma^{-1}$$
| Operation | Complexity | Notes |
|---|---|---|
| Compute K_{uu} | O(M²d) | M×M kernel matrix over inducing points |
| Compute K_{fu} | O(NMd) | N×M cross-covariance |
| Cholesky of K_{uu} | O(M³) | Small matrix factorization |
| Compute Λ = K_{uu} + K_{uf}Σ⁻¹K_{fu} | O(NM² + M³) | Key matrix for prediction |
| Solve triangular systems | O(M²) | Per prediction |
| Total training | O(NM² + M³) | Dominated by cross-covariance products |
SoR reduces complexity from O(N³) to O(NM²). For typical settings where M ∝ √N or M = O(log N), this represents a dramatic improvement. With N = 100,000 and M = 1,000, the speedup factor is approximately 10,000×.
The Deterministic Training Conditional (DTC) approximation, also known as the Projected Latent Variables approximation, provides a probabilistically coherent interpretation of the Nyström/SoR method. Rather than viewing the approximation as a matrix factorization trick, DTC interprets it through the lens of conditional Gaussian distributions.
Recall the exact joint distribution:
$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})p(\mathbf{u})$$
The exact conditional $p(\mathbf{f}|\mathbf{u})$ has covariance $K_{ff} - K_{fu}K_{uu}^{-1}K_{uf}$, which is non-diagonal and captures residual correlations between training points not explained by the inducing points.
DTC's key approximation is to replace this conditional with a deterministic mapping:
$$p_{\text{DTC}}(\mathbf{f}|\mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, \mathbf{0})$$
That is, given the inducing values $\mathbf{u}$, the training function values are exactly determined with no residual uncertainty. This is a dramatic simplification—we assume the inducing points capture all variability in the function.
Consequences of the DTC assumption:
With observation noise $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$, the likelihood becomes:
$$p(\mathbf{y}|\mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, \sigma_n^2 I)$$
The marginal likelihood, integrating out $\mathbf{u}$:
$$p(\mathbf{y}) = \int p(\mathbf{y}|\mathbf{u})p(\mathbf{u})d\mathbf{u} = \mathcal{N}(\mathbf{0}, \, Q_{ff} + \sigma_n^2 I)$$
where $Q_{ff} = K_{fu}K_{uu}^{-1}K_{uf}$ is exactly the Nyström approximation.
The posterior over inducing points:
$$p(\mathbf{u}|\mathbf{y}) = \mathcal{N}(\mu_u, \Sigma_u)$$
where: $$\Sigma_u = (K_{uu}^{-1} + \sigma_n^{-2}K_{uf}K_{fu})^{-1} = K_{uu}(K_{uu} + \sigma_n^{-2}K_{uf}K_{fu})^{-1}K_{uu}$$ $$\mu_u = \sigma_n^{-2}\Sigma_u K_{uf}\mathbf{y}$$
Because DTC sets the conditional variance to zero, it systematically underestimates predictive uncertainty. The model cannot represent uncertainty arising from limited inducing point coverage—it assumes the inducing points explain everything. This is particularly problematic in regions far from inducing points or where the true function has fine-scale structure.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
import numpy as npfrom numpy import linalg as la class DTCSparseGP: """ Deterministic Training Conditional (DTC) Sparse Gaussian Process. Implements the DTC approximation with O(NM²) complexity for exact GP inference with O(N³) complexity. This is the "textbook" sparse GP, providing a foundation for understanding more sophisticated methods like FITC and VFE. """ def __init__(self, kernel_fn, noise_variance: float = 1e-3): """ Args: kernel_fn: Kernel function k(X1, X2) -> matrix noise_variance: Observation noise variance σ² """ self.kernel_fn = kernel_fn self.noise_var = noise_variance self.fitted = False def fit(self, X: np.ndarray, y: np.ndarray, Z: np.ndarray) -> 'DTCSparseGP': """ Fit the DTC sparse GP. Args: X: (N, d) training inputs y: (N,) training targets Z: (M, d) inducing point locations Returns: self for method chaining """ N, M = len(X), len(Z) # Compute kernel matrices Kuu = self.kernel_fn(Z, Z) # M x M Kfu = self.kernel_fn(X, Z) # N x M Kuu_inv = la.inv(Kuu + 1e-6 * np.eye(M)) # Regularized inverse # Compute Λ = Kuu + σ⁻² Kuf @ Kfu (M x M) Lambda = Kuu + (1 / self.noise_var) * Kfu.T @ Kfu # Cholesky of Lambda for stable solve L_lambda = la.cholesky(Lambda) # Compute α = σ⁻² Λ⁻¹ Kuf y alpha = la.solve(L_lambda.T, la.solve(L_lambda, Kfu.T @ y)) / self.noise_var # Store for prediction self.X = X self.y = y self.Z = Z self.Kuu = Kuu self.Kuu_inv = Kuu_inv self.L_lambda = L_lambda self.alpha = alpha self.fitted = True return self def predict(self, X_star: np.ndarray, return_var: bool = True): """ Make predictions at new points. Args: X_star: (N*, d) test inputs return_var: Whether to return predictive variance Returns: mean: (N*,) predictive means var: (N*,) predictive variances (if return_var=True) """ if not self.fitted: raise RuntimeError("Model must be fitted before prediction") # Cross-covariance with inducing points Ksu = self.kernel_fn(X_star, self.Z) # N* x M # Predictive mean: k_{*u} @ α mu = Ksu @ self.alpha if not return_var: return mu # Predictive variance (DTC approximation) # σ²* = k_{*u} @ Λ⁻¹ @ k_{u*} # Note: This is the DTC variance which underestimates uncertainty v = la.solve(self.L_lambda, Ksu.T) # L⁻¹ @ k_{u*} var = np.sum(v**2, axis=0) # Diagonal of k_{*u} @ Λ⁻¹ @ k_{u*} return mu, var def log_marginal_likelihood(self) -> float: """ Compute the DTC log marginal likelihood. Returns: Log marginal likelihood p(y | X, Z, θ) """ if not self.fitted: raise RuntimeError("Model must be fitted first") N = len(self.y) # log|Qff + σ²I| via Woodbury log_det_Kuu = 2 * np.sum(np.log(np.diag(la.cholesky(self.Kuu)))) log_det_Lambda = 2 * np.sum(np.log(np.diag(self.L_lambda))) log_det = N * np.log(self.noise_var) + log_det_Lambda - log_det_Kuu # y^T (Qff + σ²I)⁻¹ y via Woodbury y_scaled = self.y / self.noise_var Kfu = self.kernel_fn(self.X, self.Z) v = Kfu.T @ self.y / self.noise_var w = la.solve(self.L_lambda.T, la.solve(self.L_lambda, v)) data_fit = self.y @ y_scaled - v @ w # Log marginal likelihood lml = -0.5 * data_fit - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi) return lml # Example usagedef squared_exponential_kernel(X1, X2, lengthscale=1.0, variance=1.0): """Standard squared exponential (RBF) kernel.""" dists_sq = np.sum((X1[:, None] - X2[None, :])**2, axis=2) return variance * np.exp(-0.5 * dists_sq / lengthscale**2) # Demonstrationnp.random.seed(42)N, M = 1000, 50 # 1000 training points, 50 inducing points # Generate synthetic dataX = np.random.randn(N, 2)y = np.sin(X[:, 0]) + 0.5 * np.cos(2 * X[:, 1]) + 0.1 * np.random.randn(N) # Select inducing points via k-meansfrom scipy.cluster.vq import kmeans2Z, _ = kmeans2(X, M, minit='++') # Fit DTC sparse GPkernel_fn = lambda X1, X2: squared_exponential_kernel(X1, X2, lengthscale=1.0, variance=1.0)model = DTCSparseGP(kernel_fn, noise_variance=0.01)model.fit(X, y, Z) # Compute log marginal likelihoodlml = model.log_marginal_likelihood()print(f"DTC Log Marginal Likelihood: {lml:.2f}") # Make predictionsX_test = np.random.randn(100, 2)mu, var = model.predict(X_test)print(f"Predictions: mean in [{mu.min():.2f}, {mu.max():.2f}], " f"var in [{var.min():.4f}, {var.max():.4f}]")The Fully Independent Training Conditional (FITC) approximation addresses DTC's systematic variance underestimation by retaining diagonal residual variance—the part of each training point's variance not explained by the inducing points.
Recall the exact conditional:
$$p(f_i | \mathbf{u}) = \mathcal{N}(\mathbf{k}{iu}K{uu}^{-1}\mathbf{u}, \, k_{ii} - \mathbf{k}{iu}K{uu}^{-1}\mathbf{k}_{ui})$$
The conditional variance $k_{ii} - \mathbf{k}{iu}K{uu}^{-1}\mathbf{k}_{ui}$ represents the function uncertainty at point $i$ that is not captured by the inducing points. This is non-negative and can be substantial in regions far from inducing points.
FITC makes the approximation that, conditioned on $\mathbf{u}$, the training function values are independent (but not deterministic):
$$p_{\text{FITC}}(\mathbf{f}|\mathbf{u}) = \prod_{i=1}^N \mathcal{N}(f_i; \mathbf{k}{iu}K{uu}^{-1}\mathbf{u}, \, \lambda_i)$$
where $\lambda_i = k_{ii} - \mathbf{k}{iu}K{uu}^{-1}\mathbf{k}_{ui}$ is the residual variance at point $i$.
The FITC covariance structure:
Under FITC, the prior covariance over training points becomes:
$$K_{\text{FITC}} = Q_{ff} + \text{diag}(K_{ff} - Q_{ff}) = Q_{ff} + \Lambda$$
where $\Lambda = \text{diag}(\lambda_1, ..., \lambda_N)$ is a diagonal matrix of residual variances.
This structure is crucial: FITC says "training points are correlated through the inducing points (via $Q_{ff}$), but have independent residual noise beyond that (via $\Lambda$)." The diagonal structure of $\Lambda$ enables efficient computation.
The FITC marginal likelihood:
$$\log p(\mathbf{y}) = \log \mathcal{N}(\mathbf{0}, Q_{ff} + \Lambda + \sigma_n^2 I)$$
Defining $D = \Lambda + \sigma_n^2 I$ (a diagonal matrix), this becomes:
$$\log p(\mathbf{y}) = \log \mathcal{N}(\mathbf{0}, Q_{ff} + D)$$
The Woodbury formula gives:
$$(Q_{ff} + D)^{-1} = D^{-1} - D^{-1}K_{fu}(K_{uu} + K_{uf}D^{-1}K_{fu})^{-1}K_{uf}D^{-1}$$
Since $D$ is diagonal, $D^{-1}$ is trivially O(N). The computational bottleneck remains the M×M matrix $(K_{uu} + K_{uf}D^{-1}K_{fu})$, preserving O(NM²) complexity.
FITC provides better variance estimation than DTC at the cost of additional computation (computing and storing the diagonal Λ). For well-placed inducing points, the improvement can be marginal. For poorly-placed inducing points, FITC provides crucial protection against overconfident predictions.
| Method | p(f|u) Approximation | Prior Covariance | Variance Estimate | Complexity |
|---|---|---|---|---|
| DTC | Deterministic | Q_{ff} | Underestimates | O(NM²) |
| FITC | Diagonal conditional cov | Q_{ff} + diag(K_{ff} - Q_{ff}) | Better near data | O(NM²) |
| PITC | Block-diagonal conditional | Q_{ff} + blockdiag(...) | Best (among these) | O(NM² + NB²) |
| Full GP | Full conditional cov | K_{ff} | Exact | O(N³) |
PITC: Partially Independent Training Conditional:
A natural extension is PITC, which partitions training points into blocks and models block-wise dependencies while ignoring cross-block dependencies. PITC interpolates between FITC (every point is its own block) and the exact GP (all points in one block). It offers improved approximation at the cost of O(NB²) additional computation where B is the block size. In practice, FITC often provides sufficient improvement over DTC, and modern variational methods (covered in subsequent pages) supersede PITC.
Sparse GP methods introduce approximation error. Understanding when this error is acceptable—and when it corrupts results—is essential for practitioners. Several perspectives illuminate approximation quality:
The projection interpretation:
The Nyström approximation $Q_{ff} = K_{fu}K_{uu}^{-1}K_{uf}$ is a rank-M approximation to the full kernel matrix. It is optimal (in Frobenius norm) among rank-M approximations that pass through the inducing columns of $K$. The approximation error is:
$$\|K_{ff} - Q_{ff}\|F^2 = \sum{i=M+1}^N \lambda_i^2$$
where $\lambda_i$ are eigenvalues of $K_{ff}$ not captured by the inducing point projection. If the kernel matrix has rapidly decaying spectrum (common for smooth kernels), a small M captures most variance.
The information-theoretic view:
From an information theory perspective, the inducing points $\mathbf{u}$ should maximize the mutual information $I(\mathbf{f}; \mathbf{u})$—the information they provide about the full function values. This leads to the objective:
$$\max_Z I(\mathbf{f}; \mathbf{u}) = \frac{1}{2}\log|K_{ff}| - \frac{1}{2}\log|K_{ff} - Q_{ff}|$$
Equivalently, we minimize the residual entropy $H(\mathbf{f}|\mathbf{u})$. Greedy algorithms for inducing point selection often optimize this criterion.
When sparse approximations fail:
Sparse GPs struggle in specific scenarios:
Rapidly varying functions: If the true function has length-scale much shorter than inducing point spacing, the approximation misses important structure
Non-stationary functions: Standard kernels assume uniform length-scale; local structure may require many inducing points
High dimensions: Covering input space requires exponentially many inducing points (curse of dimensionality)
Outlier regions: Data points far from inducing points have less predictive influence than they should
Choosing M is fundamentally a bias-variance trade-off. Small M gives fast computation but high approximation bias. Large M gives better approximation but approaches exact GP cost. Rules of thumb suggest M = O(√N) or M = O(log N) depending on smoothness, but the optimal choice is problem-specific and often determined empirically.
Efficient implementation of sparse GPs requires careful attention to numerical linear algebra. Several techniques are essential for production-quality code:
Cholesky-based computation:
Never explicitly compute matrix inverses. Instead, use Cholesky factorization:
$$K_{uu} = L_{uu}L_{uu}^\top$$
where $L_{uu}$ is lower-triangular. Then $K_{uu}^{-1}\mathbf{v}$ is computed by solving $L_{uu}L_{uu}^\top\mathbf{x} = \mathbf{v}$ via forward and back substitution. This is more numerically stable and equally fast.
Memory-efficient cross-covariance:
The N×M cross-covariance matrix $K_{fu}$ dominates memory for large N. For prediction, we often don't need the full matrix—only its product with certain vectors. When possible, compute $K_{fu}\mathbf{v}$ incrementally without materializing the full matrix.
Incremental updates:
When adding new inducing points (e.g., during greedy selection), use rank-1 updates to existing Cholesky factors rather than recomputing from scratch.
GPU acceleration:
Matrix-matrix products dominate sparse GP computation. These are highly parallelizable, making GPUs effective even at moderate N. The key is to batch operations and minimize CPU-GPU transfers.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom scipy import linalg as sla class EfficientFITC: """ Numerically stable implementation of FITC sparse GP. Key optimizations: 1. Cholesky-based solves (never explicit inversion) 2. Precomputed quantities for repeated predictions 3. Memory-efficient intermediate computations 4. Numerically stable log-determinant computation """ def __init__(self, kernel_fn, noise_variance=0.01, jitter=1e-6): self.kernel_fn = kernel_fn self.noise_var = noise_variance self.jitter = jitter def fit(self, X, y, Z): """ Fit FITC with numerically stable Cholesky-based computation. """ N, M = len(X), len(Z) # Kernel matrices Kuu = self.kernel_fn(Z, Z) + self.jitter * np.eye(M) Kfu = self.kernel_fn(X, Z) Kff_diag = self.kernel_fn(X, X)[np.diag_indices(N)] # Only diagonal needed # Cholesky of K_uu L_uu = sla.cholesky(Kuu, lower=True) # Solve L_uu @ V = K_uf.T, so V = L_uu^{-1} @ K_uf.T # This gives us K_uu^{-1} @ K_uf.T = L_uu.T^{-1} @ V V = sla.solve_triangular(L_uu, Kfu.T, lower=True) # M x N # Q_ff diagonal: diag(K_fu @ K_uu^{-1} @ K_uf) = column sums of V^2 Qff_diag = np.sum(V**2, axis=0) # Residual variance (FITC diagonal) Lambda = np.maximum(Kff_diag - Qff_diag, 0) + self.noise_var Lambda_inv = 1.0 / Lambda # Scaled V: V @ diag(Lambda^{-1/2}) V_scaled = V * np.sqrt(Lambda_inv) # B = I + V @ Lambda^{-1} @ V.T = I + V_scaled @ V_scaled.T B = np.eye(M) + V_scaled @ V_scaled.T L_B = sla.cholesky(B, lower=True) # alpha computation using Woodbury # (Q + D)^{-1} y = D^{-1} y - D^{-1} K_fu @ B^{-1} @ K_uf @ D^{-1} y y_scaled = y * Lambda_inv c = V @ y_scaled # K_uf @ D^{-1} @ y, but using V representation c_solve = sla.solve_triangular(L_B.T, sla.solve_triangular(L_B, c, lower=True), lower=False) alpha = y_scaled - Lambda_inv * (V.T @ c_solve) # Store for prediction self.Z = Z self.L_uu = L_uu self.L_B = L_B self.V = V self.alpha = alpha self.Lambda = Lambda self.Lambda_inv = Lambda_inv # For log marginal likelihood self._N = N self._y = y self._y_scaled = y_scaled def predict(self, X_star, return_cov=False): """ Predictions with FITC variance correction. """ Ksu = self.kernel_fn(X_star, self.Z) Kss_diag = self.kernel_fn(X_star, X_star)[np.diag_indices(len(X_star))] # V_star = L_uu^{-1} @ K_us.T V_star = sla.solve_triangular(self.L_uu, Ksu.T, lower=True) # Mean: μ* = K_s,u @ K_uu^{-1} @ K_uf @ α = K_su @ (L_uu.T^{-1} @ V @ α) # Simpler: α is already (Q+D)^{-1}y, so μ* = K_su @ K_uu^{-1} @ K_uf @ α # = V_star.T @ (V @ α) # Actually easier: μ* = Ksu @ α_u where α_u = K_uu^{-1} @ K_uf @ (Q+D)^{-1} @ y # Let's use the direct form mean = Ksu @ sla.solve_triangular( self.L_uu.T, sla.solve_triangular(self.L_uu, self.V @ self.alpha, lower=True), lower=False ) if not return_cov: return mean # Variance (FITC: includes Qff residual correction) # σ²_* = k_** - k_*u @ K_uu^{-1} @ k_u* + k_*u @ Σ_u @ k_u* Qss_diag = np.sum(V_star**2, axis=0) # Σ_u inverse quadratic form via B B_solve = sla.solve_triangular(self.L_B, V_star, lower=True) correction = np.sum(B_solve**2, axis=0) var = Kss_diag - Qss_diag + correction return mean, np.maximum(var, 0) # Ensure non-negative def log_marginal_likelihood(self): """ Compute log p(y|X, Z, θ) with stable log-determinant. """ N = self._N # log|Q + D| = log|D| + log|B| log_det_D = np.sum(np.log(self.Lambda)) log_det_B = 2 * np.sum(np.log(np.diag(self.L_B))) log_det = log_det_D + log_det_B # y^T (Q + D)^{-1} y - already computed elements c = self.V @ self._y_scaled c_solve = sla.solve_triangular( self.L_B.T, sla.solve_triangular(self.L_B, c, lower=True), lower=False ) quad_form = np.dot(self._y_scaled, self._y) - np.dot(c, c_solve) lml = -0.5 * quad_form - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi) return lmlWe have developed the fundamental framework for sparse Gaussian Processes—methods that reduce complexity from O(N³) to O(NM²) by compressing the GP through M inducing points. The key insights:
What's next:
The methods presented here—DTC, FITC, and their variants—are foundational but somewhat ad-hoc. In the next page, we examine inducing point methods more systematically, exploring strategies for selecting and even learning the optimal inducing point locations. Following that, we develop the elegant variational sparse GP framework that unifies and extends these classical approximations with principled Bayesian inference.
You now understand the core mechanics of sparse GP approximations. The inducing point concept, the Nyström approximation, and the DTC/FITC frameworks form the conceptual substrate upon which all modern scalable GP methods build. These ideas will recur throughout the remaining pages as we explore more sophisticated approaches to GP scalability.