Machine LearningGaussian Processes

Scalable Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

2 / 5

Sparse Gaussian Processes

Compression Without Catastrophe

The previous page established an uncomfortable truth: exact Gaussian Process inference scales as O(N³), rendering it impractical for datasets beyond tens of thousands of points. The fundamental challenge is the N×N kernel matrix K—computing it, storing it, and inverting it all scale prohibitively with dataset size.

Sparse Gaussian Process methods offer a compelling resolution: what if we could compress the information from N training points into a much smaller set of M representative points, where M << N? If successful, we would replace the N×N matrix operations with M×M operations, dramatically reducing computational burden while preserving the essential GP predictions.

This seemingly simple idea—approximating a large GP with a smaller one—conceals remarkable mathematical depth. The challenge is not merely computational; it's representational. How do we choose which M points best summarize N points? What information is lost in the compression? How do we quantify the approximation error? And critically, how do we ensure the resulting approximation remains a valid Gaussian Process with meaningful uncertainty estimates?

This page develops the foundational sparse GP framework that addresses these questions, establishing the theoretical apparatus for all modern scalable GP methods.

What You Will Learn

By the end of this page, you will understand the core principles behind sparse GP approximations, master the mathematical formulation that reduces O(N³) to O(NM²), differentiate between classical sparse approximations (SoD, SoR, DTC, FITC), and develop intuition for when and why these approximations succeed or fail.

The Core Idea: Inducing Points

At the heart of all sparse GP methods lies the concept of inducing points (also called pseudo-inputs, support points, or basis points). Instead of conditioning the GP directly on all N training observations, we introduce a set of M << N synthetic locations $Z = \{\mathbf{z}_1, ..., \mathbf{z}_M\}$ in the input space, along with corresponding function values $\mathbf{u} = [f(\mathbf{z}_1), ..., f(\mathbf{z}_M)]^\top$.

The key insight is that under the GP prior, the function values at any finite set of points are jointly Gaussian. In particular, the training function values $\mathbf{f} = [f(\mathbf{x}_1), ..., f(\mathbf{x}_N)]^\top$ and inducing values $\mathbf{u}$ satisfy:

$$\begin{bmatrix} \mathbf{f} \\ \mathbf{u} \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} K_{ff} & K_{fu} \\ K_{uf} & K_{uu} \end{bmatrix} \right)$$

where:

$K_{ff}$ is the N×N kernel matrix over training points
$K_{uu}$ is the M×M kernel matrix over inducing points
$K_{fu}$ is the N×M cross-covariance matrix between training and inducing points
$K_{uf} = K_{fu}^\top$ is the transpose

The conditional distribution of $\mathbf{f}$ given $\mathbf{u}$ is also Gaussian:

$$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, K_{ff} - K_{fu}K_{uu}^{-1}K_{uf})$$

This conditioning relationship is exact under the GP prior—no approximation has been made yet. The magic of sparse GPs happens when we make strategic approximations to this conditional structure.

The Bottleneck Shift

Notice that K_{uu}^{-1} is only M×M. If we can reformulate inference to depend primarily on K_{uu} rather than K_{ff}, we've shifted the bottleneck from N to M. The art of sparse GPs lies in clever approximations that make this reformulation possible while minimizing information loss.

The graphical model perspective:

Sparse GPs can be understood through the lens of probabilistic graphical models. In the exact GP, all training points $\mathbf{f}$ are correlated through the full covariance matrix $K_{ff}$. In sparse GPs, we introduce inducing variables $\mathbf{u}$ as intermediate latent variables that mediate the dependence:

$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})p(\mathbf{u})$$

Different sparse GP approximations correspond to different assumptions about the conditional $p(\mathbf{f}|\mathbf{u})$—specifically, which dependencies are retained and which are discarded. This graphical model viewpoint clarifies that we're not merely subsampling data, but intelligently restructuring the probabilistic model.

Subset of Data (SoD): The Naive Baseline

The simplest approach to scaling GPs is to ignore most of the data entirely. Subset of Data (SoD) selects M representative training points from the full dataset of N points and performs exact GP inference using only this subset.

Let $(X_M, \mathbf{y}_M)$ denote the selected M points. Inference proceeds exactly as for standard GPs:

$$\mu_* = \mathbf{k}{*M}^\top (K{MM} + \sigma_n^2 I)^{-1} \mathbf{y}M$$ $$\sigma^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}{*M}^\top (K_{MM} + \sigma_n^2 I)^{-1} \mathbf{k}_{*M}$$

where $K_{MM}$ is the M×M kernel matrix over the selected subset.

SoD Advantages

•Simple implementation: No modification to standard GP inference
•Exact within subset: The M-point GP is exact, not approximate
•Complexity O(M³): Computation depends only on subset size
•Valid uncertainty: Posterior is a proper Gaussian distribution

SoD Disadvantages

•Discards N-M observations: Potentially valuable data is completely ignored
•Overconfident predictions: Uncertainty is underestimated in discarded regions
•Selection sensitivity: Performance varies dramatically with subset choice
•No marginal likelihood: Cannot compare with full-data model

Subset selection strategies:

The performance of SoD depends critically on how the M points are selected:

Random sampling: Simple but can miss important regions; high variance in performance
Uniform/Grid sampling: Works for low-dimensional, evenly distributed data
K-means clustering: Select cluster centers to cover the input space
Farthest point sampling: Greedily select points that maximize minimum distance to existing selections
Leverage score sampling: Weight points by their influence on the GP posterior

Despite its limitations, SoD serves as an important baseline. If a sophisticated sparse GP method cannot outperform well-selected SoD, its additional complexity is not justified. Moreover, SoD reveals the information-theoretic limits of what M points can capture about N points—setting expectations for all sparse methods.

The Uncertainty Catastrophe

SoD's most serious flaw is uncertainty estimation. By ignoring N-M points, the model has no memory that data exists in those regions. Predictions may be confident when they should be uncertain, leading to overconfident decisions. This is particularly dangerous in safety-critical applications like Bayesian optimization with expensive experiments.

Subset of Regressors (SoR) and the Nyström Approximation

Subset of Regressors (SoR), also known as the Nyström approximation in the kernel methods literature, takes a fundamentally different approach. Rather than ignoring data, SoR uses all N observations but approximates the kernel matrix $K_{ff}$ with a low-rank structure.

The key approximation is:

$$K_{ff} \approx Q_{ff} = K_{fu}K_{uu}^{-1}K_{uf}$$

Here, $Q_{ff}$ is the Nyström approximation of the full kernel matrix—a rank-M approximation that can be computed and stored efficiently. The intuition is that the kernel function can be approximated by its projection onto the span of kernel functions centered at the M inducing points.

Mathematical derivation:

Consider the kernel as defining a feature map $\phi(\mathbf{x})$ in a (possibly infinite-dimensional) Hilbert space:

$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$

The Nyström approximation projects this feature map onto the subspace spanned by $\{\phi(\mathbf{z}_1), ..., \phi(\mathbf{z}_M)\}$. The projected feature map is:

$$\tilde{\phi}(\mathbf{x}) = K_{uu}^{-1/2} \mathbf{k}_u(\mathbf{x})$$

where $\mathbf{k}_u(\mathbf{x}) = [k(\mathbf{z}_1, \mathbf{x}), ..., k(\mathbf{z}_M, \mathbf{x})]^\top$. The approximate kernel becomes:

$$\tilde{k}(\mathbf{x}, \mathbf{x}') = \langle \tilde{\phi}(\mathbf{x}), \tilde{\phi}(\mathbf{x}') \rangle = \mathbf{k}u(\mathbf{x})^\top K{uu}^{-1} \mathbf{k}_u(\mathbf{x}')$$

Predictive equations under SoR:

Substituting the Nyström approximation into the GP predictive equations:

$$\mu_* = \mathbf{k}{*u}^\top (K{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1} K_{uf}\Sigma^{-1}\mathbf{y}$$

$$\sigma_^2 = \mathbf{k}{*u}^\top (K{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1} \mathbf{k}_{u}$$

where $\Sigma = \sigma_n^2 I$ is the noise covariance.

Using the Woodbury identity, these can be rewritten as:

$$\mu_* = \mathbf{k}{*u}^\top K{uu}^{-1} K_{uf} (Q_{ff} + \Sigma)^{-1} \mathbf{y}$$

The term $(Q_{ff} + \Sigma)^{-1}$ still involves an N×N matrix, but the low-rank structure of $Q_{ff}$ enables efficient computation via the matrix inversion lemma:

$$(Q_{ff} + \Sigma)^{-1} = \Sigma^{-1} - \Sigma^{-1}K_{fu}(K_{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1}K_{uf}\Sigma^{-1}$$

Computational Complexity of SoR/Nyström
Operation	Complexity	Notes
Compute K_{uu}	O(M²d)	M×M kernel matrix over inducing points
Compute K_{fu}	O(NMd)	N×M cross-covariance
Cholesky of K_{uu}	O(M³)	Small matrix factorization
Compute Λ = K_{uu} + K_{uf}Σ⁻¹K_{fu}	O(NM² + M³)	Key matrix for prediction
Solve triangular systems	O(M²)	Per prediction
Total training	O(NM² + M³)	Dominated by cross-covariance products

The Complexity Reduction

SoR reduces complexity from O(N³) to O(NM²). For typical settings where M ∝ √N or M = O(log N), this represents a dramatic improvement. With N = 100,000 and M = 1,000, the speedup factor is approximately 10,000×.

DTC: Deterministic Training Conditional

The Deterministic Training Conditional (DTC) approximation, also known as the Projected Latent Variables approximation, provides a probabilistically coherent interpretation of the Nyström/SoR method. Rather than viewing the approximation as a matrix factorization trick, DTC interprets it through the lens of conditional Gaussian distributions.

Recall the exact joint distribution:

$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})p(\mathbf{u})$$

The exact conditional $p(\mathbf{f}|\mathbf{u})$ has covariance $K_{ff} - K_{fu}K_{uu}^{-1}K_{uf}$, which is non-diagonal and captures residual correlations between training points not explained by the inducing points.

DTC's key approximation is to replace this conditional with a deterministic mapping:

$$p_{\text{DTC}}(\mathbf{f}|\mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, \mathbf{0})$$

That is, given the inducing values $\mathbf{u}$, the training function values are exactly determined with no residual uncertainty. This is a dramatic simplification—we assume the inducing points capture all variability in the function.

Consequences of the DTC assumption:

With observation noise $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$, the likelihood becomes:

$$p(\mathbf{y}|\mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, \sigma_n^2 I)$$

The marginal likelihood, integrating out $\mathbf{u}$:

$$p(\mathbf{y}) = \int p(\mathbf{y}|\mathbf{u})p(\mathbf{u})d\mathbf{u} = \mathcal{N}(\mathbf{0}, \, Q_{ff} + \sigma_n^2 I)$$

where $Q_{ff} = K_{fu}K_{uu}^{-1}K_{uf}$ is exactly the Nyström approximation.

The posterior over inducing points:

$$p(\mathbf{u}|\mathbf{y}) = \mathcal{N}(\mu_u, \Sigma_u)$$

where: $$\Sigma_u = (K_{uu}^{-1} + \sigma_n^{-2}K_{uf}K_{fu})^{-1} = K_{uu}(K_{uu} + \sigma_n^{-2}K_{uf}K_{fu})^{-1}K_{uu}$$ $$\mu_u = \sigma_n^{-2}\Sigma_u K_{uf}\mathbf{y}$$

DTC's Systematic Variance Underestimation

Because DTC sets the conditional variance to zero, it systematically underestimates predictive uncertainty. The model cannot represent uncertainty arising from limited inducing point coverage—it assumes the inducing points explain everything. This is particularly problematic in regions far from inducing points or where the true function has fine-scale structure.

dtc_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import numpy as np
from numpy import linalg as la
 
class DTCSparseGP:
    """
    Deterministic Training Conditional (DTC) Sparse Gaussian Process.
    
    Implements the DTC approximation with O(NM²) complexity for
    exact GP inference with O(N³) complexity.
    
    This is the "textbook" sparse GP, providing a foundation for
    understanding more sophisticated methods like FITC and VFE.
    """
    
    def __init__(self, kernel_fn, noise_variance: float = 1e-3):
        """
        Args:
            kernel_fn: Kernel function k(X1, X2) -> matrix
            noise_variance: Observation noise variance σ²
        """
        self.kernel_fn = kernel_fn
        self.noise_var = noise_variance
        self.fitted = False
    
    def fit(self, X: np.ndarray, y: np.ndarray, Z: np.ndarray) -> 'DTCSparseGP':
        """
        Fit the DTC sparse GP.
        
        Args:
            X: (N, d) training inputs
            y: (N,) training targets
            Z: (M, d) inducing point locations
        
        Returns:
            self for method chaining
        """
        N, M = len(X), len(Z)
        
        # Compute kernel matrices
        Kuu = self.kernel_fn(Z, Z)  # M x M
        Kfu = self.kernel_fn(X, Z)  # N x M
        Kuu_inv = la.inv(Kuu + 1e-6 * np.eye(M))  # Regularized inverse
        
        # Compute Λ = Kuu + σ⁻² Kuf @ Kfu (M x M)
        Lambda = Kuu + (1 / self.noise_var) * Kfu.T @ Kfu
        
        # Cholesky of Lambda for stable solve
        L_lambda = la.cholesky(Lambda)
        
        # Compute α = σ⁻² Λ⁻¹ Kuf y
        alpha = la.solve(L_lambda.T, la.solve(L_lambda, Kfu.T @ y)) / self.noise_var
        
        # Store for prediction
        self.X = X
        self.y = y
        self.Z = Z
        self.Kuu = Kuu
        self.Kuu_inv = Kuu_inv
        self.L_lambda = L_lambda
        self.alpha = alpha
        self.fitted = True
        
        return self
    
    def predict(self, X_star: np.ndarray, return_var: bool = True):
        """
        Make predictions at new points.
        
        Args:
            X_star: (N*, d) test inputs
            return_var: Whether to return predictive variance
        
        Returns:
            mean: (N*,) predictive means
            var: (N*,) predictive variances (if return_var=True)
        """
        if not self.fitted:
            raise RuntimeError("Model must be fitted before prediction")
        
        # Cross-covariance with inducing points
        Ksu = self.kernel_fn(X_star, self.Z)  # N* x M
        
        # Predictive mean: k_{*u} @ α
        mu = Ksu @ self.alpha
        
        if not return_var:
            return mu
        
        # Predictive variance (DTC approximation)
        # σ²* = k_{*u} @ Λ⁻¹ @ k_{u*}
        # Note: This is the DTC variance which underestimates uncertainty
        v = la.solve(self.L_lambda, Ksu.T)  # L⁻¹ @ k_{u*}
        var = np.sum(v**2, axis=0)  # Diagonal of k_{*u} @ Λ⁻¹ @ k_{u*}
        
        return mu, var
    
    def log_marginal_likelihood(self) -> float:
        """
        Compute the DTC log marginal likelihood.
        
        Returns:
            Log marginal likelihood p(y | X, Z, θ)
        """
        if not self.fitted:
            raise RuntimeError("Model must be fitted first")
        
        N = len(self.y)
        
        # log|Qff + σ²I| via Woodbury
        log_det_Kuu = 2 * np.sum(np.log(np.diag(la.cholesky(self.Kuu))))
        log_det_Lambda = 2 * np.sum(np.log(np.diag(self.L_lambda)))
        log_det = N * np.log(self.noise_var) + log_det_Lambda - log_det_Kuu
        
        # y^T (Qff + σ²I)⁻¹ y via Woodbury
        y_scaled = self.y / self.noise_var
        Kfu = self.kernel_fn(self.X, self.Z)
        v = Kfu.T @ self.y / self.noise_var
        w = la.solve(self.L_lambda.T, la.solve(self.L_lambda, v))
        data_fit = self.y @ y_scaled - v @ w
        
        # Log marginal likelihood
        lml = -0.5 * data_fit - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi)
        
        return lml
 
 
# Example usage
def squared_exponential_kernel(X1, X2, lengthscale=1.0, variance=1.0):
    """Standard squared exponential (RBF) kernel."""
    dists_sq = np.sum((X1[:, None] - X2[None, :])**2, axis=2)
    return variance * np.exp(-0.5 * dists_sq / lengthscale**2)
 
# Demonstration
np.random.seed(42)
N, M = 1000, 50  # 1000 training points, 50 inducing points
 
# Generate synthetic data
X = np.random.randn(N, 2)
y = np.sin(X[:, 0]) + 0.5 * np.cos(2 * X[:, 1]) + 0.1 * np.random.randn(N)
 
# Select inducing points via k-means
from scipy.cluster.vq import kmeans2
Z, _ = kmeans2(X, M, minit='++')
 
# Fit DTC sparse GP
kernel_fn = lambda X1, X2: squared_exponential_kernel(X1, X2, lengthscale=1.0, variance=1.0)
model = DTCSparseGP(kernel_fn, noise_variance=0.01)
model.fit(X, y, Z)
 
# Compute log marginal likelihood
lml = model.log_marginal_likelihood()
print(f"DTC Log Marginal Likelihood: {lml:.2f}")
 
# Make predictions
X_test = np.random.randn(100, 2)
mu, var = model.predict(X_test)
print(f"Predictions: mean in [{mu.min():.2f}, {mu.max():.2f}], "
      f"var in [{var.min():.4f}, {var.max():.4f}]")

FITC: Fully Independent Training Conditional

The Fully Independent Training Conditional (FITC) approximation addresses DTC's systematic variance underestimation by retaining diagonal residual variance—the part of each training point's variance not explained by the inducing points.

Recall the exact conditional:

$$p(f_i | \mathbf{u}) = \mathcal{N}(\mathbf{k}{iu}K{uu}^{-1}\mathbf{u}, \, k_{ii} - \mathbf{k}{iu}K{uu}^{-1}\mathbf{k}_{ui})$$

The conditional variance $k_{ii} - \mathbf{k}{iu}K{uu}^{-1}\mathbf{k}_{ui}$ represents the function uncertainty at point $i$ that is not captured by the inducing points. This is non-negative and can be substantial in regions far from inducing points.

FITC makes the approximation that, conditioned on $\mathbf{u}$, the training function values are independent (but not deterministic):

$$p_{\text{FITC}}(\mathbf{f}|\mathbf{u}) = \prod_{i=1}^N \mathcal{N}(f_i; \mathbf{k}{iu}K{uu}^{-1}\mathbf{u}, \, \lambda_i)$$

where $\lambda_i = k_{ii} - \mathbf{k}{iu}K{uu}^{-1}\mathbf{k}_{ui}$ is the residual variance at point $i$.

The FITC covariance structure:

Under FITC, the prior covariance over training points becomes:

$$K_{\text{FITC}} = Q_{ff} + \text{diag}(K_{ff} - Q_{ff}) = Q_{ff} + \Lambda$$

where $\Lambda = \text{diag}(\lambda_1, ..., \lambda_N)$ is a diagonal matrix of residual variances.

This structure is crucial: FITC says "training points are correlated through the inducing points (via $Q_{ff}$), but have independent residual noise beyond that (via $\Lambda$)." The diagonal structure of $\Lambda$ enables efficient computation.

The FITC marginal likelihood:

$$\log p(\mathbf{y}) = \log \mathcal{N}(\mathbf{0}, Q_{ff} + \Lambda + \sigma_n^2 I)$$

Defining $D = \Lambda + \sigma_n^2 I$ (a diagonal matrix), this becomes:

$$\log p(\mathbf{y}) = \log \mathcal{N}(\mathbf{0}, Q_{ff} + D)$$

The Woodbury formula gives:

$$(Q_{ff} + D)^{-1} = D^{-1} - D^{-1}K_{fu}(K_{uu} + K_{uf}D^{-1}K_{fu})^{-1}K_{uf}D^{-1}$$

Since $D$ is diagonal, $D^{-1}$ is trivially O(N). The computational bottleneck remains the M×M matrix $(K_{uu} + K_{uf}D^{-1}K_{fu})$, preserving O(NM²) complexity.

FITC vs DTC Trade-off

FITC provides better variance estimation than DTC at the cost of additional computation (computing and storing the diagonal Λ). For well-placed inducing points, the improvement can be marginal. For poorly-placed inducing points, FITC provides crucial protection against overconfident predictions.

Comparison of Classical Sparse GP Approximations
Method	p(f\|u) Approximation	Prior Covariance	Variance Estimate	Complexity
DTC	Deterministic	Q_{ff}	Underestimates	O(NM²)
FITC	Diagonal conditional cov	Q_{ff} + diag(K_{ff} - Q_{ff})	Better near data	O(NM²)
PITC	Block-diagonal conditional	Q_{ff} + blockdiag(...)	Best (among these)	O(NM² + NB²)
Full GP	Full conditional cov	K_{ff}	Exact	O(N³)

PITC: Partially Independent Training Conditional:

A natural extension is PITC, which partitions training points into blocks and models block-wise dependencies while ignoring cross-block dependencies. PITC interpolates between FITC (every point is its own block) and the exact GP (all points in one block). It offers improved approximation at the cost of O(NB²) additional computation where B is the block size. In practice, FITC often provides sufficient improvement over DTC, and modern variational methods (covered in subsequent pages) supersede PITC.

Understanding Approximation Quality

Sparse GP methods introduce approximation error. Understanding when this error is acceptable—and when it corrupts results—is essential for practitioners. Several perspectives illuminate approximation quality:

The projection interpretation:

The Nyström approximation $Q_{ff} = K_{fu}K_{uu}^{-1}K_{uf}$ is a rank-M approximation to the full kernel matrix. It is optimal (in Frobenius norm) among rank-M approximations that pass through the inducing columns of $K$. The approximation error is:

$$\|K_{ff} - Q_{ff}\|F^2 = \sum{i=M+1}^N \lambda_i^2$$

where $\lambda_i$ are eigenvalues of $K_{ff}$ not captured by the inducing point projection. If the kernel matrix has rapidly decaying spectrum (common for smooth kernels), a small M captures most variance.

The information-theoretic view:

From an information theory perspective, the inducing points $\mathbf{u}$ should maximize the mutual information $I(\mathbf{f}; \mathbf{u})$—the information they provide about the full function values. This leads to the objective:

$$\max_Z I(\mathbf{f}; \mathbf{u}) = \frac{1}{2}\log|K_{ff}| - \frac{1}{2}\log|K_{ff} - Q_{ff}|$$

Equivalently, we minimize the residual entropy $H(\mathbf{f}|\mathbf{u})$. Greedy algorithms for inducing point selection often optimize this criterion.

When sparse approximations fail:

Sparse GPs struggle in specific scenarios:

Rapidly varying functions: If the true function has length-scale much shorter than inducing point spacing, the approximation misses important structure
Non-stationary functions: Standard kernels assume uniform length-scale; local structure may require many inducing points
High dimensions: Covering input space requires exponentially many inducing points (curse of dimensionality)
Outlier regions: Data points far from inducing points have less predictive influence than they should

Diagnostics for Approximation Quality

•Compare with exact GP on subset: If the sparse approximation matches exact inference on a held-out subset, the approximation is likely adequate
•Monitor residual variance Λ: Large diagonal elements indicate poor coverage—those training points are poorly explained by inducing points
•Sensitivity to M: If increasing M substantially changes predictions, the current M is insufficient
•Prediction vs training error gap: Large gaps suggest the approximation oversimplifies the function
•ELBO tightness (for variational methods): The gap between ELBO and true marginal likelihood bounds approximation error

The M Selection Problem

Choosing M is fundamentally a bias-variance trade-off. Small M gives fast computation but high approximation bias. Large M gives better approximation but approaches exact GP cost. Rules of thumb suggest M = O(√N) or M = O(log N) depending on smoothness, but the optimal choice is problem-specific and often determined empirically.

Computational Implementation Details

Efficient implementation of sparse GPs requires careful attention to numerical linear algebra. Several techniques are essential for production-quality code:

Cholesky-based computation:

Never explicitly compute matrix inverses. Instead, use Cholesky factorization:

$$K_{uu} = L_{uu}L_{uu}^\top$$

where $L_{uu}$ is lower-triangular. Then $K_{uu}^{-1}\mathbf{v}$ is computed by solving $L_{uu}L_{uu}^\top\mathbf{x} = \mathbf{v}$ via forward and back substitution. This is more numerically stable and equally fast.

Memory-efficient cross-covariance:

The N×M cross-covariance matrix $K_{fu}$ dominates memory for large N. For prediction, we often don't need the full matrix—only its product with certain vectors. When possible, compute $K_{fu}\mathbf{v}$ incrementally without materializing the full matrix.

Incremental updates:

When adding new inducing points (e.g., during greedy selection), use rank-1 updates to existing Cholesky factors rather than recomputing from scratch.

GPU acceleration:

Matrix-matrix products dominate sparse GP computation. These are highly parallelizable, making GPUs effective even at moderate N. The key is to batch operations and minimize CPU-GPU transfers.

efficient_sparse_gp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from scipy import linalg as sla
 
class EfficientFITC:
    """
    Numerically stable implementation of FITC sparse GP.
    
    Key optimizations:
    1. Cholesky-based solves (never explicit inversion)
    2. Precomputed quantities for repeated predictions
    3. Memory-efficient intermediate computations
    4. Numerically stable log-determinant computation
    """
    
    def __init__(self, kernel_fn, noise_variance=0.01, jitter=1e-6):
        self.kernel_fn = kernel_fn
        self.noise_var = noise_variance
        self.jitter = jitter
        
    def fit(self, X, y, Z):
        """
        Fit FITC with numerically stable Cholesky-based computation.
        """
        N, M = len(X), len(Z)
        
        # Kernel matrices
        Kuu = self.kernel_fn(Z, Z) + self.jitter * np.eye(M)
        Kfu = self.kernel_fn(X, Z)
        Kff_diag = self.kernel_fn(X, X)[np.diag_indices(N)]  # Only diagonal needed
        
        # Cholesky of K_uu
        L_uu = sla.cholesky(Kuu, lower=True)
        
        # Solve L_uu @ V = K_uf.T, so V = L_uu^{-1} @ K_uf.T
        # This gives us K_uu^{-1} @ K_uf.T = L_uu.T^{-1} @ V
        V = sla.solve_triangular(L_uu, Kfu.T, lower=True)  # M x N
        
        # Q_ff diagonal: diag(K_fu @ K_uu^{-1} @ K_uf) = column sums of V^2
        Qff_diag = np.sum(V**2, axis=0)
        
        # Residual variance (FITC diagonal)
        Lambda = np.maximum(Kff_diag - Qff_diag, 0) + self.noise_var
        Lambda_inv = 1.0 / Lambda
        
        # Scaled V: V @ diag(Lambda^{-1/2})
        V_scaled = V * np.sqrt(Lambda_inv)
        
        # B = I + V @ Lambda^{-1} @ V.T = I + V_scaled @ V_scaled.T
        B = np.eye(M) + V_scaled @ V_scaled.T
        L_B = sla.cholesky(B, lower=True)
        
        # alpha computation using Woodbury
        # (Q + D)^{-1} y = D^{-1} y - D^{-1} K_fu @ B^{-1} @ K_uf @ D^{-1} y
        y_scaled = y * Lambda_inv
        c = V @ y_scaled  # K_uf @ D^{-1} @ y, but using V representation
        c_solve = sla.solve_triangular(L_B.T, sla.solve_triangular(L_B, c, lower=True), lower=False)
        alpha = y_scaled - Lambda_inv * (V.T @ c_solve)
        
        # Store for prediction
        self.Z = Z
        self.L_uu = L_uu
        self.L_B = L_B
        self.V = V
        self.alpha = alpha
        self.Lambda = Lambda
        self.Lambda_inv = Lambda_inv
        
        # For log marginal likelihood
        self._N = N
        self._y = y
        self._y_scaled = y_scaled
        
    def predict(self, X_star, return_cov=False):
        """
        Predictions with FITC variance correction.
        """
        Ksu = self.kernel_fn(X_star, self.Z)
        Kss_diag = self.kernel_fn(X_star, X_star)[np.diag_indices(len(X_star))]
        
        # V_star = L_uu^{-1} @ K_us.T
        V_star = sla.solve_triangular(self.L_uu, Ksu.T, lower=True)
        
        # Mean: μ* = K_s,u @ K_uu^{-1} @ K_uf @ α = K_su @ (L_uu.T^{-1} @ V @ α)
        # Simpler: α is already (Q+D)^{-1}y, so μ* = K_su @ K_uu^{-1} @ K_uf @ α
        # = V_star.T @ (V @ α)
        # Actually easier: μ* = Ksu @ α_u where α_u = K_uu^{-1} @ K_uf @ (Q+D)^{-1} @ y
        # Let's use the direct form
        mean = Ksu @ sla.solve_triangular(
            self.L_uu.T, 
            sla.solve_triangular(self.L_uu, self.V @ self.alpha, lower=True),
            lower=False
        )
        
        if not return_cov:
            return mean
        
        # Variance (FITC: includes Qff residual correction)
        # σ²_* = k_** - k_*u @ K_uu^{-1} @ k_u* + k_*u @ Σ_u @ k_u*
        Qss_diag = np.sum(V_star**2, axis=0)
        
        # Σ_u inverse quadratic form via B
        B_solve = sla.solve_triangular(self.L_B, V_star, lower=True)
        correction = np.sum(B_solve**2, axis=0)
        
        var = Kss_diag - Qss_diag + correction
        
        return mean, np.maximum(var, 0)  # Ensure non-negative
    
    def log_marginal_likelihood(self):
        """
        Compute log p(y|X, Z, θ) with stable log-determinant.
        """
        N = self._N
        
        # log|Q + D| = log|D| + log|B|
        log_det_D = np.sum(np.log(self.Lambda))
        log_det_B = 2 * np.sum(np.log(np.diag(self.L_B)))
        log_det = log_det_D + log_det_B
        
        # y^T (Q + D)^{-1} y - already computed elements
        c = self.V @ self._y_scaled
        c_solve = sla.solve_triangular(
            self.L_B.T, 
            sla.solve_triangular(self.L_B, c, lower=True),
            lower=False
        )
        quad_form = np.dot(self._y_scaled, self._y) - np.dot(c, c_solve)
        
        lml = -0.5 * quad_form - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi)
        return lml

Summary: The Foundation of Scalable GPs

We have developed the fundamental framework for sparse Gaussian Processes—methods that reduce complexity from O(N³) to O(NM²) by compressing the GP through M inducing points. The key insights:

Key Takeaways

•Inducing points are the key abstraction: M strategically placed pseudo-inputs mediate all dependencies between training points, enabling efficient low-rank approximation of the full kernel matrix.
•The Nyström/SoR approximation: K_{ff} ≈ K_{fu}K_{uu}^{-1}K_{uf} provides a rank-M approximation whose quality depends on how well inducing points span the data manifold.
•DTC assumes deterministic conditioning: This yields computational efficiency but systematically underestimates predictive variance, which can be dangerous for uncertainty-sensitive applications.
•FITC corrects with diagonal variance: By retaining per-point residual variance, FITC provides better uncertainty estimates while maintaining O(NM²) complexity.
•Approximation quality depends on M and Z placement: The choice of inducing points is crucial—poorly placed inducing points can render sparse approximations useless regardless of M.
•Numerical stability requires careful implementation: Cholesky-based solves, jitter for conditioning, and memory-efficient computation are essential for practical deployment.

What's next:

The methods presented here—DTC, FITC, and their variants—are foundational but somewhat ad-hoc. In the next page, we examine inducing point methods more systematically, exploring strategies for selecting and even learning the optimal inducing point locations. Following that, we develop the elegant variational sparse GP framework that unifies and extends these classical approximations with principled Bayesian inference.

Page Complete

You now understand the core mechanics of sparse GP approximations. The inducing point concept, the Nyström approximation, and the DTC/FITC frameworks form the conceptual substrate upon which all modern scalable GP methods build. These ideas will recur throughout the remaining pages as we explore more sophisticated approaches to GP scalability.

2 / 5

Loading learning content...

Machine LearningGaussian Processes

Scalable Gaussian Processes

LevelAdvanced

Duration90 mins

TopicGaussian Processes

2 / 5

Sparse Gaussian Processes

Compression Without Catastrophe

This page develops the foundational sparse GP framework that addresses these questions, establishing the theoretical apparatus for all modern scalable GP methods.

What You Will Learn

The Core Idea: Inducing Points

$$\begin{bmatrix} \mathbf{f} \\ \mathbf{u} \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} K_{ff} & K_{fu} \\ K_{uf} & K_{uu} \end{bmatrix} \right)$$

where:

$K_{ff}$ is the N×N kernel matrix over training points
$K_{uu}$ is the M×M kernel matrix over inducing points
$K_{fu}$ is the N×M cross-covariance matrix between training and inducing points
$K_{uf} = K_{fu}^\top$ is the transpose

The conditional distribution of $\mathbf{f}$ given $\mathbf{u}$ is also Gaussian:

$$p(\mathbf{f} | \mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, K_{ff} - K_{fu}K_{uu}^{-1}K_{uf})$$

This conditioning relationship is exact under the GP prior—no approximation has been made yet. The magic of sparse GPs happens when we make strategic approximations to this conditional structure.

The Bottleneck Shift

The graphical model perspective:

$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})p(\mathbf{u})$$

Subset of Data (SoD): The Naive Baseline

Let $(X_M, \mathbf{y}_M)$ denote the selected M points. Inference proceeds exactly as for standard GPs:

$$\mu_* = \mathbf{k}{*M}^\top (K{MM} + \sigma_n^2 I)^{-1} \mathbf{y}M$$ $$\sigma^2 = k(\mathbf{x}_, \mathbf{x}*) - \mathbf{k}{*M}^\top (K_{MM} + \sigma_n^2 I)^{-1} \mathbf{k}_{*M}$$

where $K_{MM}$ is the M×M kernel matrix over the selected subset.

SoD Advantages

•Simple implementation: No modification to standard GP inference
•Exact within subset: The M-point GP is exact, not approximate
•Complexity O(M³): Computation depends only on subset size
•Valid uncertainty: Posterior is a proper Gaussian distribution

SoD Disadvantages

•Discards N-M observations: Potentially valuable data is completely ignored
•Overconfident predictions: Uncertainty is underestimated in discarded regions
•Selection sensitivity: Performance varies dramatically with subset choice
•No marginal likelihood: Cannot compare with full-data model

Subset selection strategies:

The performance of SoD depends critically on how the M points are selected:

Random sampling: Simple but can miss important regions; high variance in performance
Uniform/Grid sampling: Works for low-dimensional, evenly distributed data
K-means clustering: Select cluster centers to cover the input space
Farthest point sampling: Greedily select points that maximize minimum distance to existing selections
Leverage score sampling: Weight points by their influence on the GP posterior

The Uncertainty Catastrophe

Subset of Regressors (SoR) and the Nyström Approximation

The key approximation is:

$$K_{ff} \approx Q_{ff} = K_{fu}K_{uu}^{-1}K_{uf}$$

Mathematical derivation:

Consider the kernel as defining a feature map $\phi(\mathbf{x})$ in a (possibly infinite-dimensional) Hilbert space:

$$k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$$

The Nyström approximation projects this feature map onto the subspace spanned by $\{\phi(\mathbf{z}_1), ..., \phi(\mathbf{z}_M)\}$. The projected feature map is:

$$\tilde{\phi}(\mathbf{x}) = K_{uu}^{-1/2} \mathbf{k}_u(\mathbf{x})$$

where $\mathbf{k}_u(\mathbf{x}) = [k(\mathbf{z}_1, \mathbf{x}), ..., k(\mathbf{z}_M, \mathbf{x})]^\top$. The approximate kernel becomes:

$$\tilde{k}(\mathbf{x}, \mathbf{x}') = \langle \tilde{\phi}(\mathbf{x}), \tilde{\phi}(\mathbf{x}') \rangle = \mathbf{k}u(\mathbf{x})^\top K{uu}^{-1} \mathbf{k}_u(\mathbf{x}')$$

Predictive equations under SoR:

Substituting the Nyström approximation into the GP predictive equations:

$$\mu_* = \mathbf{k}{*u}^\top (K{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1} K_{uf}\Sigma^{-1}\mathbf{y}$$

$$\sigma_^2 = \mathbf{k}{*u}^\top (K{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1} \mathbf{k}_{u}$$

where $\Sigma = \sigma_n^2 I$ is the noise covariance.

Using the Woodbury identity, these can be rewritten as:

$$\mu_* = \mathbf{k}{*u}^\top K{uu}^{-1} K_{uf} (Q_{ff} + \Sigma)^{-1} \mathbf{y}$$

The term $(Q_{ff} + \Sigma)^{-1}$ still involves an N×N matrix, but the low-rank structure of $Q_{ff}$ enables efficient computation via the matrix inversion lemma:

$$(Q_{ff} + \Sigma)^{-1} = \Sigma^{-1} - \Sigma^{-1}K_{fu}(K_{uu} + K_{uf}\Sigma^{-1}K_{fu})^{-1}K_{uf}\Sigma^{-1}$$

Computational Complexity of SoR/Nyström
Operation	Complexity	Notes
Compute K_{uu}	O(M²d)	M×M kernel matrix over inducing points
Compute K_{fu}	O(NMd)	N×M cross-covariance
Cholesky of K_{uu}	O(M³)	Small matrix factorization
Compute Λ = K_{uu} + K_{uf}Σ⁻¹K_{fu}	O(NM² + M³)	Key matrix for prediction
Solve triangular systems	O(M²)	Per prediction
Total training	O(NM² + M³)	Dominated by cross-covariance products

The Complexity Reduction

DTC: Deterministic Training Conditional

Recall the exact joint distribution:

$$p(\mathbf{f}, \mathbf{u}) = p(\mathbf{f}|\mathbf{u})p(\mathbf{u})$$

DTC's key approximation is to replace this conditional with a deterministic mapping:

$$p_{\text{DTC}}(\mathbf{f}|\mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, \mathbf{0})$$

Consequences of the DTC assumption:

With observation noise $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$, the likelihood becomes:

$$p(\mathbf{y}|\mathbf{u}) = \mathcal{N}(K_{fu}K_{uu}^{-1}\mathbf{u}, \, \sigma_n^2 I)$$

The marginal likelihood, integrating out $\mathbf{u}$:

$$p(\mathbf{y}) = \int p(\mathbf{y}|\mathbf{u})p(\mathbf{u})d\mathbf{u} = \mathcal{N}(\mathbf{0}, \, Q_{ff} + \sigma_n^2 I)$$

where $Q_{ff} = K_{fu}K_{uu}^{-1}K_{uf}$ is exactly the Nyström approximation.

The posterior over inducing points:

$$p(\mathbf{u}|\mathbf{y}) = \mathcal{N}(\mu_u, \Sigma_u)$$

where: $$\Sigma_u = (K_{uu}^{-1} + \sigma_n^{-2}K_{uf}K_{fu})^{-1} = K_{uu}(K_{uu} + \sigma_n^{-2}K_{uf}K_{fu})^{-1}K_{uu}$$ $$\mu_u = \sigma_n^{-2}\Sigma_u K_{uf}\mathbf{y}$$

DTC's Systematic Variance Underestimation

dtc_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
import numpy as np
from numpy import linalg as la
 
class DTCSparseGP:
    """
    Deterministic Training Conditional (DTC) Sparse Gaussian Process.
    
    Implements the DTC approximation with O(NM²) complexity for
    exact GP inference with O(N³) complexity.
    
    This is the "textbook" sparse GP, providing a foundation for
    understanding more sophisticated methods like FITC and VFE.
    """
    
    def __init__(self, kernel_fn, noise_variance: float = 1e-3):
        """
        Args:
            kernel_fn: Kernel function k(X1, X2) -> matrix
            noise_variance: Observation noise variance σ²
        """
        self.kernel_fn = kernel_fn
        self.noise_var = noise_variance
        self.fitted = False
    
    def fit(self, X: np.ndarray, y: np.ndarray, Z: np.ndarray) -> 'DTCSparseGP':
        """
        Fit the DTC sparse GP.
        
        Args:
            X: (N, d) training inputs
            y: (N,) training targets
            Z: (M, d) inducing point locations
        
        Returns:
            self for method chaining
        """
        N, M = len(X), len(Z)
        
        # Compute kernel matrices
        Kuu = self.kernel_fn(Z, Z)  # M x M
        Kfu = self.kernel_fn(X, Z)  # N x M
        Kuu_inv = la.inv(Kuu + 1e-6 * np.eye(M))  # Regularized inverse
        
        # Compute Λ = Kuu + σ⁻² Kuf @ Kfu (M x M)
        Lambda = Kuu + (1 / self.noise_var) * Kfu.T @ Kfu
        
        # Cholesky of Lambda for stable solve
        L_lambda = la.cholesky(Lambda)
        
        # Compute α = σ⁻² Λ⁻¹ Kuf y
        alpha = la.solve(L_lambda.T, la.solve(L_lambda, Kfu.T @ y)) / self.noise_var
        
        # Store for prediction
        self.X = X
        self.y = y
        self.Z = Z
        self.Kuu = Kuu
        self.Kuu_inv = Kuu_inv
        self.L_lambda = L_lambda
        self.alpha = alpha
        self.fitted = True
        
        return self
    
    def predict(self, X_star: np.ndarray, return_var: bool = True):
        """
        Make predictions at new points.
        
        Args:
            X_star: (N*, d) test inputs
            return_var: Whether to return predictive variance
        
        Returns:
            mean: (N*,) predictive means
            var: (N*,) predictive variances (if return_var=True)
        """
        if not self.fitted:
            raise RuntimeError("Model must be fitted before prediction")
        
        # Cross-covariance with inducing points
        Ksu = self.kernel_fn(X_star, self.Z)  # N* x M
        
        # Predictive mean: k_{*u} @ α
        mu = Ksu @ self.alpha
        
        if not return_var:
            return mu
        
        # Predictive variance (DTC approximation)
        # σ²* = k_{*u} @ Λ⁻¹ @ k_{u*}
        # Note: This is the DTC variance which underestimates uncertainty
        v = la.solve(self.L_lambda, Ksu.T)  # L⁻¹ @ k_{u*}
        var = np.sum(v**2, axis=0)  # Diagonal of k_{*u} @ Λ⁻¹ @ k_{u*}
        
        return mu, var
    
    def log_marginal_likelihood(self) -> float:
        """
        Compute the DTC log marginal likelihood.
        
        Returns:
            Log marginal likelihood p(y | X, Z, θ)
        """
        if not self.fitted:
            raise RuntimeError("Model must be fitted first")
        
        N = len(self.y)
        
        # log|Qff + σ²I| via Woodbury
        log_det_Kuu = 2 * np.sum(np.log(np.diag(la.cholesky(self.Kuu))))
        log_det_Lambda = 2 * np.sum(np.log(np.diag(self.L_lambda)))
        log_det = N * np.log(self.noise_var) + log_det_Lambda - log_det_Kuu
        
        # y^T (Qff + σ²I)⁻¹ y via Woodbury
        y_scaled = self.y / self.noise_var
        Kfu = self.kernel_fn(self.X, self.Z)
        v = Kfu.T @ self.y / self.noise_var
        w = la.solve(self.L_lambda.T, la.solve(self.L_lambda, v))
        data_fit = self.y @ y_scaled - v @ w
        
        # Log marginal likelihood
        lml = -0.5 * data_fit - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi)
        
        return lml
 
 
# Example usage
def squared_exponential_kernel(X1, X2, lengthscale=1.0, variance=1.0):
    """Standard squared exponential (RBF) kernel."""
    dists_sq = np.sum((X1[:, None] - X2[None, :])**2, axis=2)
    return variance * np.exp(-0.5 * dists_sq / lengthscale**2)
 
# Demonstration
np.random.seed(42)
N, M = 1000, 50  # 1000 training points, 50 inducing points
 
# Generate synthetic data
X = np.random.randn(N, 2)
y = np.sin(X[:, 0]) + 0.5 * np.cos(2 * X[:, 1]) + 0.1 * np.random.randn(N)
 
# Select inducing points via k-means
from scipy.cluster.vq import kmeans2
Z, _ = kmeans2(X, M, minit='++')
 
# Fit DTC sparse GP
kernel_fn = lambda X1, X2: squared_exponential_kernel(X1, X2, lengthscale=1.0, variance=1.0)
model = DTCSparseGP(kernel_fn, noise_variance=0.01)
model.fit(X, y, Z)
 
# Compute log marginal likelihood
lml = model.log_marginal_likelihood()
print(f"DTC Log Marginal Likelihood: {lml:.2f}")
 
# Make predictions
X_test = np.random.randn(100, 2)
mu, var = model.predict(X_test)
print(f"Predictions: mean in [{mu.min():.2f}, {mu.max():.2f}], "
      f"var in [{var.min():.4f}, {var.max():.4f}]")

FITC: Fully Independent Training Conditional

Recall the exact conditional:

$$p(f_i | \mathbf{u}) = \mathcal{N}(\mathbf{k}{iu}K{uu}^{-1}\mathbf{u}, \, k_{ii} - \mathbf{k}{iu}K{uu}^{-1}\mathbf{k}_{ui})$$

FITC makes the approximation that, conditioned on $\mathbf{u}$, the training function values are independent (but not deterministic):

$$p_{\text{FITC}}(\mathbf{f}|\mathbf{u}) = \prod_{i=1}^N \mathcal{N}(f_i; \mathbf{k}{iu}K{uu}^{-1}\mathbf{u}, \, \lambda_i)$$

where $\lambda_i = k_{ii} - \mathbf{k}{iu}K{uu}^{-1}\mathbf{k}_{ui}$ is the residual variance at point $i$.

The FITC covariance structure:

Under FITC, the prior covariance over training points becomes:

$$K_{\text{FITC}} = Q_{ff} + \text{diag}(K_{ff} - Q_{ff}) = Q_{ff} + \Lambda$$

where $\Lambda = \text{diag}(\lambda_1, ..., \lambda_N)$ is a diagonal matrix of residual variances.

The FITC marginal likelihood:

$$\log p(\mathbf{y}) = \log \mathcal{N}(\mathbf{0}, Q_{ff} + \Lambda + \sigma_n^2 I)$$

Defining $D = \Lambda + \sigma_n^2 I$ (a diagonal matrix), this becomes:

$$\log p(\mathbf{y}) = \log \mathcal{N}(\mathbf{0}, Q_{ff} + D)$$

The Woodbury formula gives:

$$(Q_{ff} + D)^{-1} = D^{-1} - D^{-1}K_{fu}(K_{uu} + K_{uf}D^{-1}K_{fu})^{-1}K_{uf}D^{-1}$$

Since $D$ is diagonal, $D^{-1}$ is trivially O(N). The computational bottleneck remains the M×M matrix $(K_{uu} + K_{uf}D^{-1}K_{fu})$, preserving O(NM²) complexity.

FITC vs DTC Trade-off

Comparison of Classical Sparse GP Approximations
Method	p(f\|u) Approximation	Prior Covariance	Variance Estimate	Complexity
DTC	Deterministic	Q_{ff}	Underestimates	O(NM²)
FITC	Diagonal conditional cov	Q_{ff} + diag(K_{ff} - Q_{ff})	Better near data	O(NM²)
PITC	Block-diagonal conditional	Q_{ff} + blockdiag(...)	Best (among these)	O(NM² + NB²)
Full GP	Full conditional cov	K_{ff}	Exact	O(N³)

PITC: Partially Independent Training Conditional:

Understanding Approximation Quality

The projection interpretation:

$$\|K_{ff} - Q_{ff}\|F^2 = \sum{i=M+1}^N \lambda_i^2$$

The information-theoretic view:

$$\max_Z I(\mathbf{f}; \mathbf{u}) = \frac{1}{2}\log|K_{ff}| - \frac{1}{2}\log|K_{ff} - Q_{ff}|$$

Equivalently, we minimize the residual entropy $H(\mathbf{f}|\mathbf{u})$. Greedy algorithms for inducing point selection often optimize this criterion.

When sparse approximations fail:

Sparse GPs struggle in specific scenarios:

Rapidly varying functions: If the true function has length-scale much shorter than inducing point spacing, the approximation misses important structure
Non-stationary functions: Standard kernels assume uniform length-scale; local structure may require many inducing points
High dimensions: Covering input space requires exponentially many inducing points (curse of dimensionality)
Outlier regions: Data points far from inducing points have less predictive influence than they should

Diagnostics for Approximation Quality

•Compare with exact GP on subset: If the sparse approximation matches exact inference on a held-out subset, the approximation is likely adequate
•Monitor residual variance Λ: Large diagonal elements indicate poor coverage—those training points are poorly explained by inducing points
•Sensitivity to M: If increasing M substantially changes predictions, the current M is insufficient
•Prediction vs training error gap: Large gaps suggest the approximation oversimplifies the function
•ELBO tightness (for variational methods): The gap between ELBO and true marginal likelihood bounds approximation error

The M Selection Problem

Computational Implementation Details

Efficient implementation of sparse GPs requires careful attention to numerical linear algebra. Several techniques are essential for production-quality code:

Cholesky-based computation:

Never explicitly compute matrix inverses. Instead, use Cholesky factorization:

$$K_{uu} = L_{uu}L_{uu}^\top$$

Memory-efficient cross-covariance:

Incremental updates:

When adding new inducing points (e.g., during greedy selection), use rank-1 updates to existing Cholesky factors rather than recomputing from scratch.

GPU acceleration:

Matrix-matrix products dominate sparse GP computation. These are highly parallelizable, making GPUs effective even at moderate N. The key is to batch operations and minimize CPU-GPU transfers.

efficient_sparse_gp.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from scipy import linalg as sla
 
class EfficientFITC:
    """
    Numerically stable implementation of FITC sparse GP.
    
    Key optimizations:
    1. Cholesky-based solves (never explicit inversion)
    2. Precomputed quantities for repeated predictions
    3. Memory-efficient intermediate computations
    4. Numerically stable log-determinant computation
    """
    
    def __init__(self, kernel_fn, noise_variance=0.01, jitter=1e-6):
        self.kernel_fn = kernel_fn
        self.noise_var = noise_variance
        self.jitter = jitter
        
    def fit(self, X, y, Z):
        """
        Fit FITC with numerically stable Cholesky-based computation.
        """
        N, M = len(X), len(Z)
        
        # Kernel matrices
        Kuu = self.kernel_fn(Z, Z) + self.jitter * np.eye(M)
        Kfu = self.kernel_fn(X, Z)
        Kff_diag = self.kernel_fn(X, X)[np.diag_indices(N)]  # Only diagonal needed
        
        # Cholesky of K_uu
        L_uu = sla.cholesky(Kuu, lower=True)
        
        # Solve L_uu @ V = K_uf.T, so V = L_uu^{-1} @ K_uf.T
        # This gives us K_uu^{-1} @ K_uf.T = L_uu.T^{-1} @ V
        V = sla.solve_triangular(L_uu, Kfu.T, lower=True)  # M x N
        
        # Q_ff diagonal: diag(K_fu @ K_uu^{-1} @ K_uf) = column sums of V^2
        Qff_diag = np.sum(V**2, axis=0)
        
        # Residual variance (FITC diagonal)
        Lambda = np.maximum(Kff_diag - Qff_diag, 0) + self.noise_var
        Lambda_inv = 1.0 / Lambda
        
        # Scaled V: V @ diag(Lambda^{-1/2})
        V_scaled = V * np.sqrt(Lambda_inv)
        
        # B = I + V @ Lambda^{-1} @ V.T = I + V_scaled @ V_scaled.T
        B = np.eye(M) + V_scaled @ V_scaled.T
        L_B = sla.cholesky(B, lower=True)
        
        # alpha computation using Woodbury
        # (Q + D)^{-1} y = D^{-1} y - D^{-1} K_fu @ B^{-1} @ K_uf @ D^{-1} y
        y_scaled = y * Lambda_inv
        c = V @ y_scaled  # K_uf @ D^{-1} @ y, but using V representation
        c_solve = sla.solve_triangular(L_B.T, sla.solve_triangular(L_B, c, lower=True), lower=False)
        alpha = y_scaled - Lambda_inv * (V.T @ c_solve)
        
        # Store for prediction
        self.Z = Z
        self.L_uu = L_uu
        self.L_B = L_B
        self.V = V
        self.alpha = alpha
        self.Lambda = Lambda
        self.Lambda_inv = Lambda_inv
        
        # For log marginal likelihood
        self._N = N
        self._y = y
        self._y_scaled = y_scaled
        
    def predict(self, X_star, return_cov=False):
        """
        Predictions with FITC variance correction.
        """
        Ksu = self.kernel_fn(X_star, self.Z)
        Kss_diag = self.kernel_fn(X_star, X_star)[np.diag_indices(len(X_star))]
        
        # V_star = L_uu^{-1} @ K_us.T
        V_star = sla.solve_triangular(self.L_uu, Ksu.T, lower=True)
        
        # Mean: μ* = K_s,u @ K_uu^{-1} @ K_uf @ α = K_su @ (L_uu.T^{-1} @ V @ α)
        # Simpler: α is already (Q+D)^{-1}y, so μ* = K_su @ K_uu^{-1} @ K_uf @ α
        # = V_star.T @ (V @ α)
        # Actually easier: μ* = Ksu @ α_u where α_u = K_uu^{-1} @ K_uf @ (Q+D)^{-1} @ y
        # Let's use the direct form
        mean = Ksu @ sla.solve_triangular(
            self.L_uu.T, 
            sla.solve_triangular(self.L_uu, self.V @ self.alpha, lower=True),
            lower=False
        )
        
        if not return_cov:
            return mean
        
        # Variance (FITC: includes Qff residual correction)
        # σ²_* = k_** - k_*u @ K_uu^{-1} @ k_u* + k_*u @ Σ_u @ k_u*
        Qss_diag = np.sum(V_star**2, axis=0)
        
        # Σ_u inverse quadratic form via B
        B_solve = sla.solve_triangular(self.L_B, V_star, lower=True)
        correction = np.sum(B_solve**2, axis=0)
        
        var = Kss_diag - Qss_diag + correction
        
        return mean, np.maximum(var, 0)  # Ensure non-negative
    
    def log_marginal_likelihood(self):
        """
        Compute log p(y|X, Z, θ) with stable log-determinant.
        """
        N = self._N
        
        # log|Q + D| = log|D| + log|B|
        log_det_D = np.sum(np.log(self.Lambda))
        log_det_B = 2 * np.sum(np.log(np.diag(self.L_B)))
        log_det = log_det_D + log_det_B
        
        # y^T (Q + D)^{-1} y - already computed elements
        c = self.V @ self._y_scaled
        c_solve = sla.solve_triangular(
            self.L_B.T, 
            sla.solve_triangular(self.L_B, c, lower=True),
            lower=False
        )
        quad_form = np.dot(self._y_scaled, self._y) - np.dot(c, c_solve)
        
        lml = -0.5 * quad_form - 0.5 * log_det - 0.5 * N * np.log(2 * np.pi)
        return lml

Summary: The Foundation of Scalable GPs

We have developed the fundamental framework for sparse Gaussian Processes—methods that reduce complexity from O(N³) to O(NM²) by compressing the GP through M inducing points. The key insights:

Key Takeaways

•Inducing points are the key abstraction: M strategically placed pseudo-inputs mediate all dependencies between training points, enabling efficient low-rank approximation of the full kernel matrix.
•The Nyström/SoR approximation: K_{ff} ≈ K_{fu}K_{uu}^{-1}K_{uf} provides a rank-M approximation whose quality depends on how well inducing points span the data manifold.
•DTC assumes deterministic conditioning: This yields computational efficiency but systematically underestimates predictive variance, which can be dangerous for uncertainty-sensitive applications.
•FITC corrects with diagonal variance: By retaining per-point residual variance, FITC provides better uncertainty estimates while maintaining O(NM²) complexity.
•Approximation quality depends on M and Z placement: The choice of inducing points is crucial—poorly placed inducing points can render sparse approximations useless regardless of M.
•Numerical stability requires careful implementation: Cholesky-based solves, jitter for conditioning, and memory-efficient computation are essential for practical deployment.

What's next:

Page Complete

2 / 5