Loading content...
The stationary kernels we explored in the previous page share a fundamental assumption: the statistical relationship between function values depends only on the distance between input points, not their absolute location. While this assumption is elegant and often appropriate, it fails dramatically for many real-world phenomena.
Consider these scenarios where stationarity breaks down:
Non-stationary kernels address these limitations by allowing the covariance function to depend on the absolute positions of input points, not just their difference.
By the end of this page, you will understand: (1) why and when stationarity assumptions fail, (2) the mathematical structure of key non-stationary kernels including the Neural Network, Polynomial, and Linear kernels, (3) how to construct input-dependent non-stationary kernels, (4) the Gibbs kernel and warping-based approaches, and (5) practical strategies for deploying non-stationary GPs.
Before diving into non-stationary kernels, let's build intuition for why we need them. Understanding the limitations of stationarity will help you recognize when to reach for more flexible alternatives.
The Stationarity Assumption Revisited
A stationary kernel satisfies: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$
This implies that if we slide the entire function (and all observations) by any offset $\mathbf{h}$, the induced probability distribution over functions remains unchanged. Mathematically:
$$p(f(\mathbf{x}_1), ..., f(\mathbf{x}_n)) = p(f(\mathbf{x}_1 + \mathbf{h}), ..., f(\mathbf{x}_n + \mathbf{h})) \quad \forall \mathbf{h}$$
Cases Where This Assumption Breaks
| Phenomenon | Why Stationarity Fails | Required Property | Solution Approach |
|---|---|---|---|
| Time series with trends | Mean changes with time (growth, decay) | Position-dependent mean | Linear/Polynomial kernel or mean function |
| Varying smoothness | Different regions have different 'wiggliness' | Position-dependent length-scale | Gibbs kernel, deep kernels |
| Boundary effects | Behavior changes near edges | Location-aware covariance | Warping, neural network kernels |
| Multi-fidelity data | Different sources have different noise/correlation | Input-dependent properties | Deep GPs, hierarchical models |
| Heteroscedastic noise | Observation noise varies across domain | Position-dependent noise variance | Heteroscedastic GP models |
| Scale changes | Amplitude of variations changes | Position-dependent output scale | Gibbs kernel, warping |
To check if stationarity is appropriate: (1) Compute residuals from a fitted stationary GP and plot them against input location—patterns indicate non-stationarity. (2) Divide your domain into regions and fit separate GPs—if optimal hyperparameters differ substantially, consider non-stationary alternatives. (3) Examine prior samples—do they look qualitatively similar across the domain?
The Modeling Trade-Off
Non-stationary kernels offer greater flexibility but come with costs:
| Aspect | Stationary Kernels | Non-Stationary Kernels |
|---|---|---|
| Parameters | Few (σ², ℓ, maybe ν) | Often many more |
| Interpretability | High (length-scale = correlation range) | Can be harder to interpret |
| Overfitting risk | Lower | Higher with many parameters |
| Extrapolation | Consistent behavior | Can behave unpredictably |
| Computational cost | Standard GP cost | Often same, sometimes higher |
The principle of parsimony suggests: use stationary kernels unless you have clear evidence of non-stationarity. Non-stationary models are more powerful but require more data and care to avoid overfitting.
The simplest non-stationary kernels arise from the classical linear and polynomial models. These kernels are computationally attractive and serve as building blocks for more complex structures.
The Linear Kernel
$$k_{\text{Linear}}(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 (\mathbf{x} - \mathbf{c})^\top (\mathbf{x}' - \mathbf{c})$$
Where:
Key Property: The linear kernel is equivalent to Bayesian linear regression with a Gaussian prior on the weights. GP samples are linear functions.
The Polynomial Kernel
$$k_{\text{Poly}}(\mathbf{x}, \mathbf{x}') = (\sigma_b^2 + \sigma_v^2 \mathbf{x}^\top \mathbf{x}')^p$$
Where $p$ is the polynomial degree. This kernel places a GP prior over polynomials up to degree $p$.
The linear kernel is non-stationary because $k(\mathbf{x}, \mathbf{x}')$ depends on the absolute positions $\mathbf{x}$ and $\mathbf{x}'$, not just $\mathbf{x} - \mathbf{x}'$. Points far from the origin exhibit stronger correlations than points near the origin—the covariance scales with distance from the offset.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
import numpy as npimport matplotlib.pyplot as pltfrom scipy.spatial.distance import cdist def linear_kernel(X1, X2, sigma_b_sq=1.0, sigma_v_sq=1.0, offset=0.0): """ Compute the Linear kernel matrix. Parameters: ----------- X1 : ndarray of shape (n1, d) or (n1,) X2 : ndarray of shape (n2, d) or (n2,) sigma_b_sq : float Bias variance (variance at offset point) sigma_v_sq : float Slope variance (controls range of slopes) offset : float or ndarray Offset point c (often 0) Returns: -------- K : ndarray of shape (n1, n2) """ X1 = np.atleast_2d(X1) X2 = np.atleast_2d(X2) if X1.shape[0] == 1: X1 = X1.T if X2.shape[0] == 1: X2 = X2.T # Center around offset X1_centered = X1 - offset X2_centered = X2 - offset # Linear kernel: σ_b² + σ_v² (x - c)ᵀ(x' - c) K = sigma_b_sq + sigma_v_sq * (X1_centered @ X2_centered.T) return K def polynomial_kernel(X1, X2, sigma_b_sq=1.0, sigma_v_sq=1.0, degree=2): """ Compute the Polynomial kernel matrix. k(x, x') = (σ_b² + σ_v² · xᵀx')^p """ X1 = np.atleast_2d(X1) X2 = np.atleast_2d(X2) if X1.shape[0] == 1: X1 = X1.T if X2.shape[0] == 1: X2 = X2.T # Compute inner products inner_products = X1 @ X2.T # Polynomial kernel K = (sigma_b_sq + sigma_v_sq * inner_products) ** degree return K def demonstrate_linear_kernel_nonstationarity(): """ Visualize how the linear kernel creates non-stationary behavior. """ X = np.linspace(-3, 3, 100).reshape(-1, 1) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Sample from GP with linear kernel K = linear_kernel(X, X, sigma_b_sq=0.0, sigma_v_sq=1.0, offset=0.0) K += 1e-8 * np.eye(len(X)) np.random.seed(42) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) axes[0].plot(X, samples, alpha=0.7) axes[0].set_title('Linear Kernel: GP Samples (Lines)') axes[0].set_xlabel('x') axes[0].set_ylabel('f(x)') axes[0].grid(True, alpha=0.3) # Visualize covariance structure axes[1].imshow(K, extent=[-3, 3, 3, -3], cmap='viridis') axes[1].set_title('Linear Kernel: Covariance Matrix') axes[1].set_xlabel('x\'') axes[1].set_ylabel('x') axes[1].colorbar = plt.colorbar(axes[1].images[0], ax=axes[1]) # Compare covariance at different locations x_near_origin = np.array([[0.5]]) x_far_from_origin = np.array([[2.5]]) cov_near = linear_kernel(x_near_origin, X, sigma_b_sq=0.0, sigma_v_sq=1.0).flatten() cov_far = linear_kernel(x_far_from_origin, X, sigma_b_sq=0.0, sigma_v_sq=1.0).flatten() axes[2].plot(X, cov_near, label='Cov with x=0.5 (near origin)') axes[2].plot(X, cov_far, label='Cov with x=2.5 (far from origin)') axes[2].axhline(0, color='gray', linestyle='--', alpha=0.5) axes[2].set_title('Non-Stationarity: Position-Dependent Covariance') axes[2].set_xlabel('x\'') axes[2].set_ylabel('k(x, x\')') axes[2].legend() axes[2].grid(True, alpha=0.3) plt.tight_layout() return fig def demonstrate_polynomial_trends(): """ Show how polynomial kernels model trends. """ X = np.linspace(0, 5, 100).reshape(-1, 1) fig, axes = plt.subplots(1, 3, figsize=(15, 4)) for ax, degree in zip(axes, [1, 2, 3]): K = polynomial_kernel(X, X, sigma_b_sq=0.1, sigma_v_sq=0.1, degree=degree) K += 1e-6 * np.eye(len(X)) np.random.seed(42) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) ax.plot(X, samples, alpha=0.7) ax.set_title(f'Polynomial Kernel (degree={degree})') ax.set_xlabel('x') ax.set_ylabel('f(x)') ax.grid(True, alpha=0.3) plt.suptitle('Polynomial Kernels: Modeling Trend Structure') plt.tight_layout() return figApplications of Linear and Polynomial Kernels
Trend modeling: Combine with stationary kernels to model functions with trend plus residual: $k_{\text{total}} = k_{\text{Linear}} + k_{\text{Matérn}}$
Feature learning: Polynomial kernels can capture interactions between input dimensions
Bayesian linear regression: The linear kernel with a stationary kernel is equivalent to Bayesian linear regression with a GP residual
Caveats
One of the most theoretically fascinating non-stationary kernels arises from the connection between Gaussian Processes and neural networks. The Neural Network (NN) kernel (also called the arc-sine kernel or Neal's kernel) was derived by Radford Neal in his seminal 1996 work on Bayesian neural networks.
Derivation Intuition
Consider a single-layer neural network with $H$ hidden units:
$$f(\mathbf{x}) = \sum_{h=1}^{H} v_h \sigma(\mathbf{w}_h^\top \mathbf{x} + b_h)$$
Where $\sigma$ is the activation function (e.g., sigmoid, tanh), $\mathbf{w}_h$ are input-to-hidden weights, $v_h$ are hidden-to-output weights, and $b_h$ are biases.
If we place i.i.d. Gaussian priors on all weights and biases, then as $H \to \infty$, the function $f$ converges to a Gaussian Process by the Central Limit Theorem!
The NN Kernel Formula (for erf activation)
For a single-layer network with error function (erf) activation:
$$k_{\text{NN}}(\mathbf{x}, \mathbf{x}') = \frac{2}{\pi} \sin^{-1}\left(\frac{2\tilde{\mathbf{x}}^\top \Sigma \tilde{\mathbf{x}}'}{\sqrt{(1 + 2\tilde{\mathbf{x}}^\top \Sigma \tilde{\mathbf{x}})(1 + 2\tilde{\mathbf{x}}'^\top \Sigma \tilde{\mathbf{x}}')}}}\right)$$
Where:
The theory extends to deep networks: each additional layer applies another nonlinearity to the covariance structure. Deep neural network kernels (NTK = Neural Tangent Kernel) have become a major research area, connecting infinite-width networks to GPs and enabling theoretical analysis of deep learning.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom scipy.special import erf def nn_kernel(X1, X2, sigma_w_sq=1.0, sigma_b_sq=1.0): """ Compute the Neural Network (arc-sine) kernel for erf activation. This kernel corresponds to a Bayesian single-layer neural network with infinite hidden units and Gaussian priors on weights and biases. Parameters: ----------- X1 : ndarray of shape (n1, d) X2 : ndarray of shape (n2, d) sigma_w_sq : float Prior variance on input weights sigma_b_sq : float Prior variance on biases Returns: -------- K : ndarray of shape (n1, n2) """ X1 = np.atleast_2d(X1) X2 = np.atleast_2d(X2) if X1.shape[0] == 1 and X1.shape[1] > 1: X1 = X1.T if X2.shape[0] == 1 and X2.shape[1] > 1: X2 = X2.T n1, d = X1.shape n2 = X2.shape[0] # Augment inputs with 1 for bias: x̃ = (1, x) X1_aug = np.hstack([np.ones((n1, 1)), X1]) X2_aug = np.hstack([np.ones((n2, 1)), X2]) # Construct diagonal Σ matrix (bias variance, then weight variances) Sigma_diag = np.array([sigma_b_sq] + [sigma_w_sq] * d) # Compute x̃ᵀΣx̃ for each input (self-products) X1_Sigma = X1_aug * Sigma_diag # (n1, d+1) X2_Sigma = X2_aug * Sigma_diag # (n2, d+1) # Self-products: x̃ᵀΣx̃ s1 = np.sum(X1_aug * X1_Sigma, axis=1) # (n1,) s2 = np.sum(X2_aug * X2_Sigma, axis=1) # (n2,) # Cross-products: x̃ᵀΣx̃' cross = X1_Sigma @ X2_aug.T # (n1, n2) # Compute kernel using arc-sine formula numerator = 2 * cross denominator = np.sqrt(np.outer(1 + 2*s1, 1 + 2*s2)) # Clamp to valid arcsin range to avoid numerical issues arg = np.clip(numerator / denominator, -1 + 1e-10, 1 - 1e-10) K = (2 / np.pi) * np.arcsin(arg) return K def visualize_nn_kernel_behavior(): """ Demonstrate the non-stationary behavior of the NN kernel. """ import matplotlib.pyplot as plt X = np.linspace(-3, 3, 150).reshape(-1, 1) fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # (0,0): GP samples from NN kernel K = nn_kernel(X, X, sigma_w_sq=1.0, sigma_b_sq=1.0) K += 1e-8 * np.eye(len(X)) np.random.seed(42) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) axes[0, 0].plot(X, samples, alpha=0.7) axes[0, 0].set_title('Neural Network Kernel: GP Samples') axes[0, 0].set_xlabel('x') axes[0, 0].set_ylabel('f(x)') axes[0, 0].grid(True, alpha=0.3) # (0,1): Covariance matrix structure im = axes[0, 1].imshow(K, extent=[-3, 3, 3, -3], cmap='viridis') axes[0, 1].set_title('NN Kernel: Covariance Matrix') axes[0, 1].set_xlabel("x'") axes[0, 1].set_ylabel('x') plt.colorbar(im, ax=axes[0, 1]) # (1,0): Covariance slices at different positions positions = [-2.0, 0.0, 2.0] for pos in positions: x_anchor = np.array([[pos]]) cov_slice = nn_kernel(x_anchor, X, sigma_w_sq=1.0, sigma_b_sq=1.0).flatten() axes[1, 0].plot(X, cov_slice, label=f'Cov with x={pos}') axes[1, 0].set_title('Non-Stationarity: Covariance Slices') axes[1, 0].set_xlabel("x'") axes[1, 0].set_ylabel("k(x, x')") axes[1, 0].legend() axes[1, 0].grid(True, alpha=0.3) # (1,1): Effect of varying sigma_w sigma_values = [0.5, 1.0, 2.0] for sigma_w in sigma_values: K_temp = nn_kernel(X, X, sigma_w_sq=sigma_w**2, sigma_b_sq=1.0) K_temp += 1e-6 * np.eye(len(X)) L_temp = np.linalg.cholesky(K_temp) sample = L_temp @ np.random.randn(len(X)) axes[1, 1].plot(X, sample, label=f'σ_w = {sigma_w}') axes[1, 1].set_title('Effect of Weight Prior Variance') axes[1, 1].set_xlabel('x') axes[1, 1].set_ylabel('f(x)') axes[1, 1].legend() axes[1, 1].grid(True, alpha=0.3) plt.suptitle('Neural Network Kernel: Non-Stationary Behavior', fontsize=14) plt.tight_layout() return figProperties of the Neural Network Kernel
Non-stationarity: The kernel depends on both $\mathbf{x}$ and $\mathbf{x}'$ absolutely, not just their difference. Covariance structure changes based on position relative to the origin.
Bounded output: The NN kernel produces values in $[-1, 1]$, unlike unbounded stationary kernels.
Saturation behavior: For inputs far from the origin, the kernel saturates (similar to how sigmoid activations saturate for large inputs).
Step-function limit: As weight variance increases, samples approach step functions.
When to Use the NN Kernel
✅ Appropriate for:
❌ Less appropriate for:
One of the most intuitive approaches to non-stationarity is to allow the length-scale to vary across the input domain. The Gibbs kernel (also called the non-stationary squared exponential) achieves this by making $\ell$ a function of position.
Mathematical Definition
The Gibbs kernel is defined as:
$$k_{\text{Gibbs}}(\mathbf{x}, \mathbf{x}') = \sqrt{\frac{2\ell(\mathbf{x})\ell(\mathbf{x}')}{\ell(\mathbf{x})^2 + \ell(\mathbf{x}')^2}} \exp\left(-\frac{(\mathbf{x} - \mathbf{x}')^2}{\ell(\mathbf{x})^2 + \ell(\mathbf{x}')^2}\right)$$
Where $\ell(\mathbf{x})$ is the length-scale function that specifies how the correlation range varies with input location.
Important Properties
The Length-Scale Function $\ell(\mathbf{x})$
This function must be specified or learned. Common choices:
Consider position-dependent length-scales when: (1) the data appears smooth in some regions but wiggly in others, (2) you have domain knowledge about where variations are faster or slower (e.g., urban vs rural in spatial modeling), (3) residual analysis from a stationary GP shows spatially-correlated heterogeneity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
import numpy as npimport matplotlib.pyplot as plt def gibbs_kernel(X1, X2, length_scale_fn, sigma_sq=1.0): """ Compute the Gibbs (non-stationary) kernel with position-dependent length-scale. Parameters: ----------- X1 : ndarray of shape (n1,) or (n1, 1) X2 : ndarray of shape (n2,) or (n2, 1) length_scale_fn : callable Function that takes x and returns length-scale ℓ(x) sigma_sq : float Signal variance Returns: -------- K : ndarray of shape (n1, n2) """ X1 = np.atleast_1d(X1).flatten() X2 = np.atleast_1d(X2).flatten() n1, n2 = len(X1), len(X2) # Compute length-scales at all points ell1 = np.array([length_scale_fn(x) for x in X1]) # (n1,) ell2 = np.array([length_scale_fn(x) for x in X2]) # (n2,) # Create grids for vectorized computation ell1_grid = ell1[:, np.newaxis] # (n1, 1) ell2_grid = ell2[np.newaxis, :] # (1, n2) X1_grid = X1[:, np.newaxis] X2_grid = X2[np.newaxis, :] # Sum of squared length-scales ell_sq_sum = ell1_grid**2 + ell2_grid**2 # Prefactor for normalization prefactor = np.sqrt(2 * ell1_grid * ell2_grid / ell_sq_sum) # Exponential term sq_diff = (X1_grid - X2_grid)**2 exp_term = np.exp(-sq_diff / ell_sq_sum) K = sigma_sq * prefactor * exp_term return K def demo_varying_smoothness(): """ Demonstrate a function with varying smoothness modeled by Gibbs kernel. """ # Length-scale varies: small near x=0 (wiggly), large far from 0 (smooth) def length_scale_fn(x): # Varying length-scale: wiggly near center, smooth at edges return 0.3 + 0.5 * np.abs(x) X = np.linspace(-4, 4, 200) fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # (0,0): Visualize the length-scale function ell_values = [length_scale_fn(x) for x in X] axes[0, 0].plot(X, ell_values, 'b-', linewidth=2) axes[0, 0].set_title('Position-Dependent Length-Scale ℓ(x)') axes[0, 0].set_xlabel('x') axes[0, 0].set_ylabel('ℓ(x)') axes[0, 0].grid(True, alpha=0.3) axes[0, 0].fill_between(X, 0, ell_values, alpha=0.3) # (0,1): GP samples with Gibbs kernel K = gibbs_kernel(X, X, length_scale_fn, sigma_sq=1.0) K += 1e-6 * np.eye(len(X)) np.random.seed(42) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 5) axes[0, 1].plot(X, samples, alpha=0.7) axes[0, 1].set_title('Gibbs Kernel: GP Samples (Varying Smoothness)') axes[0, 1].set_xlabel('x') axes[0, 1].set_ylabel('f(x)') axes[0, 1].grid(True, alpha=0.3) # (1,0): Covariance matrix im = axes[1, 0].imshow(K, extent=[-4, 4, 4, -4], cmap='viridis') axes[1, 0].set_title('Gibbs Kernel: Covariance Matrix') axes[1, 0].set_xlabel("x'") axes[1, 0].set_ylabel('x') plt.colorbar(im, ax=axes[1, 0]) # (1,1): Compare to stationary RBF with average length-scale avg_ell = np.mean(ell_values) K_rbf = np.exp(-(X[:, np.newaxis] - X)**2 / (2 * avg_ell**2)) K_rbf += 1e-6 * np.eye(len(X)) np.random.seed(42) # Same seed for comparison L_rbf = np.linalg.cholesky(K_rbf) samples_rbf = L_rbf @ np.random.randn(len(X), 5) axes[1, 1].plot(X, samples_rbf, alpha=0.7) axes[1, 1].set_title(f'Comparison: Stationary RBF (ℓ={avg_ell:.2f})') axes[1, 1].set_xlabel('x') axes[1, 1].set_ylabel('f(x)') axes[1, 1].grid(True, alpha=0.3) plt.suptitle('Gibbs Kernel: Modeling Spatially-Varying Smoothness', fontsize=14) plt.tight_layout() return fig def practical_length_scale_functions(): """ Examples of useful length-scale functions for different scenarios. """ X = np.linspace(-5, 5, 100) # Different length-scale patterns patterns = { 'Linear decay from center': lambda x: 1.5 - 0.2 * np.abs(x), 'Exponential growth': lambda x: 0.5 * np.exp(0.3 * np.abs(x)), 'Step function (two regimes)': lambda x: 0.3 if x < 0 else 1.5, 'Periodic length-scale': lambda x: 0.5 + 0.3 * np.sin(x)**2, } fig, axes = plt.subplots(2, 2, figsize=(12, 8)) axes = axes.flatten() for ax, (name, fn) in zip(axes, patterns.items()): ell = np.array([max(0.1, fn(x)) for x in X]) # Ensure positive ax.plot(X, ell, 'b-', linewidth=2) ax.fill_between(X, 0, ell, alpha=0.3) ax.set_title(name) ax.set_xlabel('x') ax.set_ylabel('ℓ(x)') ax.grid(True, alpha=0.3) plt.suptitle('Length-Scale Functions for Different Use Cases') plt.tight_layout() return figLearning the Length-Scale Function
Specifying $\ell(\mathbf{x})$ parametrically with few parameters is simple but limited. More powerful approaches include:
Hierarchical GP: Place a GP prior on $\log \ell(\mathbf{x})$, creating a two-layer GP model
Neural network mapping: Use a neural network to output $\ell(\mathbf{x})$, train end-to-end
Inducing point representation: Specify $\ell$ at a set of inducing points, interpolate elsewhere
Kernel transformation: Apply a non-linear transformation (warping) to inputs before applying a stationary kernel
Practical Considerations
An elegant alternative to directly specifying non-stationary kernels is input warping—transforming inputs through a non-linear function before applying a standard stationary kernel. This approach is conceptually simple, computationally efficient, and can capture complex non-stationary behaviors.
The Warping Approach
Given a warping function $g: \mathbb{R}^d \to \mathbb{R}^{d'}$, the warped kernel is:
$$k_{\text{warped}}(\mathbf{x}, \mathbf{x}') = k_{\text{stationary}}(g(\mathbf{x}), g(\mathbf{x}'))$$
The function $g$ 'stretches' or 'compresses' the input space before measuring distances. Where $g$ compresses space (derivatives small), the effective length-scale increases (smoother). Where $g$ stretches space (derivatives large), variations become more rapid.
Common Warping Functions
Box-Cox transformation: $g(x) = \frac{x^\lambda - 1}{\lambda}$ for $\lambda \neq 0$, or $\log(x)$ for $\lambda = 0$
Cumulative Beta distribution: $g(x) = F_{\text{Beta}}(x; \alpha, \beta)$ — highly flexible, bounded
Neural network: Learn $g$ as a neural network (Deep Kernel Learning)
Gaussian CDF (probit): $g(x) = \Phi^{-1}(x)$ — maps [0,1] to unbounded space
When you apply a stationary kernel in warped space, the effective distance between points x and x' becomes ||g(x) - g(x')||. If g stretches the space in some region, the 'effective distance' increases, reducing correlation. If g compresses, effective distance shrinks, increasing correlation. This elegantly converts non-stationarity into a space transformation problem.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
import numpy as npimport matplotlib.pyplot as pltfrom scipy.stats import beta as beta_dist def rbf_kernel(X1, X2, length_scale=1.0, sigma_sq=1.0): """Standard RBF kernel for comparison.""" X1 = np.atleast_1d(X1).flatten()[:, np.newaxis] X2 = np.atleast_1d(X2).flatten()[:, np.newaxis] sq_dists = np.sum((X1 - X2.T)**2, axis=0) if X1.shape == X2.T.shape else \ np.sum(X1**2, axis=1, keepdims=True) + \ np.sum(X2.T**2, axis=0, keepdims=True) - \ 2 * X1 @ X2.T # Recompute properly sq_dists = (X1 - X2.T)**2 return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2)) def warped_rbf_kernel(X1, X2, warp_fn, length_scale=1.0, sigma_sq=1.0): """ RBF kernel applied in warped space. Parameters: ----------- X1, X2 : ndarrays Input locations warp_fn : callable Warping function g(x) length_scale, sigma_sq : floats Kernel hyperparameters (applied in warped space) """ X1 = np.atleast_1d(X1).flatten() X2 = np.atleast_1d(X2).flatten() # Warp inputs X1_warped = np.array([warp_fn(x) for x in X1]) X2_warped = np.array([warp_fn(x) for x in X2]) # Compute RBF kernel in warped space sq_dists = (X1_warped[:, np.newaxis] - X2_warped[np.newaxis, :])**2 return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2)) def demonstrate_input_warping(): """ Show how different warping functions create non-stationary behavior. """ X = np.linspace(0.01, 0.99, 200) # Avoid boundary issues # Define warping functions warp_functions = { 'Identity (no warp)': lambda x: x, 'Log transform': lambda x: np.log(x + 0.1), 'Beta CDF (α=2, β=5)': lambda x: beta_dist.cdf(x, 2, 5), 'Sinusoidal': lambda x: x + 0.2 * np.sin(4 * np.pi * x), } fig, axes = plt.subplots(3, 4, figsize=(16, 12)) for col, (name, warp_fn) in enumerate(warp_functions.items()): # Row 0: Warping function X_warped = np.array([warp_fn(x) for x in X]) axes[0, col].plot(X, X_warped, 'b-', linewidth=2) axes[0, col].plot(X, X, 'k--', alpha=0.3, label='Identity') axes[0, col].set_title(f'{name}') axes[0, col].set_xlabel('x (original)') axes[0, col].set_ylabel('g(x) (warped)') axes[0, col].grid(True, alpha=0.3) # Row 1: Covariance matrix K = warped_rbf_kernel(X, X, warp_fn, length_scale=0.3, sigma_sq=1.0) K += 1e-6 * np.eye(len(X)) im = axes[1, col].imshow(K, extent=[0, 1, 1, 0], cmap='viridis') axes[1, col].set_title('Covariance Matrix') axes[1, col].set_xlabel("x'") axes[1, col].set_ylabel('x') # Row 2: GP samples np.random.seed(42) L = np.linalg.cholesky(K) samples = L @ np.random.randn(len(X), 3) axes[2, col].plot(X, samples, alpha=0.7) axes[2, col].set_title('GP Samples') axes[2, col].set_xlabel('x') axes[2, col].set_ylabel('f(x)') axes[2, col].grid(True, alpha=0.3) axes[0, 0].legend(loc='lower right') plt.suptitle('Input Warping: Creating Non-Stationarity via Space Transformation', fontsize=14) plt.tight_layout() return fig def beta_warp_kernel(X1, X2, alpha=2.0, beta_param=5.0, length_scale=1.0, sigma_sq=1.0): """ Warped kernel using Beta CDF transformation. The Beta CDF is flexible: α > β shifts density left (compress right, stretch left) α < β shifts density right (stretch right, compress left) α = β = 1 gives identity (uniform) Parameters: ----------- alpha, beta_param : floats Beta distribution parameters controlling warp shape """ warp_fn = lambda x: beta_dist.cdf(np.clip(x, 0.001, 0.999), alpha, beta_param) return warped_rbf_kernel(X1, X2, warp_fn, length_scale, sigma_sq) # Deep Kernel Learning: Neural network warpingclass NeuralNetWarp: """ Simple neural network for input warping (conceptual implementation). In practice, use PyTorch/TensorFlow with GPyTorch/GPflow. """ def __init__(self, hidden_sizes=[10, 10], seed=42): np.random.seed(seed) self.layers = [] input_dim = 1 for h in hidden_sizes: W = np.random.randn(input_dim, h) * 0.5 b = np.zeros(h) self.layers.append((W, b)) input_dim = h # Output layer W_out = np.random.randn(input_dim, 1) * 0.5 b_out = np.zeros(1) self.layers.append((W_out, b_out)) def __call__(self, x): """Forward pass through network.""" h = np.array([[x]]) for i, (W, b) in enumerate(self.layers[:-1]): h = np.tanh(h @ W + b) # tanh activation W_out, b_out = self.layers[-1] return float(h @ W_out + b_out)Deep Kernel Learning
A powerful modern approach combines neural networks with GP kernels:
$$k(\mathbf{x}, \mathbf{x}') = k_{\text{RBF}}(\phi_{\text{NN}}(\mathbf{x}), \phi_{\text{NN}}(\mathbf{x}'))$$
Where $\phi_{\text{NN}}$ is a neural network that learns the optimal representation (warping) for the GP. This approach:
Key Libraries for Deep Kernel Learning:
DeepKernel class combines any PyTorch network with GP kernelsCaution: Deep kernel learning is powerful but requires careful regularization. Without it, the neural network can overfit by learning representations that make all training points very far apart, collapsing uncertainty.
Deploying non-stationary Gaussian Processes requires careful attention to model specification, hyperparameter optimization, and validation. This section synthesizes practical wisdom for real-world applications.
✓ Residuals from stationary GP show spatial patterns ✓ Different parts of domain have visibly different scales of variation ✓ Physical domain knowledge suggests varying properties ✓ Model comparison favors non-stationary structure
✗ Limited data (< 100 points in many domains) ✗ No clear spatially-varying patterns in residuals ✗ Stationary model already generalizes well ✗ Extrapolation is important (non-stationarity can behave unpredictably)
Common Pitfalls and Remedies
| Pitfall | Symptom | Remedy |
|---|---|---|
| Overfitting | Excellent training fit, poor test performance | Regularize, reduce complexity, use informative priors |
| Identifiability issues | Multiple parameter settings give similar fits | Reduce parameter count, fix some via domain knowledge |
| Numerical instability | Cholesky failures, NaN gradients | Check for extreme length-scales, add jitter, normalize inputs |
| Slow optimization | Many hyperparameters, complex landscape | Use multi-start, informative initialization, hierarchical approach |
| Unrealistic extrapolation | Weird behavior outside training domain | Consider how non-stationarity extends, maybe use stationary for extrapolation |
Recommended Workflow
Non-stationary kernels expand the GP toolkit to handle phenomena where statistical properties vary across the input domain. While more powerful than stationary alternatives, they require careful application to avoid overfitting.
What's Next:
We've explored individual kernels, both stationary and non-stationary. The next page covers kernel composition—the powerful principle that valid kernels can be combined via addition, multiplication, and other operations to create complex, interpretable covariance structures tailored to your specific domain.
You now understand non-stationary kernels: when stationarity fails, how the Linear, Polynomial, Neural Network, and Gibbs kernels work, and how input warping provides an elegant path to non-stationarity. This prepares you to model complex, heterogeneous phenomena with appropriate prior knowledge.