Kernel Design - Learning Module

Loading content...

0/245

Non-Stationary Kernels

Beyond Translation Invariance

The stationary kernels we explored in the previous page share a fundamental assumption: the statistical relationship between function values depends only on the distance between input points, not their absolute location. While this assumption is elegant and often appropriate, it fails dramatically for many real-world phenomena.

Consider these scenarios where stationarity breaks down:

Spatial modeling near boundaries: The correlation structure changes near coastlines, fault lines, or domain edges
Time series with trends: Stock prices, population growth, or technology adoption curves have direction-dependent behavior
Functions with varying complexity: Some regions are smooth while others exhibit rapid oscillations
Heteroscedastic noise: Uncertainty varies across the input domain (e.g., more noise in volatile market periods)

Non-stationary kernels address these limitations by allowing the covariance function to depend on the absolute positions of input points, not just their difference.

What You Will Learn

By the end of this page, you will understand: (1) why and when stationarity assumptions fail, (2) the mathematical structure of key non-stationary kernels including the Neural Network, Polynomial, and Linear kernels, (3) how to construct input-dependent non-stationary kernels, (4) the Gibbs kernel and warping-based approaches, and (5) practical strategies for deploying non-stationary GPs.

When Stationarity Fails

Before diving into non-stationary kernels, let's build intuition for why we need them. Understanding the limitations of stationarity will help you recognize when to reach for more flexible alternatives.

The Stationarity Assumption Revisited

A stationary kernel satisfies: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$

This implies that if we slide the entire function (and all observations) by any offset $\mathbf{h}$, the induced probability distribution over functions remains unchanged. Mathematically:

$$p(f(\mathbf{x}_1), ..., f(\mathbf{x}_n)) = p(f(\mathbf{x}_1 + \mathbf{h}), ..., f(\mathbf{x}_n + \mathbf{h})) \quad \forall \mathbf{h}$$

Cases Where This Assumption Breaks

Stationarity Violations in Practice
Phenomenon	Why Stationarity Fails	Required Property	Solution Approach
Time series with trends	Mean changes with time (growth, decay)	Position-dependent mean	Linear/Polynomial kernel or mean function
Varying smoothness	Different regions have different 'wiggliness'	Position-dependent length-scale	Gibbs kernel, deep kernels
Boundary effects	Behavior changes near edges	Location-aware covariance	Warping, neural network kernels
Multi-fidelity data	Different sources have different noise/correlation	Input-dependent properties	Deep GPs, hierarchical models
Heteroscedastic noise	Observation noise varies across domain	Position-dependent noise variance	Heteroscedastic GP models
Scale changes	Amplitude of variations changes	Position-dependent output scale	Gibbs kernel, warping

Diagnosing Stationarity Violations

To check if stationarity is appropriate: (1) Compute residuals from a fitted stationary GP and plot them against input location—patterns indicate non-stationarity. (2) Divide your domain into regions and fit separate GPs—if optimal hyperparameters differ substantially, consider non-stationary alternatives. (3) Examine prior samples—do they look qualitatively similar across the domain?

The Modeling Trade-Off

Non-stationary kernels offer greater flexibility but come with costs:

Aspect	Stationary Kernels	Non-Stationary Kernels
Parameters	Few (σ², ℓ, maybe ν)	Often many more
Interpretability	High (length-scale = correlation range)	Can be harder to interpret
Overfitting risk	Lower	Higher with many parameters
Extrapolation	Consistent behavior	Can behave unpredictably
Computational cost	Standard GP cost	Often same, sometimes higher

The principle of parsimony suggests: use stationary kernels unless you have clear evidence of non-stationarity. Non-stationary models are more powerful but require more data and care to avoid overfitting.

The Linear and Polynomial Kernels

The simplest non-stationary kernels arise from the classical linear and polynomial models. These kernels are computationally attractive and serve as building blocks for more complex structures.

The Linear Kernel

$$k_{\text{Linear}}(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 (\mathbf{x} - \mathbf{c})^\top (\mathbf{x}' - \mathbf{c})$$

Where:

$\sigma_b^2$ is the bias variance: variance at the offset point $\mathbf{c}$
$\sigma_v^2$ is the slope variance: controls the range of line slopes
$\mathbf{c}$ is the offset (often set to 0)

Key Property: The linear kernel is equivalent to Bayesian linear regression with a Gaussian prior on the weights. GP samples are linear functions.

The Polynomial Kernel

$$k_{\text{Poly}}(\mathbf{x}, \mathbf{x}') = (\sigma_b^2 + \sigma_v^2 \mathbf{x}^\top \mathbf{x}')^p$$

Where $p$ is the polynomial degree. This kernel places a GP prior over polynomials up to degree $p$.

Non-Stationarity of Linear Kernel

The linear kernel is non-stationary because $k(\mathbf{x}, \mathbf{x}')$ depends on the absolute positions $\mathbf{x}$ and $\mathbf{x}'$, not just $\mathbf{x} - \mathbf{x}'$. Points far from the origin exhibit stronger correlations than points near the origin—the covariance scales with distance from the offset.

linear_polynomial_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
 
def linear_kernel(X1, X2, sigma_b_sq=1.0, sigma_v_sq=1.0, offset=0.0):
    """
    Compute the Linear kernel matrix.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d) or (n1,)
    X2 : ndarray of shape (n2, d) or (n2,)
    sigma_b_sq : float
        Bias variance (variance at offset point)
    sigma_v_sq : float
        Slope variance (controls range of slopes)
    offset : float or ndarray
        Offset point c (often 0)
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    if X1.shape[0] == 1:
        X1 = X1.T
    if X2.shape[0] == 1:
        X2 = X2.T
    
    # Center around offset
    X1_centered = X1 - offset
    X2_centered = X2 - offset
    
    # Linear kernel: σ_b² + σ_v² (x - c)ᵀ(x' - c)
    K = sigma_b_sq + sigma_v_sq * (X1_centered @ X2_centered.T)
    
    return K
 
def polynomial_kernel(X1, X2, sigma_b_sq=1.0, sigma_v_sq=1.0, degree=2):
    """
    Compute the Polynomial kernel matrix.
    
    k(x, x') = (σ_b² + σ_v² · xᵀx')^p
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    if X1.shape[0] == 1:
        X1 = X1.T
    if X2.shape[0] == 1:
        X2 = X2.T
    
    # Compute inner products
    inner_products = X1 @ X2.T
    
    # Polynomial kernel
    K = (sigma_b_sq + sigma_v_sq * inner_products) ** degree
    
    return K
 
def demonstrate_linear_kernel_nonstationarity():
    """
    Visualize how the linear kernel creates non-stationary behavior.
    """
    X = np.linspace(-3, 3, 100).reshape(-1, 1)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Sample from GP with linear kernel
    K = linear_kernel(X, X, sigma_b_sq=0.0, sigma_v_sq=1.0, offset=0.0)
    K += 1e-8 * np.eye(len(X))
    
    np.random.seed(42)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    axes[0].plot(X, samples, alpha=0.7)
    axes[0].set_title('Linear Kernel: GP Samples (Lines)')
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('f(x)')
    axes[0].grid(True, alpha=0.3)
    
    # Visualize covariance structure
    axes[1].imshow(K, extent=[-3, 3, 3, -3], cmap='viridis')
    axes[1].set_title('Linear Kernel: Covariance Matrix')
    axes[1].set_xlabel('x\'')
    axes[1].set_ylabel('x')
    axes[1].colorbar = plt.colorbar(axes[1].images[0], ax=axes[1])
    
    # Compare covariance at different locations
    x_near_origin = np.array([[0.5]])
    x_far_from_origin = np.array([[2.5]])
    
    cov_near = linear_kernel(x_near_origin, X, sigma_b_sq=0.0, sigma_v_sq=1.0).flatten()
    cov_far = linear_kernel(x_far_from_origin, X, sigma_b_sq=0.0, sigma_v_sq=1.0).flatten()
    
    axes[2].plot(X, cov_near, label='Cov with x=0.5 (near origin)')
    axes[2].plot(X, cov_far, label='Cov with x=2.5 (far from origin)')
    axes[2].axhline(0, color='gray', linestyle='--', alpha=0.5)
    axes[2].set_title('Non-Stationarity: Position-Dependent Covariance')
    axes[2].set_xlabel('x\'')
    axes[2].set_ylabel('k(x, x\')')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
def demonstrate_polynomial_trends():
    """
    Show how polynomial kernels model trends.
    """
    X = np.linspace(0, 5, 100).reshape(-1, 1)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for ax, degree in zip(axes, [1, 2, 3]):
        K = polynomial_kernel(X, X, sigma_b_sq=0.1, sigma_v_sq=0.1, degree=degree)
        K += 1e-6 * np.eye(len(X))
        
        np.random.seed(42)
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X), 5)
        
        ax.plot(X, samples, alpha=0.7)
        ax.set_title(f'Polynomial Kernel (degree={degree})')
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Polynomial Kernels: Modeling Trend Structure')
    plt.tight_layout()
    return fig

Applications of Linear and Polynomial Kernels

Trend modeling: Combine with stationary kernels to model functions with trend plus residual: $k_{\text{total}} = k_{\text{Linear}} + k_{\text{Matérn}}$
Feature learning: Polynomial kernels can capture interactions between input dimensions
Bayesian linear regression: The linear kernel with a stationary kernel is equivalent to Bayesian linear regression with a GP residual

Caveats

Polynomial kernels of degree $> 3$ often cause numerical issues
Extrapolation behavior can be extreme (polynomials grow unboundedly)
The polynomial kernel is not isotropic—it depends on angles between vectors from the origin, not just distances

The Neural Network Kernel

One of the most theoretically fascinating non-stationary kernels arises from the connection between Gaussian Processes and neural networks. The Neural Network (NN) kernel (also called the arc-sine kernel or Neal's kernel) was derived by Radford Neal in his seminal 1996 work on Bayesian neural networks.

Derivation Intuition

Consider a single-layer neural network with $H$ hidden units:

$$f(\mathbf{x}) = \sum_{h=1}^{H} v_h \sigma(\mathbf{w}_h^\top \mathbf{x} + b_h)$$

Where $\sigma$ is the activation function (e.g., sigmoid, tanh), $\mathbf{w}_h$ are input-to-hidden weights, $v_h$ are hidden-to-output weights, and $b_h$ are biases.

If we place i.i.d. Gaussian priors on all weights and biases, then as $H \to \infty$, the function $f$ converges to a Gaussian Process by the Central Limit Theorem!

The NN Kernel Formula (for erf activation)

For a single-layer network with error function (erf) activation:

$$k_{\text{NN}}(\mathbf{x}, \mathbf{x}') = \frac{2}{\pi} \sin^{-1}\left(\frac{2\tilde{\mathbf{x}}^\top \Sigma \tilde{\mathbf{x}}'}{\sqrt{(1 + 2\tilde{\mathbf{x}}^\top \Sigma \tilde{\mathbf{x}})(1 + 2\tilde{\mathbf{x}}'^\top \Sigma \tilde{\mathbf{x}}')}}}\right)$$

Where:

$\tilde{\mathbf{x}} = (1, \mathbf{x})$ is the input augmented with a 1 for the bias term
$\Sigma = \text{diag}(\sigma_b^2, \sigma_w^2, ..., \sigma_w^2)$ encodes prior variances on bias and weights

Deep Neural Network Kernels

The theory extends to deep networks: each additional layer applies another nonlinearity to the covariance structure. Deep neural network kernels (NTK = Neural Tangent Kernel) have become a major research area, connecting infinite-width networks to GPs and enabling theoretical analysis of deep learning.

neural_network_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from scipy.special import erf
 
def nn_kernel(X1, X2, sigma_w_sq=1.0, sigma_b_sq=1.0):
    """
    Compute the Neural Network (arc-sine) kernel for erf activation.
    
    This kernel corresponds to a Bayesian single-layer neural network
    with infinite hidden units and Gaussian priors on weights and biases.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d)
    X2 : ndarray of shape (n2, d)
    sigma_w_sq : float
        Prior variance on input weights
    sigma_b_sq : float
        Prior variance on biases
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    if X1.shape[0] == 1 and X1.shape[1] > 1:
        X1 = X1.T
    if X2.shape[0] == 1 and X2.shape[1] > 1:
        X2 = X2.T
    
    n1, d = X1.shape
    n2 = X2.shape[0]
    
    # Augment inputs with 1 for bias: x̃ = (1, x)
    X1_aug = np.hstack([np.ones((n1, 1)), X1])
    X2_aug = np.hstack([np.ones((n2, 1)), X2])
    
    # Construct diagonal Σ matrix (bias variance, then weight variances)
    Sigma_diag = np.array([sigma_b_sq] + [sigma_w_sq] * d)
    
    # Compute x̃ᵀΣx̃ for each input (self-products)
    X1_Sigma = X1_aug * Sigma_diag  # (n1, d+1)
    X2_Sigma = X2_aug * Sigma_diag  # (n2, d+1)
    
    # Self-products: x̃ᵀΣx̃
    s1 = np.sum(X1_aug * X1_Sigma, axis=1)  # (n1,)
    s2 = np.sum(X2_aug * X2_Sigma, axis=1)  # (n2,)
    
    # Cross-products: x̃ᵀΣx̃'
    cross = X1_Sigma @ X2_aug.T  # (n1, n2)
    
    # Compute kernel using arc-sine formula
    numerator = 2 * cross
    denominator = np.sqrt(np.outer(1 + 2*s1, 1 + 2*s2))
    
    # Clamp to valid arcsin range to avoid numerical issues
    arg = np.clip(numerator / denominator, -1 + 1e-10, 1 - 1e-10)
    
    K = (2 / np.pi) * np.arcsin(arg)
    
    return K
 
def visualize_nn_kernel_behavior():
    """
    Demonstrate the non-stationary behavior of the NN kernel.
    """
    import matplotlib.pyplot as plt
    
    X = np.linspace(-3, 3, 150).reshape(-1, 1)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # (0,0): GP samples from NN kernel
    K = nn_kernel(X, X, sigma_w_sq=1.0, sigma_b_sq=1.0)
    K += 1e-8 * np.eye(len(X))
    
    np.random.seed(42)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    axes[0, 0].plot(X, samples, alpha=0.7)
    axes[0, 0].set_title('Neural Network Kernel: GP Samples')
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('f(x)')
    axes[0, 0].grid(True, alpha=0.3)
    
    # (0,1): Covariance matrix structure
    im = axes[0, 1].imshow(K, extent=[-3, 3, 3, -3], cmap='viridis')
    axes[0, 1].set_title('NN Kernel: Covariance Matrix')
    axes[0, 1].set_xlabel("x'")
    axes[0, 1].set_ylabel('x')
    plt.colorbar(im, ax=axes[0, 1])
    
    # (1,0): Covariance slices at different positions
    positions = [-2.0, 0.0, 2.0]
    for pos in positions:
        x_anchor = np.array([[pos]])
        cov_slice = nn_kernel(x_anchor, X, sigma_w_sq=1.0, sigma_b_sq=1.0).flatten()
        axes[1, 0].plot(X, cov_slice, label=f'Cov with x={pos}')
    
    axes[1, 0].set_title('Non-Stationarity: Covariance Slices')
    axes[1, 0].set_xlabel("x'")
    axes[1, 0].set_ylabel("k(x, x')")
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # (1,1): Effect of varying sigma_w
    sigma_values = [0.5, 1.0, 2.0]
    for sigma_w in sigma_values:
        K_temp = nn_kernel(X, X, sigma_w_sq=sigma_w**2, sigma_b_sq=1.0)
        K_temp += 1e-6 * np.eye(len(X))
        L_temp = np.linalg.cholesky(K_temp)
        sample = L_temp @ np.random.randn(len(X))
        axes[1, 1].plot(X, sample, label=f'σ_w = {sigma_w}')
    
    axes[1, 1].set_title('Effect of Weight Prior Variance')
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.suptitle('Neural Network Kernel: Non-Stationary Behavior', fontsize=14)
    plt.tight_layout()
    return fig

Properties of the Neural Network Kernel

Non-stationarity: The kernel depends on both $\mathbf{x}$ and $\mathbf{x}'$ absolutely, not just their difference. Covariance structure changes based on position relative to the origin.
Bounded output: The NN kernel produces values in $[-1, 1]$, unlike unbounded stationary kernels.
Saturation behavior: For inputs far from the origin, the kernel saturates (similar to how sigmoid activations saturate for large inputs).
Step-function limit: As weight variance increases, samples approach step functions.

When to Use the NN Kernel

✅ Appropriate for:

Classification problems (bounded outputs are natural for class probabilities)
Functions with sigmoidal or saturation behavior
When you believe a neural network structure would be appropriate
Modeling binary or threshold-like responses

❌ Less appropriate for:

Smooth regression problems (consider Matérn instead)
Extrapolation far beyond training data
Periodic or oscillatory phenomena

The Gibbs Kernel: Input-Dependent Length-Scales

One of the most intuitive approaches to non-stationarity is to allow the length-scale to vary across the input domain. The Gibbs kernel (also called the non-stationary squared exponential) achieves this by making $\ell$ a function of position.

Mathematical Definition

The Gibbs kernel is defined as:

$$k_{\text{Gibbs}}(\mathbf{x}, \mathbf{x}') = \sqrt{\frac{2\ell(\mathbf{x})\ell(\mathbf{x}')}{\ell(\mathbf{x})^2 + \ell(\mathbf{x}')^2}} \exp\left(-\frac{(\mathbf{x} - \mathbf{x}')^2}{\ell(\mathbf{x})^2 + \ell(\mathbf{x}')^2}\right)$$

Where $\ell(\mathbf{x})$ is the length-scale function that specifies how the correlation range varies with input location.

Important Properties

When $\ell(\mathbf{x}) = \ell$ (constant), this reduces to the standard RBF kernel
The kernel is guaranteed positive semi-definite for any positive $\ell(\mathbf{x})$
The prefactor ensures proper normalization (k(x,x) adjusts appropriately)

The Length-Scale Function $\ell(\mathbf{x})$

This function must be specified or learned. Common choices:

Parametric: $\ell(x) = a + bx^2$ (varying smoothness with position)
Another GP: Place a GP prior on $\log \ell(x)$ (extremely flexible but computationally expensive)
Neural network: Learn $\ell(x)$ as the output of a neural network
Piecewise constant: Different length-scales in different regions

When to Use Varying Length-Scales

Consider position-dependent length-scales when: (1) the data appears smooth in some regions but wiggly in others, (2) you have domain knowledge about where variations are faster or slower (e.g., urban vs rural in spatial modeling), (3) residual analysis from a stationary GP shows spatially-correlated heterogeneity.

gibbs_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
import matplotlib.pyplot as plt
 
def gibbs_kernel(X1, X2, length_scale_fn, sigma_sq=1.0):
    """
    Compute the Gibbs (non-stationary) kernel with position-dependent length-scale.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1,) or (n1, 1)
    X2 : ndarray of shape (n2,) or (n2, 1)
    length_scale_fn : callable
        Function that takes x and returns length-scale ℓ(x)
    sigma_sq : float
        Signal variance
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    X1 = np.atleast_1d(X1).flatten()
    X2 = np.atleast_1d(X2).flatten()
    
    n1, n2 = len(X1), len(X2)
    
    # Compute length-scales at all points
    ell1 = np.array([length_scale_fn(x) for x in X1])  # (n1,)
    ell2 = np.array([length_scale_fn(x) for x in X2])  # (n2,)
    
    # Create grids for vectorized computation
    ell1_grid = ell1[:, np.newaxis]  # (n1, 1)
    ell2_grid = ell2[np.newaxis, :]  # (1, n2)
    X1_grid = X1[:, np.newaxis]
    X2_grid = X2[np.newaxis, :]
    
    # Sum of squared length-scales
    ell_sq_sum = ell1_grid**2 + ell2_grid**2
    
    # Prefactor for normalization
    prefactor = np.sqrt(2 * ell1_grid * ell2_grid / ell_sq_sum)
    
    # Exponential term
    sq_diff = (X1_grid - X2_grid)**2
    exp_term = np.exp(-sq_diff / ell_sq_sum)
    
    K = sigma_sq * prefactor * exp_term
    
    return K
 
def demo_varying_smoothness():
    """
    Demonstrate a function with varying smoothness modeled by Gibbs kernel.
    """
    # Length-scale varies: small near x=0 (wiggly), large far from 0 (smooth)
    def length_scale_fn(x):
        # Varying length-scale: wiggly near center, smooth at edges
        return 0.3 + 0.5 * np.abs(x)
    
    X = np.linspace(-4, 4, 200)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # (0,0): Visualize the length-scale function
    ell_values = [length_scale_fn(x) for x in X]
    axes[0, 0].plot(X, ell_values, 'b-', linewidth=2)
    axes[0, 0].set_title('Position-Dependent Length-Scale ℓ(x)')
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('ℓ(x)')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].fill_between(X, 0, ell_values, alpha=0.3)
    
    # (0,1): GP samples with Gibbs kernel
    K = gibbs_kernel(X, X, length_scale_fn, sigma_sq=1.0)
    K += 1e-6 * np.eye(len(X))
    
    np.random.seed(42)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    axes[0, 1].plot(X, samples, alpha=0.7)
    axes[0, 1].set_title('Gibbs Kernel: GP Samples (Varying Smoothness)')
    axes[0, 1].set_xlabel('x')
    axes[0, 1].set_ylabel('f(x)')
    axes[0, 1].grid(True, alpha=0.3)
    
    # (1,0): Covariance matrix
    im = axes[1, 0].imshow(K, extent=[-4, 4, 4, -4], cmap='viridis')
    axes[1, 0].set_title('Gibbs Kernel: Covariance Matrix')
    axes[1, 0].set_xlabel("x'")
    axes[1, 0].set_ylabel('x')
    plt.colorbar(im, ax=axes[1, 0])
    
    # (1,1): Compare to stationary RBF with average length-scale
    avg_ell = np.mean(ell_values)
    K_rbf = np.exp(-(X[:, np.newaxis] - X)**2 / (2 * avg_ell**2))
    K_rbf += 1e-6 * np.eye(len(X))
    
    np.random.seed(42)  # Same seed for comparison
    L_rbf = np.linalg.cholesky(K_rbf)
    samples_rbf = L_rbf @ np.random.randn(len(X), 5)
    
    axes[1, 1].plot(X, samples_rbf, alpha=0.7)
    axes[1, 1].set_title(f'Comparison: Stationary RBF (ℓ={avg_ell:.2f})')
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.suptitle('Gibbs Kernel: Modeling Spatially-Varying Smoothness', fontsize=14)
    plt.tight_layout()
    return fig
 
def practical_length_scale_functions():
    """
    Examples of useful length-scale functions for different scenarios.
    """
    X = np.linspace(-5, 5, 100)
    
    # Different length-scale patterns
    patterns = {
        'Linear decay from center': lambda x: 1.5 - 0.2 * np.abs(x),
        'Exponential growth': lambda x: 0.5 * np.exp(0.3 * np.abs(x)),
        'Step function (two regimes)': lambda x: 0.3 if x < 0 else 1.5,
        'Periodic length-scale': lambda x: 0.5 + 0.3 * np.sin(x)**2,
    }
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    axes = axes.flatten()
    
    for ax, (name, fn) in zip(axes, patterns.items()):
        ell = np.array([max(0.1, fn(x)) for x in X])  # Ensure positive
        
        ax.plot(X, ell, 'b-', linewidth=2)
        ax.fill_between(X, 0, ell, alpha=0.3)
        ax.set_title(name)
        ax.set_xlabel('x')
        ax.set_ylabel('ℓ(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Length-Scale Functions for Different Use Cases')
    plt.tight_layout()
    return fig

Learning the Length-Scale Function

Specifying $\ell(\mathbf{x})$ parametrically with few parameters is simple but limited. More powerful approaches include:

Hierarchical GP: Place a GP prior on $\log \ell(\mathbf{x})$, creating a two-layer GP model
Neural network mapping: Use a neural network to output $\ell(\mathbf{x})$, train end-to-end
Inducing point representation: Specify $\ell$ at a set of inducing points, interpolate elsewhere
Kernel transformation: Apply a non-linear transformation (warping) to inputs before applying a stationary kernel

Practical Considerations

Learning a flexible $\ell(\mathbf{x})$ requires sufficient data—with sparse data, stick to parametric forms
Ensure $\ell(\mathbf{x}) > 0$ everywhere (use softplus or exponential transformation)
The Gibbs kernel can be expensive if computing $\ell(\mathbf{x})$ is expensive (it's called $O(n^2)$ times for the full kernel matrix)

Input Warping for Non-Stationarity

An elegant alternative to directly specifying non-stationary kernels is input warping—transforming inputs through a non-linear function before applying a standard stationary kernel. This approach is conceptually simple, computationally efficient, and can capture complex non-stationary behaviors.

The Warping Approach

Given a warping function $g: \mathbb{R}^d \to \mathbb{R}^{d'}$, the warped kernel is:

$$k_{\text{warped}}(\mathbf{x}, \mathbf{x}') = k_{\text{stationary}}(g(\mathbf{x}), g(\mathbf{x}'))$$

The function $g$ 'stretches' or 'compresses' the input space before measuring distances. Where $g$ compresses space (derivatives small), the effective length-scale increases (smoother). Where $g$ stretches space (derivatives large), variations become more rapid.

Common Warping Functions

Box-Cox transformation: $g(x) = \frac{x^\lambda - 1}{\lambda}$ for $\lambda \neq 0$, or $\log(x)$ for $\lambda = 0$
Cumulative Beta distribution: $g(x) = F_{\text{Beta}}(x; \alpha, \beta)$ — highly flexible, bounded
Neural network: Learn $g$ as a neural network (Deep Kernel Learning)
Gaussian CDF (probit): $g(x) = \Phi^{-1}(x)$ — maps [0,1] to unbounded space

Why Warping Works

When you apply a stationary kernel in warped space, the effective distance between points x and x' becomes ||g(x) - g(x')||. If g stretches the space in some region, the 'effective distance' increases, reducing correlation. If g compresses, effective distance shrinks, increasing correlation. This elegantly converts non-stationarity into a space transformation problem.

input_warping.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta as beta_dist
 
def rbf_kernel(X1, X2, length_scale=1.0, sigma_sq=1.0):
    """Standard RBF kernel for comparison."""
    X1 = np.atleast_1d(X1).flatten()[:, np.newaxis]
    X2 = np.atleast_1d(X2).flatten()[:, np.newaxis]
    sq_dists = np.sum((X1 - X2.T)**2, axis=0) if X1.shape == X2.T.shape else \
               np.sum(X1**2, axis=1, keepdims=True) + \
               np.sum(X2.T**2, axis=0, keepdims=True) - \
               2 * X1 @ X2.T
    # Recompute properly
    sq_dists = (X1 - X2.T)**2
    return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2))
 
def warped_rbf_kernel(X1, X2, warp_fn, length_scale=1.0, sigma_sq=1.0):
    """
    RBF kernel applied in warped space.
    
    Parameters:
    -----------
    X1, X2 : ndarrays
        Input locations
    warp_fn : callable
        Warping function g(x)
    length_scale, sigma_sq : floats
        Kernel hyperparameters (applied in warped space)
    """
    X1 = np.atleast_1d(X1).flatten()
    X2 = np.atleast_1d(X2).flatten()
    
    # Warp inputs
    X1_warped = np.array([warp_fn(x) for x in X1])
    X2_warped = np.array([warp_fn(x) for x in X2])
    
    # Compute RBF kernel in warped space
    sq_dists = (X1_warped[:, np.newaxis] - X2_warped[np.newaxis, :])**2
    
    return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2))
 
def demonstrate_input_warping():
    """
    Show how different warping functions create non-stationary behavior.
    """
    X = np.linspace(0.01, 0.99, 200)  # Avoid boundary issues
    
    # Define warping functions
    warp_functions = {
        'Identity (no warp)': lambda x: x,
        'Log transform': lambda x: np.log(x + 0.1),
        'Beta CDF (α=2, β=5)': lambda x: beta_dist.cdf(x, 2, 5),
        'Sinusoidal': lambda x: x + 0.2 * np.sin(4 * np.pi * x),
    }
    
    fig, axes = plt.subplots(3, 4, figsize=(16, 12))
    
    for col, (name, warp_fn) in enumerate(warp_functions.items()):
        # Row 0: Warping function
        X_warped = np.array([warp_fn(x) for x in X])
        axes[0, col].plot(X, X_warped, 'b-', linewidth=2)
        axes[0, col].plot(X, X, 'k--', alpha=0.3, label='Identity')
        axes[0, col].set_title(f'{name}')
        axes[0, col].set_xlabel('x (original)')
        axes[0, col].set_ylabel('g(x) (warped)')
        axes[0, col].grid(True, alpha=0.3)
        
        # Row 1: Covariance matrix
        K = warped_rbf_kernel(X, X, warp_fn, length_scale=0.3, sigma_sq=1.0)
        K += 1e-6 * np.eye(len(X))
        
        im = axes[1, col].imshow(K, extent=[0, 1, 1, 0], cmap='viridis')
        axes[1, col].set_title('Covariance Matrix')
        axes[1, col].set_xlabel("x'")
        axes[1, col].set_ylabel('x')
        
        # Row 2: GP samples
        np.random.seed(42)
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X), 3)
        
        axes[2, col].plot(X, samples, alpha=0.7)
        axes[2, col].set_title('GP Samples')
        axes[2, col].set_xlabel('x')
        axes[2, col].set_ylabel('f(x)')
        axes[2, col].grid(True, alpha=0.3)
    
    axes[0, 0].legend(loc='lower right')
    plt.suptitle('Input Warping: Creating Non-Stationarity via Space Transformation', 
                 fontsize=14)
    plt.tight_layout()
    return fig
 
def beta_warp_kernel(X1, X2, alpha=2.0, beta_param=5.0, 
                     length_scale=1.0, sigma_sq=1.0):
    """
    Warped kernel using Beta CDF transformation.
    
    The Beta CDF is flexible: α > β shifts density left (compress right, stretch left)
    α < β shifts density right (stretch right, compress left)
    α = β = 1 gives identity (uniform)
    
    Parameters:
    -----------
    alpha, beta_param : floats
        Beta distribution parameters controlling warp shape
    """
    warp_fn = lambda x: beta_dist.cdf(np.clip(x, 0.001, 0.999), alpha, beta_param)
    return warped_rbf_kernel(X1, X2, warp_fn, length_scale, sigma_sq)
 
# Deep Kernel Learning: Neural network warping
class NeuralNetWarp:
    """
    Simple neural network for input warping (conceptual implementation).
    In practice, use PyTorch/TensorFlow with GPyTorch/GPflow.
    """
    def __init__(self, hidden_sizes=[10, 10], seed=42):
        np.random.seed(seed)
        self.layers = []
        input_dim = 1
        for h in hidden_sizes:
            W = np.random.randn(input_dim, h) * 0.5
            b = np.zeros(h)
            self.layers.append((W, b))
            input_dim = h
        # Output layer
        W_out = np.random.randn(input_dim, 1) * 0.5
        b_out = np.zeros(1)
        self.layers.append((W_out, b_out))
    
    def __call__(self, x):
        """Forward pass through network."""
        h = np.array([[x]])
        for i, (W, b) in enumerate(self.layers[:-1]):
            h = np.tanh(h @ W + b)  # tanh activation
        W_out, b_out = self.layers[-1]
        return float(h @ W_out + b_out)

Deep Kernel Learning

A powerful modern approach combines neural networks with GP kernels:

$$k(\mathbf{x}, \mathbf{x}') = k_{\text{RBF}}(\phi_{\text{NN}}(\mathbf{x}), \phi_{\text{NN}}(\mathbf{x}'))$$

Where $\phi_{\text{NN}}$ is a neural network that learns the optimal representation (warping) for the GP. This approach:

Combines deep learning's representation power with GP's uncertainty quantification
Can be trained end-to-end by maximizing marginal likelihood
Works well on structured data (images, sequences) where deep features are meaningful

Key Libraries for Deep Kernel Learning:

GPyTorch: DeepKernel class combines any PyTorch network with GP kernels
GPflow: Similar capability with TensorFlow backend
BoTorch: Built on GPyTorch, designed for Bayesian optimization

Caution: Deep kernel learning is powerful but requires careful regularization. Without it, the neural network can overfit by learning representations that make all training points very far apart, collapsing uncertainty.

Practical Considerations for Non-Stationary GPs

Deploying non-stationary Gaussian Processes requires careful attention to model specification, hyperparameter optimization, and validation. This section synthesizes practical wisdom for real-world applications.

Decision Framework for Non-Stationarity

•Start stationary: Always begin with a Matérn-5/2 or RBF kernel. Non-stationarity adds parameters and complexity—don't add it unless needed.
•Diagnose before treating: Fit a stationary GP, analyze residuals spatially. Are there patterns? Does the marginal likelihood improve substantially with non-stationary alternatives?
•Prefer simple non-stationarity: Linear × Stationary (for trends) or simple parametric warping before resorting to deep kernels or full Gibbs kernels.
•Match complexity to data: A Gibbs kernel with many parameters needs substantial data to avoid overfitting. With few observations, a three-parameter non-stationary model is often the limit.
•Cross-validate rigorously: Non-stationary kernels can fit training data well while generalizing poorly. Use spatial or temporal cross-validation.

Signs Non-Stationarity Will Help

✓ Residuals from stationary GP show spatial patterns ✓ Different parts of domain have visibly different scales of variation ✓ Physical domain knowledge suggests varying properties ✓ Model comparison favors non-stationary structure

Signs to Stick with Stationarity

✗ Limited data (< 100 points in many domains) ✗ No clear spatially-varying patterns in residuals ✗ Stationary model already generalizes well ✗ Extrapolation is important (non-stationarity can behave unpredictably)

Common Pitfalls and Remedies

Pitfall	Symptom	Remedy
Overfitting	Excellent training fit, poor test performance	Regularize, reduce complexity, use informative priors
Identifiability issues	Multiple parameter settings give similar fits	Reduce parameter count, fix some via domain knowledge
Numerical instability	Cholesky failures, NaN gradients	Check for extreme length-scales, add jitter, normalize inputs
Slow optimization	Many hyperparameters, complex landscape	Use multi-start, informative initialization, hierarchical approach
Unrealistic extrapolation	Weird behavior outside training domain	Consider how non-stationarity extends, maybe use stationary for extrapolation

Recommended Workflow

Normalize inputs to a common scale (e.g., [0,1] or zero-mean, unit-variance)
Fit stationary baseline and record marginal likelihood + cross-validation score
Analyze residuals/uncertainty for spatial patterns
Try simple non-stationarity (Linear + Stationary, simple warping)
Compare metrics: marginal likelihood, predictive log-likelihood, RMSE
Validate extrapolation behavior if relevant to your application
Document assumptions about non-stationarity for reproducibility

Summary: Non-Stationary Kernels

Non-stationary kernels expand the GP toolkit to handle phenomena where statistical properties vary across the input domain. While more powerful than stationary alternatives, they require careful application to avoid overfitting.

Key Takeaways

•Stationarity fails when: smoothness, correlation length, or amplitude varies across the domain
•Linear/Polynomial kernels: Model trends and systematic position-dependent variation
•Neural Network kernel: Arises from infinite-width BNNs; bounded, saturation-like behavior
•Gibbs kernel: Explicit position-dependent length-scale $\ell(\mathbf{x})$; highly flexible
•Input warping: Transform inputs before applying stationary kernel; conceptually clean, efficient
•Deep Kernel Learning: Neural network + GP; combines representation learning with uncertainty quantification
•Practical wisdom: Start stationary, add complexity cautiously, validate rigorously

What's Next:

We've explored individual kernels, both stationary and non-stationary. The next page covers kernel composition—the powerful principle that valid kernels can be combined via addition, multiplication, and other operations to create complex, interpretable covariance structures tailored to your specific domain.

Page Complete

You now understand non-stationary kernels: when stationarity fails, how the Linear, Polynomial, Neural Network, and Gibbs kernels work, and how input warping provides an elegant path to non-stationarity. This prepares you to model complex, heterogeneous phenomena with appropriate prior knowledge.

Non-Stationary Kernels

Beyond Translation Invariance

Consider these scenarios where stationarity breaks down:

Spatial modeling near boundaries: The correlation structure changes near coastlines, fault lines, or domain edges
Time series with trends: Stock prices, population growth, or technology adoption curves have direction-dependent behavior
Functions with varying complexity: Some regions are smooth while others exhibit rapid oscillations
Heteroscedastic noise: Uncertainty varies across the input domain (e.g., more noise in volatile market periods)

Non-stationary kernels address these limitations by allowing the covariance function to depend on the absolute positions of input points, not just their difference.

What You Will Learn

When Stationarity Fails

The Stationarity Assumption Revisited

A stationary kernel satisfies: $k(\mathbf{x}, \mathbf{x}') = k(\mathbf{x} - \mathbf{x}')$

This implies that if we slide the entire function (and all observations) by any offset $\mathbf{h}$, the induced probability distribution over functions remains unchanged. Mathematically:

$$p(f(\mathbf{x}_1), ..., f(\mathbf{x}_n)) = p(f(\mathbf{x}_1 + \mathbf{h}), ..., f(\mathbf{x}_n + \mathbf{h})) \quad \forall \mathbf{h}$$

Cases Where This Assumption Breaks

Stationarity Violations in Practice
Phenomenon	Why Stationarity Fails	Required Property	Solution Approach
Time series with trends	Mean changes with time (growth, decay)	Position-dependent mean	Linear/Polynomial kernel or mean function
Varying smoothness	Different regions have different 'wiggliness'	Position-dependent length-scale	Gibbs kernel, deep kernels
Boundary effects	Behavior changes near edges	Location-aware covariance	Warping, neural network kernels
Multi-fidelity data	Different sources have different noise/correlation	Input-dependent properties	Deep GPs, hierarchical models
Heteroscedastic noise	Observation noise varies across domain	Position-dependent noise variance	Heteroscedastic GP models
Scale changes	Amplitude of variations changes	Position-dependent output scale	Gibbs kernel, warping

Diagnosing Stationarity Violations

The Modeling Trade-Off

Non-stationary kernels offer greater flexibility but come with costs:

Aspect	Stationary Kernels	Non-Stationary Kernels
Parameters	Few (σ², ℓ, maybe ν)	Often many more
Interpretability	High (length-scale = correlation range)	Can be harder to interpret
Overfitting risk	Lower	Higher with many parameters
Extrapolation	Consistent behavior	Can behave unpredictably
Computational cost	Standard GP cost	Often same, sometimes higher

The Linear and Polynomial Kernels

The simplest non-stationary kernels arise from the classical linear and polynomial models. These kernels are computationally attractive and serve as building blocks for more complex structures.

The Linear Kernel

$$k_{\text{Linear}}(\mathbf{x}, \mathbf{x}') = \sigma_b^2 + \sigma_v^2 (\mathbf{x} - \mathbf{c})^\top (\mathbf{x}' - \mathbf{c})$$

Where:

$\sigma_b^2$ is the bias variance: variance at the offset point $\mathbf{c}$
$\sigma_v^2$ is the slope variance: controls the range of line slopes
$\mathbf{c}$ is the offset (often set to 0)

Key Property: The linear kernel is equivalent to Bayesian linear regression with a Gaussian prior on the weights. GP samples are linear functions.

The Polynomial Kernel

$$k_{\text{Poly}}(\mathbf{x}, \mathbf{x}') = (\sigma_b^2 + \sigma_v^2 \mathbf{x}^\top \mathbf{x}')^p$$

Where $p$ is the polynomial degree. This kernel places a GP prior over polynomials up to degree $p$.

Non-Stationarity of Linear Kernel

linear_polynomial_kernels.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
 
def linear_kernel(X1, X2, sigma_b_sq=1.0, sigma_v_sq=1.0, offset=0.0):
    """
    Compute the Linear kernel matrix.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d) or (n1,)
    X2 : ndarray of shape (n2, d) or (n2,)
    sigma_b_sq : float
        Bias variance (variance at offset point)
    sigma_v_sq : float
        Slope variance (controls range of slopes)
    offset : float or ndarray
        Offset point c (often 0)
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    if X1.shape[0] == 1:
        X1 = X1.T
    if X2.shape[0] == 1:
        X2 = X2.T
    
    # Center around offset
    X1_centered = X1 - offset
    X2_centered = X2 - offset
    
    # Linear kernel: σ_b² + σ_v² (x - c)ᵀ(x' - c)
    K = sigma_b_sq + sigma_v_sq * (X1_centered @ X2_centered.T)
    
    return K
 
def polynomial_kernel(X1, X2, sigma_b_sq=1.0, sigma_v_sq=1.0, degree=2):
    """
    Compute the Polynomial kernel matrix.
    
    k(x, x') = (σ_b² + σ_v² · xᵀx')^p
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    if X1.shape[0] == 1:
        X1 = X1.T
    if X2.shape[0] == 1:
        X2 = X2.T
    
    # Compute inner products
    inner_products = X1 @ X2.T
    
    # Polynomial kernel
    K = (sigma_b_sq + sigma_v_sq * inner_products) ** degree
    
    return K
 
def demonstrate_linear_kernel_nonstationarity():
    """
    Visualize how the linear kernel creates non-stationary behavior.
    """
    X = np.linspace(-3, 3, 100).reshape(-1, 1)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Sample from GP with linear kernel
    K = linear_kernel(X, X, sigma_b_sq=0.0, sigma_v_sq=1.0, offset=0.0)
    K += 1e-8 * np.eye(len(X))
    
    np.random.seed(42)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    axes[0].plot(X, samples, alpha=0.7)
    axes[0].set_title('Linear Kernel: GP Samples (Lines)')
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('f(x)')
    axes[0].grid(True, alpha=0.3)
    
    # Visualize covariance structure
    axes[1].imshow(K, extent=[-3, 3, 3, -3], cmap='viridis')
    axes[1].set_title('Linear Kernel: Covariance Matrix')
    axes[1].set_xlabel('x\'')
    axes[1].set_ylabel('x')
    axes[1].colorbar = plt.colorbar(axes[1].images[0], ax=axes[1])
    
    # Compare covariance at different locations
    x_near_origin = np.array([[0.5]])
    x_far_from_origin = np.array([[2.5]])
    
    cov_near = linear_kernel(x_near_origin, X, sigma_b_sq=0.0, sigma_v_sq=1.0).flatten()
    cov_far = linear_kernel(x_far_from_origin, X, sigma_b_sq=0.0, sigma_v_sq=1.0).flatten()
    
    axes[2].plot(X, cov_near, label='Cov with x=0.5 (near origin)')
    axes[2].plot(X, cov_far, label='Cov with x=2.5 (far from origin)')
    axes[2].axhline(0, color='gray', linestyle='--', alpha=0.5)
    axes[2].set_title('Non-Stationarity: Position-Dependent Covariance')
    axes[2].set_xlabel('x\'')
    axes[2].set_ylabel('k(x, x\')')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig
 
def demonstrate_polynomial_trends():
    """
    Show how polynomial kernels model trends.
    """
    X = np.linspace(0, 5, 100).reshape(-1, 1)
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for ax, degree in zip(axes, [1, 2, 3]):
        K = polynomial_kernel(X, X, sigma_b_sq=0.1, sigma_v_sq=0.1, degree=degree)
        K += 1e-6 * np.eye(len(X))
        
        np.random.seed(42)
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X), 5)
        
        ax.plot(X, samples, alpha=0.7)
        ax.set_title(f'Polynomial Kernel (degree={degree})')
        ax.set_xlabel('x')
        ax.set_ylabel('f(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Polynomial Kernels: Modeling Trend Structure')
    plt.tight_layout()
    return fig

Applications of Linear and Polynomial Kernels

Trend modeling: Combine with stationary kernels to model functions with trend plus residual: $k_{\text{total}} = k_{\text{Linear}} + k_{\text{Matérn}}$
Feature learning: Polynomial kernels can capture interactions between input dimensions
Bayesian linear regression: The linear kernel with a stationary kernel is equivalent to Bayesian linear regression with a GP residual

Caveats

Polynomial kernels of degree $> 3$ often cause numerical issues
Extrapolation behavior can be extreme (polynomials grow unboundedly)
The polynomial kernel is not isotropic—it depends on angles between vectors from the origin, not just distances

The Neural Network Kernel

Derivation Intuition

Consider a single-layer neural network with $H$ hidden units:

$$f(\mathbf{x}) = \sum_{h=1}^{H} v_h \sigma(\mathbf{w}_h^\top \mathbf{x} + b_h)$$

Where $\sigma$ is the activation function (e.g., sigmoid, tanh), $\mathbf{w}_h$ are input-to-hidden weights, $v_h$ are hidden-to-output weights, and $b_h$ are biases.

If we place i.i.d. Gaussian priors on all weights and biases, then as $H \to \infty$, the function $f$ converges to a Gaussian Process by the Central Limit Theorem!

The NN Kernel Formula (for erf activation)

For a single-layer network with error function (erf) activation:

Where:

$\tilde{\mathbf{x}} = (1, \mathbf{x})$ is the input augmented with a 1 for the bias term
$\Sigma = \text{diag}(\sigma_b^2, \sigma_w^2, ..., \sigma_w^2)$ encodes prior variances on bias and weights

Deep Neural Network Kernels

neural_network_kernel.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from scipy.special import erf
 
def nn_kernel(X1, X2, sigma_w_sq=1.0, sigma_b_sq=1.0):
    """
    Compute the Neural Network (arc-sine) kernel for erf activation.
    
    This kernel corresponds to a Bayesian single-layer neural network
    with infinite hidden units and Gaussian priors on weights and biases.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1, d)
    X2 : ndarray of shape (n2, d)
    sigma_w_sq : float
        Prior variance on input weights
    sigma_b_sq : float
        Prior variance on biases
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    X1 = np.atleast_2d(X1)
    X2 = np.atleast_2d(X2)
    if X1.shape[0] == 1 and X1.shape[1] > 1:
        X1 = X1.T
    if X2.shape[0] == 1 and X2.shape[1] > 1:
        X2 = X2.T
    
    n1, d = X1.shape
    n2 = X2.shape[0]
    
    # Augment inputs with 1 for bias: x̃ = (1, x)
    X1_aug = np.hstack([np.ones((n1, 1)), X1])
    X2_aug = np.hstack([np.ones((n2, 1)), X2])
    
    # Construct diagonal Σ matrix (bias variance, then weight variances)
    Sigma_diag = np.array([sigma_b_sq] + [sigma_w_sq] * d)
    
    # Compute x̃ᵀΣx̃ for each input (self-products)
    X1_Sigma = X1_aug * Sigma_diag  # (n1, d+1)
    X2_Sigma = X2_aug * Sigma_diag  # (n2, d+1)
    
    # Self-products: x̃ᵀΣx̃
    s1 = np.sum(X1_aug * X1_Sigma, axis=1)  # (n1,)
    s2 = np.sum(X2_aug * X2_Sigma, axis=1)  # (n2,)
    
    # Cross-products: x̃ᵀΣx̃'
    cross = X1_Sigma @ X2_aug.T  # (n1, n2)
    
    # Compute kernel using arc-sine formula
    numerator = 2 * cross
    denominator = np.sqrt(np.outer(1 + 2*s1, 1 + 2*s2))
    
    # Clamp to valid arcsin range to avoid numerical issues
    arg = np.clip(numerator / denominator, -1 + 1e-10, 1 - 1e-10)
    
    K = (2 / np.pi) * np.arcsin(arg)
    
    return K
 
def visualize_nn_kernel_behavior():
    """
    Demonstrate the non-stationary behavior of the NN kernel.
    """
    import matplotlib.pyplot as plt
    
    X = np.linspace(-3, 3, 150).reshape(-1, 1)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # (0,0): GP samples from NN kernel
    K = nn_kernel(X, X, sigma_w_sq=1.0, sigma_b_sq=1.0)
    K += 1e-8 * np.eye(len(X))
    
    np.random.seed(42)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    axes[0, 0].plot(X, samples, alpha=0.7)
    axes[0, 0].set_title('Neural Network Kernel: GP Samples')
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('f(x)')
    axes[0, 0].grid(True, alpha=0.3)
    
    # (0,1): Covariance matrix structure
    im = axes[0, 1].imshow(K, extent=[-3, 3, 3, -3], cmap='viridis')
    axes[0, 1].set_title('NN Kernel: Covariance Matrix')
    axes[0, 1].set_xlabel("x'")
    axes[0, 1].set_ylabel('x')
    plt.colorbar(im, ax=axes[0, 1])
    
    # (1,0): Covariance slices at different positions
    positions = [-2.0, 0.0, 2.0]
    for pos in positions:
        x_anchor = np.array([[pos]])
        cov_slice = nn_kernel(x_anchor, X, sigma_w_sq=1.0, sigma_b_sq=1.0).flatten()
        axes[1, 0].plot(X, cov_slice, label=f'Cov with x={pos}')
    
    axes[1, 0].set_title('Non-Stationarity: Covariance Slices')
    axes[1, 0].set_xlabel("x'")
    axes[1, 0].set_ylabel("k(x, x')")
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)
    
    # (1,1): Effect of varying sigma_w
    sigma_values = [0.5, 1.0, 2.0]
    for sigma_w in sigma_values:
        K_temp = nn_kernel(X, X, sigma_w_sq=sigma_w**2, sigma_b_sq=1.0)
        K_temp += 1e-6 * np.eye(len(X))
        L_temp = np.linalg.cholesky(K_temp)
        sample = L_temp @ np.random.randn(len(X))
        axes[1, 1].plot(X, sample, label=f'σ_w = {sigma_w}')
    
    axes[1, 1].set_title('Effect of Weight Prior Variance')
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.suptitle('Neural Network Kernel: Non-Stationary Behavior', fontsize=14)
    plt.tight_layout()
    return fig

Properties of the Neural Network Kernel

Non-stationarity: The kernel depends on both $\mathbf{x}$ and $\mathbf{x}'$ absolutely, not just their difference. Covariance structure changes based on position relative to the origin.
Bounded output: The NN kernel produces values in $[-1, 1]$, unlike unbounded stationary kernels.
Saturation behavior: For inputs far from the origin, the kernel saturates (similar to how sigmoid activations saturate for large inputs).
Step-function limit: As weight variance increases, samples approach step functions.

When to Use the NN Kernel

✅ Appropriate for:

Classification problems (bounded outputs are natural for class probabilities)
Functions with sigmoidal or saturation behavior
When you believe a neural network structure would be appropriate
Modeling binary or threshold-like responses

❌ Less appropriate for:

Smooth regression problems (consider Matérn instead)
Extrapolation far beyond training data
Periodic or oscillatory phenomena

The Gibbs Kernel: Input-Dependent Length-Scales

Mathematical Definition

The Gibbs kernel is defined as:

Where $\ell(\mathbf{x})$ is the length-scale function that specifies how the correlation range varies with input location.

Important Properties

When $\ell(\mathbf{x}) = \ell$ (constant), this reduces to the standard RBF kernel
The kernel is guaranteed positive semi-definite for any positive $\ell(\mathbf{x})$
The prefactor ensures proper normalization (k(x,x) adjusts appropriately)

The Length-Scale Function $\ell(\mathbf{x})$

This function must be specified or learned. Common choices:

Parametric: $\ell(x) = a + bx^2$ (varying smoothness with position)
Another GP: Place a GP prior on $\log \ell(x)$ (extremely flexible but computationally expensive)
Neural network: Learn $\ell(x)$ as the output of a neural network
Piecewise constant: Different length-scales in different regions

When to Use Varying Length-Scales

gibbs_kernel_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
import matplotlib.pyplot as plt
 
def gibbs_kernel(X1, X2, length_scale_fn, sigma_sq=1.0):
    """
    Compute the Gibbs (non-stationary) kernel with position-dependent length-scale.
    
    Parameters:
    -----------
    X1 : ndarray of shape (n1,) or (n1, 1)
    X2 : ndarray of shape (n2,) or (n2, 1)
    length_scale_fn : callable
        Function that takes x and returns length-scale ℓ(x)
    sigma_sq : float
        Signal variance
    
    Returns:
    --------
    K : ndarray of shape (n1, n2)
    """
    X1 = np.atleast_1d(X1).flatten()
    X2 = np.atleast_1d(X2).flatten()
    
    n1, n2 = len(X1), len(X2)
    
    # Compute length-scales at all points
    ell1 = np.array([length_scale_fn(x) for x in X1])  # (n1,)
    ell2 = np.array([length_scale_fn(x) for x in X2])  # (n2,)
    
    # Create grids for vectorized computation
    ell1_grid = ell1[:, np.newaxis]  # (n1, 1)
    ell2_grid = ell2[np.newaxis, :]  # (1, n2)
    X1_grid = X1[:, np.newaxis]
    X2_grid = X2[np.newaxis, :]
    
    # Sum of squared length-scales
    ell_sq_sum = ell1_grid**2 + ell2_grid**2
    
    # Prefactor for normalization
    prefactor = np.sqrt(2 * ell1_grid * ell2_grid / ell_sq_sum)
    
    # Exponential term
    sq_diff = (X1_grid - X2_grid)**2
    exp_term = np.exp(-sq_diff / ell_sq_sum)
    
    K = sigma_sq * prefactor * exp_term
    
    return K
 
def demo_varying_smoothness():
    """
    Demonstrate a function with varying smoothness modeled by Gibbs kernel.
    """
    # Length-scale varies: small near x=0 (wiggly), large far from 0 (smooth)
    def length_scale_fn(x):
        # Varying length-scale: wiggly near center, smooth at edges
        return 0.3 + 0.5 * np.abs(x)
    
    X = np.linspace(-4, 4, 200)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # (0,0): Visualize the length-scale function
    ell_values = [length_scale_fn(x) for x in X]
    axes[0, 0].plot(X, ell_values, 'b-', linewidth=2)
    axes[0, 0].set_title('Position-Dependent Length-Scale ℓ(x)')
    axes[0, 0].set_xlabel('x')
    axes[0, 0].set_ylabel('ℓ(x)')
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].fill_between(X, 0, ell_values, alpha=0.3)
    
    # (0,1): GP samples with Gibbs kernel
    K = gibbs_kernel(X, X, length_scale_fn, sigma_sq=1.0)
    K += 1e-6 * np.eye(len(X))
    
    np.random.seed(42)
    L = np.linalg.cholesky(K)
    samples = L @ np.random.randn(len(X), 5)
    
    axes[0, 1].plot(X, samples, alpha=0.7)
    axes[0, 1].set_title('Gibbs Kernel: GP Samples (Varying Smoothness)')
    axes[0, 1].set_xlabel('x')
    axes[0, 1].set_ylabel('f(x)')
    axes[0, 1].grid(True, alpha=0.3)
    
    # (1,0): Covariance matrix
    im = axes[1, 0].imshow(K, extent=[-4, 4, 4, -4], cmap='viridis')
    axes[1, 0].set_title('Gibbs Kernel: Covariance Matrix')
    axes[1, 0].set_xlabel("x'")
    axes[1, 0].set_ylabel('x')
    plt.colorbar(im, ax=axes[1, 0])
    
    # (1,1): Compare to stationary RBF with average length-scale
    avg_ell = np.mean(ell_values)
    K_rbf = np.exp(-(X[:, np.newaxis] - X)**2 / (2 * avg_ell**2))
    K_rbf += 1e-6 * np.eye(len(X))
    
    np.random.seed(42)  # Same seed for comparison
    L_rbf = np.linalg.cholesky(K_rbf)
    samples_rbf = L_rbf @ np.random.randn(len(X), 5)
    
    axes[1, 1].plot(X, samples_rbf, alpha=0.7)
    axes[1, 1].set_title(f'Comparison: Stationary RBF (ℓ={avg_ell:.2f})')
    axes[1, 1].set_xlabel('x')
    axes[1, 1].set_ylabel('f(x)')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.suptitle('Gibbs Kernel: Modeling Spatially-Varying Smoothness', fontsize=14)
    plt.tight_layout()
    return fig
 
def practical_length_scale_functions():
    """
    Examples of useful length-scale functions for different scenarios.
    """
    X = np.linspace(-5, 5, 100)
    
    # Different length-scale patterns
    patterns = {
        'Linear decay from center': lambda x: 1.5 - 0.2 * np.abs(x),
        'Exponential growth': lambda x: 0.5 * np.exp(0.3 * np.abs(x)),
        'Step function (two regimes)': lambda x: 0.3 if x < 0 else 1.5,
        'Periodic length-scale': lambda x: 0.5 + 0.3 * np.sin(x)**2,
    }
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    axes = axes.flatten()
    
    for ax, (name, fn) in zip(axes, patterns.items()):
        ell = np.array([max(0.1, fn(x)) for x in X])  # Ensure positive
        
        ax.plot(X, ell, 'b-', linewidth=2)
        ax.fill_between(X, 0, ell, alpha=0.3)
        ax.set_title(name)
        ax.set_xlabel('x')
        ax.set_ylabel('ℓ(x)')
        ax.grid(True, alpha=0.3)
    
    plt.suptitle('Length-Scale Functions for Different Use Cases')
    plt.tight_layout()
    return fig

Learning the Length-Scale Function

Specifying $\ell(\mathbf{x})$ parametrically with few parameters is simple but limited. More powerful approaches include:

Hierarchical GP: Place a GP prior on $\log \ell(\mathbf{x})$, creating a two-layer GP model
Neural network mapping: Use a neural network to output $\ell(\mathbf{x})$, train end-to-end
Inducing point representation: Specify $\ell$ at a set of inducing points, interpolate elsewhere
Kernel transformation: Apply a non-linear transformation (warping) to inputs before applying a stationary kernel

Practical Considerations

Learning a flexible $\ell(\mathbf{x})$ requires sufficient data—with sparse data, stick to parametric forms
Ensure $\ell(\mathbf{x}) > 0$ everywhere (use softplus or exponential transformation)
The Gibbs kernel can be expensive if computing $\ell(\mathbf{x})$ is expensive (it's called $O(n^2)$ times for the full kernel matrix)

Input Warping for Non-Stationarity

The Warping Approach

Given a warping function $g: \mathbb{R}^d \to \mathbb{R}^{d'}$, the warped kernel is:

$$k_{\text{warped}}(\mathbf{x}, \mathbf{x}') = k_{\text{stationary}}(g(\mathbf{x}), g(\mathbf{x}'))$$

Common Warping Functions

Box-Cox transformation: $g(x) = \frac{x^\lambda - 1}{\lambda}$ for $\lambda \neq 0$, or $\log(x)$ for $\lambda = 0$
Cumulative Beta distribution: $g(x) = F_{\text{Beta}}(x; \alpha, \beta)$ — highly flexible, bounded
Neural network: Learn $g$ as a neural network (Deep Kernel Learning)
Gaussian CDF (probit): $g(x) = \Phi^{-1}(x)$ — maps [0,1] to unbounded space

Why Warping Works

input_warping.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta as beta_dist
 
def rbf_kernel(X1, X2, length_scale=1.0, sigma_sq=1.0):
    """Standard RBF kernel for comparison."""
    X1 = np.atleast_1d(X1).flatten()[:, np.newaxis]
    X2 = np.atleast_1d(X2).flatten()[:, np.newaxis]
    sq_dists = np.sum((X1 - X2.T)**2, axis=0) if X1.shape == X2.T.shape else \
               np.sum(X1**2, axis=1, keepdims=True) + \
               np.sum(X2.T**2, axis=0, keepdims=True) - \
               2 * X1 @ X2.T
    # Recompute properly
    sq_dists = (X1 - X2.T)**2
    return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2))
 
def warped_rbf_kernel(X1, X2, warp_fn, length_scale=1.0, sigma_sq=1.0):
    """
    RBF kernel applied in warped space.
    
    Parameters:
    -----------
    X1, X2 : ndarrays
        Input locations
    warp_fn : callable
        Warping function g(x)
    length_scale, sigma_sq : floats
        Kernel hyperparameters (applied in warped space)
    """
    X1 = np.atleast_1d(X1).flatten()
    X2 = np.atleast_1d(X2).flatten()
    
    # Warp inputs
    X1_warped = np.array([warp_fn(x) for x in X1])
    X2_warped = np.array([warp_fn(x) for x in X2])
    
    # Compute RBF kernel in warped space
    sq_dists = (X1_warped[:, np.newaxis] - X2_warped[np.newaxis, :])**2
    
    return sigma_sq * np.exp(-sq_dists / (2 * length_scale**2))
 
def demonstrate_input_warping():
    """
    Show how different warping functions create non-stationary behavior.
    """
    X = np.linspace(0.01, 0.99, 200)  # Avoid boundary issues
    
    # Define warping functions
    warp_functions = {
        'Identity (no warp)': lambda x: x,
        'Log transform': lambda x: np.log(x + 0.1),
        'Beta CDF (α=2, β=5)': lambda x: beta_dist.cdf(x, 2, 5),
        'Sinusoidal': lambda x: x + 0.2 * np.sin(4 * np.pi * x),
    }
    
    fig, axes = plt.subplots(3, 4, figsize=(16, 12))
    
    for col, (name, warp_fn) in enumerate(warp_functions.items()):
        # Row 0: Warping function
        X_warped = np.array([warp_fn(x) for x in X])
        axes[0, col].plot(X, X_warped, 'b-', linewidth=2)
        axes[0, col].plot(X, X, 'k--', alpha=0.3, label='Identity')
        axes[0, col].set_title(f'{name}')
        axes[0, col].set_xlabel('x (original)')
        axes[0, col].set_ylabel('g(x) (warped)')
        axes[0, col].grid(True, alpha=0.3)
        
        # Row 1: Covariance matrix
        K = warped_rbf_kernel(X, X, warp_fn, length_scale=0.3, sigma_sq=1.0)
        K += 1e-6 * np.eye(len(X))
        
        im = axes[1, col].imshow(K, extent=[0, 1, 1, 0], cmap='viridis')
        axes[1, col].set_title('Covariance Matrix')
        axes[1, col].set_xlabel("x'")
        axes[1, col].set_ylabel('x')
        
        # Row 2: GP samples
        np.random.seed(42)
        L = np.linalg.cholesky(K)
        samples = L @ np.random.randn(len(X), 3)
        
        axes[2, col].plot(X, samples, alpha=0.7)
        axes[2, col].set_title('GP Samples')
        axes[2, col].set_xlabel('x')
        axes[2, col].set_ylabel('f(x)')
        axes[2, col].grid(True, alpha=0.3)
    
    axes[0, 0].legend(loc='lower right')
    plt.suptitle('Input Warping: Creating Non-Stationarity via Space Transformation', 
                 fontsize=14)
    plt.tight_layout()
    return fig
 
def beta_warp_kernel(X1, X2, alpha=2.0, beta_param=5.0, 
                     length_scale=1.0, sigma_sq=1.0):
    """
    Warped kernel using Beta CDF transformation.
    
    The Beta CDF is flexible: α > β shifts density left (compress right, stretch left)
    α < β shifts density right (stretch right, compress left)
    α = β = 1 gives identity (uniform)
    
    Parameters:
    -----------
    alpha, beta_param : floats
        Beta distribution parameters controlling warp shape
    """
    warp_fn = lambda x: beta_dist.cdf(np.clip(x, 0.001, 0.999), alpha, beta_param)
    return warped_rbf_kernel(X1, X2, warp_fn, length_scale, sigma_sq)
 
# Deep Kernel Learning: Neural network warping
class NeuralNetWarp:
    """
    Simple neural network for input warping (conceptual implementation).
    In practice, use PyTorch/TensorFlow with GPyTorch/GPflow.
    """
    def __init__(self, hidden_sizes=[10, 10], seed=42):
        np.random.seed(seed)
        self.layers = []
        input_dim = 1
        for h in hidden_sizes:
            W = np.random.randn(input_dim, h) * 0.5
            b = np.zeros(h)
            self.layers.append((W, b))
            input_dim = h
        # Output layer
        W_out = np.random.randn(input_dim, 1) * 0.5
        b_out = np.zeros(1)
        self.layers.append((W_out, b_out))
    
    def __call__(self, x):
        """Forward pass through network."""
        h = np.array([[x]])
        for i, (W, b) in enumerate(self.layers[:-1]):
            h = np.tanh(h @ W + b)  # tanh activation
        W_out, b_out = self.layers[-1]
        return float(h @ W_out + b_out)

Deep Kernel Learning

A powerful modern approach combines neural networks with GP kernels:

$$k(\mathbf{x}, \mathbf{x}') = k_{\text{RBF}}(\phi_{\text{NN}}(\mathbf{x}), \phi_{\text{NN}}(\mathbf{x}'))$$

Where $\phi_{\text{NN}}$ is a neural network that learns the optimal representation (warping) for the GP. This approach:

Combines deep learning's representation power with GP's uncertainty quantification
Can be trained end-to-end by maximizing marginal likelihood
Works well on structured data (images, sequences) where deep features are meaningful

Key Libraries for Deep Kernel Learning:

GPyTorch: DeepKernel class combines any PyTorch network with GP kernels
GPflow: Similar capability with TensorFlow backend
BoTorch: Built on GPyTorch, designed for Bayesian optimization

Practical Considerations for Non-Stationary GPs

Decision Framework for Non-Stationarity

•Start stationary: Always begin with a Matérn-5/2 or RBF kernel. Non-stationarity adds parameters and complexity—don't add it unless needed.
•Diagnose before treating: Fit a stationary GP, analyze residuals spatially. Are there patterns? Does the marginal likelihood improve substantially with non-stationary alternatives?
•Prefer simple non-stationarity: Linear × Stationary (for trends) or simple parametric warping before resorting to deep kernels or full Gibbs kernels.
•Match complexity to data: A Gibbs kernel with many parameters needs substantial data to avoid overfitting. With few observations, a three-parameter non-stationary model is often the limit.
•Cross-validate rigorously: Non-stationary kernels can fit training data well while generalizing poorly. Use spatial or temporal cross-validation.

Signs Non-Stationarity Will Help

Signs to Stick with Stationarity

Common Pitfalls and Remedies

Pitfall	Symptom	Remedy
Overfitting	Excellent training fit, poor test performance	Regularize, reduce complexity, use informative priors
Identifiability issues	Multiple parameter settings give similar fits	Reduce parameter count, fix some via domain knowledge
Numerical instability	Cholesky failures, NaN gradients	Check for extreme length-scales, add jitter, normalize inputs
Slow optimization	Many hyperparameters, complex landscape	Use multi-start, informative initialization, hierarchical approach
Unrealistic extrapolation	Weird behavior outside training domain	Consider how non-stationarity extends, maybe use stationary for extrapolation

Recommended Workflow

Normalize inputs to a common scale (e.g., [0,1] or zero-mean, unit-variance)
Fit stationary baseline and record marginal likelihood + cross-validation score
Analyze residuals/uncertainty for spatial patterns
Try simple non-stationarity (Linear + Stationary, simple warping)
Compare metrics: marginal likelihood, predictive log-likelihood, RMSE
Validate extrapolation behavior if relevant to your application
Document assumptions about non-stationarity for reproducibility

Summary: Non-Stationary Kernels

Key Takeaways

•Stationarity fails when: smoothness, correlation length, or amplitude varies across the domain
•Linear/Polynomial kernels: Model trends and systematic position-dependent variation
•Neural Network kernel: Arises from infinite-width BNNs; bounded, saturation-like behavior
•Gibbs kernel: Explicit position-dependent length-scale $\ell(\mathbf{x})$; highly flexible
•Input warping: Transform inputs before applying stationary kernel; conceptually clean, efficient
•Deep Kernel Learning: Neural network + GP; combines representation learning with uncertainty quantification
•Practical wisdom: Start stationary, add complexity cautiously, validate rigorously

What's Next:

Page Complete