Bayesian Function Predictor (Hard) — Practice with Code Visualizer

In probabilistic machine learning, Bayesian inference over functions offers a powerful paradigm for making predictions while quantifying uncertainty. Unlike parametric models that learn fixed parameters, this approach maintains a distribution over all possible functions that could explain the observed data.

The Bayesian Function Predictor is a non-parametric model that defines a prior distribution over functions using only two key components: a mean function (typically assumed to be zero) and a covariance function (also called a kernel). The kernel encodes our assumptions about the function's smoothness, periodicity, and other structural properties.

Mathematical Foundation

Given training data points ((X_{train}, y_{train})) and test points (X_{test}), the predictor operates as follows:

1. Kernel (Covariance) Function

The kernel (k(x_i, x_j)) measures the similarity between any two input points. For a linear kernel:

$$k(x_i, x_j) = \sigma_b^2 + \sigma_v^2 \cdot (x_i^T \cdot x_j)$$

Where:

(\sigma_b^2) is the bias variance (controls offset from origin)
(\sigma_v^2) is the slope variance (scales the linear relationship)

2. Covariance Matrices

Construct three covariance matrices:

(K_{train,train}): Covariance between all training points ((n \times n))
(K_{test,train}): Covariance between test and training points ((m \times n))
Add noise term: (K_y = K_{train,train} + \sigma_n^2 I)

3. Posterior Predictive Mean

The predicted mean at test points is computed as:

$$\mu_{test} = K_{test,train} \cdot K_y^{-1} \cdot y_{train}$$

This formula elegantly combines the kernel similarity structure with the observed target values to produce predictions that interpolate smoothly through the training data.

Your Task

Implement the BayesianFunctionPredictor class with the following methods:

__init__(kernel, kernel_params, noise): Initialize the predictor with the specified kernel type, its hyperparameters, and observation noise variance.
fit(X_train, y_train): Condition the prior on the observed training data by computing and storing necessary matrices.
predict(X_test): Return the posterior mean predictions for the test points, formatted to 4 decimal places.

Note: Your implementation should handle multi-dimensional input features and return predictions as a NumPy array.

The training data follows a perfect linear relationship: y = 2x + 1.

With x = 1 → y = 3, x = 2 → y = 5, x = 4 → y = 9.

Using a linear kernel with sigma_b = 0.0 (no bias variance) and sigma_v = 1.0 (unit slope variance), the predictor effectively learns this linear function.

For x_test = 3.0, the prediction is:

μ = 2 × 3 + 1 = 7.0000

The extremely small noise (1e-8) ensures the model passes exactly through the training points, enabling precise interpolation of the underlying linear function.

The training data represents y = 2x (a perfect linear function through the origin).

The predictor with a linear kernel learns this relationship and makes predictions at intermediate points:

x = 0.5 → predicted y = 2 × 0.5 = 1.0000
x = 1.5 → predicted y = 2 × 1.5 = 3.0000
x = 2.5 → predicted y = 2 × 2.5 = 5.0000

Since the data is perfectly linear and noise is minimal, predictions at any point along this line are highly accurate.

This example demonstrates multi-dimensional input features. The training data suggests y = x₁ + x₂:

(1, 0) → 1, (0, 1) → 1, (1, 1) → 2

For the test point (0.5, 0.5):

The linear kernel computes similarities based on dot products between feature vectors
The posterior mean calculation yields: y = 0.5 + 0.5 = 1.0000

This shows the predictor correctly captures the additive linear relationship across multiple input dimensions.

Mathematical Foundation

Given training data points ((X_{train}, y_{train})) and test points (X_{test}), the predictor operates as follows:

1. Kernel (Covariance) Function

The kernel (k(x_i, x_j)) measures the similarity between any two input points. For a linear kernel:

$$k(x_i, x_j) = \sigma_b^2 + \sigma_v^2 \cdot (x_i^T \cdot x_j)$$

Where:

(\sigma_b^2) is the bias variance (controls offset from origin)
(\sigma_v^2) is the slope variance (scales the linear relationship)

2. Covariance Matrices

Construct three covariance matrices:

(K_{train,train}): Covariance between all training points ((n \times n))
(K_{test,train}): Covariance between test and training points ((m \times n))
Add noise term: (K_y = K_{train,train} + \sigma_n^2 I)

3. Posterior Predictive Mean

The predicted mean at test points is computed as:

$$\mu_{test} = K_{test,train} \cdot K_y^{-1} \cdot y_{train}$$

This formula elegantly combines the kernel similarity structure with the observed target values to produce predictions that interpolate smoothly through the training data.

Your Task

Implement the BayesianFunctionPredictor class with the following methods:

__init__(kernel, kernel_params, noise): Initialize the predictor with the specified kernel type, its hyperparameters, and observation noise variance.
fit(X_train, y_train): Condition the prior on the observed training data by computing and storing necessary matrices.
predict(X_test): Return the posterior mean predictions for the test points, formatted to 4 decimal places.

Note: Your implementation should handle multi-dimensional input features and return predictions as a NumPy array.

The training data follows a perfect linear relationship: y = 2x + 1.

With x = 1 → y = 3, x = 2 → y = 5, x = 4 → y = 9.

Using a linear kernel with sigma_b = 0.0 (no bias variance) and sigma_v = 1.0 (unit slope variance), the predictor effectively learns this linear function.

For x_test = 3.0, the prediction is:

μ = 2 × 3 + 1 = 7.0000

The extremely small noise (1e-8) ensures the model passes exactly through the training points, enabling precise interpolation of the underlying linear function.

The training data represents y = 2x (a perfect linear function through the origin).

The predictor with a linear kernel learns this relationship and makes predictions at intermediate points:

x = 0.5 → predicted y = 2 × 0.5 = 1.0000
x = 1.5 → predicted y = 2 × 1.5 = 3.0000
x = 2.5 → predicted y = 2 × 2.5 = 5.0000

Since the data is perfectly linear and noise is minimal, predictions at any point along this line are highly accurate.

This example demonstrates multi-dimensional input features. The training data suggests y = x₁ + x₂:

(1, 0) → 1, (0, 1) → 1, (1, 1) → 2

For the test point (0.5, 0.5):

The linear kernel computes similarities based on dot products between feature vectors
The posterior mean calculation yields: y = 0.5 + 0.5 = 1.0000

This shows the predictor correctly captures the additive linear relationship across multiple input dimensions.

Bayesian Function Predictor

Mathematical Foundation

1. Kernel (Covariance) Function

2. Covariance Matrices

3. Posterior Predictive Mean

Your Task

Hints

Bayesian Function Predictor

Mathematical Foundation

1. Kernel (Covariance) Function

2. Covariance Matrices

3. Posterior Predictive Mean

Your Task

Hints