Loading content...

0/318

00:00:00

Description

Editorial

Kernel-Based Support Vector Classifier Training

HARD50 pts

Kernel-based Support Vector Machines (SVMs) represent one of the most powerful and elegant approaches to binary classification in machine learning. Unlike simpler linear classifiers, kernel SVMs can discover complex, non-linear decision boundaries by implicitly mapping data into high-dimensional feature spaces without explicitly computing the transformation.

The Kernel Trick

The genius of kernel methods lies in the kernel trick: instead of explicitly transforming data points into a higher-dimensional space (which can be computationally prohibitive or even infinite-dimensional), we compute the inner product in that space directly using a kernel function. This allows SVMs to create non-linear decision boundaries while maintaining the efficiency of linear optimization.

Supported Kernels

Linear Kernel: The simplest kernel, computing the standard dot product between vectors: $$K(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y} = \sum_{i=1}^{n} x_i \cdot y_i$$

Radial Basis Function (RBF) Kernel: Also known as the Gaussian kernel, it measures similarity based on Euclidean distance: $$K(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{||\mathbf{x} - \mathbf{y}||^2}{2\sigma^2}\right)$$

where $\sigma$ (sigma) controls the "width" of the Gaussian function. Smaller values create more localized, complex decision boundaries, while larger values produce smoother boundaries.

Subgradient Descent Optimization

Your task is to implement a batch-mode subgradient descent algorithm to train the SVM. Unlike stochastic variants that randomly sample one training example per iteration, this deterministic approach uses all samples in every iteration for stable, reproducible results.

The Optimization Objective

The SVM learning problem minimizes the regularized hinge loss:

$$\mathcal{L} = \frac{\lambda}{2}||\mathbf{w}||^2 + \frac{1}{n}\sum_{i=1}^{n} \max(0, 1 - y_i \cdot f(\mathbf{x}_i))$$

where:

$\lambda$ is the regularization parameter (controls model complexity)
$y_i \in {-1, +1}$ are the true labels
$f(\mathbf{x}_i)$ is the decision function output

Update Rules

For each training sample, update the dual coefficient $\alpha_i$ based on whether it violates the margin constraint:

If the sample is correctly classified with sufficient margin ($y_i \cdot f(\mathbf{x}_i) \geq 1$): $$\alpha_i^{(t+1)} = \alpha_i^{(t)} \cdot (1 - \eta_t \cdot \lambda)$$

If the sample violates the margin ($y_i \cdot f(\mathbf{x}_i) < 1$): $$\alpha_i^{(t+1)} = \alpha_i^{(t)} \cdot (1 - \eta_t \cdot \lambda) + \eta_t \cdot y_i$$

The learning rate $\eta_t$ typically decreases over time as $\eta_t = \frac{1}{\lambda \cdot t}$ where $t$ is the iteration number.

The Decision Function

The classifier's prediction for a new sample $\mathbf{x}$ is: $$f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \cdot K(\mathbf{x}_i, \mathbf{x}) + b$$

where $b$ is the bias term, updated to center the decision boundary between the support vectors.

Your Task

Implement the kernel_svm_classifier function that:

Initializes dual coefficients to appropriate starting values
Iteratively updates coefficients using batch subgradient descent
Computes and updates the bias term
Returns the final coefficients and bias rounded to 4 decimal places

Important: Use all training samples in each iteration (no random sampling).

Example

Input

data = [[1.0, 2.0], [2.0, 3.0], [3.0, 1.0], [4.0, 1.0]]
labels = [1, 1, -1, -1]
kernel = "rbf"
lambda_val = 0.01
iterations = 100
sigma = 1.0

Output

alpha = [1.0, 1.0, -100.0, -100.0]
b = -85.7884

Explanation

Using the RBF kernel with σ=1.0, the algorithm learns to separate the two classes in a transformed feature space.

Initial Setup:

4 training samples with 2 features each
Classes: samples 1-2 are positive (+1), samples 3-4 are negative (-1)

Kernel Matrix Computation: The RBF kernel computes pairwise similarities. For example:

K(x₁, x₂) = exp(-||[1,2] - [2,3]||²/(2×1²)) ≈ 0.368
Points in the same class have higher kernel values than cross-class pairs

Training Process: Over 100 iterations, the algorithm adjusts α values. Samples that are harder to classify (closer to the decision boundary) accumulate larger magnitudes. The negative class samples (3,4) end up with larger magnitude coefficients, indicating they are more critical for defining the boundary.

Bias Calculation: The bias b ≈ -85.7884 shifts the decision boundary to properly separate the classes in the kernel-induced feature space.

Example

Input

data = [[1.0, 1.0], [2.0, 2.0], [-1.0, -1.0], [-2.0, -2.0]]
labels = [1, 1, -1, -1]
kernel = "linear"
lambda_val = 0.01
iterations = 50

Output

alpha = [100.0, 100.0, -100.0, -100.0]
b = 0.0

Explanation

With a linear kernel, the algorithm finds a hyperplane in the original feature space.

Data Geometry:

Positive class: points in the first quadrant (1,1) and (2,2)
Negative class: points in the third quadrant (-1,-1) and (-2,-2)
The classes are linearly separable through the origin

Linear Kernel: K(xᵢ, xⱼ) = xᵢᵀxⱼ (simple dot product)

Training Result: After 50 iterations, all samples reach the maximum coefficient magnitude (100.0). The symmetric distribution of points around the origin results in a bias of exactly 0.0, meaning the separating hyperplane passes through the origin.

Decision Boundary: The resulting classifier predicts: sign(100(x₁·[1,1] + x₂·[2,2]) - 100(x₃·[-1,-1] + x₄·[-2,-2]))

Example

Input

data = [[0.0, 1.0], [1.0, 0.0], [0.0, -1.0], [-1.0, 0.0]]
labels = [1, 1, -1, -1]
kernel = "linear"
lambda_val = 0.1
iterations = 20

Output

alpha = [10.0, 10.0, -10.0, -10.0]
b = 0.0

Explanation

A higher regularization parameter (λ=0.1) with fewer iterations produces smaller coefficient magnitudes.

Effect of Regularization:

λ=0.1 is 10× larger than the previous example's λ=0.01
Higher λ penalizes large coefficients more heavily
Result: coefficients are 10× smaller (10.0 vs 100.0)

Data Configuration:

Points are arranged on the unit circle
Positive class: (0,1) and (1,0) - upper-right semicircle
Negative class: (0,-1) and (-1,0) - lower-left semicircle

Perfect Symmetry: The symmetric arrangement again yields b=0.0, with the separating line passing through the origin at a 45° angle (the line y = -x separates the classes).

Accepted0/0·0% Acceptance

Constraints

1 ≤ n_samples ≤ 1000 (number of training samples)
1 ≤ n_features ≤ 100 (dimensionality of feature space)
labels ∈ {-1, +1} for all training samples
0.0001 ≤ lambda_val ≤ 1.0 (regularization parameter)
1 ≤ iterations ≤ 1000 (number of training epochs)
0.01 ≤ sigma ≤ 10.0 (RBF kernel bandwidth, when applicable)
-10⁶ ≤ data[i][j] ≤ 10⁶ (feature values)
The kernel parameter will always be either 'linear' or 'rbf'
All alpha values and bias should be rounded to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

data =

[[1,2],[2,3],[3,1],[4,1]]

sigma =

kernel =

"rbf"

labels =

[1,1,-1,-1]

iterations =

100

lambda_val =

0.01

00:00:00

Description

Editorial

Kernel-Based Support Vector Classifier Training

HARD50 pts

The Kernel Trick

Supported Kernels

Linear Kernel: The simplest kernel, computing the standard dot product between vectors: $$K(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y} = \sum_{i=1}^{n} x_i \cdot y_i$$

where $\sigma$ (sigma) controls the "width" of the Gaussian function. Smaller values create more localized, complex decision boundaries, while larger values produce smoother boundaries.

Subgradient Descent Optimization

The Optimization Objective

The SVM learning problem minimizes the regularized hinge loss:

$$\mathcal{L} = \frac{\lambda}{2}||\mathbf{w}||^2 + \frac{1}{n}\sum_{i=1}^{n} \max(0, 1 - y_i \cdot f(\mathbf{x}_i))$$

where:

$\lambda$ is the regularization parameter (controls model complexity)
$y_i \in {-1, +1}$ are the true labels
$f(\mathbf{x}_i)$ is the decision function output

Update Rules

For each training sample, update the dual coefficient $\alpha_i$ based on whether it violates the margin constraint:

If the sample is correctly classified with sufficient margin ($y_i \cdot f(\mathbf{x}_i) \geq 1$): $$\alpha_i^{(t+1)} = \alpha_i^{(t)} \cdot (1 - \eta_t \cdot \lambda)$$

If the sample violates the margin ($y_i \cdot f(\mathbf{x}_i) < 1$): $$\alpha_i^{(t+1)} = \alpha_i^{(t)} \cdot (1 - \eta_t \cdot \lambda) + \eta_t \cdot y_i$$

The learning rate $\eta_t$ typically decreases over time as $\eta_t = \frac{1}{\lambda \cdot t}$ where $t$ is the iteration number.

The Decision Function

The classifier's prediction for a new sample $\mathbf{x}$ is: $$f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \cdot K(\mathbf{x}_i, \mathbf{x}) + b$$

where $b$ is the bias term, updated to center the decision boundary between the support vectors.

Your Task

Implement the kernel_svm_classifier function that:

Initializes dual coefficients to appropriate starting values
Iteratively updates coefficients using batch subgradient descent
Computes and updates the bias term
Returns the final coefficients and bias rounded to 4 decimal places

Important: Use all training samples in each iteration (no random sampling).

Example

Input

data = [[1.0, 2.0], [2.0, 3.0], [3.0, 1.0], [4.0, 1.0]]
labels = [1, 1, -1, -1]
kernel = "rbf"
lambda_val = 0.01
iterations = 100
sigma = 1.0

Output

alpha = [1.0, 1.0, -100.0, -100.0]
b = -85.7884

Explanation

Using the RBF kernel with σ=1.0, the algorithm learns to separate the two classes in a transformed feature space.

Initial Setup:

4 training samples with 2 features each
Classes: samples 1-2 are positive (+1), samples 3-4 are negative (-1)

Kernel Matrix Computation: The RBF kernel computes pairwise similarities. For example:

K(x₁, x₂) = exp(-||[1,2] - [2,3]||²/(2×1²)) ≈ 0.368
Points in the same class have higher kernel values than cross-class pairs

Bias Calculation: The bias b ≈ -85.7884 shifts the decision boundary to properly separate the classes in the kernel-induced feature space.

Example

Input

data = [[1.0, 1.0], [2.0, 2.0], [-1.0, -1.0], [-2.0, -2.0]]
labels = [1, 1, -1, -1]
kernel = "linear"
lambda_val = 0.01
iterations = 50

Output

alpha = [100.0, 100.0, -100.0, -100.0]
b = 0.0

Explanation

With a linear kernel, the algorithm finds a hyperplane in the original feature space.

Data Geometry:

Positive class: points in the first quadrant (1,1) and (2,2)
Negative class: points in the third quadrant (-1,-1) and (-2,-2)
The classes are linearly separable through the origin

Linear Kernel: K(xᵢ, xⱼ) = xᵢᵀxⱼ (simple dot product)

Decision Boundary: The resulting classifier predicts: sign(100(x₁·[1,1] + x₂·[2,2]) - 100(x₃·[-1,-1] + x₄·[-2,-2]))

Example

Input

data = [[0.0, 1.0], [1.0, 0.0], [0.0, -1.0], [-1.0, 0.0]]
labels = [1, 1, -1, -1]
kernel = "linear"
lambda_val = 0.1
iterations = 20

Output

alpha = [10.0, 10.0, -10.0, -10.0]
b = 0.0

Explanation

A higher regularization parameter (λ=0.1) with fewer iterations produces smaller coefficient magnitudes.

Effect of Regularization:

λ=0.1 is 10× larger than the previous example's λ=0.01
Higher λ penalizes large coefficients more heavily
Result: coefficients are 10× smaller (10.0 vs 100.0)

Data Configuration:

Points are arranged on the unit circle
Positive class: (0,1) and (1,0) - upper-right semicircle
Negative class: (0,-1) and (-1,0) - lower-left semicircle

Perfect Symmetry: The symmetric arrangement again yields b=0.0, with the separating line passing through the origin at a 45° angle (the line y = -x separates the classes).

Accepted0/0·0% Acceptance

Constraints

1 ≤ n_samples ≤ 1000 (number of training samples)
1 ≤ n_features ≤ 100 (dimensionality of feature space)
labels ∈ {-1, +1} for all training samples
0.0001 ≤ lambda_val ≤ 1.0 (regularization parameter)
1 ≤ iterations ≤ 1000 (number of training epochs)
0.01 ≤ sigma ≤ 10.0 (RBF kernel bandwidth, when applicable)
-10⁶ ≤ data[i][j] ≤ 10⁶ (feature values)
The kernel parameter will always be either 'linear' or 'rbf'
All alpha values and bias should be rounded to 4 decimal places

Code

Visualizer

Solutions

14px

Test Cases3

Results

Submissions

data =

[[1,2],[2,3],[3,1],[4,1]]

sigma =

kernel =

"rbf"

labels =

[1,1,-1,-1]

iterations =

100

lambda_val =

0.01

Kernel-Based Support Vector Classifier Training

The Kernel Trick

Supported Kernels

Subgradient Descent Optimization

The Optimization Objective

Update Rules

The Decision Function

Your Task

Hints

Kernel-Based Support Vector Classifier Training

The Kernel Trick

Supported Kernels

Subgradient Descent Optimization

The Optimization Objective

Update Rules

The Decision Function

Your Task

Hints