Loading content...
Kernel-based Support Vector Machines (SVMs) represent one of the most powerful and elegant approaches to binary classification in machine learning. Unlike simpler linear classifiers, kernel SVMs can discover complex, non-linear decision boundaries by implicitly mapping data into high-dimensional feature spaces without explicitly computing the transformation.
The genius of kernel methods lies in the kernel trick: instead of explicitly transforming data points into a higher-dimensional space (which can be computationally prohibitive or even infinite-dimensional), we compute the inner product in that space directly using a kernel function. This allows SVMs to create non-linear decision boundaries while maintaining the efficiency of linear optimization.
Linear Kernel: The simplest kernel, computing the standard dot product between vectors: $$K(\mathbf{x}, \mathbf{y}) = \mathbf{x}^T \mathbf{y} = \sum_{i=1}^{n} x_i \cdot y_i$$
Radial Basis Function (RBF) Kernel: Also known as the Gaussian kernel, it measures similarity based on Euclidean distance: $$K(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{||\mathbf{x} - \mathbf{y}||^2}{2\sigma^2}\right)$$
where $\sigma$ (sigma) controls the "width" of the Gaussian function. Smaller values create more localized, complex decision boundaries, while larger values produce smoother boundaries.
Your task is to implement a batch-mode subgradient descent algorithm to train the SVM. Unlike stochastic variants that randomly sample one training example per iteration, this deterministic approach uses all samples in every iteration for stable, reproducible results.
The SVM learning problem minimizes the regularized hinge loss:
$$\mathcal{L} = \frac{\lambda}{2}||\mathbf{w}||^2 + \frac{1}{n}\sum_{i=1}^{n} \max(0, 1 - y_i \cdot f(\mathbf{x}_i))$$
where:
For each training sample, update the dual coefficient $\alpha_i$ based on whether it violates the margin constraint:
If the sample is correctly classified with sufficient margin ($y_i \cdot f(\mathbf{x}_i) \geq 1$): $$\alpha_i^{(t+1)} = \alpha_i^{(t)} \cdot (1 - \eta_t \cdot \lambda)$$
If the sample violates the margin ($y_i \cdot f(\mathbf{x}_i) < 1$): $$\alpha_i^{(t+1)} = \alpha_i^{(t)} \cdot (1 - \eta_t \cdot \lambda) + \eta_t \cdot y_i$$
The learning rate $\eta_t$ typically decreases over time as $\eta_t = \frac{1}{\lambda \cdot t}$ where $t$ is the iteration number.
The classifier's prediction for a new sample $\mathbf{x}$ is: $$f(\mathbf{x}) = \sum_{i=1}^{n} \alpha_i \cdot K(\mathbf{x}_i, \mathbf{x}) + b$$
where $b$ is the bias term, updated to center the decision boundary between the support vectors.
Implement the kernel_svm_classifier function that:
Important: Use all training samples in each iteration (no random sampling).
data = [[1.0, 2.0], [2.0, 3.0], [3.0, 1.0], [4.0, 1.0]]
labels = [1, 1, -1, -1]
kernel = "rbf"
lambda_val = 0.01
iterations = 100
sigma = 1.0alpha = [1.0, 1.0, -100.0, -100.0]
b = -85.7884Using the RBF kernel with σ=1.0, the algorithm learns to separate the two classes in a transformed feature space.
Initial Setup:
Kernel Matrix Computation: The RBF kernel computes pairwise similarities. For example:
Training Process: Over 100 iterations, the algorithm adjusts α values. Samples that are harder to classify (closer to the decision boundary) accumulate larger magnitudes. The negative class samples (3,4) end up with larger magnitude coefficients, indicating they are more critical for defining the boundary.
Bias Calculation: The bias b ≈ -85.7884 shifts the decision boundary to properly separate the classes in the kernel-induced feature space.
data = [[1.0, 1.0], [2.0, 2.0], [-1.0, -1.0], [-2.0, -2.0]]
labels = [1, 1, -1, -1]
kernel = "linear"
lambda_val = 0.01
iterations = 50alpha = [100.0, 100.0, -100.0, -100.0]
b = 0.0With a linear kernel, the algorithm finds a hyperplane in the original feature space.
Data Geometry:
Linear Kernel: K(xᵢ, xⱼ) = xᵢᵀxⱼ (simple dot product)
Training Result: After 50 iterations, all samples reach the maximum coefficient magnitude (100.0). The symmetric distribution of points around the origin results in a bias of exactly 0.0, meaning the separating hyperplane passes through the origin.
Decision Boundary: The resulting classifier predicts: sign(100(x₁·[1,1] + x₂·[2,2]) - 100(x₃·[-1,-1] + x₄·[-2,-2]))
data = [[0.0, 1.0], [1.0, 0.0], [0.0, -1.0], [-1.0, 0.0]]
labels = [1, 1, -1, -1]
kernel = "linear"
lambda_val = 0.1
iterations = 20alpha = [10.0, 10.0, -10.0, -10.0]
b = 0.0A higher regularization parameter (λ=0.1) with fewer iterations produces smaller coefficient magnitudes.
Effect of Regularization:
Data Configuration:
Perfect Symmetry: The symmetric arrangement again yields b=0.0, with the separating line passing through the origin at a 45° angle (the line y = -x separates the classes).
Constraints