Kernel Svm - Learning Module

Loading content...

0/245

Nonlinear Boundaries

Beyond Hyperplanes: The Geometry of Kernel Spaces

The kernel trick enables SVMs to find nonlinear decision boundaries—but what does this mean geometrically? How do different kernels shape the decision surface? And how do we diagnose and control the complexity of these boundaries?

In this page, we take a deep dive into the geometry of nonlinear classification. We'll build intuition through visualization, understand the bias-variance tradeoff in kernel space, and develop practical diagnostic skills for identifying when a model is too simple or too complex.

Understanding nonlinear boundaries is essential for deploying kernel SVMs effectively. Without this understanding, tuning hyperparameters becomes guesswork, and debugging poor performance is nearly impossible.

Learning Objectives

By the end of this page, you will: (1) Visualize how kernels transform decision boundaries from linear to nonlinear, (2) Understand the relationship between kernel parameters and boundary complexity, (3) Diagnose underfitting and overfitting in kernel SVM, and (4) Apply strategies to control model complexity.

From Linear to Nonlinear: The Transformation

The fundamental insight of kernel methods is that a linear boundary in feature space becomes a nonlinear boundary in input space. Let's trace this transformation step by step.

The Core Mechanism

Input space: Our original data lies in $\mathbb{R}^d$ (e.g., 2D points on a plane).
Feature map: The kernel implicitly applies a feature map $\phi: \mathbb{R}^d \to \mathcal{H}$ to a higher-dimensional (possibly infinite) space.
Linear boundary in $\mathcal{H}$: The SVM finds a hyperplane in $\mathcal{H}$: $$\langle \mathbf{w}, \phi(\mathbf{x}) \rangle + b = 0$$
Nonlinear boundary in $\mathbb{R}^d$: When we project this hyperplane back to input space, it becomes a curved surface.

Why Does This Happen?

The feature map $\phi$ "stretches" and "bends" the input space. Points that are close in input space may become far in feature space, and vice versa. A hyperplane that cuts through feature space at a particular orientation traces out a curve when viewed in the original co ordinates.

Concrete Example: Quadratic Kernel

Consider 2D data and the feature map: $$\phi(x_1, x_2) = (x_1^2, \sqrt{2}x_1 x_2, x_2^2)$$

A hyperplane in 3D feature space has equation: $$w_1 z_1 + w_2 z_2 + w_3 z_3 + b = 0$$

Substituting $\phi$: $$w_1 x_1^2 + w_2 \sqrt{2} x_1 x_2 + w_3 x_2^2 + b = 0$$

This is a conic section in the original 2D space—an ellipse, hyperbola, or parabola depending on the coefficients.

The General Principle

Polynomial kernel of degree $p$ → Decision boundary is a polynomial of degree $p$ in input coordinates.

RBF kernel → Decision boundary can be any smooth curve (given enough support vectors).

boundary_transformation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
 
def visualize_kernel_boundaries(X, y, title_prefix=""):
    """
    Compare decision boundaries for different kernels on the same data.
    """
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Define kernels to compare
    kernels = [
        ('Linear', 'linear', {}),
        ('Poly (d=2)', 'poly', {'degree': 2, 'coef0': 1}),
        ('Poly (d=3)', 'poly', {'degree': 3, 'coef0': 1}),
        ('RBF (γ=0.5)', 'rbf', {'gamma': 0.5}),
        ('RBF (γ=2.0)', 'rbf', {'gamma': 2.0}),
        ('RBF (γ=10)', 'rbf', {'gamma': 10.0}),
    ]
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.ravel()
    
    # Create mesh for decision boundary visualization
    x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
    y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    for ax, (name, kernel, params) in zip(axes, kernels):
        # Train SVM
        clf = SVC(kernel=kernel, C=1.0, **params)
        clf.fit(X_scaled, y)
        
        # Compute decision function on mesh
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        
        # Plot
        ax.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 20),
                   cmap='RdBu', alpha=0.6)
        ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
        ax.contour(xx, yy, Z, levels=[-1, 1], colors='black', 
                   linewidths=1, linestyles='dashed')
        
        # Plot data points
        ax.scatter(X_scaled[y==1, 0], X_scaled[y==1, 1], 
                   c='blue', edgecolors='white', s=50)
        ax.scatter(X_scaled[y==-1, 0], X_scaled[y==-1, 1], 
                   c='red', edgecolors='white', s=50)
        
        # Mark support vectors
        ax.scatter(X_scaled[clf.support_, 0], X_scaled[clf.support_, 1],
                   s=200, facecolors='none', edgecolors='green', linewidths=2)
        
        ax.set_title(f'{name}\n(#SV: {len(clf.support_)})')
        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)
    
    plt.suptitle(f'{title_prefix}Decision Boundary Comparison', fontsize=14)
    plt.tight_layout()
    return fig
 
# Generate XOR-like data (classic nonlinearly separable)
np.random.seed(42)
n = 100
X_xor = np.vstack([
    np.random.randn(n, 2) * 0.5 + [1, 1],
    np.random.randn(n, 2) * 0.5 + [-1, -1],
    np.random.randn(n, 2) * 0.5 + [1, -1],
    np.random.randn(n, 2) * 0.5 + [-1, 1],
])
y_xor = np.array([1]*n + [1]*n + [-1]*n + [-1]*n)
 
# This would produce a 2x3 grid showing:
# - Linear: straight line (will fail on XOR)
# - Poly d=2: curved boundary
# - Poly d=3: more complex curve
# - RBF with increasing gamma: increasingly complex boundaries
 
print("Visualization would show:")
print("- Linear kernel: A single straight line (fails on XOR)")
print("- Polynomial: Curved boundaries, smoother for low degree")
print("- RBF low gamma: Smooth, broad curves")
print("- RBF high gamma: Tight, localized boundaries around points")
print()
print("Key insight: As gamma increases in RBF, boundary becomes more")
print("complex and 'hugs' individual data points more closely.")

Anatomy of RBF Decision Boundaries

The RBF kernel produces the most varied and flexible decision boundaries. Let's understand how they form.

The Decision Function Decomposition

For RBF kernel, the decision function is: $$f(\mathbf{x}) = \sum_{i \in \mathcal{S}} \alpha_i y_i \exp\left(-\gamma|\mathbf{x} - \mathbf{x}_i|^2\right) + b$$

Each support vector contributes a "bump" centered at $\mathbf{x}_i$:

Positive class support vectors ($y_i = +1$) create positive bumps
Negative class support vectors ($y_i = -1$) create negative bumps
The magnitude $\alpha_i$ controls the height of each bump
The $\gamma$ parameter controls the width of all bumps

The decision boundary is where these bumps sum to zero.

The Bump Interpretation

Think of the RBF SVM decision function as a landscape of hills (positive class regions) and valleys (negative class regions). Each support vector anchors a Gaussian-shaped hill or valley. The decision boundary is the shore—where the land meets water at sea level.

Effect of $\gamma$ on Boundary Shape

Small $\gamma$: Wide bumps

Each support vector's influence extends far
Bumps overlap significantly
Decision boundary is smooth, determined by collective effect
Fewer, broader regions

Large $\gamma$: Narrow bumps

Each support vector's influence is localized
Bumps barely overlap
Decision boundary follows individual training points closely
Can form isolated "islands" around individual points

The Extreme Cases

$\gamma \to 0$:

All kernel values approach 1 (maximum similarity for everything)
Decision function becomes nearly constant
No discrimination ability
Equivalent to a bias-only model

$\gamma \to \infty$:

Kernel values are 0 everywhere except exactly at training points
Each training point becomes its own isolated "island"
Decision function is essentially 1-nearest-neighbor
Perfect training accuracy, terrible generalization

Diagnosing $\gamma$ from Decision Boundary
Boundary Characteristic	Diagnosis	Action
Boundary too smooth, ignores complex patterns	$\gamma$ too small	Increase $\gamma$
Boundary forms isolated islands around points	$\gamma$ too large	Decrease $\gamma$
Most/all training points are support vectors	$\gamma$ too large	Decrease $\gamma$
Very few support vectors, underfitting	$\gamma$ too small (or data is easy)	Increase $\gamma$ or use linear
Boundary oscillates wildly	$\gamma$ too large	Decrease $\gamma$ significantly

gamma_effect_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
 
def analyze_gamma_effect(X, y, gamma_range):
    """
    Analyze how gamma affects model behavior.
    """
    results = []
    
    for gamma in gamma_range:
        clf = SVC(kernel='rbf', gamma=gamma, C=1.0)
        clf.fit(X, y)
        
        # Training accuracy
        train_acc = clf.score(X, y)
        
        # Cross-validation accuracy (generalization estimate)
        cv_scores = cross_val_score(clf, X, y, cv=5)
        cv_acc = cv_scores.mean()
        cv_std = cv_scores.std()
        
        # Number of support vectors
        n_sv = len(clf.support_)
        sv_fraction = n_sv / len(y)
        
        results.append({
            'gamma': gamma,
            'train_acc': train_acc,
            'cv_acc': cv_acc,
            'cv_std': cv_std,
            'n_sv': n_sv,
            'sv_fraction': sv_fraction
        })
    
    return results
 
# Example analysis
np.random.seed(42)
 
# Create a moderately complex dataset
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
y = 2*y - 1  # Convert to {-1, +1}
 
gamma_range = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 50.0, 100.0]
results = analyze_gamma_effect(X, y, gamma_range)
 
print("Effect of gamma on RBF SVM")
print("=" * 70)
print(f"{'gamma':>8} {'Train Acc':>10} {'CV Acc':>10} {'CV Std':>8} {'#SV':>6} {'SV %':>8}")
print("-" * 70)
 
for r in results:
    print(f"{r['gamma']:>8.3f} {r['train_acc']:>10.3f} {r['cv_acc']:>10.3f} "
          f"{r['cv_std']:>8.3f} {r['n_sv']:>6d} {r['sv_fraction']:>8.1%}")
 
print()
print("Interpretation:")
print("- Very small gamma: Low train acc, low CV acc → underfitting")
print("- Optimal gamma: High CV acc, reasonable #SV → good generalization")
print("- Very large gamma: Perfect train acc, lower CV acc → overfitting")
print("- #SV approaching N: Danger sign for overfitting")

Polynomial Decision Boundary Shapes

Polynomial kernels produce mathematically predictable boundary shapes—polynomials of the specified degree in the original coordinates.

Degree 2: Conic Sections

For 2D data with polynomial kernel of degree 2, the decision boundary is: $$a x_1^2 + b x_1 x_2 + c x_2^2 + d x_1 + e x_2 + f = 0$$

This is the general equation of a conic section:

Ellipse (including circle): When $b^2 - 4ac < 0$
Parabola: When $b^2 - 4ac = 0$
Hyperbola: When $b^2 - 4ac > 0$

The SVM optimization determines the specific coefficients based on the training data.

Degree 3: Cubic Curves

Degree-3 polynomial boundaries can have:

Inflection points (where curvature changes direction)
Self-intersections (in some cases)
More complex "S-curves" and loops

Higher Degrees

As degree increases:

More complex curves possible (up to degree oscillation)
Risk of fitting noise increases
Numerical stability decreases (large polynomial values)
Rarely beneficial to go beyond degree 5

The Role of the Offset Parameter

The offset parameter $c$ in $(\mathbf{x}^T\mathbf{z} + c)^p$ affects the mix of polynomial terms. With $c = 0$ (homogeneous polynomial), only degree-$p$ terms appear. With $c > 0$, terms of all degrees ≤ $p$ are included. For most problems, $c = 1$ is a good default—it allows the model to learn simpler patterns if they suffice.

Polynomial Degree vs Boundary Complexity
Degree	Boundary Type (2D)	Typical Use Case	Risk Level
1	Straight line	Linearly separable data	Low (underfitting possible)
2	Conic section	Quadratic relationships	Low-medium
3	Cubic curve	Smooth nonlinear patterns	Medium
4	Quartic curve	Complex smooth patterns	Medium-high
5+	Higher-order polynomial	Rarely needed	High (overfitting likely)

Visualizing Polynomial Boundaries

To build intuition:

Degree 2 (quadratic):

Can separate one class inside another (e.g., a circle inside a square)
Can create "U-shaped" or "V-shaped" boundaries
Cannot create multiple disconnected regions

Degree 3 (cubic):

Can create "S-curves" with one inflection
More adaptive to asymmetric patterns
Still globally smooth

Degree 4+ (quartic and higher):

Can approximate any smooth boundary (with enough data)
But often overkill—RBF is more flexible for complex patterns
Use polynomial when you have domain knowledge of polynomial relationships

Bias-Variance Tradeoff in Kernel SVM

The bias-variance tradeoff—fundamental to all machine learning—manifests distinctly in kernel SVMs through the interplay of kernel choice, hyperparameters, and the regularization parameter $C$.

The Three Key Controls

1. Kernel Complexity (inherent flexibility)

Linear < Polynomial (low degree) < Polynomial (high degree) < RBF
More complex kernels have lower bias (can fit more patterns) but higher variance (more sensitive to training data)

2. Kernel Hyperparameters ($\gamma$ for RBF, degree for polynomial)

Higher $\gamma$ / Higher degree → Lower bias, higher variance
Lower $\gamma$ / Lower degree → Higher bias, lower variance

3. Regularization Parameter $C$

Large $C$ → Prioritize fitting training data → Lower bias, higher variance
Small $C$ → Prioritize large margin → Higher bias, lower variance

C and Kernel Parameters Interact

The effects of $C$ and $\gamma$ are not independent. A model with large $\gamma$ (flexible) and small $C$ (strong regularization) behaves differently than one with small $\gamma$ and large $C$. Grid search over both parameters simultaneously is essential.

Diagnosing Bias vs Variance Problems

High Bias (Underfitting):

Training accuracy is low
Validation accuracy is low
Training and validation accuracy are similar
Decision boundary is too simple for the data

Solutions:

Use a more complex kernel (e.g., RBF instead of linear)
Increase kernel complexity ($\gamma$ for RBF, degree for polynomial)
Increase $C$ (allow smaller margin, fit data more closely)
Add more informative features

High Variance (Overfitting):

Training accuracy is high (often perfect)
Validation accuracy is significantly lower than training
Large gap between training and validation performance
Decision boundary is too complex, follows noise

Solutions:

Use a simpler kernel (e.g., linear instead of RBF)
Decrease kernel complexity (lower $\gamma$ for RBF, lower degree for polynomial)
Decrease $C$ (enforce larger margin, more regularization)
Get more training data
Remove noisy or irrelevant features

Bias-Variance Diagnostic Table for Kernel SVM
Symptom	Likely Cause	Parameter Adjustment
Low train acc, low val acc	Underfitting (high bias)	↑ $\gamma$, ↑ degree, ↑ $C$, or switch to RBF
High train acc, low val acc	Overfitting (high variance)	↓ $\gamma$, ↓ degree, ↓ $C$, or switch to linear
Both accuracies high	Good fit	Fine-tune for marginal gains
#SV ≈ N (all points)	Severe overfitting	Significantly ↓ $\gamma$ or ↓ $C$
#SV very small	Possibly underfitting	Check if data is actually easy; if not, ↑ $C$

bias_variance_diagnosis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.datasets import make_classification
 
def diagnose_bias_variance(X, y, kernel='rbf', gamma=1.0, C=1.0):
    """
    Diagnose bias vs variance problem for a kernel SVM configuration.
    """
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Train model
    clf = SVC(kernel=kernel, gamma=gamma, C=C)
    clf.fit(X_train, y_train)
    
    # Compute metrics
    train_acc = clf.score(X_train, y_train)
    test_acc = clf.score(X_test, y_test)
    n_sv = len(clf.support_)
    sv_fraction = n_sv / len(y_train)
    
    # Diagnosis
    gap = train_acc - test_acc
    
    print(f"Configuration: kernel={kernel}, gamma={gamma}, C={C}")
    print(f"  Training accuracy:  {train_acc:.3f}")
    print(f"  Test accuracy:      {test_acc:.3f}")
    print(f"  Gap (train - test): {gap:.3f}")
    print(f"  Support vectors:    {n_sv} ({sv_fraction:.1%})")
    
    # Automatic diagnosis
    if train_acc < 0.7:
        print("  → Diagnosis: HIGH BIAS (underfitting)")
        print("     Recommendation: ↑ C, ↑ gamma, or use more complex kernel")
    elif gap > 0.15:
        print("  → Diagnosis: HIGH VARIANCE (overfitting)")
        print("     Recommendation: ↓ C, ↓ gamma, or use simpler kernel")
    elif sv_fraction > 0.7:
        print("  → Diagnosis: Warning - too many support vectors")
        print("     Recommendation: ↓ gamma or ↓ C")
    else:
        print("  → Diagnosis: Reasonable fit")
    print()
 
# Generate dataset with some noise
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=10, 
                           n_informative=5, n_redundant=2,
                           n_classes=2, flip_y=0.1, random_state=42)
y = 2*y - 1  # Convert to {-1, +1}
 
print("=" * 60)
print("BIAS-VARIANCE DIAGNOSIS EXAMPLES")
print("=" * 60)
 
# Case 1: High bias (underfitting)
print("\nCase 1: Very smooth model (likely underfitting)")
diagnose_bias_variance(X, y, kernel='rbf', gamma=0.001, C=0.1)
 
# Case 2: Balanced
print("\nCase 2: Well-tuned model")
diagnose_bias_variance(X, y, kernel='rbf', gamma=0.1, C=1.0)
 
# Case 3: High variance (overfitting)
print("\nCase 3: Very flexible model (likely overfitting)")
diagnose_bias_variance(X, y, kernel='rbf', gamma=10.0, C=100.0)
 
# Case 4: Linear kernel on nonlinear data
print("\nCase 4: Linear kernel (check if appropriate)")
diagnose_bias_variance(X, y, kernel='linear', gamma=1.0, C=1.0)

The C-$\gamma$ Interaction for RBF Kernel

For RBF kernel SVM, the two hyperparameters $C$ and $\gamma$ together control model complexity. Understanding their interaction is crucial for effective tuning.

Individual Effects

$C$ (Regularization):

Controls the penalty for margin violations
Large $C$: Strict enforcement—try to classify all training points correctly → overfitting risk
Small $C$: Allow margin violations for smoother boundary → underfitting risk

$\gamma$ (Kernel Width):

Controls the influence radius of each support vector
Large $\gamma$: Each point has local influence → complex, localized boundaries
Small $\gamma$: Each point has global influence → smooth, simple boundaries

The Interaction

The parameters interact multiplicatively in terms of model complexity:

	Small $C$	Medium $C$	Large $C$
Small $\gamma$	Very simple (underfit)	Simple	Moderately complex
Medium $\gamma$	Simple	Balanced	Complex
Large $\gamma$	Moderately complex	Complex	Very complex (overfit)

The Optimal Region

Typically, optimal configurations lie along a "ridge" in the $(C, \gamma)$ space—neither extreme. Doubling $C$ has a similar effect to increasing $\gamma$ somewhat. Grid search helps find this ridge, and understanding the interaction guides smarter search strategies.

Practical Tuning Strategy

Stage 1: Coarse Grid Search

$C$: ${10^{-3}, 10^{-2}, 10^{-1}, 1, 10, 100, 1000}$
$\gamma$: ${10^{-3}, 10^{-2}, 10^{-1}, 1, 10}$
Use 3-fold CV for speed

Stage 2: Fine Grid Search

Zoom into the best region from Stage 1
E.g., if best was $(C, \gamma) = (10, 0.1)$, search:
- $C$: ${3, 10, 30}$
- $\gamma$: ${0.03, 0.1, 0.3}$
Use 5-fold CV for reliability

Stage 3: Final Validation

Use best params on held-out test set
Or nested cross-validation for unbiased estimate

c_gamma_grid_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
 
# Generate data
np.random.seed(42)
X, y = make_moons(n_samples=500, noise=0.25, random_state=42)
y = 2*y - 1
 
# Scale features (critical for RBF!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Stage 1: Coarse grid search
print("Stage 1: Coarse Grid Search")
print("=" * 50)
 
param_grid_coarse = {
    'C': [0.01, 0.1, 1, 10, 100, 1000],
    'gamma': [0.001, 0.01, 0.1, 1, 10]
}
 
grid_coarse = GridSearchCV(
    SVC(kernel='rbf'),
    param_grid_coarse,
    cv=3,
    scoring='accuracy',
    return_train_score=True
)
grid_coarse.fit(X_scaled, y)
 
print(f"Best params: {grid_coarse.best_params_}")
print(f"Best CV score: {grid_coarse.best_score_:.3f}")
 
# Show top 5 configurations
results = grid_coarse.cv_results_
top_indices = np.argsort(results['mean_test_score'])[-5:][::-1]
print("\nTop 5 configurations:")
for idx in top_indices:
    print(f"  C={results['param_C'][idx]:>7}, γ={results['param_gamma'][idx]:>6} "
          f"→ CV: {results['mean_test_score'][idx]:.3f} ± {results['std_test_score'][idx]:.3f}")
 
# Stage 2: Fine grid search around best
best_C = grid_coarse.best_params_['C']
best_gamma = grid_coarse.best_params_['gamma']
 
print(f"\nStage 2: Fine Grid Search around C={best_C}, γ={best_gamma}")
print("=" * 50)
 
param_grid_fine = {
    'C': [best_C * f for f in [0.3, 0.5, 0.7, 1.0, 1.5, 2.0, 3.0]],
    'gamma': [best_gamma * f for f in [0.3, 0.5, 0.7, 1.0, 1.5, 2.0, 3.0]]
}
 
grid_fine = GridSearchCV(
    SVC(kernel='rbf'),
    param_grid_fine,
    cv=5,  # More folds for finer search
    scoring='accuracy',
    return_train_score=True
)
grid_fine.fit(X_scaled, y)
 
print(f"Best params: C={grid_fine.best_params_['C']:.3f}, "
      f"γ={grid_fine.best_params_['gamma']:.4f}")
print(f"Best CV score: {grid_fine.best_score_:.3f}")
 
# Final model analysis
final_model = grid_fine.best_estimator_
print(f"\nFinal Model:")
print(f"  Number of support vectors: {len(final_model.support_)}")
print(f"  SV fraction: {len(final_model.support_)/len(y):.1%}")

Support Vectors as Complexity Indicators

The number and distribution of support vectors provide valuable diagnostics for model complexity and generalization.

What Support Vectors Tell Us

Number of Support Vectors:

Few SVs relative to $n$: Decision boundary is "simple" (determined by few key examples)
Many SVs (approaching $n$): Decision boundary is "complex" (depends on most training points)

Rule of thumb: If more than 30-50% of training points are support vectors, the model may be overfitting.

Types of Support Vectors:

Margin SVs ($0 < \alpha < C$): Lie exactly on the margin. Define the boundary's core structure.
Bounded SVs ($\alpha = C$): Lie inside the margin or on the wrong side. Represent challenging or misclassified points.

Many bounded SVs suggests either:

High noise in the data
Too complex a boundary for the regularization level
Inherently overlapping classes

Support Vector Interpretation

Margin SVs are the "prototypical" examples that define class boundaries. Bounded SVs are the "difficult" or "noisy" examples. Examining bounded SVs often reveals outliers, labeling errors, or inherently ambiguous cases.

Complexity Theory Connection

Statistical learning theory provides bounds on generalization error that involve the number of support vectors:

$$\mathbb{E}[\text{LOO error}] \leq \frac{\mathbb{E}[|\mathcal{S}|]}{n}$$

where $|\mathcal{S}|$ is the expected number of support vectors and LOO is leave-one-out cross-validation error.

This bound suggests:

Fewer SVs generally means better generalization
The SV count is a model-complexity measure
Minimizing SVs (while maintaining accuracy) is a form of regularization

Practical Use: Model Selection

When hyperparameter tuning:

Among configurations with similar CV accuracy, prefer fewer SVs
Watch for SV count explosion (sign of overfitting)
If reducing $\gamma$ significantly reduces SVs without hurting CV score, prefer smaller $\gamma$

Interpreting Support Vector Statistics
Metric	Low Value	High Value
SV / n	Simple model, may underfit	Complex model, may overfit
Bounded SV / # SV	Clean margins, good separation	Noisy data or overlapping classes
SV variation with $\gamma$	Model robust to $\gamma$	Model sensitive to $\gamma$
SV from one class / total SV	Balanced contribution	Imbalanced (check class distribution)

Summary: Mastering Nonlinear Boundaries

We've explored the geometry and diagnostics of nonlinear decision boundaries in kernel SVMs—essential knowledge for effective model development.

Key Takeaways

•Feature space transformation: A linear boundary in feature space becomes a nonlinear boundary in input space—this is the geometric essence of kernel methods.
•RBF boundaries as bump landscapes: Each support vector contributes a Gaussian bump; the decision boundary is where bumps sum to zero.
•$\gamma$ controls locality: Large $\gamma$ = narrow bumps = complex local boundaries. Small $\gamma$ = wide bumps = smooth global boundaries.
•Polynomial degree controls complexity: Degree $p$ polynomial kernel produces decision boundaries that are degree-$p$ polynomials in input coordinates.
•C and $\gamma$ interact: Both affect model complexity; grid search over both is essential for RBF kernel.
•Support vectors are complexity indicators: Few SVs suggests simple model; many SVs (especially bounded) suggests potential overfitting.

What's Next

In the final page of this module, we'll explore kernel selection—systematic approaches to choosing between kernels for a given problem, including empirical methods, domain knowledge guidance, and hybrid strategies.

Page Complete

You now have deep understanding of how kernels shape nonlinear decision boundaries, how to diagnose underfitting vs overfitting, and how the key hyperparameters C and γ interact. This knowledge is essential for successful kernel SVM deployment.

Nonlinear Boundaries

Beyond Hyperplanes: The Geometry of Kernel Spaces

Learning Objectives

From Linear to Nonlinear: The Transformation

The fundamental insight of kernel methods is that a linear boundary in feature space becomes a nonlinear boundary in input space. Let's trace this transformation step by step.

The Core Mechanism

Input space: Our original data lies in $\mathbb{R}^d$ (e.g., 2D points on a plane).
Feature map: The kernel implicitly applies a feature map $\phi: \mathbb{R}^d \to \mathcal{H}$ to a higher-dimensional (possibly infinite) space.
Linear boundary in $\mathcal{H}$: The SVM finds a hyperplane in $\mathcal{H}$: $$\langle \mathbf{w}, \phi(\mathbf{x}) \rangle + b = 0$$
Nonlinear boundary in $\mathbb{R}^d$: When we project this hyperplane back to input space, it becomes a curved surface.

Why Does This Happen?

Concrete Example: Quadratic Kernel

Consider 2D data and the feature map: $$\phi(x_1, x_2) = (x_1^2, \sqrt{2}x_1 x_2, x_2^2)$$

A hyperplane in 3D feature space has equation: $$w_1 z_1 + w_2 z_2 + w_3 z_3 + b = 0$$

Substituting $\phi$: $$w_1 x_1^2 + w_2 \sqrt{2} x_1 x_2 + w_3 x_2^2 + b = 0$$

This is a conic section in the original 2D space—an ellipse, hyperbola, or parabola depending on the coefficients.

The General Principle

Polynomial kernel of degree $p$ → Decision boundary is a polynomial of degree $p$ in input coordinates.

RBF kernel → Decision boundary can be any smooth curve (given enough support vectors).

boundary_transformation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
 
def visualize_kernel_boundaries(X, y, title_prefix=""):
    """
    Compare decision boundaries for different kernels on the same data.
    """
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Define kernels to compare
    kernels = [
        ('Linear', 'linear', {}),
        ('Poly (d=2)', 'poly', {'degree': 2, 'coef0': 1}),
        ('Poly (d=3)', 'poly', {'degree': 3, 'coef0': 1}),
        ('RBF (γ=0.5)', 'rbf', {'gamma': 0.5}),
        ('RBF (γ=2.0)', 'rbf', {'gamma': 2.0}),
        ('RBF (γ=10)', 'rbf', {'gamma': 10.0}),
    ]
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.ravel()
    
    # Create mesh for decision boundary visualization
    x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
    y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    for ax, (name, kernel, params) in zip(axes, kernels):
        # Train SVM
        clf = SVC(kernel=kernel, C=1.0, **params)
        clf.fit(X_scaled, y)
        
        # Compute decision function on mesh
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        
        # Plot
        ax.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 20),
                   cmap='RdBu', alpha=0.6)
        ax.contour(xx, yy, Z, levels=[0], colors='black', linewidths=2)
        ax.contour(xx, yy, Z, levels=[-1, 1], colors='black', 
                   linewidths=1, linestyles='dashed')
        
        # Plot data points
        ax.scatter(X_scaled[y==1, 0], X_scaled[y==1, 1], 
                   c='blue', edgecolors='white', s=50)
        ax.scatter(X_scaled[y==-1, 0], X_scaled[y==-1, 1], 
                   c='red', edgecolors='white', s=50)
        
        # Mark support vectors
        ax.scatter(X_scaled[clf.support_, 0], X_scaled[clf.support_, 1],
                   s=200, facecolors='none', edgecolors='green', linewidths=2)
        
        ax.set_title(f'{name}\n(#SV: {len(clf.support_)})')
        ax.set_xlim(x_min, x_max)
        ax.set_ylim(y_min, y_max)
    
    plt.suptitle(f'{title_prefix}Decision Boundary Comparison', fontsize=14)
    plt.tight_layout()
    return fig
 
# Generate XOR-like data (classic nonlinearly separable)
np.random.seed(42)
n = 100
X_xor = np.vstack([
    np.random.randn(n, 2) * 0.5 + [1, 1],
    np.random.randn(n, 2) * 0.5 + [-1, -1],
    np.random.randn(n, 2) * 0.5 + [1, -1],
    np.random.randn(n, 2) * 0.5 + [-1, 1],
])
y_xor = np.array([1]*n + [1]*n + [-1]*n + [-1]*n)
 
# This would produce a 2x3 grid showing:
# - Linear: straight line (will fail on XOR)
# - Poly d=2: curved boundary
# - Poly d=3: more complex curve
# - RBF with increasing gamma: increasingly complex boundaries
 
print("Visualization would show:")
print("- Linear kernel: A single straight line (fails on XOR)")
print("- Polynomial: Curved boundaries, smoother for low degree")
print("- RBF low gamma: Smooth, broad curves")
print("- RBF high gamma: Tight, localized boundaries around points")
print()
print("Key insight: As gamma increases in RBF, boundary becomes more")
print("complex and 'hugs' individual data points more closely.")

Anatomy of RBF Decision Boundaries

The RBF kernel produces the most varied and flexible decision boundaries. Let's understand how they form.

The Decision Function Decomposition

For RBF kernel, the decision function is: $$f(\mathbf{x}) = \sum_{i \in \mathcal{S}} \alpha_i y_i \exp\left(-\gamma|\mathbf{x} - \mathbf{x}_i|^2\right) + b$$

Each support vector contributes a "bump" centered at $\mathbf{x}_i$:

Positive class support vectors ($y_i = +1$) create positive bumps
Negative class support vectors ($y_i = -1$) create negative bumps
The magnitude $\alpha_i$ controls the height of each bump
The $\gamma$ parameter controls the width of all bumps

The decision boundary is where these bumps sum to zero.

The Bump Interpretation

Effect of $\gamma$ on Boundary Shape

Small $\gamma$: Wide bumps

Each support vector's influence extends far
Bumps overlap significantly
Decision boundary is smooth, determined by collective effect
Fewer, broader regions

Large $\gamma$: Narrow bumps

Each support vector's influence is localized
Bumps barely overlap
Decision boundary follows individual training points closely
Can form isolated "islands" around individual points

The Extreme Cases

$\gamma \to 0$:

All kernel values approach 1 (maximum similarity for everything)
Decision function becomes nearly constant
No discrimination ability
Equivalent to a bias-only model

$\gamma \to \infty$:

Kernel values are 0 everywhere except exactly at training points
Each training point becomes its own isolated "island"
Decision function is essentially 1-nearest-neighbor
Perfect training accuracy, terrible generalization

Diagnosing $\gamma$ from Decision Boundary
Boundary Characteristic	Diagnosis	Action
Boundary too smooth, ignores complex patterns	$\gamma$ too small	Increase $\gamma$
Boundary forms isolated islands around points	$\gamma$ too large	Decrease $\gamma$
Most/all training points are support vectors	$\gamma$ too large	Decrease $\gamma$
Very few support vectors, underfitting	$\gamma$ too small (or data is easy)	Increase $\gamma$ or use linear
Boundary oscillates wildly	$\gamma$ too large	Decrease $\gamma$ significantly

gamma_effect_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
 
def analyze_gamma_effect(X, y, gamma_range):
    """
    Analyze how gamma affects model behavior.
    """
    results = []
    
    for gamma in gamma_range:
        clf = SVC(kernel='rbf', gamma=gamma, C=1.0)
        clf.fit(X, y)
        
        # Training accuracy
        train_acc = clf.score(X, y)
        
        # Cross-validation accuracy (generalization estimate)
        cv_scores = cross_val_score(clf, X, y, cv=5)
        cv_acc = cv_scores.mean()
        cv_std = cv_scores.std()
        
        # Number of support vectors
        n_sv = len(clf.support_)
        sv_fraction = n_sv / len(y)
        
        results.append({
            'gamma': gamma,
            'train_acc': train_acc,
            'cv_acc': cv_acc,
            'cv_std': cv_std,
            'n_sv': n_sv,
            'sv_fraction': sv_fraction
        })
    
    return results
 
# Example analysis
np.random.seed(42)
 
# Create a moderately complex dataset
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.3, random_state=42)
y = 2*y - 1  # Convert to {-1, +1}
 
gamma_range = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 50.0, 100.0]
results = analyze_gamma_effect(X, y, gamma_range)
 
print("Effect of gamma on RBF SVM")
print("=" * 70)
print(f"{'gamma':>8} {'Train Acc':>10} {'CV Acc':>10} {'CV Std':>8} {'#SV':>6} {'SV %':>8}")
print("-" * 70)
 
for r in results:
    print(f"{r['gamma']:>8.3f} {r['train_acc']:>10.3f} {r['cv_acc']:>10.3f} "
          f"{r['cv_std']:>8.3f} {r['n_sv']:>6d} {r['sv_fraction']:>8.1%}")
 
print()
print("Interpretation:")
print("- Very small gamma: Low train acc, low CV acc → underfitting")
print("- Optimal gamma: High CV acc, reasonable #SV → good generalization")
print("- Very large gamma: Perfect train acc, lower CV acc → overfitting")
print("- #SV approaching N: Danger sign for overfitting")

Polynomial Decision Boundary Shapes

Polynomial kernels produce mathematically predictable boundary shapes—polynomials of the specified degree in the original coordinates.

Degree 2: Conic Sections

For 2D data with polynomial kernel of degree 2, the decision boundary is: $$a x_1^2 + b x_1 x_2 + c x_2^2 + d x_1 + e x_2 + f = 0$$

This is the general equation of a conic section:

Ellipse (including circle): When $b^2 - 4ac < 0$
Parabola: When $b^2 - 4ac = 0$
Hyperbola: When $b^2 - 4ac > 0$

The SVM optimization determines the specific coefficients based on the training data.

Degree 3: Cubic Curves

Degree-3 polynomial boundaries can have:

Inflection points (where curvature changes direction)
Self-intersections (in some cases)
More complex "S-curves" and loops

Higher Degrees

As degree increases:

More complex curves possible (up to degree oscillation)
Risk of fitting noise increases
Numerical stability decreases (large polynomial values)
Rarely beneficial to go beyond degree 5

The Role of the Offset Parameter

Polynomial Degree vs Boundary Complexity
Degree	Boundary Type (2D)	Typical Use Case	Risk Level
1	Straight line	Linearly separable data	Low (underfitting possible)
2	Conic section	Quadratic relationships	Low-medium
3	Cubic curve	Smooth nonlinear patterns	Medium
4	Quartic curve	Complex smooth patterns	Medium-high
5+	Higher-order polynomial	Rarely needed	High (overfitting likely)

Visualizing Polynomial Boundaries

To build intuition:

Degree 2 (quadratic):

Can separate one class inside another (e.g., a circle inside a square)
Can create "U-shaped" or "V-shaped" boundaries
Cannot create multiple disconnected regions

Degree 3 (cubic):

Can create "S-curves" with one inflection
More adaptive to asymmetric patterns
Still globally smooth

Degree 4+ (quartic and higher):

Can approximate any smooth boundary (with enough data)
But often overkill—RBF is more flexible for complex patterns
Use polynomial when you have domain knowledge of polynomial relationships

Bias-Variance Tradeoff in Kernel SVM

The bias-variance tradeoff—fundamental to all machine learning—manifests distinctly in kernel SVMs through the interplay of kernel choice, hyperparameters, and the regularization parameter $C$.

The Three Key Controls

1. Kernel Complexity (inherent flexibility)

Linear < Polynomial (low degree) < Polynomial (high degree) < RBF
More complex kernels have lower bias (can fit more patterns) but higher variance (more sensitive to training data)

2. Kernel Hyperparameters ($\gamma$ for RBF, degree for polynomial)

Higher $\gamma$ / Higher degree → Lower bias, higher variance
Lower $\gamma$ / Lower degree → Higher bias, lower variance

3. Regularization Parameter $C$

Large $C$ → Prioritize fitting training data → Lower bias, higher variance
Small $C$ → Prioritize large margin → Higher bias, lower variance

C and Kernel Parameters Interact

Diagnosing Bias vs Variance Problems

High Bias (Underfitting):

Training accuracy is low
Validation accuracy is low
Training and validation accuracy are similar
Decision boundary is too simple for the data

Solutions:

Use a more complex kernel (e.g., RBF instead of linear)
Increase kernel complexity ($\gamma$ for RBF, degree for polynomial)
Increase $C$ (allow smaller margin, fit data more closely)
Add more informative features

High Variance (Overfitting):

Training accuracy is high (often perfect)
Validation accuracy is significantly lower than training
Large gap between training and validation performance
Decision boundary is too complex, follows noise

Solutions:

Use a simpler kernel (e.g., linear instead of RBF)
Decrease kernel complexity (lower $\gamma$ for RBF, lower degree for polynomial)
Decrease $C$ (enforce larger margin, more regularization)
Get more training data
Remove noisy or irrelevant features

Bias-Variance Diagnostic Table for Kernel SVM
Symptom	Likely Cause	Parameter Adjustment
Low train acc, low val acc	Underfitting (high bias)	↑ $\gamma$, ↑ degree, ↑ $C$, or switch to RBF
High train acc, low val acc	Overfitting (high variance)	↓ $\gamma$, ↓ degree, ↓ $C$, or switch to linear
Both accuracies high	Good fit	Fine-tune for marginal gains
#SV ≈ N (all points)	Severe overfitting	Significantly ↓ $\gamma$ or ↓ $C$
#SV very small	Possibly underfitting	Check if data is actually easy; if not, ↑ $C$

bias_variance_diagnosis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.datasets import make_classification
 
def diagnose_bias_variance(X, y, kernel='rbf', gamma=1.0, C=1.0):
    """
    Diagnose bias vs variance problem for a kernel SVM configuration.
    """
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Train model
    clf = SVC(kernel=kernel, gamma=gamma, C=C)
    clf.fit(X_train, y_train)
    
    # Compute metrics
    train_acc = clf.score(X_train, y_train)
    test_acc = clf.score(X_test, y_test)
    n_sv = len(clf.support_)
    sv_fraction = n_sv / len(y_train)
    
    # Diagnosis
    gap = train_acc - test_acc
    
    print(f"Configuration: kernel={kernel}, gamma={gamma}, C={C}")
    print(f"  Training accuracy:  {train_acc:.3f}")
    print(f"  Test accuracy:      {test_acc:.3f}")
    print(f"  Gap (train - test): {gap:.3f}")
    print(f"  Support vectors:    {n_sv} ({sv_fraction:.1%})")
    
    # Automatic diagnosis
    if train_acc < 0.7:
        print("  → Diagnosis: HIGH BIAS (underfitting)")
        print("     Recommendation: ↑ C, ↑ gamma, or use more complex kernel")
    elif gap > 0.15:
        print("  → Diagnosis: HIGH VARIANCE (overfitting)")
        print("     Recommendation: ↓ C, ↓ gamma, or use simpler kernel")
    elif sv_fraction > 0.7:
        print("  → Diagnosis: Warning - too many support vectors")
        print("     Recommendation: ↓ gamma or ↓ C")
    else:
        print("  → Diagnosis: Reasonable fit")
    print()
 
# Generate dataset with some noise
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=10, 
                           n_informative=5, n_redundant=2,
                           n_classes=2, flip_y=0.1, random_state=42)
y = 2*y - 1  # Convert to {-1, +1}
 
print("=" * 60)
print("BIAS-VARIANCE DIAGNOSIS EXAMPLES")
print("=" * 60)
 
# Case 1: High bias (underfitting)
print("\nCase 1: Very smooth model (likely underfitting)")
diagnose_bias_variance(X, y, kernel='rbf', gamma=0.001, C=0.1)
 
# Case 2: Balanced
print("\nCase 2: Well-tuned model")
diagnose_bias_variance(X, y, kernel='rbf', gamma=0.1, C=1.0)
 
# Case 3: High variance (overfitting)
print("\nCase 3: Very flexible model (likely overfitting)")
diagnose_bias_variance(X, y, kernel='rbf', gamma=10.0, C=100.0)
 
# Case 4: Linear kernel on nonlinear data
print("\nCase 4: Linear kernel (check if appropriate)")
diagnose_bias_variance(X, y, kernel='linear', gamma=1.0, C=1.0)

The C-$\gamma$ Interaction for RBF Kernel

For RBF kernel SVM, the two hyperparameters $C$ and $\gamma$ together control model complexity. Understanding their interaction is crucial for effective tuning.

Individual Effects

$C$ (Regularization):

Controls the penalty for margin violations
Large $C$: Strict enforcement—try to classify all training points correctly → overfitting risk
Small $C$: Allow margin violations for smoother boundary → underfitting risk

$\gamma$ (Kernel Width):

Controls the influence radius of each support vector
Large $\gamma$: Each point has local influence → complex, localized boundaries
Small $\gamma$: Each point has global influence → smooth, simple boundaries

The Interaction

The parameters interact multiplicatively in terms of model complexity:

	Small $C$	Medium $C$	Large $C$
Small $\gamma$	Very simple (underfit)	Simple	Moderately complex
Medium $\gamma$	Simple	Balanced	Complex
Large $\gamma$	Moderately complex	Complex	Very complex (overfit)

The Optimal Region

Practical Tuning Strategy

Stage 1: Coarse Grid Search

$C$: ${10^{-3}, 10^{-2}, 10^{-1}, 1, 10, 100, 1000}$
$\gamma$: ${10^{-3}, 10^{-2}, 10^{-1}, 1, 10}$
Use 3-fold CV for speed

Stage 2: Fine Grid Search

Zoom into the best region from Stage 1
E.g., if best was $(C, \gamma) = (10, 0.1)$, search:
- $C$: ${3, 10, 30}$
- $\gamma$: ${0.03, 0.1, 0.3}$
Use 5-fold CV for reliability

Stage 3: Final Validation

Use best params on held-out test set
Or nested cross-validation for unbiased estimate

c_gamma_grid_search.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
 
# Generate data
np.random.seed(42)
X, y = make_moons(n_samples=500, noise=0.25, random_state=42)
y = 2*y - 1
 
# Scale features (critical for RBF!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Stage 1: Coarse grid search
print("Stage 1: Coarse Grid Search")
print("=" * 50)
 
param_grid_coarse = {
    'C': [0.01, 0.1, 1, 10, 100, 1000],
    'gamma': [0.001, 0.01, 0.1, 1, 10]
}
 
grid_coarse = GridSearchCV(
    SVC(kernel='rbf'),
    param_grid_coarse,
    cv=3,
    scoring='accuracy',
    return_train_score=True
)
grid_coarse.fit(X_scaled, y)
 
print(f"Best params: {grid_coarse.best_params_}")
print(f"Best CV score: {grid_coarse.best_score_:.3f}")
 
# Show top 5 configurations
results = grid_coarse.cv_results_
top_indices = np.argsort(results['mean_test_score'])[-5:][::-1]
print("\nTop 5 configurations:")
for idx in top_indices:
    print(f"  C={results['param_C'][idx]:>7}, γ={results['param_gamma'][idx]:>6} "
          f"→ CV: {results['mean_test_score'][idx]:.3f} ± {results['std_test_score'][idx]:.3f}")
 
# Stage 2: Fine grid search around best
best_C = grid_coarse.best_params_['C']
best_gamma = grid_coarse.best_params_['gamma']
 
print(f"\nStage 2: Fine Grid Search around C={best_C}, γ={best_gamma}")
print("=" * 50)
 
param_grid_fine = {
    'C': [best_C * f for f in [0.3, 0.5, 0.7, 1.0, 1.5, 2.0, 3.0]],
    'gamma': [best_gamma * f for f in [0.3, 0.5, 0.7, 1.0, 1.5, 2.0, 3.0]]
}
 
grid_fine = GridSearchCV(
    SVC(kernel='rbf'),
    param_grid_fine,
    cv=5,  # More folds for finer search
    scoring='accuracy',
    return_train_score=True
)
grid_fine.fit(X_scaled, y)
 
print(f"Best params: C={grid_fine.best_params_['C']:.3f}, "
      f"γ={grid_fine.best_params_['gamma']:.4f}")
print(f"Best CV score: {grid_fine.best_score_:.3f}")
 
# Final model analysis
final_model = grid_fine.best_estimator_
print(f"\nFinal Model:")
print(f"  Number of support vectors: {len(final_model.support_)}")
print(f"  SV fraction: {len(final_model.support_)/len(y):.1%}")

Support Vectors as Complexity Indicators

The number and distribution of support vectors provide valuable diagnostics for model complexity and generalization.

What Support Vectors Tell Us

Number of Support Vectors:

Few SVs relative to $n$: Decision boundary is "simple" (determined by few key examples)
Many SVs (approaching $n$): Decision boundary is "complex" (depends on most training points)

Rule of thumb: If more than 30-50% of training points are support vectors, the model may be overfitting.

Types of Support Vectors:

Margin SVs ($0 < \alpha < C$): Lie exactly on the margin. Define the boundary's core structure.
Bounded SVs ($\alpha = C$): Lie inside the margin or on the wrong side. Represent challenging or misclassified points.

Many bounded SVs suggests either:

High noise in the data
Too complex a boundary for the regularization level
Inherently overlapping classes

Support Vector Interpretation

Complexity Theory Connection

Statistical learning theory provides bounds on generalization error that involve the number of support vectors:

$$\mathbb{E}[\text{LOO error}] \leq \frac{\mathbb{E}[|\mathcal{S}|]}{n}$$

where $|\mathcal{S}|$ is the expected number of support vectors and LOO is leave-one-out cross-validation error.

This bound suggests:

Fewer SVs generally means better generalization
The SV count is a model-complexity measure
Minimizing SVs (while maintaining accuracy) is a form of regularization

Practical Use: Model Selection

When hyperparameter tuning:

Among configurations with similar CV accuracy, prefer fewer SVs
Watch for SV count explosion (sign of overfitting)
If reducing $\gamma$ significantly reduces SVs without hurting CV score, prefer smaller $\gamma$

Interpreting Support Vector Statistics
Metric	Low Value	High Value
SV / n	Simple model, may underfit	Complex model, may overfit
Bounded SV / # SV	Clean margins, good separation	Noisy data or overlapping classes
SV variation with $\gamma$	Model robust to $\gamma$	Model sensitive to $\gamma$
SV from one class / total SV	Balanced contribution	Imbalanced (check class distribution)

Summary: Mastering Nonlinear Boundaries

We've explored the geometry and diagnostics of nonlinear decision boundaries in kernel SVMs—essential knowledge for effective model development.

Key Takeaways

•Feature space transformation: A linear boundary in feature space becomes a nonlinear boundary in input space—this is the geometric essence of kernel methods.
•RBF boundaries as bump landscapes: Each support vector contributes a Gaussian bump; the decision boundary is where bumps sum to zero.
•$\gamma$ controls locality: Large $\gamma$ = narrow bumps = complex local boundaries. Small $\gamma$ = wide bumps = smooth global boundaries.
•Polynomial degree controls complexity: Degree $p$ polynomial kernel produces decision boundaries that are degree-$p$ polynomials in input coordinates.
•C and $\gamma$ interact: Both affect model complexity; grid search over both is essential for RBF kernel.
•Support vectors are complexity indicators: Few SVs suggests simple model; many SVs (especially bounded) suggests potential overfitting.

What's Next

Page Complete

Nonlinear Boundaries

The Core Mechanism

Why Does This Happen?

Concrete Example: Quadratic Kernel

The Decision Function Decomposition

Effect of $\gamma$ on Boundary Shape

The Extreme Cases

Degree 2: Conic Sections

Degree 3: Cubic Curves

Higher Degrees

Visualizing Polynomial Boundaries

The Three Key Controls

Diagnosing Bias vs Variance Problems

Individual Effects

The Interaction

Practical Tuning Strategy

What Support Vectors Tell Us

Complexity Theory Connection

Practical Use: Model Selection

SV / n

Bounded SV / # SV

SV variation with $\gamma$

Nonlinear Boundaries

The Core Mechanism

Why Does This Happen?

Concrete Example: Quadratic Kernel

The Decision Function Decomposition

Effect of $\gamma$ on Boundary Shape

The Extreme Cases

Degree 2: Conic Sections

Degree 3: Cubic Curves

Higher Degrees

Visualizing Polynomial Boundaries

The Three Key Controls

Diagnosing Bias vs Variance Problems

Individual Effects

The Interaction

Practical Tuning Strategy

What Support Vectors Tell Us

Complexity Theory Connection

Practical Use: Model Selection

SV / n

Bounded SV / # SV

SV variation with $\gamma$