Soft Margin Svm - Learning Module

Loading content...

0/245

The C Parameter: Controlling the Margin-Error Trade-off

The Most Critical Hyperparameter

Of all the decisions you make when training an SVM, the choice of C is arguably the most consequential. This single scalar controls the fundamental trade-off between two competing objectives:

Maximizing the margin (larger margin → better generalization, simpler model)
Minimizing training errors (fewer errors → better fit to data)

Set C too small, and your SVM ignores the training data, finding a wide margin that misclassifies many examples. Set C too large, and your SVM overfits, twisting the decision boundary to classify every training point correctly at the expense of a narrow, fragile margin.

Understanding C is not optional—it's the difference between an SVM that generalizes brilliantly and one that fails spectacularly.

What You Will Learn

This page provides a complete treatment of the C parameter—its mathematical meaning, geometric effects, extreme behaviors, selection strategies, and practical guidelines. You will develop intuition for how C affects the decision boundary and how to choose appropriate values for your problems.

Mathematical Role of C

Recall the soft margin SVM optimization problem:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$

$$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

The parameter $C > 0$ is a weighting factor that controls the relative importance of the two terms:

First term $\frac{1}{2}|\mathbf{w}|^2$: Regularization / margin maximization
Second term $C\sum_i \xi_i$: Empirical loss / slack penalty

Larger C makes violations more expensive. Smaller C makes violations cheaper.

Two Interpretations of C

C can be viewed as 'penalty per unit slack' (how much we pay for each unit of margin violation) OR as the inverse of regularization strength (C = 1/λ in the regularized risk formulation). Both interpretations are valid and useful in different contexts.

Alternative formulation (regularization view):

The soft margin objective can be rewritten as:

$$\min_{\mathbf{w}, b} \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)) + \lambda|\mathbf{w}|^2$$

where $\lambda = \frac{1}{2nC}$.

In this form, $\lambda$ is the regularization strength:

Large $\lambda$ → strong regularization → smaller $|\mathbf{w}|$ → wider margin
Small $\lambda$ → weak regularization → focus on minimizing loss

The relationship $C = 1/(2n\lambda)$ shows that C and λ are inversely related. This connects SVM to the broad family of regularized empirical risk minimizers.

Dimensionless interpretation:

Consider the ratio of the two objective terms: $$\text{Ratio} = \frac{C\sum_i \xi_i}{\frac{1}{2}|\mathbf{w}|^2}$$

When C is large, this ratio is large, meaning the optimization prioritizes minimizing slack. When C is small, the ratio is small, meaning the optimization prioritizes minimizing $|\mathbf{w}|^2$ (maximizing margin).

C Parameter Interpretations
Interpretation	Mathematical Expression	Effect of Large C
Slack penalty	C × Σξᵢ	Violations are expensive → fit training data
Inverse regularization	C = 1/(2nλ)	Weak regularization → complex model
Margin-error balance	Balance term in objective	Prioritize low training error over wide margin

Extreme Values of C: Limiting Behaviors

Understanding the behavior at extreme values of C provides crucial intuition.

Case 1: C → ∞ (Hard Margin Limit)

As C approaches infinity, the cost of any slack becomes prohibitive. The optimization will do anything to avoid nonzero slack:

$$C \to \infty \Rightarrow \xi_i = 0 \ \forall i$$

The constraints become: $$y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 \ \forall i$$

This is exactly the hard margin SVM! The soft margin SVM with infinite C reduces to hard margin SVM.

Implications:

If data is separable: finds maximum margin hyperplane with zero training error
If data is not separable: optimization is infeasible (or numerical issues arise)
High sensitivity to outliers—even one point on the wrong side is catastrophic

Overfitting with Large C

Very large C often leads to overfitting. The model will sacrifice margin width to correctly classify every training point, including noisy or mislabeled examples. The resulting narrow margin generalizes poorly to new data.

Case 2: C → 0 (Regularization Dominant)

As C approaches zero, the slack penalty vanishes:

$$C \to 0 \Rightarrow \min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$

The optimization ignores the constraints entirely (since violating them costs nothing). The solution is $\mathbf{w} = \mathbf{0}$, which has infinite margin but classifies nothing correctly.

Implications:

Decision boundary becomes arbitrary (or undefined)
All points have infinite slack—all are "allowed" to be misclassified
Model completely ignores the training data
Extreme underfitting

The sweet spot:

Optimal C lies between these extremes—large enough to respect the training data, small enough to maintain a healthy margin. Finding this balance is the art of hyperparameter tuning.

C Too Large (C → ∞)

•Approaches hard margin SVM
•Zero training error (if separable)
•Very narrow margin
•High sensitivity to outliers
•Overfitting to training data
•Poor generalization
•Infeasible if non-separable

C Too Small (C → 0)

•Regularization dominates
•Ignores training labels
•Very wide margin (infinite)
•High training error
•Underfitting
•Model is too simple
•w → 0, no discrimination

Geometric Effects of C on Decision Boundaries

The parameter C has profound geometric effects on the learned decision boundary. Let's visualize and understand these effects.

Effect 1: Margin width

Recall that the geometric margin is $\gamma = 1/|\mathbf{w}|$.

Small C: Prioritizes small $|\mathbf{w}|$ → wide margin
Large C: Allows larger $|\mathbf{w}|$ to reduce slack → narrow margin

This is the fundamental trade-off: wide margins generalize better but may misclassify more training points.

Effect 2: Decision boundary position

The position and orientation of the decision boundary change with C:

Small C: Boundary settles where the "average" between classes is, ignoring outliers
Large C: Boundary contorts to accommodate individual points, including outliers

Effect 3: Number of support vectors

Small C: Many points become support vectors (many violate the wide margin)
Large C: Fewer support vectors, primarily those exactly on the margin

c_parameter_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
 
def visualize_c_effect(X, y, C_values, figsize=(15, 5)):
    """
    Visualize the effect of C parameter on SVM decision boundaries.
    """
    fig, axes = plt.subplots(1, len(C_values), figsize=figsize)
    
    for ax, C in zip(axes, C_values):
        # Train SVM
        svm = SVC(kernel='linear', C=C)
        svm.fit(X, y)
        
        # Get decision boundary
        w = svm.coef_[0]
        b = svm.intercept_[0]
        margin = 1 / np.linalg.norm(w)
        
        # Create mesh
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                              np.linspace(y_min, y_max, 200))
        Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        
        # Plot decision function contours
        ax.contourf(xx, yy, Z, levels=np.linspace(-2, 2, 20), 
                    cmap='RdBu', alpha=0.3)
        ax.contour(xx, yy, Z, levels=[-1, 0, 1], 
                   colors=['blue', 'black', 'red'],
                   linestyles=['--', '-', '--'], linewidths=2)
        
        # Plot data points
        ax.scatter(X[y == 1, 0], X[y == 1, 1], c='red', 
                   edgecolors='black', s=60, label='+1')
        ax.scatter(X[y == -1, 0], X[y == -1, 1], c='blue', 
                   edgecolors='black', s=60, label='-1')
        
        # Highlight support vectors
        ax.scatter(svm.support_vectors_[:, 0], 
                   svm.support_vectors_[:, 1],
                   facecolors='none', edgecolors='green', 
                   s=150, linewidths=2, label='SVs')
        
        # Compute statistics
        n_sv = len(svm.support_)
        train_acc = svm.score(X, y)
        
        ax.set_xlabel('x₁')
        ax.set_ylabel('x₂')
        ax.set_title(f'C = {C}\n'
                     f'Margin = {margin:.3f}, #SV = {n_sv}\n'
                     f'Train Acc = {train_acc:.1%}')
        ax.legend(loc='upper left', fontsize=8)
    
    plt.tight_layout()
    plt.savefig('c_parameter_effect.png', dpi=150)
    plt.show()
 
 
def analyze_c_effect(X, y, C_range=np.logspace(-3, 3, 100)):
    """
    Analyze how metrics change with C.
    """
    metrics = {
        'C': [],
        'margin': [],
        'n_support_vectors': [],
        'train_accuracy': [],
        'n_violations': [],
        'sum_slack': []
    }
    
    for C in C_range:
        svm = SVC(kernel='linear', C=C)
        svm.fit(X, y)
        
        w = svm.coef_[0]
        b = svm.intercept_[0]
        
        # Compute metrics
        margin = 1 / np.linalg.norm(w)
        n_sv = len(svm.support_)
        train_acc = svm.score(X, y)
        
        # Compute slack
        functional_margin = y * (X @ w + b)
        slack = np.maximum(0, 1 - functional_margin)
        
        metrics['C'].append(C)
        metrics['margin'].append(margin)
        metrics['n_support_vectors'].append(n_sv)
        metrics['train_accuracy'].append(train_acc)
        metrics['n_violations'].append(np.sum(slack > 0))
        metrics['sum_slack'].append(np.sum(slack))
    
    # Plot
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    ax = axes[0, 0]
    ax.semilogx(metrics['C'], metrics['margin'], 'b-', linewidth=2)
    ax.set_xlabel('C')
    ax.set_ylabel('Margin Width (1/||w||)')
    ax.set_title('Margin vs C')
    ax.grid(True, alpha=0.3)
    
    ax = axes[0, 1]
    ax.semilogx(metrics['C'], metrics['n_support_vectors'], 'r-', linewidth=2)
    ax.set_xlabel('C')
    ax.set_ylabel('Number of Support Vectors')
    ax.set_title('Support Vectors vs C')
    ax.grid(True, alpha=0.3)
    
    ax = axes[1, 0]
    ax.semilogx(metrics['C'], metrics['train_accuracy'], 'g-', linewidth=2)
    ax.set_xlabel('C')
    ax.set_ylabel('Training Accuracy')
    ax.set_title('Training Accuracy vs C')
    ax.grid(True, alpha=0.3)
    
    ax = axes[1, 1]
    ax.semilogx(metrics['C'], metrics['sum_slack'], 'm-', linewidth=2)
    ax.set_xlabel('C')
    ax.set_ylabel('Sum of Slack Variables')
    ax.set_title('Total Slack vs C')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('c_metrics_analysis.png', dpi=150)
    plt.show()
    
    return metrics
 
 
# Example with overlapping dataset
np.random.seed(42)
n = 100
 
# Create overlapping classes
X_pos = np.random.randn(n//2, 2) * 0.8 + [1.5, 1.5]
X_neg = np.random.randn(n//2, 2) * 0.8 + [0, 0]
X = np.vstack([X_pos, X_neg])
y = np.array([1]*(n//2) + [-1]*(n//2))
 
# Visualize for different C values
visualize_c_effect(X, y, [0.01, 0.1, 1.0, 10.0, 100.0])
 
# Analyze trend
analyze_c_effect(X, y)

Visual Pattern

As C increases: the margin shrinks, fewer points become support vectors, training accuracy increases (until saturating), and the decision boundary becomes more 'wiggly' if using kernels. This visual pattern is a diagnostic for understanding your model's bias-variance trade-off.

C and the Bias-Variance Trade-off

The C parameter directly controls where SVM sits on the bias-variance spectrum.

Bias-variance decomposition (informal):

Bias: Error from oversimplifying assumptions. High bias → underfitting.
Variance: Error from sensitivity to training set fluctuations. High variance → overfitting.

Effect of C:

C Value	Margin	Model Complexity	Bias	Variance
Small	Wide	Low	High	Low
Large	Narrow	High	Low	High

The Margin as Regularizer

A wide margin is a form of regularization—it constrains the hypothesis space to simple boundaries. This reduces variance (less sensitivity to training data). A narrow margin allows complex boundaries, which can fit training data better (low bias) but may overfit (high variance).

Characteristics at different C values:

Small C (High Bias, Low Variance):

Wide margin, simple decision boundary
Many training points misclassified (inside or beyond margin)
Many support vectors
Model is robust to perturbations in training data
May miss important patterns (underfitting)

Large C (Low Bias, High Variance):

Narrow margin, potentially complex boundary
Few training errors
Fewer support vectors (only margin-touching points)
Sensitive to individual training points
May learn noise (overfitting)

Optimal C:

Balances training accuracy and margin width
Minimizes generalization error
Found via cross-validation

C and Model Behavior
Aspect	Small C	Optimal C	Large C
Training error	Higher	Moderate	Lower/Zero
Test error	Higher (underfit)	Lowest	Higher (overfit)
Margin width	Wide	Appropriate	Narrow
Support vectors	Many	Moderate	Few
Sensitivity to outliers	Low	Moderate	High
Stability across samples	High	Moderate	Low

Connection to VC dimension:

Statistical learning theory provides rigorous bounds on generalization error. For SVM, the margin $\gamma$ and the radius $R$ of the smallest sphere enclosing the data determine complexity:

$$\text{VC dimension} \leq \min\left(\frac{R^2}{\gamma^2}, d\right) + 1$$

where $d$ is the input dimension. Wider margin → lower VC dimension → better generalization bound. This is why preferring large margins (small C behavior) has theoretical justification.

Strategies for Selecting C

Choosing the right C is essential for good SVM performance. Here are the principal strategies.

Strategy 1: Grid Search with Cross-Validation

The most common approach is to search over a logarithmic grid of C values and select the one with best cross-validation performance:

Define a grid: $C \in {10^{-3}, 10^{-2}, ..., 10^2, 10^3}$
For each C, perform k-fold cross-validation
Select C with highest mean validation accuracy (or lowest error)
Optionally, refine search around the best region

c_selection_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
def grid_search_c(X, y, C_range=None, cv=5):
    """
    Select C using grid search with cross-validation.
    """
    if C_range is None:
        C_range = np.logspace(-4, 4, 17)  # 17 values from 10^-4 to 10^4
    
    # Create pipeline with scaling (important for SVM!)
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC(kernel='linear'))
    ])
    
    param_grid = {'svm__C': C_range}
    
    grid_search = GridSearchCV(
        pipeline, 
        param_grid, 
        cv=cv, 
        scoring='accuracy',
        return_train_score=True,
        n_jobs=-1
    )
    
    grid_search.fit(X, y)
    
    # Results
    print(f"Best C: {grid_search.best_params_['svm__C']}")
    print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")
    
    return grid_search
 
 
def analyze_c_selection(grid_search):
    """
    Analyze and visualize grid search results.
    """
    import matplotlib.pyplot as plt
    
    results = grid_search.cv_results_
    C_values = results['param_svm__C'].data
    mean_train = results['mean_train_score']
    mean_test = results['mean_test_score']
    std_test = results['std_test_score']
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.semilogx(C_values, mean_train, 'b-o', label='Training Accuracy')
    ax.semilogx(C_values, mean_test, 'r-o', label='Validation Accuracy')
    ax.fill_between(C_values, 
                     mean_test - std_test, 
                     mean_test + std_test, 
                     alpha=0.2, color='red')
    
    # Mark best C
    best_idx = np.argmax(mean_test)
    ax.axvline(C_values[best_idx], color='green', linestyle='--', 
               label=f'Best C = {C_values[best_idx]:.2e}')
    
    ax.set_xlabel('C (log scale)', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title('C Selection via Cross-Validation', fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('c_selection.png', dpi=150)
    plt.show()
    
    # Print detailed results around optimum
    print("\nDetailed Results Around Optimum:")
    print("-" * 50)
    start = max(0, best_idx - 3)
    end = min(len(C_values), best_idx + 4)
    for i in range(start, end):
        marker = " *" if i == best_idx else ""
        print(f"C = {C_values[i]:.2e}: "
              f"Train = {mean_train[i]:.4f}, "
              f"Val = {mean_test[i]:.4f} ± {std_test[i]:.4f}{marker}")
 
 
# Strategy 2: Bayesian Optimization (more efficient for expensive fits)
def bayesian_c_optimization(X, y, n_calls=50):
    """
    Use Bayesian optimization to find optimal C.
    More sample-efficient than grid search for expensive models.
    """
    from skopt import BayesSearchCV
    from skopt.space import Real
    
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC(kernel='linear'))
    ])
    
    opt = BayesSearchCV(
        pipeline,
        {'svm__C': Real(1e-4, 1e4, prior='log-uniform')},
        n_iter=n_calls,
        cv=5,
        n_jobs=-1,
        random_state=42
    )
    
    opt.fit(X, y)
    
    print(f"Bayesian Optimal C: {opt.best_params_['svm__C']:.4f}")
    print(f"Best CV Score: {opt.best_score_:.4f}")
    
    return opt
 
 
# Strategy 3: Heuristic based on data
def heuristic_c_estimate(X, y):
    """
    Heuristic C estimate based on data scale.
    
    Common heuristic: C ≈ 1/mean(||x||)
    This scales C inversely with feature magnitudes.
    """
    # Scale data first
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Compute mean norm
    mean_norm = np.mean(np.linalg.norm(X_scaled, axis=1))
    
    # Heuristic C value
    C_heuristic = 1.0  # After scaling, C=1 is often reasonable
    
    print(f"Mean ||x|| (scaled): {mean_norm:.4f}")
    print(f"Heuristic C: {C_heuristic}")
    print("Recommendation: Search in range [0.01, 100] around this value")
    
    return C_heuristic
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Create dataset
    X, y = make_classification(
        n_samples=500, n_features=20, n_informative=10,
        n_redundant=5, n_clusters_per_class=2, random_state=42
    )
    y = 2 * y - 1  # Convert to {-1, +1}
    
    # Grid search
    grid_search = grid_search_c(X, y)
    analyze_c_selection(grid_search)
    
    # Heuristic
    heuristic_c_estimate(X, y)

Strategy 2: One Standard Error Rule

Instead of selecting the C with highest validation accuracy, choose the largest C (simplest model) whose performance is within one standard error of the maximum. This favors simpler models when differences are not statistically significant.

Strategy 3: Regularization Path

Compute solutions for many C values efficiently using warm-starting or path following algorithms. Some SVM solvers can compute the entire regularization path at cost similar to solving for a single C.

Strategy 4: Problem-Specific Heuristics

For certain domains, experience provides guidelines:

Text classification: Often C ∈ [0.1, 10]
Image classification: Often C ∈ [0.01, 100]
After normalization: C = 1 is a reasonable starting point

Practical Guidelines for C

Years of experience with SVMs have yielded practical wisdom about C selection. Here are the most important guidelines.

Key Guidelines for C Selection

•Always normalize features first — C is sensitive to feature scales. If features have different scales, perceived 'distance' in feature space is distorted, affecting which C values work. Standardization (zero mean, unit variance) or normalization (min-max) should precede SVM training.
•Search on a logarithmic scale — The effect of C changes multiplicatively, not additively. Searching C ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000} explores the space efficiently.
•Start with C = 1 — After proper normalization, C = 1 is often a reasonable default. This gives equal weight to margin and violations.
•Use cross-validation — Don't select C based on training accuracy alone. Training accuracy always increases with C, but generalization may not.
•Watch for overfitting signs — If training accuracy >> test accuracy, C is too large. If both are similar but low, C might be too small.
•Consider computation time — Very large C can slow optimization (hard margin problems are more difficult). Very small C may also slow convergence (near-trivial solution).

Common Mistake: Tuning C Without Normalization

Without normalizing features, the optimal C depends on the arbitrary scales of your features. A dataset with features in [0, 1000] needs different C than the same data in [0, 1]. Normalize first, then tune C. The optimal C for normalized data is more stable and transferable.

The C-kernel interaction:

When using kernel SVMs (covered later), C and kernel parameters interact:

For RBF kernel with parameter $\gamma$: both C and $\gamma$ should be tuned together
High C + high $\gamma$ → very complex, overfitting model
Low C + low $\gamma$ → very simple, underfitting model
The search space becomes 2D, often done via grid search over (C, $\gamma$) pairs

Computational considerations:

SVM optimization algorithms (like SMO) have complexity affected by C:

Very large C: More active constraints, potentially slower
Very small C: Near-trivial problem, but may have numerical issues
Moderate C: Usually fastest convergence

For large-scale problems, starting with a rough C estimate and refining is more efficient than exhaustive grid search.

C Selection Checklist
Step	Action	Notes
1	Normalize features	StandardScaler or MinMaxScaler
2	Set initial C = 1	Reasonable starting point after normalization
3	Define grid	C ∈ {10⁻³, 10⁻², ..., 10², 10³} typically sufficient
4	Cross-validate	5-fold or 10-fold CV
5	Analyze results	Plot train/val curves vs C
6	Refine if needed	Narrow search around optimum
7	Final evaluation	Hold-out test set, never used in CV

C in the Dual Formulation

When we derive the dual formulation of soft margin SVM (covered in detail next page), C appears as a box constraint on the Lagrange multipliers.

Primal form: $$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i$$ $$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Dual form (preview): $$\max_{\boldsymbol{\alpha}} \sum_i \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j$$ $$\text{s.t. } 0 \leq \alpha_i \leq C, \quad \sum_i \alpha_i y_i = 0$$

The constraint $\alpha_i \leq C$ is called the box constraint—each Lagrange multiplier is "boxed in" by the interval $[0, C]$.

Why the Box Constraint?

The upper bound C on αᵢ comes from the slack variable constraint ξᵢ ≥ 0. In the Lagrangian, the multiplier for ξᵢ ≥ 0 is some μᵢ ≥ 0, and the KKT conditions link μᵢ = C - αᵢ. Since μᵢ ≥ 0, we must have αᵢ ≤ C.

Classification of points by α value:

The α values directly indicate the role of each training point:

Condition	α Value	Point Type	Location
α = 0	Inactive	Non-support vector	Outside margin, correct side
0 < α < C	Active, free	Free support vector	Exactly on margin
α = C	Active, bounded	Bounded support vector	Inside margin or misclassified

Implications:

α = 0: Point is correctly classified and outside the margin. It doesn't influence the solution.
0 < α < C: Point is exactly on the margin (ξ = 0). These points are used to compute bias b.
α = C: Point is a margin violator (ξ > 0). It may be inside the margin (correctly classified) or misclassified.

In hard margin SVM, there's no upper bound on α—points can be "infinitely important." The box constraint C limits how much any single point can influence the solution, providing robustness.

Point Classification via Dual Variables
αᵢ	ξᵢ	Margin yf(x)	Classification
αᵢ = 0	ξᵢ = 0	yf(x) > 1	Correctly outside margin
0 < αᵢ < C	ξᵢ = 0	yf(x) = 1	On margin (free SV)
αᵢ = C	0 < ξᵢ < 1	0 < yf(x) < 1	Inside margin (bounded SV)
αᵢ = C	ξᵢ = 1	yf(x) = 0	On decision boundary
αᵢ = C	ξᵢ > 1	yf(x) < 0	Misclassified (bounded SV)

Summary: Mastering the C Parameter

The C parameter is the primary hyperparameter of soft margin SVM. Understanding it is essential for practical success. Let us consolidate the key insights:

Key Takeaways

•C controls the margin-error trade-off — It weighs margin width against training error. Large C → narrow margin, low training error. Small C → wide margin, higher training error.
•C is inverse regularization — In the regularized risk view, C = 1/(2nλ). Large C means weak regularization, allowing complex models.
•Extreme C leads to pathology — C→∞ gives hard margin (infeasible for non-separable data). C→0 gives w=0 (ignores all data).
•C affects the bias-variance trade-off — Small C → high bias, low variance (underfitting). Large C → low bias, high variance (overfitting).
•In the dual, C is a box constraint — αᵢ ∈ [0, C] limits the influence of any single point, providing robustness.
•Normalize before tuning C — Feature scales affect optimal C. Always normalize first.
•Use cross-validation on log scale — Search C ∈ {10⁻³, ..., 10³} and select via CV accuracy.

What's next:

With the C parameter understood, we're ready to derive the dual formulation of soft margin SVM. The dual is where the mathematical elegance of SVM truly shines—it enables kernel methods, provides geometric insight, and often leads to more efficient optimization.

Conceptual Milestone

You now understand the C parameter—the critical hyperparameter that determines SVM behavior. You can reason about its effects geometrically, select it appropriately using cross-validation, and interpret its role in both primal and dual formulations. This knowledge is essential for effective SVM deployment.

The C Parameter: Controlling the Margin-Error Trade-off

The Most Critical Hyperparameter

Of all the decisions you make when training an SVM, the choice of C is arguably the most consequential. This single scalar controls the fundamental trade-off between two competing objectives:

Maximizing the margin (larger margin → better generalization, simpler model)
Minimizing training errors (fewer errors → better fit to data)

Understanding C is not optional—it's the difference between an SVM that generalizes brilliantly and one that fails spectacularly.

What You Will Learn

Mathematical Role of C

Recall the soft margin SVM optimization problem:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$

$$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

The parameter $C > 0$ is a weighting factor that controls the relative importance of the two terms:

First term $\frac{1}{2}|\mathbf{w}|^2$: Regularization / margin maximization
Second term $C\sum_i \xi_i$: Empirical loss / slack penalty

Larger C makes violations more expensive. Smaller C makes violations cheaper.

Two Interpretations of C

Alternative formulation (regularization view):

The soft margin objective can be rewritten as:

$$\min_{\mathbf{w}, b} \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)) + \lambda|\mathbf{w}|^2$$

where $\lambda = \frac{1}{2nC}$.

In this form, $\lambda$ is the regularization strength:

Large $\lambda$ → strong regularization → smaller $|\mathbf{w}|$ → wider margin
Small $\lambda$ → weak regularization → focus on minimizing loss

The relationship $C = 1/(2n\lambda)$ shows that C and λ are inversely related. This connects SVM to the broad family of regularized empirical risk minimizers.

Dimensionless interpretation:

Consider the ratio of the two objective terms: $$\text{Ratio} = \frac{C\sum_i \xi_i}{\frac{1}{2}|\mathbf{w}|^2}$$

C Parameter Interpretations
Interpretation	Mathematical Expression	Effect of Large C
Slack penalty	C × Σξᵢ	Violations are expensive → fit training data
Inverse regularization	C = 1/(2nλ)	Weak regularization → complex model
Margin-error balance	Balance term in objective	Prioritize low training error over wide margin

Extreme Values of C: Limiting Behaviors

Understanding the behavior at extreme values of C provides crucial intuition.

Case 1: C → ∞ (Hard Margin Limit)

As C approaches infinity, the cost of any slack becomes prohibitive. The optimization will do anything to avoid nonzero slack:

$$C \to \infty \Rightarrow \xi_i = 0 \ \forall i$$

The constraints become: $$y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 \ \forall i$$

This is exactly the hard margin SVM! The soft margin SVM with infinite C reduces to hard margin SVM.

Implications:

If data is separable: finds maximum margin hyperplane with zero training error
If data is not separable: optimization is infeasible (or numerical issues arise)
High sensitivity to outliers—even one point on the wrong side is catastrophic

Overfitting with Large C

Case 2: C → 0 (Regularization Dominant)

As C approaches zero, the slack penalty vanishes:

$$C \to 0 \Rightarrow \min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2$$

The optimization ignores the constraints entirely (since violating them costs nothing). The solution is $\mathbf{w} = \mathbf{0}$, which has infinite margin but classifies nothing correctly.

Implications:

Decision boundary becomes arbitrary (or undefined)
All points have infinite slack—all are "allowed" to be misclassified
Model completely ignores the training data
Extreme underfitting

The sweet spot:

Optimal C lies between these extremes—large enough to respect the training data, small enough to maintain a healthy margin. Finding this balance is the art of hyperparameter tuning.

C Too Large (C → ∞)

•Approaches hard margin SVM
•Zero training error (if separable)
•Very narrow margin
•High sensitivity to outliers
•Overfitting to training data
•Poor generalization
•Infeasible if non-separable

C Too Small (C → 0)

•Regularization dominates
•Ignores training labels
•Very wide margin (infinite)
•High training error
•Underfitting
•Model is too simple
•w → 0, no discrimination

Geometric Effects of C on Decision Boundaries

The parameter C has profound geometric effects on the learned decision boundary. Let's visualize and understand these effects.

Effect 1: Margin width

Recall that the geometric margin is $\gamma = 1/|\mathbf{w}|$.

Small C: Prioritizes small $|\mathbf{w}|$ → wide margin
Large C: Allows larger $|\mathbf{w}|$ to reduce slack → narrow margin

This is the fundamental trade-off: wide margins generalize better but may misclassify more training points.

Effect 2: Decision boundary position

The position and orientation of the decision boundary change with C:

Small C: Boundary settles where the "average" between classes is, ignoring outliers
Large C: Boundary contorts to accommodate individual points, including outliers

Effect 3: Number of support vectors

Small C: Many points become support vectors (many violate the wide margin)
Large C: Fewer support vectors, primarily those exactly on the margin

c_parameter_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
 
def visualize_c_effect(X, y, C_values, figsize=(15, 5)):
    """
    Visualize the effect of C parameter on SVM decision boundaries.
    """
    fig, axes = plt.subplots(1, len(C_values), figsize=figsize)
    
    for ax, C in zip(axes, C_values):
        # Train SVM
        svm = SVC(kernel='linear', C=C)
        svm.fit(X, y)
        
        # Get decision boundary
        w = svm.coef_[0]
        b = svm.intercept_[0]
        margin = 1 / np.linalg.norm(w)
        
        # Create mesh
        x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
        y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                              np.linspace(y_min, y_max, 200))
        Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        
        # Plot decision function contours
        ax.contourf(xx, yy, Z, levels=np.linspace(-2, 2, 20), 
                    cmap='RdBu', alpha=0.3)
        ax.contour(xx, yy, Z, levels=[-1, 0, 1], 
                   colors=['blue', 'black', 'red'],
                   linestyles=['--', '-', '--'], linewidths=2)
        
        # Plot data points
        ax.scatter(X[y == 1, 0], X[y == 1, 1], c='red', 
                   edgecolors='black', s=60, label='+1')
        ax.scatter(X[y == -1, 0], X[y == -1, 1], c='blue', 
                   edgecolors='black', s=60, label='-1')
        
        # Highlight support vectors
        ax.scatter(svm.support_vectors_[:, 0], 
                   svm.support_vectors_[:, 1],
                   facecolors='none', edgecolors='green', 
                   s=150, linewidths=2, label='SVs')
        
        # Compute statistics
        n_sv = len(svm.support_)
        train_acc = svm.score(X, y)
        
        ax.set_xlabel('x₁')
        ax.set_ylabel('x₂')
        ax.set_title(f'C = {C}\n'
                     f'Margin = {margin:.3f}, #SV = {n_sv}\n'
                     f'Train Acc = {train_acc:.1%}')
        ax.legend(loc='upper left', fontsize=8)
    
    plt.tight_layout()
    plt.savefig('c_parameter_effect.png', dpi=150)
    plt.show()
 
 
def analyze_c_effect(X, y, C_range=np.logspace(-3, 3, 100)):
    """
    Analyze how metrics change with C.
    """
    metrics = {
        'C': [],
        'margin': [],
        'n_support_vectors': [],
        'train_accuracy': [],
        'n_violations': [],
        'sum_slack': []
    }
    
    for C in C_range:
        svm = SVC(kernel='linear', C=C)
        svm.fit(X, y)
        
        w = svm.coef_[0]
        b = svm.intercept_[0]
        
        # Compute metrics
        margin = 1 / np.linalg.norm(w)
        n_sv = len(svm.support_)
        train_acc = svm.score(X, y)
        
        # Compute slack
        functional_margin = y * (X @ w + b)
        slack = np.maximum(0, 1 - functional_margin)
        
        metrics['C'].append(C)
        metrics['margin'].append(margin)
        metrics['n_support_vectors'].append(n_sv)
        metrics['train_accuracy'].append(train_acc)
        metrics['n_violations'].append(np.sum(slack > 0))
        metrics['sum_slack'].append(np.sum(slack))
    
    # Plot
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    ax = axes[0, 0]
    ax.semilogx(metrics['C'], metrics['margin'], 'b-', linewidth=2)
    ax.set_xlabel('C')
    ax.set_ylabel('Margin Width (1/||w||)')
    ax.set_title('Margin vs C')
    ax.grid(True, alpha=0.3)
    
    ax = axes[0, 1]
    ax.semilogx(metrics['C'], metrics['n_support_vectors'], 'r-', linewidth=2)
    ax.set_xlabel('C')
    ax.set_ylabel('Number of Support Vectors')
    ax.set_title('Support Vectors vs C')
    ax.grid(True, alpha=0.3)
    
    ax = axes[1, 0]
    ax.semilogx(metrics['C'], metrics['train_accuracy'], 'g-', linewidth=2)
    ax.set_xlabel('C')
    ax.set_ylabel('Training Accuracy')
    ax.set_title('Training Accuracy vs C')
    ax.grid(True, alpha=0.3)
    
    ax = axes[1, 1]
    ax.semilogx(metrics['C'], metrics['sum_slack'], 'm-', linewidth=2)
    ax.set_xlabel('C')
    ax.set_ylabel('Sum of Slack Variables')
    ax.set_title('Total Slack vs C')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('c_metrics_analysis.png', dpi=150)
    plt.show()
    
    return metrics
 
 
# Example with overlapping dataset
np.random.seed(42)
n = 100
 
# Create overlapping classes
X_pos = np.random.randn(n//2, 2) * 0.8 + [1.5, 1.5]
X_neg = np.random.randn(n//2, 2) * 0.8 + [0, 0]
X = np.vstack([X_pos, X_neg])
y = np.array([1]*(n//2) + [-1]*(n//2))
 
# Visualize for different C values
visualize_c_effect(X, y, [0.01, 0.1, 1.0, 10.0, 100.0])
 
# Analyze trend
analyze_c_effect(X, y)

Visual Pattern

C and the Bias-Variance Trade-off

The C parameter directly controls where SVM sits on the bias-variance spectrum.

Bias-variance decomposition (informal):

Bias: Error from oversimplifying assumptions. High bias → underfitting.
Variance: Error from sensitivity to training set fluctuations. High variance → overfitting.

Effect of C:

C Value	Margin	Model Complexity	Bias	Variance
Small	Wide	Low	High	Low
Large	Narrow	High	Low	High

The Margin as Regularizer

Characteristics at different C values:

Small C (High Bias, Low Variance):

Wide margin, simple decision boundary
Many training points misclassified (inside or beyond margin)
Many support vectors
Model is robust to perturbations in training data
May miss important patterns (underfitting)

Large C (Low Bias, High Variance):

Narrow margin, potentially complex boundary
Few training errors
Fewer support vectors (only margin-touching points)
Sensitive to individual training points
May learn noise (overfitting)

Optimal C:

Balances training accuracy and margin width
Minimizes generalization error
Found via cross-validation

C and Model Behavior
Aspect	Small C	Optimal C	Large C
Training error	Higher	Moderate	Lower/Zero
Test error	Higher (underfit)	Lowest	Higher (overfit)
Margin width	Wide	Appropriate	Narrow
Support vectors	Many	Moderate	Few
Sensitivity to outliers	Low	Moderate	High
Stability across samples	High	Moderate	Low

Connection to VC dimension:

Statistical learning theory provides rigorous bounds on generalization error. For SVM, the margin $\gamma$ and the radius $R$ of the smallest sphere enclosing the data determine complexity:

$$\text{VC dimension} \leq \min\left(\frac{R^2}{\gamma^2}, d\right) + 1$$

where $d$ is the input dimension. Wider margin → lower VC dimension → better generalization bound. This is why preferring large margins (small C behavior) has theoretical justification.

Strategies for Selecting C

Choosing the right C is essential for good SVM performance. Here are the principal strategies.

Strategy 1: Grid Search with Cross-Validation

The most common approach is to search over a logarithmic grid of C values and select the one with best cross-validation performance:

Define a grid: $C \in {10^{-3}, 10^{-2}, ..., 10^2, 10^3}$
For each C, perform k-fold cross-validation
Select C with highest mean validation accuracy (or lowest error)
Optionally, refine search around the best region

c_selection_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
def grid_search_c(X, y, C_range=None, cv=5):
    """
    Select C using grid search with cross-validation.
    """
    if C_range is None:
        C_range = np.logspace(-4, 4, 17)  # 17 values from 10^-4 to 10^4
    
    # Create pipeline with scaling (important for SVM!)
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC(kernel='linear'))
    ])
    
    param_grid = {'svm__C': C_range}
    
    grid_search = GridSearchCV(
        pipeline, 
        param_grid, 
        cv=cv, 
        scoring='accuracy',
        return_train_score=True,
        n_jobs=-1
    )
    
    grid_search.fit(X, y)
    
    # Results
    print(f"Best C: {grid_search.best_params_['svm__C']}")
    print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")
    
    return grid_search
 
 
def analyze_c_selection(grid_search):
    """
    Analyze and visualize grid search results.
    """
    import matplotlib.pyplot as plt
    
    results = grid_search.cv_results_
    C_values = results['param_svm__C'].data
    mean_train = results['mean_train_score']
    mean_test = results['mean_test_score']
    std_test = results['std_test_score']
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    ax.semilogx(C_values, mean_train, 'b-o', label='Training Accuracy')
    ax.semilogx(C_values, mean_test, 'r-o', label='Validation Accuracy')
    ax.fill_between(C_values, 
                     mean_test - std_test, 
                     mean_test + std_test, 
                     alpha=0.2, color='red')
    
    # Mark best C
    best_idx = np.argmax(mean_test)
    ax.axvline(C_values[best_idx], color='green', linestyle='--', 
               label=f'Best C = {C_values[best_idx]:.2e}')
    
    ax.set_xlabel('C (log scale)', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title('C Selection via Cross-Validation', fontsize=14)
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('c_selection.png', dpi=150)
    plt.show()
    
    # Print detailed results around optimum
    print("\nDetailed Results Around Optimum:")
    print("-" * 50)
    start = max(0, best_idx - 3)
    end = min(len(C_values), best_idx + 4)
    for i in range(start, end):
        marker = " *" if i == best_idx else ""
        print(f"C = {C_values[i]:.2e}: "
              f"Train = {mean_train[i]:.4f}, "
              f"Val = {mean_test[i]:.4f} ± {std_test[i]:.4f}{marker}")
 
 
# Strategy 2: Bayesian Optimization (more efficient for expensive fits)
def bayesian_c_optimization(X, y, n_calls=50):
    """
    Use Bayesian optimization to find optimal C.
    More sample-efficient than grid search for expensive models.
    """
    from skopt import BayesSearchCV
    from skopt.space import Real
    
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC(kernel='linear'))
    ])
    
    opt = BayesSearchCV(
        pipeline,
        {'svm__C': Real(1e-4, 1e4, prior='log-uniform')},
        n_iter=n_calls,
        cv=5,
        n_jobs=-1,
        random_state=42
    )
    
    opt.fit(X, y)
    
    print(f"Bayesian Optimal C: {opt.best_params_['svm__C']:.4f}")
    print(f"Best CV Score: {opt.best_score_:.4f}")
    
    return opt
 
 
# Strategy 3: Heuristic based on data
def heuristic_c_estimate(X, y):
    """
    Heuristic C estimate based on data scale.
    
    Common heuristic: C ≈ 1/mean(||x||)
    This scales C inversely with feature magnitudes.
    """
    # Scale data first
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Compute mean norm
    mean_norm = np.mean(np.linalg.norm(X_scaled, axis=1))
    
    # Heuristic C value
    C_heuristic = 1.0  # After scaling, C=1 is often reasonable
    
    print(f"Mean ||x|| (scaled): {mean_norm:.4f}")
    print(f"Heuristic C: {C_heuristic}")
    print("Recommendation: Search in range [0.01, 100] around this value")
    
    return C_heuristic
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Create dataset
    X, y = make_classification(
        n_samples=500, n_features=20, n_informative=10,
        n_redundant=5, n_clusters_per_class=2, random_state=42
    )
    y = 2 * y - 1  # Convert to {-1, +1}
    
    # Grid search
    grid_search = grid_search_c(X, y)
    analyze_c_selection(grid_search)
    
    # Heuristic
    heuristic_c_estimate(X, y)

Strategy 2: One Standard Error Rule

Strategy 3: Regularization Path

Strategy 4: Problem-Specific Heuristics

For certain domains, experience provides guidelines:

Text classification: Often C ∈ [0.1, 10]
Image classification: Often C ∈ [0.01, 100]
After normalization: C = 1 is a reasonable starting point

Practical Guidelines for C

Years of experience with SVMs have yielded practical wisdom about C selection. Here are the most important guidelines.

Key Guidelines for C Selection

•Always normalize features first — C is sensitive to feature scales. If features have different scales, perceived 'distance' in feature space is distorted, affecting which C values work. Standardization (zero mean, unit variance) or normalization (min-max) should precede SVM training.
•Search on a logarithmic scale — The effect of C changes multiplicatively, not additively. Searching C ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000} explores the space efficiently.
•Start with C = 1 — After proper normalization, C = 1 is often a reasonable default. This gives equal weight to margin and violations.
•Use cross-validation — Don't select C based on training accuracy alone. Training accuracy always increases with C, but generalization may not.
•Watch for overfitting signs — If training accuracy >> test accuracy, C is too large. If both are similar but low, C might be too small.
•Consider computation time — Very large C can slow optimization (hard margin problems are more difficult). Very small C may also slow convergence (near-trivial solution).

Common Mistake: Tuning C Without Normalization

The C-kernel interaction:

When using kernel SVMs (covered later), C and kernel parameters interact:

For RBF kernel with parameter $\gamma$: both C and $\gamma$ should be tuned together
High C + high $\gamma$ → very complex, overfitting model
Low C + low $\gamma$ → very simple, underfitting model
The search space becomes 2D, often done via grid search over (C, $\gamma$) pairs

Computational considerations:

SVM optimization algorithms (like SMO) have complexity affected by C:

Very large C: More active constraints, potentially slower
Very small C: Near-trivial problem, but may have numerical issues
Moderate C: Usually fastest convergence

For large-scale problems, starting with a rough C estimate and refining is more efficient than exhaustive grid search.

C Selection Checklist
Step	Action	Notes
1	Normalize features	StandardScaler or MinMaxScaler
2	Set initial C = 1	Reasonable starting point after normalization
3	Define grid	C ∈ {10⁻³, 10⁻², ..., 10², 10³} typically sufficient
4	Cross-validate	5-fold or 10-fold CV
5	Analyze results	Plot train/val curves vs C
6	Refine if needed	Narrow search around optimum
7	Final evaluation	Hold-out test set, never used in CV

C in the Dual Formulation

When we derive the dual formulation of soft margin SVM (covered in detail next page), C appears as a box constraint on the Lagrange multipliers.

Primal form: $$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_i \xi_i$$ $$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

The constraint $\alpha_i \leq C$ is called the box constraint—each Lagrange multiplier is "boxed in" by the interval $[0, C]$.

Why the Box Constraint?

Classification of points by α value:

The α values directly indicate the role of each training point:

Condition	α Value	Point Type	Location
α = 0	Inactive	Non-support vector	Outside margin, correct side
0 < α < C	Active, free	Free support vector	Exactly on margin
α = C	Active, bounded	Bounded support vector	Inside margin or misclassified

Implications:

α = 0: Point is correctly classified and outside the margin. It doesn't influence the solution.
0 < α < C: Point is exactly on the margin (ξ = 0). These points are used to compute bias b.
α = C: Point is a margin violator (ξ > 0). It may be inside the margin (correctly classified) or misclassified.

In hard margin SVM, there's no upper bound on α—points can be "infinitely important." The box constraint C limits how much any single point can influence the solution, providing robustness.

Point Classification via Dual Variables
αᵢ	ξᵢ	Margin yf(x)	Classification
αᵢ = 0	ξᵢ = 0	yf(x) > 1	Correctly outside margin
0 < αᵢ < C	ξᵢ = 0	yf(x) = 1	On margin (free SV)
αᵢ = C	0 < ξᵢ < 1	0 < yf(x) < 1	Inside margin (bounded SV)
αᵢ = C	ξᵢ = 1	yf(x) = 0	On decision boundary
αᵢ = C	ξᵢ > 1	yf(x) < 0	Misclassified (bounded SV)

Summary: Mastering the C Parameter

The C parameter is the primary hyperparameter of soft margin SVM. Understanding it is essential for practical success. Let us consolidate the key insights:

Key Takeaways

•C controls the margin-error trade-off — It weighs margin width against training error. Large C → narrow margin, low training error. Small C → wide margin, higher training error.
•C is inverse regularization — In the regularized risk view, C = 1/(2nλ). Large C means weak regularization, allowing complex models.
•Extreme C leads to pathology — C→∞ gives hard margin (infeasible for non-separable data). C→0 gives w=0 (ignores all data).
•C affects the bias-variance trade-off — Small C → high bias, low variance (underfitting). Large C → low bias, high variance (overfitting).
•In the dual, C is a box constraint — αᵢ ∈ [0, C] limits the influence of any single point, providing robustness.
•Normalize before tuning C — Feature scales affect optimal C. Always normalize first.
•Use cross-validation on log scale — Search C ∈ {10⁻³, ..., 10³} and select via CV accuracy.

What's next:

Conceptual Milestone