Machine LearningSupport Vector Machines

Soft Margin SVM

LevelIntermediate

Duration90 mins

TopicSupport Vector Machines

2 / 5

Hinge Loss: The Margin-Aware Loss Function

Beyond Classification Accuracy

In the previous page, we introduced slack variables as a mechanism to tolerate margin violations. Now we reveal a deeper truth: the soft margin SVM formulation is equivalent to minimizing a specific loss function called the hinge loss.

This perspective is profoundly important. While the slack variable formulation feels like a clever optimization trick, the hinge loss formulation connects SVM to the broader landscape of machine learning. It reveals why SVM has its characteristic properties—margin maximization, sparse solutions, and robustness to outliers.

The hinge loss also answers a fundamental question in classification: What should we optimize? Not just accuracy, but confident accuracy. Not just whether predictions are correct, but how correct they are. The hinge loss encodes this preference mathematically, penalizing predictions that are correct but not confident.

What You Will Learn

This page develops a complete understanding of the hinge loss function—from its definition and mathematical properties to its comparison with other loss functions (logistic, squared, 0-1). You will understand why hinge loss creates margins, why it leads to sparse solutions, and how it relates to the slack variable formulation.

Definition of Hinge Loss

The hinge loss (also called the max-margin loss) is defined as:

$$L_{\text{hinge}}(y, f(\mathbf{x})) = \max(0, 1 - y \cdot f(\mathbf{x}))$$

where:

$y \in {-1, +1}$ is the true label
$f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b$ is the model's raw prediction (the signed distance to the decision boundary, scaled)
The quantity $z = y \cdot f(\mathbf{x})$ is called the functional margin

Using the notation $[t]_+ = \max(0, t)$ for the positive part function, we can write:

$$L_{\text{hinge}}(z) = [1 - z]_+ = \max(0, 1-z)$$

The Functional Margin

The functional margin z = y·f(x) is positive when the prediction is correct (y and f(x) have the same sign) and negative when incorrect. Its magnitude indicates confidence: z = 5 means a confident correct prediction, z = 0.1 means a barely correct prediction, and z = -2 means a confident wrong prediction.

Understanding the hinge loss behavior:

The hinge loss has three distinct regions:

Region 1: $z \geq 1$ (confident correct) $$L_{\text{hinge}}(z) = 0$$

When the functional margin exceeds 1, the loss is exactly zero. The prediction is correct and confident—the point lies outside the margin on the correct side. No further improvement is sought.

Region 2: $0 < z < 1$ (correct but not confident) $$L_{\text{hinge}}(z) = 1 - z \in (0, 1)$$

The prediction is correct (positive margin) but not confident enough—the point lies inside the margin. The loss is positive, proportional to how far inside the margin the point lies. This penalizes correct but "weak" predictions.

Region 3: $z \leq 0$ (incorrect) $$L_{\text{hinge}}(z) = 1 - z \geq 1$$

The prediction is wrong. The loss is at least 1 and grows linearly with the magnitude of the error. The more confident the wrong prediction, the higher the loss.

Hinge Loss Behavior by Region
Margin z	Geometric Meaning	Loss Value	Gradient
z ≥ 1	Outside margin, correct side	0	0
0 < z < 1	Inside margin, correct side	1 - z ∈ (0, 1)	-1
z = 0	On decision boundary	1	-1
z < 0	Wrong side of boundary	1 - z > 1	-1

Equivalence to Slack Variable Formulation

The hinge loss formulation and the slack variable formulation are exactly equivalent. This is not an approximation—they define the same optimization problem.

Soft margin SVM with slack variables:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$

$$\text{s.t. } y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Soft margin SVM with hinge loss:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b))$$

These are identical because at optimality, $\xi_i^* = \max(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^))$.

Why Two Formulations?

The slack formulation is a constrained optimization problem (QP), while the hinge loss formulation is unconstrained. The constrained form leads to the dual formulation and kernel methods. The unconstrained form enables direct gradient-based optimization like SGD. Both are useful in different contexts.

Proof of equivalence:

From the slack formulation, at optimality we must have:

$$\xi_i^* = \max\left(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)\right)$$

Why? Consider two cases:

Case 1: $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1$

The constraint $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i$ is satisfied with $\xi_i = 0$
Since we minimize $C\sum\xi_i$, we never use more slack than necessary
Therefore $\xi_i^* = 0 = \max(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^))$ ✓

Case 2: $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) < 1$

The constraint requires $\xi_i \geq 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)$
Minimizing $C\xi_i$ sets $\xi_i$ exactly at the boundary
Therefore $\xi_i^* = 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = \max(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^))$ ✓

Substituting this into $C\sum_i \xi_i$ yields the hinge loss formulation.

hinge_loss_equivalence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
 
def hinge_loss(y, f_x, reduction='sum'):
    """
    Compute hinge loss.
    
    Parameters
    ----------
    y : array-like, shape (n_samples,)
        True labels in {-1, +1}
    f_x : array-like, shape (n_samples,)
        Model predictions (w'x + b)
    reduction : str
        'sum' returns total loss, 'none' returns per-sample loss
    
    Returns
    -------
    loss : float or array
    """
    margins = y * f_x
    losses = np.maximum(0, 1 - margins)
    
    if reduction == 'sum':
        return np.sum(losses)
    elif reduction == 'mean':
        return np.mean(losses)
    else:
        return losses
 
 
def compute_slack_from_solution(X, y, w, b):
    """
    Compute optimal slack variables given SVM solution.
    Demonstrates equivalence: ξ* = max(0, 1 - y(w'x + b))
    """
    functional_margins = y * (X @ w + b)
    slack = np.maximum(0, 1 - functional_margins)
    return slack
 
 
def verify_equivalence(X, y, w, b, C):
    """
    Verify that slack formulation equals hinge loss formulation.
    """
    # Compute predictions
    f_x = X @ w + b
    
    # Hinge loss formulation objective
    regularizer = 0.5 * np.dot(w, w)
    total_hinge = hinge_loss(y, f_x, reduction='sum')
    hinge_objective = regularizer + C * total_hinge
    
    # Slack formulation objective  
    slack = compute_slack_from_solution(X, y, w, b)
    slack_objective = regularizer + C * np.sum(slack)
    
    print(f"Regularizer: {regularizer:.6f}")
    print(f"Hinge loss term: C * Σ hinge = {C} * {total_hinge:.6f} = {C * total_hinge:.6f}")
    print(f"Slack term: C * Σξ = {C} * {np.sum(slack):.6f} = {C * np.sum(slack):.6f}")
    print(f"Hinge objective: {hinge_objective:.6f}")
    print(f"Slack objective: {slack_objective:.6f}")
    print(f"Difference: {abs(hinge_objective - slack_objective):.10f}")
    
    return hinge_objective, slack_objective
 
 
# Example
np.random.seed(42)
n = 100
X = np.random.randn(n, 2)
y = np.sign(X[:, 0] + X[:, 1] + 0.5 * np.random.randn(n))
y[y == 0] = 1
 
# Hypothetical solution
w = np.array([0.5, 0.5])
b = 0.1
 
verify_equivalence(X, y, w, b, C=1.0)

Geometric Interpretation of Hinge Loss

The hinge loss has a beautiful geometric interpretation that explains SVM's margin-maximizing behavior.

The margin as a threshold:

In SVM, we don't just want correct predictions—we want predictions that are correct by at least a margin. Setting this margin threshold to 1 in the functional margin space, the hinge loss accomplishes:

No penalty for points outside the margin (correct and confident)
Proportional penalty for points inside the margin (correct but not confident enough)
Proportional penalty for misclassified points (incorrect)

This creates a "dead zone" where correct, confident predictions incur no loss. The model is not incentivized to make already-confident predictions even more confident—the gradient is zero there.

The Hinge Shape

The 'hinge' in hinge loss refers to its shape when plotted. At z=1, the loss function has a kink (like a door hinge). To the right of this kink, the loss is flat at zero. To the left, it's a linear ramp. This non-smooth point at z=1 is where the margin boundary lives.

Why the margin threshold is at 1:

The choice of 1 as the margin threshold is arbitrary but conventional. We could define hinge loss as $\max(0, \gamma - z)$ for any $\gamma > 0$, but this is equivalent to rescaling $\mathbf{w}$ and $b$. The choice $\gamma = 1$ is a normalization that sets the scale of the weight vector.

Geometrically, with this normalization:

The decision boundary is at $f(\mathbf{x}) = 0$
The positive margin surface is at $f(\mathbf{x}) = +1$
The negative margin surface is at $f(\mathbf{x}) = -1$
The geometric margin (distance from boundary to margin surface) is $1/|\mathbf{w}|$

Gradient and support vectors:

The hinge loss gradient with respect to $\mathbf{w}$ reveals why SVM produces sparse solutions:

$$\frac{\partial L_{\text{hinge}}}{\partial \mathbf{w}} = \begin{cases} \mathbf{0} & \text{if } y(\mathbf{w}^\top\mathbf{x} + b) \geq 1 \ -y\mathbf{x} & \text{if } y(\mathbf{w}^\top\mathbf{x} + b) < 1 \end{cases}$$

Points with $y(\mathbf{w}^\top\mathbf{x} + b) \geq 1$ contribute zero gradient. They don't influence the solution. Only points at or inside the margin—the support vectors—have nonzero gradient and affect the model.

Hinge Loss Gradient by Region
Region	∂L/∂w	∂L/∂b	Interpretation
z ≥ 1 (outside margin)	0	0	Point doesn't affect solution
z < 1 (inside/wrong side)	-yx	-y	Point pushes boundary away

The gradient pushes the boundary:

When a point has $z < 1$, the gradient $-y\mathbf{x}$ acts to push the decision boundary in the direction that increases this point's margin. For a positive example ($y = +1$) with insufficient margin, updating $\mathbf{w} \leftarrow \mathbf{w} + \eta y\mathbf{x} = \mathbf{w} + \eta\mathbf{x}$ increases $\mathbf{w}^\top\mathbf{x}$, pushing the prediction toward the positive side.

This continues until the point achieves margin 1, at which point its gradient becomes zero and it stops influencing the solution. This is the mechanism behind margin maximization—gradients from margin-violating points "push" the boundary until no point violates the margin (or the penalty for violation is balanced by the regularizer).

Comparison with Other Loss Functions

Understanding hinge loss requires comparing it to other classification losses. Each loss function embodies different assumptions about what makes a prediction "good."

The loss function zoo:

Let $z = y \cdot f(\mathbf{x})$ be the functional margin. Common loss functions include:

Loss	Formula	Properties
0-1 (misclassification)	$\mathbb{1}(z < 0)$	Non-convex, non-differentiable
Hinge	$\max(0, 1-z)$	Convex, piecewise linear
Squared hinge	$[\max(0, 1-z)]^2$	Convex, differentiable
Logistic	$\log(1 + e^{-z})$	Convex, smooth
Exponential	$e^{-z}$	Convex, smooth
Squared	$(1-z)^2$	Convex, smooth

Convex Surrogates

All losses except 0-1 are convex surrogates—they upper bound the 0-1 loss and are convex, enabling efficient optimization. The choice of surrogate affects both computational properties and the geometry of the learned classifier.

Key differences between losses:

Hinge vs. Logistic:

Property	Hinge	Logistic
Smoothness	Non-smooth at z=1	Smooth everywhere
Gradient at z→∞	0 (exactly)	Approaches 0 exponentially
Sparsity	Yes (exact zero gradient for z≥1)	No (always nonzero gradient)
Probability output	No	Yes (via sigmoid)
Outlier sensitivity	Linear tail	Linear tail

Hinge loss creates sparsity because its gradient is exactly zero for confident predictions. Logistic loss always has nonzero (though small) gradients, so all points influence the solution to some degree.

Hinge vs. Exponential (AdaBoost):

Property	Hinge	Exponential
Penalty for z<0	Linear: 1-z	Exponential: e^{-z}
Outlier sensitivity	Moderate	Severe
Gradient magnitude for errors	Constant: 1	Grows exponentially

Exponential loss penalizes misclassifications exponentially, making it very sensitive to outliers and mislabeled data. Hinge loss is more robust—a point on the wrong side contributes loss linear in its margin, not exponential.

compare_loss_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt
 
def zero_one_loss(z):
    """0-1 loss: 1 if z < 0, else 0"""
    return (z < 0).astype(float)
 
def hinge_loss(z):
    """Hinge loss: max(0, 1-z)"""
    return np.maximum(0, 1 - z)
 
def squared_hinge_loss(z):
    """Squared hinge loss: max(0, 1-z)^2"""
    return np.maximum(0, 1 - z) ** 2
 
def logistic_loss(z):
    """Logistic loss: log(1 + exp(-z))"""
    # Numerically stable version
    return np.log1p(np.exp(-z))
 
def exponential_loss(z):
    """Exponential loss: exp(-z)"""
    return np.exp(-z)
 
def squared_loss(z):
    """Squared loss: (1-z)^2 (for margin)"""
    return (1 - z) ** 2
 
 
# Plot comparison
z = np.linspace(-2.5, 3, 1000)
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: All losses
ax = axes[0]
ax.plot(z, zero_one_loss(z), 'k--', linewidth=2, label='0-1 Loss')
ax.plot(z, hinge_loss(z), 'b-', linewidth=2, label='Hinge')
ax.plot(z, logistic_loss(z), 'r-', linewidth=2, label='Logistic')
ax.plot(z, exponential_loss(z), 'g-', linewidth=2, label='Exponential')
ax.plot(z, squared_hinge_loss(z), 'm--', linewidth=2, label='Squared Hinge')
 
ax.axvline(x=0, color='gray', linestyle=':', alpha=0.5, label='Decision boundary')
ax.axvline(x=1, color='orange', linestyle=':', alpha=0.5, label='Margin')
ax.set_xlim(-2.5, 3)
ax.set_ylim(-0.1, 4)
ax.set_xlabel('Functional Margin z = y·f(x)', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Classification Loss Functions', fontsize=14)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
# Right: Gradient comparison
ax = axes[1]
 
def hinge_grad(z):
    return np.where(z >= 1, 0, -1)
 
def logistic_grad(z):
    return -1 / (1 + np.exp(z))
 
def exponential_grad(z):
    return -np.exp(-z)
 
ax.plot(z, hinge_grad(z), 'b-', linewidth=2, label='Hinge gradient')
ax.plot(z, logistic_grad(z), 'r-', linewidth=2, label='Logistic gradient')
ax.plot(z, exponential_grad(z), 'g-', linewidth=2, label='Exponential gradient')
 
ax.axvline(x=1, color='orange', linestyle=':', alpha=0.5, label='z=1 (margin)')
ax.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax.set_xlim(-2.5, 3)
ax.set_ylim(-4, 0.5)
ax.set_xlabel('Functional Margin z = y·f(x)', fontsize=12)
ax.set_ylabel('Gradient ∂L/∂z', fontsize=12)
ax.set_title('Loss Function Gradients', fontsize=14)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('loss_function_comparison.png', dpi=150)
plt.show()
 
# Print key observations
print("KEY OBSERVATIONS:")
print("-" * 50)
print("At z=2 (confident correct prediction):")
print(f"  Hinge: {hinge_loss(2):.4f}, Gradient: {hinge_grad(2)}")
print(f"  Logistic: {logistic_loss(2):.4f}, Gradient: {logistic_grad(2):.4f}")
print()
print("At z=-1 (confident wrong prediction):")
print(f"  Hinge: {hinge_loss(-1):.4f}, Gradient: {hinge_grad(-1)}")
print(f"  Logistic: {logistic_loss(-1):.4f}, Gradient: {logistic_grad(-1):.4f}")
print(f"  Exponential: {exponential_loss(-1):.4f}, Gradient: {exponential_grad(-1):.4f}")

Why hinge loss creates margins:

The key insight is that hinge loss has zero loss and zero gradient for $z \geq 1$. This means:

Points outside the margin don't contribute to the loss
Points outside the margin don't influence gradient updates
The optimization focuses entirely on margin-violating points

In contrast, logistic loss always has positive loss and nonzero gradient, even for very confident predictions. This means logistic regression tries to push all predictions toward infinity, while SVM only cares about getting predictions past the margin threshold.

Mathematical Properties of Hinge Loss

The hinge loss has several important mathematical properties that influence SVM behavior.

Property 1: Convexity

Hinge loss is convex. To see this, note that:

$1 - z$ is an affine (hence convex) function of $z$
$\max(0, g(z))$ preserves convexity when $g$ is convex
Therefore $\max(0, 1-z)$ is convex

Convexity is crucial—it guarantees that any local minimum is a global minimum, and that gradient-based optimization will converge to the optimum.

Property 2: Lipschitz continuity

Hinge loss is Lipschitz continuous with constant 1: $$|L_{\text{hinge}}(z_1) - L_{\text{hinge}}(z_2)| \leq |z_1 - z_2|$$

This bounded rate of change provides stability guarantees for optimization algorithms and generalization bounds.

Property 3: Non-smooth at z=1

Hinge loss is not differentiable at $z = 1$. The left derivative is $-1$ and the right derivative is $0$. This non-smoothness is where the "kink" occurs, and it's precisely where support vectors lie.

Subgradients

At the non-smooth point z=1, we use subgradients instead of gradients. The subdifferential at z=1 is the interval [-1, 0]. Any value in this range is a valid subgradient and can be used in optimization. This is why SVM optimization algorithms use subgradient methods.

Property 4: Upper bound on 0-1 loss

For all $z$: $L_{\text{hinge}}(z) \geq \mathbb{1}(z < 0)$

Proof:

If $z < 0$: $L_{\text{hinge}} = 1 - z > 1 \geq 1 = \mathbb{1}(z < 0)$ ✓
If $z \geq 0$: $L_{\text{hinge}} = \max(0, 1-z) \geq 0 = \mathbb{1}(z < 0)$ ✓

Property 5: Calibration

Hinge loss is classification-calibrated, meaning that minimizing expected hinge loss leads to the Bayes optimal classifier for 0-1 loss as sample size grows. This is a fundamental theoretical guarantee that SVM will learn the optimal decision boundary.

Property 6: Margin awareness

Unlike 0-1 loss, hinge loss penalizes predictions even when correct if they're not confident enough (inside the margin). This is why SVM produces large-margin classifiers—the loss function explicitly penalizes small margins.

Summary of Hinge Loss Properties
Property	Statement	Implication
Convexity	L(z) is convex in z	Unique global optimum, efficient optimization
Lipschitz	\|L(z₁) - L(z₂)\| ≤ \|z₁ - z₂\|	Stable optimization, generalization bounds
Non-smooth	Not differentiable at z=1	Requires subgradient methods
Upper bound	L(z) ≥ 𝟙(z<0)	Minimizing hinge controls 0-1 loss
Calibration	Minimizer → Bayes optimal	Statistically consistent
Margin aware	L(z) > 0 for 0 < z < 1	Produces large-margin classifiers

Subgradients and Optimization

The non-smoothness of hinge loss at $z=1$ requires careful treatment in optimization. We use subgradients instead of gradients.

Subgradient definition:

A vector $\mathbf{g}$ is a subgradient of convex function $f$ at point $\mathbf{x}_0$ if, for all $\mathbf{x}$: $$f(\mathbf{x}) \geq f(\mathbf{x}_0) + \mathbf{g}^\top(\mathbf{x} - \mathbf{x}_0)$$

The set of all subgradients at $\mathbf{x}_0$ is called the subdifferential, denoted $\partial f(\mathbf{x}_0)$.

For hinge loss with respect to $z$:

$$\partial L_{\text{hinge}}(z) = \begin{cases} {-1} & z < 1 \ [-1, 0] & z = 1 \ {0} & z > 1 \end{cases}$$

At the Kink

At z=1, any value between -1 and 0 (inclusive) is a valid subgradient. This flexibility means that optimization algorithms have choices when a training point lands exactly on the margin. In practice, we can use 0 or -1 consistently, or take the average (-0.5).

Subgradient of the SVM objective:

The unconstrained soft margin SVM objective is: $$J(\mathbf{w}, b) = \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b))$$

The subgradient with respect to $\mathbf{w}$: $$\mathbf{g}\mathbf{w} \in \mathbf{w} + C\sum{i=1}^n \partial_\mathbf{w} L_i$$

where, for each term: $$\partial_\mathbf{w} L_i = \begin{cases} {-y_i \mathbf{x}_i} & y_i(\mathbf{w}^\top\mathbf{x}_i + b) < 1 \ {\mathbf{0}} \text{ or } {-y_i \mathbf{x}_i} & y_i(\mathbf{w}^\top\mathbf{x}_i + b) = 1 \ {\mathbf{0}} & y_i(\mathbf{w}^\top\mathbf{x}_i + b) > 1 \end{cases}$$

Subgradient descent algorithm:

The basic subgradient descent update:

$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta_t \mathbf{g}^{(t)}$$

where $\mathbf{g}^{(t)}$ is any subgradient at $\mathbf{w}^{(t)}$.

Unlike gradient descent, subgradient descent may not decrease the objective at every step. Convergence requires diminishing step sizes: $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$ (e.g., $\eta_t = 1/\sqrt{t}$).

svm_subgradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
 
def svm_subgradient_descent(X, y, C, n_epochs=1000, eta_0=1.0):
    """
    Train SVM using subgradient descent on the hinge loss.
    
    Uses step size: eta_t = eta_0 / sqrt(t)
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,) with values in {-1, +1}
    C : float, regularization parameter
    n_epochs : int
    eta_0 : float, initial learning rate
    
    Returns
    -------
    w, b : learned parameters
    history : dict with training history
    """
    n_samples, n_features = X.shape
    
    # Initialize
    w = np.zeros(n_features)
    b = 0.0
    
    history = {'objective': [], 'w_norm': [], 'n_violations': []}
    
    t = 1  # iteration counter for step size
    
    for epoch in range(n_epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)
        
        for i in indices:
            # Compute functional margin
            z = y[i] * (np.dot(w, X[i]) + b)
            
            # Step size
            eta = eta_0 / np.sqrt(t)
            t += 1
            
            # Compute subgradient
            if z < 1:
                # Margin violation: include hinge term
                g_w = w - C * y[i] * X[i]
                g_b = -C * y[i]
            else:
                # No violation: only regularizer
                g_w = w
                g_b = 0.0
            
            # Update
            w = w - eta * g_w
            b = b - eta * g_b
        
        # Track objective
        margins = y * (X @ w + b)
        hinge_losses = np.maximum(0, 1 - margins)
        objective = 0.5 * np.dot(w, w) + C * np.sum(hinge_losses)
        n_violations = np.sum(margins < 1)
        
        history['objective'].append(objective)
        history['w_norm'].append(np.linalg.norm(w))
        history['n_violations'].append(n_violations)
    
    return w, b, history
 
 
def svm_pegasos(X, y, C, n_epochs=100, batch_size=1):
    """
    PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM.
    
    A popular and efficient variant of subgradient descent for SVM.
    Uses projected subgradient updates with step size 1/(lambda*t).
    
    Note: Here lambda = 1 (we penalize 1/(2*lambda)*||w||^2 so lambda = 1/C)
    """
    n_samples, n_features = X.shape
    lambda_param = 1.0 / C
    
    w = np.zeros(n_features)
    # Note: PEGASOS typically doesn't update bias; we omit for simplicity
    
    t = 1
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)
        
        for i in indices:
            eta = 1.0 / (lambda_param * t)
            t += 1
            
            z = y[i] * np.dot(w, X[i])
            
            if z < 1:
                w = (1 - eta * lambda_param) * w + eta * y[i] * X[i]
            else:
                w = (1 - eta * lambda_param) * w
    
    return w
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create overlapping dataset
    n = 200
    X_pos = np.random.randn(n//2, 2) + [1, 1]
    X_neg = np.random.randn(n//2, 2) + [-1, -1]
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*(n//2) + [-1]*(n//2))
    
    # Train with subgradient descent
    w, b, history = svm_subgradient_descent(X, y, C=1.0, n_epochs=100)
    
    print(f"Final w: [{w[0]:.4f}, {w[1]:.4f}]")
    print(f"Final b: {b:.4f}")
    print(f"Final objective: {history['objective'][-1]:.4f}")
    print(f"Margin violations: {history['n_violations'][-1]}")
    
    # Accuracy
    predictions = np.sign(X @ w + b)
    accuracy = np.mean(predictions == y)
    print(f"Training accuracy: {accuracy:.2%}")

Squared Hinge Loss Variant

A popular variant is the squared hinge loss (also called L2-SVM or L2 loss):

$$L_{\text{sq-hinge}}(z) = [\max(0, 1-z)]^2 = [1-z]_+^2$$

This corresponds to the soft margin formulation with squared slack:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i^2$$ $$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i$$

Squared Hinge Properties

•Differentiable everywhere: Unlike standard hinge, squared hinge is smooth at z=1. This enables gradient descent without subgradients.
•Quadratic penalty for violations: Large violations are penalized quadratically, not linearly. This makes L2-SVM more sensitive to outliers than L1-SVM.
•No sparse slack: All points with z < 1 have nonzero influence. There's no exact cutoff.
•Equivalent to regularized least squares with hinge constraint: This connects to SVMs from a different direction.

L1 vs L2 SVM

L1-SVM uses Σξᵢ (standard), L2-SVM uses Σξᵢ². L1 is more robust to outliers and produces sparser αs in the dual. L2 is differentiable and sometimes faster to optimize. The choice depends on the application and data characteristics.

Gradient of squared hinge:

$$\frac{d L_{\text{sq-hinge}}}{dz} = \begin{cases} -2(1-z) & z < 1 \ 0 & z \geq 1 \end{cases}$$

The gradient is continuous at $z=1$ (both sides approach 0), making the function differentiable.

Comparison:

Property	Hinge (L1-SVM)	Squared Hinge (L2-SVM)
Smoothness	Kink at z=1	Smooth everywhere
Gradient at z<1	Constant: -1	Linear: -2(1-z)
Outlier sensitivity	Moderate	High
Dual sparsity	Yes	No
Typical use	General classification	Text classification

Loss Values Comparison
z (margin)	Hinge Loss	Squared Hinge	Logistic Loss
z = 2	0	0	0.127
z = 1	0	0	0.313
z = 0.5	0.5	0.25	0.474
z = 0	1	1	0.693
z = -1	2	4	1.313
z = -2	3	9	2.127

Summary: The Hinge Loss Perspective

The hinge loss provides a fundamental lens for understanding SVM behavior. Let us consolidate the key insights:

Key Takeaways

•Hinge loss is margin-aware — It penalizes not just wrong predictions, but predictions that aren't confident enough. The margin threshold at z=1 defines "confident enough."
•Equivalence to slack variables — The unconstrained hinge loss formulation is exactly equivalent to the constrained slack formulation. They define the same optimization problem.
•Zero loss for z≥1 creates sparsity — Points outside the margin contribute nothing to the loss or gradient. Only support vectors influence the solution.
•Convex surrogate for 0-1 loss — Hinge loss upper bounds 0-1 loss while being convex and tractable. Minimizing it controls classification error.
•Non-smoothness at z=1 — The kink at the margin boundary requires subgradient methods, not gradient descent. This is where support vectors live.
•Linear tail makes SVM robust — Unlike exponential loss (AdaBoost), hinge has a linear penalty for misclassifications, providing robustness to outliers.

What's next:

With hinge loss understood, we turn to the crucial question: how do we control the margin-violation trade-off? The answer lies in the C parameter, which balances margin width against training error. Understanding C is essential for practical SVM application—mistuning it can lead to overfitting or underfitting.

Conceptual Milestone

You now understand hinge loss—the loss function that gives SVM its defining characteristics. The margin-awareness, sparsity, and robustness of SVM all trace back to the hinge loss. This perspective will illuminate the dual formulation and kernel methods in subsequent pages.

2 / 5

Loading learning content...

Machine LearningSupport Vector Machines

Soft Margin SVM

LevelIntermediate

Duration90 mins

TopicSupport Vector Machines

2 / 5

Hinge Loss: The Margin-Aware Loss Function

Beyond Classification Accuracy

What You Will Learn

Definition of Hinge Loss

The hinge loss (also called the max-margin loss) is defined as:

$$L_{\text{hinge}}(y, f(\mathbf{x})) = \max(0, 1 - y \cdot f(\mathbf{x}))$$

where:

$y \in {-1, +1}$ is the true label
$f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b$ is the model's raw prediction (the signed distance to the decision boundary, scaled)
The quantity $z = y \cdot f(\mathbf{x})$ is called the functional margin

Using the notation $[t]_+ = \max(0, t)$ for the positive part function, we can write:

$$L_{\text{hinge}}(z) = [1 - z]_+ = \max(0, 1-z)$$

The Functional Margin

Understanding the hinge loss behavior:

The hinge loss has three distinct regions:

Region 1: $z \geq 1$ (confident correct) $$L_{\text{hinge}}(z) = 0$$

When the functional margin exceeds 1, the loss is exactly zero. The prediction is correct and confident—the point lies outside the margin on the correct side. No further improvement is sought.

Region 2: $0 < z < 1$ (correct but not confident) $$L_{\text{hinge}}(z) = 1 - z \in (0, 1)$$

Region 3: $z \leq 0$ (incorrect) $$L_{\text{hinge}}(z) = 1 - z \geq 1$$

The prediction is wrong. The loss is at least 1 and grows linearly with the magnitude of the error. The more confident the wrong prediction, the higher the loss.

Hinge Loss Behavior by Region
Margin z	Geometric Meaning	Loss Value	Gradient
z ≥ 1	Outside margin, correct side	0	0
0 < z < 1	Inside margin, correct side	1 - z ∈ (0, 1)	-1
z = 0	On decision boundary	1	-1
z < 0	Wrong side of boundary	1 - z > 1	-1

Equivalence to Slack Variable Formulation

The hinge loss formulation and the slack variable formulation are exactly equivalent. This is not an approximation—they define the same optimization problem.

Soft margin SVM with slack variables:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$

$$\text{s.t. } y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$

Soft margin SVM with hinge loss:

$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b))$$

These are identical because at optimality, $\xi_i^* = \max(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^))$.

Why Two Formulations?

Proof of equivalence:

From the slack formulation, at optimality we must have:

$$\xi_i^* = \max\left(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)\right)$$

Why? Consider two cases:

Case 1: $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1$

The constraint $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i$ is satisfied with $\xi_i = 0$
Since we minimize $C\sum\xi_i$, we never use more slack than necessary
Therefore $\xi_i^* = 0 = \max(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^))$ ✓

Case 2: $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) < 1$

The constraint requires $\xi_i \geq 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)$
Minimizing $C\xi_i$ sets $\xi_i$ exactly at the boundary
Therefore $\xi_i^* = 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) = \max(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^))$ ✓

Substituting this into $C\sum_i \xi_i$ yields the hinge loss formulation.

hinge_loss_equivalence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
 
def hinge_loss(y, f_x, reduction='sum'):
    """
    Compute hinge loss.
    
    Parameters
    ----------
    y : array-like, shape (n_samples,)
        True labels in {-1, +1}
    f_x : array-like, shape (n_samples,)
        Model predictions (w'x + b)
    reduction : str
        'sum' returns total loss, 'none' returns per-sample loss
    
    Returns
    -------
    loss : float or array
    """
    margins = y * f_x
    losses = np.maximum(0, 1 - margins)
    
    if reduction == 'sum':
        return np.sum(losses)
    elif reduction == 'mean':
        return np.mean(losses)
    else:
        return losses
 
 
def compute_slack_from_solution(X, y, w, b):
    """
    Compute optimal slack variables given SVM solution.
    Demonstrates equivalence: ξ* = max(0, 1 - y(w'x + b))
    """
    functional_margins = y * (X @ w + b)
    slack = np.maximum(0, 1 - functional_margins)
    return slack
 
 
def verify_equivalence(X, y, w, b, C):
    """
    Verify that slack formulation equals hinge loss formulation.
    """
    # Compute predictions
    f_x = X @ w + b
    
    # Hinge loss formulation objective
    regularizer = 0.5 * np.dot(w, w)
    total_hinge = hinge_loss(y, f_x, reduction='sum')
    hinge_objective = regularizer + C * total_hinge
    
    # Slack formulation objective  
    slack = compute_slack_from_solution(X, y, w, b)
    slack_objective = regularizer + C * np.sum(slack)
    
    print(f"Regularizer: {regularizer:.6f}")
    print(f"Hinge loss term: C * Σ hinge = {C} * {total_hinge:.6f} = {C * total_hinge:.6f}")
    print(f"Slack term: C * Σξ = {C} * {np.sum(slack):.6f} = {C * np.sum(slack):.6f}")
    print(f"Hinge objective: {hinge_objective:.6f}")
    print(f"Slack objective: {slack_objective:.6f}")
    print(f"Difference: {abs(hinge_objective - slack_objective):.10f}")
    
    return hinge_objective, slack_objective
 
 
# Example
np.random.seed(42)
n = 100
X = np.random.randn(n, 2)
y = np.sign(X[:, 0] + X[:, 1] + 0.5 * np.random.randn(n))
y[y == 0] = 1
 
# Hypothetical solution
w = np.array([0.5, 0.5])
b = 0.1
 
verify_equivalence(X, y, w, b, C=1.0)

Geometric Interpretation of Hinge Loss

The hinge loss has a beautiful geometric interpretation that explains SVM's margin-maximizing behavior.

The margin as a threshold:

No penalty for points outside the margin (correct and confident)
Proportional penalty for points inside the margin (correct but not confident enough)
Proportional penalty for misclassified points (incorrect)

This creates a "dead zone" where correct, confident predictions incur no loss. The model is not incentivized to make already-confident predictions even more confident—the gradient is zero there.

The Hinge Shape

Why the margin threshold is at 1:

Geometrically, with this normalization:

The decision boundary is at $f(\mathbf{x}) = 0$
The positive margin surface is at $f(\mathbf{x}) = +1$
The negative margin surface is at $f(\mathbf{x}) = -1$
The geometric margin (distance from boundary to margin surface) is $1/|\mathbf{w}|$

Gradient and support vectors:

The hinge loss gradient with respect to $\mathbf{w}$ reveals why SVM produces sparse solutions:

Hinge Loss Gradient by Region
Region	∂L/∂w	∂L/∂b	Interpretation
z ≥ 1 (outside margin)	0	0	Point doesn't affect solution
z < 1 (inside/wrong side)	-yx	-y	Point pushes boundary away

The gradient pushes the boundary:

Comparison with Other Loss Functions

Understanding hinge loss requires comparing it to other classification losses. Each loss function embodies different assumptions about what makes a prediction "good."

The loss function zoo:

Let $z = y \cdot f(\mathbf{x})$ be the functional margin. Common loss functions include:

Loss	Formula	Properties
0-1 (misclassification)	$\mathbb{1}(z < 0)$	Non-convex, non-differentiable
Hinge	$\max(0, 1-z)$	Convex, piecewise linear
Squared hinge	$[\max(0, 1-z)]^2$	Convex, differentiable
Logistic	$\log(1 + e^{-z})$	Convex, smooth
Exponential	$e^{-z}$	Convex, smooth
Squared	$(1-z)^2$	Convex, smooth

Convex Surrogates

Key differences between losses:

Hinge vs. Logistic:

Property	Hinge	Logistic
Smoothness	Non-smooth at z=1	Smooth everywhere
Gradient at z→∞	0 (exactly)	Approaches 0 exponentially
Sparsity	Yes (exact zero gradient for z≥1)	No (always nonzero gradient)
Probability output	No	Yes (via sigmoid)
Outlier sensitivity	Linear tail	Linear tail

Hinge vs. Exponential (AdaBoost):

Property	Hinge	Exponential
Penalty for z<0	Linear: 1-z	Exponential: e^{-z}
Outlier sensitivity	Moderate	Severe
Gradient magnitude for errors	Constant: 1	Grows exponentially

compare_loss_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt
 
def zero_one_loss(z):
    """0-1 loss: 1 if z < 0, else 0"""
    return (z < 0).astype(float)
 
def hinge_loss(z):
    """Hinge loss: max(0, 1-z)"""
    return np.maximum(0, 1 - z)
 
def squared_hinge_loss(z):
    """Squared hinge loss: max(0, 1-z)^2"""
    return np.maximum(0, 1 - z) ** 2
 
def logistic_loss(z):
    """Logistic loss: log(1 + exp(-z))"""
    # Numerically stable version
    return np.log1p(np.exp(-z))
 
def exponential_loss(z):
    """Exponential loss: exp(-z)"""
    return np.exp(-z)
 
def squared_loss(z):
    """Squared loss: (1-z)^2 (for margin)"""
    return (1 - z) ** 2
 
 
# Plot comparison
z = np.linspace(-2.5, 3, 1000)
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: All losses
ax = axes[0]
ax.plot(z, zero_one_loss(z), 'k--', linewidth=2, label='0-1 Loss')
ax.plot(z, hinge_loss(z), 'b-', linewidth=2, label='Hinge')
ax.plot(z, logistic_loss(z), 'r-', linewidth=2, label='Logistic')
ax.plot(z, exponential_loss(z), 'g-', linewidth=2, label='Exponential')
ax.plot(z, squared_hinge_loss(z), 'm--', linewidth=2, label='Squared Hinge')
 
ax.axvline(x=0, color='gray', linestyle=':', alpha=0.5, label='Decision boundary')
ax.axvline(x=1, color='orange', linestyle=':', alpha=0.5, label='Margin')
ax.set_xlim(-2.5, 3)
ax.set_ylim(-0.1, 4)
ax.set_xlabel('Functional Margin z = y·f(x)', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Classification Loss Functions', fontsize=14)
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)
 
# Right: Gradient comparison
ax = axes[1]
 
def hinge_grad(z):
    return np.where(z >= 1, 0, -1)
 
def logistic_grad(z):
    return -1 / (1 + np.exp(z))
 
def exponential_grad(z):
    return -np.exp(-z)
 
ax.plot(z, hinge_grad(z), 'b-', linewidth=2, label='Hinge gradient')
ax.plot(z, logistic_grad(z), 'r-', linewidth=2, label='Logistic gradient')
ax.plot(z, exponential_grad(z), 'g-', linewidth=2, label='Exponential gradient')
 
ax.axvline(x=1, color='orange', linestyle=':', alpha=0.5, label='z=1 (margin)')
ax.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax.set_xlim(-2.5, 3)
ax.set_ylim(-4, 0.5)
ax.set_xlabel('Functional Margin z = y·f(x)', fontsize=12)
ax.set_ylabel('Gradient ∂L/∂z', fontsize=12)
ax.set_title('Loss Function Gradients', fontsize=14)
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('loss_function_comparison.png', dpi=150)
plt.show()
 
# Print key observations
print("KEY OBSERVATIONS:")
print("-" * 50)
print("At z=2 (confident correct prediction):")
print(f"  Hinge: {hinge_loss(2):.4f}, Gradient: {hinge_grad(2)}")
print(f"  Logistic: {logistic_loss(2):.4f}, Gradient: {logistic_grad(2):.4f}")
print()
print("At z=-1 (confident wrong prediction):")
print(f"  Hinge: {hinge_loss(-1):.4f}, Gradient: {hinge_grad(-1)}")
print(f"  Logistic: {logistic_loss(-1):.4f}, Gradient: {logistic_grad(-1):.4f}")
print(f"  Exponential: {exponential_loss(-1):.4f}, Gradient: {exponential_grad(-1):.4f}")

Why hinge loss creates margins:

The key insight is that hinge loss has zero loss and zero gradient for $z \geq 1$. This means:

Points outside the margin don't contribute to the loss
Points outside the margin don't influence gradient updates
The optimization focuses entirely on margin-violating points

Mathematical Properties of Hinge Loss

The hinge loss has several important mathematical properties that influence SVM behavior.

Property 1: Convexity

Hinge loss is convex. To see this, note that:

$1 - z$ is an affine (hence convex) function of $z$
$\max(0, g(z))$ preserves convexity when $g$ is convex
Therefore $\max(0, 1-z)$ is convex

Convexity is crucial—it guarantees that any local minimum is a global minimum, and that gradient-based optimization will converge to the optimum.

Property 2: Lipschitz continuity

Hinge loss is Lipschitz continuous with constant 1: $$|L_{\text{hinge}}(z_1) - L_{\text{hinge}}(z_2)| \leq |z_1 - z_2|$$

This bounded rate of change provides stability guarantees for optimization algorithms and generalization bounds.

Property 3: Non-smooth at z=1

Hinge loss is not differentiable at $z = 1$. The left derivative is $-1$ and the right derivative is $0$. This non-smoothness is where the "kink" occurs, and it's precisely where support vectors lie.

Subgradients

Property 4: Upper bound on 0-1 loss

For all $z$: $L_{\text{hinge}}(z) \geq \mathbb{1}(z < 0)$

Proof:

If $z < 0$: $L_{\text{hinge}} = 1 - z > 1 \geq 1 = \mathbb{1}(z < 0)$ ✓
If $z \geq 0$: $L_{\text{hinge}} = \max(0, 1-z) \geq 0 = \mathbb{1}(z < 0)$ ✓

Property 5: Calibration

Property 6: Margin awareness

Summary of Hinge Loss Properties
Property	Statement	Implication
Convexity	L(z) is convex in z	Unique global optimum, efficient optimization
Lipschitz	\|L(z₁) - L(z₂)\| ≤ \|z₁ - z₂\|	Stable optimization, generalization bounds
Non-smooth	Not differentiable at z=1	Requires subgradient methods
Upper bound	L(z) ≥ 𝟙(z<0)	Minimizing hinge controls 0-1 loss
Calibration	Minimizer → Bayes optimal	Statistically consistent
Margin aware	L(z) > 0 for 0 < z < 1	Produces large-margin classifiers

Subgradients and Optimization

The non-smoothness of hinge loss at $z=1$ requires careful treatment in optimization. We use subgradients instead of gradients.

Subgradient definition:

A vector $\mathbf{g}$ is a subgradient of convex function $f$ at point $\mathbf{x}_0$ if, for all $\mathbf{x}$: $$f(\mathbf{x}) \geq f(\mathbf{x}_0) + \mathbf{g}^\top(\mathbf{x} - \mathbf{x}_0)$$

The set of all subgradients at $\mathbf{x}_0$ is called the subdifferential, denoted $\partial f(\mathbf{x}_0)$.

For hinge loss with respect to $z$:

$$\partial L_{\text{hinge}}(z) = \begin{cases} {-1} & z < 1 \ [-1, 0] & z = 1 \ {0} & z > 1 \end{cases}$$

At the Kink

Subgradient of the SVM objective:

The unconstrained soft margin SVM objective is: $$J(\mathbf{w}, b) = \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b))$$

The subgradient with respect to $\mathbf{w}$: $$\mathbf{g}\mathbf{w} \in \mathbf{w} + C\sum{i=1}^n \partial_\mathbf{w} L_i$$

Subgradient descent algorithm:

The basic subgradient descent update:

$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta_t \mathbf{g}^{(t)}$$

where $\mathbf{g}^{(t)}$ is any subgradient at $\mathbf{w}^{(t)}$.

svm_subgradient_descent.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
 
def svm_subgradient_descent(X, y, C, n_epochs=1000, eta_0=1.0):
    """
    Train SVM using subgradient descent on the hinge loss.
    
    Uses step size: eta_t = eta_0 / sqrt(t)
    
    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,) with values in {-1, +1}
    C : float, regularization parameter
    n_epochs : int
    eta_0 : float, initial learning rate
    
    Returns
    -------
    w, b : learned parameters
    history : dict with training history
    """
    n_samples, n_features = X.shape
    
    # Initialize
    w = np.zeros(n_features)
    b = 0.0
    
    history = {'objective': [], 'w_norm': [], 'n_violations': []}
    
    t = 1  # iteration counter for step size
    
    for epoch in range(n_epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)
        
        for i in indices:
            # Compute functional margin
            z = y[i] * (np.dot(w, X[i]) + b)
            
            # Step size
            eta = eta_0 / np.sqrt(t)
            t += 1
            
            # Compute subgradient
            if z < 1:
                # Margin violation: include hinge term
                g_w = w - C * y[i] * X[i]
                g_b = -C * y[i]
            else:
                # No violation: only regularizer
                g_w = w
                g_b = 0.0
            
            # Update
            w = w - eta * g_w
            b = b - eta * g_b
        
        # Track objective
        margins = y * (X @ w + b)
        hinge_losses = np.maximum(0, 1 - margins)
        objective = 0.5 * np.dot(w, w) + C * np.sum(hinge_losses)
        n_violations = np.sum(margins < 1)
        
        history['objective'].append(objective)
        history['w_norm'].append(np.linalg.norm(w))
        history['n_violations'].append(n_violations)
    
    return w, b, history
 
 
def svm_pegasos(X, y, C, n_epochs=100, batch_size=1):
    """
    PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM.
    
    A popular and efficient variant of subgradient descent for SVM.
    Uses projected subgradient updates with step size 1/(lambda*t).
    
    Note: Here lambda = 1 (we penalize 1/(2*lambda)*||w||^2 so lambda = 1/C)
    """
    n_samples, n_features = X.shape
    lambda_param = 1.0 / C
    
    w = np.zeros(n_features)
    # Note: PEGASOS typically doesn't update bias; we omit for simplicity
    
    t = 1
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)
        
        for i in indices:
            eta = 1.0 / (lambda_param * t)
            t += 1
            
            z = y[i] * np.dot(w, X[i])
            
            if z < 1:
                w = (1 - eta * lambda_param) * w + eta * y[i] * X[i]
            else:
                w = (1 - eta * lambda_param) * w
    
    return w
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create overlapping dataset
    n = 200
    X_pos = np.random.randn(n//2, 2) + [1, 1]
    X_neg = np.random.randn(n//2, 2) + [-1, -1]
    X = np.vstack([X_pos, X_neg])
    y = np.array([1]*(n//2) + [-1]*(n//2))
    
    # Train with subgradient descent
    w, b, history = svm_subgradient_descent(X, y, C=1.0, n_epochs=100)
    
    print(f"Final w: [{w[0]:.4f}, {w[1]:.4f}]")
    print(f"Final b: {b:.4f}")
    print(f"Final objective: {history['objective'][-1]:.4f}")
    print(f"Margin violations: {history['n_violations'][-1]}")
    
    # Accuracy
    predictions = np.sign(X @ w + b)
    accuracy = np.mean(predictions == y)
    print(f"Training accuracy: {accuracy:.2%}")

Squared Hinge Loss Variant

A popular variant is the squared hinge loss (also called L2-SVM or L2 loss):

$$L_{\text{sq-hinge}}(z) = [\max(0, 1-z)]^2 = [1-z]_+^2$$

This corresponds to the soft margin formulation with squared slack:

$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i^2$$ $$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i$$

Squared Hinge Properties

•Differentiable everywhere: Unlike standard hinge, squared hinge is smooth at z=1. This enables gradient descent without subgradients.
•Quadratic penalty for violations: Large violations are penalized quadratically, not linearly. This makes L2-SVM more sensitive to outliers than L1-SVM.
•No sparse slack: All points with z < 1 have nonzero influence. There's no exact cutoff.
•Equivalent to regularized least squares with hinge constraint: This connects to SVMs from a different direction.

L1 vs L2 SVM

Gradient of squared hinge:

$$\frac{d L_{\text{sq-hinge}}}{dz} = \begin{cases} -2(1-z) & z < 1 \ 0 & z \geq 1 \end{cases}$$

The gradient is continuous at $z=1$ (both sides approach 0), making the function differentiable.

Comparison:

Property	Hinge (L1-SVM)	Squared Hinge (L2-SVM)
Smoothness	Kink at z=1	Smooth everywhere
Gradient at z<1	Constant: -1	Linear: -2(1-z)
Outlier sensitivity	Moderate	High
Dual sparsity	Yes	No
Typical use	General classification	Text classification

Loss Values Comparison
z (margin)	Hinge Loss	Squared Hinge	Logistic Loss
z = 2	0	0	0.127
z = 1	0	0	0.313
z = 0.5	0.5	0.25	0.474
z = 0	1	1	0.693
z = -1	2	4	1.313
z = -2	3	9	2.127

Summary: The Hinge Loss Perspective

The hinge loss provides a fundamental lens for understanding SVM behavior. Let us consolidate the key insights:

Key Takeaways

•Hinge loss is margin-aware — It penalizes not just wrong predictions, but predictions that aren't confident enough. The margin threshold at z=1 defines "confident enough."
•Equivalence to slack variables — The unconstrained hinge loss formulation is exactly equivalent to the constrained slack formulation. They define the same optimization problem.
•Zero loss for z≥1 creates sparsity — Points outside the margin contribute nothing to the loss or gradient. Only support vectors influence the solution.
•Convex surrogate for 0-1 loss — Hinge loss upper bounds 0-1 loss while being convex and tractable. Minimizing it controls classification error.
•Non-smoothness at z=1 — The kink at the margin boundary requires subgradient methods, not gradient descent. This is where support vectors live.
•Linear tail makes SVM robust — Unlike exponential loss (AdaBoost), hinge has a linear penalty for misclassifications, providing robustness to outliers.

What's next:

Conceptual Milestone

2 / 5