Loading learning content...
In the previous page, we introduced slack variables as a mechanism to tolerate margin violations. Now we reveal a deeper truth: the soft margin SVM formulation is equivalent to minimizing a specific loss function called the hinge loss.
This perspective is profoundly important. While the slack variable formulation feels like a clever optimization trick, the hinge loss formulation connects SVM to the broader landscape of machine learning. It reveals why SVM has its characteristic properties—margin maximization, sparse solutions, and robustness to outliers.
The hinge loss also answers a fundamental question in classification: What should we optimize? Not just accuracy, but confident accuracy. Not just whether predictions are correct, but how correct they are. The hinge loss encodes this preference mathematically, penalizing predictions that are correct but not confident.
This page develops a complete understanding of the hinge loss function—from its definition and mathematical properties to its comparison with other loss functions (logistic, squared, 0-1). You will understand why hinge loss creates margins, why it leads to sparse solutions, and how it relates to the slack variable formulation.
The hinge loss (also called the max-margin loss) is defined as:
$$L_{\text{hinge}}(y, f(\mathbf{x})) = \max(0, 1 - y \cdot f(\mathbf{x}))$$
where:
Using the notation $[t]_+ = \max(0, t)$ for the positive part function, we can write:
$$L_{\text{hinge}}(z) = [1 - z]_+ = \max(0, 1-z)$$
The functional margin z = y·f(x) is positive when the prediction is correct (y and f(x) have the same sign) and negative when incorrect. Its magnitude indicates confidence: z = 5 means a confident correct prediction, z = 0.1 means a barely correct prediction, and z = -2 means a confident wrong prediction.
Understanding the hinge loss behavior:
The hinge loss has three distinct regions:
Region 1: $z \geq 1$ (confident correct) $$L_{\text{hinge}}(z) = 0$$
When the functional margin exceeds 1, the loss is exactly zero. The prediction is correct and confident—the point lies outside the margin on the correct side. No further improvement is sought.
Region 2: $0 < z < 1$ (correct but not confident) $$L_{\text{hinge}}(z) = 1 - z \in (0, 1)$$
The prediction is correct (positive margin) but not confident enough—the point lies inside the margin. The loss is positive, proportional to how far inside the margin the point lies. This penalizes correct but "weak" predictions.
Region 3: $z \leq 0$ (incorrect) $$L_{\text{hinge}}(z) = 1 - z \geq 1$$
The prediction is wrong. The loss is at least 1 and grows linearly with the magnitude of the error. The more confident the wrong prediction, the higher the loss.
| Margin z | Geometric Meaning | Loss Value | Gradient |
|---|---|---|---|
| z ≥ 1 | Outside margin, correct side | 0 | 0 |
| 0 < z < 1 | Inside margin, correct side | 1 - z ∈ (0, 1) | -1 |
| z = 0 | On decision boundary | 1 | -1 |
| z < 0 | Wrong side of boundary | 1 - z > 1 | -1 |
The hinge loss formulation and the slack variable formulation are exactly equivalent. This is not an approximation—they define the same optimization problem.
Soft margin SVM with slack variables:
$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i$$
$$\text{s.t. } y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$
Soft margin SVM with hinge loss:
$$\min_{\mathbf{w}, b} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^\top \mathbf{x}_i + b))$$
These are identical because at optimality, $\xi_i^* = \max(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^))$.
The slack formulation is a constrained optimization problem (QP), while the hinge loss formulation is unconstrained. The constrained form leads to the dual formulation and kernel methods. The unconstrained form enables direct gradient-based optimization like SGD. Both are useful in different contexts.
Proof of equivalence:
From the slack formulation, at optimality we must have:
$$\xi_i^* = \max\left(0, 1 - y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^)\right)$$
Why? Consider two cases:
Case 1: $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) \geq 1$
Case 2: $y_i(\mathbf{w}^{\top}\mathbf{x}_i + b^) < 1$
Substituting this into $C\sum_i \xi_i$ yields the hinge loss formulation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as np def hinge_loss(y, f_x, reduction='sum'): """ Compute hinge loss. Parameters ---------- y : array-like, shape (n_samples,) True labels in {-1, +1} f_x : array-like, shape (n_samples,) Model predictions (w'x + b) reduction : str 'sum' returns total loss, 'none' returns per-sample loss Returns ------- loss : float or array """ margins = y * f_x losses = np.maximum(0, 1 - margins) if reduction == 'sum': return np.sum(losses) elif reduction == 'mean': return np.mean(losses) else: return losses def compute_slack_from_solution(X, y, w, b): """ Compute optimal slack variables given SVM solution. Demonstrates equivalence: ξ* = max(0, 1 - y(w'x + b)) """ functional_margins = y * (X @ w + b) slack = np.maximum(0, 1 - functional_margins) return slack def verify_equivalence(X, y, w, b, C): """ Verify that slack formulation equals hinge loss formulation. """ # Compute predictions f_x = X @ w + b # Hinge loss formulation objective regularizer = 0.5 * np.dot(w, w) total_hinge = hinge_loss(y, f_x, reduction='sum') hinge_objective = regularizer + C * total_hinge # Slack formulation objective slack = compute_slack_from_solution(X, y, w, b) slack_objective = regularizer + C * np.sum(slack) print(f"Regularizer: {regularizer:.6f}") print(f"Hinge loss term: C * Σ hinge = {C} * {total_hinge:.6f} = {C * total_hinge:.6f}") print(f"Slack term: C * Σξ = {C} * {np.sum(slack):.6f} = {C * np.sum(slack):.6f}") print(f"Hinge objective: {hinge_objective:.6f}") print(f"Slack objective: {slack_objective:.6f}") print(f"Difference: {abs(hinge_objective - slack_objective):.10f}") return hinge_objective, slack_objective # Examplenp.random.seed(42)n = 100X = np.random.randn(n, 2)y = np.sign(X[:, 0] + X[:, 1] + 0.5 * np.random.randn(n))y[y == 0] = 1 # Hypothetical solutionw = np.array([0.5, 0.5])b = 0.1 verify_equivalence(X, y, w, b, C=1.0)The hinge loss has a beautiful geometric interpretation that explains SVM's margin-maximizing behavior.
The margin as a threshold:
In SVM, we don't just want correct predictions—we want predictions that are correct by at least a margin. Setting this margin threshold to 1 in the functional margin space, the hinge loss accomplishes:
This creates a "dead zone" where correct, confident predictions incur no loss. The model is not incentivized to make already-confident predictions even more confident—the gradient is zero there.
The 'hinge' in hinge loss refers to its shape when plotted. At z=1, the loss function has a kink (like a door hinge). To the right of this kink, the loss is flat at zero. To the left, it's a linear ramp. This non-smooth point at z=1 is where the margin boundary lives.
Why the margin threshold is at 1:
The choice of 1 as the margin threshold is arbitrary but conventional. We could define hinge loss as $\max(0, \gamma - z)$ for any $\gamma > 0$, but this is equivalent to rescaling $\mathbf{w}$ and $b$. The choice $\gamma = 1$ is a normalization that sets the scale of the weight vector.
Geometrically, with this normalization:
Gradient and support vectors:
The hinge loss gradient with respect to $\mathbf{w}$ reveals why SVM produces sparse solutions:
$$\frac{\partial L_{\text{hinge}}}{\partial \mathbf{w}} = \begin{cases} \mathbf{0} & \text{if } y(\mathbf{w}^\top\mathbf{x} + b) \geq 1 \ -y\mathbf{x} & \text{if } y(\mathbf{w}^\top\mathbf{x} + b) < 1 \end{cases}$$
Points with $y(\mathbf{w}^\top\mathbf{x} + b) \geq 1$ contribute zero gradient. They don't influence the solution. Only points at or inside the margin—the support vectors—have nonzero gradient and affect the model.
| Region | ∂L/∂w | ∂L/∂b | Interpretation |
|---|---|---|---|
| z ≥ 1 (outside margin) | 0 | 0 | Point doesn't affect solution |
| z < 1 (inside/wrong side) | -yx | -y | Point pushes boundary away |
The gradient pushes the boundary:
When a point has $z < 1$, the gradient $-y\mathbf{x}$ acts to push the decision boundary in the direction that increases this point's margin. For a positive example ($y = +1$) with insufficient margin, updating $\mathbf{w} \leftarrow \mathbf{w} + \eta y\mathbf{x} = \mathbf{w} + \eta\mathbf{x}$ increases $\mathbf{w}^\top\mathbf{x}$, pushing the prediction toward the positive side.
This continues until the point achieves margin 1, at which point its gradient becomes zero and it stops influencing the solution. This is the mechanism behind margin maximization—gradients from margin-violating points "push" the boundary until no point violates the margin (or the penalty for violation is balanced by the regularizer).
Understanding hinge loss requires comparing it to other classification losses. Each loss function embodies different assumptions about what makes a prediction "good."
The loss function zoo:
Let $z = y \cdot f(\mathbf{x})$ be the functional margin. Common loss functions include:
| Loss | Formula | Properties |
|---|---|---|
| 0-1 (misclassification) | $\mathbb{1}(z < 0)$ | Non-convex, non-differentiable |
| Hinge | $\max(0, 1-z)$ | Convex, piecewise linear |
| Squared hinge | $[\max(0, 1-z)]^2$ | Convex, differentiable |
| Logistic | $\log(1 + e^{-z})$ | Convex, smooth |
| Exponential | $e^{-z}$ | Convex, smooth |
| Squared | $(1-z)^2$ | Convex, smooth |
All losses except 0-1 are convex surrogates—they upper bound the 0-1 loss and are convex, enabling efficient optimization. The choice of surrogate affects both computational properties and the geometry of the learned classifier.
Key differences between losses:
Hinge vs. Logistic:
| Property | Hinge | Logistic |
|---|---|---|
| Smoothness | Non-smooth at z=1 | Smooth everywhere |
| Gradient at z→∞ | 0 (exactly) | Approaches 0 exponentially |
| Sparsity | Yes (exact zero gradient for z≥1) | No (always nonzero gradient) |
| Probability output | No | Yes (via sigmoid) |
| Outlier sensitivity | Linear tail | Linear tail |
Hinge loss creates sparsity because its gradient is exactly zero for confident predictions. Logistic loss always has nonzero (though small) gradients, so all points influence the solution to some degree.
Hinge vs. Exponential (AdaBoost):
| Property | Hinge | Exponential |
|---|---|---|
| Penalty for z<0 | Linear: 1-z | Exponential: e^{-z} |
| Outlier sensitivity | Moderate | Severe |
| Gradient magnitude for errors | Constant: 1 | Grows exponentially |
Exponential loss penalizes misclassifications exponentially, making it very sensitive to outliers and mislabeled data. Hinge loss is more robust—a point on the wrong side contributes loss linear in its margin, not exponential.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npimport matplotlib.pyplot as plt def zero_one_loss(z): """0-1 loss: 1 if z < 0, else 0""" return (z < 0).astype(float) def hinge_loss(z): """Hinge loss: max(0, 1-z)""" return np.maximum(0, 1 - z) def squared_hinge_loss(z): """Squared hinge loss: max(0, 1-z)^2""" return np.maximum(0, 1 - z) ** 2 def logistic_loss(z): """Logistic loss: log(1 + exp(-z))""" # Numerically stable version return np.log1p(np.exp(-z)) def exponential_loss(z): """Exponential loss: exp(-z)""" return np.exp(-z) def squared_loss(z): """Squared loss: (1-z)^2 (for margin)""" return (1 - z) ** 2 # Plot comparisonz = np.linspace(-2.5, 3, 1000) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: All lossesax = axes[0]ax.plot(z, zero_one_loss(z), 'k--', linewidth=2, label='0-1 Loss')ax.plot(z, hinge_loss(z), 'b-', linewidth=2, label='Hinge')ax.plot(z, logistic_loss(z), 'r-', linewidth=2, label='Logistic')ax.plot(z, exponential_loss(z), 'g-', linewidth=2, label='Exponential')ax.plot(z, squared_hinge_loss(z), 'm--', linewidth=2, label='Squared Hinge') ax.axvline(x=0, color='gray', linestyle=':', alpha=0.5, label='Decision boundary')ax.axvline(x=1, color='orange', linestyle=':', alpha=0.5, label='Margin')ax.set_xlim(-2.5, 3)ax.set_ylim(-0.1, 4)ax.set_xlabel('Functional Margin z = y·f(x)', fontsize=12)ax.set_ylabel('Loss', fontsize=12)ax.set_title('Classification Loss Functions', fontsize=14)ax.legend(loc='upper right')ax.grid(True, alpha=0.3) # Right: Gradient comparisonax = axes[1] def hinge_grad(z): return np.where(z >= 1, 0, -1) def logistic_grad(z): return -1 / (1 + np.exp(z)) def exponential_grad(z): return -np.exp(-z) ax.plot(z, hinge_grad(z), 'b-', linewidth=2, label='Hinge gradient')ax.plot(z, logistic_grad(z), 'r-', linewidth=2, label='Logistic gradient')ax.plot(z, exponential_grad(z), 'g-', linewidth=2, label='Exponential gradient') ax.axvline(x=1, color='orange', linestyle=':', alpha=0.5, label='z=1 (margin)')ax.axhline(y=0, color='gray', linestyle='-', alpha=0.3)ax.set_xlim(-2.5, 3)ax.set_ylim(-4, 0.5)ax.set_xlabel('Functional Margin z = y·f(x)', fontsize=12)ax.set_ylabel('Gradient ∂L/∂z', fontsize=12)ax.set_title('Loss Function Gradients', fontsize=14)ax.legend(loc='lower right')ax.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('loss_function_comparison.png', dpi=150)plt.show() # Print key observationsprint("KEY OBSERVATIONS:")print("-" * 50)print("At z=2 (confident correct prediction):")print(f" Hinge: {hinge_loss(2):.4f}, Gradient: {hinge_grad(2)}")print(f" Logistic: {logistic_loss(2):.4f}, Gradient: {logistic_grad(2):.4f}")print()print("At z=-1 (confident wrong prediction):")print(f" Hinge: {hinge_loss(-1):.4f}, Gradient: {hinge_grad(-1)}")print(f" Logistic: {logistic_loss(-1):.4f}, Gradient: {logistic_grad(-1):.4f}")print(f" Exponential: {exponential_loss(-1):.4f}, Gradient: {exponential_grad(-1):.4f}")Why hinge loss creates margins:
The key insight is that hinge loss has zero loss and zero gradient for $z \geq 1$. This means:
In contrast, logistic loss always has positive loss and nonzero gradient, even for very confident predictions. This means logistic regression tries to push all predictions toward infinity, while SVM only cares about getting predictions past the margin threshold.
The hinge loss has several important mathematical properties that influence SVM behavior.
Property 1: Convexity
Hinge loss is convex. To see this, note that:
Convexity is crucial—it guarantees that any local minimum is a global minimum, and that gradient-based optimization will converge to the optimum.
Property 2: Lipschitz continuity
Hinge loss is Lipschitz continuous with constant 1: $$|L_{\text{hinge}}(z_1) - L_{\text{hinge}}(z_2)| \leq |z_1 - z_2|$$
This bounded rate of change provides stability guarantees for optimization algorithms and generalization bounds.
Property 3: Non-smooth at z=1
Hinge loss is not differentiable at $z = 1$. The left derivative is $-1$ and the right derivative is $0$. This non-smoothness is where the "kink" occurs, and it's precisely where support vectors lie.
At the non-smooth point z=1, we use subgradients instead of gradients. The subdifferential at z=1 is the interval [-1, 0]. Any value in this range is a valid subgradient and can be used in optimization. This is why SVM optimization algorithms use subgradient methods.
Property 4: Upper bound on 0-1 loss
For all $z$: $L_{\text{hinge}}(z) \geq \mathbb{1}(z < 0)$
Proof:
Property 5: Calibration
Hinge loss is classification-calibrated, meaning that minimizing expected hinge loss leads to the Bayes optimal classifier for 0-1 loss as sample size grows. This is a fundamental theoretical guarantee that SVM will learn the optimal decision boundary.
Property 6: Margin awareness
Unlike 0-1 loss, hinge loss penalizes predictions even when correct if they're not confident enough (inside the margin). This is why SVM produces large-margin classifiers—the loss function explicitly penalizes small margins.
| Property | Statement | Implication |
|---|---|---|
| Convexity | L(z) is convex in z | Unique global optimum, efficient optimization |
| Lipschitz | |L(z₁) - L(z₂)| ≤ |z₁ - z₂| | Stable optimization, generalization bounds |
| Non-smooth | Not differentiable at z=1 | Requires subgradient methods |
| Upper bound | L(z) ≥ 𝟙(z<0) | Minimizing hinge controls 0-1 loss |
| Calibration | Minimizer → Bayes optimal | Statistically consistent |
| Margin aware | L(z) > 0 for 0 < z < 1 | Produces large-margin classifiers |
The non-smoothness of hinge loss at $z=1$ requires careful treatment in optimization. We use subgradients instead of gradients.
Subgradient definition:
A vector $\mathbf{g}$ is a subgradient of convex function $f$ at point $\mathbf{x}_0$ if, for all $\mathbf{x}$: $$f(\mathbf{x}) \geq f(\mathbf{x}_0) + \mathbf{g}^\top(\mathbf{x} - \mathbf{x}_0)$$
The set of all subgradients at $\mathbf{x}_0$ is called the subdifferential, denoted $\partial f(\mathbf{x}_0)$.
For hinge loss with respect to $z$:
$$\partial L_{\text{hinge}}(z) = \begin{cases} {-1} & z < 1 \ [-1, 0] & z = 1 \ {0} & z > 1 \end{cases}$$
At z=1, any value between -1 and 0 (inclusive) is a valid subgradient. This flexibility means that optimization algorithms have choices when a training point lands exactly on the margin. In practice, we can use 0 or -1 consistently, or take the average (-0.5).
Subgradient of the SVM objective:
The unconstrained soft margin SVM objective is: $$J(\mathbf{w}, b) = \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b))$$
The subgradient with respect to $\mathbf{w}$: $$\mathbf{g}\mathbf{w} \in \mathbf{w} + C\sum{i=1}^n \partial_\mathbf{w} L_i$$
where, for each term: $$\partial_\mathbf{w} L_i = \begin{cases} {-y_i \mathbf{x}_i} & y_i(\mathbf{w}^\top\mathbf{x}_i + b) < 1 \ {\mathbf{0}} \text{ or } {-y_i \mathbf{x}_i} & y_i(\mathbf{w}^\top\mathbf{x}_i + b) = 1 \ {\mathbf{0}} & y_i(\mathbf{w}^\top\mathbf{x}_i + b) > 1 \end{cases}$$
Subgradient descent algorithm:
The basic subgradient descent update:
$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta_t \mathbf{g}^{(t)}$$
where $\mathbf{g}^{(t)}$ is any subgradient at $\mathbf{w}^{(t)}$.
Unlike gradient descent, subgradient descent may not decrease the objective at every step. Convergence requires diminishing step sizes: $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$ (e.g., $\eta_t = 1/\sqrt{t}$).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as np def svm_subgradient_descent(X, y, C, n_epochs=1000, eta_0=1.0): """ Train SVM using subgradient descent on the hinge loss. Uses step size: eta_t = eta_0 / sqrt(t) Parameters ---------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) with values in {-1, +1} C : float, regularization parameter n_epochs : int eta_0 : float, initial learning rate Returns ------- w, b : learned parameters history : dict with training history """ n_samples, n_features = X.shape # Initialize w = np.zeros(n_features) b = 0.0 history = {'objective': [], 'w_norm': [], 'n_violations': []} t = 1 # iteration counter for step size for epoch in range(n_epochs): # Shuffle data each epoch indices = np.random.permutation(n_samples) for i in indices: # Compute functional margin z = y[i] * (np.dot(w, X[i]) + b) # Step size eta = eta_0 / np.sqrt(t) t += 1 # Compute subgradient if z < 1: # Margin violation: include hinge term g_w = w - C * y[i] * X[i] g_b = -C * y[i] else: # No violation: only regularizer g_w = w g_b = 0.0 # Update w = w - eta * g_w b = b - eta * g_b # Track objective margins = y * (X @ w + b) hinge_losses = np.maximum(0, 1 - margins) objective = 0.5 * np.dot(w, w) + C * np.sum(hinge_losses) n_violations = np.sum(margins < 1) history['objective'].append(objective) history['w_norm'].append(np.linalg.norm(w)) history['n_violations'].append(n_violations) return w, b, history def svm_pegasos(X, y, C, n_epochs=100, batch_size=1): """ PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM. A popular and efficient variant of subgradient descent for SVM. Uses projected subgradient updates with step size 1/(lambda*t). Note: Here lambda = 1 (we penalize 1/(2*lambda)*||w||^2 so lambda = 1/C) """ n_samples, n_features = X.shape lambda_param = 1.0 / C w = np.zeros(n_features) # Note: PEGASOS typically doesn't update bias; we omit for simplicity t = 1 for epoch in range(n_epochs): indices = np.random.permutation(n_samples) for i in indices: eta = 1.0 / (lambda_param * t) t += 1 z = y[i] * np.dot(w, X[i]) if z < 1: w = (1 - eta * lambda_param) * w + eta * y[i] * X[i] else: w = (1 - eta * lambda_param) * w return w # Example usageif __name__ == "__main__": np.random.seed(42) # Create overlapping dataset n = 200 X_pos = np.random.randn(n//2, 2) + [1, 1] X_neg = np.random.randn(n//2, 2) + [-1, -1] X = np.vstack([X_pos, X_neg]) y = np.array([1]*(n//2) + [-1]*(n//2)) # Train with subgradient descent w, b, history = svm_subgradient_descent(X, y, C=1.0, n_epochs=100) print(f"Final w: [{w[0]:.4f}, {w[1]:.4f}]") print(f"Final b: {b:.4f}") print(f"Final objective: {history['objective'][-1]:.4f}") print(f"Margin violations: {history['n_violations'][-1]}") # Accuracy predictions = np.sign(X @ w + b) accuracy = np.mean(predictions == y) print(f"Training accuracy: {accuracy:.2%}")A popular variant is the squared hinge loss (also called L2-SVM or L2 loss):
$$L_{\text{sq-hinge}}(z) = [\max(0, 1-z)]^2 = [1-z]_+^2$$
This corresponds to the soft margin formulation with squared slack:
$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^n \xi_i^2$$ $$\text{s.t. } y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i$$
L1-SVM uses Σξᵢ (standard), L2-SVM uses Σξᵢ². L1 is more robust to outliers and produces sparser αs in the dual. L2 is differentiable and sometimes faster to optimize. The choice depends on the application and data characteristics.
Gradient of squared hinge:
$$\frac{d L_{\text{sq-hinge}}}{dz} = \begin{cases} -2(1-z) & z < 1 \ 0 & z \geq 1 \end{cases}$$
The gradient is continuous at $z=1$ (both sides approach 0), making the function differentiable.
Comparison:
| Property | Hinge (L1-SVM) | Squared Hinge (L2-SVM) |
|---|---|---|
| Smoothness | Kink at z=1 | Smooth everywhere |
| Gradient at z<1 | Constant: -1 | Linear: -2(1-z) |
| Outlier sensitivity | Moderate | High |
| Dual sparsity | Yes | No |
| Typical use | General classification | Text classification |
| z (margin) | Hinge Loss | Squared Hinge | Logistic Loss |
|---|---|---|---|
| z = 2 | 0 | 0 | 0.127 |
| z = 1 | 0 | 0 | 0.313 |
| z = 0.5 | 0.5 | 0.25 | 0.474 |
| z = 0 | 1 | 1 | 0.693 |
| z = -1 | 2 | 4 | 1.313 |
| z = -2 | 3 | 9 | 2.127 |
The hinge loss provides a fundamental lens for understanding SVM behavior. Let us consolidate the key insights:
What's next:
With hinge loss understood, we turn to the crucial question: how do we control the margin-violation trade-off? The answer lies in the C parameter, which balances margin width against training error. Understanding C is essential for practical SVM application—mistuning it can lead to overfitting or underfitting.
You now understand hinge loss—the loss function that gives SVM its defining characteristics. The margin-awareness, sparsity, and robustness of SVM all trace back to the hinge loss. This perspective will illuminate the dual formulation and kernel methods in subsequent pages.