Loading content...
Every learning algorithm optimizes a loss function—a mathematical expression that quantifies how wrong predictions are. The choice of loss function profoundly impacts the resulting classifier's behavior, robustness, and probabilistic interpretation.
Different losses lead to different classifiers, even with the same hypothesis class. Understanding these differences is essential for choosing appropriate methods.
By the end of this page, you will understand the 0-1 loss and why it's intractable, surrogate losses that enable optimization, log loss (cross-entropy) for probabilistic classifiers, hinge loss for margin-based methods, and how different losses shape classifier properties.
Definition
The most natural classification loss: count errors.
$$L_{0-1}(y, \hat{y}) = \mathbb{1}[y \neq \hat{y}] = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y \neq \hat{y} \end{cases}$$
The corresponding risk is the error rate: $$R_{0-1}(f) = \mathbb{E}[\mathbb{1}[y \neq f(\mathbf{x})]] = P(f(\mathbf{x}) \neq y)$$
Why We Can't Optimize 0-1 Loss Directly
Despite being the "true" objective, 0-1 loss is problematic:
We want to minimize 0-1 loss but can't optimize it directly. Instead, we minimize surrogate losses that are tractable and whose minimization approximately minimizes 0-1 loss. This surrogate approach is central to machine learning.
The Surrogate Framework
A surrogate loss $\phi$ replaces 0-1 loss with a tractable alternative. Good surrogates satisfy:
Margin-Based Losses
Using $y \in {-1, +1}$ encoding, define margin $m = y \cdot f(\mathbf{x})$:
Margin-based losses $\phi(m)$ depend only on this product.
| Loss | Formula $\phi(m)$ | Key Property |
|---|---|---|
| 0-1 Loss | $\mathbb{1}[m \leq 0]$ | True objective (not a surrogate) |
| Hinge | $\max(0, 1-m)$ | Margin-based, sparse (SVM) |
| Logistic | $\log(1 + e^{-m})$ | Smooth, probabilistic |
| Exponential | $e^{-m}$ | Used in AdaBoost |
| Squared Hinge | $\max(0, 1-m)^2$ | Differentiable everywhere |
12345678910111213141516171819202122232425262728293031
import numpy as np def zero_one_loss(margin): """0-1 loss: 1 if margin <= 0, else 0.""" return (margin <= 0).astype(float) def hinge_loss(margin): """Hinge loss: max(0, 1 - margin).""" return np.maximum(0, 1 - margin) def logistic_loss(margin): """Logistic (log) loss: log(1 + exp(-margin)).""" # Numerically stable version return np.log1p(np.exp(-np.clip(margin, -500, 500))) def exponential_loss(margin): """Exponential loss: exp(-margin).""" return np.exp(-np.clip(margin, -500, 500)) def squared_hinge_loss(margin): """Squared hinge loss: max(0, 1 - margin)^2.""" return np.maximum(0, 1 - margin) ** 2 # Compare losses at different marginsmargins = np.linspace(-2, 3, 100)print("Margin | 0-1 | Hinge | Logistic | Exponential")for m in [-1.5, -0.5, 0, 0.5, 1.0, 1.5, 2.0]: print(f"{m:6.1f} | {zero_one_loss(np.array([m]))[0]:3.0f} | " f"{hinge_loss(np.array([m]))[0]:5.2f} | " f"{logistic_loss(np.array([m]))[0]:8.4f} | " f"{exponential_loss(np.array([m]))[0]:11.4f}")Definition
For probabilistic predictions $\hat{p} = P(Y=1|\mathbf{x})$ and labels $y \in {0, 1}$:
$$L_{\text{log}}(y, \hat{p}) = -[y \log \hat{p} + (1-y) \log(1-\hat{p})]$$
Equivalently, using margin notation with $y \in {-1, +1}$ and score $f(\mathbf{x})$: $$L_{\text{log}}(m) = \log(1 + e^{-m})$$
Properties of Log Loss
The Penalty Structure
| Prediction | True Label | Loss | Interpretation |
|---|---|---|---|
| $\hat{p} = 0.9$ | $y = 1$ | $-\log(0.9) \approx 0.105$ | Good prediction, small loss |
| $\hat{p} = 0.5$ | $y = 1$ | $-\log(0.5) \approx 0.693$ | Uncertain, moderate loss |
| $\hat{p} = 0.1$ | $y = 1$ | $-\log(0.1) \approx 2.303$ | Wrong and confident, large loss |
| $\hat{p} = 0.01$ | $y = 1$ | $-\log(0.01) \approx 4.605$ | Very wrong, severe loss |
Key insight: Log loss penalizes confident wrong predictions exponentially. Predicting 0.01 for a true positive is severely punished.
If you predict p=0 for a true positive, log loss is infinite! This forces models to never assign zero probability. In practice, predictions are clipped: p = max(ε, min(1-ε, p)) with small ε ≈ 10⁻¹⁵.
Definition
For margin $m = y \cdot f(\mathbf{x})$ with $y \in {-1, +1}$:
$$L_{\text{hinge}}(m) = \max(0, 1 - m) = [1 - m]_+$$
where $[z]_+ = \max(0, z)$ is the "positive part" or ReLU function.
Properties of Hinge Loss
| Margin | Loss | Status |
|---|---|---|
| $m < 0$ | $1 - m > 1$ | Misclassified, loss grows linearly |
| $0 \leq m < 1$ | $0 < 1 - m \leq 1$ | Correct but within margin, nonzero loss |
| $m \geq 1$ | $0$ | Correct with sufficient margin, zero loss |
Connection to SVMs
Support Vector Machines minimize regularized hinge loss: $$\min_{\mathbf{w}, b} \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)) + \lambda |\mathbf{w}|^2$$
The hinge loss creates the characteristic SVM behavior: only points near (or violating) the margin boundary affect the solution.
Different losses induce different classifier behaviors. Understanding these differences guides method selection.
| Property | Log Loss | Hinge Loss | Exponential |
|---|---|---|---|
| Probabilistic output | Yes (calibrated) | No (scores) | No (scores) |
| Outlier robustness | Moderate | High (linear penalty) | Low (exponential penalty) |
| Sparsity in support | No | Yes (support vectors) | No |
| Differentiability | Everywhere | Not at m=1 | Everywhere |
| Effect of confident correct | Small positive loss | Zero loss | Small positive loss |
| Optimizer | Gradient descent | QP or SGD | Coordinate descent |
Need probabilities → Log loss (logistic regression). Want geometric margins → Hinge loss (SVM). Sensitive to outliers → Hinge loss is more robust. Need smooth optimization → Log loss or squared hinge. Boosting → Exponential loss.
Asymptotic Behavior
As margin $m \to \infty$ (very confident correct):
As margin $m \to -\infty$ (very confident wrong):
Convexity and Global Optima
All common surrogate losses (log, hinge, exponential) are convex in the margin. For linear classifiers, this means:
$$J(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n \phi(y_i \cdot \mathbf{w}^T\mathbf{x}_i)$$
is convex in $\mathbf{w}$ (since $\phi$ convex and $m$ linear in $\mathbf{w}$). Thus gradient descent finds the global optimum.
Gradient Computation
12345678910111213141516171819202122232425262728
import numpy as np def logistic_loss_gradient(X, y, w): """ Gradient of logistic loss for linear classifier. y in {-1, +1}, w includes bias (X has column of 1s). """ margins = y * (X @ w) # Gradient of log(1 + exp(-m)) w.r.t. w is -y*x * sigmoid(-m) probs = 1 / (1 + np.exp(margins)) # sigmoid(-m) return -X.T @ (y * probs) / len(y) def hinge_loss_gradient(X, y, w): """ Subgradient of hinge loss for linear classifier. """ margins = y * (X @ w) # Subgradient is -y*x for points with margin < 1, else 0 mask = (margins < 1).astype(float) return -X.T @ (y * mask) / len(y) # Gradient descent exampledef train_logistic(X, y, lr=0.1, epochs=100): w = np.zeros(X.shape[1]) for _ in range(epochs): grad = logistic_loss_gradient(X, y, w) w -= lr * grad return wIn practice, we minimize loss + regularization: J(w) = Loss(w) + λR(w). Regularization (L2, L1) prevents overfitting and ensures unique solutions. The loss function determines the fit to data; regularization controls complexity.
You've completed the binary classification foundation! You understand problem formulation, decision boundaries, probabilistic interpretation, linear classifiers, and loss functions. This prepares you perfectly for the next module: the logistic regression model.