Binary Classification Problem - Learning Module

Loading content...

0/245

Loss Functions for Classification

Quantifying Classification Errors

Every learning algorithm optimizes a loss function—a mathematical expression that quantifies how wrong predictions are. The choice of loss function profoundly impacts the resulting classifier's behavior, robustness, and probabilistic interpretation.

Different losses lead to different classifiers, even with the same hypothesis class. Understanding these differences is essential for choosing appropriate methods.

What You Will Learn

By the end of this page, you will understand the 0-1 loss and why it's intractable, surrogate losses that enable optimization, log loss (cross-entropy) for probabilistic classifiers, hinge loss for margin-based methods, and how different losses shape classifier properties.

The 0-1 Loss

Definition

The most natural classification loss: count errors.

$$L_{0-1}(y, \hat{y}) = \mathbb{1}[y \neq \hat{y}] = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y \neq \hat{y} \end{cases}$$

The corresponding risk is the error rate: $$R_{0-1}(f) = \mathbb{E}[\mathbb{1}[y \neq f(\mathbf{x})]] = P(f(\mathbf{x}) \neq y)$$

Why We Can't Optimize 0-1 Loss Directly

Despite being the "true" objective, 0-1 loss is problematic:

•Non-convex — Multiple local minima trap gradient-based optimizers
•Non-differentiable — Discontinuous at decision boundary
•NP-hard — Minimizing 0-1 loss over linear classifiers is computationally intractable
•No gradient signal — Doesn't distinguish between barely wrong and wildly wrong

The Fundamental Tradeoff

We want to minimize 0-1 loss but can't optimize it directly. Instead, we minimize surrogate losses that are tractable and whose minimization approximately minimizes 0-1 loss. This surrogate approach is central to machine learning.

Surrogate Losses

The Surrogate Framework

A surrogate loss $\phi$ replaces 0-1 loss with a tractable alternative. Good surrogates satisfy:

Upper bound — $\phi(y, f(\mathbf{x})) \geq L_{0-1}(y, f(\mathbf{x}))$
Convexity — Enables efficient optimization
Classification calibration — Minimizing $\phi$ converges to Bayes optimal classifier

Margin-Based Losses

Using $y \in {-1, +1}$ encoding, define margin $m = y \cdot f(\mathbf{x})$:

$m > 0$: Correct classification
$m < 0$: Incorrect classification
$|m|$: Confidence of prediction

Margin-based losses $\phi(m)$ depend only on this product.

Common Surrogate Losses (for margin m = y·f(x))
Loss	Formula $\phi(m)$	Key Property
0-1 Loss	$\mathbb{1}[m \leq 0]$	True objective (not a surrogate)
Hinge	$\max(0, 1-m)$	Margin-based, sparse (SVM)
Logistic	$\log(1 + e^{-m})$	Smooth, probabilistic
Exponential	$e^{-m}$	Used in AdaBoost
Squared Hinge	$\max(0, 1-m)^2$	Differentiable everywhere

loss_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
 
def zero_one_loss(margin):
    """0-1 loss: 1 if margin <= 0, else 0."""
    return (margin <= 0).astype(float)
 
def hinge_loss(margin):
    """Hinge loss: max(0, 1 - margin)."""
    return np.maximum(0, 1 - margin)
 
def logistic_loss(margin):
    """Logistic (log) loss: log(1 + exp(-margin))."""
    # Numerically stable version
    return np.log1p(np.exp(-np.clip(margin, -500, 500)))
 
def exponential_loss(margin):
    """Exponential loss: exp(-margin)."""
    return np.exp(-np.clip(margin, -500, 500))
 
def squared_hinge_loss(margin):
    """Squared hinge loss: max(0, 1 - margin)^2."""
    return np.maximum(0, 1 - margin) ** 2
 
# Compare losses at different margins
margins = np.linspace(-2, 3, 100)
print("Margin | 0-1 | Hinge | Logistic | Exponential")
for m in [-1.5, -0.5, 0, 0.5, 1.0, 1.5, 2.0]:
    print(f"{m:6.1f} | {zero_one_loss(np.array([m]))[0]:3.0f} | "
          f"{hinge_loss(np.array([m]))[0]:5.2f} | "
          f"{logistic_loss(np.array([m]))[0]:8.4f} | "
          f"{exponential_loss(np.array([m]))[0]:11.4f}")

Log Loss (Cross-Entropy)

Definition

For probabilistic predictions $\hat{p} = P(Y=1|\mathbf{x})$ and labels $y \in {0, 1}$:

$$L_{\text{log}}(y, \hat{p}) = -[y \log \hat{p} + (1-y) \log(1-\hat{p})]$$

Equivalently, using margin notation with $y \in {-1, +1}$ and score $f(\mathbf{x})$: $$L_{\text{log}}(m) = \log(1 + e^{-m})$$

Properties of Log Loss

Log Loss Advantages

•Proper scoring rule — Minimized when predicted probabilities equal true probabilities
•Smooth everywhere — Differentiable, enabling gradient-based optimization
•Probabilistic interpretation — Derived from maximum likelihood under Bernoulli model
•Calibrated outputs — Naturally produces probability estimates
•Convex — Unique global minimum for linear models

The Penalty Structure

Prediction	True Label	Loss	Interpretation
$\hat{p} = 0.9$	$y = 1$	$-\log(0.9) \approx 0.105$	Good prediction, small loss
$\hat{p} = 0.5$	$y = 1$	$-\log(0.5) \approx 0.693$	Uncertain, moderate loss
$\hat{p} = 0.1$	$y = 1$	$-\log(0.1) \approx 2.303$	Wrong and confident, large loss
$\hat{p} = 0.01$	$y = 1$	$-\log(0.01) \approx 4.605$	Very wrong, severe loss

Key insight: Log loss penalizes confident wrong predictions exponentially. Predicting 0.01 for a true positive is severely punished.

The Infinite Penalty

If you predict p=0 for a true positive, log loss is infinite! This forces models to never assign zero probability. In practice, predictions are clipped: p = max(ε, min(1-ε, p)) with small ε ≈ 10⁻¹⁵.

Hinge Loss

Definition

For margin $m = y \cdot f(\mathbf{x})$ with $y \in {-1, +1}$:

$$L_{\text{hinge}}(m) = \max(0, 1 - m) = [1 - m]_+$$

where $[z]_+ = \max(0, z)$ is the "positive part" or ReLU function.

Properties of Hinge Loss

•Margin-enforcing — Zero loss only when margin ≥ 1 (correct and confident)
•Sparse solutions — Many points have zero loss → support vectors are the non-zero ones
•Piecewise linear — Convex but not smooth at $m = 1$
•No probability interpretation — Outputs are scores, not calibrated probabilities

Hinge Loss Behavior
Margin	Loss	Status
$m < 0$	$1 - m > 1$	Misclassified, loss grows linearly
$0 \leq m < 1$	$0 < 1 - m \leq 1$	Correct but within margin, nonzero loss
$m \geq 1$	$0$	Correct with sufficient margin, zero loss

Connection to SVMs

Support Vector Machines minimize regularized hinge loss: $$\min_{\mathbf{w}, b} \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)) + \lambda |\mathbf{w}|^2$$

The hinge loss creates the characteristic SVM behavior: only points near (or violating) the margin boundary affect the solution.

Comparing Loss Functions

Different losses induce different classifier behaviors. Understanding these differences guides method selection.

Loss Function Comparison
Property	Log Loss	Hinge Loss	Exponential
Probabilistic output	Yes (calibrated)	No (scores)	No (scores)
Outlier robustness	Moderate	High (linear penalty)	Low (exponential penalty)
Sparsity in support	No	Yes (support vectors)	No
Differentiability	Everywhere	Not at m=1	Everywhere
Effect of confident correct	Small positive loss	Zero loss	Small positive loss
Optimizer	Gradient descent	QP or SGD	Coordinate descent

Choosing the Right Loss

Need probabilities → Log loss (logistic regression). Want geometric margins → Hinge loss (SVM). Sensitive to outliers → Hinge loss is more robust. Need smooth optimization → Log loss or squared hinge. Boosting → Exponential loss.

Asymptotic Behavior

As margin $m \to \infty$ (very confident correct):

0-1: Loss = 0
Hinge: Loss = 0 (for $m > 1$)
Logistic: Loss → 0 exponentially
Exponential: Loss → 0 even faster

As margin $m \to -\infty$ (very confident wrong):

0-1: Loss = 1 (constant)
Hinge: Loss grows linearly
Logistic: Loss grows linearly (asymptotically)
Exponential: Loss grows exponentially

Optimization Considerations

Convexity and Global Optima

All common surrogate losses (log, hinge, exponential) are convex in the margin. For linear classifiers, this means:

$$J(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n \phi(y_i \cdot \mathbf{w}^T\mathbf{x}_i)$$

is convex in $\mathbf{w}$ (since $\phi$ convex and $m$ linear in $\mathbf{w}$). Thus gradient descent finds the global optimum.

Gradient Computation

loss_gradients.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
 
def logistic_loss_gradient(X, y, w):
    """
    Gradient of logistic loss for linear classifier.
    y in {-1, +1}, w includes bias (X has column of 1s).
    """
    margins = y * (X @ w)
    # Gradient of log(1 + exp(-m)) w.r.t. w is -y*x * sigmoid(-m)
    probs = 1 / (1 + np.exp(margins))  # sigmoid(-m)
    return -X.T @ (y * probs) / len(y)
 
def hinge_loss_gradient(X, y, w):
    """
    Subgradient of hinge loss for linear classifier.
    """
    margins = y * (X @ w)
    # Subgradient is -y*x for points with margin < 1, else 0
    mask = (margins < 1).astype(float)
    return -X.T @ (y * mask) / len(y)
 
# Gradient descent example
def train_logistic(X, y, lr=0.1, epochs=100):
    w = np.zeros(X.shape[1])
    for _ in range(epochs):
        grad = logistic_loss_gradient(X, y, w)
        w -= lr * grad
    return w

Regularization and Loss

In practice, we minimize loss + regularization: J(w) = Loss(w) + λR(w). Regularization (L2, L1) prevents overfitting and ensures unique solutions. The loss function determines the fit to data; regularization controls complexity.

Summary

Key Takeaways

•0-1 loss is the natural objective but is intractable—non-convex, non-differentiable, NP-hard
•Surrogate losses replace 0-1 loss with convex, differentiable alternatives
•Log loss (cross-entropy) produces calibrated probabilities—foundation of logistic regression
•Hinge loss enforces margins—foundation of SVMs with sparse solutions
•Choice of loss determines classifier properties: probability calibration, robustness, sparsity
•Convex losses + linear models yield efficient global optimization via gradient methods

Module Complete

You've completed the binary classification foundation! You understand problem formulation, decision boundaries, probabilistic interpretation, linear classifiers, and loss functions. This prepares you perfectly for the next module: the logistic regression model.

Loss Functions for Classification

Quantifying Classification Errors

Different losses lead to different classifiers, even with the same hypothesis class. Understanding these differences is essential for choosing appropriate methods.

What You Will Learn

The 0-1 Loss

Definition

The most natural classification loss: count errors.

$$L_{0-1}(y, \hat{y}) = \mathbb{1}[y \neq \hat{y}] = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y \neq \hat{y} \end{cases}$$

The corresponding risk is the error rate: $$R_{0-1}(f) = \mathbb{E}[\mathbb{1}[y \neq f(\mathbf{x})]] = P(f(\mathbf{x}) \neq y)$$

Why We Can't Optimize 0-1 Loss Directly

Despite being the "true" objective, 0-1 loss is problematic:

•Non-convex — Multiple local minima trap gradient-based optimizers
•Non-differentiable — Discontinuous at decision boundary
•NP-hard — Minimizing 0-1 loss over linear classifiers is computationally intractable
•No gradient signal — Doesn't distinguish between barely wrong and wildly wrong

The Fundamental Tradeoff

Surrogate Losses

The Surrogate Framework

A surrogate loss $\phi$ replaces 0-1 loss with a tractable alternative. Good surrogates satisfy:

Upper bound — $\phi(y, f(\mathbf{x})) \geq L_{0-1}(y, f(\mathbf{x}))$
Convexity — Enables efficient optimization
Classification calibration — Minimizing $\phi$ converges to Bayes optimal classifier

Margin-Based Losses

Using $y \in {-1, +1}$ encoding, define margin $m = y \cdot f(\mathbf{x})$:

$m > 0$: Correct classification
$m < 0$: Incorrect classification
$|m|$: Confidence of prediction

Margin-based losses $\phi(m)$ depend only on this product.

Common Surrogate Losses (for margin m = y·f(x))
Loss	Formula $\phi(m)$	Key Property
0-1 Loss	$\mathbb{1}[m \leq 0]$	True objective (not a surrogate)
Hinge	$\max(0, 1-m)$	Margin-based, sparse (SVM)
Logistic	$\log(1 + e^{-m})$	Smooth, probabilistic
Exponential	$e^{-m}$	Used in AdaBoost
Squared Hinge	$\max(0, 1-m)^2$	Differentiable everywhere

loss_functions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
 
def zero_one_loss(margin):
    """0-1 loss: 1 if margin <= 0, else 0."""
    return (margin <= 0).astype(float)
 
def hinge_loss(margin):
    """Hinge loss: max(0, 1 - margin)."""
    return np.maximum(0, 1 - margin)
 
def logistic_loss(margin):
    """Logistic (log) loss: log(1 + exp(-margin))."""
    # Numerically stable version
    return np.log1p(np.exp(-np.clip(margin, -500, 500)))
 
def exponential_loss(margin):
    """Exponential loss: exp(-margin)."""
    return np.exp(-np.clip(margin, -500, 500))
 
def squared_hinge_loss(margin):
    """Squared hinge loss: max(0, 1 - margin)^2."""
    return np.maximum(0, 1 - margin) ** 2
 
# Compare losses at different margins
margins = np.linspace(-2, 3, 100)
print("Margin | 0-1 | Hinge | Logistic | Exponential")
for m in [-1.5, -0.5, 0, 0.5, 1.0, 1.5, 2.0]:
    print(f"{m:6.1f} | {zero_one_loss(np.array([m]))[0]:3.0f} | "
          f"{hinge_loss(np.array([m]))[0]:5.2f} | "
          f"{logistic_loss(np.array([m]))[0]:8.4f} | "
          f"{exponential_loss(np.array([m]))[0]:11.4f}")

Log Loss (Cross-Entropy)

Definition

For probabilistic predictions $\hat{p} = P(Y=1|\mathbf{x})$ and labels $y \in {0, 1}$:

$$L_{\text{log}}(y, \hat{p}) = -[y \log \hat{p} + (1-y) \log(1-\hat{p})]$$

Equivalently, using margin notation with $y \in {-1, +1}$ and score $f(\mathbf{x})$: $$L_{\text{log}}(m) = \log(1 + e^{-m})$$

Properties of Log Loss

Log Loss Advantages

•Proper scoring rule — Minimized when predicted probabilities equal true probabilities
•Smooth everywhere — Differentiable, enabling gradient-based optimization
•Probabilistic interpretation — Derived from maximum likelihood under Bernoulli model
•Calibrated outputs — Naturally produces probability estimates
•Convex — Unique global minimum for linear models

The Penalty Structure

Prediction	True Label	Loss	Interpretation
$\hat{p} = 0.9$	$y = 1$	$-\log(0.9) \approx 0.105$	Good prediction, small loss
$\hat{p} = 0.5$	$y = 1$	$-\log(0.5) \approx 0.693$	Uncertain, moderate loss
$\hat{p} = 0.1$	$y = 1$	$-\log(0.1) \approx 2.303$	Wrong and confident, large loss
$\hat{p} = 0.01$	$y = 1$	$-\log(0.01) \approx 4.605$	Very wrong, severe loss

Key insight: Log loss penalizes confident wrong predictions exponentially. Predicting 0.01 for a true positive is severely punished.

The Infinite Penalty

Hinge Loss

Definition

For margin $m = y \cdot f(\mathbf{x})$ with $y \in {-1, +1}$:

$$L_{\text{hinge}}(m) = \max(0, 1 - m) = [1 - m]_+$$

where $[z]_+ = \max(0, z)$ is the "positive part" or ReLU function.

Properties of Hinge Loss

•Margin-enforcing — Zero loss only when margin ≥ 1 (correct and confident)
•Sparse solutions — Many points have zero loss → support vectors are the non-zero ones
•Piecewise linear — Convex but not smooth at $m = 1$
•No probability interpretation — Outputs are scores, not calibrated probabilities

Hinge Loss Behavior
Margin	Loss	Status
$m < 0$	$1 - m > 1$	Misclassified, loss grows linearly
$0 \leq m < 1$	$0 < 1 - m \leq 1$	Correct but within margin, nonzero loss
$m \geq 1$	$0$	Correct with sufficient margin, zero loss

Connection to SVMs

Support Vector Machines minimize regularized hinge loss: $$\min_{\mathbf{w}, b} \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i(\mathbf{w}^T\mathbf{x}_i + b)) + \lambda |\mathbf{w}|^2$$

The hinge loss creates the characteristic SVM behavior: only points near (or violating) the margin boundary affect the solution.

Comparing Loss Functions

Different losses induce different classifier behaviors. Understanding these differences guides method selection.

Loss Function Comparison
Property	Log Loss	Hinge Loss	Exponential
Probabilistic output	Yes (calibrated)	No (scores)	No (scores)
Outlier robustness	Moderate	High (linear penalty)	Low (exponential penalty)
Sparsity in support	No	Yes (support vectors)	No
Differentiability	Everywhere	Not at m=1	Everywhere
Effect of confident correct	Small positive loss	Zero loss	Small positive loss
Optimizer	Gradient descent	QP or SGD	Coordinate descent

Choosing the Right Loss

Asymptotic Behavior

As margin $m \to \infty$ (very confident correct):

0-1: Loss = 0
Hinge: Loss = 0 (for $m > 1$)
Logistic: Loss → 0 exponentially
Exponential: Loss → 0 even faster

As margin $m \to -\infty$ (very confident wrong):

0-1: Loss = 1 (constant)
Hinge: Loss grows linearly
Logistic: Loss grows linearly (asymptotically)
Exponential: Loss grows exponentially

Optimization Considerations

Convexity and Global Optima

All common surrogate losses (log, hinge, exponential) are convex in the margin. For linear classifiers, this means:

$$J(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n \phi(y_i \cdot \mathbf{w}^T\mathbf{x}_i)$$

is convex in $\mathbf{w}$ (since $\phi$ convex and $m$ linear in $\mathbf{w}$). Thus gradient descent finds the global optimum.

Gradient Computation

loss_gradients.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np
 
def logistic_loss_gradient(X, y, w):
    """
    Gradient of logistic loss for linear classifier.
    y in {-1, +1}, w includes bias (X has column of 1s).
    """
    margins = y * (X @ w)
    # Gradient of log(1 + exp(-m)) w.r.t. w is -y*x * sigmoid(-m)
    probs = 1 / (1 + np.exp(margins))  # sigmoid(-m)
    return -X.T @ (y * probs) / len(y)
 
def hinge_loss_gradient(X, y, w):
    """
    Subgradient of hinge loss for linear classifier.
    """
    margins = y * (X @ w)
    # Subgradient is -y*x for points with margin < 1, else 0
    mask = (margins < 1).astype(float)
    return -X.T @ (y * mask) / len(y)
 
# Gradient descent example
def train_logistic(X, y, lr=0.1, epochs=100):
    w = np.zeros(X.shape[1])
    for _ in range(epochs):
        grad = logistic_loss_gradient(X, y, w)
        w -= lr * grad
    return w

Regularization and Loss

Summary

Key Takeaways

•0-1 loss is the natural objective but is intractable—non-convex, non-differentiable, NP-hard
•Surrogate losses replace 0-1 loss with convex, differentiable alternatives
•Log loss (cross-entropy) produces calibrated probabilities—foundation of logistic regression
•Hinge loss enforces margins—foundation of SVMs with sparse solutions
•Choice of loss determines classifier properties: probability calibration, robustness, sparsity
•Convex losses + linear models yield efficient global optimization via gradient methods

Module Complete