Ml Landscape - Learning Module

Loading content...

0/278

Classification Problems

Predicting Discrete Categories

Classification represents perhaps the most ubiquitous problem type in machine learning—the task of assigning observations to discrete categories. Where regression predicts how much, classification predicts which one.

Every day, classification systems make millions of decisions:

Is this email spam or legitimate?
Does this medical image show a tumor or healthy tissue?
Will this customer churn next month?
Which of 1000 species does this plant belong to?
Is this financial transaction fraudulent?

Despite its apparent simplicity—just pick a category—classification harbors deep subtleties. How do we measure the quality of categorical predictions? How do we handle imbalanced classes? When should we trust probability estimates? How do we interpret decision boundaries? This page addresses these questions rigorously.

What You Will Master

By the end of this page, you will understand classification at a foundational level: the formal mathematical framework, the distinction between binary and multiclass settings, decision boundaries and their geometry, probability calibration, comprehensive evaluation metrics, and the unique challenges classification presents. This knowledge enables you to approach any classification task with expertise.

Formal Definition of Classification

Classification admits several formal definitions, each illuminating different aspects of the problem.

The Set-Theoretic View:

Classification is a function mapping inputs to class labels: $$h: \mathcal{X} \rightarrow \mathcal{Y}$$

where:

$\mathcal{X}$ is the input (feature) space, typically $\mathbb{R}^d$
$\mathcal{Y} = {1, 2, \ldots, K}$ is the set of $K$ class labels

For binary classification: $\mathcal{Y} = {0, 1}$ or ${-1, +1}$

The goal: find $h$ that accurately predicts the class of new, unseen examples.

The Probabilistic View:

Instead of directly predicting labels, we model class probabilities: $$p(Y = k | X = x) \text{ for } k = 1, \ldots, K$$

These probabilities must satisfy $\sum_{k=1}^{K} p(Y = k | X = x) = 1$.

To make a hard prediction, we typically choose the class with highest probability: $$\hat{y} = \arg\max_{k} p(Y = k | X = x)$$

This is the Bayes optimal classifier—it minimizes the probability of misclassification under the true distribution. Any deviation from using true probabilities increases expected error.

Why the probabilistic view matters:

Enables principled handling of uncertainty
Allows thresholding for different cost structures
Provides calibrated predictions for downstream decisions
Supports combining multiple models (ensembles)

The Bayes Optimal Classifier

The Bayes optimal classifier predicts the class with highest posterior probability. It's optimal in the sense that no other classifier can have lower error rate under the same data distribution. In practice, we don't know the true posteriors, so we estimate them from data. The gap between our learned classifier and the Bayes optimal is called 'excess risk.'

The Geometric View:

Classification can be understood as partitioning the input space into regions, each assigned to a class. The boundaries between regions are decision boundaries.

For a binary classifier:

Decision boundary: the set of points where $p(Y=1|X=x) = 0.5$ (or more generally, where the decision changes)
Decision regions: connected subsets of $\mathcal{X}$ all assigned the same class

The shape of decision boundaries characterizes classifier families:

Linear classifiers: Hyperplane boundaries (lines in 2D, planes in 3D)
Polynomial classifiers: Curved boundaries (quadratic, cubic surfaces)
Tree-based classifiers: Axis-aligned rectangular boundaries
Neural networks: Arbitrarily complex, smooth boundaries
k-NN: Voronoi-cell-like, data-adaptive boundaries

Classification Framework Components
Component	Definition	Examples
Input Space (X)	Domain from which features are drawn	ℝᵈ, images, text sequences, graphs
Label Space (Y)	Set of possible classes	{0,1}, {cat, dog, bird}, {1,...,1000}
Hypothesis Class (H)	Family of classifiers considered	Linear, decision trees, neural networks
Loss Function (ℓ)	Penalty for misclassification	0-1 loss, cross-entropy, hinge loss
Decision Boundary	Partition of input space into regions	Hyperplanes, curves, complex surfaces
Probability Estimates	P(Y=k\|X=x) for each class	Logistic outputs, softmax scores

Binary vs. Multiclass Classification

The distinction between binary and multiclass classification is more than a matter of counting classes—it affects loss functions, evaluation metrics, reduction strategies, and even which algorithms apply.

Binary Classification (K = 2):

The simplest case: positive vs. negative, yes vs. no, 1 vs. 0.

Conventions vary, but typically:

One class is designated 'positive' (the class of interest)
The other is 'negative' (the alternative)
The model outputs a score or probability for the positive class

Decision: Predict positive if $p(Y=1|X) > \tau$, where $\tau$ is a threshold (default 0.5).

Binary classification is well-understood and forms the foundation for multiclass methods.

Multiclass Classification (K > 2):

When there are more than two mutually exclusive classes, several approaches exist:

1. Native Multiclass Methods: Some algorithms generalize naturally to K classes:

Softmax regression: Generalizes logistic regression with K-way softmax output
Decision trees/forests: Naturally handle any number of classes
k-NN: Simply count votes for each class among neighbors
Neural networks: Use softmax output layer with K units

2. Reduction to Binary Classification: Other methods require decomposing multiclass into binary subproblems:

One-vs-Rest (OvR) / One-vs-All (OvA): Train K binary classifiers, each distinguishing one class from all others. Predict the class whose classifier gives highest confidence. Simple but may suffer from class imbalance in subproblems.
One-vs-One (OvO): Train $\binom{K}{2} = \frac{K(K-1)}{2}$ binary classifiers, one for each pair of classes. Predict by voting. More classifiers but each trained on smaller, balanced subsets.
Error-Correcting Output Codes (ECOC): Assign each class a binary code; train one classifier per code bit. Decode predictions using the code structure. Provides robustness through redundancy.

One-vs-Rest Approach

•Train K binary classifiers
•Each classifier: class k vs. all others
•Prediction: argmax over classifier scores
•Handles any base classifier
•May create severe class imbalance (1 vs K-1)
•Computational complexity: O(K × base cost)

One-vs-One Approach

•Train K(K-1)/2 binary classifiers
•Each classifier: class i vs. class j
•Prediction: majority voting winner
•Each subproblem is balanced
•More classifiers but faster training each
•Computational complexity: O(K² × base cost)

Choosing a Multiclass Strategy

For small K (say, K < 10), One-vs-One often works well because each subproblem is balanced and uses only relevant data. For large K (hundreds or thousands of classes), native multiclass methods (neural networks with softmax, gradient boosted trees) are typically more practical. One-vs-Rest is a reasonable middle ground but watch for imbalance issues.

Multilabel Classification (Multiple Labels per Instance):

Distinct from multiclass, multilabel classification allows assigning multiple labels to each instance:

An image might contain {cat, dog, tree}
A document might belong to {politics, economics, Asia}

Formal setup: Output is $\mathbf{y} \in {0,1}^K$, a binary vector indicating presence of each label.

Approaches:

Binary Relevance: Train K independent binary classifiers, one per label
Classifier Chains: Train sequentially, each classifier conditioned on previous predictions
Label Powerset: Treat each unique label combination as a single class (exponential in K)
Neural approaches: Train a single network with K sigmoid outputs

Multilabel introduces unique challenges: label correlations, extreme class imbalance, and evaluation complexity.

Decision Boundaries and Model Geometry

The geometry of decision boundaries reveals the inductive bias of different classifier families. Understanding this geometry builds intuition for what each model can and cannot learn.

Linear Decision Boundaries:

Linear classifiers define boundaries as hyperplanes: $$w^\top x + b = 0$$

Points with $w^\top x + b > 0$ are classified as positive; $w^\top x + b < 0$ as negative.

The weight vector $w$ is perpendicular to the decision boundary. The bias $b$ determines the boundary's distance from the origin.

Key insight: Linear classifiers can only learn linearly separable patterns—where a hyperplane can perfectly separate classes. The classic XOR problem is not linearly separable.

Nonlinear Decision Boundaries:

Nonlinear classifiers create complex boundary shapes:

Kernel Methods: Implicitly map data to high-dimensional feature space where linear separation is possible. In the original space, this produces curved boundaries (ellipses for RBF kernel, polynomial curves for polynomial kernel).

Decision Trees: Create axis-aligned, rectangular boundaries. Each split is perpendicular to a feature axis. Ensembles of trees can approximate smooth boundaries through averaging.

Neural Networks: Each layer applies an affine transformation followed by a nonlinearity. Deep networks compose many simple transformations to create arbitrarily complex boundaries. The 'bending' of input space allows representing virtually any decision surface.

k-Nearest Neighbors: Decision boundaries are the Voronoi tessellation of training points. Boundaries are local and data-adaptive but can be noisy and computationally expensive.

Decision Boundary Characteristics by Model Family
Model Family	Boundary Shape	Strengths	Limitations
Logistic Regression	Linear hyperplane	Simple, interpretable, fast	Cannot capture nonlinear patterns
SVM (linear kernel)	Margin-maximizing hyperplane	Good generalization, convex optimization	Still limited to linear boundaries
SVM (RBF kernel)	Smooth, flexible curves	Can capture complex patterns	Kernel choice and tuning critical; O(n²) memory
Decision Trees	Axis-aligned rectangles	Handles mixed types, interpretable	Fragmented regions, may overfit
Random Forests	Smoothed rectangles (averaging)	Robust, handles high dimensions	Less interpretable than single trees
Neural Networks	Arbitrarily complex, smooth	Universal approximation, state-of-the-art	Requires more data, harder to train
k-NN	Voronoi cells	Non-parametric, no training	Sensitive to k, expensive at test time

The Bias-Variance Perspective

Simple boundaries (linear) have high bias but low variance—they're stable but may underfit. Complex boundaries (deep networks) have low bias but high variance—they can overfit. Model selection involves finding the right complexity for your data size and noise level. Regularization, cross-validation, and ensemble methods help navigate this tradeoff.

Margin and Confidence:

For classifiers that output continuous scores (most do), we can define:

Margin: The distance from a point to the decision boundary (signed: positive on correct side)
Confidence: Absolute magnitude of the margin or score

Points near the boundary have low confidence; points far away have high confidence. This geometric intuition motivates:

SVM's margin maximization: Push the boundary to maximize minimum distance to training points
Confidence calibration: Ensure reported probabilities match actual accuracy
Active learning: Query labels for points near the boundary where uncertainty is highest

Loss Functions for Classification

The choice of loss function shapes optimization, affects generalization, and encodes assumptions about the problem. Classification admits several natural loss functions.

The 0-1 Loss (Misclassification Rate):

$$\ell_{0-1}(y, \hat{y}) = \mathbb{1}[y eq \hat{y}]$$

The most intuitive loss: 0 for correct, 1 for incorrect. Directly measures classification error.

Problem: The 0-1 loss is non-convex and non-differentiable—gradient-based optimization cannot minimize it directly. We need surrogate losses.

Classification Loss Functions
Loss	Formula (binary, y ∈ {-1,+1})	Properties	Associated Model
0-1 Loss	$\mathbb{1}[y eq \text{sign}(f(x))]$	Non-convex, non-differentiable	The true objective; not used for training
Hinge Loss	$\max(0, 1 - y \cdot f(x))$	Convex, piecewise linear; margin-based	Support Vector Machines
Logistic Loss	$\log(1 + e^{-y \cdot f(x)})$	Smooth, convex; log-likelihood under logistic model	Logistic Regression
Exponential Loss	$e^{-y \cdot f(x)}$	Convex, smooth; sensitive to outliers	AdaBoost
Cross-Entropy	$-\sum_k y_k \log \hat{p}_k$	Standard for probabilistic multiclass	Neural networks, softmax models
Focal Loss	$-\alpha (1-\hat{p})^\gamma \log \hat{p}$	Down-weights easy examples; for imbalance	Object detection, imbalanced classification

Understanding Surrogate Losses:

Surrogate losses are differentiable upper bounds on 0-1 loss, enabling gradient-based optimization. Different surrogates induce different behaviors:

Hinge loss creates sparse support vectors; ignores well-classified points beyond the margin
Logistic loss always updates for every point; produces well-calibrated probabilities
Exponential loss aggressively penalizes misclassified points; sensitive to label noise

Calibration: Minimizing a proper scoring rule (like cross-entropy) results in predictions that, in expectation, match true class probabilities. This is crucial when probability estimates (not just class predictions) matter.

When to Use Which Loss

Use cross-entropy/logistic loss when you need calibrated probabilities (for decisions involving thresholds or expected values). Use hinge loss when hard classification matters and you want maximum margin. Avoid exponential loss with noisy labels. Consider focal loss when positive examples are rare and you want to focus learning on hard cases.

Cost-Sensitive Classification:

Often, different types of errors have different costs. Missing a cancer diagnosis (false negative) is far worse than a false alarm (false positive).

The cost matrix defines penalties:

Actual \ Predicted	Positive	Negative
Positive	0 (correct)	$C_{FN}$ (false negative)
Negative	$C_{FP}$ (false positive)	0 (correct)

To minimize expected cost rather than error rate, we adjust the decision threshold: $$\text{Predict positive if } p(Y=1|x) > \frac{C_{FP}}{C_{FP} + C_{FN}}$$

This shifts the boundary to favor the less costly type of error.

Probability Calibration

Not all classifiers produce well-calibrated probability estimates. Calibration refers to the agreement between predicted probabilities and actual frequencies of outcomes.

Formal Definition:

A classifier is perfectly calibrated if, among all instances where it predicts probability $p$, the fraction that are actually positive is exactly $p$: $$P(Y=1 | \hat{p}(X) = p) = p \quad \forall p \in [0,1]$$

In practice, we assess calibration empirically using calibration plots (reliability diagrams).

Why Calibration Matters:

Medical decisions based on risk scores require accurate probabilities
Insurance pricing depends on true probability of claims
Threshold-based decisions fail with miscalibrated scores
Combining multiple models requires consistent probability scales

Calibration Properties of Common Classifiers:

Logistic Regression: Generally well-calibrated by design (minimizes log-loss)
Naive Bayes: Often poorly calibrated, pushes predictions to 0 or 1
SVM: Outputs non-probabilistic scores; sigmoid calibration needed
Random Forests: Tends toward moderate probabilities; may need calibration
Neural Networks: Can be overconfident; modern networks often miscalibrated
Gradient Boosting: Often well-calibrated when optimizing log-loss

Diagnosing Miscalibration:

Reliability Diagram: Plot actual fraction positive vs. predicted probability in bins. For perfect calibration, points lie on the diagonal.
Expected Calibration Error (ECE): Average absolute difference between predicted and actual probabilities, weighted by bin size: $$ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} |\text{acc}(B_m) - \text{conf}(B_m)|$$

Calibration Problems

•Overconfidence: Predicts extreme probabilities too often (common in neural nets)
•Underconfidence: Predictions too close to 0.5
•Systematic bias: Always over/underestimates specific ranges
•Class-specific: Well-calibrated for some classes, not others

Calibration Methods

•Platt Scaling: Fit sigmoid on validation scores (A·s + B)
•Isotonic Regression: Non-parametric monotonic mapping
•Temperature Scaling: Divide logits by T before softmax (neural nets)
•Beta Calibration: Three-parameter beta transformation

Calibration vs. Discrimination

A well-calibrated model is not necessarily a good model (it could assign 50% probability to everything—perfectly calibrated but useless). Conversely, a model with good discrimination (AUC) might have poor calibration. Both properties matter: discrimination tells you the model can distinguish classes; calibration tells you the probability estimates are reliable.

Evaluation Metrics for Classification

Classification admits a rich set of evaluation metrics, each capturing different aspects of performance. Choosing the right metrics is critical—optimizing the wrong metric leads to models that succeed on paper but fail in deployment.

The Confusion Matrix:

For binary classification, all metrics derive from four quantities:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

TP + FN: All actual positives
FP + TN: All actual negatives
TP + FP: All predicted positives
FN + TN: All predicted negatives

Essential Classification Metrics
Metric	Formula	Interpretation	When to Use
Accuracy	$(TP+TN)/(TP+TN+FP+FN)$	Fraction correct	Only if classes balanced; otherwise misleading
Precision	$TP/(TP+FP)$	Among predicted positives, fraction truly positive	When false positives are costly (spam detection)
Recall/Sensitivity	$TP/(TP+FN)$	Among actual positives, fraction detected	When false negatives are costly (cancer screening)
Specificity	$TN/(TN+FP)$	Among actual negatives, fraction correctly identified	Complement to recall; important in medical screening
F1 Score	$2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$	Harmonic mean of precision and recall	When you need single metric balancing both
F-beta Score	$\frac{(1+\beta^2) \cdot Prec \cdot Rec}{\beta^2 \cdot Prec + Rec}$	Weighted harmonic mean (β controls weight)	When recall is β times more important than precision
Matthews Corr. Coef.	$\frac{TP \cdot TN - FP \cdot FN}{\sqrt{...}}$	Correlation between predictions and truth	Balanced measure; works well with imbalanced data

Threshold-Independent Metrics:

Metrics above depend on the decision threshold. Threshold-independent metrics evaluate the ranking quality across all thresholds:

ROC Curve and AUC:

ROC Curve: Plots TPR (recall) vs. FPR (1-specificity) as threshold varies from 0 to 1
AUC (Area Under ROC): Single number summarizing ROC. AUC = 1 is perfect; AUC = 0.5 is random
Interpretation: AUC equals probability that a random positive is ranked higher than a random negative

Precision-Recall Curve and AUPRC:

PR Curve: Plots precision vs. recall as threshold varies
AUPRC (Average Precision): Area under PR curve
More informative than ROC when positives are rare (imbalanced data)

The Accuracy Trap

When classes are imbalanced (e.g., 99% negative, 1% positive), a model predicting 'always negative' achieves 99% accuracy—useless but impressive-looking. In such cases, use precision, recall, F1, or AUPRC. Accuracy is only meaningful when classes are roughly balanced.

Multiclass Metrics:

Extending binary metrics to multiclass settings:

Macro-averaging: Compute metric for each class; average across classes (weights classes equally)
Micro-averaging: Aggregate TP, FP, FN across all classes; compute metric on aggregates (weights instances equally)
Weighted-averaging: Macro-average weighted by class frequency

Choice depends on whether you care equally about all classes (macro) or care more about frequent classes (micro/weighted).

Multiclass Confusion Matrix: K×K matrix showing predictions vs. actual for all class pairs. Diagonal elements are correct predictions.

The Challenge of Class Imbalance

Real-world classification problems frequently exhibit class imbalance—where one class vastly outnumbers another. Fraud detection (0.1% fraudulent), disease screening (2% positive), click prediction (1% click-through)—imbalance is the norm, not the exception.

Why Imbalance is Problematic:

Optimization bias: Minimizing overall error favors the majority class
Evaluation issues: Accuracy becomes meaningless
Probability calibration: Rare-class probabilities are often underestimated
Feature learning: Minority-class patterns may be overwhelmed

Strategies for Handling Imbalance

•Resampling: Oversample the minority (SMOTE, random oversampling) or undersample the majority. Simple but may introduce duplicates or lose information.
•Class Weights: Assign higher weight to minority-class errors in the loss function. Most algorithms support this; equivalent to resampling but cleaner.
•Cost-Sensitive Learning: Explicitly model different misclassification costs. More principled when asymmetric costs are known.
•Threshold Adjustment: After training, adjust decision threshold to balance precision/recall as needed. Works when you have well-calibrated probabilities.
•Anomaly Detection Framing: Treat extreme imbalance as anomaly detection rather than classification. Models like isolation forest, autoencoders, one-class SVM.
•Hybrid Methods: Combine multiple strategies. E.g., SMOTE + Tomek links removes borderline samples after synthetic generation.
•Algorithm Choice: Some algorithms are naturally more robust to imbalance (e.g., tree-based methods, especially with proper splitting criteria).

The Right Approach Depends on Your Goal

If you need calibrated probabilities, don't resample—use class weights or post-hoc calibration. If you need high recall, adjust the threshold or use class weights favoring recall. If you're detecting rare events and interpretability matters, consider tree-based methods with balanced class weights. There's no universal solution—match the technique to your objective.

SMOTE: Synthetic Minority Over-sampling Technique

SMOTE generates synthetic minority examples by interpolating between existing ones:

For each minority instance, find its k nearest minority neighbors
Choose one neighbor at random
Create a synthetic point between the two: $x_{new} = x_i + \lambda (x_{neighbor} - x_i)$ where $\lambda \in [0,1]$

Variants: Borderline-SMOTE (focus on boundary samples), ADASYN (adaptive synthesis based on density).

Caution: SMOTE can create unrealistic samples in high dimensions or when class regions are not convex. It doesn't create new information—just interpolated approximations.

Summary: Classification in the ML Landscape

We've traversed the classification landscape comprehensively—formal definitions, binary and multiclass settings, decision boundaries, loss functions, probability calibration, evaluation metrics, and class imbalance. Let's synthesize the key insights.

Key Takeaways

•Classification predicts discrete categories — The target is a class label, distinguishing it fundamentally from regression's continuous outputs.
•Probabilistic outputs are more valuable than hard predictions — They enable calibrated decisions, threshold tuning, and proper uncertainty handling.
•Decision boundaries encode inductive bias — Linear, tree-based, and neural models each impose different geometric structures on the classification problem.
•Surrogate losses enable optimization — We minimize convex approximations (logistic, hinge) to the non-differentiable 0-1 loss.
•Calibration and discrimination are both essential — A good classifier both separates classes well and outputs reliable probability estimates.
•Metric selection must match the objective — Accuracy is often misleading; precision, recall, AUC, and AUPRC capture different performance aspects.
•Class imbalance requires explicit handling — Resampling, class weights, threshold adjustment, or problem reframing are all valid approaches depending on the setting.

Connection to Other Problem Types:

Classification connects deeply to other ML problem types:

Regression: Logistic regression is a linear model for classification; classification can be viewed as regression with a categorical target and appropriate link function.
Ranking: Classification scores induce rankings; ranking losses can improve classification performance.
Clustering: Unsupervised classification—discovering categories rather than predicting known ones.
Structured Prediction: Multiclass/multilabel generalize to structured outputs (sequences, graphs, images).

What's Next:

Having covered regression (continuous targets) and classification (discrete categories), we turn to clustering—discovering structure in unlabeled data. You'll see how the absence of ground truth labels fundamentally changes the problem: from predicting to discovering, from supervision to patterns.

Page Complete

You now possess a comprehensive understanding of classification problems in machine learning. From formal definitions through practical challenges like imbalance, you have the conceptual framework to approach any classification task with rigor and expertise. Next, we explore clustering—finding structure without labels.