Loading learning content...
Every classifier, regardless of its underlying algorithm, ultimately performs the same geometric operation: it partitions feature space into regions assigned to different classes. The surfaces separating these regions are called decision boundaries.
Understanding decision boundaries provides deep insight into classifier behavior—their expressiveness, their limitations, and their failure modes. A classifier's decision boundary reveals its implicit assumptions about the problem structure.
By the end of this page, you will understand how classifiers partition feature space, visualize decision boundaries for various hypothesis classes, analyze the tradeoff between boundary complexity and generalization, and recognize how boundary shape connects to model capacity.
Formal Definition
For a classifier $f: \mathbb{R}^d \rightarrow {0, 1}$, the decision regions are:
$$\mathcal{R}_0 = {\mathbf{x} \in \mathbb{R}^d : f(\mathbf{x}) = 0}$$ $$\mathcal{R}_1 = {\mathbf{x} \in \mathbb{R}^d : f(\mathbf{x}) = 1}$$
These regions partition the entire feature space: $\mathcal{R}_0 \cup \mathcal{R}_1 = \mathbb{R}^d$ and $\mathcal{R}_0 \cap \mathcal{R}_1 = \emptyset$.
The decision boundary is the frontier between these regions—the set of points where the classifier's confidence is exactly balanced:
$$\mathcal{B} = {\mathbf{x} \in \mathbb{R}^d : P(Y=1|\mathbf{x}) = 0.5}$$
For hard classifiers, this is where small perturbations can flip the prediction.
| Model | Boundary Shape | Expressiveness | Parameters |
|---|---|---|---|
| Linear classifier | Hyperplane | Low | $O(d)$ |
| Quadratic classifier | Quadric surface | Medium | $O(d^2)$ |
| Polynomial (degree p) | Polynomial surface | High | $O(d^p)$ |
| Decision tree | Axis-aligned rectangles | Medium-High | Varies with depth |
| k-NN | Voronoi-like cells | Very high | No explicit params |
| RBF kernel SVM | Arbitrary smooth | Very high | Support vectors |
| Neural network | Arbitrary (layers) | Unlimited | Weights in each layer |
The simplest and most fundamental decision boundaries are hyperplanes—flat surfaces that divide space into two half-spaces.
Mathematical Form
A linear classifier computes: $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b)$$
where $\mathbf{w} \in \mathbb{R}^d$ is the weight vector (normal to the hyperplane) and $b \in \mathbb{R}$ is the bias (offset from origin).
The decision boundary is the hyperplane: $$\mathcal{B} = {\mathbf{x} : \mathbf{w}^T\mathbf{x} + b = 0}$$
Geometric Interpretation
Linear boundaries seem restrictive but offer major advantages: interpretability (weights show feature importance), computational efficiency (O(d) for prediction), convex optimization (unique global optimum), theoretical guarantees (generalization bounds), and they often work surprisingly well due to high-dimensional geometry.
12345678910111213141516171819202122232425
import numpy as np def linear_decision_function(X, w, b): """Compute linear decision function values.""" return X @ w + b def predict_linear(X, w, b): """Make predictions using linear classifier.""" return (linear_decision_function(X, w, b) >= 0).astype(int) def distance_to_boundary(X, w, b): """Compute signed distance to decision boundary.""" return linear_decision_function(X, w, b) / np.linalg.norm(w) # Example: 2D linear classifierw = np.array([1.0, 2.0]) # Normal vectorb = -1.0 # Bias # Test pointsX_test = np.array([[0, 0], [1, 1], [-1, 0], [2, 0]])predictions = predict_linear(X_test, w, b)distances = distance_to_boundary(X_test, w, b) print("Predictions:", predictions)print("Distances to boundary:", distances.round(3))When classes are not linearly separable, we need nonlinear boundaries. Several strategies achieve this:
1. Feature Expansion
Transform inputs to a higher-dimensional space where classes become linearly separable: $$\phi(\mathbf{x}) = [x_1, x_2, x_1^2, x_2^2, x_1 x_2, \ldots]$$
A linear boundary in $\phi$-space becomes nonlinear in original space.
2. Kernel Methods
Implicitly compute in high-dimensional spaces via kernel functions: $$K(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^T\phi(\mathbf{x}')$$
RBF kernels enable arbitrarily complex boundaries.
3. Nonparametric Methods
k-NN and kernel density methods create boundaries that adapt to local data density.
4. Neural Networks
Compose multiple nonlinear transformations, with each layer reshaping the decision surface.
A decision boundary that perfectly separates training data often memorizes noise. The boundary becomes highly irregular, wrapping around individual points. Such boundaries generalize poorly—small changes in input lead to arbitrary class flips.
Not all correct classifications are equal. The margin quantifies confidence—how far a point lies from the decision boundary.
Geometric Margin
For a point $(\mathbf{x}_i, y_i)$ and linear classifier $(\mathbf{w}, b)$: $$\gamma_i = y_i \cdot \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|}$$
Functional Margin
The unnormalized version: $$\hat{\gamma}_i = y_i \cdot (\mathbf{w}^T\mathbf{x}_i + b)$$
Scale-dependent but easier to compute.
| Margin Value | Interpretation | Implications |
|---|---|---|
| Large positive | Confident correct | Robust to perturbations |
| Small positive | Barely correct | Vulnerable to noise |
| Zero | On boundary | Maximum uncertainty |
| Small negative | Barely wrong | Close to being correct |
| Large negative | Confidently wrong | Severe misclassification |
Why Margin Matters
Support Vector Machines explicitly maximize the minimum margin, guided by the intuition that points far from the boundary are more robust. This leads to better generalization.
The margin distribution across training data reveals classifier behavior:
Visualization is essential for understanding classifier behavior. For 2D data, we can directly plot decision boundaries.
The Meshgrid Approach
1234567891011121314151617181920212223242526272829303132333435363738
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_moons def plot_decision_boundary(model, X, y, resolution=200): """Plot decision boundary and training data.""" # Create mesh grid x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5 y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5 xx, yy = np.meshgrid( np.linspace(x_min, x_max, resolution), np.linspace(y_min, y_max, resolution) ) # Predict on grid Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) # Plot plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu') plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0') plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1') plt.legend() plt.xlabel('Feature 1') plt.ylabel('Feature 2') # Generate nonlinearly separable dataX, y = make_moons(n_samples=200, noise=0.2, random_state=42) # Fit logistic regressionmodel = LogisticRegression()model.fit(X, y) plt.figure(figsize=(8, 6))plot_decision_boundary(model, X, y)plt.title('Logistic Regression Decision Boundary')plt.show()For d > 2 dimensions, visualization requires dimensionality reduction (PCA, t-SNE) or examining 2D slices. Be cautious: projections can hide or distort the true boundary structure. A boundary that looks complex in 2D projection may be simple in the original space.
The complexity of a decision boundary directly impacts generalization. This connection is formalized through VC dimension and related concepts.
VC Dimension
The Vapnik-Chervonenkis dimension measures a hypothesis class's capacity to shatter (perfectly classify all labelings of) point sets.
Generalization Bound (Simplified)
With high probability: $$\text{True Error} \leq \text{Training Error} + O\left(\sqrt{\frac{\text{VC}}{n}}\right)$$
Implications:
You now understand classification through a geometric lens. This perspective unifies diverse algorithms: they all seek effective decision boundaries. Next, we explore probabilistic classification—interpreting classifier outputs as calibrated probabilities.