Machine LearningBinary Classification Problem

Binary Classification Problem

LevelIntermediate

Duration60 mins

TopicBinary Classification Problem

2 / 5

Decision Boundaries

The Geometry of Classification

Every classifier, regardless of its underlying algorithm, ultimately performs the same geometric operation: it partitions feature space into regions assigned to different classes. The surfaces separating these regions are called decision boundaries.

Understanding decision boundaries provides deep insight into classifier behavior—their expressiveness, their limitations, and their failure modes. A classifier's decision boundary reveals its implicit assumptions about the problem structure.

What You Will Learn

By the end of this page, you will understand how classifiers partition feature space, visualize decision boundaries for various hypothesis classes, analyze the tradeoff between boundary complexity and generalization, and recognize how boundary shape connects to model capacity.

Decision Regions and Boundaries

Formal Definition

For a classifier $f: \mathbb{R}^d \rightarrow {0, 1}$, the decision regions are:

$$\mathcal{R}_0 = {\mathbf{x} \in \mathbb{R}^d : f(\mathbf{x}) = 0}$$ $$\mathcal{R}_1 = {\mathbf{x} \in \mathbb{R}^d : f(\mathbf{x}) = 1}$$

These regions partition the entire feature space: $\mathcal{R}_0 \cup \mathcal{R}_1 = \mathbb{R}^d$ and $\mathcal{R}_0 \cap \mathcal{R}_1 = \emptyset$.

The decision boundary is the frontier between these regions—the set of points where the classifier's confidence is exactly balanced:

$$\mathcal{B} = {\mathbf{x} \in \mathbb{R}^d : P(Y=1|\mathbf{x}) = 0.5}$$

For hard classifiers, this is where small perturbations can flip the prediction.

Decision Boundary Characteristics by Model Type
Model	Boundary Shape	Expressiveness	Parameters
Linear classifier	Hyperplane	Low	$O(d)$
Quadratic classifier	Quadric surface	Medium	$O(d^2)$
Polynomial (degree p)	Polynomial surface	High	$O(d^p)$
Decision tree	Axis-aligned rectangles	Medium-High	Varies with depth
k-NN	Voronoi-like cells	Very high	No explicit params
RBF kernel SVM	Arbitrary smooth	Very high	Support vectors
Neural network	Arbitrary (layers)	Unlimited	Weights in each layer

Linear Decision Boundaries

The simplest and most fundamental decision boundaries are hyperplanes—flat surfaces that divide space into two half-spaces.

Mathematical Form

A linear classifier computes: $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b)$$

where $\mathbf{w} \in \mathbb{R}^d$ is the weight vector (normal to the hyperplane) and $b \in \mathbb{R}$ is the bias (offset from origin).

The decision boundary is the hyperplane: $$\mathcal{B} = {\mathbf{x} : \mathbf{w}^T\mathbf{x} + b = 0}$$

Geometric Interpretation

The vector $\mathbf{w}$ is perpendicular to the decision boundary
The bias $b$ determines the offset from the origin
Distance from point $\mathbf{x}$ to boundary: $\frac{|\mathbf{w}^T\mathbf{x} + b|}{|\mathbf{w}|}$
Points on the positive side: $\mathbf{w}^T\mathbf{x} + b > 0$

The Power of Linearity

Linear boundaries seem restrictive but offer major advantages: interpretability (weights show feature importance), computational efficiency (O(d) for prediction), convex optimization (unique global optimum), theoretical guarantees (generalization bounds), and they often work surprisingly well due to high-dimensional geometry.

linear_boundary.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
 
def linear_decision_function(X, w, b):
    """Compute linear decision function values."""
    return X @ w + b
 
def predict_linear(X, w, b):
    """Make predictions using linear classifier."""
    return (linear_decision_function(X, w, b) >= 0).astype(int)
 
def distance_to_boundary(X, w, b):
    """Compute signed distance to decision boundary."""
    return linear_decision_function(X, w, b) / np.linalg.norm(w)
 
# Example: 2D linear classifier
w = np.array([1.0, 2.0])  # Normal vector
b = -1.0                   # Bias
 
# Test points
X_test = np.array([[0, 0], [1, 1], [-1, 0], [2, 0]])
predictions = predict_linear(X_test, w, b)
distances = distance_to_boundary(X_test, w, b)
 
print("Predictions:", predictions)
print("Distances to boundary:", distances.round(3))

Nonlinear Decision Boundaries

When classes are not linearly separable, we need nonlinear boundaries. Several strategies achieve this:

1. Feature Expansion

Transform inputs to a higher-dimensional space where classes become linearly separable: $$\phi(\mathbf{x}) = [x_1, x_2, x_1^2, x_2^2, x_1 x_2, \ldots]$$

A linear boundary in $\phi$-space becomes nonlinear in original space.

2. Kernel Methods

Implicitly compute in high-dimensional spaces via kernel functions: $$K(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^T\phi(\mathbf{x}')$$

RBF kernels enable arbitrarily complex boundaries.

3. Nonparametric Methods

k-NN and kernel density methods create boundaries that adapt to local data density.

4. Neural Networks

Compose multiple nonlinear transformations, with each layer reshaping the decision surface.

Boundary Complexity Trade-offs

•Simple boundaries (linear) — Low variance, may underfit, but generalize well with limited data
•Complex boundaries (high-degree polynomial, deep networks) — Can fit any pattern, but risk overfitting
•The goal — Match boundary complexity to true decision surface, not to training noise

The Overfitting Boundary

A decision boundary that perfectly separates training data often memorizes noise. The boundary becomes highly irregular, wrapping around individual points. Such boundaries generalize poorly—small changes in input lead to arbitrary class flips.

Margin and Confidence

Not all correct classifications are equal. The margin quantifies confidence—how far a point lies from the decision boundary.

Geometric Margin

For a point $(\mathbf{x}_i, y_i)$ and linear classifier $(\mathbf{w}, b)$: $$\gamma_i = y_i \cdot \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|}$$

$\gamma_i > 0$: Correct classification
$\gamma_i < 0$: Misclassification
$|\gamma_i|$: Distance from boundary

Functional Margin

The unnormalized version: $$\hat{\gamma}_i = y_i \cdot (\mathbf{w}^T\mathbf{x}_i + b)$$

Scale-dependent but easier to compute.

Margin Interpretation
Margin Value	Interpretation	Implications
Large positive	Confident correct	Robust to perturbations
Small positive	Barely correct	Vulnerable to noise
Zero	On boundary	Maximum uncertainty
Small negative	Barely wrong	Close to being correct
Large negative	Confidently wrong	Severe misclassification

Why Margin Matters

Support Vector Machines explicitly maximize the minimum margin, guided by the intuition that points far from the boundary are more robust. This leads to better generalization.

The margin distribution across training data reveals classifier behavior:

Uniform large margins → robust classifier
Many small margins → fragile classifier
Negative margins → training errors

Visualizing Decision Boundaries

Visualization is essential for understanding classifier behavior. For 2D data, we can directly plot decision boundaries.

The Meshgrid Approach

Create a dense grid covering feature space
Predict class for each grid point
Color regions by predicted class
Overlay training points

visualize_boundary.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
 
def plot_decision_boundary(model, X, y, resolution=200):
    """Plot decision boundary and training data."""
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, resolution),
        np.linspace(y_min, y_max, resolution)
    )
    
    # Predict on grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
    plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0')
    plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1')
    plt.legend()
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
 
# Generate nonlinearly separable data
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
 
# Fit logistic regression
model = LogisticRegression()
model.fit(X, y)
 
plt.figure(figsize=(8, 6))
plot_decision_boundary(model, X, y)
plt.title('Logistic Regression Decision Boundary')
plt.show()

High-Dimensional Visualization

For d > 2 dimensions, visualization requires dimensionality reduction (PCA, t-SNE) or examining 2D slices. Be cautious: projections can hide or distort the true boundary structure. A boundary that looks complex in 2D projection may be simple in the original space.

Boundary Shape and Generalization

The complexity of a decision boundary directly impacts generalization. This connection is formalized through VC dimension and related concepts.

VC Dimension

The Vapnik-Chervonenkis dimension measures a hypothesis class's capacity to shatter (perfectly classify all labelings of) point sets.

Linear classifiers in $\mathbb{R}^d$: VC dimension = $d + 1$
More complex boundaries → higher VC dimension → more capacity to overfit

Generalization Bound (Simplified)

With high probability: $$\text{True Error} \leq \text{Training Error} + O\left(\sqrt{\frac{\text{VC}}{n}}\right)$$

Implications:

Higher VC dimension requires more training data
Complex boundaries need larger datasets to generalize

Practical Guidelines

•Start simple — Begin with linear classifiers; add complexity only if needed
•Validate complexity — Use cross-validation to select boundary complexity
•Regularize — Penalize complex boundaries to prevent overfitting
•Inspect boundaries — Visualize to detect obvious overfitting (highly irregular shapes)
•Consider domain knowledge — If you know the true boundary is smooth, constrain accordingly

Summary

Key Takeaways

•Decision boundaries partition feature space into class regions—every classifier implicitly defines one
•Linear boundaries (hyperplanes) are simple, interpretable, and efficient—often sufficient in practice
•Nonlinear boundaries handle complex separations via feature expansion, kernels, or neural networks
•Margin measures classification confidence—larger margins indicate more robust predictions
•Boundary complexity trades off with generalization—simpler boundaries often generalize better
•Visualization reveals classifier behavior but requires care in high dimensions

Geometric Foundation Complete

You now understand classification through a geometric lens. This perspective unifies diverse algorithms: they all seek effective decision boundaries. Next, we explore probabilistic classification—interpreting classifier outputs as calibrated probabilities.

2 / 5

Loading learning content...

Machine LearningBinary Classification Problem

Binary Classification Problem

LevelIntermediate

Duration60 mins

TopicBinary Classification Problem

2 / 5

Decision Boundaries

The Geometry of Classification

What You Will Learn

Decision Regions and Boundaries

Formal Definition

For a classifier $f: \mathbb{R}^d \rightarrow {0, 1}$, the decision regions are:

$$\mathcal{R}_0 = {\mathbf{x} \in \mathbb{R}^d : f(\mathbf{x}) = 0}$$ $$\mathcal{R}_1 = {\mathbf{x} \in \mathbb{R}^d : f(\mathbf{x}) = 1}$$

These regions partition the entire feature space: $\mathcal{R}_0 \cup \mathcal{R}_1 = \mathbb{R}^d$ and $\mathcal{R}_0 \cap \mathcal{R}_1 = \emptyset$.

The decision boundary is the frontier between these regions—the set of points where the classifier's confidence is exactly balanced:

$$\mathcal{B} = {\mathbf{x} \in \mathbb{R}^d : P(Y=1|\mathbf{x}) = 0.5}$$

For hard classifiers, this is where small perturbations can flip the prediction.

Decision Boundary Characteristics by Model Type
Model	Boundary Shape	Expressiveness	Parameters
Linear classifier	Hyperplane	Low	$O(d)$
Quadratic classifier	Quadric surface	Medium	$O(d^2)$
Polynomial (degree p)	Polynomial surface	High	$O(d^p)$
Decision tree	Axis-aligned rectangles	Medium-High	Varies with depth
k-NN	Voronoi-like cells	Very high	No explicit params
RBF kernel SVM	Arbitrary smooth	Very high	Support vectors
Neural network	Arbitrary (layers)	Unlimited	Weights in each layer

Linear Decision Boundaries

The simplest and most fundamental decision boundaries are hyperplanes—flat surfaces that divide space into two half-spaces.

Mathematical Form

A linear classifier computes: $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b)$$

where $\mathbf{w} \in \mathbb{R}^d$ is the weight vector (normal to the hyperplane) and $b \in \mathbb{R}$ is the bias (offset from origin).

The decision boundary is the hyperplane: $$\mathcal{B} = {\mathbf{x} : \mathbf{w}^T\mathbf{x} + b = 0}$$

Geometric Interpretation

The vector $\mathbf{w}$ is perpendicular to the decision boundary
The bias $b$ determines the offset from the origin
Distance from point $\mathbf{x}$ to boundary: $\frac{|\mathbf{w}^T\mathbf{x} + b|}{|\mathbf{w}|}$
Points on the positive side: $\mathbf{w}^T\mathbf{x} + b > 0$

The Power of Linearity

linear_boundary.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
 
def linear_decision_function(X, w, b):
    """Compute linear decision function values."""
    return X @ w + b
 
def predict_linear(X, w, b):
    """Make predictions using linear classifier."""
    return (linear_decision_function(X, w, b) >= 0).astype(int)
 
def distance_to_boundary(X, w, b):
    """Compute signed distance to decision boundary."""
    return linear_decision_function(X, w, b) / np.linalg.norm(w)
 
# Example: 2D linear classifier
w = np.array([1.0, 2.0])  # Normal vector
b = -1.0                   # Bias
 
# Test points
X_test = np.array([[0, 0], [1, 1], [-1, 0], [2, 0]])
predictions = predict_linear(X_test, w, b)
distances = distance_to_boundary(X_test, w, b)
 
print("Predictions:", predictions)
print("Distances to boundary:", distances.round(3))

Nonlinear Decision Boundaries

When classes are not linearly separable, we need nonlinear boundaries. Several strategies achieve this:

1. Feature Expansion

Transform inputs to a higher-dimensional space where classes become linearly separable: $$\phi(\mathbf{x}) = [x_1, x_2, x_1^2, x_2^2, x_1 x_2, \ldots]$$

A linear boundary in $\phi$-space becomes nonlinear in original space.

2. Kernel Methods

Implicitly compute in high-dimensional spaces via kernel functions: $$K(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^T\phi(\mathbf{x}')$$

RBF kernels enable arbitrarily complex boundaries.

3. Nonparametric Methods

k-NN and kernel density methods create boundaries that adapt to local data density.

4. Neural Networks

Compose multiple nonlinear transformations, with each layer reshaping the decision surface.

Boundary Complexity Trade-offs

•Simple boundaries (linear) — Low variance, may underfit, but generalize well with limited data
•Complex boundaries (high-degree polynomial, deep networks) — Can fit any pattern, but risk overfitting
•The goal — Match boundary complexity to true decision surface, not to training noise

The Overfitting Boundary

Margin and Confidence

Not all correct classifications are equal. The margin quantifies confidence—how far a point lies from the decision boundary.

Geometric Margin

For a point $(\mathbf{x}_i, y_i)$ and linear classifier $(\mathbf{w}, b)$: $$\gamma_i = y_i \cdot \frac{\mathbf{w}^T\mathbf{x}_i + b}{|\mathbf{w}|}$$

$\gamma_i > 0$: Correct classification
$\gamma_i < 0$: Misclassification
$|\gamma_i|$: Distance from boundary

Functional Margin

The unnormalized version: $$\hat{\gamma}_i = y_i \cdot (\mathbf{w}^T\mathbf{x}_i + b)$$

Scale-dependent but easier to compute.

Margin Interpretation
Margin Value	Interpretation	Implications
Large positive	Confident correct	Robust to perturbations
Small positive	Barely correct	Vulnerable to noise
Zero	On boundary	Maximum uncertainty
Small negative	Barely wrong	Close to being correct
Large negative	Confidently wrong	Severe misclassification

Why Margin Matters

Support Vector Machines explicitly maximize the minimum margin, guided by the intuition that points far from the boundary are more robust. This leads to better generalization.

The margin distribution across training data reveals classifier behavior:

Uniform large margins → robust classifier
Many small margins → fragile classifier
Negative margins → training errors

Visualizing Decision Boundaries

Visualization is essential for understanding classifier behavior. For 2D data, we can directly plot decision boundaries.

The Meshgrid Approach

Create a dense grid covering feature space
Predict class for each grid point
Color regions by predicted class
Overlay training points

visualize_boundary.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_moons
 
def plot_decision_boundary(model, X, y, resolution=200):
    """Plot decision boundary and training data."""
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(
        np.linspace(x_min, x_max, resolution),
        np.linspace(y_min, y_max, resolution)
    )
    
    # Predict on grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot
    plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdBu')
    plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0')
    plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1')
    plt.legend()
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
 
# Generate nonlinearly separable data
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
 
# Fit logistic regression
model = LogisticRegression()
model.fit(X, y)
 
plt.figure(figsize=(8, 6))
plot_decision_boundary(model, X, y)
plt.title('Logistic Regression Decision Boundary')
plt.show()

High-Dimensional Visualization

Boundary Shape and Generalization

The complexity of a decision boundary directly impacts generalization. This connection is formalized through VC dimension and related concepts.

VC Dimension

The Vapnik-Chervonenkis dimension measures a hypothesis class's capacity to shatter (perfectly classify all labelings of) point sets.

Linear classifiers in $\mathbb{R}^d$: VC dimension = $d + 1$
More complex boundaries → higher VC dimension → more capacity to overfit

Generalization Bound (Simplified)

With high probability: $$\text{True Error} \leq \text{Training Error} + O\left(\sqrt{\frac{\text{VC}}{n}}\right)$$

Implications:

Higher VC dimension requires more training data
Complex boundaries need larger datasets to generalize

Practical Guidelines

•Start simple — Begin with linear classifiers; add complexity only if needed
•Validate complexity — Use cross-validation to select boundary complexity
•Regularize — Penalize complex boundaries to prevent overfitting
•Inspect boundaries — Visualize to detect obvious overfitting (highly irregular shapes)
•Consider domain knowledge — If you know the true boundary is smooth, constrain accordingly

Summary

Key Takeaways

•Decision boundaries partition feature space into class regions—every classifier implicitly defines one
•Linear boundaries (hyperplanes) are simple, interpretable, and efficient—often sufficient in practice
•Nonlinear boundaries handle complex separations via feature expansion, kernels, or neural networks
•Margin measures classification confidence—larger margins indicate more robust predictions
•Boundary complexity trades off with generalization—simpler boundaries often generalize better
•Visualization reveals classifier behavior but requires care in high dimensions

Geometric Foundation Complete

2 / 5