Loading learning content...
Linear classifiers form the foundation of classification theory and practice. Despite their simplicity, they are surprisingly effective in many real-world applications and serve as the building blocks for more complex methods.
A linear classifier makes predictions based on a linear combination of features: $$f(\mathbf{x}) = \text{sign}(\mathbf{w}^T\mathbf{x} + b) = \text{sign}\left(\sum_{j=1}^d w_j x_j + b\right)$$
By the end of this page, you will understand the mathematical form and geometric interpretation of linear classifiers, the perceptron algorithm as the simplest linear learner, connections between linear classifiers and hyperplane geometry, and when linear classifiers succeed or fail.
The Linear Model
A linear classifier is characterized by:
The decision function (also called discriminant function) is: $$g(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b$$
The prediction rule is: $$\hat{y} = \begin{cases} 1 & \text{if } g(\mathbf{x}) \geq 0 \ 0 & \text{if } g(\mathbf{x}) < 0 \end{cases}$$
Homogeneous Form
We can absorb the bias into the weight vector by augmenting features: $$\tilde{\mathbf{x}} = [1, x_1, x_2, \ldots, x_d]^T, \quad \tilde{\mathbf{w}} = [b, w_1, w_2, \ldots, w_d]^T$$
Then: $g(\mathbf{x}) = \tilde{\mathbf{w}}^T\tilde{\mathbf{x}}$
| Property | Value/Description |
|---|---|
| Number of parameters | $d + 1$ (weights + bias) |
| Decision boundary | Hyperplane in $\mathbb{R}^d$ |
| Prediction complexity | $O(d)$ per instance |
| Training complexity | Varies by algorithm |
| VC dimension | $d + 1$ |
When features are standardized (zero mean, unit variance), weight magnitudes directly indicate feature importance. Larger |wⱼ| means feature j has stronger influence. This interpretability is a major advantage of linear classifiers.
The perceptron (Rosenblatt, 1958) is the simplest linear classifier, learned through iterative error correction.
Algorithm (using $y \in {-1, +1}$)
The learning rate $\eta$ controls update magnitude (often $\eta = 1$).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import numpy as np class Perceptron: """Simple perceptron classifier.""" def __init__(self, learning_rate=1.0, max_epochs=1000): self.lr = learning_rate self.max_epochs = max_epochs def fit(self, X, y): """Train perceptron on data.""" n_samples, n_features = X.shape # Convert labels to {-1, +1} y_signed = 2 * y - 1 # Initialize weights self.w = np.zeros(n_features) self.b = 0.0 for epoch in range(self.max_epochs): errors = 0 for i in range(n_samples): if y_signed[i] * (self.w @ X[i] + self.b) <= 0: # Misclassified - update self.w += self.lr * y_signed[i] * X[i] self.b += self.lr * y_signed[i] errors += 1 if errors == 0: print(f"Converged at epoch {epoch}") break return self def predict(self, X): """Make predictions.""" scores = X @ self.w + self.b return (scores >= 0).astype(int) # Examplenp.random.seed(42)X = np.vstack([np.random.randn(50, 2) + [2, 2], np.random.randn(50, 2) + [-2, -2]])y = np.array([1]*50 + [0]*50) model = Perceptron()model.fit(X, y)accuracy = (model.predict(X) == y).mean()print(f"Training accuracy: {accuracy:.2%}")If data is linearly separable with margin γ and all points have norm ≤ R, the perceptron converges in at most (R/γ)² updates. However, if data is not linearly separable, the perceptron never converges—it oscillates indefinitely.
Definition
A dataset is linearly separable if there exists a hyperplane that perfectly separates the classes: $$\exists \mathbf{w}, b : y_i(\mathbf{w}^T\mathbf{x}_i + b) > 0 \quad \forall i$$
Testing for Linear Separability
Linear separability can be checked via linear programming: find $\mathbf{w}, b$ such that $y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1$ for all $i$. If feasible, data is separable.
| Dimension | Probability of Separability | Implication |
|---|---|---|
| $d < n$ | Typically low | More samples than features → likely overlap |
| $d \approx n$ | Moderate | May or may not be separable |
| $d \gg n$ | Very high | High-dim space easily separates finite points |
The XOR Problem
The classic example of non-linear-separability: points at $(0,0), (1,1)$ labeled negative and $(0,1), (1,0)$ labeled positive. No single line can separate them—demonstrating a fundamental limitation of linear classifiers.
Solutions for Non-Separable Data
The Weight Vector as Normal
The weight vector $\mathbf{w}$ is perpendicular (normal) to the decision boundary. This means:
Distance to the Boundary
For any point $\mathbf{x}$, the signed distance to the decision boundary is: $$\text{distance} = \frac{g(\mathbf{x})}{|\mathbf{w}|} = \frac{\mathbf{w}^T\mathbf{x} + b}{|\mathbf{w}|}$$
Positive for class 1 side, negative for class 0 side.
In 2D, the boundary is a line: w₁x₁ + w₂x₂ + b = 0. The slope is -w₁/w₂ and intercept is -b/w₂. Points above/below the line get different predictions based on which side w points to.
Several important algorithms produce linear classifiers, differing in their training objectives and resulting properties.
| Algorithm | Loss Function | Output | Key Property |
|---|---|---|---|
| Perceptron | Hinge (margin 0) | Hard labels | Online, simple updates |
| Logistic Regression | Log loss | Probabilities | Calibrated, convex |
| Linear SVM | Hinge (margin 1) | Hard labels | Maximum margin |
| LDA | Generative/Gaussian | Probabilities | Optimal for Gaussian classes |
| Linear Regression + Threshold | Squared loss | Hard labels | Simple but suboptimal |
Which to Choose?
Despite their simplicity, linear classifiers excel in several important settings:
In high dimensions, randomly generated point clouds become linearly separable with high probability. This 'blessing of dimensionality' explains why linear classifiers work well for text and genomics data with thousands of features.
You now understand linear classifiers thoroughly. This prepares you for the final piece: loss functions. Understanding how different losses shape classifier behavior completes our foundation for logistic regression.