Loading content...
Linear classifiers—also known as halfspaces, perceptrons, or linear threshold functions—are the most fundamental and widely-studied hypothesis class in machine learning. Their VC dimension is known exactly: VCdim(halfspaces in $\mathbb{R}^d$) = d + 1.
This result is more than a curiosity. It serves as:
In this page, we'll prove this result rigorously, develop geometric intuition, and explore its implications for practical machine learning.
By the end of this page, you will understand the complete proof that VCdim(halfspaces) = d+1, visualize the geometry behind linear shattering, see how this extends to homogeneous classifiers, and appreciate the connection between linear algebra and learning theory.
The Hypothesis Class of Halfspaces:
A halfspace in $\mathbb{R}^d$ is the set of points on one side of a hyperplane. The associated classifier assigns $+1$ to points in the halfspace and $-1$ to points outside.
Formal Definition: $$\mathcal{H}{d} = {h{\mathbf{w},b} : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}}$$
where $$h_{\mathbf{w},b}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b) = \begin{cases} +1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \ -1 & \text{if } \mathbf{w}^\top \mathbf{x} + b < 0 \end{cases}$$
Geometric Interpretation:
We can convert inhomogeneous classifiers (with bias b) to homogeneous ones (no bias) by lifting: augment each x ∈ ℝ^d to [x; 1] ∈ ℝ^(d+1) and use w' = [w; b]. This means classifiers in ℝ^d with bias are equivalent to homogeneous classifiers in ℝ^(d+1). We'll use this trick in the proofs.
Homogeneous Linear Classifiers:
The homogeneous halfspaces pass through the origin: $$\mathcal{H}{d}^{\text{hom}} = {h{\mathbf{w}} : \mathbf{w} \in \mathbb{R}^d}$$ where $h_{\mathbf{w}}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x})$.
Key Result: $\text{VCdim}(\mathcal{H}_{d}^{\text{hom}}) = d$ (homogeneous)
Key Result: $\text{VCdim}(\mathcal{H}_{d}) = d + 1$ (with bias)
The difference is exactly one—the bias parameter adds one to the VC dimension. This is consistent with the lifting trick: classifiers in $\mathbb{R}^d$ with bias correspond to homogeneous classifiers in $\mathbb{R}^{d+1}$.
To prove the lower bound, we must exhibit a set of $d+1$ points in $\mathbb{R}^d$ that halfspaces can shatter.
Construction: The Standard Simplex Points
Consider the following $d+1$ points in $\mathbb{R}^d$: $$\mathbf{x}_0 = \mathbf{0} = (0, 0, \ldots, 0)$$ $$\mathbf{x}_1 = \mathbf{e}_1 = (1, 0, \ldots, 0)$$ $$\mathbf{x}_2 = \mathbf{e}_2 = (0, 1, \ldots, 0)$$ $$\vdots$$ $$\mathbf{x}_d = \mathbf{e}_d = (0, 0, \ldots, 1)$$
This is the origin plus the $d$ standard basis vectors—the vertices of a standard simplex.
These points are in 'general position'—no d of them lie in a (d-1)-dimensional subspace. This ensures maximum flexibility for the separating hyperplane. The origin plays a special role: it allows us to use the bias term to control its label independently.
Proof that these points can be shattered:
We need to show that for any labeling $(y_0, y_1, \ldots, y_d) \in {-1, +1}^{d+1}$, there exists $(\mathbf{w}, b)$ such that $\text{sign}(\mathbf{w}^\top \mathbf{x}_i + b) = y_i$ for all $i$.
Step 1: Express the constraints
For the origin: $\text{sign}(b) = y_0$ For each basis vector: $\text{sign}(w_i + b) = y_i$ for $i = 1, \ldots, d$
Step 2: Construct the solution
Set $b = y_0$ (so the origin gets the right label).
For each $i \in {1, \ldots, d}$:
Step 3: Verify
The weight vector $\mathbf{w} = (w_1, \ldots, w_d)$ and bias $b$ correctly classify all $d+1$ points according to any desired labeling.
$\therefore$ VCdim($\mathcal{H}_d$) ≥ d + 1
Construct (w₁, w₂, b) for each of the 8 labelings| Labeling (0, e₁, e₂) | (w₁, w₂, b) | Verification |
|---------------------|-------------|---------------|
| (+, +, +) | (0, 0, 1) | sign(1)=+, sign(1)=+, sign(1)=+ |
| (+, +, −) | (0, −2, 1) | sign(1)=+, sign(1)=+, sign(−1)=− |
| (+, −, +) | (−2, 0, 1) | sign(1)=+, sign(−1)=−, sign(1)=+ |
| (+, −, −) | (−2, −2, 1) | sign(1)=+, sign(−1)=−, sign(−1)=− |
| (−, +, +) | (2, 2, −1) | sign(−1)=−, sign(1)=+, sign(1)=+ |
| (−, +, −) | (2, 0, −1) | sign(−1)=−, sign(1)=+, sign(−1)=− |
| (−, −, +) | (0, 2, −1) | sign(−1)=−, sign(−1)=−, sign(1)=+ |
| (−, −, −) | (0, 0, −1) | sign(−1)=−, sign(−1)=−, sign(−1)=− |Each row shows that the specified hyperplane correctly classifies all three points according to the given labeling. This exhaustively verifies that these 3 points can be shattered.
To prove the upper bound, we must show that no set of $d+2$ points in $\mathbb{R}^d$ can be shattered by halfspaces. This requires a universal argument.
Strategy: Radon's Theorem
We use a fundamental result from convex geometry.
Theorem (Radon, 1921): Any set of d+2 points in ℝ^d can be partitioned into two disjoint sets whose convex hulls intersect.
Geometrically: given d+2 points in d dimensions, we can always split them into two groups such that 'something from the first group' and 'something from the second group' occupy the same location in space (as convex combinations).
Proof of Radon's Theorem (Sketch):
Given $d+2$ points $\mathbf{x}1, \ldots, \mathbf{x}{d+2}$ in $\mathbb{R}^d$.
The vectors $\mathbf{x}1, \ldots, \mathbf{x}{d+2}$ must be linearly dependent (there are $d+2$ vectors in $d$-dimensional space, and the coefficient space is $d+2$ dimensional with a constraint that coefficients sum to something).
There exist scalars $\alpha_1, \ldots, \alpha_{d+2}$, not all zero, such that: $$\sum_{i=1}^{d+2} \alpha_i \mathbf{x}i = \mathbf{0} \quad \text{and} \quad \sum{i=1}^{d+2} \alpha_i = 0$$
Define $I^+ = {i : \alpha_i > 0}$ and $I^- = {i : \alpha_i < 0}$. Let $\lambda = \sum_{i \in I^+} \alpha_i = -\sum_{i \in I^-} \alpha_i$ (these are equal by the sum-to-zero condition).
The point $\mathbf{p} = \frac{1}{\lambda} \sum_{i \in I^+} \alpha_i \mathbf{x}i = -\frac{1}{\lambda} \sum{i \in I^-} \alpha_i \mathbf{x}_i$ is in both convex hulls.
$\therefore$ The sets ${\mathbf{x}_i : i \in I^+}$ and ${\mathbf{x}_i : i \in I^-}$ have intersecting convex hulls.
Completing the Upper Bound Proof:
Now we show no $d+2$ points can be shattered.
Let $S = {\mathbf{x}1, \ldots, \mathbf{x}{d+2}}$ be any set of $d+2$ points in $\mathbb{R}^d$.
By Radon's theorem, $S$ can be partitioned into $S = A \cup B$ where $A \cap B = \emptyset$ and $\text{conv}(A) \cap \text{conv}(B) eq \emptyset$.
Consider the labeling that assigns $+1$ to all points in $A$ and $-1$ to all points in $B$.
Claim: No hyperplane can achieve this labeling.
Proof of Claim: Suppose for contradiction that hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$ separates $A$ from $B$.
Then for all $\mathbf{a} \in A$: $\mathbf{w}^\top \mathbf{a} + b \geq 0$ And for all $\mathbf{b} \in B$: $\mathbf{w}^\top \mathbf{b} + b < 0$
Let $\mathbf{p} \in \text{conv}(A) \cap \text{conv}(B)$.
Since $\mathbf{p} \in \text{conv}(A)$: $\mathbf{p} = \sum_{i} \lambda_i \mathbf{a}_i$ for $\lambda_i \geq 0$, $\sum \lambda_i = 1$. Then $\mathbf{w}^\top \mathbf{p} + b = \sum_i \lambda_i (\mathbf{w}^\top \mathbf{a}_i + b) \geq 0$
Since $\mathbf{p} \in \text{conv}(B)$: Similarly $\mathbf{w}^\top \mathbf{p} + b < 0$.
Contradiction! The same point cannot have both $\mathbf{w}^\top \mathbf{p} + b \geq 0$ and $\mathbf{w}^\top \mathbf{p} + b < 0$.
$\therefore$ No hyperplane separates $A$ from $B$, so this labeling cannot be achieved.
$\therefore$ VCdim($\mathcal{H}_d$) ≤ d + 1
Combining the lower and upper bounds:
VCdim(halfspaces in ℝ^d) = d + 1
This is one of the most important results in learning theory, providing exact sample complexity bounds for linear classifiers.
Let's develop geometric intuition for the VC dimension of linear classifiers across different dimensions.
Dimension d = 1 (Lines in ℝ):
VCdim = 2
All 4 labelings: $(-,-)$, $(-,+)$, $(+,-)$, $(+,+)$ can be achieved by moving the threshold.
Why 3 points fail: For $x_1 < x_2 < x_3$, the labeling $(+,-,+)$ requires the middle point to be negative while both endpoints are positive—impossible with a single threshold.
Dimension d = 2 (Planes):
VCdim = 3
For any three points forming a triangle:
Why 4 points fail: Consider 4 points in convex position (forming a quadrilateral). The alternating labeling $(+,-,+,-)$ cannot be achieved—a line cannot separate alternating corners. If one point is inside the triangle of the other three, similar issues arise.
Dimension d = 3 (3D Space):
VCdim = 4
For four points forming a tetrahedron:
Why 5 points fail: By Radon's theorem, 5 points in $\mathbb{R}^3$ can be partitioned into two sets with intersecting convex hulls. The corresponding labeling is impossible.
The General Pattern:
In $d$ dimensions, we can shatter $d+1$ points (a simplex) but not $d+2$ points. The VC dimension tracks the 'dimensional capacity' of the hypothesis class.
A d-dimensional simplex has d+1 vertices. This is the maximum number of affinely independent points in ℝ^d—and exactly the number of points that linear classifiers can shatter. The simplex structure allows each vertex to be 'seen' independently by hyperplanes, enabling all 2^(d+1) labelings.
There's an elegant alternative proof using linear algebra that provides additional insight.
Setup: Homogeneous Case
Consider homogeneous halfspaces (through the origin) in $\mathbb{R}^d$: $$\mathcal{H}^{\text{hom}} = {h_\mathbf{w} : \mathbf{w} \in \mathbb{R}^d}, \quad h_\mathbf{w}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x})$$
Theorem: VCdim($\mathcal{H}^{\text{hom}}$) = $d$
Lower Bound (VCdim ≥ d):
Take the standard basis: $S = {\mathbf{e}_1, \ldots, \mathbf{e}_d}$.
For any labeling $(y_1, \ldots, y_d)$, set $\mathbf{w} = (y_1, \ldots, y_d)$.
Then $\mathbf{w}^\top \mathbf{e}_i = y_i$, so $\text{sign}(\mathbf{w}^\top \mathbf{e}_i) = \text{sign}(y_i) = y_i$ ✓
$\therefore$ The basis is shattered.
Upper Bound (VCdim ≤ d):
Take any $d+1$ points $\mathbf{x}1, \ldots, \mathbf{x}{d+1}$ in $\mathbb{R}^d$.
These vectors are linearly dependent (more vectors than dimensions), so: $$\sum_{i=1}^{d+1} \alpha_i \mathbf{x}_i = \mathbf{0}$$ for some $\alpha_i$ not all zero.
Partition: $I^+ = {i : \alpha_i > 0}$, $I^- = {i : \alpha_i < 0}$ (both non-empty since they sum to zero).
Consider labeling: $y_i = +1$ for $i \in I^+$, $y_i = -1$ for $i \in I^-$.
Claim: No $\mathbf{w}$ achieves this labeling.
Proof: $$0 = \mathbf{w}^\top \mathbf{0} = \mathbf{w}^\top \left(\sum_i \alpha_i \mathbf{x}_i\right) = \sum_i \alpha_i (\mathbf{w}^\top \mathbf{x}_i)$$
If $\text{sign}(\mathbf{w}^\top \mathbf{x}_i) = \text{sign}(\alpha_i)$ for all $i$, then all terms $\alpha_i (\mathbf{w}^\top \mathbf{x}_i)$ have the same sign (positive), so their sum cannot be zero.
Contradiction!
$\therefore$ This labeling cannot be achieved, so no $d+1$ points are shattered.
This proof directly connects linear dependence to shattering impossibility. The d+1 vectors in d dimensions must be dependent, and this dependence creates a labeling that violates the weighted sum constraint. It's a purely algebraic argument, complementing the geometric Radon theorem approach.
The VC dimension analysis extends to many variations of linear classifiers.
Polynomial Feature Mapping:
Consider degree-$k$ polynomial classifiers on $\mathbb{R}^d$: $$h(x) = \text{sign}\left(\sum_{|\alpha| \leq k} w_\alpha x^\alpha\right)$$
where $\alpha = (\alpha_1, \ldots, \alpha_d)$ is a multi-index and $x^\alpha = x_1^{\alpha_1} \cdots x_d^{\alpha_d}$.
This is a linear classifier in feature space!
Map $\mathbf{x} \mapsto \phi(\mathbf{x})$ where $\phi(\mathbf{x})$ contains all monomials up to degree $k$.
The dimension of this feature space is: $$D = \binom{d+k}{k}$$
VC Dimension: $$\text{VCdim}(\text{degree-}k \text{ poly in } \mathbb{R}^d) = \binom{d+k}{k}$$
Examples:
The VC dimension of linear classifiers has profound implications for machine learning practice.
| Dimension d | VC Dimension | Samples for ε=0.1, δ=0.05 |
|---|---|---|
| 10 | 11 | ~1,800 |
| 100 | 101 | ~15,600 |
| 1,000 | 1,001 | ~153,300 |
| 10,000 | 10,001 | ~1,530,400 |
NLP and image models often have millions of dimensions (word embeddings, pixel features). Classical VC theory suggests astronomical sample requirements. Yet modern deep learning works! This is the 'over-parameterized regime' where additional mechanisms (implicit regularization, architecture, optimization dynamics) enable generalization beyond VC predictions.
Let's implement experiments to verify the VC dimension of linear classifiers empirically.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npfrom itertools import productfrom scipy.optimize import linprogimport matplotlib.pyplot as plt def can_linear_achieve_labeling(X: np.ndarray, y: np.ndarray) -> bool: """ Check if there exists a linear classifier that achieves the given labeling. Uses linear programming to test linear separability. Args: X: Data matrix of shape (n_samples, n_features) y: Labels in {-1, +1} of shape (n_samples,) Returns: True if the labeling is achievable by a linear classifier """ n, d = X.shape # Augment X with bias term: [X | 1] X_aug = np.hstack([X, np.ones((n, 1))]) # We want: y_i * (w · x_i + b) >= 1 for all i # In standard form: -y_i * (w · x_i + b) <= -1 # Variables: [w_1, ..., w_d, b] A_ub = -y.reshape(-1, 1) * X_aug b_ub = -np.ones(n) c = np.zeros(d + 1) # Feasibility test, no objective result = linprog(c, A_ub=A_ub, b_ub=b_ub, method='highs') return result.success def verify_shattering(X: np.ndarray) -> tuple[bool, int]: """ Verify if the given points can be shattered by linear classifiers. Returns: (is_shattered, num_achievable) - whether all labelings work and how many do """ n = len(X) total_labelings = 2**n achievable = 0 for labeling in product([-1, 1], repeat=n): y = np.array(labeling) if can_linear_achieve_labeling(X, y): achievable += 1 return achievable == total_labelings, achievable def estimate_vc_dimension(d: int, num_trials: int = 10) -> int: """ Empirically estimate VC dimension of linear classifiers in R^d. Strategy: - Try to shatter sets of increasing size - Use random points in general position Returns: Estimated VC dimension """ vc_estimate = 0 for m in range(1, d + 4): # Try up to d+3 points shattered_any = False for trial in range(num_trials): # Generate m random points (general position with high probability) X = np.random.randn(m, d) is_shattered, _ = verify_shattering(X) if is_shattered: shattered_any = True break if shattered_any: vc_estimate = m else: break return vc_estimate # Verify the standard simplex constructiondef verify_simplex_shattering(d: int) -> bool: """ Verify that the standard simplex points can be shattered. Points: origin, e_1, e_2, ..., e_d """ X = np.vstack([np.zeros(d), np.eye(d)]) # d+1 points is_shattered, achievable = verify_shattering(X) print(f"Dimension {d}: Simplex ({d+1} points) shattered = {is_shattered}") print(f" Achievable labelings: {achievable} / {2**(d+1)}") return is_shattered # Verify that d+2 points cannot be shattereddef verify_upper_bound(d: int, num_trials: int = 20) -> bool: """ Verify that no set of d+2 points can be shattered. """ m = d + 2 for trial in range(num_trials): X = np.random.randn(m, d) is_shattered, achievable = verify_shattering(X) if is_shattered: print(f"Counterexample found! {m} points shattered in {d}D!") return False print(f"Dimension {d}: No {d+2}-point set shattered (checked {num_trials} random sets)") return True # Run verificationif __name__ == "__main__": print("=" * 50) print("Verifying VC Dimension of Linear Classifiers") print("=" * 50) for d in [1, 2, 3]: print(f"--- Dimension d = {d} ---") # Lower bound: simplex can be shattered verify_simplex_shattering(d) # Upper bound: d+2 points cannot be shattered verify_upper_bound(d) print(f"=> VCdim(halfspaces in R^{d}) = {d + 1}")The VC dimension of linear classifiers is the foundational example for all of learning theory.
What's Next:
With the VC dimension of linear classifiers established, we're ready to derive generalization bounds—precise mathematical statements about how test error relates to training error, sample size, and VC dimension. These bounds are the payoff of all our theoretical development.
You've mastered the complete proof that VCdim(halfspaces) = d+1, including both geometric (Radon) and algebraic (linear dependence) approaches. You understand implications for sample complexity and the role of margin. Next, we derive generalization bounds using VC dimension.