Vc Dimension - Learning Module

Loading content...

0/278

VC Dimension of Linear Classifiers

The Foundational Example

Linear classifiers—also known as halfspaces, perceptrons, or linear threshold functions—are the most fundamental and widely-studied hypothesis class in machine learning. Their VC dimension is known exactly: VCdim(halfspaces in $\mathbb{R}^d$) = d + 1.

This result is more than a curiosity. It serves as:

The foundation for understanding support vector machines
A baseline for comparing more complex model classes
A prototypical example of VC analysis techniques
A building block for kernel methods (which lift data to high-dimensional spaces)

In this page, we'll prove this result rigorously, develop geometric intuition, and explore its implications for practical machine learning.

What You Will Learn

By the end of this page, you will understand the complete proof that VCdim(halfspaces) = d+1, visualize the geometry behind linear shattering, see how this extends to homogeneous classifiers, and appreciate the connection between linear algebra and learning theory.

Linear Classifiers: Formal Definition

The Hypothesis Class of Halfspaces:

A halfspace in $\mathbb{R}^d$ is the set of points on one side of a hyperplane. The associated classifier assigns $+1$ to points in the halfspace and $-1$ to points outside.

Formal Definition: $$\mathcal{H}{d} = {h{\mathbf{w},b} : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}}$$

where $$h_{\mathbf{w},b}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b) = \begin{cases} +1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \ -1 & \text{if } \mathbf{w}^\top \mathbf{x} + b < 0 \end{cases}$$

Geometric Interpretation:

$\mathbf{w}$ is the normal vector to the hyperplane (perpendicular to the decision boundary)
$b$ is the bias (offset from the origin)
$\mathbf{w}^\top \mathbf{x} + b = 0$ defines the decision boundary
Points are classified +1 if they're on the side the normal points toward

Homogeneous vs. Inhomogeneous

We can convert inhomogeneous classifiers (with bias b) to homogeneous ones (no bias) by lifting: augment each x ∈ ℝ^d to [x; 1] ∈ ℝ^(d+1) and use w' = [w; b]. This means classifiers in ℝ^d with bias are equivalent to homogeneous classifiers in ℝ^(d+1). We'll use this trick in the proofs.

Homogeneous Linear Classifiers:

The homogeneous halfspaces pass through the origin: $$\mathcal{H}{d}^{\text{hom}} = {h{\mathbf{w}} : \mathbf{w} \in \mathbb{R}^d}$$ where $h_{\mathbf{w}}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x})$.

Key Result: $\text{VCdim}(\mathcal{H}_{d}^{\text{hom}}) = d$ (homogeneous)

Key Result: $\text{VCdim}(\mathcal{H}_{d}) = d + 1$ (with bias)

The difference is exactly one—the bias parameter adds one to the VC dimension. This is consistent with the lifting trick: classifiers in $\mathbb{R}^d$ with bias correspond to homogeneous classifiers in $\mathbb{R}^{d+1}$.

Lower Bound: VCdim(Halfspaces in ℝᵈ) ≥ d + 1

To prove the lower bound, we must exhibit a set of $d+1$ points in $\mathbb{R}^d$ that halfspaces can shatter.

Construction: The Standard Simplex Points

Consider the following $d+1$ points in $\mathbb{R}^d$: $$\mathbf{x}_0 = \mathbf{0} = (0, 0, \ldots, 0)$$ $$\mathbf{x}_1 = \mathbf{e}_1 = (1, 0, \ldots, 0)$$ $$\mathbf{x}_2 = \mathbf{e}_2 = (0, 1, \ldots, 0)$$ $$\vdots$$ $$\mathbf{x}_d = \mathbf{e}_d = (0, 0, \ldots, 1)$$

This is the origin plus the $d$ standard basis vectors—the vertices of a standard simplex.

Why These Points?

These points are in 'general position'—no d of them lie in a (d-1)-dimensional subspace. This ensures maximum flexibility for the separating hyperplane. The origin plays a special role: it allows us to use the bias term to control its label independently.

Proof that these points can be shattered:

We need to show that for any labeling $(y_0, y_1, \ldots, y_d) \in {-1, +1}^{d+1}$, there exists $(\mathbf{w}, b)$ such that $\text{sign}(\mathbf{w}^\top \mathbf{x}_i + b) = y_i$ for all $i$.

Step 1: Express the constraints

For the origin: $\text{sign}(b) = y_0$ For each basis vector: $\text{sign}(w_i + b) = y_i$ for $i = 1, \ldots, d$

Step 2: Construct the solution

Set $b = y_0$ (so the origin gets the right label).

For each $i \in {1, \ldots, d}$:

If $y_i = y_0$: Set $w_i = 0$, then $\text{sign}(w_i + b) = \text{sign}(b) = y_0 = y_i$ ✓
If $y_i eq y_0$: Set $w_i = 2y_i$, then $\text{sign}(w_i + b) = \text{sign}(2y_i + y_0)$
- Since $y_i = -y_0$, we have $2y_i + y_0 = -2y_0 + y_0 = -y_0 = y_i$
- So $\text{sign}(2y_i + y_0) = \text{sign}(y_i) = y_i$ ✓

Step 3: Verify

The weight vector $\mathbf{w} = (w_1, \ldots, w_d)$ and bias $b$ correctly classify all $d+1$ points according to any desired labeling.

$\therefore$ VCdim($\mathcal{H}_d$) ≥ d + 1

Example: Shattering 3 Points in ℝ²Let's explicitly construct hyperplanes for all 8 labelings of $\{\mathbf{0}, \mathbf{e}_1, \mathbf{e}_2\} = \{(0,0), (1,0), (0,1)\}$.

Input

Construct (w₁, w₂, b) for each of the 8 labelings

Output

| Labeling (0, e₁, e₂) | (w₁, w₂, b) | Verification |
|---------------------|-------------|---------------|
| (+, +, +) | (0, 0, 1) | sign(1)=+, sign(1)=+, sign(1)=+ |
| (+, +, −) | (0, −2, 1) | sign(1)=+, sign(1)=+, sign(−1)=− |
| (+, −, +) | (−2, 0, 1) | sign(1)=+, sign(−1)=−, sign(1)=+ |
| (+, −, −) | (−2, −2, 1) | sign(1)=+, sign(−1)=−, sign(−1)=− |
| (−, +, +) | (2, 2, −1) | sign(−1)=−, sign(1)=+, sign(1)=+ |
| (−, +, −) | (2, 0, −1) | sign(−1)=−, sign(1)=+, sign(−1)=− |
| (−, −, +) | (0, 2, −1) | sign(−1)=−, sign(−1)=−, sign(1)=+ |
| (−, −, −) | (0, 0, −1) | sign(−1)=−, sign(−1)=−, sign(−1)=− |

Explanation

Each row shows that the specified hyperplane correctly classifies all three points according to the given labeling. This exhaustively verifies that these 3 points can be shattered.

Upper Bound: VCdim(Halfspaces in ℝᵈ) ≤ d + 1

To prove the upper bound, we must show that no set of $d+2$ points in $\mathbb{R}^d$ can be shattered by halfspaces. This requires a universal argument.

Strategy: Radon's Theorem

We use a fundamental result from convex geometry.

Radon's Theorem

Theorem (Radon, 1921): Any set of d+2 points in ℝ^d can be partitioned into two disjoint sets whose convex hulls intersect.

Geometrically: given d+2 points in d dimensions, we can always split them into two groups such that 'something from the first group' and 'something from the second group' occupy the same location in space (as convex combinations).

Proof of Radon's Theorem (Sketch):

Given $d+2$ points $\mathbf{x}1, \ldots, \mathbf{x}{d+2}$ in $\mathbb{R}^d$.

The vectors $\mathbf{x}1, \ldots, \mathbf{x}{d+2}$ must be linearly dependent (there are $d+2$ vectors in $d$-dimensional space, and the coefficient space is $d+2$ dimensional with a constraint that coefficients sum to something).
There exist scalars $\alpha_1, \ldots, \alpha_{d+2}$, not all zero, such that: $$\sum_{i=1}^{d+2} \alpha_i \mathbf{x}i = \mathbf{0} \quad \text{and} \quad \sum{i=1}^{d+2} \alpha_i = 0$$
Define $I^+ = {i : \alpha_i > 0}$ and $I^- = {i : \alpha_i < 0}$. Let $\lambda = \sum_{i \in I^+} \alpha_i = -\sum_{i \in I^-} \alpha_i$ (these are equal by the sum-to-zero condition).
The point $\mathbf{p} = \frac{1}{\lambda} \sum_{i \in I^+} \alpha_i \mathbf{x}i = -\frac{1}{\lambda} \sum{i \in I^-} \alpha_i \mathbf{x}_i$ is in both convex hulls.

$\therefore$ The sets ${\mathbf{x}_i : i \in I^+}$ and ${\mathbf{x}_i : i \in I^-}$ have intersecting convex hulls.

Completing the Upper Bound Proof:

Now we show no $d+2$ points can be shattered.

Let $S = {\mathbf{x}1, \ldots, \mathbf{x}{d+2}}$ be any set of $d+2$ points in $\mathbb{R}^d$.

By Radon's theorem, $S$ can be partitioned into $S = A \cup B$ where $A \cap B = \emptyset$ and $\text{conv}(A) \cap \text{conv}(B) eq \emptyset$.

Consider the labeling that assigns $+1$ to all points in $A$ and $-1$ to all points in $B$.

Claim: No hyperplane can achieve this labeling.

Proof of Claim: Suppose for contradiction that hyperplane $\mathbf{w}^\top \mathbf{x} + b = 0$ separates $A$ from $B$.

Then for all $\mathbf{a} \in A$: $\mathbf{w}^\top \mathbf{a} + b \geq 0$ And for all $\mathbf{b} \in B$: $\mathbf{w}^\top \mathbf{b} + b < 0$

Let $\mathbf{p} \in \text{conv}(A) \cap \text{conv}(B)$.

Since $\mathbf{p} \in \text{conv}(A)$: $\mathbf{p} = \sum_{i} \lambda_i \mathbf{a}_i$ for $\lambda_i \geq 0$, $\sum \lambda_i = 1$. Then $\mathbf{w}^\top \mathbf{p} + b = \sum_i \lambda_i (\mathbf{w}^\top \mathbf{a}_i + b) \geq 0$

Since $\mathbf{p} \in \text{conv}(B)$: Similarly $\mathbf{w}^\top \mathbf{p} + b < 0$.

Contradiction! The same point cannot have both $\mathbf{w}^\top \mathbf{p} + b \geq 0$ and $\mathbf{w}^\top \mathbf{p} + b < 0$.

$\therefore$ No hyperplane separates $A$ from $B$, so this labeling cannot be achieved.

$\therefore$ VCdim($\mathcal{H}_d$) ≤ d + 1

Complete Result

Combining the lower and upper bounds:

VCdim(halfspaces in ℝ^d) = d + 1

This is one of the most important results in learning theory, providing exact sample complexity bounds for linear classifiers.

Geometric Intuition: Visualizing the VC Dimension

Let's develop geometric intuition for the VC dimension of linear classifiers across different dimensions.

Dimension d = 1 (Lines in ℝ):

VCdim = 2

Input space: The real line
Decision boundary: A single point (threshold)
Shattered set: Any two distinct points ${x_1, x_2}$ with $x_1 < x_2$

All 4 labelings: $(-,-)$, $(-,+)$, $(+,-)$, $(+,+)$ can be achieved by moving the threshold.

Why 3 points fail: For $x_1 < x_2 < x_3$, the labeling $(+,-,+)$ requires the middle point to be negative while both endpoints are positive—impossible with a single threshold.

Dimension d = 2 (Planes):

VCdim = 3

Input space: The 2D plane
Decision boundary: A line
Shattered set: Three non-collinear points (a triangle)

For any three points forming a triangle:

Each point can be isolated on one side of a line
Any pair can be grouped on one side
This gives all 8 labelings

Why 4 points fail: Consider 4 points in convex position (forming a quadrilateral). The alternating labeling $(+,-,+,-)$ cannot be achieved—a line cannot separate alternating corners. If one point is inside the triangle of the other three, similar issues arise.

Dimension d = 3 (3D Space):

VCdim = 4

Input space: 3D space
Decision boundary: A plane
Shattered set: Four points in general position (a tetrahedron)

For four points forming a tetrahedron:

Each vertex can be on either side of a plane passing near the opposite face
Any subset can be separated from its complement

Why 5 points fail: By Radon's theorem, 5 points in $\mathbb{R}^3$ can be partitioned into two sets with intersecting convex hulls. The corresponding labeling is impossible.

The General Pattern:

In $d$ dimensions, we can shatter $d+1$ points (a simplex) but not $d+2$ points. The VC dimension tracks the 'dimensional capacity' of the hypothesis class.

The Simplex Intuition

A d-dimensional simplex has d+1 vertices. This is the maximum number of affinely independent points in ℝ^d—and exactly the number of points that linear classifiers can shatter. The simplex structure allows each vertex to be 'seen' independently by hyperplanes, enabling all 2^(d+1) labelings.

Alternative Proof: The Linear Algebra Perspective

There's an elegant alternative proof using linear algebra that provides additional insight.

Setup: Homogeneous Case

Consider homogeneous halfspaces (through the origin) in $\mathbb{R}^d$: $$\mathcal{H}^{\text{hom}} = {h_\mathbf{w} : \mathbf{w} \in \mathbb{R}^d}, \quad h_\mathbf{w}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x})$$

Theorem: VCdim($\mathcal{H}^{\text{hom}}$) = $d$

Lower Bound (VCdim ≥ d):

Take the standard basis: $S = {\mathbf{e}_1, \ldots, \mathbf{e}_d}$.

For any labeling $(y_1, \ldots, y_d)$, set $\mathbf{w} = (y_1, \ldots, y_d)$.

Then $\mathbf{w}^\top \mathbf{e}_i = y_i$, so $\text{sign}(\mathbf{w}^\top \mathbf{e}_i) = \text{sign}(y_i) = y_i$ ✓

$\therefore$ The basis is shattered.

Upper Bound (VCdim ≤ d):

Take any $d+1$ points $\mathbf{x}1, \ldots, \mathbf{x}{d+1}$ in $\mathbb{R}^d$.

These vectors are linearly dependent (more vectors than dimensions), so: $$\sum_{i=1}^{d+1} \alpha_i \mathbf{x}_i = \mathbf{0}$$ for some $\alpha_i$ not all zero.

Partition: $I^+ = {i : \alpha_i > 0}$, $I^- = {i : \alpha_i < 0}$ (both non-empty since they sum to zero).

Consider labeling: $y_i = +1$ for $i \in I^+$, $y_i = -1$ for $i \in I^-$.

Claim: No $\mathbf{w}$ achieves this labeling.

Proof: $$0 = \mathbf{w}^\top \mathbf{0} = \mathbf{w}^\top \left(\sum_i \alpha_i \mathbf{x}_i\right) = \sum_i \alpha_i (\mathbf{w}^\top \mathbf{x}_i)$$

If $\text{sign}(\mathbf{w}^\top \mathbf{x}_i) = \text{sign}(\alpha_i)$ for all $i$, then all terms $\alpha_i (\mathbf{w}^\top \mathbf{x}_i)$ have the same sign (positive), so their sum cannot be zero.

Contradiction!

$\therefore$ This labeling cannot be achieved, so no $d+1$ points are shattered.

Elegance of the Linear Algebra Proof

This proof directly connects linear dependence to shattering impossibility. The d+1 vectors in d dimensions must be dependent, and this dependence creates a labeling that violates the weighted sum constraint. It's a purely algebraic argument, complementing the geometric Radon theorem approach.

Extensions: Beyond Basic Linear Classifiers

The VC dimension analysis extends to many variations of linear classifiers.

Polynomial Feature Mapping:

Consider degree-$k$ polynomial classifiers on $\mathbb{R}^d$: $$h(x) = \text{sign}\left(\sum_{|\alpha| \leq k} w_\alpha x^\alpha\right)$$

where $\alpha = (\alpha_1, \ldots, \alpha_d)$ is a multi-index and $x^\alpha = x_1^{\alpha_1} \cdots x_d^{\alpha_d}$.

This is a linear classifier in feature space!

Map $\mathbf{x} \mapsto \phi(\mathbf{x})$ where $\phi(\mathbf{x})$ contains all monomials up to degree $k$.

The dimension of this feature space is: $$D = \binom{d+k}{k}$$

VC Dimension: $$\text{VCdim}(\text{degree-}k \text{ poly in } \mathbb{R}^d) = \binom{d+k}{k}$$

Examples:

Quadratic in $\mathbb{R}^2$: $\binom{4}{2} = 6$
Cubic in $\mathbb{R}^2$: $\binom{5}{3} = 10$
Quadratic in $\mathbb{R}^3$: $\binom{5}{2} = 10$

Implications for Machine Learning

The VC dimension of linear classifiers has profound implications for machine learning practice.

Practical Implications

•Sample Complexity: To learn a linear classifier in $\mathbb{R}^d$ to error $\epsilon$ with probability $1-\delta$, you need $m = O\left(\frac{d + \ln(1/\delta)}{\epsilon^2}\right)$ samples. Sample requirements grow linearly with dimension.
•The Curse of Dimensionality: In high-dimensional spaces, VC dimension grows with $d$, requiring more data. This motivates dimensionality reduction and feature selection.
•Regularization Necessity: In practice, when $m < d$, the training data can be perfectly separated (zero training error). But this doesn't guarantee generalization—regularization is essential.
•Feature Engineering Matters: Adding polynomial features increases VC dimension exponentially. Be careful about feature explosion.
•Kernel Methods Trade-off: Kernels can lift to infinite-dimensional spaces, but margin requirements control effective complexity.

Sample Complexity Estimates for Linear Classifiers
Dimension d	VC Dimension	Samples for ε=0.1, δ=0.05
10	11	~1,800
100	101	~15,600
1,000	1,001	~153,300
10,000	10,001	~1,530,400

High Dimensions in Practice

NLP and image models often have millions of dimensions (word embeddings, pixel features). Classical VC theory suggests astronomical sample requirements. Yet modern deep learning works! This is the 'over-parameterized regime' where additional mechanisms (implicit regularization, architecture, optimization dynamics) enable generalization beyond VC predictions.

Computational Experiments

Let's implement experiments to verify the VC dimension of linear classifiers empirically.

vc_linear_classifiers.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from itertools import product
from scipy.optimize import linprog
import matplotlib.pyplot as plt
 
def can_linear_achieve_labeling(X: np.ndarray, y: np.ndarray) -> bool:
    """
    Check if there exists a linear classifier that achieves the given labeling.
    Uses linear programming to test linear separability.
    
    Args:
        X: Data matrix of shape (n_samples, n_features)
        y: Labels in {-1, +1} of shape (n_samples,)
    
    Returns:
        True if the labeling is achievable by a linear classifier
    """
    n, d = X.shape
    
    # Augment X with bias term: [X | 1]
    X_aug = np.hstack([X, np.ones((n, 1))])
    
    # We want: y_i * (w · x_i + b) >= 1 for all i
    # In standard form: -y_i * (w · x_i + b) <= -1
    # Variables: [w_1, ..., w_d, b]
    
    A_ub = -y.reshape(-1, 1) * X_aug
    b_ub = -np.ones(n)
    c = np.zeros(d + 1)  # Feasibility test, no objective
    
    result = linprog(c, A_ub=A_ub, b_ub=b_ub, method='highs')
    return result.success
 
def verify_shattering(X: np.ndarray) -> tuple[bool, int]:
    """
    Verify if the given points can be shattered by linear classifiers.
    
    Returns:
        (is_shattered, num_achievable) - whether all labelings work and how many do
    """
    n = len(X)
    total_labelings = 2**n
    achievable = 0
    
    for labeling in product([-1, 1], repeat=n):
        y = np.array(labeling)
        if can_linear_achieve_labeling(X, y):
            achievable += 1
    
    return achievable == total_labelings, achievable
 
def estimate_vc_dimension(d: int, num_trials: int = 10) -> int:
    """
    Empirically estimate VC dimension of linear classifiers in R^d.
    
    Strategy: 
    - Try to shatter sets of increasing size
    - Use random points in general position
    
    Returns:
        Estimated VC dimension
    """
    vc_estimate = 0
    
    for m in range(1, d + 4):  # Try up to d+3 points
        shattered_any = False
        
        for trial in range(num_trials):
            # Generate m random points (general position with high probability)
            X = np.random.randn(m, d)
            
            is_shattered, _ = verify_shattering(X)
            if is_shattered:
                shattered_any = True
                break
        
        if shattered_any:
            vc_estimate = m
        else:
            break
    
    return vc_estimate
 
# Verify the standard simplex construction
def verify_simplex_shattering(d: int) -> bool:
    """
    Verify that the standard simplex points can be shattered.
    Points: origin, e_1, e_2, ..., e_d
    """
    X = np.vstack([np.zeros(d), np.eye(d)])  # d+1 points
    is_shattered, achievable = verify_shattering(X)
    print(f"Dimension {d}: Simplex ({d+1} points) shattered = {is_shattered}")
    print(f"  Achievable labelings: {achievable} / {2**(d+1)}")
    return is_shattered
 
# Verify that d+2 points cannot be shattered
def verify_upper_bound(d: int, num_trials: int = 20) -> bool:
    """
    Verify that no set of d+2 points can be shattered.
    """
    m = d + 2
    for trial in range(num_trials):
        X = np.random.randn(m, d)
        is_shattered, achievable = verify_shattering(X)
        if is_shattered:
            print(f"Counterexample found! {m} points shattered in {d}D!")
            return False
    
    print(f"Dimension {d}: No {d+2}-point set shattered (checked {num_trials} random sets)")
    return True
 
# Run verification
if __name__ == "__main__":
    print("=" * 50)
    print("Verifying VC Dimension of Linear Classifiers")
    print("=" * 50)
    
    for d in [1, 2, 3]:
        print(f"
--- Dimension d = {d} ---")
        
        # Lower bound: simplex can be shattered
        verify_simplex_shattering(d)
        
        # Upper bound: d+2 points cannot be shattered
        verify_upper_bound(d)
        
        print(f"=> VCdim(halfspaces in R^{d}) = {d + 1}")

Summary: Mastering Linear Classifier VC Dimension

The VC dimension of linear classifiers is the foundational example for all of learning theory.

Key Takeaways

•VCdim(halfspaces in $\mathbb{R}^d$) = d + 1 exactly. For homogeneous classifiers, it's $d$.
•Lower bound: The simplex construction (origin + standard basis) is shattered.
•Upper bound: Radon's theorem proves no $d+2$ points can be shattered.
•Linear algebra proof: Linear dependence of $d+1$ vectors creates an impossible labeling.
•Polynomial extension: Degree-$k$ polynomials have VCdim = $\binom{d+k}{k}$.
•Margin matters: Large-margin classifiers have effectively lower VC dimension.

What's Next:

With the VC dimension of linear classifiers established, we're ready to derive generalization bounds—precise mathematical statements about how test error relates to training error, sample size, and VC dimension. These bounds are the payoff of all our theoretical development.

Page Complete

You've mastered the complete proof that VCdim(halfspaces) = d+1, including both geometric (Radon) and algebraic (linear dependence) approaches. You understand implications for sample complexity and the role of margin. Next, we derive generalization bounds using VC dimension.