Statistical Learning TheoryVC Dimension

VC Dimension and Model Complexity

LevelAdvanced

Duration90 mins

TopicVC Dimension

1 / 5

Shattering: The Foundation of VC Theory

The Expressiveness Question

In machine learning, we constantly face a fundamental tradeoff: expressive models can fit training data perfectly but may fail to generalize, while simple models generalize well but may underfit. To navigate this tradeoff rigorously, we need a precise mathematical measure of a hypothesis class's expressive power—its ability to represent arbitrary patterns.

The concept of shattering provides exactly this measure. It answers a seemingly simple but profound question: Given a set of data points, can our hypothesis class realize every possible way of labeling those points?

This question lies at the heart of statistical learning theory and leads directly to the celebrated Vapnik-Chervonenkis (VC) dimension, which quantifies hypothesis class complexity in a way that directly connects to generalization guarantees.

What You Will Learn

By the end of this page, you will deeply understand what it means for a hypothesis class to shatter a set of points. You'll develop geometric and combinatorial intuition for shattering, see concrete examples across different hypothesis classes, and understand why shattering is the key to measuring model complexity for generalization analysis.

Dichotomies: The Building Blocks of Shattering

Before defining shattering, we need the concept of a dichotomy. This fundamental notion captures how a hypothesis partitions a finite set of points into two classes.

Formal Definition:

Let $\mathcal{H}$ be a hypothesis class consisting of functions $h: \mathcal{X} \to {-1, +1}$ mapping from some input space $\mathcal{X}$ to binary labels. Given a finite set of points $S = {x_1, x_2, \ldots, x_m} \subset \mathcal{X}$, each hypothesis $h \in \mathcal{H}$ induces a labeling or dichotomy of $S$:

$$h|_S = (h(x_1), h(x_2), \ldots, h(x_m)) \in {-1, +1}^m$$

This is simply the vector of predictions that $h$ makes on the points in $S$. Different hypotheses in $\mathcal{H}$ may induce the same dichotomy, or they may induce different dichotomies.

Intuition: Dichotomies as Colorings

Think of a dichotomy as a way to 'color' the points in S with two colors (say, red and blue). Each hypothesis h assigns a color to each point. The question is: how many different colorings can our hypothesis class produce? If H can produce all possible colorings, it has maximum expressive power over S.

The Set of Achievable Dichotomies:

The set of all dichotomies that $\mathcal{H}$ can induce on $S$ is denoted:

$$\mathcal{H}(S) = {(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}}$$

This is the restriction of $\mathcal{H}$ to the set $S$. Note that $\mathcal{H}(S) \subseteq {-1, +1}^m$, meaning the number of achievable dichotomies satisfies:

$$|\mathcal{H}(S)| \leq 2^m$$

The upper bound $2^m$ represents all possible binary labelings of $m$ points. A hypothesis class that achieves this upper bound for a particular set $S$ has complete expressive freedom over $S$—and this is precisely what shattering means.

Example: Threshold Classifiers on the Real LineConsider $\mathcal{X} = \mathbb{R}$ and the hypothesis class of threshold classifiers: $\mathcal{H}_{\text{thresh}} = \{h_\theta : \theta \in \mathbb{R}\}$ where $h_\theta(x) = \begin{cases} +1 & \text{if } x \geq \theta \\ -1 & \text{if } x < \theta \end{cases}$

Input

Consider two points: $S = \{1, 3\}$. What dichotomies can $\mathcal{H}_{\text{thresh}}$ achieve?

Output

• $\theta = 0$: Both points ≥ θ → $(+1, +1)$
• $\theta = 2$: Point 1 < θ, point 3 ≥ θ → $(-1, +1)$  
• $\theta = 4$: Both points < θ → $(-1, -1)$

But **no threshold can achieve** $(+1, -1)$ because threshold classifiers label all points above the threshold as +1 and all below as -1. They cannot label a smaller point as +1 and a larger point as -1.

So $|\mathcal{H}_{\text{thresh}}(S)| = 3 < 4 = 2^2$.

Explanation

Threshold classifiers cannot shatter any set of 2 points because they impose a monotonic structure on the labeling.

The Formal Definition of Shattering

We now present the central definition that underpins VC theory.

Definition (Shattering):

A hypothesis class $\mathcal{H}$ shatters a finite set $S = {x_1, \ldots, x_m} \subset \mathcal{X}$ if $\mathcal{H}$ can realize every possible dichotomy of $S$. Formally:

$$\mathcal{H} \text{ shatters } S \iff |\mathcal{H}(S)| = 2^m$$

Equivalently, for every labeling $(y_1, \ldots, y_m) \in {-1, +1}^m$, there exists some $h \in \mathcal{H}$ such that:

$$h(x_i) = y_i \quad \text{for all } i = 1, \ldots, m$$

The Power of Shattering

If H shatters S, then H can perfectly memorize any possible labeling of S, regardless of whether that labeling has any structure or pattern. This is both powerful (high expressiveness) and dangerous (potential for overfitting). Shattering captures the ability to fit arbitrary noise.

Key Properties of Shattering:

Set-dependent: Whether $\mathcal{H}$ shatters a set depends on which points are in the set, not just how many.
Subset monotonicity: If $\mathcal{H}$ shatters $S$, it also shatters every subset $S' \subseteq S$. (Proof: Restricting dichotomies to a subset preserves all labelings.)
Existence vs. universality: We only require existence of a shattered set of size $m$—there may be many sets of size $m$ that $\mathcal{H}$ cannot shatter.
Location matters: Two sets of the same size may have very different shatterability properties based on where the points are positioned.

Shattering Requirements for Different Set Sizes
Set Size m	Total Possible Dichotomies	Requirement to Shatter	Example (Points)
1	2	Need both (+1) and (-1)	Any single point
2	4	Need all of (−−), (−+), (+−), (++)	Two distinct points
3	8	Need all 8 labelings	Three points in general position
4	16	Need all 16 labelings	Four points (often impossible)
m	2^m	Need all 2^m labelings	Exponentially harder

Geometric Intuition for Shattering

The most illuminating way to understand shattering is through geometric visualization. Consider the hypothesis class of linear classifiers (halfspaces) in $\mathbb{R}^2$:

$$\mathcal{H}{\text{linear}} = {h{\mathbf{w},b} : \mathbf{w} \in \mathbb{R}^2, b \in \mathbb{R}}$$

where $h_{\mathbf{w},b}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)$.

Each hypothesis corresponds to a line in $\mathbb{R}^2$, with points on one side labeled +1 and points on the other labeled -1.

Visualizing Halfspaces

Imagine a line sweeping through the plane. As you rotate and translate this line, you create different partitions of the plane. For a fixed set of points, each line orientation produces a dichotomy. The question is: can you achieve ALL possible dichotomies by appropriately positioning the line?

Case 1: A Single Point

For any single point $\mathbf{x}_1$, linear classifiers can trivially shatter it:

Label +1: Use any line with $\mathbf{x}_1$ on the positive side
Label -1: Use any line with $\mathbf{x}_1$ on the negative side

✓ All $2^1 = 2$ dichotomies achievable.

Case 2: Two Points

For two distinct points ${\mathbf{x}_1, \mathbf{x}_2}$:

$(+1, +1)$: Draw a line with both points on the positive side
$(-1, -1)$: Draw a line with both points on the negative side
$(+1, -1)$: Draw a line separating them with $\mathbf{x}_1$ on the + side
$(-1, +1)$: Draw a line separating them with $\mathbf{x}_2$ on the + side

✓ All $2^2 = 4$ dichotomies achievable.

Case 3: Three Points

This is where geometry becomes critical. For three points ${\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3}$:

If the points are collinear (lie on a line): The labeling $(+1, -1, +1)$ cannot be achieved—a line cannot separate the middle point from both endpoints. ✗ Cannot shatter collinear points.

If the points form a triangle (general position): All 8 dichotomies are achievable by appropriately orienting the separating line. ✓ Linear classifiers shatter any 3 points in general position in $\mathbb{R}^2$.

Example: Shattering Three Points in the PlaneConsider three points forming a triangle: $\mathbf{x}_1 = (0,0)$, $\mathbf{x}_2 = (1,0)$, $\mathbf{x}_3 = (0.5, 1)$.

Input

Show that all 8 dichotomies are achievable with linear classifiers.

Output

• $(+,+,+)$: Line $y = -1$ (all above)
• $(-,-,-)$: Line $y = 2$ (all below)
• $(+,-,-)$: Line passing left of $\mathbf{x}_1$, right of others
• $(-,+,-)$: Line passing through region separating $\mathbf{x}_2$
• $(-,-,+)$: Horizontal line at $y = 0.5$ (only $\mathbf{x}_3$ above)
• $(+,+,-)$: Horizontal line at $y = 0.5$ (only $\mathbf{x}_3$ below)
• $(+,-,+)$: Line separating $\mathbf{x}_2$ from $\{\mathbf{x}_1, \mathbf{x}_3\}$
• $(-,+,+)$: Line separating $\mathbf{x}_1$ from $\{\mathbf{x}_2, \mathbf{x}_3\}$

Explanation

For any triangle in the plane, every vertex can be isolated by an appropriate line, and every pair can be grouped together, enabling all 8 dichotomies.

Case 4: Four Points

Can linear classifiers shatter any set of 4 points in $\mathbb{R}^2$? The answer is no:

Convex hull argument: Consider 4 points in general position forming a quadrilateral. Let's say they are the vertices of a convex quadrilateral. The diagonal labelings like $(+1, -1, +1, -1)$ (alternating around the quadrilateral) cannot be achieved by a single line.

One point inside: If one point lies inside the convex hull of the other three, we cannot label it differently from all surrounding points.

In both cases, there exists at least one dichotomy that no linear classifier can achieve. Therefore:

Linear classifiers in $\mathbb{R}^2$ cannot shatter any set of 4 points.

The Shattering Ceiling

This illustrates a fundamental principle: every hypothesis class has a maximum set size it can shatter. For linear classifiers in ℝ², this maximum is 3. No matter how cleverly you position 4 points, a line cannot realize all 16 dichotomies. This maximum is exactly the VC dimension.

Shattering Examples Across Hypothesis Classes

Let's systematically examine shattering behavior across different hypothesis classes to build intuition for how model structure affects expressive power.

Interval Classifiers:

$\mathcal{H}{\text{interval}} = {h{a,b} : a \leq b, a,b \in \mathbb{R}}$

where $h_{a,b}(x) = +1$ if $x \in [a,b]$ and $-1$ otherwise.

Analysis:

Can we shatter 2 points? Yes! For ${x_1, x_2}$ with $x_1 < x_2$:

$(+,+)$: Interval $[x_1-\epsilon, x_2+\epsilon]$
$(-,-)$: Interval $[x_2+1, x_2+2]$ (disjoint from both)
$(+,-)$: Interval $[x_1-\epsilon, x_1+\epsilon]$ (contains only $x_1$)
$(-,+)$: Interval $[x_2-\epsilon, x_2+\epsilon]$ (contains only $x_2$)

Can we shatter 3 points? No! Consider $x_1 < x_2 < x_3$. The labeling $(+1, -1, +1)$ is impossible—an interval cannot contain $x_1$ and $x_3$ while excluding $x_2$.

Result: Interval classifiers can shatter at most 2 points.

Parameters ≠ Complexity

The sinusoidal example reveals a crucial insight: the number of parameters in a model does NOT determine its capacity to overfit. A single-parameter sinusoidal classifier has infinite VC dimension! This is why we need VC dimension rather than parameter counting to measure true model complexity.

The Growth Function: Quantifying Expressiveness

While shattering is a binary concept (a set is either shattered or not), we often want to quantify how many dichotomies a hypothesis class can achieve. This leads to the growth function (also called the shattering coefficient).

Definition (Growth Function):

For a hypothesis class $\mathcal{H}$, the growth function $\Pi_\mathcal{H}: \mathbb{N} \to \mathbb{N}$ is defined as:

$$\Pi_\mathcal{H}(m) = \max_{S \subset \mathcal{X}, |S| = m} |\mathcal{H}(S)|$$

This is the maximum number of dichotomies that $\mathcal{H}$ can induce on any set of $m$ points.

Properties of the Growth Function:

Upper bound: $\Pi_\mathcal{H}(m) \leq 2^m$ for all $m$
Monotonicity: While $\Pi_\mathcal{H}(m)$ is non-decreasing, the rate of growth can vary
Shattering connection: $\mathcal{H}$ shatters some set of size $m$ if and only if $\Pi_\mathcal{H}(m) = 2^m$
VC dimension connection: If $\text{VCdim}(\mathcal{H}) = d$, then:
- $\Pi_\mathcal{H}(m) = 2^m$ for $m \leq d$
- $\Pi_\mathcal{H}(m) < 2^m$ for $m > d$

Growth Functions for Different Hypothesis Classes
Hypothesis Class	VC Dimension	Growth Function Π(m)	Behavior
Positive rays on ℝ	1	m + 1	Linear growth
Intervals on ℝ	2	m(m-1)/2 + m + 1 = O(m²)	Quadratic growth
Halfspaces in ℝ^d	d + 1	Σᵢ₌₀ᵈ C(m,i) = O(m^d)	Polynomial (degree d)
Finite class \|H\|=N	≤ log₂(N)	min(N, 2^m)	Bounded by class size
Sinusoids sin(ωx)	∞	2^m for all m	Exponential (maximal)

The Sauer-Shelah Lemma

A fundamental result (Sauer-Shelah Lemma) shows that once the VC dimension is finite (say d), the growth function is bounded by a polynomial: Π(m) ≤ Σᵢ₌₀ᵈ C(m,i) ≤ (em/d)^d. This polynomial bound is crucial for deriving generalization guarantees—it ensures that the hypothesis class cannot be too complex.

Why the Growth Function Matters:

The growth function appears directly in generalization bounds. Roughly speaking, the sample complexity and generalization error depend on $\Pi_\mathcal{H}(m)$ or its logarithm. When $\Pi_\mathcal{H}(m)$ grows polynomially rather than exponentially, we can learn with polynomial sample complexity.

This is the key insight: finite VC dimension implies polynomial growth, which implies learnability.

Why Shattering Matters for Learning

The concept of shattering might seem abstract, but it has profound implications for machine learning. Understanding shattering answers the fundamental question: Why do some models generalize while others memorize?

The Overfitting Connection:

If $\mathcal{H}$ can shatter a large set, it means $\mathcal{H}$ can perfectly fit any labeling of that set—including completely random labels with no underlying pattern. This ability to fit noise is precisely what we mean by overfitting.

Consider an extreme case: if $\mathcal{H}$ shatters a training set of size $m$, then for any target function $f$ (even if $f$ generates labels randomly), there exists $h \in \mathcal{H}$ with zero training error. But such an $h$ has learned nothing—it has merely memorized.

The Price of Shattering

If your hypothesis class can shatter your training set, you can achieve zero training error on ANY labeling—including pure noise. This is not a feature; it's a liability. It means your training error tells you nothing about your test error.

The Generalization Intuition:

The more points $\mathcal{H}$ can shatter, the more freedom it has to fit arbitrary patterns. This freedom requires more data to distinguish genuine patterns from noise. The VC dimension quantifies this:

If VCdim($\mathcal{H}$) = d, you need roughly $O(d)$ samples to learn reliably
If VCdim($\mathcal{H}$) = ∞, no finite sample is sufficient for guaranteed generalization

This gives the first principled answer to 'how much data do I need?'—it depends on the VC dimension of your hypothesis class.

Finite VC Dimension (Good)

•Cannot shatter arbitrarily large sets
•Growth function is polynomial
•Generalization bounds exist
•PAC-learnable with finite samples
•Training error is meaningful signal

Infinite VC Dimension (Risky)

•Can shatter any finite set
•Growth function is exponential
•No uniform generalization guarantees
•May require infinite samples
•Training error can be misleading

The Fundamental Theorem Preview:

We're building toward one of the deepest results in learning theory:

A hypothesis class $\mathcal{H}$ is PAC-learnable if and only if VCdim($\mathcal{H}$) is finite.

This theorem (which we'll prove in later pages) shows that shattering—and its quantification via VC dimension—is not just one way to measure complexity, but the fundamental measure that determines learnability.

Computational Considerations

While shattering is a theoretical concept, understanding its computational aspects helps bridge theory and practice.

Verifying Shattering:

Given a specific set $S$ and hypothesis class $\mathcal{H}$, how do we check if $\mathcal{H}$ shatters $S$?

Naive approach: Enumerate all $2^{|S|}$ labelings and verify each is realizable. This is exponential in $|S|$.

For structured $\mathcal{H}$: Often we can use geometric or algebraic arguments. For linear classifiers, this reduces to checking linear separability of various point subsets.

shattering_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from itertools import product
from typing import List, Tuple, Callable
 
def can_shatter_linear(points: np.ndarray) -> bool:
    """
    Check if linear classifiers can shatter the given 2D points.
    Uses linear separability testing for each dichotomy.
    
    Args:
        points: Array of shape (n_points, 2)
    
    Returns:
        True if linear classifiers shatter this point set
    """
    n = len(points)
    
    # Generate all 2^n possible labelings
    for labels in product([-1, 1], repeat=n):
        labels = np.array(labels)
        
        # Check if this labeling is linearly separable
        if not is_linearly_separable(points, labels):
            return False
    
    return True
 
def is_linearly_separable(X: np.ndarray, y: np.ndarray) -> bool:
    """
    Check if points X with labels y are linearly separable.
    Uses linear programming or perceptron convergence.
    """
    from scipy.optimize import linprog
    
    n, d = X.shape
    
    # Formulate as LP: find w, b such that y_i(w·x_i + b) >= 1
    # Variables: [w_1, w_2, ..., w_d, b]
    
    # Constraint: y_i * (sum_j w_j * x_ij + b) >= 1
    # Rewrite: -y_i * sum_j w_j * x_ij - y_i * b <= -1
    
    A_ub = np.zeros((n, d + 1))
    for i in range(n):
        A_ub[i, :d] = -y[i] * X[i]
        A_ub[i, d] = -y[i]
    
    b_ub = -np.ones(n)
    c = np.zeros(d + 1)  # We just need feasibility
    
    result = linprog(c, A_ub=A_ub, b_ub=b_ub, method='highs')
    return result.success
 
# Example: Test shattering for 3 points
triangle = np.array([
    [0.0, 0.0],
    [1.0, 0.0],
    [0.5, 1.0]
])
 
print(f"Can linear classifiers shatter 3 points in triangle? {can_shatter_linear(triangle)}")
# Output: True
 
# Example: Test with 4 points
square = np.array([
    [0.0, 0.0],
    [1.0, 0.0],
    [1.0, 1.0],
    [0.0, 1.0]
])
 
print(f"Can linear classifiers shatter 4 points in square? {can_shatter_linear(square)}")
# Output: False

Complexity of Determining VC Dimension

In general, computing the VC dimension of an arbitrary hypothesis class is computationally hard (coNP-complete for neural networks, for example). However, for well-structured classes like linear classifiers, kernel methods, and polynomial classifiers, the VC dimension can be computed or bounded analytically.

Summary: The Power of Shattering

We've developed a deep understanding of shattering—the foundation for VC theory and generalization analysis.

Key Takeaways

•Dichotomies are the labelings a hypothesis class can induce on a finite point set. The set $\mathcal{H}(S)$ captures all achievable labelings.
•Shattering occurs when $\mathcal{H}$ can realize all $2^m$ dichotomies of an $m$-point set—complete expressive freedom over that set.
•Geometry determines shattering. Point positions matter: collinear points cannot be shattered by linear classifiers even though non-collinear points can.
•The growth function $\Pi_\mathcal{H}(m)$ quantifies the maximum dichotomies as a function of set size.
•Parameters ≠ complexity. Single-parameter sinusoids have infinite VC dimension; multi-parameter linear classifiers have finite VC dimension.
•Shattering connects to overfitting. The ability to shatter large sets means the ability to memorize noise.

What's Next:

With shattering firmly understood, we're ready to define the VC dimension formally—the largest set size a hypothesis class can shatter. This single number summarizes the expressive power of a hypothesis class and directly determines sample complexity and generalization bounds.

Page Complete

You now understand shattering—the core concept underlying VC theory. You can analyze whether a hypothesis class shatters a given point set, compute dichotomies, and understand why shattering ability determines a model's capacity to overfit. Next, we'll formalize the VC dimension and explore its properties.

1 / 5

Loading learning content...

Statistical Learning TheoryVC Dimension

VC Dimension and Model Complexity

LevelAdvanced

Duration90 mins

TopicVC Dimension

1 / 5

Shattering: The Foundation of VC Theory

The Expressiveness Question

What You Will Learn

Dichotomies: The Building Blocks of Shattering

Before defining shattering, we need the concept of a dichotomy. This fundamental notion captures how a hypothesis partitions a finite set of points into two classes.

Formal Definition:

$$h|_S = (h(x_1), h(x_2), \ldots, h(x_m)) \in {-1, +1}^m$$

This is simply the vector of predictions that $h$ makes on the points in $S$. Different hypotheses in $\mathcal{H}$ may induce the same dichotomy, or they may induce different dichotomies.

Intuition: Dichotomies as Colorings

The Set of Achievable Dichotomies:

The set of all dichotomies that $\mathcal{H}$ can induce on $S$ is denoted:

$$\mathcal{H}(S) = {(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}}$$

This is the restriction of $\mathcal{H}$ to the set $S$. Note that $\mathcal{H}(S) \subseteq {-1, +1}^m$, meaning the number of achievable dichotomies satisfies:

$$|\mathcal{H}(S)| \leq 2^m$$

Input

Consider two points: $S = \{1, 3\}$. What dichotomies can $\mathcal{H}_{\text{thresh}}$ achieve?

Output

• $\theta = 0$: Both points ≥ θ → $(+1, +1)$
• $\theta = 2$: Point 1 < θ, point 3 ≥ θ → $(-1, +1)$  
• $\theta = 4$: Both points < θ → $(-1, -1)$

But **no threshold can achieve** $(+1, -1)$ because threshold classifiers label all points above the threshold as +1 and all below as -1. They cannot label a smaller point as +1 and a larger point as -1.

So $|\mathcal{H}_{\text{thresh}}(S)| = 3 < 4 = 2^2$.

Explanation

Threshold classifiers cannot shatter any set of 2 points because they impose a monotonic structure on the labeling.

The Formal Definition of Shattering

We now present the central definition that underpins VC theory.

Definition (Shattering):

A hypothesis class $\mathcal{H}$ shatters a finite set $S = {x_1, \ldots, x_m} \subset \mathcal{X}$ if $\mathcal{H}$ can realize every possible dichotomy of $S$. Formally:

$$\mathcal{H} \text{ shatters } S \iff |\mathcal{H}(S)| = 2^m$$

Equivalently, for every labeling $(y_1, \ldots, y_m) \in {-1, +1}^m$, there exists some $h \in \mathcal{H}$ such that:

$$h(x_i) = y_i \quad \text{for all } i = 1, \ldots, m$$

The Power of Shattering

Key Properties of Shattering:

Set-dependent: Whether $\mathcal{H}$ shatters a set depends on which points are in the set, not just how many.
Subset monotonicity: If $\mathcal{H}$ shatters $S$, it also shatters every subset $S' \subseteq S$. (Proof: Restricting dichotomies to a subset preserves all labelings.)
Existence vs. universality: We only require existence of a shattered set of size $m$—there may be many sets of size $m$ that $\mathcal{H}$ cannot shatter.
Location matters: Two sets of the same size may have very different shatterability properties based on where the points are positioned.

Shattering Requirements for Different Set Sizes
Set Size m	Total Possible Dichotomies	Requirement to Shatter	Example (Points)
1	2	Need both (+1) and (-1)	Any single point
2	4	Need all of (−−), (−+), (+−), (++)	Two distinct points
3	8	Need all 8 labelings	Three points in general position
4	16	Need all 16 labelings	Four points (often impossible)
m	2^m	Need all 2^m labelings	Exponentially harder

Geometric Intuition for Shattering

The most illuminating way to understand shattering is through geometric visualization. Consider the hypothesis class of linear classifiers (halfspaces) in $\mathbb{R}^2$:

$$\mathcal{H}{\text{linear}} = {h{\mathbf{w},b} : \mathbf{w} \in \mathbb{R}^2, b \in \mathbb{R}}$$

where $h_{\mathbf{w},b}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)$.

Each hypothesis corresponds to a line in $\mathbb{R}^2$, with points on one side labeled +1 and points on the other labeled -1.

Visualizing Halfspaces

Case 1: A Single Point

For any single point $\mathbf{x}_1$, linear classifiers can trivially shatter it:

Label +1: Use any line with $\mathbf{x}_1$ on the positive side
Label -1: Use any line with $\mathbf{x}_1$ on the negative side

✓ All $2^1 = 2$ dichotomies achievable.

Case 2: Two Points

For two distinct points ${\mathbf{x}_1, \mathbf{x}_2}$:

$(+1, +1)$: Draw a line with both points on the positive side
$(-1, -1)$: Draw a line with both points on the negative side
$(+1, -1)$: Draw a line separating them with $\mathbf{x}_1$ on the + side
$(-1, +1)$: Draw a line separating them with $\mathbf{x}_2$ on the + side

✓ All $2^2 = 4$ dichotomies achievable.

Case 3: Three Points

This is where geometry becomes critical. For three points ${\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3}$:

If the points are collinear (lie on a line): The labeling $(+1, -1, +1)$ cannot be achieved—a line cannot separate the middle point from both endpoints. ✗ Cannot shatter collinear points.

Example: Shattering Three Points in the PlaneConsider three points forming a triangle: $\mathbf{x}_1 = (0,0)$, $\mathbf{x}_2 = (1,0)$, $\mathbf{x}_3 = (0.5, 1)$.

Input

Show that all 8 dichotomies are achievable with linear classifiers.

Output

• $(+,+,+)$: Line $y = -1$ (all above)
• $(-,-,-)$: Line $y = 2$ (all below)
• $(+,-,-)$: Line passing left of $\mathbf{x}_1$, right of others
• $(-,+,-)$: Line passing through region separating $\mathbf{x}_2$
• $(-,-,+)$: Horizontal line at $y = 0.5$ (only $\mathbf{x}_3$ above)
• $(+,+,-)$: Horizontal line at $y = 0.5$ (only $\mathbf{x}_3$ below)
• $(+,-,+)$: Line separating $\mathbf{x}_2$ from $\{\mathbf{x}_1, \mathbf{x}_3\}$
• $(-,+,+)$: Line separating $\mathbf{x}_1$ from $\{\mathbf{x}_2, \mathbf{x}_3\}$

Explanation

For any triangle in the plane, every vertex can be isolated by an appropriate line, and every pair can be grouped together, enabling all 8 dichotomies.

Case 4: Four Points

Can linear classifiers shatter any set of 4 points in $\mathbb{R}^2$? The answer is no:

One point inside: If one point lies inside the convex hull of the other three, we cannot label it differently from all surrounding points.

In both cases, there exists at least one dichotomy that no linear classifier can achieve. Therefore:

Linear classifiers in $\mathbb{R}^2$ cannot shatter any set of 4 points.

The Shattering Ceiling

Shattering Examples Across Hypothesis Classes

Let's systematically examine shattering behavior across different hypothesis classes to build intuition for how model structure affects expressive power.

Interval Classifiers:

$\mathcal{H}{\text{interval}} = {h{a,b} : a \leq b, a,b \in \mathbb{R}}$

where $h_{a,b}(x) = +1$ if $x \in [a,b]$ and $-1$ otherwise.

Analysis:

Can we shatter 2 points? Yes! For ${x_1, x_2}$ with $x_1 < x_2$:

$(+,+)$: Interval $[x_1-\epsilon, x_2+\epsilon]$
$(-,-)$: Interval $[x_2+1, x_2+2]$ (disjoint from both)
$(+,-)$: Interval $[x_1-\epsilon, x_1+\epsilon]$ (contains only $x_1$)
$(-,+)$: Interval $[x_2-\epsilon, x_2+\epsilon]$ (contains only $x_2$)

Can we shatter 3 points? No! Consider $x_1 < x_2 < x_3$. The labeling $(+1, -1, +1)$ is impossible—an interval cannot contain $x_1$ and $x_3$ while excluding $x_2$.

Result: Interval classifiers can shatter at most 2 points.

Parameters ≠ Complexity

The Growth Function: Quantifying Expressiveness

Definition (Growth Function):

For a hypothesis class $\mathcal{H}$, the growth function $\Pi_\mathcal{H}: \mathbb{N} \to \mathbb{N}$ is defined as:

$$\Pi_\mathcal{H}(m) = \max_{S \subset \mathcal{X}, |S| = m} |\mathcal{H}(S)|$$

This is the maximum number of dichotomies that $\mathcal{H}$ can induce on any set of $m$ points.

Properties of the Growth Function:

Upper bound: $\Pi_\mathcal{H}(m) \leq 2^m$ for all $m$
Monotonicity: While $\Pi_\mathcal{H}(m)$ is non-decreasing, the rate of growth can vary
Shattering connection: $\mathcal{H}$ shatters some set of size $m$ if and only if $\Pi_\mathcal{H}(m) = 2^m$
VC dimension connection: If $\text{VCdim}(\mathcal{H}) = d$, then:
- $\Pi_\mathcal{H}(m) = 2^m$ for $m \leq d$
- $\Pi_\mathcal{H}(m) < 2^m$ for $m > d$

Growth Functions for Different Hypothesis Classes
Hypothesis Class	VC Dimension	Growth Function Π(m)	Behavior
Positive rays on ℝ	1	m + 1	Linear growth
Intervals on ℝ	2	m(m-1)/2 + m + 1 = O(m²)	Quadratic growth
Halfspaces in ℝ^d	d + 1	Σᵢ₌₀ᵈ C(m,i) = O(m^d)	Polynomial (degree d)
Finite class \|H\|=N	≤ log₂(N)	min(N, 2^m)	Bounded by class size
Sinusoids sin(ωx)	∞	2^m for all m	Exponential (maximal)

The Sauer-Shelah Lemma

Why the Growth Function Matters:

This is the key insight: finite VC dimension implies polynomial growth, which implies learnability.

Why Shattering Matters for Learning

The Overfitting Connection:

The Price of Shattering

The Generalization Intuition:

If VCdim($\mathcal{H}$) = d, you need roughly $O(d)$ samples to learn reliably
If VCdim($\mathcal{H}$) = ∞, no finite sample is sufficient for guaranteed generalization

This gives the first principled answer to 'how much data do I need?'—it depends on the VC dimension of your hypothesis class.

Finite VC Dimension (Good)

•Cannot shatter arbitrarily large sets
•Growth function is polynomial
•Generalization bounds exist
•PAC-learnable with finite samples
•Training error is meaningful signal

Infinite VC Dimension (Risky)

•Can shatter any finite set
•Growth function is exponential
•No uniform generalization guarantees
•May require infinite samples
•Training error can be misleading

The Fundamental Theorem Preview:

We're building toward one of the deepest results in learning theory:

A hypothesis class $\mathcal{H}$ is PAC-learnable if and only if VCdim($\mathcal{H}$) is finite.

Computational Considerations

While shattering is a theoretical concept, understanding its computational aspects helps bridge theory and practice.

Verifying Shattering:

Given a specific set $S$ and hypothesis class $\mathcal{H}$, how do we check if $\mathcal{H}$ shatters $S$?

Naive approach: Enumerate all $2^{|S|}$ labelings and verify each is realizable. This is exponential in $|S|$.

For structured $\mathcal{H}$: Often we can use geometric or algebraic arguments. For linear classifiers, this reduces to checking linear separability of various point subsets.

shattering_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from itertools import product
from typing import List, Tuple, Callable
 
def can_shatter_linear(points: np.ndarray) -> bool:
    """
    Check if linear classifiers can shatter the given 2D points.
    Uses linear separability testing for each dichotomy.
    
    Args:
        points: Array of shape (n_points, 2)
    
    Returns:
        True if linear classifiers shatter this point set
    """
    n = len(points)
    
    # Generate all 2^n possible labelings
    for labels in product([-1, 1], repeat=n):
        labels = np.array(labels)
        
        # Check if this labeling is linearly separable
        if not is_linearly_separable(points, labels):
            return False
    
    return True
 
def is_linearly_separable(X: np.ndarray, y: np.ndarray) -> bool:
    """
    Check if points X with labels y are linearly separable.
    Uses linear programming or perceptron convergence.
    """
    from scipy.optimize import linprog
    
    n, d = X.shape
    
    # Formulate as LP: find w, b such that y_i(w·x_i + b) >= 1
    # Variables: [w_1, w_2, ..., w_d, b]
    
    # Constraint: y_i * (sum_j w_j * x_ij + b) >= 1
    # Rewrite: -y_i * sum_j w_j * x_ij - y_i * b <= -1
    
    A_ub = np.zeros((n, d + 1))
    for i in range(n):
        A_ub[i, :d] = -y[i] * X[i]
        A_ub[i, d] = -y[i]
    
    b_ub = -np.ones(n)
    c = np.zeros(d + 1)  # We just need feasibility
    
    result = linprog(c, A_ub=A_ub, b_ub=b_ub, method='highs')
    return result.success
 
# Example: Test shattering for 3 points
triangle = np.array([
    [0.0, 0.0],
    [1.0, 0.0],
    [0.5, 1.0]
])
 
print(f"Can linear classifiers shatter 3 points in triangle? {can_shatter_linear(triangle)}")
# Output: True
 
# Example: Test with 4 points
square = np.array([
    [0.0, 0.0],
    [1.0, 0.0],
    [1.0, 1.0],
    [0.0, 1.0]
])
 
print(f"Can linear classifiers shatter 4 points in square? {can_shatter_linear(square)}")
# Output: False

Complexity of Determining VC Dimension

Summary: The Power of Shattering

We've developed a deep understanding of shattering—the foundation for VC theory and generalization analysis.

Key Takeaways

•Dichotomies are the labelings a hypothesis class can induce on a finite point set. The set $\mathcal{H}(S)$ captures all achievable labelings.
•Shattering occurs when $\mathcal{H}$ can realize all $2^m$ dichotomies of an $m$-point set—complete expressive freedom over that set.
•Geometry determines shattering. Point positions matter: collinear points cannot be shattered by linear classifiers even though non-collinear points can.
•The growth function $\Pi_\mathcal{H}(m)$ quantifies the maximum dichotomies as a function of set size.
•Parameters ≠ complexity. Single-parameter sinusoids have infinite VC dimension; multi-parameter linear classifiers have finite VC dimension.
•Shattering connects to overfitting. The ability to shatter large sets means the ability to memorize noise.

What's Next:

Page Complete

1 / 5