Loading learning content...
In machine learning, we constantly face a fundamental tradeoff: expressive models can fit training data perfectly but may fail to generalize, while simple models generalize well but may underfit. To navigate this tradeoff rigorously, we need a precise mathematical measure of a hypothesis class's expressive power—its ability to represent arbitrary patterns.
The concept of shattering provides exactly this measure. It answers a seemingly simple but profound question: Given a set of data points, can our hypothesis class realize every possible way of labeling those points?
This question lies at the heart of statistical learning theory and leads directly to the celebrated Vapnik-Chervonenkis (VC) dimension, which quantifies hypothesis class complexity in a way that directly connects to generalization guarantees.
By the end of this page, you will deeply understand what it means for a hypothesis class to shatter a set of points. You'll develop geometric and combinatorial intuition for shattering, see concrete examples across different hypothesis classes, and understand why shattering is the key to measuring model complexity for generalization analysis.
Before defining shattering, we need the concept of a dichotomy. This fundamental notion captures how a hypothesis partitions a finite set of points into two classes.
Formal Definition:
Let $\mathcal{H}$ be a hypothesis class consisting of functions $h: \mathcal{X} \to {-1, +1}$ mapping from some input space $\mathcal{X}$ to binary labels. Given a finite set of points $S = {x_1, x_2, \ldots, x_m} \subset \mathcal{X}$, each hypothesis $h \in \mathcal{H}$ induces a labeling or dichotomy of $S$:
$$h|_S = (h(x_1), h(x_2), \ldots, h(x_m)) \in {-1, +1}^m$$
This is simply the vector of predictions that $h$ makes on the points in $S$. Different hypotheses in $\mathcal{H}$ may induce the same dichotomy, or they may induce different dichotomies.
Think of a dichotomy as a way to 'color' the points in S with two colors (say, red and blue). Each hypothesis h assigns a color to each point. The question is: how many different colorings can our hypothesis class produce? If H can produce all possible colorings, it has maximum expressive power over S.
The Set of Achievable Dichotomies:
The set of all dichotomies that $\mathcal{H}$ can induce on $S$ is denoted:
$$\mathcal{H}(S) = {(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}}$$
This is the restriction of $\mathcal{H}$ to the set $S$. Note that $\mathcal{H}(S) \subseteq {-1, +1}^m$, meaning the number of achievable dichotomies satisfies:
$$|\mathcal{H}(S)| \leq 2^m$$
The upper bound $2^m$ represents all possible binary labelings of $m$ points. A hypothesis class that achieves this upper bound for a particular set $S$ has complete expressive freedom over $S$—and this is precisely what shattering means.
Consider two points: $S = \{1, 3\}$. What dichotomies can $\mathcal{H}_{\text{thresh}}$ achieve?• $\theta = 0$: Both points ≥ θ → $(+1, +1)$
• $\theta = 2$: Point 1 < θ, point 3 ≥ θ → $(-1, +1)$
• $\theta = 4$: Both points < θ → $(-1, -1)$
But **no threshold can achieve** $(+1, -1)$ because threshold classifiers label all points above the threshold as +1 and all below as -1. They cannot label a smaller point as +1 and a larger point as -1.
So $|\mathcal{H}_{\text{thresh}}(S)| = 3 < 4 = 2^2$.Threshold classifiers cannot shatter any set of 2 points because they impose a monotonic structure on the labeling.
We now present the central definition that underpins VC theory.
Definition (Shattering):
A hypothesis class $\mathcal{H}$ shatters a finite set $S = {x_1, \ldots, x_m} \subset \mathcal{X}$ if $\mathcal{H}$ can realize every possible dichotomy of $S$. Formally:
$$\mathcal{H} \text{ shatters } S \iff |\mathcal{H}(S)| = 2^m$$
Equivalently, for every labeling $(y_1, \ldots, y_m) \in {-1, +1}^m$, there exists some $h \in \mathcal{H}$ such that:
$$h(x_i) = y_i \quad \text{for all } i = 1, \ldots, m$$
If H shatters S, then H can perfectly memorize any possible labeling of S, regardless of whether that labeling has any structure or pattern. This is both powerful (high expressiveness) and dangerous (potential for overfitting). Shattering captures the ability to fit arbitrary noise.
Key Properties of Shattering:
Set-dependent: Whether $\mathcal{H}$ shatters a set depends on which points are in the set, not just how many.
Subset monotonicity: If $\mathcal{H}$ shatters $S$, it also shatters every subset $S' \subseteq S$. (Proof: Restricting dichotomies to a subset preserves all labelings.)
Existence vs. universality: We only require existence of a shattered set of size $m$—there may be many sets of size $m$ that $\mathcal{H}$ cannot shatter.
Location matters: Two sets of the same size may have very different shatterability properties based on where the points are positioned.
| Set Size m | Total Possible Dichotomies | Requirement to Shatter | Example (Points) |
|---|---|---|---|
| 1 | 2 | Need both (+1) and (-1) | Any single point |
| 2 | 4 | Need all of (−−), (−+), (+−), (++) | Two distinct points |
| 3 | 8 | Need all 8 labelings | Three points in general position |
| 4 | 16 | Need all 16 labelings | Four points (often impossible) |
| m | 2^m | Need all 2^m labelings | Exponentially harder |
The most illuminating way to understand shattering is through geometric visualization. Consider the hypothesis class of linear classifiers (halfspaces) in $\mathbb{R}^2$:
$$\mathcal{H}{\text{linear}} = {h{\mathbf{w},b} : \mathbf{w} \in \mathbb{R}^2, b \in \mathbb{R}}$$
where $h_{\mathbf{w},b}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)$.
Each hypothesis corresponds to a line in $\mathbb{R}^2$, with points on one side labeled +1 and points on the other labeled -1.
Imagine a line sweeping through the plane. As you rotate and translate this line, you create different partitions of the plane. For a fixed set of points, each line orientation produces a dichotomy. The question is: can you achieve ALL possible dichotomies by appropriately positioning the line?
Case 1: A Single Point
For any single point $\mathbf{x}_1$, linear classifiers can trivially shatter it:
✓ All $2^1 = 2$ dichotomies achievable.
Case 2: Two Points
For two distinct points ${\mathbf{x}_1, \mathbf{x}_2}$:
✓ All $2^2 = 4$ dichotomies achievable.
Case 3: Three Points
This is where geometry becomes critical. For three points ${\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3}$:
If the points are collinear (lie on a line): The labeling $(+1, -1, +1)$ cannot be achieved—a line cannot separate the middle point from both endpoints. ✗ Cannot shatter collinear points.
If the points form a triangle (general position): All 8 dichotomies are achievable by appropriately orienting the separating line. ✓ Linear classifiers shatter any 3 points in general position in $\mathbb{R}^2$.
Show that all 8 dichotomies are achievable with linear classifiers.• $(+,+,+)$: Line $y = -1$ (all above)
• $(-,-,-)$: Line $y = 2$ (all below)
• $(+,-,-)$: Line passing left of $\mathbf{x}_1$, right of others
• $(-,+,-)$: Line passing through region separating $\mathbf{x}_2$
• $(-,-,+)$: Horizontal line at $y = 0.5$ (only $\mathbf{x}_3$ above)
• $(+,+,-)$: Horizontal line at $y = 0.5$ (only $\mathbf{x}_3$ below)
• $(+,-,+)$: Line separating $\mathbf{x}_2$ from $\{\mathbf{x}_1, \mathbf{x}_3\}$
• $(-,+,+)$: Line separating $\mathbf{x}_1$ from $\{\mathbf{x}_2, \mathbf{x}_3\}$For any triangle in the plane, every vertex can be isolated by an appropriate line, and every pair can be grouped together, enabling all 8 dichotomies.
Case 4: Four Points
Can linear classifiers shatter any set of 4 points in $\mathbb{R}^2$? The answer is no:
Convex hull argument: Consider 4 points in general position forming a quadrilateral. Let's say they are the vertices of a convex quadrilateral. The diagonal labelings like $(+1, -1, +1, -1)$ (alternating around the quadrilateral) cannot be achieved by a single line.
One point inside: If one point lies inside the convex hull of the other three, we cannot label it differently from all surrounding points.
In both cases, there exists at least one dichotomy that no linear classifier can achieve. Therefore:
Linear classifiers in $\mathbb{R}^2$ cannot shatter any set of 4 points.
This illustrates a fundamental principle: every hypothesis class has a maximum set size it can shatter. For linear classifiers in ℝ², this maximum is 3. No matter how cleverly you position 4 points, a line cannot realize all 16 dichotomies. This maximum is exactly the VC dimension.
Let's systematically examine shattering behavior across different hypothesis classes to build intuition for how model structure affects expressive power.
Interval Classifiers:
$\mathcal{H}{\text{interval}} = {h{a,b} : a \leq b, a,b \in \mathbb{R}}$
where $h_{a,b}(x) = +1$ if $x \in [a,b]$ and $-1$ otherwise.
Analysis:
Can we shatter 2 points? Yes! For ${x_1, x_2}$ with $x_1 < x_2$:
Can we shatter 3 points? No! Consider $x_1 < x_2 < x_3$. The labeling $(+1, -1, +1)$ is impossible—an interval cannot contain $x_1$ and $x_3$ while excluding $x_2$.
Result: Interval classifiers can shatter at most 2 points.
The sinusoidal example reveals a crucial insight: the number of parameters in a model does NOT determine its capacity to overfit. A single-parameter sinusoidal classifier has infinite VC dimension! This is why we need VC dimension rather than parameter counting to measure true model complexity.
While shattering is a binary concept (a set is either shattered or not), we often want to quantify how many dichotomies a hypothesis class can achieve. This leads to the growth function (also called the shattering coefficient).
Definition (Growth Function):
For a hypothesis class $\mathcal{H}$, the growth function $\Pi_\mathcal{H}: \mathbb{N} \to \mathbb{N}$ is defined as:
$$\Pi_\mathcal{H}(m) = \max_{S \subset \mathcal{X}, |S| = m} |\mathcal{H}(S)|$$
This is the maximum number of dichotomies that $\mathcal{H}$ can induce on any set of $m$ points.
Properties of the Growth Function:
Upper bound: $\Pi_\mathcal{H}(m) \leq 2^m$ for all $m$
Monotonicity: While $\Pi_\mathcal{H}(m)$ is non-decreasing, the rate of growth can vary
Shattering connection: $\mathcal{H}$ shatters some set of size $m$ if and only if $\Pi_\mathcal{H}(m) = 2^m$
VC dimension connection: If $\text{VCdim}(\mathcal{H}) = d$, then:
| Hypothesis Class | VC Dimension | Growth Function Π(m) | Behavior |
|---|---|---|---|
| Positive rays on ℝ | 1 | m + 1 | Linear growth |
| Intervals on ℝ | 2 | m(m-1)/2 + m + 1 = O(m²) | Quadratic growth |
| Halfspaces in ℝ^d | d + 1 | Σᵢ₌₀ᵈ C(m,i) = O(m^d) | Polynomial (degree d) |
| Finite class |H|=N | ≤ log₂(N) | min(N, 2^m) | Bounded by class size |
| Sinusoids sin(ωx) | ∞ | 2^m for all m | Exponential (maximal) |
A fundamental result (Sauer-Shelah Lemma) shows that once the VC dimension is finite (say d), the growth function is bounded by a polynomial: Π(m) ≤ Σᵢ₌₀ᵈ C(m,i) ≤ (em/d)^d. This polynomial bound is crucial for deriving generalization guarantees—it ensures that the hypothesis class cannot be too complex.
Why the Growth Function Matters:
The growth function appears directly in generalization bounds. Roughly speaking, the sample complexity and generalization error depend on $\Pi_\mathcal{H}(m)$ or its logarithm. When $\Pi_\mathcal{H}(m)$ grows polynomially rather than exponentially, we can learn with polynomial sample complexity.
This is the key insight: finite VC dimension implies polynomial growth, which implies learnability.
The concept of shattering might seem abstract, but it has profound implications for machine learning. Understanding shattering answers the fundamental question: Why do some models generalize while others memorize?
The Overfitting Connection:
If $\mathcal{H}$ can shatter a large set, it means $\mathcal{H}$ can perfectly fit any labeling of that set—including completely random labels with no underlying pattern. This ability to fit noise is precisely what we mean by overfitting.
Consider an extreme case: if $\mathcal{H}$ shatters a training set of size $m$, then for any target function $f$ (even if $f$ generates labels randomly), there exists $h \in \mathcal{H}$ with zero training error. But such an $h$ has learned nothing—it has merely memorized.
If your hypothesis class can shatter your training set, you can achieve zero training error on ANY labeling—including pure noise. This is not a feature; it's a liability. It means your training error tells you nothing about your test error.
The Generalization Intuition:
The more points $\mathcal{H}$ can shatter, the more freedom it has to fit arbitrary patterns. This freedom requires more data to distinguish genuine patterns from noise. The VC dimension quantifies this:
This gives the first principled answer to 'how much data do I need?'—it depends on the VC dimension of your hypothesis class.
The Fundamental Theorem Preview:
We're building toward one of the deepest results in learning theory:
A hypothesis class $\mathcal{H}$ is PAC-learnable if and only if VCdim($\mathcal{H}$) is finite.
This theorem (which we'll prove in later pages) shows that shattering—and its quantification via VC dimension—is not just one way to measure complexity, but the fundamental measure that determines learnability.
While shattering is a theoretical concept, understanding its computational aspects helps bridge theory and practice.
Verifying Shattering:
Given a specific set $S$ and hypothesis class $\mathcal{H}$, how do we check if $\mathcal{H}$ shatters $S$?
Naive approach: Enumerate all $2^{|S|}$ labelings and verify each is realizable. This is exponential in $|S|$.
For structured $\mathcal{H}$: Often we can use geometric or algebraic arguments. For linear classifiers, this reduces to checking linear separability of various point subsets.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npfrom itertools import productfrom typing import List, Tuple, Callable def can_shatter_linear(points: np.ndarray) -> bool: """ Check if linear classifiers can shatter the given 2D points. Uses linear separability testing for each dichotomy. Args: points: Array of shape (n_points, 2) Returns: True if linear classifiers shatter this point set """ n = len(points) # Generate all 2^n possible labelings for labels in product([-1, 1], repeat=n): labels = np.array(labels) # Check if this labeling is linearly separable if not is_linearly_separable(points, labels): return False return True def is_linearly_separable(X: np.ndarray, y: np.ndarray) -> bool: """ Check if points X with labels y are linearly separable. Uses linear programming or perceptron convergence. """ from scipy.optimize import linprog n, d = X.shape # Formulate as LP: find w, b such that y_i(w·x_i + b) >= 1 # Variables: [w_1, w_2, ..., w_d, b] # Constraint: y_i * (sum_j w_j * x_ij + b) >= 1 # Rewrite: -y_i * sum_j w_j * x_ij - y_i * b <= -1 A_ub = np.zeros((n, d + 1)) for i in range(n): A_ub[i, :d] = -y[i] * X[i] A_ub[i, d] = -y[i] b_ub = -np.ones(n) c = np.zeros(d + 1) # We just need feasibility result = linprog(c, A_ub=A_ub, b_ub=b_ub, method='highs') return result.success # Example: Test shattering for 3 pointstriangle = np.array([ [0.0, 0.0], [1.0, 0.0], [0.5, 1.0]]) print(f"Can linear classifiers shatter 3 points in triangle? {can_shatter_linear(triangle)}")# Output: True # Example: Test with 4 pointssquare = np.array([ [0.0, 0.0], [1.0, 0.0], [1.0, 1.0], [0.0, 1.0]]) print(f"Can linear classifiers shatter 4 points in square? {can_shatter_linear(square)}")# Output: FalseIn general, computing the VC dimension of an arbitrary hypothesis class is computationally hard (coNP-complete for neural networks, for example). However, for well-structured classes like linear classifiers, kernel methods, and polynomial classifiers, the VC dimension can be computed or bounded analytically.
We've developed a deep understanding of shattering—the foundation for VC theory and generalization analysis.
What's Next:
With shattering firmly understood, we're ready to define the VC dimension formally—the largest set size a hypothesis class can shatter. This single number summarizes the expressive power of a hypothesis class and directly determines sample complexity and generalization bounds.
You now understand shattering—the core concept underlying VC theory. You can analyze whether a hypothesis class shatters a given point set, compute dichotomies, and understand why shattering ability determines a model's capacity to overfit. Next, we'll formalize the VC dimension and explore its properties.