Loading content...
Having established the concept of shattering, we now define the Vapnik-Chervonenkis (VC) dimension—a single number that captures the expressive power of an entire hypothesis class. The VC dimension is arguably the most important concept in statistical learning theory, providing the foundation for generalization bounds, sample complexity analysis, and principled model selection.
Named after Vladimir Vapnik and Alexey Chervonenkis, who introduced it in 1971, the VC dimension has become the cornerstone of computational learning theory. It answers a fundamental question: How complex is this class of models, in terms of its ability to fit arbitrary data?
By the end of this page, you will understand the formal definition of VC dimension, master techniques for proving upper and lower bounds, see how VC dimension relates to (and differs from) parameter counting, and develop intuition for why VC dimension is the 'right' measure of complexity for learning.
The VC dimension crystallizes the concept of shattering into a single numerical measure.
Definition (VC Dimension):
The Vapnik-Chervonenkis dimension of a hypothesis class $\mathcal{H}$, denoted $\text{VCdim}(\mathcal{H})$ or $d_{VC}(\mathcal{H})$, is the largest integer $d$ such that there exists a set of $d$ points that $\mathcal{H}$ can shatter.
$$\text{VCdim}(\mathcal{H}) = \max{m : \Pi_\mathcal{H}(m) = 2^m}$$
If $\mathcal{H}$ can shatter arbitrarily large sets, we say $\text{VCdim}(\mathcal{H}) = \infty$.
The VC dimension requires only EXISTENCE of a shattered set of that size. If VCdim(H) = d, it means:
• There EXISTS some set of d points that H shatters. • For ALL sets of d+1 points, H cannot shatter them.
We do NOT require that H shatters EVERY set of d points—just at least one.
Equivalent Characterizations:
The VC dimension can be characterized in several equivalent ways:
Maximum shattering: $\text{VCdim}(\mathcal{H}) = d$ iff $d = \max{|S| : \mathcal{H} \text{ shatters } S}$
Growth function threshold: $\text{VCdim}(\mathcal{H}) = d$ iff:
Dichotomy counting: The VC dimension is the last point where the growth function equals its theoretical maximum
Formal Proof Strategy:
To prove $\text{VCdim}(\mathcal{H}) = d$, we must establish two things:
Lower bound (constructive): Exhibit a specific set $S$ of $d$ points that $\mathcal{H}$ shatters.
Upper bound (universal): Prove that no set of $d+1$ points can be shattered by $\mathcal{H}$.
Compute VCdim($\mathcal{H}_+$)**Lower bound (VCdim ≥ 1):**
Take a single point $\{x_1\}$:
• Label +1: Use $\theta < x_1$
• Label -1: Use $\theta > x_1$
✓ A single point is shattered.
**Upper bound (VCdim ≤ 1):**
For any two points $x_1 < x_2$:
• Label $(-1, +1)$ works: Use $\theta \in (x_1, x_2)$
• Label $(+1, +1)$ works: Use $\theta < x_1$
• Label $(-1, -1)$ works: Use $\theta > x_2$
• Label $(+1, -1)$ fails: No $\theta$ gives $x_1 \geq \theta$ and $x_2 < \theta$
✗ No two points can be shattered.
**Conclusion: VCdim($\mathcal{H}_+$) = 1**Threshold classifiers can only produce monotonic labelings, limiting their expressive power to VC dimension 1.
The VC dimension satisfies several important properties that make it a useful complexity measure.
If VCdim(H) = d, then for all m ≥ d:
Π_H(m) ≤ (em/d)^d
This polynomial bound (compared to the naive 2^m) is what makes learning possible with polynomial sample complexity. The moment VC dimension is finite, the growth function becomes 'tame'.
Behavior of the Growth Function:
The Sauer-Shelah Lemma reveals a striking dichotomy:
Moreover, once $\Pi_\mathcal{H}(m) < 2^m$ for some $m$, this gap grows. The growth function transitions from exponential to polynomial behavior exactly at $m = d$.
Formal Statement (Sauer-Shelah Lemma):
For any hypothesis class $\mathcal{H}$ with $\text{VCdim}(\mathcal{H}) = d$:
$$\Pi_\mathcal{H}(m) \leq \sum_{i=0}^{d} \binom{m}{i}$$
For $m \geq d$, this is bounded by $(em/d)^d$, which is polynomial in $m$ with degree $d$.
Computing the VC dimension requires proving both lower and upper bounds. Here are the main techniques used in practice.
Proving VCdim(H) ≥ d:
To establish a lower bound, we must construct a specific set of $d$ points that $\mathcal{H}$ shatters.
Strategy:
Common Constructions:
For linear classifiers in $\mathbb{R}^d$: Use the standard basis vectors: ${e_1, e_2, \ldots, e_d}$ plus the origin (or any $d+1$ points in general position).
For polynomial classifiers of degree $k$: Use points whose lifted representations (via polynomial features) are in general position.
For interval classifiers: Use well-separated points on the line.
Example: Lower bound for halfspaces in $\mathbb{R}^2$
We need to show VCdim(halfspaces in $\mathbb{R}^2$) ≥ 3.
Choose three non-collinear points: $(0,0), (1,0), (0,1)$.
For each of the 8 labelings, we can find a line that induces it:
This establishes VCdim ≥ 3.
VC dimension is NOT always equal to the number of parameters! The sinusoidal classifier h_ω(x) = sign(sin(ωx)) has ONE parameter but INFINITE VC dimension. Conversely, the class of classifiers with w ∈ {-1, 1}^d (d parameters, each constrained to ±1) has VCdim ≤ d·log(2) because the class size is finite (2^d hypotheses). Always compute VC dimension from first principles.
Boolean functions (mapping ${0,1}^n$ to ${0,1}$) are fundamental in learning theory. Understanding their VC dimension reveals deep connections to circuit complexity.
The Full Boolean Function Class:
Consider all Boolean functions on $n$ inputs. The input space has $|\mathcal{X}| = 2^n$ points, and each function assigns a label to each point. There are $2^{2^n}$ possible functions.
What is the VC dimension? Since any set of points can be shattered (we have all possible labelings as hypotheses), $\text{VCdim}(\mathcal{H}_{\text{all boolean}}) = 2^n$—the entire input space can be shattered.
| Class | Description | VC Dimension | Notes |
|---|---|---|---|
| All Boolean functions | Every possible mapping | 2^n | Exponential in n |
| Conjunctions | x₁ ∧ x₃ ∧ ¬x₅ | n | One per variable |
| Disjunctions | x₁ ∨ ¬x₂ ∨ x₄ | n | Dual to conjunctions |
| Monotone conjunctions | Only positive literals | n | Cannot shatter n+1 |
| k-CNF formulas | CNF with clauses of size ≤ k | O(n^k) | Polynomial in n |
| Decision lists | If-then-else chains | O(n log n) | Slightly superlinear |
| Decision trees (depth d) | Binary tree classifiers | O(2^d log n) | Exponential in depth |
Example: VC Dimension of Monotone Conjunctions
Monotone conjunctions are Boolean formulas using only AND and positive literals: $h(x_1, \ldots, x_n) = x_{i_1} \wedge x_{i_2} \wedge \ldots \wedge x_{i_k}$.
Lower bound (VCdim ≥ n):
Consider the $n$ points: $e_1 = (1,0,\ldots,0), e_2 = (0,1,\ldots,0), \ldots, e_n$.
For any subset $S \subseteq {1,\ldots,n}$, the conjunction $h_S = \bigwedge_{i \in \bar{S}} x_i$ outputs:
This achieves all $2^n$ labelings.
Upper bound (VCdim ≤ n):
Take $n+1$ points. By pigeonhole, some variable must be 0 in at least two points or 1 in all points. Careful case analysis shows not all $2^{n+1}$ labelings are achievable.
Conclusion: VCdim(monotone conjunctions) = n
When building complex models from simpler components, we need to understand how VC dimension composes.
Union of Hypothesis Classes:
Given $\mathcal{H}_1$ and $\mathcal{H}_2$, their union is $\mathcal{H}_1 \cup \mathcal{H}_2$.
$$\text{VCdim}(\mathcal{H}_1 \cup \mathcal{H}_2) \leq \text{VCdim}(\mathcal{H}_1) + \text{VCdim}(\mathcal{H}_2) + 1$$
More generally, for $k$ classes: $$\text{VCdim}\left(\bigcup_{i=1}^k \mathcal{H}i\right) \leq \sum{i=1}^k \text{VCdim}(\mathcal{H}_i) + O(k \log k)$$
Intersection and Complement:
For the complement class $\bar{\mathcal{H}} = {\bar{h} : h \in \mathcal{H}}$ where $\bar{h}(x) = -h(x)$: $$\text{VCdim}(\bar{\mathcal{H}}) = \text{VCdim}(\mathcal{H})$$
For intersection (points labeled +1 only if both component classifiers label +1): $$\text{VCdim}(\mathcal{H}_1 \cap \mathcal{H}_2) \leq O(\text{VCdim}(\mathcal{H}_1) \cdot \text{VCdim}(\mathcal{H}_2))$$
Feature Mappings:
If $\phi: \mathcal{X} \to \mathcal{Z}$ is a feature mapping and $\mathcal{H}$ operates on $\mathcal{Z}$, then for $\mathcal{H} \circ \phi$: $$\text{VCdim}(\mathcal{H} \circ \phi) \leq \text{VCdim}(\mathcal{H})$$
Equality holds when $\phi$ is injective.
These composition rules let us analyze complex models by breaking them into simpler parts. An SVM with polynomial kernel can be analyzed as a linear classifier in feature space. A neural network can be bounded (loosely) by considering layer-by-layer composition. Ensemble methods combine multiple hypothesis classes.
Compute VCdim($\mathcal{H}_{k\text{-unions}}$)**Lower bound:** Place $2k$ points: two in each interval's interior/exterior boundary region. With careful placement, any labeling of these $2k$ points can be achieved.
**Upper bound:** Each interval has 2 degrees of freedom (endpoints). By a counting argument, we cannot shatter more than $2k$ points.
**Result: VCdim = 2k**The VC dimension grows linearly with the number of intervals, reflecting the additional expressive power from allowing multiple disconnected positive regions.
A common misconception is that VC dimension equals or closely tracks the number of parameters in a model. This intuition is dangerously incomplete.
When Parameters ≈ VC Dimension:
For linear models $h_\mathbf{w}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x})$ in $\mathbb{R}^d$, we have:
This correspondence arises because linear classifiers have a tight relationship between their parameterization and their expressiveness over finite sets.
When Parameters << VC Dimension:
The sinusoidal classifier $h_\omega(x) = \text{sign}(\sin(\omega x))$ has:
Why? As $\omega \to \infty$, the function oscillates increasingly fast. For any finite set of points, there exists $\omega$ such that the function takes any desired sign pattern.
When Parameters >> VC Dimension:
Consider a finite hypothesis class with $N$ hypotheses. Each hypothesis might be specified by many parameters, but: $$\text{VCdim}(\mathcal{H}) \leq \lfloor \log_2 N \rfloor$$
Example: If we have 1 million hypotheses (perhaps specified by 100 parameters each, but constrained to a finite grid), the VC dimension is at most $\log_2(10^6) \approx 20$.
The Right Intuition:
The VC dimension measures the effective degrees of freedom for fitting binary patterns, not the number of adjustable knobs in a model. A parameter that can only be adjusted in limited ways contributes less to VC dimension than one with full freedom.
Deep neural networks have millions of parameters but don't overfit as severely as classical theory predicts. This 'double descent' phenomenon suggests that VC dimension alone doesn't tell the whole story. Modern research explores alternative complexity measures (Rademacher complexity, PAC-Bayes bounds) that better explain deep learning generalization.
The VC dimension applies directly only to binary classifiers. For real-valued functions (regression), we extend the concept via the pseudo-dimension.
Definition (Pseudo-Dimension):
Let $\mathcal{F}$ be a class of functions $f: \mathcal{X} \to \mathbb{R}$. The pseudo-dimension $\text{Pdim}(\mathcal{F})$ is the VC dimension of the 'thresholded' class:
$$\mathcal{H}\mathcal{F} = {h{f,t} : f \in \mathcal{F}, t \in \mathbb{R}}$$
where $h_{f,t}(x) = \mathbf{1}[f(x) \geq t]$.
Intuitively, we convert real-valued predictions to binary by thresholding, and measure the VC dimension of all possible threshold-function pairs.
Pseudo-Dimension of Common Function Classes:
| Function Class | Pseudo-Dimension |
|---|---|
| Linear functions in $\mathbb{R}^d$ | $d + 1$ |
| Polynomials of degree $k$ in one variable | $k + 1$ |
| Linear combinations of $n$ basis functions | $n$ |
| Single-layer neural networks with $m$ hidden units | $O(m \cdot d)$ |
| Lipschitz functions | $\infty$ (without additional constraints) |
Relationship to VC Dimension:
For binary classifiers (outputs in ${-1, +1}$), the pseudo-dimension equals the VC dimension (thresholding doesn't add expressiveness).
For regression, pseudo-dimension gives generalization bounds for squared loss and other Lipschitz losses:
$$\text{Generalization error} \leq O\left(\sqrt{\frac{\text{Pdim}(\mathcal{F})}{m}}\right)$$
For margin-based learning and more refined regression analysis, the fat-shattering dimension at scale γ measures complexity relative to a margin: how many points can be labeled with predictions separated by at least γ? This gives tighter bounds for problems with built-in margins.
The VC dimension provides the theoretical foundation for understanding when and why machine learning works.
What's Next:
With the formal definition established, we'll compute the VC dimension of linear classifiers—the workhorse of machine learning. This concrete example will solidify your understanding and provide the foundation for analyzing more complex models.
You now command the formal definition of VC dimension and the techniques to compute it. You understand the Sauer-Shelah lemma, the relationship (and non-relationship) between parameters and VC dimension, and how VC concepts extend to regression. Next, we apply these tools to linear classifiers.