Vc Dimension - Learning Module

Loading content...

0/278

VC Dimension: Formal Definition and Properties

Quantifying Model Complexity

Having established the concept of shattering, we now define the Vapnik-Chervonenkis (VC) dimension—a single number that captures the expressive power of an entire hypothesis class. The VC dimension is arguably the most important concept in statistical learning theory, providing the foundation for generalization bounds, sample complexity analysis, and principled model selection.

Named after Vladimir Vapnik and Alexey Chervonenkis, who introduced it in 1971, the VC dimension has become the cornerstone of computational learning theory. It answers a fundamental question: How complex is this class of models, in terms of its ability to fit arbitrary data?

What You Will Learn

By the end of this page, you will understand the formal definition of VC dimension, master techniques for proving upper and lower bounds, see how VC dimension relates to (and differs from) parameter counting, and develop intuition for why VC dimension is the 'right' measure of complexity for learning.

The Formal Definition of VC Dimension

The VC dimension crystallizes the concept of shattering into a single numerical measure.

Definition (VC Dimension):

The Vapnik-Chervonenkis dimension of a hypothesis class $\mathcal{H}$, denoted $\text{VCdim}(\mathcal{H})$ or $d_{VC}(\mathcal{H})$, is the largest integer $d$ such that there exists a set of $d$ points that $\mathcal{H}$ can shatter.

$$\text{VCdim}(\mathcal{H}) = \max{m : \Pi_\mathcal{H}(m) = 2^m}$$

If $\mathcal{H}$ can shatter arbitrarily large sets, we say $\text{VCdim}(\mathcal{H}) = \infty$.

Critical Clarification

The VC dimension requires only EXISTENCE of a shattered set of that size. If VCdim(H) = d, it means:

• There EXISTS some set of d points that H shatters. • For ALL sets of d+1 points, H cannot shatter them.

We do NOT require that H shatters EVERY set of d points—just at least one.

Equivalent Characterizations:

The VC dimension can be characterized in several equivalent ways:

Maximum shattering: $\text{VCdim}(\mathcal{H}) = d$ iff $d = \max{|S| : \mathcal{H} \text{ shatters } S}$
Growth function threshold: $\text{VCdim}(\mathcal{H}) = d$ iff:
- $\Pi_\mathcal{H}(d) = 2^d$ (achieves maximum)
- $\Pi_\mathcal{H}(d+1) < 2^{d+1}$ (falls below maximum)
Dichotomy counting: The VC dimension is the last point where the growth function equals its theoretical maximum

Formal Proof Strategy:

To prove $\text{VCdim}(\mathcal{H}) = d$, we must establish two things:

Lower bound (constructive): Exhibit a specific set $S$ of $d$ points that $\mathcal{H}$ shatters.
Upper bound (universal): Prove that no set of $d+1$ points can be shattered by $\mathcal{H}$.

Example: VC Dimension of Threshold ClassifiersConsider positive rays on the real line: $\mathcal{H}_+ = \{h_\theta : \theta \in \mathbb{R}\}$ where $h_\theta(x) = +1$ if $x \geq \theta$, else $-1$.

Input

Compute VCdim($\mathcal{H}_+$)

Output

**Lower bound (VCdim ≥ 1):**
Take a single point $\{x_1\}$:
• Label +1: Use $\theta < x_1$
• Label -1: Use $\theta > x_1$
✓ A single point is shattered.

**Upper bound (VCdim ≤ 1):**
For any two points $x_1 < x_2$:
• Label $(-1, +1)$ works: Use $\theta \in (x_1, x_2)$
• Label $(+1, +1)$ works: Use $\theta < x_1$
• Label $(-1, -1)$ works: Use $\theta > x_2$
• Label $(+1, -1)$ fails: No $\theta$ gives $x_1 \geq \theta$ and $x_2 < \theta$
✗ No two points can be shattered.

**Conclusion: VCdim($\mathcal{H}_+$) = 1**

Explanation

Threshold classifiers can only produce monotonic labelings, limiting their expressive power to VC dimension 1.

Fundamental Properties of VC Dimension

The VC dimension satisfies several important properties that make it a useful complexity measure.

Fundamental Properties of VC Dimension

•Monotonicity under inclusion: If $\mathcal{H}_1 \subseteq \mathcal{H}_2$, then $\text{VCdim}(\mathcal{H}_1) \leq \text{VCdim}(\mathcal{H}_2)$. Larger hypothesis classes have at least as much expressive power.
•Independence from input distribution: VC dimension depends only on $\mathcal{H}$, not on any probability distribution over inputs. It's a purely combinatorial/geometric property.
•Finiteness implies polynomial growth: If $\text{VCdim}(\mathcal{H}) = d < \infty$, then $\Pi_\mathcal{H}(m) \leq \sum_{i=0}^{d} \binom{m}{i}$ (Sauer-Shelah Lemma).
•Finite VC dimension implies learnability: A hypothesis class is PAC-learnable if and only if it has finite VC dimension (Fundamental Theorem of Statistical Learning).
•Sample complexity connection: To achieve error $\epsilon$ with probability $1-\delta$, you need $m = O\left(\frac{d + \ln(1/\delta)}{\epsilon^2}\right)$ samples, where $d = \text{VCdim}(\mathcal{H})$.

The Sauer-Shelah Lemma

If VCdim(H) = d, then for all m ≥ d:

Π_H(m) ≤ (em/d)^d

This polynomial bound (compared to the naive 2^m) is what makes learning possible with polynomial sample complexity. The moment VC dimension is finite, the growth function becomes 'tame'.

Behavior of the Growth Function:

The Sauer-Shelah Lemma reveals a striking dichotomy:

When $m \leq d$: The growth function can achieve its maximum, $\Pi_\mathcal{H}(m) = 2^m$
When $m > d$: The growth function must be strictly less than $2^m$

Moreover, once $\Pi_\mathcal{H}(m) < 2^m$ for some $m$, this gap grows. The growth function transitions from exponential to polynomial behavior exactly at $m = d$.

Formal Statement (Sauer-Shelah Lemma):

For any hypothesis class $\mathcal{H}$ with $\text{VCdim}(\mathcal{H}) = d$:

$$\Pi_\mathcal{H}(m) \leq \sum_{i=0}^{d} \binom{m}{i}$$

For $m \geq d$, this is bounded by $(em/d)^d$, which is polynomial in $m$ with degree $d$.

Techniques for Computing VC Dimension

Computing the VC dimension requires proving both lower and upper bounds. Here are the main techniques used in practice.

Proving VCdim(H) ≥ d:

To establish a lower bound, we must construct a specific set of $d$ points that $\mathcal{H}$ shatters.

Strategy:

Choose $d$ points carefully—often in 'general position'
For each of the $2^d$ labelings, exhibit a hypothesis $h \in \mathcal{H}$ achieving it
General position typically means: no degeneracies, maximizing separation

Common Constructions:

For linear classifiers in $\mathbb{R}^d$: Use the standard basis vectors: ${e_1, e_2, \ldots, e_d}$ plus the origin (or any $d+1$ points in general position).

For polynomial classifiers of degree $k$: Use points whose lifted representations (via polynomial features) are in general position.

For interval classifiers: Use well-separated points on the line.

Example: Lower bound for halfspaces in $\mathbb{R}^2$

We need to show VCdim(halfspaces in $\mathbb{R}^2$) ≥ 3.

Choose three non-collinear points: $(0,0), (1,0), (0,1)$.

For each of the 8 labelings, we can find a line that induces it:

$(+,+,+)$: Line $x + y = -1$ (all above)
$(-,-,-)$: Line $x + y = 2$ (all below)
$(+,-,-)$: Line $y = x + 0.5$ (only origin above)
etc.

This establishes VCdim ≥ 3.

The Parameter Trap

VC dimension is NOT always equal to the number of parameters! The sinusoidal classifier h_ω(x) = sign(sin(ωx)) has ONE parameter but INFINITE VC dimension. Conversely, the class of classifiers with w ∈ {-1, 1}^d (d parameters, each constrained to ±1) has VCdim ≤ d·log(2) because the class size is finite (2^d hypotheses). Always compute VC dimension from first principles.

VC Dimension of Boolean Functions

Boolean functions (mapping ${0,1}^n$ to ${0,1}$) are fundamental in learning theory. Understanding their VC dimension reveals deep connections to circuit complexity.

The Full Boolean Function Class:

Consider all Boolean functions on $n$ inputs. The input space has $|\mathcal{X}| = 2^n$ points, and each function assigns a label to each point. There are $2^{2^n}$ possible functions.

What is the VC dimension? Since any set of points can be shattered (we have all possible labelings as hypotheses), $\text{VCdim}(\mathcal{H}_{\text{all boolean}}) = 2^n$—the entire input space can be shattered.

VC Dimensions of Boolean Function Classes
Class	Description	VC Dimension	Notes
All Boolean functions	Every possible mapping	2^n	Exponential in n
Conjunctions	x₁ ∧ x₃ ∧ ¬x₅	n	One per variable
Disjunctions	x₁ ∨ ¬x₂ ∨ x₄	n	Dual to conjunctions
Monotone conjunctions	Only positive literals	n	Cannot shatter n+1
k-CNF formulas	CNF with clauses of size ≤ k	O(n^k)	Polynomial in n
Decision lists	If-then-else chains	O(n log n)	Slightly superlinear
Decision trees (depth d)	Binary tree classifiers	O(2^d log n)	Exponential in depth

Example: VC Dimension of Monotone Conjunctions

Monotone conjunctions are Boolean formulas using only AND and positive literals: $h(x_1, \ldots, x_n) = x_{i_1} \wedge x_{i_2} \wedge \ldots \wedge x_{i_k}$.

Lower bound (VCdim ≥ n):

Consider the $n$ points: $e_1 = (1,0,\ldots,0), e_2 = (0,1,\ldots,0), \ldots, e_n$.

For any subset $S \subseteq {1,\ldots,n}$, the conjunction $h_S = \bigwedge_{i \in \bar{S}} x_i$ outputs:

$h_S(e_j) = 1$ if $j \in S$ (none of the required literals are 0)
$h_S(e_j) = 0$ if $j otin S$ (at least one required literal is 0)

This achieves all $2^n$ labelings.

Upper bound (VCdim ≤ n):

Take $n+1$ points. By pigeonhole, some variable must be 0 in at least two points or 1 in all points. Careful case analysis shows not all $2^{n+1}$ labelings are achievable.

Conclusion: VCdim(monotone conjunctions) = n

VC Dimension of Combined Classes

When building complex models from simpler components, we need to understand how VC dimension composes.

Union of Hypothesis Classes:

Given $\mathcal{H}_1$ and $\mathcal{H}_2$, their union is $\mathcal{H}_1 \cup \mathcal{H}_2$.

$$\text{VCdim}(\mathcal{H}_1 \cup \mathcal{H}_2) \leq \text{VCdim}(\mathcal{H}_1) + \text{VCdim}(\mathcal{H}_2) + 1$$

More generally, for $k$ classes: $$\text{VCdim}\left(\bigcup_{i=1}^k \mathcal{H}i\right) \leq \sum{i=1}^k \text{VCdim}(\mathcal{H}_i) + O(k \log k)$$

Intersection and Complement:

For the complement class $\bar{\mathcal{H}} = {\bar{h} : h \in \mathcal{H}}$ where $\bar{h}(x) = -h(x)$: $$\text{VCdim}(\bar{\mathcal{H}}) = \text{VCdim}(\mathcal{H})$$

For intersection (points labeled +1 only if both component classifiers label +1): $$\text{VCdim}(\mathcal{H}_1 \cap \mathcal{H}_2) \leq O(\text{VCdim}(\mathcal{H}_1) \cdot \text{VCdim}(\mathcal{H}_2))$$

Feature Mappings:

If $\phi: \mathcal{X} \to \mathcal{Z}$ is a feature mapping and $\mathcal{H}$ operates on $\mathcal{Z}$, then for $\mathcal{H} \circ \phi$: $$\text{VCdim}(\mathcal{H} \circ \phi) \leq \text{VCdim}(\mathcal{H})$$

Equality holds when $\phi$ is injective.

Why Composition Matters

These composition rules let us analyze complex models by breaking them into simpler parts. An SVM with polynomial kernel can be analyzed as a linear classifier in feature space. A neural network can be bounded (loosely) by considering layer-by-layer composition. Ensemble methods combine multiple hypothesis classes.

Example: VC Dimension of Union of IntervalsConsider classifiers that label +1 on the union of $k$ disjoint intervals on $\mathbb{R}$.

Input

Compute VCdim($\mathcal{H}_{k\text{-unions}}$)

Output

**Lower bound:** Place $2k$ points: two in each interval's interior/exterior boundary region. With careful placement, any labeling of these $2k$ points can be achieved.

**Upper bound:** Each interval has 2 degrees of freedom (endpoints). By a counting argument, we cannot shatter more than $2k$ points.

**Result: VCdim = 2k**

Explanation

The VC dimension grows linearly with the number of intervals, reflecting the additional expressive power from allowing multiple disconnected positive regions.

VC Dimension vs. Number of Parameters

A common misconception is that VC dimension equals or closely tracks the number of parameters in a model. This intuition is dangerously incomplete.

When Parameters ≈ VC Dimension:

For linear models $h_\mathbf{w}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x})$ in $\mathbb{R}^d$, we have:

Number of parameters: $d+1$ (including bias)
VC dimension: $d+1$

This correspondence arises because linear classifiers have a tight relationship between their parameterization and their expressiveness over finite sets.

When Parameters << VC Dimension:

The sinusoidal classifier $h_\omega(x) = \text{sign}(\sin(\omega x))$ has:

Number of parameters: 1
VC dimension: $\infty$

Why? As $\omega \to \infty$, the function oscillates increasingly fast. For any finite set of points, there exists $\omega$ such that the function takes any desired sign pattern.

Parameters ≈ VC Dimension

•Linear classifiers
•Polynomial classifiers (fixed degree)
•Generalized linear models
•Most 'smooth' parametric families

Parameters ≠ VC Dimension

•Sinusoids: 1 param, ∞ VCdim
•k-NN: 0 params, ∞ VCdim
•Neural networks (highly non-trivial)
•Constrained parameter spaces

When Parameters >> VC Dimension:

Consider a finite hypothesis class with $N$ hypotheses. Each hypothesis might be specified by many parameters, but: $$\text{VCdim}(\mathcal{H}) \leq \lfloor \log_2 N \rfloor$$

Example: If we have 1 million hypotheses (perhaps specified by 100 parameters each, but constrained to a finite grid), the VC dimension is at most $\log_2(10^6) \approx 20$.

The Right Intuition:

The VC dimension measures the effective degrees of freedom for fitting binary patterns, not the number of adjustable knobs in a model. A parameter that can only be adjusted in limited ways contributes less to VC dimension than one with full freedom.

Modern Relevance: Deep Learning

Deep neural networks have millions of parameters but don't overfit as severely as classical theory predicts. This 'double descent' phenomenon suggests that VC dimension alone doesn't tell the whole story. Modern research explores alternative complexity measures (Rademacher complexity, PAC-Bayes bounds) that better explain deep learning generalization.

Pseudo-Dimension: VC for Real-Valued Functions

The VC dimension applies directly only to binary classifiers. For real-valued functions (regression), we extend the concept via the pseudo-dimension.

Definition (Pseudo-Dimension):

Let $\mathcal{F}$ be a class of functions $f: \mathcal{X} \to \mathbb{R}$. The pseudo-dimension $\text{Pdim}(\mathcal{F})$ is the VC dimension of the 'thresholded' class:

$$\mathcal{H}\mathcal{F} = {h{f,t} : f \in \mathcal{F}, t \in \mathbb{R}}$$

where $h_{f,t}(x) = \mathbf{1}[f(x) \geq t]$.

Intuitively, we convert real-valued predictions to binary by thresholding, and measure the VC dimension of all possible threshold-function pairs.

Pseudo-Dimension of Common Function Classes:

Function Class	Pseudo-Dimension
Linear functions in $\mathbb{R}^d$	$d + 1$
Polynomials of degree $k$ in one variable	$k + 1$
Linear combinations of $n$ basis functions	$n$
Single-layer neural networks with $m$ hidden units	$O(m \cdot d)$
Lipschitz functions	$\infty$ (without additional constraints)

Relationship to VC Dimension:

For binary classifiers (outputs in ${-1, +1}$), the pseudo-dimension equals the VC dimension (thresholding doesn't add expressiveness).

For regression, pseudo-dimension gives generalization bounds for squared loss and other Lipschitz losses:

$$\text{Generalization error} \leq O\left(\sqrt{\frac{\text{Pdim}(\mathcal{F})}{m}}\right)$$

Fat-Shattering Dimension

For margin-based learning and more refined regression analysis, the fat-shattering dimension at scale γ measures complexity relative to a margin: how many points can be labeled with predictions separated by at least γ? This gives tighter bounds for problems with built-in margins.

Summary: The VC Dimension Toolkit

The VC dimension provides the theoretical foundation for understanding when and why machine learning works.

Key Takeaways

•VCdim($\mathcal{H}$) = d means $\mathcal{H}$ can shatter some set of $d$ points, but no set of $d+1$ points.
•Lower bounds require exhibiting a shattered set; upper bounds require proving no set of size $d+1$ can be shattered.
•Sauer-Shelah Lemma: Finite VC dimension implies polynomial growth of the dichotomy count.
•VC dimension ≠ parameters. Sinusoids have 1 parameter but infinite VC dimension.
•Composition rules allow analyzing complex models from simpler components.
•Pseudo-dimension extends VC concepts to real-valued function classes.

What's Next:

With the formal definition established, we'll compute the VC dimension of linear classifiers—the workhorse of machine learning. This concrete example will solidify your understanding and provide the foundation for analyzing more complex models.

Page Complete

You now command the formal definition of VC dimension and the techniques to compute it. You understand the Sauer-Shelah lemma, the relationship (and non-relationship) between parameters and VC dimension, and how VC concepts extend to regression. Next, we apply these tools to linear classifiers.

VC Dimension: Formal Definition and Properties

Quantifying Model Complexity

What You Will Learn

The Formal Definition of VC Dimension

The VC dimension crystallizes the concept of shattering into a single numerical measure.

Definition (VC Dimension):

$$\text{VCdim}(\mathcal{H}) = \max{m : \Pi_\mathcal{H}(m) = 2^m}$$

If $\mathcal{H}$ can shatter arbitrarily large sets, we say $\text{VCdim}(\mathcal{H}) = \infty$.

Critical Clarification

The VC dimension requires only EXISTENCE of a shattered set of that size. If VCdim(H) = d, it means:

• There EXISTS some set of d points that H shatters. • For ALL sets of d+1 points, H cannot shatter them.

We do NOT require that H shatters EVERY set of d points—just at least one.

Equivalent Characterizations:

The VC dimension can be characterized in several equivalent ways:

Maximum shattering: $\text{VCdim}(\mathcal{H}) = d$ iff $d = \max{|S| : \mathcal{H} \text{ shatters } S}$
Growth function threshold: $\text{VCdim}(\mathcal{H}) = d$ iff:
- $\Pi_\mathcal{H}(d) = 2^d$ (achieves maximum)
- $\Pi_\mathcal{H}(d+1) < 2^{d+1}$ (falls below maximum)
Dichotomy counting: The VC dimension is the last point where the growth function equals its theoretical maximum

Formal Proof Strategy:

To prove $\text{VCdim}(\mathcal{H}) = d$, we must establish two things:

Lower bound (constructive): Exhibit a specific set $S$ of $d$ points that $\mathcal{H}$ shatters.
Upper bound (universal): Prove that no set of $d+1$ points can be shattered by $\mathcal{H}$.

Example: VC Dimension of Threshold ClassifiersConsider positive rays on the real line: $\mathcal{H}_+ = \{h_\theta : \theta \in \mathbb{R}\}$ where $h_\theta(x) = +1$ if $x \geq \theta$, else $-1$.

Input

Compute VCdim($\mathcal{H}_+$)

Output

**Lower bound (VCdim ≥ 1):**
Take a single point $\{x_1\}$:
• Label +1: Use $\theta < x_1$
• Label -1: Use $\theta > x_1$
✓ A single point is shattered.

**Upper bound (VCdim ≤ 1):**
For any two points $x_1 < x_2$:
• Label $(-1, +1)$ works: Use $\theta \in (x_1, x_2)$
• Label $(+1, +1)$ works: Use $\theta < x_1$
• Label $(-1, -1)$ works: Use $\theta > x_2$
• Label $(+1, -1)$ fails: No $\theta$ gives $x_1 \geq \theta$ and $x_2 < \theta$
✗ No two points can be shattered.

**Conclusion: VCdim($\mathcal{H}_+$) = 1**

Explanation

Threshold classifiers can only produce monotonic labelings, limiting their expressive power to VC dimension 1.

Fundamental Properties of VC Dimension

The VC dimension satisfies several important properties that make it a useful complexity measure.

Fundamental Properties of VC Dimension

•Monotonicity under inclusion: If $\mathcal{H}_1 \subseteq \mathcal{H}_2$, then $\text{VCdim}(\mathcal{H}_1) \leq \text{VCdim}(\mathcal{H}_2)$. Larger hypothesis classes have at least as much expressive power.
•Independence from input distribution: VC dimension depends only on $\mathcal{H}$, not on any probability distribution over inputs. It's a purely combinatorial/geometric property.
•Finiteness implies polynomial growth: If $\text{VCdim}(\mathcal{H}) = d < \infty$, then $\Pi_\mathcal{H}(m) \leq \sum_{i=0}^{d} \binom{m}{i}$ (Sauer-Shelah Lemma).
•Finite VC dimension implies learnability: A hypothesis class is PAC-learnable if and only if it has finite VC dimension (Fundamental Theorem of Statistical Learning).
•Sample complexity connection: To achieve error $\epsilon$ with probability $1-\delta$, you need $m = O\left(\frac{d + \ln(1/\delta)}{\epsilon^2}\right)$ samples, where $d = \text{VCdim}(\mathcal{H})$.

The Sauer-Shelah Lemma

If VCdim(H) = d, then for all m ≥ d:

Π_H(m) ≤ (em/d)^d

This polynomial bound (compared to the naive 2^m) is what makes learning possible with polynomial sample complexity. The moment VC dimension is finite, the growth function becomes 'tame'.

Behavior of the Growth Function:

The Sauer-Shelah Lemma reveals a striking dichotomy:

When $m \leq d$: The growth function can achieve its maximum, $\Pi_\mathcal{H}(m) = 2^m$
When $m > d$: The growth function must be strictly less than $2^m$

Moreover, once $\Pi_\mathcal{H}(m) < 2^m$ for some $m$, this gap grows. The growth function transitions from exponential to polynomial behavior exactly at $m = d$.

Formal Statement (Sauer-Shelah Lemma):

For any hypothesis class $\mathcal{H}$ with $\text{VCdim}(\mathcal{H}) = d$:

$$\Pi_\mathcal{H}(m) \leq \sum_{i=0}^{d} \binom{m}{i}$$

For $m \geq d$, this is bounded by $(em/d)^d$, which is polynomial in $m$ with degree $d$.

Techniques for Computing VC Dimension

Computing the VC dimension requires proving both lower and upper bounds. Here are the main techniques used in practice.

Proving VCdim(H) ≥ d:

To establish a lower bound, we must construct a specific set of $d$ points that $\mathcal{H}$ shatters.

Strategy:

Choose $d$ points carefully—often in 'general position'
For each of the $2^d$ labelings, exhibit a hypothesis $h \in \mathcal{H}$ achieving it
General position typically means: no degeneracies, maximizing separation

Common Constructions:

For linear classifiers in $\mathbb{R}^d$: Use the standard basis vectors: ${e_1, e_2, \ldots, e_d}$ plus the origin (or any $d+1$ points in general position).

For polynomial classifiers of degree $k$: Use points whose lifted representations (via polynomial features) are in general position.

For interval classifiers: Use well-separated points on the line.

Example: Lower bound for halfspaces in $\mathbb{R}^2$

We need to show VCdim(halfspaces in $\mathbb{R}^2$) ≥ 3.

Choose three non-collinear points: $(0,0), (1,0), (0,1)$.

For each of the 8 labelings, we can find a line that induces it:

$(+,+,+)$: Line $x + y = -1$ (all above)
$(-,-,-)$: Line $x + y = 2$ (all below)
$(+,-,-)$: Line $y = x + 0.5$ (only origin above)
etc.

This establishes VCdim ≥ 3.

The Parameter Trap

VC Dimension of Boolean Functions

Boolean functions (mapping ${0,1}^n$ to ${0,1}$) are fundamental in learning theory. Understanding their VC dimension reveals deep connections to circuit complexity.

The Full Boolean Function Class:

Consider all Boolean functions on $n$ inputs. The input space has $|\mathcal{X}| = 2^n$ points, and each function assigns a label to each point. There are $2^{2^n}$ possible functions.

VC Dimensions of Boolean Function Classes
Class	Description	VC Dimension	Notes
All Boolean functions	Every possible mapping	2^n	Exponential in n
Conjunctions	x₁ ∧ x₃ ∧ ¬x₅	n	One per variable
Disjunctions	x₁ ∨ ¬x₂ ∨ x₄	n	Dual to conjunctions
Monotone conjunctions	Only positive literals	n	Cannot shatter n+1
k-CNF formulas	CNF with clauses of size ≤ k	O(n^k)	Polynomial in n
Decision lists	If-then-else chains	O(n log n)	Slightly superlinear
Decision trees (depth d)	Binary tree classifiers	O(2^d log n)	Exponential in depth

Example: VC Dimension of Monotone Conjunctions

Monotone conjunctions are Boolean formulas using only AND and positive literals: $h(x_1, \ldots, x_n) = x_{i_1} \wedge x_{i_2} \wedge \ldots \wedge x_{i_k}$.

Lower bound (VCdim ≥ n):

Consider the $n$ points: $e_1 = (1,0,\ldots,0), e_2 = (0,1,\ldots,0), \ldots, e_n$.

For any subset $S \subseteq {1,\ldots,n}$, the conjunction $h_S = \bigwedge_{i \in \bar{S}} x_i$ outputs:

$h_S(e_j) = 1$ if $j \in S$ (none of the required literals are 0)
$h_S(e_j) = 0$ if $j otin S$ (at least one required literal is 0)

This achieves all $2^n$ labelings.

Upper bound (VCdim ≤ n):

Take $n+1$ points. By pigeonhole, some variable must be 0 in at least two points or 1 in all points. Careful case analysis shows not all $2^{n+1}$ labelings are achievable.

Conclusion: VCdim(monotone conjunctions) = n

VC Dimension of Combined Classes

When building complex models from simpler components, we need to understand how VC dimension composes.

Union of Hypothesis Classes:

Given $\mathcal{H}_1$ and $\mathcal{H}_2$, their union is $\mathcal{H}_1 \cup \mathcal{H}_2$.

$$\text{VCdim}(\mathcal{H}_1 \cup \mathcal{H}_2) \leq \text{VCdim}(\mathcal{H}_1) + \text{VCdim}(\mathcal{H}_2) + 1$$

More generally, for $k$ classes: $$\text{VCdim}\left(\bigcup_{i=1}^k \mathcal{H}i\right) \leq \sum{i=1}^k \text{VCdim}(\mathcal{H}_i) + O(k \log k)$$

Intersection and Complement:

For the complement class $\bar{\mathcal{H}} = {\bar{h} : h \in \mathcal{H}}$ where $\bar{h}(x) = -h(x)$: $$\text{VCdim}(\bar{\mathcal{H}}) = \text{VCdim}(\mathcal{H})$$

Feature Mappings:

Equality holds when $\phi$ is injective.

Why Composition Matters

Example: VC Dimension of Union of IntervalsConsider classifiers that label +1 on the union of $k$ disjoint intervals on $\mathbb{R}$.

Input

Compute VCdim($\mathcal{H}_{k\text{-unions}}$)

Output

**Lower bound:** Place $2k$ points: two in each interval's interior/exterior boundary region. With careful placement, any labeling of these $2k$ points can be achieved.

**Upper bound:** Each interval has 2 degrees of freedom (endpoints). By a counting argument, we cannot shatter more than $2k$ points.

**Result: VCdim = 2k**

Explanation

The VC dimension grows linearly with the number of intervals, reflecting the additional expressive power from allowing multiple disconnected positive regions.

VC Dimension vs. Number of Parameters

A common misconception is that VC dimension equals or closely tracks the number of parameters in a model. This intuition is dangerously incomplete.

When Parameters ≈ VC Dimension:

For linear models $h_\mathbf{w}(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x})$ in $\mathbb{R}^d$, we have:

Number of parameters: $d+1$ (including bias)
VC dimension: $d+1$

This correspondence arises because linear classifiers have a tight relationship between their parameterization and their expressiveness over finite sets.

When Parameters << VC Dimension:

The sinusoidal classifier $h_\omega(x) = \text{sign}(\sin(\omega x))$ has:

Number of parameters: 1
VC dimension: $\infty$

Why? As $\omega \to \infty$, the function oscillates increasingly fast. For any finite set of points, there exists $\omega$ such that the function takes any desired sign pattern.

Parameters ≈ VC Dimension

•Linear classifiers
•Polynomial classifiers (fixed degree)
•Generalized linear models
•Most 'smooth' parametric families

Parameters ≠ VC Dimension

•Sinusoids: 1 param, ∞ VCdim
•k-NN: 0 params, ∞ VCdim
•Neural networks (highly non-trivial)
•Constrained parameter spaces

When Parameters >> VC Dimension:

Consider a finite hypothesis class with $N$ hypotheses. Each hypothesis might be specified by many parameters, but: $$\text{VCdim}(\mathcal{H}) \leq \lfloor \log_2 N \rfloor$$

Example: If we have 1 million hypotheses (perhaps specified by 100 parameters each, but constrained to a finite grid), the VC dimension is at most $\log_2(10^6) \approx 20$.

The Right Intuition:

Modern Relevance: Deep Learning

Pseudo-Dimension: VC for Real-Valued Functions

The VC dimension applies directly only to binary classifiers. For real-valued functions (regression), we extend the concept via the pseudo-dimension.

Definition (Pseudo-Dimension):

Let $\mathcal{F}$ be a class of functions $f: \mathcal{X} \to \mathbb{R}$. The pseudo-dimension $\text{Pdim}(\mathcal{F})$ is the VC dimension of the 'thresholded' class:

$$\mathcal{H}\mathcal{F} = {h{f,t} : f \in \mathcal{F}, t \in \mathbb{R}}$$

where $h_{f,t}(x) = \mathbf{1}[f(x) \geq t]$.

Intuitively, we convert real-valued predictions to binary by thresholding, and measure the VC dimension of all possible threshold-function pairs.

Pseudo-Dimension of Common Function Classes:

Function Class	Pseudo-Dimension
Linear functions in $\mathbb{R}^d$	$d + 1$
Polynomials of degree $k$ in one variable	$k + 1$
Linear combinations of $n$ basis functions	$n$
Single-layer neural networks with $m$ hidden units	$O(m \cdot d)$
Lipschitz functions	$\infty$ (without additional constraints)

Relationship to VC Dimension:

For binary classifiers (outputs in ${-1, +1}$), the pseudo-dimension equals the VC dimension (thresholding doesn't add expressiveness).

For regression, pseudo-dimension gives generalization bounds for squared loss and other Lipschitz losses:

$$\text{Generalization error} \leq O\left(\sqrt{\frac{\text{Pdim}(\mathcal{F})}{m}}\right)$$

Fat-Shattering Dimension

Summary: The VC Dimension Toolkit

The VC dimension provides the theoretical foundation for understanding when and why machine learning works.

Key Takeaways

•VCdim($\mathcal{H}$) = d means $\mathcal{H}$ can shatter some set of $d$ points, but no set of $d+1$ points.
•Lower bounds require exhibiting a shattered set; upper bounds require proving no set of size $d+1$ can be shattered.
•Sauer-Shelah Lemma: Finite VC dimension implies polynomial growth of the dichotomy count.
•VC dimension ≠ parameters. Sinusoids have 1 parameter but infinite VC dimension.
•Composition rules allow analyzing complex models from simpler components.
•Pseudo-dimension extends VC concepts to real-valued function classes.

What's Next:

Page Complete