Machine LearningRegularization in Deep Learning

Capacity and Generalization

LevelAdvanced

Duration90 mins

TopicRegularization in Deep Learning

1 / 5

Model Capacity

The Central Question of Learning Theory

At the heart of machine learning lies a fundamental question: How does the complexity of a model relate to its ability to generalize? This question has occupied statisticians and computer scientists for decades, and its answer forms the theoretical foundation upon which all of modern deep learning rests.

When we train a neural network, we're essentially searching through a vast space of possible functions to find one that maps our inputs to our desired outputs. The model capacity—also called the hypothesis class complexity or model expressiveness—determines the size and shape of this function space. A model with high capacity can represent more complex functions; a model with low capacity is restricted to simpler ones.

Understanding capacity is not merely academic. It directly impacts every practical decision we make: how many layers to use, how many neurons per layer, whether to add regularization, and how much data we need. Without a principled understanding of capacity, we're navigating in the dark, relying entirely on trial-and-error experimentation.

What You Will Master

By the end of this page, you will understand multiple rigorous definitions of model capacity (VC dimension, Rademacher complexity, and more), the relationship between capacity and generalization, and the profound implications these concepts have for designing deep learning systems.

Intuitive Foundations of Capacity

Before diving into formal definitions, let's build intuition for what capacity means. Consider the problem of fitting a curve to a set of data points.

The Polynomial Fitting Example:

Suppose we have 10 data points in 2D space and want to fit a polynomial curve. We have choices:

A linear polynomial (degree 1): Can only draw straight lines. Very limited—if the data has curvature, the fit will be poor.
A quadratic polynomial (degree 2): Can capture simple curvature. Better, but still constrained.
A degree-9 polynomial: Can pass through all 10 points exactly. Maximum flexibility.

The degree of the polynomial is a form of capacity. Higher degree means higher capacity—the model can fit more complex patterns.

polynomial_capacity_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import matplotlib.pyplot as plt
 
# Generate noisy data from a true quadratic function
np.random.seed(42)
X = np.linspace(0, 1, 10)
y_true = 0.5 + 2*X - 3*X**2  # True underlying function
y_noisy = y_true + np.random.normal(0, 0.1, 10)  # Add noise
 
# Fit polynomials of different degrees (capacities)
degrees = [1, 2, 9]
X_dense = np.linspace(0, 1, 100)
 
for degree in degrees:
    coeffs = np.polyfit(X, y_noisy, degree)
    y_fit = np.polyval(coeffs, X_dense)
    
    print(f"Degree {degree} polynomial:")
    print(f"  Number of parameters: {degree + 1}")
    print(f"  Training error (MSE): {np.mean((np.polyval(coeffs, X) - y_noisy)**2):.6f}")
    
# Key insight: Degree-9 polynomial has ZERO training error
# but will generalize poorly to new points

The Capacity-Generalization Tradeoff:

This example illustrates a fundamental tension:

Too little capacity (underfitting): The model cannot capture the true underlying pattern. The linear fit misses the curvature entirely.
Too much capacity (overfitting): The model fits the noise rather than the signal. The degree-9 polynomial passes through every noisy point, memorizing the training data rather than learning the underlying relationship.
Just right (good generalization): The quadratic model matches the true data-generating process. It captures the essential pattern without fitting the noise.

This is sometimes called the bias-variance tradeoff, which we'll explore in depth. For now, recognize that capacity is not simply 'more is better'—at least in classical statistical learning theory.

A Puzzle for Later

Classical theory suggests high-capacity models should overfit. Yet modern deep networks with billions of parameters often generalize well. This apparent contradiction—which we'll resolve in later pages—is one of the most important open questions in machine learning theory.

Formal Definitions of Model Capacity

Computer scientists and statisticians have developed several rigorous ways to quantify model capacity. Each captures different aspects of what it means for a model to be 'complex' or 'expressive'.

2.1 The Hypothesis Class

We begin with the hypothesis class, denoted $\mathcal{H}$. This is the set of all functions our model can represent. For a neural network, $\mathcal{H}$ depends on the architecture (number of layers, neurons per layer, activation functions) and is implicitly defined by the parameter space.

For example:

Linear regression: $\mathcal{H} = {f(x) = w^T x + b : w \in \mathbb{R}^d, b \in \mathbb{R}}$
Neural network with ReLU: $\mathcal{H}$ is the set of all piecewise-linear functions achievable by the architecture

The 'capacity' of a model is essentially a measure of how 'large' or 'rich' this hypothesis class is.

Why Counting Parameters Isn't Enough

A naive approach might count parameters: more parameters = higher capacity. But this is inadequate. A ReLU network with 1000 parameters can represent vastly more complex functions than a degree-999 polynomial (which also has ~1000 coefficients). The architecture matters, not just parameter count.

2.2 VC Dimension: The Classic Measure

The Vapnik-Chervonenkis (VC) dimension is the most famous measure of capacity, developed in the 1970s. It quantifies capacity through the lens of shattering.

Definition (Shattering): A hypothesis class $\mathcal{H}$ shatters a set of points ${x_1, ..., x_n}$ if, for every possible binary labeling of these points, there exists some $h \in \mathcal{H}$ that achieves that labeling.

Definition (VC Dimension): The VC dimension of $\mathcal{H}$, denoted $\text{VC}(\mathcal{H})$, is the largest number $n$ such that there exists a set of $n$ points that $\mathcal{H}$ can shatter.

Intuitively, VC dimension measures the largest dataset the model can perfectly memorize for any arbitrary labeling.

vc_dimension_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
"""
VC Dimension Examples
 
Understanding VC dimension through concrete examples.
"""
 
# Example 1: Linear classifiers in 2D (lines)
# 
# Consider the hypothesis class of all lines in 2D:
#   H = {h(x) = sign(w1*x1 + w2*x2 + b) : w1, w2, b ∈ ℝ}
#
# Can shatter 3 points? YES
#   - For any 3 non-collinear points, any binary labeling 
#     (2^3 = 8 labelings) can be achieved by some line.
#
# Can shatter 4 points? NO
#   - Consider 4 points in convex position (square vertices)
#   - Label diagonal points the same (XOR pattern)
#   - No line can separate this labeling
#
# Therefore: VC(linear classifiers in 2D) = 3
 
# Example 2: Linear classifiers in d dimensions
# 
# General result: VC(linear classifiers in ℝ^d) = d + 1
#
# This makes intuitive sense:
#   - d=1 (threshold on a line): VC = 2
#   - d=2 (lines in plane): VC = 3
#   - d=10 (hyperplanes in ℝ^10): VC = 11
 
# Example 3: Neural Networks
#
# For a ReLU network with W weights:
#   - Lower bound: VC ≥ Ω(W log W) for deep networks
#   - Upper bound: VC ≤ O(W^2) in general
#
# Key insight: ReLU networks have much higher VC dimension
# than their parameter count alone would suggest!
 
def demonstrate_shattering_2d():
    """
    Demonstrate that 3 points can be shattered by lines in 2D.
    """
    import numpy as np
    
    # Three non-collinear points
    points = np.array([[0, 0], [1, 0], [0.5, 1]])
    
    # All 8 possible binary labelings
    labelings = [
        [0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1],
        [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]
    ]
    
    # For each labeling, we claim a separating line exists
    # (Proof by construction or convexity argument)
    print("For 3 non-collinear points in 2D:")
    print(f"  Number of possible labelings: {len(labelings)}")
    print("  All labelings achievable by some line: YES")
    print("  Therefore these 3 points are shattered")
    print(f"\n  VC dimension of linear classifiers in 2D = 3")
 
demonstrate_shattering_2d()

VC Dimension of Common Models:

Model	VC Dimension
Threshold on ℝ (single feature)	2
Linear classifiers in ℝ^d	d + 1
Axis-aligned rectangles in ℝ^d	2d
Decision stumps (single threshold per feature)	2d
Degree-k polynomials in ℝ^d	$\binom{d+k}{k}$
k-nearest neighbors	∞ (can shatter any finite set)
Neural network with W weights (ReLU)	O(W² log W) upper bound

Note that k-NN has infinite VC dimension yet often works well in practice. This hints that VC dimension, while theoretically important, doesn't tell the whole story about generalization.

The VC Generalization Bound

The real power of VC dimension emerges when we connect it to generalization. The VC generalization bound tells us how training error relates to test error as a function of capacity.

3.1 The Fundamental Theorem

Theorem (VC Generalization Bound): For a hypothesis class $\mathcal{H}$ with finite VC dimension $d$, and a training set of size $n$, with probability at least $1 - \delta$:

$$R(h) \leq \hat{R}_n(h) + \sqrt{\frac{d \log(\frac{2en}{d}) + \log(\frac{4}{\delta})}{n}}$$

Where:

$R(h)$ is the true risk (expected error on the underlying distribution)
$\hat{R}_n(h)$ is the empirical risk (training error)
The square root term is the generalization gap bound

This is profound: it says that the gap between training and test error depends on the ratio $d/n$ (VC dimension over sample size).

The Key Insight

The VC bound tells us: to generalize well with a high-capacity model, you need proportionally more data. If VC dimension is d, you need roughly O(d) samples for the generalization gap to be small. This is the theoretical foundation for 'more complex models need more data.'

3.2 Interpreting the Bound

Let's unpack what this bound actually tells us:

1. Capacity Controls Generalization Gap

The term $\frac{d}{n}$ dominates. If you have a model with VC dimension 1000 and only 100 training samples, the bound is vacuous (greater than 1). You need $n >> d$ for meaningful bounds.

2. Logarithmic Dependence on Confidence

The $\log(1/\delta)$ term means we pay a small price for high-confidence bounds. Going from 95% confidence to 99.9% confidence only adds a constant factor.

3. Sample Complexity

Rearranging, to achieve generalization gap $\epsilon$ with confidence $1-\delta$, you need: $$n = O\left(\frac{d + \log(1/\delta)}{\epsilon^2}\right)$$

This is the sample complexity of learning with hypothesis class $\mathcal{H}$.

vc_bound_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
"""
VC Generalization Bound Analysis
 
Exploring how the bound behaves for different capacity/sample relationships.
"""
 
import numpy as np
import matplotlib.pyplot as plt
 
def vc_generalization_bound(n: int, d: int, delta: float = 0.05) -> float:
    """
    Compute the VC generalization bound.
    
    Args:
        n: Number of training samples
        d: VC dimension
        delta: Confidence parameter (bound holds with prob 1-delta)
    
    Returns:
        Upper bound on generalization gap
    """
    if n < d:
        return float('inf')  # Bound is vacuous
    
    # VC bound formula
    term1 = d * np.log(2 * np.e * n / d)
    term2 = np.log(4 / delta)
    
    return np.sqrt((term1 + term2) / n)
 
# Analyze for a neural network with different numbers of parameters
# Assume VC dimension ≈ O(W) for simplicity (actually O(W² log W))
 
print("VC Bound Analysis for Different Capacity/Sample Ratios")
print("=" * 60)
 
# Scenario 1: Small model, adequate data
d1, n1 = 100, 10000
bound1 = vc_generalization_bound(n1, d1)
print(f"\nScenario 1: d={d1}, n={n1}")
print(f"  Ratio n/d = {n1/d1:.0f}")
print(f"  Generalization bound: {bound1:.4f}")
 
# Scenario 2: Large model, same data
d2, n2 = 10000, 10000
bound2 = vc_generalization_bound(n2, d2)
print(f"\nScenario 2: d={d2}, n={n2}")
print(f"  Ratio n/d = {n2/d2:.0f}")
print(f"  Generalization bound: {bound2:.4f}")
 
# Scenario 3: Large model, proportionally more data
d3, n3 = 10000, 1000000
bound3 = vc_generalization_bound(n3, d3)
print(f"\nScenario 3: d={d3}, n={n3}")
print(f"  Ratio n/d = {n3/d3:.0f}")
print(f"  Generalization bound: {bound3:.4f}")
 
# Scenario 4: Modern deep learning regime (overparameterized)
d4, n4 = 100000000, 1000000  # 100M params, 1M samples
bound4 = vc_generalization_bound(n4, d4)
print(f"\nScenario 4: d={d4:,}, n={n4:,}")
print(f"  Ratio n/d = {n4/d4:.4f}")
print(f"  Generalization bound: {bound4:.4f}")
print("  Note: Bound is vacuous! Yet deep nets generalize in practice.")
 
print("\n" + "=" * 60)
print("KEY INSIGHT: Classical VC bounds fail to explain deep learning!")
print("This is the 'generalization puzzle' that motivates modern theory.")

The Limitations of VC Theory

For modern deep networks with millions or billions of parameters, VC bounds are completely vacuous. A network with 100 million parameters trained on 1 million samples should—according to VC theory—overfit catastrophically. Yet they generalize remarkably well. Understanding why requires concepts we'll explore in later pages: effective capacity, implicit regularization, and the geometry of loss landscapes.

Rademacher Complexity: A Data-Dependent Measure

VC dimension is worst-case: it considers whether any arrangement of points can be shattered. But real data isn't arbitrary—it has structure. Rademacher complexity provides a more refined, data-dependent measure of capacity.

4.1 Definition

Definition (Empirical Rademacher Complexity): For a hypothesis class $\mathcal{H}$, a sample $S = {x_1, ..., x_n}$, and i.i.d. Rademacher random variables $\sigma_i \in {-1, +1}$ with equal probability:

$$\hat{\mathcal{R}}S(\mathcal{H}) = \mathbb{E}\sigma\left[\sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i)\right]$$

Intuition: We're asking: how well can the best hypothesis in $\mathcal{H}$ correlate with random noise? If $\mathcal{H}$ has high capacity, some hypothesis will correlate well with any noise pattern. If capacity is low, the correlation will be weak.

4.2 Why Rademacher Complexity is More Refined

Data-Dependence: Unlike VC dimension, Rademacher complexity depends on the actual data distribution. A model might have high VC dimension in theory but low Rademacher complexity on structured real-world data.

Tighter Bounds: The Rademacher generalization bound is:

$$R(h) \leq \hat{R}_n(h) + 2\mathcal{R}_n(\mathcal{H}) + \sqrt{\frac{\ln(1/\delta)}{2n}}$$

Of crucial importance: $\mathcal{R}_n(\mathcal{H})$ can be much smaller than what VC bounds predict, especially when the data has low intrinsic complexity.

Connection to Margin: For classification with margins, Rademacher complexity decreases with larger margins. This connects capacity to the geometric structure of the learned classifier.

rademacher_complexity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
"""
Rademacher Complexity Estimation
 
Empirically estimating Rademacher complexity via Monte Carlo sampling.
"""
 
import numpy as np
from typing import Callable, List
 
def estimate_rademacher_complexity(
    hypothesis_class_outputs: np.ndarray,
    n_rademacher_samples: int = 1000
) -> float:
    """
    Estimate empirical Rademacher complexity.
    
    Args:
        hypothesis_class_outputs: Array of shape (n_hypotheses, n_samples)
            where entry [h, i] is the output of hypothesis h on sample i.
        n_rademacher_samples: Number of Monte Carlo samples for expectation.
    
    Returns:
        Estimated Rademacher complexity.
    """
    n_hypotheses, n_samples = hypothesis_class_outputs.shape
    
    supremums = []
    for _ in range(n_rademacher_samples):
        # Sample random Rademacher variables
        sigma = np.random.choice([-1, 1], size=n_samples)
        
        # Compute correlation for each hypothesis
        correlations = hypothesis_class_outputs @ sigma / n_samples
        
        # Take supremum over hypothesis class
        supremums.append(np.max(correlations))
    
    return np.mean(supremums)
 
# Example: Compare linear models vs. polynomial models
np.random.seed(42)
 
# Generate structured data (not arbitrary)
n_samples = 100
X = np.linspace(-1, 1, n_samples).reshape(-1, 1)
 
# Hypothesis class 1: Linear functions (low capacity)
# Sample many random linear functions
n_linear = 1000
linear_weights = np.random.randn(n_linear)
linear_outputs = linear_weights.reshape(-1, 1) * X.reshape(1, -1)
 
# Hypothesis class 2: Degree-10 polynomials (high capacity)
n_poly = 1000
poly_features = np.column_stack([X**i for i in range(11)])  # 11 features
poly_weights = np.random.randn(n_poly, 11)
poly_outputs = poly_weights @ poly_features.T
 
# Normalize outputs to [-1, 1] for fair comparison
linear_outputs = np.tanh(linear_outputs)
poly_outputs = np.tanh(poly_outputs)
 
# Estimate Rademacher complexity
rc_linear = estimate_rademacher_complexity(linear_outputs)
rc_poly = estimate_rademacher_complexity(poly_outputs)
 
print("Rademacher Complexity Comparison")
print("=" * 50)
print(f"Linear functions:      {rc_linear:.4f}")
print(f"Degree-10 polynomials: {rc_poly:.4f}")
print(f"\nRatio: {rc_poly / rc_linear:.2f}x higher for polynomials")
print("\nInterpretation: Higher Rademacher complexity means")
print("the model class can fit random labels better → more capacity")

The Connection to Label Noise Experiments

Zhang et al. (2017) famously showed that deep networks can fit randomly labeled data perfectly. This is exactly what high Rademacher complexity predicts: the hypothesis class is rich enough to correlate with any noise pattern. The mystery is why, despite this capability, networks trained on real labels generalize well.

Other Measures of Model Complexity

Beyond VC dimension and Rademacher complexity, researchers have proposed numerous other ways to quantify model capacity. Each provides different insights.

5.1 Covering Numbers

A covering number $\mathcal{N}(\epsilon, \mathcal{H}, d)$ is the minimum number of balls of radius $\epsilon$ (in metric $d$) needed to cover the hypothesis class $\mathcal{H}$.

Larger covering numbers → higher capacity
Related to VC dimension: $\log \mathcal{N}(\epsilon, \mathcal{H}) \leq O\left(\text{VC}(\mathcal{H}) \log(1/\epsilon)\right)$

5.2 Fat-Shattering Dimension

An extension of VC dimension for real-valued functions that accounts for margin:

Definition: The fat-shattering dimension at margin $\gamma$, denoted $\text{fat}_\gamma(\mathcal{H})$, is the largest set $S$ such that there exist 'witnesses' $r_1, ..., r_n$ where:

For every labeling $(b_1, ..., b_n) \in {-1, +1}^n$, some $h \in \mathcal{H}$ satisfies $b_i(h(x_i) - r_i) \geq \gamma$ for all $i$.

This captures that classifiers with larger margins have lower effective capacity.

5.3 Description Length and MDL

The Minimum Description Length (MDL) principle provides an information-theoretic view of capacity:

Model complexity = number of bits needed to describe the model
Generalization bound: $R(h) \lesssim \hat{R}(h) + \frac{\text{bits}(h)}{n}$

This perspective is compelling because it connects directly to compression: a model that generalizes well should compress the data, not memorize it.

5.4 PAC-Bayes Bounds

Probably Approximately Correct (PAC)-Bayes bounds provide some of the tightest generalization bounds currently known:

$$R(\rho) \leq \hat{R}(\rho) + \sqrt{\frac{\text{KL}(\rho | \pi) + \ln(n/\delta)}{2n}}$$

Where:

$\rho$ is the posterior distribution over hypotheses
$\pi$ is the prior distribution
$\text{KL}(\rho | \pi)$ measures how much the posterior deviates from the prior

The key insight: if the learned model stays 'close' to the prior (in KL sense), generalization is guaranteed. This formalizes the intuition that simpler models (closer to prior) generalize better.

Comparison of Capacity Measures
Measure	Type	Advantages	Limitations
VC Dimension	Worst-case, combinatorial	Conceptually simple, well-studied	Too pessimistic for deep learning
Rademacher Complexity	Data-dependent, average-case	Tighter bounds, captures data structure	Hard to compute exactly
Covering Numbers	Geometric	Works for continuous hypothesis classes	Depends on choice of metric
Fat-Shattering	Margin-aware	Captures benefit of large margins	Still worst-case
MDL / Bits	Information-theoretic	Intuitive, connects to compression	Depends on encoding scheme
PAC-Bayes	Probabilistic, algorithmic	Tightest known bounds, non-vacuous for DNNs	Requires choosing prior

The Current Frontier

PAC-Bayes bounds are currently the most promising approach for understanding deep learning generalization. They can give non-vacuous bounds (bounds less than 1) for networks with millions of parameters—something VC theory cannot achieve. The key is that they account for the specific weights found by training, not just the architecture's capacity.

Capacity in Neural Networks

How do these theoretical concepts apply specifically to neural networks? Let's examine what determines the capacity of deep learning models.

6.1 Architectural Factors

Depth (Number of Layers):

Deeper networks can represent more complex functions
Each layer potentially adds exponential expressiveness for certain function classes
Depth provides 'compositionality'—building complex features from simpler ones

Width (Neurons per Layer):

Wider layers can represent more features in parallel
Width contributes linearly to parameter count
Universal approximation theorems guarantee sufficient width can approximate any continuous function

Connection Patterns:

Fully connected: maximum flexibility, high parameter count
Convolutional: parameter sharing reduces capacity but adds inductive bias
Skip connections: affect both forward computation and gradient flow, indirectly affecting effective capacity

neural_network_capacity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
"""
Neural Network Capacity Analysis
 
Examining how architectural choices affect representational power.
"""
 
import torch
import torch.nn as nn
import numpy as np
 
def count_parameters(model: nn.Module) -> int:
    """Count trainable parameters."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
 
def count_linear_regions(model: nn.Module, input_dim: int, n_samples: int = 10000) -> int:
    """
    Estimate the number of linear regions for a ReLU network.
    
    For ReLU networks, the function is piecewise linear.
    More linear regions = higher capacity to represent complex functions.
    """
    model.eval()
    
    # Sample random inputs
    X = torch.randn(n_samples, input_dim)
    
    # Get activation patterns for each sample
    patterns = []
    
    def hook_fn(name):
        def hook(module, inp, out):
            # Binary pattern: which neurons are active (> 0)
            pattern = (out > 0).float()
            patterns.append(pattern.flatten().cpu().numpy())
        return hook
    
    # Register hooks on all ReLU activations
    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, nn.ReLU):
            hooks.append(module.register_forward_hook(hook_fn(name)))
    
    # Forward pass
    with torch.no_grad():
        _ = model(X)
    
    # Remove hooks
    for hook in hooks:
        hook.remove()
    
    if not patterns:
        return 1
    
    # Concatenate all activation patterns
    all_patterns = np.concatenate(patterns, axis=1) if len(patterns) > 1 else patterns[0]
    
    # Count unique activation patterns
    unique_patterns = np.unique(all_patterns, axis=0)
    return len(unique_patterns)
 
# Compare different architectures
input_dim = 10
hidden_dims = [64, 64]  # 2 hidden layers
 
# Architecture 1: Shallow and wide
shallow_wide = nn.Sequential(
    nn.Linear(input_dim, 256),
    nn.ReLU(),
    nn.Linear(256, 1)
)
 
# Architecture 2: Deep and narrow
deep_narrow = nn.Sequential(
    nn.Linear(input_dim, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 1)
)
 
# Architecture 3: Balanced
balanced = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, 1)
)
 
models = {
    "Shallow-Wide (1 hidden, 256 units)": shallow_wide,
    "Deep-Narrow (4 hidden, 32 units)": deep_narrow,
    "Balanced (2 hidden, 64 units)": balanced,
}
 
print("Neural Network Capacity Comparison")
print("=" * 60)
 
for name, model in models.items():
    params = count_parameters(model)
    regions = count_linear_regions(model, input_dim)
    print(f"\n{name}")
    print(f"  Parameters: {params:,}")
    print(f"  Estimated linear regions: {regions:,}")
    print(f"  Regions per parameter: {regions/params:.2f}")
 
print("\n" + "=" * 60)
print("KEY INSIGHT: Depth enables exponentially more linear regions")
print("with similar parameter counts. Depth provides more 'capacity per parameter'.")

6.2 The Universal Approximation Perspective

The Universal Approximation Theorem states that a single-hidden-layer neural network with sufficient width can approximate any continuous function on a compact domain to arbitrary precision.

Key Results:

Cybenko (1989): Networks with sigmoidal activations are universal approximators.
Hornik (1991): Universality holds for any non-constant, bounded, continuous activation function.
Modern Extensions: ReLU networks, residual networks, and transformers all have universal approximation properties.

What This Means for Capacity:

Universality tells us that neural networks have sufficient capacity to represent any reasonable target function
It does not tell us whether training will find that function
It does not tell us how many parameters or samples are needed

Universality is about existence, not efficiency or learnability.

The Curse of Universality

Universal approximation is both a blessing and a curse. The blessing: we know our architecture can represent the solution. The curse: the same architecture can represent infinitely many wrong solutions too. The capacity to represent everything includes the capacity to memorize noise.

The Bias-Variance Tradeoff

The bias-variance tradeoff is the classical framework for understanding how capacity affects generalization. While deep learning challenges some aspects of this framework (as we'll see in later pages), understanding it remains essential.

7.1 Decomposing Expected Error

For regression with mean-squared error, the expected prediction error at any point $x$ can be decomposed:

$$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Noise}}$$

Where:

$f(x)$ is the true underlying function
$\hat{f}(x)$ is our learned estimator (a random variable depending on training data)
$\sigma^2$ is the noise variance in the targets

Interpretation:

Bias: How far is the average prediction from the truth? High bias = model is systematically wrong.
Variance: How much do predictions vary with different training sets? High variance = model is unstable.
Noise: Irreducible error from the data generating process.

7.2 The Tradeoff

Low Capacity (e.g., linear model for nonlinear data):

High bias: Cannot capture the true relationship
Low variance: Predictions stable across training sets
Result: Underfitting

High Capacity (e.g., degree-20 polynomial for 10 data points):

Low bias: Can potentially match the true function
High variance: Sensitive to specific training samples
Result: Overfitting

Optimal Capacity:

Balance between bias and variance
Minimizes total expected error
Classically, this is the 'sweet spot' we seek

bias_variance_simulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
"""
Bias-Variance Tradeoff Simulation
 
Empirically demonstrating the tradeoff through repeated experiments.
"""
 
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
 
def true_function(x):
    """The true underlying function we're trying to learn."""
    return np.sin(2 * np.pi * x)
 
def generate_data(n_samples=30, noise_std=0.3, seed=None):
    """Generate noisy samples from the true function."""
    if seed is not None:
        np.random.seed(seed)
    X = np.random.uniform(0, 1, n_samples)
    y = true_function(X) + np.random.normal(0, noise_std, n_samples)
    return X.reshape(-1, 1), y
 
def bias_variance_experiment(degree, n_trials=100, n_train=30, n_test=100):
    """
    Estimate bias and variance for polynomial regression of given degree.
    """
    # Fixed test points
    X_test = np.linspace(0, 1, n_test).reshape(-1, 1)
    y_true = true_function(X_test.flatten())
    
    # Collect predictions from many trials
    all_predictions = np.zeros((n_trials, n_test))
    
    for trial in range(n_trials):
        X_train, y_train = generate_data(n_train, seed=trial)
        
        model = make_pipeline(
            PolynomialFeatures(degree),
            LinearRegression()
        )
        model.fit(X_train, y_train)
        
        all_predictions[trial] = model.predict(X_test)
    
    # Compute bias and variance
    mean_prediction = np.mean(all_predictions, axis=0)
    
    # Pointwise squared bias
    bias_squared = (mean_prediction - y_true) ** 2
    
    # Pointwise variance  
    variance = np.var(all_predictions, axis=0)
    
    # Average over test points
    avg_bias_squared = np.mean(bias_squared)
    avg_variance = np.mean(variance)
    
    return avg_bias_squared, avg_variance
 
# Run experiment for different polynomial degrees
degrees = [1, 2, 3, 4, 5, 7, 10, 15, 20]
 
print("Bias-Variance Tradeoff: Polynomial Regression")
print("=" * 60)
print(f"{'Degree':<10} {'Bias²':<15} {'Variance':<15} {'Total':<15}")
print("-" * 60)
 
results = []
for degree in degrees:
    bias_sq, var = bias_variance_experiment(degree)
    total = bias_sq + var
    results.append((degree, bias_sq, var, total))
    print(f"{degree:<10} {bias_sq:<15.4f} {var:<15.4f} {total:<15.4f}")
 
# Find optimal
best = min(results, key=lambda x: x[3])
print("-" * 60)
print(f"\nOptimal degree: {best[0]} (Total error: {best[3]:.4f})")
print("\nObservations:")
print("  - Low degrees: High bias, low variance (underfitting)")
print("  - High degrees: Low bias, high variance (overfitting)")
print("  - Middle degrees: Best balance")

The Classical Recipe

The traditional prescription: Choose model complexity using held-out validation. Start simple, increase complexity until validation error starts increasing. This 'Goldilocks' approach—not too simple, not too complex—has been the standard for decades. We'll see how deep learning upends this prescription in later pages.

Summary and Key Insights

We have covered substantial ground on the foundational concept of model capacity. Let's consolidate the essential takeaways:

Key Takeaways

•Model capacity quantifies what a model can represent. It's the size and richness of the hypothesis class—the set of all functions the model architecture can express.
•VC dimension is the classical measure. It counts the maximum number of points a hypothesis class can shatter (perfectly classify with any labeling). Higher VC dimension = higher capacity.
•The VC generalization bound connects capacity to generalization. It says the gap between training and test error scales as √(d/n), where d is VC dimension and n is sample size.
•Rademacher complexity provides tighter, data-dependent bounds. It measures how well the best hypothesis correlates with random noise, capturing that real data may have structure that artificial examples don't.
•Neural network capacity depends on architecture. Depth, width, and connectivity all contribute. Deeper networks achieve exponentially more linear regions per parameter.
•The bias-variance tradeoff is the classical framework. Too little capacity causes underfitting (high bias), too much causes overfitting (high variance). The optimal is in between.
•Classical theory fails for deep learning. VC bounds are vacuous for modern networks, yet they generalize well. This puzzle motivates the concepts in subsequent pages.

Looking Ahead

In the next page, we'll explore effective capacity—the idea that what matters is not what a model can represent, but what it actually represents after training. This distinction is crucial for understanding why deep learning defies classical theory.

Core Questions to Reflect On:

If VC dimension of a neural network is O(W²) where W is parameter count, why don't billion-parameter models overfit catastrophically?
How does the training algorithm (SGD) interact with model capacity?
What role does data structure play in determining effective capacity?

These questions drive modern deep learning theory. With the foundation from this page, you're ready to explore the answers.

Page Complete

You now understand the theoretical foundations of model capacity, including VC dimension, Rademacher complexity, and the bias-variance tradeoff. You've also seen why classical theory fails to explain deep learning, setting the stage for more refined concepts like effective capacity and implicit regularization.

1 / 5

Loading learning content...

Machine LearningRegularization in Deep Learning

Capacity and Generalization

LevelAdvanced

Duration90 mins

TopicRegularization in Deep Learning

1 / 5

Model Capacity

The Central Question of Learning Theory

What You Will Master

Intuitive Foundations of Capacity

Before diving into formal definitions, let's build intuition for what capacity means. Consider the problem of fitting a curve to a set of data points.

The Polynomial Fitting Example:

Suppose we have 10 data points in 2D space and want to fit a polynomial curve. We have choices:

A linear polynomial (degree 1): Can only draw straight lines. Very limited—if the data has curvature, the fit will be poor.
A quadratic polynomial (degree 2): Can capture simple curvature. Better, but still constrained.
A degree-9 polynomial: Can pass through all 10 points exactly. Maximum flexibility.

The degree of the polynomial is a form of capacity. Higher degree means higher capacity—the model can fit more complex patterns.

polynomial_capacity_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import matplotlib.pyplot as plt
 
# Generate noisy data from a true quadratic function
np.random.seed(42)
X = np.linspace(0, 1, 10)
y_true = 0.5 + 2*X - 3*X**2  # True underlying function
y_noisy = y_true + np.random.normal(0, 0.1, 10)  # Add noise
 
# Fit polynomials of different degrees (capacities)
degrees = [1, 2, 9]
X_dense = np.linspace(0, 1, 100)
 
for degree in degrees:
    coeffs = np.polyfit(X, y_noisy, degree)
    y_fit = np.polyval(coeffs, X_dense)
    
    print(f"Degree {degree} polynomial:")
    print(f"  Number of parameters: {degree + 1}")
    print(f"  Training error (MSE): {np.mean((np.polyval(coeffs, X) - y_noisy)**2):.6f}")
    
# Key insight: Degree-9 polynomial has ZERO training error
# but will generalize poorly to new points

The Capacity-Generalization Tradeoff:

This example illustrates a fundamental tension:

Too little capacity (underfitting): The model cannot capture the true underlying pattern. The linear fit misses the curvature entirely.
Too much capacity (overfitting): The model fits the noise rather than the signal. The degree-9 polynomial passes through every noisy point, memorizing the training data rather than learning the underlying relationship.
Just right (good generalization): The quadratic model matches the true data-generating process. It captures the essential pattern without fitting the noise.

A Puzzle for Later

Formal Definitions of Model Capacity

Computer scientists and statisticians have developed several rigorous ways to quantify model capacity. Each captures different aspects of what it means for a model to be 'complex' or 'expressive'.

2.1 The Hypothesis Class

For example:

Linear regression: $\mathcal{H} = {f(x) = w^T x + b : w \in \mathbb{R}^d, b \in \mathbb{R}}$
Neural network with ReLU: $\mathcal{H}$ is the set of all piecewise-linear functions achievable by the architecture

The 'capacity' of a model is essentially a measure of how 'large' or 'rich' this hypothesis class is.

Why Counting Parameters Isn't Enough

2.2 VC Dimension: The Classic Measure

The Vapnik-Chervonenkis (VC) dimension is the most famous measure of capacity, developed in the 1970s. It quantifies capacity through the lens of shattering.

Intuitively, VC dimension measures the largest dataset the model can perfectly memorize for any arbitrary labeling.

vc_dimension_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
"""
VC Dimension Examples
 
Understanding VC dimension through concrete examples.
"""
 
# Example 1: Linear classifiers in 2D (lines)
# 
# Consider the hypothesis class of all lines in 2D:
#   H = {h(x) = sign(w1*x1 + w2*x2 + b) : w1, w2, b ∈ ℝ}
#
# Can shatter 3 points? YES
#   - For any 3 non-collinear points, any binary labeling 
#     (2^3 = 8 labelings) can be achieved by some line.
#
# Can shatter 4 points? NO
#   - Consider 4 points in convex position (square vertices)
#   - Label diagonal points the same (XOR pattern)
#   - No line can separate this labeling
#
# Therefore: VC(linear classifiers in 2D) = 3
 
# Example 2: Linear classifiers in d dimensions
# 
# General result: VC(linear classifiers in ℝ^d) = d + 1
#
# This makes intuitive sense:
#   - d=1 (threshold on a line): VC = 2
#   - d=2 (lines in plane): VC = 3
#   - d=10 (hyperplanes in ℝ^10): VC = 11
 
# Example 3: Neural Networks
#
# For a ReLU network with W weights:
#   - Lower bound: VC ≥ Ω(W log W) for deep networks
#   - Upper bound: VC ≤ O(W^2) in general
#
# Key insight: ReLU networks have much higher VC dimension
# than their parameter count alone would suggest!
 
def demonstrate_shattering_2d():
    """
    Demonstrate that 3 points can be shattered by lines in 2D.
    """
    import numpy as np
    
    # Three non-collinear points
    points = np.array([[0, 0], [1, 0], [0.5, 1]])
    
    # All 8 possible binary labelings
    labelings = [
        [0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1],
        [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]
    ]
    
    # For each labeling, we claim a separating line exists
    # (Proof by construction or convexity argument)
    print("For 3 non-collinear points in 2D:")
    print(f"  Number of possible labelings: {len(labelings)}")
    print("  All labelings achievable by some line: YES")
    print("  Therefore these 3 points are shattered")
    print(f"\n  VC dimension of linear classifiers in 2D = 3")
 
demonstrate_shattering_2d()

VC Dimension of Common Models:

Model	VC Dimension
Threshold on ℝ (single feature)	2
Linear classifiers in ℝ^d	d + 1
Axis-aligned rectangles in ℝ^d	2d
Decision stumps (single threshold per feature)	2d
Degree-k polynomials in ℝ^d	$\binom{d+k}{k}$
k-nearest neighbors	∞ (can shatter any finite set)
Neural network with W weights (ReLU)	O(W² log W) upper bound

Note that k-NN has infinite VC dimension yet often works well in practice. This hints that VC dimension, while theoretically important, doesn't tell the whole story about generalization.

The VC Generalization Bound

The real power of VC dimension emerges when we connect it to generalization. The VC generalization bound tells us how training error relates to test error as a function of capacity.

3.1 The Fundamental Theorem

Theorem (VC Generalization Bound): For a hypothesis class $\mathcal{H}$ with finite VC dimension $d$, and a training set of size $n$, with probability at least $1 - \delta$:

$$R(h) \leq \hat{R}_n(h) + \sqrt{\frac{d \log(\frac{2en}{d}) + \log(\frac{4}{\delta})}{n}}$$

Where:

$R(h)$ is the true risk (expected error on the underlying distribution)
$\hat{R}_n(h)$ is the empirical risk (training error)
The square root term is the generalization gap bound

This is profound: it says that the gap between training and test error depends on the ratio $d/n$ (VC dimension over sample size).

The Key Insight

3.2 Interpreting the Bound

Let's unpack what this bound actually tells us:

1. Capacity Controls Generalization Gap

The term $\frac{d}{n}$ dominates. If you have a model with VC dimension 1000 and only 100 training samples, the bound is vacuous (greater than 1). You need $n >> d$ for meaningful bounds.

2. Logarithmic Dependence on Confidence

The $\log(1/\delta)$ term means we pay a small price for high-confidence bounds. Going from 95% confidence to 99.9% confidence only adds a constant factor.

3. Sample Complexity

Rearranging, to achieve generalization gap $\epsilon$ with confidence $1-\delta$, you need: $$n = O\left(\frac{d + \log(1/\delta)}{\epsilon^2}\right)$$

This is the sample complexity of learning with hypothesis class $\mathcal{H}$.

vc_bound_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
"""
VC Generalization Bound Analysis
 
Exploring how the bound behaves for different capacity/sample relationships.
"""
 
import numpy as np
import matplotlib.pyplot as plt
 
def vc_generalization_bound(n: int, d: int, delta: float = 0.05) -> float:
    """
    Compute the VC generalization bound.
    
    Args:
        n: Number of training samples
        d: VC dimension
        delta: Confidence parameter (bound holds with prob 1-delta)
    
    Returns:
        Upper bound on generalization gap
    """
    if n < d:
        return float('inf')  # Bound is vacuous
    
    # VC bound formula
    term1 = d * np.log(2 * np.e * n / d)
    term2 = np.log(4 / delta)
    
    return np.sqrt((term1 + term2) / n)
 
# Analyze for a neural network with different numbers of parameters
# Assume VC dimension ≈ O(W) for simplicity (actually O(W² log W))
 
print("VC Bound Analysis for Different Capacity/Sample Ratios")
print("=" * 60)
 
# Scenario 1: Small model, adequate data
d1, n1 = 100, 10000
bound1 = vc_generalization_bound(n1, d1)
print(f"\nScenario 1: d={d1}, n={n1}")
print(f"  Ratio n/d = {n1/d1:.0f}")
print(f"  Generalization bound: {bound1:.4f}")
 
# Scenario 2: Large model, same data
d2, n2 = 10000, 10000
bound2 = vc_generalization_bound(n2, d2)
print(f"\nScenario 2: d={d2}, n={n2}")
print(f"  Ratio n/d = {n2/d2:.0f}")
print(f"  Generalization bound: {bound2:.4f}")
 
# Scenario 3: Large model, proportionally more data
d3, n3 = 10000, 1000000
bound3 = vc_generalization_bound(n3, d3)
print(f"\nScenario 3: d={d3}, n={n3}")
print(f"  Ratio n/d = {n3/d3:.0f}")
print(f"  Generalization bound: {bound3:.4f}")
 
# Scenario 4: Modern deep learning regime (overparameterized)
d4, n4 = 100000000, 1000000  # 100M params, 1M samples
bound4 = vc_generalization_bound(n4, d4)
print(f"\nScenario 4: d={d4:,}, n={n4:,}")
print(f"  Ratio n/d = {n4/d4:.4f}")
print(f"  Generalization bound: {bound4:.4f}")
print("  Note: Bound is vacuous! Yet deep nets generalize in practice.")
 
print("\n" + "=" * 60)
print("KEY INSIGHT: Classical VC bounds fail to explain deep learning!")
print("This is the 'generalization puzzle' that motivates modern theory.")

The Limitations of VC Theory

Rademacher Complexity: A Data-Dependent Measure

4.1 Definition

$$\hat{\mathcal{R}}S(\mathcal{H}) = \mathbb{E}\sigma\left[\sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i)\right]$$

4.2 Why Rademacher Complexity is More Refined

Tighter Bounds: The Rademacher generalization bound is:

$$R(h) \leq \hat{R}_n(h) + 2\mathcal{R}_n(\mathcal{H}) + \sqrt{\frac{\ln(1/\delta)}{2n}}$$

Of crucial importance: $\mathcal{R}_n(\mathcal{H})$ can be much smaller than what VC bounds predict, especially when the data has low intrinsic complexity.

Connection to Margin: For classification with margins, Rademacher complexity decreases with larger margins. This connects capacity to the geometric structure of the learned classifier.

rademacher_complexity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
"""
Rademacher Complexity Estimation
 
Empirically estimating Rademacher complexity via Monte Carlo sampling.
"""
 
import numpy as np
from typing import Callable, List
 
def estimate_rademacher_complexity(
    hypothesis_class_outputs: np.ndarray,
    n_rademacher_samples: int = 1000
) -> float:
    """
    Estimate empirical Rademacher complexity.
    
    Args:
        hypothesis_class_outputs: Array of shape (n_hypotheses, n_samples)
            where entry [h, i] is the output of hypothesis h on sample i.
        n_rademacher_samples: Number of Monte Carlo samples for expectation.
    
    Returns:
        Estimated Rademacher complexity.
    """
    n_hypotheses, n_samples = hypothesis_class_outputs.shape
    
    supremums = []
    for _ in range(n_rademacher_samples):
        # Sample random Rademacher variables
        sigma = np.random.choice([-1, 1], size=n_samples)
        
        # Compute correlation for each hypothesis
        correlations = hypothesis_class_outputs @ sigma / n_samples
        
        # Take supremum over hypothesis class
        supremums.append(np.max(correlations))
    
    return np.mean(supremums)
 
# Example: Compare linear models vs. polynomial models
np.random.seed(42)
 
# Generate structured data (not arbitrary)
n_samples = 100
X = np.linspace(-1, 1, n_samples).reshape(-1, 1)
 
# Hypothesis class 1: Linear functions (low capacity)
# Sample many random linear functions
n_linear = 1000
linear_weights = np.random.randn(n_linear)
linear_outputs = linear_weights.reshape(-1, 1) * X.reshape(1, -1)
 
# Hypothesis class 2: Degree-10 polynomials (high capacity)
n_poly = 1000
poly_features = np.column_stack([X**i for i in range(11)])  # 11 features
poly_weights = np.random.randn(n_poly, 11)
poly_outputs = poly_weights @ poly_features.T
 
# Normalize outputs to [-1, 1] for fair comparison
linear_outputs = np.tanh(linear_outputs)
poly_outputs = np.tanh(poly_outputs)
 
# Estimate Rademacher complexity
rc_linear = estimate_rademacher_complexity(linear_outputs)
rc_poly = estimate_rademacher_complexity(poly_outputs)
 
print("Rademacher Complexity Comparison")
print("=" * 50)
print(f"Linear functions:      {rc_linear:.4f}")
print(f"Degree-10 polynomials: {rc_poly:.4f}")
print(f"\nRatio: {rc_poly / rc_linear:.2f}x higher for polynomials")
print("\nInterpretation: Higher Rademacher complexity means")
print("the model class can fit random labels better → more capacity")

The Connection to Label Noise Experiments

Other Measures of Model Complexity

Beyond VC dimension and Rademacher complexity, researchers have proposed numerous other ways to quantify model capacity. Each provides different insights.

5.1 Covering Numbers

A covering number $\mathcal{N}(\epsilon, \mathcal{H}, d)$ is the minimum number of balls of radius $\epsilon$ (in metric $d$) needed to cover the hypothesis class $\mathcal{H}$.

Larger covering numbers → higher capacity
Related to VC dimension: $\log \mathcal{N}(\epsilon, \mathcal{H}) \leq O\left(\text{VC}(\mathcal{H}) \log(1/\epsilon)\right)$

5.2 Fat-Shattering Dimension

An extension of VC dimension for real-valued functions that accounts for margin:

Definition: The fat-shattering dimension at margin $\gamma$, denoted $\text{fat}_\gamma(\mathcal{H})$, is the largest set $S$ such that there exist 'witnesses' $r_1, ..., r_n$ where:

For every labeling $(b_1, ..., b_n) \in {-1, +1}^n$, some $h \in \mathcal{H}$ satisfies $b_i(h(x_i) - r_i) \geq \gamma$ for all $i$.

This captures that classifiers with larger margins have lower effective capacity.

5.3 Description Length and MDL

The Minimum Description Length (MDL) principle provides an information-theoretic view of capacity:

Model complexity = number of bits needed to describe the model
Generalization bound: $R(h) \lesssim \hat{R}(h) + \frac{\text{bits}(h)}{n}$

This perspective is compelling because it connects directly to compression: a model that generalizes well should compress the data, not memorize it.

5.4 PAC-Bayes Bounds

Probably Approximately Correct (PAC)-Bayes bounds provide some of the tightest generalization bounds currently known:

$$R(\rho) \leq \hat{R}(\rho) + \sqrt{\frac{\text{KL}(\rho | \pi) + \ln(n/\delta)}{2n}}$$

Where:

$\rho$ is the posterior distribution over hypotheses
$\pi$ is the prior distribution
$\text{KL}(\rho | \pi)$ measures how much the posterior deviates from the prior

The key insight: if the learned model stays 'close' to the prior (in KL sense), generalization is guaranteed. This formalizes the intuition that simpler models (closer to prior) generalize better.

Comparison of Capacity Measures
Measure	Type	Advantages	Limitations
VC Dimension	Worst-case, combinatorial	Conceptually simple, well-studied	Too pessimistic for deep learning
Rademacher Complexity	Data-dependent, average-case	Tighter bounds, captures data structure	Hard to compute exactly
Covering Numbers	Geometric	Works for continuous hypothesis classes	Depends on choice of metric
Fat-Shattering	Margin-aware	Captures benefit of large margins	Still worst-case
MDL / Bits	Information-theoretic	Intuitive, connects to compression	Depends on encoding scheme
PAC-Bayes	Probabilistic, algorithmic	Tightest known bounds, non-vacuous for DNNs	Requires choosing prior

The Current Frontier

Capacity in Neural Networks

How do these theoretical concepts apply specifically to neural networks? Let's examine what determines the capacity of deep learning models.

6.1 Architectural Factors

Depth (Number of Layers):

Deeper networks can represent more complex functions
Each layer potentially adds exponential expressiveness for certain function classes
Depth provides 'compositionality'—building complex features from simpler ones

Width (Neurons per Layer):

Wider layers can represent more features in parallel
Width contributes linearly to parameter count
Universal approximation theorems guarantee sufficient width can approximate any continuous function

Connection Patterns:

Fully connected: maximum flexibility, high parameter count
Convolutional: parameter sharing reduces capacity but adds inductive bias
Skip connections: affect both forward computation and gradient flow, indirectly affecting effective capacity

neural_network_capacity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
"""
Neural Network Capacity Analysis
 
Examining how architectural choices affect representational power.
"""
 
import torch
import torch.nn as nn
import numpy as np
 
def count_parameters(model: nn.Module) -> int:
    """Count trainable parameters."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
 
def count_linear_regions(model: nn.Module, input_dim: int, n_samples: int = 10000) -> int:
    """
    Estimate the number of linear regions for a ReLU network.
    
    For ReLU networks, the function is piecewise linear.
    More linear regions = higher capacity to represent complex functions.
    """
    model.eval()
    
    # Sample random inputs
    X = torch.randn(n_samples, input_dim)
    
    # Get activation patterns for each sample
    patterns = []
    
    def hook_fn(name):
        def hook(module, inp, out):
            # Binary pattern: which neurons are active (> 0)
            pattern = (out > 0).float()
            patterns.append(pattern.flatten().cpu().numpy())
        return hook
    
    # Register hooks on all ReLU activations
    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, nn.ReLU):
            hooks.append(module.register_forward_hook(hook_fn(name)))
    
    # Forward pass
    with torch.no_grad():
        _ = model(X)
    
    # Remove hooks
    for hook in hooks:
        hook.remove()
    
    if not patterns:
        return 1
    
    # Concatenate all activation patterns
    all_patterns = np.concatenate(patterns, axis=1) if len(patterns) > 1 else patterns[0]
    
    # Count unique activation patterns
    unique_patterns = np.unique(all_patterns, axis=0)
    return len(unique_patterns)
 
# Compare different architectures
input_dim = 10
hidden_dims = [64, 64]  # 2 hidden layers
 
# Architecture 1: Shallow and wide
shallow_wide = nn.Sequential(
    nn.Linear(input_dim, 256),
    nn.ReLU(),
    nn.Linear(256, 1)
)
 
# Architecture 2: Deep and narrow
deep_narrow = nn.Sequential(
    nn.Linear(input_dim, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 1)
)
 
# Architecture 3: Balanced
balanced = nn.Sequential(
    nn.Linear(input_dim, 64),
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Linear(64, 1)
)
 
models = {
    "Shallow-Wide (1 hidden, 256 units)": shallow_wide,
    "Deep-Narrow (4 hidden, 32 units)": deep_narrow,
    "Balanced (2 hidden, 64 units)": balanced,
}
 
print("Neural Network Capacity Comparison")
print("=" * 60)
 
for name, model in models.items():
    params = count_parameters(model)
    regions = count_linear_regions(model, input_dim)
    print(f"\n{name}")
    print(f"  Parameters: {params:,}")
    print(f"  Estimated linear regions: {regions:,}")
    print(f"  Regions per parameter: {regions/params:.2f}")
 
print("\n" + "=" * 60)
print("KEY INSIGHT: Depth enables exponentially more linear regions")
print("with similar parameter counts. Depth provides more 'capacity per parameter'.")

6.2 The Universal Approximation Perspective

The Universal Approximation Theorem states that a single-hidden-layer neural network with sufficient width can approximate any continuous function on a compact domain to arbitrary precision.

Key Results:

Cybenko (1989): Networks with sigmoidal activations are universal approximators.
Hornik (1991): Universality holds for any non-constant, bounded, continuous activation function.
Modern Extensions: ReLU networks, residual networks, and transformers all have universal approximation properties.

What This Means for Capacity:

Universality tells us that neural networks have sufficient capacity to represent any reasonable target function
It does not tell us whether training will find that function
It does not tell us how many parameters or samples are needed

Universality is about existence, not efficiency or learnability.

The Curse of Universality

The Bias-Variance Tradeoff

7.1 Decomposing Expected Error

For regression with mean-squared error, the expected prediction error at any point $x$ can be decomposed:

Where:

$f(x)$ is the true underlying function
$\hat{f}(x)$ is our learned estimator (a random variable depending on training data)
$\sigma^2$ is the noise variance in the targets

Interpretation:

Bias: How far is the average prediction from the truth? High bias = model is systematically wrong.
Variance: How much do predictions vary with different training sets? High variance = model is unstable.
Noise: Irreducible error from the data generating process.

7.2 The Tradeoff

Low Capacity (e.g., linear model for nonlinear data):

High bias: Cannot capture the true relationship
Low variance: Predictions stable across training sets
Result: Underfitting

High Capacity (e.g., degree-20 polynomial for 10 data points):

Low bias: Can potentially match the true function
High variance: Sensitive to specific training samples
Result: Overfitting

Optimal Capacity:

Balance between bias and variance
Minimizes total expected error
Classically, this is the 'sweet spot' we seek

bias_variance_simulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
"""
Bias-Variance Tradeoff Simulation
 
Empirically demonstrating the tradeoff through repeated experiments.
"""
 
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
 
def true_function(x):
    """The true underlying function we're trying to learn."""
    return np.sin(2 * np.pi * x)
 
def generate_data(n_samples=30, noise_std=0.3, seed=None):
    """Generate noisy samples from the true function."""
    if seed is not None:
        np.random.seed(seed)
    X = np.random.uniform(0, 1, n_samples)
    y = true_function(X) + np.random.normal(0, noise_std, n_samples)
    return X.reshape(-1, 1), y
 
def bias_variance_experiment(degree, n_trials=100, n_train=30, n_test=100):
    """
    Estimate bias and variance for polynomial regression of given degree.
    """
    # Fixed test points
    X_test = np.linspace(0, 1, n_test).reshape(-1, 1)
    y_true = true_function(X_test.flatten())
    
    # Collect predictions from many trials
    all_predictions = np.zeros((n_trials, n_test))
    
    for trial in range(n_trials):
        X_train, y_train = generate_data(n_train, seed=trial)
        
        model = make_pipeline(
            PolynomialFeatures(degree),
            LinearRegression()
        )
        model.fit(X_train, y_train)
        
        all_predictions[trial] = model.predict(X_test)
    
    # Compute bias and variance
    mean_prediction = np.mean(all_predictions, axis=0)
    
    # Pointwise squared bias
    bias_squared = (mean_prediction - y_true) ** 2
    
    # Pointwise variance  
    variance = np.var(all_predictions, axis=0)
    
    # Average over test points
    avg_bias_squared = np.mean(bias_squared)
    avg_variance = np.mean(variance)
    
    return avg_bias_squared, avg_variance
 
# Run experiment for different polynomial degrees
degrees = [1, 2, 3, 4, 5, 7, 10, 15, 20]
 
print("Bias-Variance Tradeoff: Polynomial Regression")
print("=" * 60)
print(f"{'Degree':<10} {'Bias²':<15} {'Variance':<15} {'Total':<15}")
print("-" * 60)
 
results = []
for degree in degrees:
    bias_sq, var = bias_variance_experiment(degree)
    total = bias_sq + var
    results.append((degree, bias_sq, var, total))
    print(f"{degree:<10} {bias_sq:<15.4f} {var:<15.4f} {total:<15.4f}")
 
# Find optimal
best = min(results, key=lambda x: x[3])
print("-" * 60)
print(f"\nOptimal degree: {best[0]} (Total error: {best[3]:.4f})")
print("\nObservations:")
print("  - Low degrees: High bias, low variance (underfitting)")
print("  - High degrees: Low bias, high variance (overfitting)")
print("  - Middle degrees: Best balance")

The Classical Recipe

Summary and Key Insights

We have covered substantial ground on the foundational concept of model capacity. Let's consolidate the essential takeaways:

Key Takeaways

•Model capacity quantifies what a model can represent. It's the size and richness of the hypothesis class—the set of all functions the model architecture can express.
•VC dimension is the classical measure. It counts the maximum number of points a hypothesis class can shatter (perfectly classify with any labeling). Higher VC dimension = higher capacity.
•The VC generalization bound connects capacity to generalization. It says the gap between training and test error scales as √(d/n), where d is VC dimension and n is sample size.
•Rademacher complexity provides tighter, data-dependent bounds. It measures how well the best hypothesis correlates with random noise, capturing that real data may have structure that artificial examples don't.
•Neural network capacity depends on architecture. Depth, width, and connectivity all contribute. Deeper networks achieve exponentially more linear regions per parameter.
•The bias-variance tradeoff is the classical framework. Too little capacity causes underfitting (high bias), too much causes overfitting (high variance). The optimal is in between.
•Classical theory fails for deep learning. VC bounds are vacuous for modern networks, yet they generalize well. This puzzle motivates the concepts in subsequent pages.

Looking Ahead

Core Questions to Reflect On:

If VC dimension of a neural network is O(W²) where W is parameter count, why don't billion-parameter models overfit catastrophically?
How does the training algorithm (SGD) interact with model capacity?
What role does data structure play in determining effective capacity?

These questions drive modern deep learning theory. With the foundation from this page, you're ready to explore the answers.

Page Complete

1 / 5